From 99a408dd627afb915b92c08b6a202c4d7e0eeb37 Mon Sep 17 00:00:00 2001 From: Wes B Date: Wed, 10 Jan 2024 12:09:15 -0800 Subject: [PATCH] Created a draft of the data joins chapter --- xx_joins.Rmd | 393 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 393 insertions(+) create mode 100644 xx_joins.Rmd diff --git a/xx_joins.Rmd b/xx_joins.Rmd new file mode 100644 index 0000000..8ddb008 --- /dev/null +++ b/xx_joins.Rmd @@ -0,0 +1,393 @@ +Joining Data +==================== +You will often have two or more data sets that are part of the same project. For example, our library keeps track of its books using three tables: one identifying books, one identifying borrowers, and one that records each book checkout. + +## Objective: +The goal of this workshop is to familiarize students with the concept of joining tables in R using the dplyr package. Participants will learn how to merge datasets based on common columns, handle different types of joins, and understand the implications of each type. + + + + +## The `dplyr` Package +This workshop uses R and the [`dplyr` package](https://dplyr.tidyverse.org/index.html). The next link goes to a [list of all the functions provided by `dplyr`](https://dplyr.tidyverse.org/reference/index.html). Let's check them out. If you've ever used SQL, youll probably recognize functions like `select()`, `left_join()`, `group_by()`. In fact, `dplyr` was designed to bring SQL-style data manipulation to R. As a result the concepts of `dplyr` and SQL are nearly identical, and even the language overlaps a lot. I'll point out some examples of this as we go, because I think some people might find it helpful. + +## Gradebook data +Data joins are what make databases possible because they link separate tables. There are a lot of benefits of storing data in several small tables rather than one comprehensive table (reducing repetition and reducing storage requirements are probably the two biggest reasons). An example of databases that we all interact with regularly is the university gradebook. One table stores information about students and another table stores grades. The grades are linked to the student records by student ID. Looking at a student's grades requires joining these two tables. + +We are going to use made-up gradebook data for some preliminary examples. The rows and columns of these tables have different meanings, so we won't be able to stack them side-by-side or one on top of the other. The "key" piece of information is the student ID that is a column in both tables. If you've ever used SQL, you may recall that each table should have a **primary key**, which is a column of values that identify the rows. In a database, the primary key must be unique - there can be no duplicates. Most spreadsheets are not designed as databases, and they may have key columns that are not unique. How to handle non-unique keys is going to be a recurring feature of this workshop. + +In our made-up gradebook data, `student_id` will be the primary key for the table of students. Each student should have grades for each class they take, so the student ID is not a unique identifier in the table of grades. When the primary key of one table is used as an indicator in another table, it is called a **foreign key**. + + +## Left and Right Joins + +In the previous section, we set a goal of augmenting the table of students with the information in the grades table. To augment means to add to. We are going to be adding to the `students` table, but do we add rows or columns? Each row of `grades` is an event that matches one student to one grade. Adding new rows would be like adding new grades that didn't occur: not good! Each row has a student and each student may have many columns of information in the `students` table. So we can maintain the relationships in the data while adding new columns of information to `students`. + +When you want to augment one dataset with information from another, you'll use a directional join: either left or right. Here, left and right refer to the order that the tables are written in the join command. Left and right joins work almost exactly the same and left joins are far more common in practice, so we will focus on those. + +### Left Join +A left join combines all rows from the left dataset with the matching rows from the right dataset. Let's get familiar with a contrived example before actually working on the library data. We'll create two tables of made-up data: the first identifies students by name and ID, and the second lists their grades in a class. + + +```{r left-join-example} +#| message: false +# load dplyr package +library(dplyr) +library(knitr) + +# Example datasets +students = data.frame( + student_id = c(1, 2, 3, 4), + name = c("Angel", "Beto", "Cici", "Desmond") + ) + +grades = data.frame( + student_id = c(2, 3, 4, 5, 6), + grade = c(90, 85, 80, 75, 60) + ) + +# Left join +left_join(students, grades, by = "student_id") |> + kable() +``` + +There are four rows in the `students` table, and four rows in the `grades` +table. I have provided `students` as the first table in the left join (it is written on the left in the function) so the result will have one row per row of `students`. I have also provided `student_id` as the **by** or **key** column for the join. That means that rows will be matched between the tables based on whether their `student_id` columns match. + +Note that the keys are not identical between tables: the `grades` table has no `student_id` of 1, which is the ID for Angel. This shows up in the result as an `NA` in the grade column for Angel. There is no row in the result for `student_id` 5 because a left join brings in augments the left table (`students`) with columns from the right table (`grades`). It does not add rows just because they are found in the right table. + +There is one reason that a left join will add rows to the result: when the key is not unique. In that case, every possible match will be provided in the result. For an example, let's add rows with repeat IDs to both the `students` and `grades` tables. Let's also rename the `student_id` column of `grades` to be `sid` so we can see how to join tables where the key column names don't match. + +```{r left-join-example-duplicate-keys} + +# Example datasets +students = data.frame( + student_id = c(1, 2, 3, 4, 4), + name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) + +grades = data.frame(sid = c(2, 3, 4, 5, 2), + grade = c(90, 85, 80, 75, 60)) + +# Left join +left_join(students, grades, by = join_by(student_id == sid)) |> + kable() +``` + +Both of the tables had five rows but the result has six rows because `student_id` is 4 for two rows of `students` and `sid` is 2 for two rows of `grades`. R has warned us that there is a many-to-many relationship in the join, which means that duplicate keys were matched in the left table and the right table. Where there are no duplicate keys in either table, the match is one-to-one. When there are duplicate in one table only, the match is one-to-many or many-to-one. These are often desired behavior and so R just complies silently. A many-to-many match may be desired but it is often a sign that something has gone wrong, so R gives a warning. You can get funky results when your keys are not unique! + +![Cats join meme](img/cat-legs-left-join-meme.jpg) + + + +## Other Joins +I've already mentioned that a right join is like a left join, but augments the right table with the left, rather than vice versa. The next most commong join is an inner join, which includes only the rows that match. And a full join includes all rows from both tables, even if there was no match. I've created a picture to help identify the differences. + +![Disney characters illustrate differences between joins](img/disney-meme.jpg) + +### Right Join +A right join is exactly like a left join, except... + +### Inner Join +An inner join returns the same columns as a left join, but potentially fewer rows. The result of a inner join only includes the rows that matched according to the join specification. This will leave out some rows from the left table if they aren't matched in the right table, which is the difference between an inner join and a left join. + + +```{r inner-join-example-duplicate-key} +# Example datasets +students = data.frame( + student_id = c(1, 2, 3, 4, 4), + name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) + +grades = data.frame(student_id = c(2, 3, 4, 5, 2), + grade = c(90, 85, 80, 75, 60)) + +# Inner join +inner_join(students, grades, by="student_id") |> + kable() +``` + + +## Getting Clever with `join_by` + +So far, we've focused on the join types and the tables. There's been a third element in all of the examples that we've mostly ignored until now: the `by` argument in the joins. Specifying a single column name (like `student_id`) works great when the key columns have the same names in both tables. However, real examples are often more complicated. For those times, `dplyr` provides a function called `join_by()`, which lets you create **join specifications** to solve even very complicated problems. We begin with an example where the key name in the `grades` table has been changed from `student_id` to `sid`. + + +```{r different-key-names} +# Example datasets +students = data.frame( + student_id = c(1, 2, 3, 4), + name = c("Angel", "Beto", "Cici", "Desmond")) + +grades = data.frame(sid = c(2, 3, 4, 5), + grade = c(90, 85, 80, 75)) + +# Left join +left_join(students, grades, by = join_by(student_id==sid)) |> + kable() +``` + +Since the key column names don't match, I have provied a `join_by` specification. Specifying a match via `join_by` is very powerful and flexible, but the main thing to recognize here is that R searches for the column name on the left of the double-equals in the left table and searches for the column name on the right of the double-equals in the right table. In this example, that means the join will try to match `students$student_id` to `grades$sid`. + +### Matching multiple columns +Sometimes it takes more than one key to uniquely identify a row of data. For example, suppose some of our students are retaking the class in 2023 because they got a failing grade in 2022. Then we would need to combine the student ID with the year to uniquely identify a student's grade. You can include multiple comparisons in a `join_by` specification by separating them with commas. In the following example, student ID still has different names between the tables but the year column has the same name in both tables. + +```{r two-key-cols} +# Example datasets +students = data.frame( + student_id = c(1, 2, 3, 4), + name = c("Angel", "Beto", "Cici", "Desmond")) + +# duplicate the students for two years +students = bind_rows( + mutate(students, year=2022), + mutate(students, year=2023) +) + +# create the grades data.frame +grades = data.frame( + sid = c(2, 3, 4, 5), + grade = c(90, 85, 80, 75) + ) + +# duplicate the grades table for two years +grades = bind_rows( + mutate(grades, grade = grade-50, year=2022), + mutate(grades, year=2023) +) + +# Left join +left_join(students, grades, by = join_by(student_id==sid, year)) |> + kable() +``` + + +To learn clever tricks for complicated joins, see the documentation at `?join_by`. + + + + + + + + + + + +## Examples +We've seen enough of the made-up grades example! Let's look at some real data and practice our skills! + +Let's begin by looking at the data on books, borrowers, and checkouts. + +```{r import-data} +#| warning: false +#| error: false +#| message: false +borrowers = read.csv("data/library/borrowers.csv") +books = read.csv("data/library/books.csv") +checkouts = read.csv("data/library/checkouts.csv") + +# show the top rows +print(head(books)) +print(head(borrowers)) +print(head(checkouts)) + +# get the table sizes +print(dim(books)) +print(dim(borrowers)) +print(dim(checkouts)) +``` + +One thing we can see is that the `books` table refers to physical copies of a book, so if the library owns two copies of the same book then the same title, publisher, etc. will appear on the same row. + +In the previous section, we set a goal of augmenting the `checkouts` table with the information in the `books` table. To augment means to add to. We are going to be adding to the `checkouts` table, but do we add rows or columns? Each row of `checkouts` is an event that matches one book and one borrower. Adding new rows would be like adding new events that didn't occur: not good! Each row has a book and each book has many columns of information in the `books` table. So we can maintain the relationships in the data while adding new columns of information to `checkouts`. + +How are we to know which books were checked out most often, or were generaly checked out by the same people? The tables have different numbers of rows and columns, so we won't be able to stack them side-by-side or one on top of the other. The "key" pieces of information are the columns `books$book_id` and `borrowers$borrower_id`. If you've ever used SQL, you may recall that each table should have a **primary key**, which is a column of values that identify the rows. In a database, the primary key must be unique - there can be no duplicates. Most spreadsheets are not designed as databases, and they may have key or ID columns that are not unique. How to handle non-unique keys is going to be a recurring feature of this workshop. + +Now look at the `checkouts` table again. It has two ID columns: `book_id` and `borrower_id`. These match the borrower and book IDs in the `borrowers` and `books` tables. Obviously, these aren't unique: one person may check out multiple books, and an exceptionaly popular book might be checked out as many as three times from the same library! These are columns that identify unique keys other tables, which SQL calls **foreign keys**. + +Now we can begin to reason about how to approach the goal of identifying the books that are most often checked out. We want to augment the `checkouts` table with the information in the `books` table, matching rows where `book_id` matches. Every row in the checkouts table should match exactly one row in the results and every row in the results should match exactly one row in the checkouts table. In the next section we will translate this plain-english description into the language used by `dplyr`. + +```{r most-checked-out-books} +# Top ten books with most checkouts +left_join(checkouts, books, by="book_id") |> + group_by(book_id) |> + summarize(title=first(title), author=first(author), n_checkouts=n()) |> + arrange(desc(n_checkouts)) |> + head(n=10) |> + kable() +``` + +Just for fun, here is an instructive example of why relational tables are a better way to store data than putting everything into one spreadsheet. If we want to identify the authors whose books were most checked out from the UCD library, we might think to adapt our previous example to group by author rather than by book_id. +```{r most-checked-out-authors} +# Top ten authors with most checkouts +left_join(checkouts, books, by="book_id") |> + group_by(author) |> + summarize(author=first(author), n_checkouts = n()) |> + arrange(desc(n_checkouts)) |> + head(n=10) |> + kable() +``` + +The problem is that the `author` column is a text field for author name(s), which is not a one-to-one match to a person. There are a lot of reasons: some books have multiple authors, some authors change their names, the order of personal name and family name may be reversed, and middle initials are sometimes included, sometimes not. A table of authors would allow you to refer to authors by a unique identifier and have it always point to the same name ([this is what ORCID does for scientific publishing](https://orcid.org)). + +### Three Or More Tables +All of he join functions can only work on two tables but you can join as many tables as you want by iteratively building them up two at a time. We are going to look at an example that combines `checkouts`, `books`, and `borrowers` in order to see how many books were checked out by students, faculty, and staff. + +```{r three-tables} +# list the account types who checked out the most books +left_join(checkouts, books, by="book_id") |> + left_join(borrowers, by="borrower_id") |> + group_by(borrower_id) |> + summarize(account_type=first(account_type), n_checkouts = n()) |> + arrange(desc(n_checkouts)) |> + kable() +``` + + + +## Be Explicit +Do you find it odd that we have to tell R exactly what kind of data join to do by calling one of `left_join()`, `right_join()`, `inner_join()`, or `full_join()`? Why isn't there just one function called `join()` that assumes you're doing a left join unless you specifically provide an argument `type` like `join(..., type="inner")`? If you think it would be confusing for R to make assumptions about what kind of data join we want, then you're on the right track but you'll want to watch out for these other cases where R has strong assumptions about what the default behavior should be. + +A general principle of programming is that explicit is better than implicit because writing information into your code explicitly makes it easier to understand what the code does. Here are some examples of implicit assumptions R will make unless you provide explicit instructions. + + + +### Handling Duplicate Keys +Values in the key columns may not be unique. What do you think happens when you join using keys that aren't unique? + +```{r duplicate-keys-example} +# Example datasets +students = data.frame( + student_id = c(1, 2, 3, 4, 4), + name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) + +grades = data.frame(student_id = c(2, 2, 3, 4, 4, 5), + grade = c(90, 50, 85, 80, 75, 30)) + +# Left join +left_join(students, grades, by = "student_id") |> + kable() +``` + +We get one row in the result for every possible combination of the matching keys. Sometimes that is what you want, and other times not. In this case, it might be reasonable that Beto, Desmond, and Erik have multiple grades in the book, but it is probably not reasonable that both Desmond and Erik have student ID 4 and have the same grades as each other. This is a many-to-many match, with all the risks we've mentioned before. + +#### Specifying The Expected Relationship +You can be explicit about what kind of relationship you expect in the join by specifying the `relationship` parameter. Your options are `one-to-one`, `one-to-many`, or `many-to-one`. Any of those will stop the code with an error if the data doesn't match the relationship you told it to expect. + +If you leave the `relationship` parameter blank, R will allow a many-to-many join but will raise a warning. Pay attention to your warning messages! If you know in advance that you want a many-to-many join, then you can provide the argument `relatonship='many-to-many'`, which will do the same as leaving `relationship` blank, except it will not raise the warning. + +#### Using Only Distinct Rows +An alternative to handling duplicate keys is to subset the data to avoid duplicates in the first place. The `dplyr` package provides a function, `distinct()`, which can help. When `distinct()` finds duplicated rows, it keeps the first one. + +```{r duplicate-keys-example-distinct} +# Example datasets +students = data.frame( + student_id = c(1, 2, 3, 4, 4), + name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) + +grades = data.frame(student_id = c(2, 2, 3, 4, 4, 5), + grade = c(90, 50, 85, 80, 75, 30)) + +# Left join +distinct_keys_result = students |> distinct(student_id, .keep_all=TRUE) |> + left_join(grades, by = "student_id") |> + kable() +``` + + + +### Ambiguous Columns +When the two tables have columns with the same names, it is ambiguous which one to use in the result. R handles that situation by keeping both but changing the names to include the table names. So the column from the left table gets a `.x` appended by default and the column from the right table gets a `.y` appended by default. Let's see an example. Suppose that the `date_created` column of the borrowers table had the name `date` instead. Then in the joined data it would be ambiguous with the `date` column of the checkouts table. + +```{r ambiguous-date} +# Rename the date_created column of borrowers +borrowers = rename(borrowers, date=date_created) + +# Now create the list of checkouts +left_join(checkouts, books, by="book_id") |> + left_join(borrowers, by="borrower_id") |> + head(n=10) |> + kable() +``` + +If you aren't satisfied with appending `.x` and `.y` to the ambiguous columns, then you can specify the `suffix` argument with a pair of strings like this: + +```{r ambiguous-date-custom-suffix} +# Now create the list of checkouts +left_join(checkouts, books, by="book_id", suffix=c("_checkout", "_book")) |> + head(n=10) |> + kable() +``` + +By specifying the `suffix` argument, we get column names in the result with more meaningful names. + +### Missing Values +The `dplyr` package has a default behavior that I think is dangerous. In the conditions of a join, `NA==NA` evaluates to `TRUE`, which is unlike the behavior anywhere else in R. This means that keys identified as `NA` will match other `NA`s in the join. This is a very strong assumption that seems to contradict the idea of a missing value since if we actually don't know two keys, how can we say that they match? And if we know two keys have the same value then they should be labeled in the data. In my opinion, it's a mistake to have the computer make strong assumptions by default, and especially if it does so without warning the user. Fortunately, there is a way to make the more sensible decision that `NA`s don't match anything: include the argument `na_matches='never'` in the join. + +```{r missing-values-example} +# Example datasets +students = data.frame( + student_id = c(1, NA, 3, 4), + name = c("Angel", "Beto", "Cici", "Desmond")) + +grades = data.frame(student_id = c(2, NA, 4, 5), + grade = c(90, 85, 80, 75)) + +# Left joins +left_join(students, grades, by = "student_id") |> + kable() + +left_join(students, grades, by = "student_id", na_matches = "never") |> + kable() +``` + +Notice that since Beto's student ID is `NA`, none of the rows in the `grades` table can match him. As a result, his grade is left `NA` in the result. + +## Conclusion +You've now seen how to join data tables that can be linked by key columns. I encourage you to expand on the examples by posing questions and trying to write the code to answer them. Reading the documentation for [join functions](https://dplyr.tidyverse.org/reference/mutate-joins.html) and [`join_by` specifications](https://dplyr.tidyverse.org/reference/join_by.html) is a great way to continue your learning journey by studying the (many!) special cases that we skipped over here.