Skip to content

Latest commit

 

History

History
546 lines (447 loc) · 18.3 KB

cm105-data-joins.md

File metadata and controls

546 lines (447 loc) · 18.3 KB

carpentry with joins


Joined by David Seibold is licensed under CC BY-NC 2.0

contents

introduction
prerequisites
midfield data
joins
carpentry
design
exercises
references

introduction

“It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.” (Wickham and Grolemund, 2017, Sec. 13.1) (Relational tables is the basic idea underlying relational databases.)

Data frames are our basic form of table. Combining information from two data frames is called a join. The dplyr package has joining functions called inner_join(), left_join(), right_join(), and others. We’ll use the left join regularly. You can read about the others at the help page by running ? dplyr::join or Ch. 13 Relational Data in the text.

To perform a join, the two data frames must have at least one key-column in common. Here again we see the importance of coordinatized data.

prerequisites

Project setup

  • Start every work session by launching the RStudio Project file for the course, e.g., portfolio.Rproj
  • Ensure your project directory structure satisfies the course requirements

Ensure you have installed the following packages. See install packages for instructions if needed. For midfielddata, see the instructions below.

  • tidyverse: The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at https://tidyverse.org.
  • wrapr: Tools for writing and debugging R code. Provides: ‘let()’ (converts non-standard evaluation interfaces to parametric standard evaluation interfaces, inspired by ‘gtools:strmacro()’ and ‘base::bquote()’), ‘%.>%’ dot-pipe (an ‘S3’ configurable pipe), ‘build_frame()’/‘draw_frame()’ (‘data.frame’ example tools), ‘qc()’ (quoting concatenate), ‘:=’ (named map builder), and more.
  • midfielddata: A stratified sample of the MIDFIELD database of student records (registrar’s data) for 98,000 undergraduates at participating US institutions from 1987 to 2016. The data sample includes entering demographics, individual course and term data, and graduation status and degrees earned.

Install midfielddata as follows,

install.packages("drat")
drat::addRepo("midfieldr")
install.packages("midfielddata")

Scripts to initialize

explore/0601-joins-explore.R     

And start the file with a minimal header

# your name
# date

# load packages
library("tidyverse")
library("wrapr")
library("midfielddata")

Duplicate the lines of code in the session one chunk at a time. Save, Source, and compare your results to the results shown.


▲ top of page

midfield data

The midfielddata data package comprises demographic, term, course, and degree information for 97,640 undergraduate students from 1987 to 2016 (Layton and others, 2018). If you want to learn more about the data set, open its help page by running ? midfielddata.

In this exercise, we’ll work with the degree data (midfielddegrees) and the student demographic data (midfieldstudents).

# degree status 
glimpse(midfielddegrees)
#> Observations: 97,640
#> Variables: 5
#> $ id          <chr> "MID25783135", "MID25783147", "MID25783156", "MID2...
#> $ institution <chr> "Institution M", "Institution M", "Institution M",...
#> $ cip6        <chr> NA, NA, NA, "521401", NA, NA, "143501", NA, "23010...
#> $ term_degree <dbl> NA, NA, NA, 19963, NA, NA, 20001, NA, 20023, 19966...
#> $ degree      <chr> NA, NA, NA, "Bachelor of Science", NA, NA, "Bachel...


# student information at matriculation  
glimpse(midfieldstudents)
#> Observations: 97,640
#> Variables: 15
#> $ id             <chr> "MID25783135", "MID25783147", "MID25783156", "M...
#> $ institution    <chr> "Institution M", "Institution M", "Institution ...
#> $ cip6           <chr> "511199", "131202", "040201", "14XXXX", "14XXXX...
#> $ term_enter     <dbl> 19911, 20041, 20011, 19921, 20091, 20051, 19946...
#> $ transfer       <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N...
#> $ hours_transfer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ sex            <chr> "Male", "Female", "Female", "Male", "Male", "Ma...
#> $ race           <chr> "White", "White", "White", "White", "White", "W...
#> $ age            <dbl> 17, 17, 18, 17, 18, 18, 17, 18, 18, 18, 18, 17,...
#> $ us_citizen     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
#> $ home_zip       <chr> "29485", "29072", "43026", "29572", "29016", "2...
#> $ high_school    <chr> "411900", "410825", "210568", "411465", "410210...
#> $ sat_math       <dbl> 580, 470, 750, 560, 670, 620, 642, 590, 492, 62...
#> $ sat_verbal     <dbl> 505, 530, 627, 395, 530, 500, 486, 620, 594, 45...
#> $ act_comp       <dbl> 21, 19, 29, 18, 26, 22, 24, 25, 21, 21, 25, 20,...

Possible keys:

  • id when students are the unit of analysis, uniquely identifies each of the 97640 students in the dataset
  • cip6 when programs are the unit of analysis, uniquely identifies each of the 338 academic programs in the dataset
  • institution when institutions are the unit of analysis, uniquely identifies each of the 12 institutions in the dataset

The IDs and institutions are anonymized. The 6-digit program codes, however, are from the US Integrated Postsecondary Education Data System (IPEDS), Classification of Instructional Programs (CIP). A sampling of the CIP codes:

midfielddegrees %>% 
        select(cip6)  %>% 
        arrange(cip6) %>% 
        distinct()    %>% 
        sample_n(10) 
#> # A tibble: 10 x 1
#>    cip6  
#>    <chr> 
#>  1 450601
#>  2 030502
#>  3 010999
#>  4 131203
#>  5 400202
#>  6 161200
#>  7 380299
#>  8 169999
#>  9 260301
#> 10 512001

From the IPEDS website, we find that engineering codes start with 14. We filter midfielddegrees using str_start() to keep the rows of students graduating in an engineering major.

grad_engr <- midfielddegrees %>% 
    filter(str_starts(cip6, "14"))
    
grad_engr %>% 
        select(cip6)  %>% 
        arrange(cip6) %>% 
        distinct()  
#> # A tibble: 27 x 1
#>    cip6  
#>    <chr> 
#>  1 140101
#>  2 140201
#>  3 140301
#>  4 140401
#>  5 140501
#>  6 140601
#>  7 140701
#>  8 140801
#>  9 140901
#> 10 141001
#> # ... with 17 more rows

In this exercise, I would like to examine students in 5 engineering majors: Civil, Computer, Electrical, Mechanical, and Industrial Engineering. The first 4 digits of cip6 define these majors, for example, every 6-digit code that starts with 1408 is Civil Engineering, 1409 Computer Engineering, etc.

Let’s create a data frame that contains the 4-digit codes of the 5 majors, using wrapr::build_frame().

(engr_major <- wrapr::build_frame(
        "major",      "cip4" |
        "Civil",      "1408" |
        "Computer",   "1409" | 
        "Electrical", "1410" | 
        "Mechanical", "1419" |
        "Industrial", "1435" 
    ))
#>        major cip4
#> 1      Civil 1408
#> 2   Computer 1409
#> 3 Electrical 1410
#> 4 Mechanical 1419
#> 5 Industrial 1435

I will use the cip4 column to filter the grad_engr data frame to keep all students graduating in these majors. First, a create a single string of the cip4 values, separated by the logical OR symbol,

(major_string <- str_c(engr_major$cip4, collapse = "|"))
#> [1] "1408|1409|1410|1419|1435"

Then I filter the grad_engr data frame to keep the rows whose cip6 columns starts with any of the 4 digit codes in the keep_code string, again using filter() with str_starts(),

grad <- grad_engr %>% 
        filter(str_starts(cip6, major_string))  %>% 
        glimpse()
#> Observations: 5,322
#> Variables: 5
#> $ id          <chr> "MID25783178", "MID25783197", "MID25783441", "MID2...
#> $ institution <chr> "Institution M", "Institution M", "Institution M",...
#> $ cip6        <chr> "143501", "140801", "140801", "141001", "143501", ...
#> $ term_degree <dbl> 20001, 19911, 19963, 20001, 20001, 19963, 19981, 1...
#> $ degree      <chr> "Bachelor of Science", "Bachelor of Science", "Bac...

Every row in grad is a unique student who graduated in one of these 5 majors.

grad %>% 
        select(cip6)  %>% 
        arrange(cip6) %>% 
        distinct() 
#> # A tibble: 5 x 1
#>   cip6  
#>   <chr> 
#> 1 140801
#> 2 140901
#> 3 141001
#> 4 141901
#> 5 143501

The original data frame had 97640 students; the five engineering majors have 5322 graduates.

joins

The left_join() is perhaps the most commonly used joining function. The basic syntax is

left_join(x, y, by)   

where x and y are two data frames and by is the key or keys that the tables have in common. The left join (from the help page) “returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.”

Before joining, let’s reduce the number of columns to the minimum that we want to use,

grad <- grad %>%
        select(id, institution, cip6) 

demographics <- midfieldstudents %>% 
        select(id, sex, race) 

Using the student id as the key, we join the student demographic information to the student graduates data frame.

grad <- left_join(x = grad, y = demographics, by = "id") %>% 
        glimpse()
#> Observations: 5,322
#> Variables: 5
#> $ id          <chr> "MID25783178", "MID25783197", "MID25783441", "MID2...
#> $ institution <chr> "Institution M", "Institution M", "Institution M",...
#> $ cip6        <chr> "143501", "140801", "140801", "141001", "143501", ...
#> $ sex         <chr> "Male", "Male", "Male", "Male", "Male", "Female", ...
#> $ race        <chr> "Black", "White", "White", "White", "White", "Whit...

Earlier, we created the data frame engr_major with the 4-digit CIP code and our name for the major. Next we join these major names to the grad data frame.

grad <- left_join(x = grad, y = engr_major)

#> Error: `by` required, because the data sources have no common variables

This line throws an error because the two data frames I’m trying to join do not have a common key. The fix is to create a cip4 variable in grad using the first 4 digits of cip6.

grad <- grad %>% 
    mutate(cip4 = str_sub(cip6, start = 1L, end = 4L)) %>% 
    glimpse()
#> Observations: 5,322
#> Variables: 6
#> $ id          <chr> "MID25783178", "MID25783197", "MID25783441", "MID2...
#> $ institution <chr> "Institution M", "Institution M", "Institution M",...
#> $ cip6        <chr> "143501", "140801", "140801", "141001", "143501", ...
#> $ sex         <chr> "Male", "Male", "Male", "Male", "Male", "Female", ...
#> $ race        <chr> "Black", "White", "White", "White", "White", "Whit...
#> $ cip4        <chr> "1435", "1408", "1408", "1410", "1435", "1408", "1...

Now the left join will work,

grad <- left_join(x = grad, y = engr_major, by = "cip4") %>%
    glimpse()
#> Observations: 5,322
#> Variables: 7
#> $ id          <chr> "MID25783178", "MID25783197", "MID25783441", "MID2...
#> $ institution <chr> "Institution M", "Institution M", "Institution M",...
#> $ cip6        <chr> "143501", "140801", "140801", "141001", "143501", ...
#> $ sex         <chr> "Male", "Male", "Male", "Male", "Male", "Female", ...
#> $ race        <chr> "Black", "White", "White", "White", "White", "Whit...
#> $ cip4        <chr> "1435", "1408", "1408", "1410", "1435", "1408", "1...
#> $ major       <chr> "Industrial", "Civil", "Civil", "Electrical", "Ind...

carpentry

Before creating graphs, I’d like to clean up the data, The student ID and the CIP codes are no longer needed.

grad <- grad %>% 
    select(institution, major, sex, race) %>% 
    glimpse()
#> Observations: 5,322
#> Variables: 4
#> $ institution <chr> "Institution M", "Institution M", "Institution M",...
#> $ major       <chr> "Industrial", "Civil", "Civil", "Electrical", "Ind...
#> $ sex         <chr> "Male", "Male", "Male", "Male", "Male", "Female", ...
#> $ race        <chr> "Black", "White", "White", "White", "White", "Whit...

Shorten the variable name for institution

grad <- grad %>% 
    dplyr::rename("inst" = "institution") %>% 
    glimpse()
#> Observations: 5,322
#> Variables: 4
#> $ inst  <chr> "Institution M", "Institution M", "Institution M", "Inst...
#> $ major <chr> "Industrial", "Civil", "Civil", "Electrical", "Industria...
#> $ sex   <chr> "Male", "Male", "Male", "Male", "Male", "Female", "Male"...
#> $ race  <chr> "Black", "White", "White", "White", "White", "White", "W...

Recode the institution values by shortening the string “Institution” to “Inst”.

grad <- grad %>% 
    mutate(inst = str_remove_all(inst, "itution")) %>% 
  glimpse()
#> Observations: 5,322
#> Variables: 4
#> $ inst  <chr> "Inst M", "Inst M", "Inst M", "Inst M", "Inst M", "Inst ...
#> $ major <chr> "Industrial", "Civil", "Civil", "Electrical", "Industria...
#> $ sex   <chr> "Male", "Male", "Male", "Male", "Male", "Female", "Male"...
#> $ race  <chr> "Black", "White", "White", "White", "White", "White", "W...

Create a new variable race_sex that unites race and sex—a useful variable in this type of graph.

grad <- grad %>%
  unite(col = "race_sex", c("race", "sex"), sep = " ", remove = FALSE) %>% 
    glimpse()
#> Observations: 5,322
#> Variables: 5
#> $ inst     <chr> "Inst M", "Inst M", "Inst M", "Inst M", "Inst M", "In...
#> $ major    <chr> "Industrial", "Civil", "Civil", "Electrical", "Indust...
#> $ race_sex <chr> "Black Male", "White Male", "White Male", "White Male...
#> $ sex      <chr> "Male", "Male", "Male", "Male", "Male", "Female", "Ma...
#> $ race     <chr> "Black", "White", "White", "White", "White", "White",...

design

Use count() to quickly group and summarize, then convert the character variables into factors ordered by n (count() creates a variable n).

grad1 <- grad %>% 
    count(inst, major, race_sex) %>% 
    mutate(inst = fct_reorder(inst, n)) %>% 
    mutate(major = fct_reorder(major, n)) %>% 
    mutate(race_sex = fct_reorder(race_sex, n)) %>% 
    glimpse()
#> Observations: 368
#> Variables: 4
#> $ inst     <fct> Inst A, Inst A, Inst A, Inst A, Inst A, Inst A, Inst ...
#> $ major    <fct> Civil, Civil, Computer, Electrical, Electrical, Elect...
#> $ race_sex <fct> Black Female, Black Male, Black Male, Black Female, B...
#> $ n        <int> 6, 4, 2, 11, 16, 1, 7, 6, 2, 12, 1, 3, 2, 1, 3, 1, 1,...

ggplot(data = grad1, mapping = aes(x = n, y = race_sex)) +
    geom_point() + 
    facet_grid(rows = vars(inst), cols = vars(major), as.table = FALSE)

Not all populations are present at all institutions, so I’ll do a recount, this time omitting institution. By eliminating this category, I can use race and sex separately.

grad2 <- grad %>% 
    count(major, race, sex) %>% 
    mutate(major = fct_reorder(major, n)) %>% 
    mutate(race = fct_reorder(race, n)) %>% 
    mutate(sex = fct_reorder(sex, n)) %>% 
    glimpse()
#> Observations: 75
#> Variables: 4
#> $ major <fct> Civil, Civil, Civil, Civil, Civil, Civil, Civil, Civil, ...
#> $ race  <fct> Asian, Asian, Black, Black, Hispanic, Hispanic, Internat...
#> $ sex   <fct> Female, Male, Female, Male, Female, Male, Female, Male, ...
#> $ n     <int> 11, 24, 21, 39, 5, 30, 1, 5, 3, 1, 8, 1, 4, 192, 708, 13...

ggplot(data = grad2, mapping = aes(x = n, y = race)) +
    geom_point() + 
    facet_grid(rows = vars(sex), cols = vars(major), as.table = FALSE)

Add a log scale to help compare small numbers and edit the theme.

ggplot(data = grad2, mapping = aes(x = n, y = race)) +
    geom_point(size = 2) + 
    facet_grid(rows = vars(sex), cols = vars(major), as.table = FALSE) + 
  scale_x_continuous(trans = "log2") +
    labs(x = "Number of graduates (log2 scale)", y = "", caption = "Source: midfielddata R package", title = "Comparing graduates of five engineering programs") +
    theme_graphclass() 

exercises

1. midfield

Initialize a new script

explore/0601-dotplot-midfield-exercise.R    

Starting with midfielddegrees and midfieldstudents, create the data frame such that you can graph the answer to the question: “Of all the students graduating in a given major, what percent are in each sex-race group?”

  • Limit the majors to Mechanical, Electrical, Civil, and Industrial Engineering.
  • Limit the race values to Asian, Black, Hispanic, and White.

Answer:

references

Layton R, Long R and Ohland M (2018) midfielddata: Student record data for 98,000 undergraduates. R package version 0.1.0 https://midfieldr.github.io/midfielddata/

Wickham H and Grolemund G (2017) R for Data Science. O’Reilly Media, Inc., Sebastopol, CA https://r4ds.had.co.nz/


▲ top of page
◁ calendar
◁ index