Skip to content

Latest commit

 

History

History
362 lines (263 loc) · 10.4 KB

cm101-data-basics.md

File metadata and controls

362 lines (263 loc) · 10.4 KB

data basics


California Carpet by Marc Cooper is licensed under CC BY-SA 2.0

contents

prerequisites
data in base R
data in R packages
raw data files
reading XLSX files
reading CSV files
webscraping ASCII data
data directory write and read
exercises

prerequisites

  • Data basics lecture (17 min)
  • Then start the work session by launching portfolio.Rproj
  • Your project directory structure satisfies the course requirements
  • Create explore/0201-data-basics.R

The meaning of this last instruction is to create an R script named 0201-data-basics.R in your explore directory. My file-naming convention is to indicate the week of the lesson (02), the day of the week (01), and a brief description of the topic (-data-basics), and of course the file type (.R). Typically, we will have a new script for each tutorial.

You can do the exercises at the end of the tutorial in this same script.

A final note on procedure. I will often create my own version of the explore scripts. You can find them in the explore directory on the course website.

data in base R

Launching R loads all data sets in base R. You can use these data primarily for tutorials and practice sessions.

For example, in the Console type,

data() # to list data sets in base R 

yields

 #> AirPassengers  Monthly Airline Passenger Numbers
 #> BJsales        Sales Data with Leading Indicator
 #> BOD            Biochemical Oxygen Demand
 #> CO2            Carbon Dioxide Uptake in Grass Plants
 #> Formaldehyde   Determination of Formaldehyde
 etc.

For an example of a dataset help page, in the Console type,

? CO2

data in R packages

Loading a package with library() loads all the data sets in the package. You can use these data primarily for tutorials and practice sessions.

For example, in the Console type,

library("dplyr")
data(package = "dplyr")

yields

 #> band_instruments   Band membership
 #> band_instruments2  Band membership
 #> band_members       Band membership
 #> nasa               NASA spatio-temporal data
 #> starwars           Starwars characters
 #> storms             Storm tracks data

To show the data set help page, in the Console type,

? starwars

raw data files

For this work session, manually download the CSV and XLSX files from the course data-raw directory and save them to your own data-raw directory. In another session we will cover how to automate the downloads.

Generally, when treating data files in their original form,

  • Save original data files in the data-raw directory
  • Data in their original form are never edited manually
  • Read the raw data with R scripts
  • For learning R, save your R scripts in the explore directory
  • For scripts associates with your portfolio, save the R scripts in the carpentry directory

reading XLSX files

Your data-raw/ directory should contain the DSR-table1.xlsx file.

  • readxl is the package (you will have to install the package)
  • read_excel() is the function

In your open R script, type the following,

library("readxl")
df1 <- read_excel(path  = "data-raw/DSR-table1.xlsx", 
        sheet = "DSR-table1")

If a data set is small enough, we can view it in the Console by typing its name, for example,

df1

We can pretty-print the data using knitr::kable(). Add the following to your script.

library("knitr")
kable(df1)

When you use this function in an R script, it prints the table to your Console

|country     | year|  cases| population|
|:-----------|----:|------:|----------:|
|Afghanistan | 1999|    745|   19987071|
|Afghanistan | 2000|   2666|   20595360|
|Brazil      | 1999|  37737|  172006362|
|Brazil      | 2000|  80488|  174504898|
|China       | 1999| 212258| 1272915272|
|China       | 2000| 213766| 1280428583|

Later, we’ll be using the kable() function in an R Markdown script. Then the function produces a prettier table like this:

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

read_excel() produces a tibble. Add the class() function to your script,

class(df1)
#> [1] "tbl_df"     "tbl"        "data.frame"

reading CSV files

Your data-raw/ directory should contain the scanvote.csv file.

  • readr is the package (part of the tidyverse)
  • read_csv() is the function

In your open R script, type the following,

library("tidyverse") # loads the readr package 
df2 <- read_csv(file = "data-raw/scanvote.csv")

We can pretty-print the top n rows with head()

head(df2, n = 5L)
#> # A tibble: 5 x 4
#>   District        Yes   Pop Country
#>   <chr>         <dbl> <dbl> <chr>  
#> 1 Uusimaa        70.8 118.  Fin    
#> 2 Turku ja Pori  53.4  32   Fin    
#> 3 Hame           57.8  39.5 Fin    
#> 4 Kymi           65.2  31.8 Fin    
#> 5 Ahvenanmaa     51.9  15.4 Fin

read_csv() produces a tibble. Add the class() function to your script,

class(df2)
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

webscraping ASCII data

Confine your webscraping (for now) to data in ASCII format

  • On any online data page, right-click > View page source
  • Data formatted in ASCII (text) is easily recognized
  • Data formatted in HTML is also easily recognized

With online data in ASCII format, webscraping is easy

  • utils is the package
  • read.table() is the function

Add the following to your R script.

library("utils")
url <- "http://www.prdh.umontreal.ca/BDLC/data/alb/Population.txt"

df3  <- read.table(url,   
                  skip = 2, 
                  header = TRUE, 
                  stringsAsFactors = FALSE)

df3 <- as_tibble(df3)

Examine it

glimpse(df3)
#> Observations: 10,212
#> Variables: 5
#> $ Year   <int> 1921, 1921, 1921, 1921, 1921, 1921, 1921, 1921, 1921, 1...
#> $ Age    <chr> "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",...
#> $ Female <dbl> 7864.85, 7936.45, 8024.81, 8017.72, 7925.59, 7760.15, 7...
#> $ Male   <dbl> 8133.86, 8142.91, 8240.34, 8244.01, 8154.78, 7991.74, 7...
#> $ Total  <dbl> 15998.71, 16079.36, 16265.15, 16261.73, 16080.37, 15751...

Write it to the data-raw directory. Add the following to your script,

write_csv(df3, path = "data-raw/alberta-mortality.csv")

data directory write and read

Write functions

write_csv()  # use CSV generally
saveRDS()    # use RDS to preserve factors 

Read functions for further data carpentry

read_csv() 
readRDS()  

Anytime you want to know about a function arguments, invoke the package using the library() function, then in the Console type ? function-name, for example,

library("tidyverse")
? write_csv() 

exercises

Note: The instruction to classify a data structure implies that you

  • determine the number of observations
  • determine the number of variables
  • classify every variable as quantitative or categorical
  • determine the number of levels in every categorical variable
  • classify every categorical variables as ordinal or nominal
  • classify every quantitative variable as continuous or discrete
  • other features relevant to the particular display type

1. Manually tidy Excel data

  • Open the VADeaths.xlsx file in MS Excel (or whatever spreadsheet software you use).
  • In the spreadsheet, manually move, copy, and paste cell entries until the data are in tidy form.
  • This is the only time we’ll do a manual operation. By doing this manually, you develop a better understanding of what the tidyr package does when we use it to script similar operations.
  • Classify the data structure
  • When you have finished, a partial answer is provided so you can check your work.

2. Webscrape CSV data from FiveThirtyEight

We will use the FiveThirtyEight data from: https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv

  • In your open R script, assign the URL above to the variable name url
  • In the R script, use read_csv(url) to scrape the data from the web, and assign it to the variable name df4
  • Use glimpse() to examine the data
  • Write the CSV file to your data-raw directory using the name spi-matches.csv
  • Have a look in your data-raw directory to confirm that the file is there.

In addition,

  • Classify the data structure
  • Determine if the data are tidy or not

3. XLXS data from US Dept of Agriculture

Manually download the file nsn-extract-5-9-18.xlsx from the US General Services Administration (GSA) at the following URL

https://inventory.data.gov/dataset/67567804-073d-40ad-a710-2b0bed8b84e2/resource/360b0748-d161-4857-a7dc-dfccfaeea096/download/nsn-extract-5-9-18.xlsx

  • Save the file in your data-raw directory
  • Manually open the file and examine the contents
  • Based on the structure you see, use read_excel() to input the data into R
  • Write the CSV file to your data-raw directory with the filename nsn-extract-5-9-18.csv. At this point, I consider the CSV file raw data, just as if we have used our script to webscrape it directly.

In addition,

  • Classify the data structure
  • Determine if the data are tidy or not

references

Wickham H and Grolemund G (2017) R for Data Science. O’Reilly Media, Inc., Sebastopol, CA https://r4ds.had.co.nz/


▲ top of page
◁ calendar
◁ index