data basics

California Carpet by Marc Cooper is licensed under CC BY-SA 2.0

prerequisites

Data basics lecture (17 min)
Then start the work session by launching portfolio.Rproj
Your project directory structure satisfies the course requirements
Create explore/0201-data-basics.R

The meaning of this last instruction is to create an R script named 0201-data-basics.R in your explore directory. My file-naming convention is to indicate the week of the lesson (02), the day of the week (01), and a brief description of the topic (-data-basics), and of course the file type (.R). Typically, we will have a new script for each tutorial.

You can do the exercises at the end of the tutorial in this same script.

A final note on procedure. I will often create my own version of the explore scripts. You can find them in the explore directory on the course website.

data in base R

Launching R loads all data sets in base R. You can use these data primarily for tutorials and practice sessions.

For example, in the Console type,

data() # to list data sets in base R

yields

 #> AirPassengers  Monthly Airline Passenger Numbers
 #> BJsales        Sales Data with Leading Indicator
 #> BOD            Biochemical Oxygen Demand
 #> CO2            Carbon Dioxide Uptake in Grass Plants
 #> Formaldehyde   Determination of Formaldehyde
 etc.

For an example of a dataset help page, in the Console type,

? CO2

data in R packages

Loading a package with library() loads all the data sets in the package. You can use these data primarily for tutorials and practice sessions.

For example, in the Console type,

library("dplyr")
data(package = "dplyr")

yields

 #> band_instruments   Band membership
 #> band_instruments2  Band membership
 #> band_members       Band membership
 #> nasa               NASA spatio-temporal data
 #> starwars           Starwars characters
 #> storms             Storm tracks data

To show the data set help page, in the Console type,

? starwars

raw data files

For this work session, manually download the CSV and XLSX files from the course data-raw directory and save them to your own data-raw directory. In another session we will cover how to automate the downloads.

Generally, when treating data files in their original form,

Save original data files in the data-raw directory
Data in their original form are never edited manually
Read the raw data with R scripts
For learning R, save your R scripts in the explore directory
For scripts associates with your portfolio, save the R scripts in the carpentry directory

reading XLSX files

Your data-raw/ directory should contain the DSR-table1.xlsx file.

readxl is the package (you will have to install the package)
read_excel() is the function

In your open R script, type the following,

library("readxl")
df1 <- read_excel(path  = "data-raw/DSR-table1.xlsx", 
        sheet = "DSR-table1")

If a data set is small enough, we can view it in the Console by typing its name, for example,

df1

We can pretty-print the data using knitr::kable(). Add the following to your script.

library("knitr")
kable(df1)

When you use this function in an R script, it prints the table to your Console

|country     | year|  cases| population|
|:-----------|----:|------:|----------:|
|Afghanistan | 1999|    745|   19987071|
|Afghanistan | 2000|   2666|   20595360|
|Brazil      | 1999|  37737|  172006362|
|Brazil      | 2000|  80488|  174504898|
|China       | 1999| 212258| 1272915272|
|China       | 2000| 213766| 1280428583|

Later, we’ll be using the kable() function in an R Markdown script. Then the function produces a prettier table like this:

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

read_excel() produces a tibble. Add the class() function to your script,

class(df1)

#> [1] "tbl_df"     "tbl"        "data.frame"

reading CSV files

Your data-raw/ directory should contain the scanvote.csv file.

readr is the package (part of the tidyverse)
read_csv() is the function

In your open R script, type the following,

library("tidyverse") # loads the readr package 
df2 <- read_csv(file = "data-raw/scanvote.csv")

We can pretty-print the top n rows with head()

head(df2, n = 5L)

#> # A tibble: 5 x 4
#>   District        Yes   Pop Country
#>   <chr>         <dbl> <dbl> <chr>  
#> 1 Uusimaa        70.8 118.  Fin    
#> 2 Turku ja Pori  53.4  32   Fin    
#> 3 Hame           57.8  39.5 Fin    
#> 4 Kymi           65.2  31.8 Fin    
#> 5 Ahvenanmaa     51.9  15.4 Fin

read_csv() produces a tibble. Add the class() function to your script,

class(df2)

#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

webscraping ASCII data

Confine your webscraping (for now) to data in ASCII format

On any online data page, right-click > View page source
Data formatted in ASCII (text) is easily recognized
Data formatted in HTML is also easily recognized

With online data in ASCII format, webscraping is easy

utils is the package
read.table() is the function

Add the following to your R script.

library("utils")
url <- "http://www.prdh.umontreal.ca/BDLC/data/alb/Population.txt"

df3  <- read.table(url,   
                  skip = 2, 
                  header = TRUE, 
                  stringsAsFactors = FALSE)

df3 <- as_tibble(df3)

Examine it

glimpse(df3)

#> Observations: 10,212
#> Variables: 5
#> $ Year   <int> 1921, 1921, 1921, 1921, 1921, 1921, 1921, 1921, 1921, 1...
#> $ Age    <chr> "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",...
#> $ Female <dbl> 7864.85, 7936.45, 8024.81, 8017.72, 7925.59, 7760.15, 7...
#> $ Male   <dbl> 8133.86, 8142.91, 8240.34, 8244.01, 8154.78, 7991.74, 7...
#> $ Total  <dbl> 15998.71, 16079.36, 16265.15, 16261.73, 16080.37, 15751...

Write it to the data-raw directory. Add the following to your script,

write_csv(df3, path = "data-raw/alberta-mortality.csv")

data directory write and read

Write functions

write_csv()  # use CSV generally
saveRDS()    # use RDS to preserve factors

Read functions for further data carpentry

read_csv() 
readRDS()

Anytime you want to know about a function arguments, invoke the package using the library() function, then in the Console type ? function-name, for example,

library("tidyverse")
? write_csv()

exercises

Note: The instruction to classify a data structure implies that you

determine the number of observations
determine the number of variables
classify every variable as quantitative or categorical
determine the number of levels in every categorical variable
classify every categorical variables as ordinal or nominal
classify every quantitative variable as continuous or discrete
other features relevant to the particular display type

1. Manually tidy Excel data

Open the VADeaths.xlsx file in MS Excel (or whatever spreadsheet software you use).
In the spreadsheet, manually move, copy, and paste cell entries until the data are in tidy form.
This is the only time we’ll do a manual operation. By doing this manually, you develop a better understanding of what the tidyr package does when we use it to script similar operations.
Classify the data structure
When you have finished, a partial answer is provided so you can check your work.

2. Webscrape CSV data from FiveThirtyEight

We will use the FiveThirtyEight data from: https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv

In your open R script, assign the URL above to the variable name url
In the R script, use read_csv(url) to scrape the data from the web, and assign it to the variable name df4
Use glimpse() to examine the data
Write the CSV file to your data-raw directory using the name spi-matches.csv
Have a look in your data-raw directory to confirm that the file is there.

In addition,

Classify the data structure
Determine if the data are tidy or not

3. XLXS data from US Dept of Agriculture

Manually download the file nsn-extract-5-9-18.xlsx from the US General Services Administration (GSA) at the following URL

https://inventory.data.gov/dataset/67567804-073d-40ad-a710-2b0bed8b84e2/resource/360b0748-d161-4857-a7dc-dfccfaeea096/download/nsn-extract-5-9-18.xlsx

Save the file in your data-raw directory
Manually open the file and examine the contents
Based on the structure you see, use read_excel() to input the data into R
Write the CSV file to your data-raw directory with the filename nsn-extract-5-9-18.csv. At this point, I consider the CSV file raw data, just as if we have used our script to webscrape it directly.

In addition,

Classify the data structure
Determine if the data are tidy or not

references

Wickham H and Grolemund G (2017) R for Data Science. O’Reilly Media, Inc., Sebastopol, CA https://r4ds.had.co.nz/

▲ top of page
◁ calendar
◁ index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cm101-data-basics.md

cm101-data-basics.md

data basics

contents

prerequisites

data in base R

data in R packages

raw data files

reading XLSX files

reading CSV files

webscraping ASCII data

data directory write and read

exercises

references

Files

cm101-data-basics.md

Latest commit

History

cm101-data-basics.md

File metadata and controls

data basics

contents

prerequisites

data in base R

data in R packages

raw data files

reading XLSX files

reading CSV files

webscraping ASCII data

data directory write and read

exercises

references