diff --git a/docs/03_writing_r_scripts_files/figure-html/unnamed-chunk-22-1.png b/docs/03_writing_r_scripts_files/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 0000000..02da22c Binary files /dev/null and b/docs/03_writing_r_scripts_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/docs/05_visualization_files/figure-html/example-genotype-1.png b/docs/05_visualization_files/figure-html/example-genotype-1.png index f38b1d1..a9dae24 100644 Binary files a/docs/05_visualization_files/figure-html/example-genotype-1.png and b/docs/05_visualization_files/figure-html/example-genotype-1.png differ diff --git a/docs/404.html b/docs/404.html index a3e27f7..ee9641b 100644 --- a/docs/404.html +++ b/docs/404.html @@ -202,17 +202,33 @@
print
Functionmessage
Functioncat
Functionprint
Functionmessage
Functioncat
Functionprint
Functionmessage
Functioncat
FunctionThis section introduces several different functions for printing output +and making that output easier to read.
+print
FunctionThe print
function prints a string representation of an object to the
+console. The string representation is usually formatted in a way that exposes
+details important to programmers rather than users.
For example, when printing a vector, the function prints the position of the
+first element on each line in square brackets [ ]
:
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
+## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
+## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
+## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
+## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
+## [91] 91 92 93 94 95 96 97 98 99 100
+The print
function also prints quotes around strings:
## [1] "Hi"
+These features make the print
function ideal for printing information when
+you’re trying to understand some code or diagnose a bug. On the other hand,
+these features also make print
a bad choice for printing output or status
+messages for users (including you).
R calls the print
function automatically anytime a result is returned at the
+prompt. Thus it’s not necessary to call print
to print something when you’re
+working directly in the console—only from within loops, functions, scripts,
+and other code that runs non-interactively.
The print
function is an S3 generic (see Section 6.4), so you if you
+create an S3 class, you can define a custom print method for it. For S4
+objects, R uses the S4 generic show
instead of print
.
message
FunctionTo print output for users, the message
function is the one you should use.
+The main reason for this is that the message
function is part of R’s
+conditions system for reporting status information as code runs. This makes
+it easier for other code to detect, record, respond to, or suppress the output
+(see Section 4.2 to learn more about R’s conditions
+system).
The message
function prints its argument(s) and a newline to the console:
## Hello world!
+If an argument isn’t a string, the function automatically and silently attempts +to coerce it to one:
+ +## 4
+Some types of objects can’t be coerced to a string:
+ +## Error in FUN(X[[i]], ...): cannot coerce type 'builtin' to vector of type 'character'
+For objects with multiple elements, the function pastes together the string +representations of the elements with no separators in between:
+ +## 123
+Similarly, if you pass the message
function multiple arguments, it pastes
+them together with no separators:
## Hi, my name is R and x is 123
+This is a convenient way to print names or descriptions alongside values from
+your code without having to call a formatting function like paste
.
You can make the message function print something without adding a newline at
+the end by setting the argument appendLF = FALSE
. The difference can be easy
+to miss unless you make several calls to message
, so the say_hello
function
+in this example calls message
twice:
say_hello = function(appendLF) {
+ message("Hello", appendLF = appendLF)
+ message(" world!")
+}
+
+say_hello(appendLF = TRUE)
## Hello
+## world!
+
+## Hello world!
+Note that RStudio always adds a newline in front of the prompt, so making an
+isolated call to message
with appendLF = FALSE
appears to produce the same
+output as with appendLF = TRUE
. This is an example of a situation where
+RStudio leads you astray: in an ordinary R console, the two are clearly
+different.
cat
FunctionThe cat
function, whose name stands for “concatenate and print,” is a
+low-level way to print output to the console or a file. The message
function
+prints output by calling cat
, but cat
is not part of R’s conditions system.
The cat
function prints its argument(s) to the console. It does not add a
+newline at the end:
## Hello
+As with message
, RStudio hides the fact that there’s no newline if you make
+an isolated call to cat
.
The cat
function coerces its arguments to strings and concatenates them. By
+default, a space is inserted between arguments and their elements:
## 4
+
+## 1 2 3
+
+## Hello Nick
+You can set the sep
parameter to control the separator cat
inserts:
## Hello|world|1|2|3
+If you want to write output to a file rather than to the console, you can call
+cat
with the file
parameter set. However, it’s preferable to use functions
+tailored to writing specific kinds of data, such as writeLines
(for text) or
+write.table
(for tabular data), since they provide additional options to
+control the output.
Many scripts and packages still use cat
to print output, but the message
+function provides more flexibility and control to people running the code. Thus
+it’s generally preferable to use message
in new code. Nevertheless, there are
+a few specific cases where cat
is useful—for example, if you want to pipe
+data to a UNIX shell command. See ?cat
for details.
R provides a variety of ways to format data before you print it. Taking the +time to format output carefully makes it easier to read and understand, as well +as making your scripts seem more professional.
+One way to format strings is by adding (or removing) escape sequences. An +escape sequence is a sequence of characters that represents some other +character, usually one that’s invisible (such as whitespace) or doesn’t appear +on a standard keyboard.
+In R, escape sequences always begin with a backslash. For example, \n
is a
+newline. The message
and cat
functions automatically convert escape
+sequences to the characters they represent:
## Hello
+## world!
+The print
function doesn’t convert escape sequences:
## [1] "Hello\nworld!"
+Some escape sequences trigger special behavior in the console. For example,
+ending a line with a carriage return \r
makes the console print the next line
+over the line. Try running this code in a console (it’s not possible to see the
+result in a static book):
# Run this in an R console.
+for (i in 1:10) {
+ message(i, "\r", appendLF = FALSE)
+ # Wait 0.5 seconds.
+ Sys.sleep(0.5)
+}
You can find a complete list of escape sequences in ?Quotes
.
You can use the sprintf
function to apply specific formatting to values and
+substitute them into strings. The function uses a mini-language to describe the
+formatting and substitutions. The sprintf
function (or something like it) is
+available in many programming languages, so being familiar with it will serve
+you well on your programming journey.
The key idea is that substitutions are marked by a percent sign %
and a
+character. The character indicates the kind of data to be substituted: s
for
+strings, i
for integers, f
for floating point numbers, and so on.
The first argument to sprintf
must be a string, and subsequent arguments are
+values to substitute into the string (from left to right). For example:
## [1] "My age is 32, and my name is Nick"
+You can use the mini-language to do things like specify how many digits to
+print after a decimal point. Format settings for a substituted value go between
+the percent sign %
and the character. For instance, here’s how to print pi
+with 2 digits after the decimal:
## [1] "3.14"
+You can learn more by reading ?sprintf
.
Much simpler are the paste
and paste0
functions, which coerce their
+arguments to strings and concatenate (or “paste together”) them. The paste
+function inserts a space between each argument, while the
paste0
+function doesn’t:
## [1] "Hello world"
+
+## [1] "Helloworld"
+You can control the character inserted between arguments with the sep
+parameter.
By setting an argument for the collapse
parameter, you can also use the
+paste
and paste0
functions to concatenate the elements of a vector. The
+argument to collapse
is inserted between the elements. For example, suppose
+you want to paste together elements of a vector inserting a comma and space
+,
in between:
## [1] "1, 2, 3"
+Members of the R community have developed many packages to make formatting +strings easier:
+ +Logging means saving the output from some code to a file as the code runs. +The file where the output is saved is called a log file or log, but this +name isn’t indicative of a specific format (unlike, say, a “CSV file”).
+It’s a good idea to set up some kind of logging for any code that takes more +than a few minutes to run, because then if something goes wrong you can inspect +the log to diagnose the problem. Think of any output that’s not logged as +ephemeral: it could disappear if someone reboots the computer, or there’s a +power outage, or some other, unforeseen event.
+R’s built-in tools for logging are rudimentary, but members of the community +have developed a variety of packages for logging. Here are a few that are still +actively maintained as of January 2023:
+logging
module.R is powerful tool for automating tasks that have repetitive steps. For +example, you can:
+You can implement concise, efficient solutions for these kinds of tasks in R by +using iteration, which means repeating a computation many times. R provides +four different strategies for writing iterative code:
+Vectorization is the most efficient and most concise iteration strategy, but +also the least flexible, because it only works with vectorized functions and +vectors. Apply functions are more flexible—they work with any function and +any data structure with elements—but less efficient and less concise. Loops +and recursion provide the most flexibility but are the least concise. In recent +versions of R, apply functions and loops are similar in terms of efficiency. +Recursion tends to be the least efficient iteration strategy in R.
+The rest of this section explains how to write loops and how to choose which +iteration strategy to use. We assume you’re already comfortable with +vectorization and have at least some familiarity with apply functions.
+A for-loop evaluates an expression once for each element of a vector or
+list. The for
keyword creates a for-loop. The syntax is:
The variable I
is called an induction variable. At the beginning of each
+iteration, I
is assigned the next element of DATA
. The loop iterates once
+for each element, unless a keyword instructs R to exit the loop early (more
+about this in Section 3.5.4). As with if-statements and functions,
+the curly braces { }
are only required if the body contains multiple lines of
+code. Here’s a simple for-loop:
## Hi from iteration 1
+## Hi from iteration 2
+## Hi from iteration 3
+## Hi from iteration 4
+## Hi from iteration 5
+## Hi from iteration 6
+## Hi from iteration 7
+## Hi from iteration 8
+## Hi from iteration 9
+## Hi from iteration 10
+When some or all of the iterations in a task depend on results from prior +iterations, loops tend to be the most appropriate iteration strategy. For +instance, loops are a good way to implement time-based simulations or compute +values in recursively defined sequences.
+As a concrete example, suppose you want to compute the result of starting from +the value 1 and composing the sine function 100 times:
+ +## [1] 0.1688525
+Unlike other iteration strategies, loops don’t return a result automatically. +It’s up to you to use variables to store any results you want to use later. If +you want to save a result from every iteration, you can use a vector or a list +indexed on the iteration number:
+n = 1 + 100
+result = numeric(n)
+result[1] = 1
+for (i in 2:n) {
+ result[i] = sin(result[i - 1])
+}
+
+plot(result)
Section 3.5.3 explains this in more detail.
+If the iterations in a task are not dependent, it’s preferable to use +vectorization or apply functions instead of a loop. Vectorization is more +efficient, and apply functions are usually more concise.
+In some cases, you can use vectorization to handle a task even if the
+iterations are dependent. For example, you can use vectorized exponentiation
+and the sum
function to compute the sum of the cubes of many numbers:
## [1] 1001910
+A while-loop runs a block of code repeatedly as long as some condition is
+TRUE
. The while
keyword creates a while-loop. The syntax is:
The CONDITION
should be a scalar logical value or an expression that returns
+one. At the beginning of each iteration, R checks the CONDITION
and exits the
+loop if it’s FALSE
. As always, the curly braces { }
are only required if
+the body contains multiple lines of code. Here’s a simple while-loop:
## Hello from iteration 1
+## Hello from iteration 2
+## Hello from iteration 3
+## Hello from iteration 4
+## Hello from iteration 5
+## Hello from iteration 6
+## Hello from iteration 7
+## Hello from iteration 8
+## Hello from iteration 9
+## Hello from iteration 10
+Notice that this example does the same thing as the simple for-loop in Section +3.5.1, but requires 5 lines of code instead of 2. While-loops are a +generalization of for-loops, and only do the bare minimum necessary to iterate. +They tend to be most useful when you don’t know how many iterations will be +necessary to complete a task.
+As an example, suppose you want to add up the integers in order until the total +is greater than 50:
+total = 0
+i = 1
+while (total < 50) {
+ total = total + i
+ message("i is ", i, " total is ", total)
+ i = i + 1
+}
## i is 1 total is 1
+## i is 2 total is 3
+## i is 3 total is 6
+## i is 4 total is 10
+## i is 5 total is 15
+## i is 6 total is 21
+## i is 7 total is 28
+## i is 8 total is 36
+## i is 9 total is 45
+## i is 10 total is 55
+
+## [1] 55
+
+## [1] 11
+Loops often produce a different result for each iteration. If you want to save +more than one result, there are a few things you must do.
+First, set up an index vector. The index vector should usually correspond to
+the positions of the elements in the data you want to process. The seq_along
+function returns an index vector when passed a vector or list. For instance:
The loop will iterate over the index rather than the input, so the induction +variable will track the current iteration number. On the first iteration, the +induction variable will be 1, on the second it will be 2, and so on. Then you +can use the induction variable and indexing to get the input for each +iteration.
+Second, set up an empty output vector or list. This should usually also +correspond to the input, or one element longer (the extra element comes from +the initial value). R has several functions for creating vectors:
+logical
, integer
, numeric
, complex
, and character
create an empty
+vector with a specific type and length
vector
creates an empty vector with a specific type and length
rep
creates a vector by repeating elements of some other vector
Empty vectors are filled with FALSE
, 0
, or ""
, depending on the type of
+the vector. Here are some examples:
## [1] FALSE FALSE FALSE
+
+## [1] 0 0 0 0
+
+## [1] 1 2 1 2
+Let’s create an empty numeric vector congruent to the numbers
vector:
As with the input, you can use the induction variable and indexing to set the +output for each iteration.
+Creating a vector or list in advance to store something, as we’ve just done, is
+called preallocation. Preallocation is extremely important for efficiency
+in loops. Avoid the temptation to use c
or append
to build up the output
+bit by bit in each iteration.
Finally, write the loop, making sure to get the input and set the output. As an
+example, this loop adds each element of numbers
to a running total and
+squares the new running total:
for (i in index) {
+ prev = if (i > 1) result[i - 1] else 0
+ result[i] = (numbers[i] + prev)^2
+}
+result
## [1] 1.000000e+00 4.840000e+02 2.371690e+05 5.624534e+10 3.163538e+21
+The break
keyword causes a loop to immediately exit. It only makes sense to
+use break
inside of an if-statement.
For example, suppose you want to print each string in a vector, but stop at the
+first missing value. You can do this with a for-loop and the break
keyword:
my_messages = c("Hi", "Hello", NA, "Goodbye")
+
+for (msg in my_messages) {
+ if (is.na(msg))
+ break
+
+ message(msg)
+}
## Hi
+## Hello
+The next
keyword causes a loop to immediately go to the next iteration. As
+with break
, it only makes sense to use next
inside of an if-statement.
Let’s modify the previous example so that missing values are skipped, but don’t +cause printing to stop. Here’s the code:
+ +## Hi
+## Hello
+## Goodbye
+These keywords work with both for-loops and while-loops.
+At first it may seem difficult to decide if and what kind of iteration to use. +Start by thinking about whether you need to do something over and over. If you +don’t, then you probably don’t need to use iteration. If you do, then try +iteration strategies in this order:
+Start by writing the code for just one iteration. Make sure that code works; +it’s easy to test code for one iteration.
+When you have one iteration working, then try using the code with an iteration
+strategy (you will have to make some small changes). If it doesn’t work, try to
+figure out which iteration is causing the problem. One way to do this is to use
+message
to print out information. Then try to write the code for the broken
+iteration, get that iteration working, and repeat this whole process.
The Collatz Conjecture is a conjecture in math that was introduced +in 1937 by Lothar Collatz and remains unproven today, despite being relatively +easy to explain. Here’s a statement of the conjecture:
+++Start from any positive integer. If the integer is even, divide by 2. If the +integer is odd, multiply by 3 and add 1.
+If the result is not 1, repeat using the result as the new starting value.
+The result will always reach 1 eventually, regardless of the starting value.
+
The sequences of numbers this process generates are called Collatz
+sequences. For instance, the Collatz sequence starting from 2 is 2, 1
. The
+Collatz sequence starting from 12 is 12, 6, 3, 10, 5, 16, 8, 4, 2, 1
.
You can use iteration to compute the Collatz sequence for a given starting +value. Since each number in the sequence depends on the previous one, and since +the length of the sequence varies, a while-loop is the most appropriate +iteration strategy:
+n = 5
+i = 0
+while (n != 1) {
+ i = i + 1
+ if (n %% 2 == 0) {
+ n = n / 2
+ } else {
+ n = 3 * n + 1
+ }
+ message(n, " ", appendLF = FALSE)
+}
## 16 8 4 2 1
+As of 2020, scientists have used computers to check the Collatz sequences for +every number up to approximately \(2^{64}\). For more details about the Collatz +Conjecture, check out this video.
+The U.S. Department of Agriculture (USDA) Economic Research Service (ERS) +publishes data about consumer food prices. For instance, in 2018 they posted a +dataset that estimates average retail prices for various fruits, vegetables, +and snack foods. The estimates are formatted as a collection +of Excel files, one for each type of fruit or vegetable. In this case study, +you’ll use iteration to get the estimated “fresh” price for all of the fruits +in the dataset that are sold fresh.
+To get started, download the zipped collection of fruit
+spreadsheets and save it somewhere on your computer. Then unzip the
+file with a zip program or R’s own unzip
function.
The first sheet of each file contains a table with the name of the fruit and +prices sorted by how the fruit was prepared. You can see this for yourself if +you use a spreadsheet program to inspect some of the files.
+In order to read the files into R, first get a vector of their names. You can
+use the list.files
function to list all of the files in a directory. If you
+set full.names = TRUE
, the function will return the absolute path to each
+file:
## [1] "data/fruit/apples_2013.xlsx" "data/fruit/apricots_2013.xlsx"
+## [3] "data/fruit/bananas_2013.xlsx" "data/fruit/berries_mixed_2013.xlsx"
+## [5] "data/fruit/blackberries_2013.xlsx" "data/fruit/blueberries_2013.xlsx"
+## [7] "data/fruit/cantaloupe_2013.xlsx" "data/fruit/cherries_2013.xlsx"
+## [9] "data/fruit/cranberries_2013.xlsx" "data/fruit/dates_2013.xlsx"
+## [11] "data/fruit/figs_2013.xlsx" "data/fruit/fruit_cocktail_2013.xlsx"
+## [13] "data/fruit/grapefruit_2013.xlsx" "data/fruit/grapes_2013.xlsx"
+## [15] "data/fruit/honeydew_2013.xlsx" "data/fruit/kiwi_2013.xlsx"
+## [17] "data/fruit/mangoes_2013.xlsx" "data/fruit/nectarines_2013.xlsx"
+## [19] "data/fruit/oranges_2013.xlsx" "data/fruit/papaya_2013.xlsx"
+## [21] "data/fruit/peaches_2013.xlsx" "data/fruit/pears_2013.xlsx"
+## [23] "data/fruit/pineapple_2013.xlsx" "data/fruit/plums_2013.xlsx"
+## [25] "data/fruit/pomegranate_2013.xlsx" "data/fruit/raspberries_2013.xlsx"
+## [27] "data/fruit/strawberries_2013.xlsx" "data/fruit/tangerines_2013.xlsx"
+## [29] "data/fruit/watermelon_2013.xlsx"
+The files are in Excel format, which you can read with the read_excel
+function from the readxl package. First try reading one file and extracting
+the fresh price:
## New names:
+## • `` -> `...2`
+## • `` -> `...3`
+## • `` -> `...4`
+## • `` -> `...5`
+## • `` -> `...6`
+## • `` -> `...7`
+The name of the fruit is the first word in the first column’s name. The fresh
+price appears in the row where the word in column 1 starts with "Fresh"
. You
+can use str_which
from the stringr package (Section
+1.4.1) to find and extract this row:
## # A tibble: 1 × 7
+## Apples—Average retail price per pound or…¹ ...2 ...3 ...4 ...5 ...6 ...7
+## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
+## 1 Fresh1 1.56… per … 0.9 0.24… poun… 0.42…
+## # ℹ abbreviated name:
+## # ¹`Apples—Average retail price per pound or pint and per cup equivalent, 2013`
+The price and unit appear in column 2 and column 3.
+Now generalize these steps by making a read_fresh_price
function. The
+function should accept a path as input and return a vector that contains the
+fruit name, fresh price, and unit. Don’t worry about cleaning up the fruit name
+at this point—you can do that with a vectorized operation after combining the
+data from all of the files. A few fruits don’t have a fresh price, and the
+function should return NA
for the price and unit for those. Here’s one way to
+implement the read_fresh_price
function:
read_fresh_price = function(path) {
+ prices = read_excel(path)
+
+ # Get fruit name.
+ fruit = names(prices)[[1]]
+
+ # Find fresh price.
+ idx = str_which(prices[[1]], "^Fresh")
+ if (length(idx) > 0) {
+ prices = prices[idx, ]
+ c(fruit, prices[[2]], prices[[3]])
+ } else {
+ c(fruit, NA, NA)
+ }
+}
Test that the function returns the correct result for a few of the files:
+ +## New names:
+## • `` -> `...2`
+## • `` -> `...3`
+## • `` -> `...4`
+## • `` -> `...5`
+## • `` -> `...6`
+## • `` -> `...7`
+## [1] "Apples—Average retail price per pound or pint and per cup equivalent, 2013"
+## [2] "1.5675153914496354"
+## [3] "per pound"
+
+## New names:
+## • `` -> `...2`
+## • `` -> `...3`
+## • `` -> `...4`
+## • `` -> `...5`
+## • `` -> `...6`
+## • `` -> `...7`
+## [1] "Mixed berries—Average retail price per pound and per cup equivalent, 2013"
+## [2] NA
+## [3] NA
+
+## New names:
+## • `` -> `...2`
+## • `` -> `...3`
+## • `` -> `...4`
+## • `` -> `...5`
+## • `` -> `...6`
+## • `` -> `...7`
+## [1] "Cherries—Average retail price per pound and per cup equivalent, 2013"
+## [2] "3.5929897554945156"
+## [3] "per pound"
+Now that the function is working, it’s time to choose an iteration strategy.
+The read_fresh_price
function is not vectorized, so that strategy isn’t
+possible. Reading one file doesn’t depend on reading any of the others, so
+apply functions are the best strategy here. The read_fresh_price
function
+always returns a character vector with 3 elements, so you can use sapply
to
+process all of the files and get a matrix of results:
## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## New names:
+## • `` -> `...2`
+## • `` -> `...3`
+## • `` -> `...4`
+## • `` -> `...5`
+## • `` -> `...6`
+## • `` -> `...7`
+# Transpose, convert to a data frame, and set names for easy reading.
+all_prices = t(all_prices)
+all_prices = data.frame(all_prices)
+rownames(all_prices) = NULL
+colnames(all_prices) = c("fruit", "price", "unit")
+all_prices
## fruit
+## 1 Apples—Average retail price per pound or pint and per cup equivalent, 2013
+## 2 Apricots—Average retail price per pound and per cup equivalent, 2013
+## 3 Bananas—Average retail price per pound and per cup equivalent, 2013
+## 4 Mixed berries—Average retail price per pound and per cup equivalent, 2013
+## 5 Blackberries—Average retail price per pound and per cup equivalent, 2013
+## 6 Blueberries—Average retail price per pound and per cup equivalent, 2013
+## 7 Cantaloupe—Average retail price per pound and per cup equivalent, 2013
+## 8 Cherries—Average retail price per pound and per cup equivalent, 2013
+## 9 Cranberries—Average retail price per pound and per cup equivalent, 2013
+## 10 Dates—Average retail price per pound and per cup equivalent, 2013
+## 11 Figs—Average retail price per pound and per cup equivalent, 2013
+## 12 Fruit cocktail—Average retail price per pound and per cup equivalent, 2013
+## 13 Grapefruit—Average retail price per pound or pint and per cup equivalent, 2013
+## 14 Grapes—Average retail price per pound or pint and per cup equivalent, 2013
+## 15 Honeydew melon—Average retail price per pound and per cup equivalent, 2013
+## 16 Kiwi—Average retail price per pound and per cup equivalent, 2013
+## 17 Mangoes—Average retail price per pound and per cup equivalent, 2013
+## 18 Nectarines—Average retail price per pound and per cup equivalent, 2013
+## 19 Oranges—Average retail price per pound or pint and per cup equivalent, 2013
+## 20 Papaya—Average retail price per pound and per cup equivalent, 2013
+## 21 Peaches—Average retail price per pound and per cup equivalent, 2013
+## 22 Pears—Average retail price per pound and per cup equivalent, 2013
+## 23 Pineapple—Average retail price per pound or pint and per cup equivalent, 2013
+## 24 Plums—Average retail price per pound or pint and per cup equivalent, 2013
+## 25 Pomegranate—Average retail price per pound or pint and per cup equivalent, 2013
+## 26 Raspberries—Average retail price per pound and per cup equivalent, 2013
+## 27 Strawberries—Average retail price per pound and per cup equivalent, 2013
+## 28 Tangerines—Average retail price per pound or pint and per cup equivalent, 2013
+## 29 Watermelon—Average retail price per pound and per cup equivalent, 2013
+## price unit
+## 1 1.5675153914496354 per pound
+## 2 3.0400719670964378 per pound
+## 3 0.56698341453144807 per pound
+## 4 <NA> <NA>
+## 5 5.7747082503535152 per pound
+## 6 4.7346216897250253 per pound
+## 7 0.53587377610644515 per pound
+## 8 3.5929897554945156 per pound
+## 9 <NA> <NA>
+## 10 <NA> <NA>
+## 11 <NA> <NA>
+## 12 <NA> <NA>
+## 13 0.89780204117954143 per pound
+## 14 2.0938274120049827 per pound
+## 15 0.79665620543008364 per pound
+## 16 2.0446834079658482 per pound
+## 17 1.3775634470319702 per pound
+## 18 1.7611484827950696 per pound
+## 19 1.0351727302444853 per pound
+## 20 1.2980115892049107 per pound
+## 21 1.5911868532458617 per pound
+## 22 1.4615746043999458 per pound
+## 23 0.62766194593569868 per pound
+## 24 1.8274160078099031 per pound
+## 25 2.1735904118559191 per pound
+## 26 6.9758107988552958 per pound
+## 27 2.3588084831103004 per pound
+## 28 1.3779618772323634 per pound
+## 29 0.33341203532340097 per pound
+Finally, the last step is to remove the extra text from the fruit name. One way
+to do this is with the str_split_fixed
function from the stringr package.
+There’s an en dash —
after each fruit name, which you can use for the split:
## fruit price unit
+## 1 Apples 1.5675153914496354 per pound
+## 2 Apricots 3.0400719670964378 per pound
+## 3 Bananas 0.56698341453144807 per pound
+## 4 Mixed berries <NA> <NA>
+## 5 Blackberries 5.7747082503535152 per pound
+## 6 Blueberries 4.7346216897250253 per pound
+## 7 Cantaloupe 0.53587377610644515 per pound
+## 8 Cherries 3.5929897554945156 per pound
+## 9 Cranberries <NA> <NA>
+## 10 Dates <NA> <NA>
+## 11 Figs <NA> <NA>
+## 12 Fruit cocktail <NA> <NA>
+## 13 Grapefruit 0.89780204117954143 per pound
+## 14 Grapes 2.0938274120049827 per pound
+## 15 Honeydew melon 0.79665620543008364 per pound
+## 16 Kiwi 2.0446834079658482 per pound
+## 17 Mangoes 1.3775634470319702 per pound
+## 18 Nectarines 1.7611484827950696 per pound
+## 19 Oranges 1.0351727302444853 per pound
+## 20 Papaya 1.2980115892049107 per pound
+## 21 Peaches 1.5911868532458617 per pound
+## 22 Pears 1.4615746043999458 per pound
+## 23 Pineapple 0.62766194593569868 per pound
+## 24 Plums 1.8274160078099031 per pound
+## 25 Pomegranate 2.1735904118559191 per pound
+## 26 Raspberries 6.9758107988552958 per pound
+## 27 Strawberries 2.3588084831103004 per pound
+## 28 Tangerines 1.3779618772323634 per pound
+## 29 Watermelon 0.33341203532340097 per pound
+Now the data are ready for analysis. You could extend the reader function to +extract more of the data (e.g., dried and frozen prices), but the overall +process is fundamentally the same. Write the code to handle one file (one +step), generalize it to work on several, and then iterate.
+For another example, see Liza Wood’s Real-world Function Writing +Mini-reader.
+print
Functionmessage
Functioncat
Functionggplot2
We will be using the R package ggplot2
to create data visualizations. Install it via the install.packages()
function. While we are at it, let’s make sure we install all of the packages that we’ll need for today’s workshop. Beyone ggplot2
, we’ll use readr
for reading data files, dplyr
for manupulating data, and palmerpenguins
provides a nice dataset.
install.packages("ggplot2")
-install.packages("dplyr")
-install.packages("readr")
-install.packages("palmerpenguins")
install.packages("ggplot2")
+install.packages("dplyr")
+install.packages("readr")
+install.packages("palmerpenguins")
ggplot2
is an enormously popular R package that provides a way to create data visualizations through a so-called “grammar of graphics” (hence the “gg” in the name). That grammar interface may be a little bit unintuitive at first but once you grasp it, you hold enormous power to quickly craft plots. It doesn’t hurt that the ggplot2
plots look great, too.
Let’s look at an example. This uses data from the palmerpenguins
package that you just installed (make sure to load the package with library(palmerpenguins)
). It is measurements of 344 penguins, collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The data package was created by Allison Horst. Before jumping in, let’s have a look at the data and the image we want to create.
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
@@ -360,22 +364,22 @@ 5.2.1 Examining the Plot
5.2.2 Duplicating the Palmer Penguins Plot
Here is the code to make the plot:
-# matching the Allison Horst peguins plot
-ggplot(penguins) +
- aes(x = flipper_length_mm,
- y = bill_length_mm,
- color = species,
- shape = species) +
- geom_point() +
- geom_smooth(method = lm, se = FALSE) +
- xlab("Flipper length (mm)") +
- ylab("Bill length (mm)") +
- ggtitle(
- "Flipper and bill length",
- subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo penguins at Palmer Station LTER"
- ) +
- labs(color = "Penguin species", shape = "Penguin species") +
- scale_color_brewer(palette = "Dark2")
+# matching the Allison Horst peguins plot
+ggplot(penguins) +
+ aes(x = flipper_length_mm,
+ y = bill_length_mm,
+ color = species,
+ shape = species) +
+ geom_point() +
+ geom_smooth(method = lm, se = FALSE) +
+ xlab("Flipper length (mm)") +
+ ylab("Bill length (mm)") +
+ ggtitle(
+ "Flipper and bill length",
+ subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo penguins at Palmer Station LTER"
+ ) +
+ labs(color = "Penguin species", shape = "Penguin species") +
+ scale_color_brewer(palette = "Dark2")
# Make a basic penguin plot with just data,
-# a geometry, and a map between the two.
-ggplot(penguins) +
- aes(x = flipper_length_mm,
- y = bill_length_mm,
- color = species,
- shape = species) +
- geom_point()
# Make a basic penguin plot with just data,
+# a geometry, and a map between the two.
+ggplot(penguins) +
+ aes(x = flipper_length_mm,
+ y = bill_length_mm,
+ color = species,
+ shape = species) +
+ geom_point()
We know, though, that this plot is not complete. In particular, there is no title and the axes aren’t labeled meaningfully. Also, the clouds of points seem to hide the meaning that we are trying to convey and the colors aren’t colorblind-safe. The rest of the pieces of the plot call are meant to address those shortcomings.
We add a second geometry layer with geom_smooth(method=lm, se=FALSE)
, which specifies the lm
method in order to draw a straight (instead of wiggly) smoother through the data. The x
-axis label, y
-axis label and title are set by xlab()
, ylab()
, and ggtitle()
, respectively. We want a more informative label for the legend title than just the variable name (“Penguin Species” instead of “species”), which is handled by the labs()
function. And you’ll recall from the principles of data visualization that you can use Cynthia Brewer’s Color Brewer website to select colorblind-friendly color schemes. Color Brewer is integrated directly into ggplot2
, so the scale_color_brewer()
function can pull a named color scheme from Color Brewer directly into your plot as the color scale.
We can begin to better understand the grammar of graphics as we consider this example. Recognize that we our data visualization conveys information via several visual channels that express data as visual marks. The geometry determines how those marks are drawn on the page, which can be set separately from the mapping. Let’s see a couple examples of that:
-# placing plots via gridExtra
-library(gridExtra)
-
-# plot the Palmer penguin data with a line geometry
-peng_line = ggplot(penguins) +
- aes(x = flipper_length_mm,
- y = bill_length_mm,
- color = species,
- shape = species) +
- geom_line()
-
-# plot the Palmer penguin data with a hex heatmap geometry
-peng_hex = ggplot(penguins) +
- aes(x = flipper_length_mm,
- y = bill_length_mm,
- color = species,
- shape = species) +
- geom_hex()
-
-# place the plts side-by-side
-grid.arrange(peng_line, peng_hex)
# placing plots via gridExtra
+library(gridExtra)
+
+# plot the Palmer penguin data with a line geometry
+peng_line = ggplot(penguins) +
+ aes(x = flipper_length_mm,
+ y = bill_length_mm,
+ color = species,
+ shape = species) +
+ geom_line()
+
+# plot the Palmer penguin data with a hex heatmap geometry
+peng_hex = ggplot(penguins) +
+ aes(x = flipper_length_mm,
+ y = bill_length_mm,
+ color = species,
+ shape = species) +
+ geom_hex()
+
+# place the plts side-by-side
+grid.arrange(peng_line, peng_hex)
You can see how changing the geometry but not the mapping will plot the same data with a different method. Separating the mapping of features to channels from the drawing of marks is at the core of the grammar of graphics.
This separation of functions gives ggplot2
its power by allowing us to compose a small number of functions to express data in unlimited ways (kind of like poetry). Recognizing the grammar of graphics allows us to reason in a consistent way about different kinds of plots, and make intelligent assumptions about mappings and geometries.
First, let’s revisit the penguins
data. There are tree categorical features in the data: species
, island
, and sex
. Let’s use geom_bar()
to count how many penguins of each species and/or sex were observed on each island. The x-axis of the plot should be the island, but note that there are multiple values of species
and sex
that have the same position on that x-axis. In this case, you can use the position_dodge()
or position_stack()
arguments to tell ggplot2
how to handle the second grouping channel.
# count the penguins on each island
-ggplot(penguins) +
- aes(x=island) +
- geom_bar() +
- xlab("Island") +
- ylab("Count") +
- ggtitle("Count of penguins on each island")
# count the penguins on each island
+ggplot(penguins) +
+ aes(x=island) +
+ geom_bar() +
+ xlab("Island") +
+ ylab("Count") +
+ ggtitle("Count of penguins on each island")
#count the penguins of each sex on each island
-ggplot(penguins) +
- aes(x=island, fill=sex) +
- geom_bar(position=position_dodge()) +
- scale_fill_grey() +
- theme_bw() +
- xlab("Island") +
- ylab("Count") +
- ggtitle("Count of penguins on each island by sex")
#count the penguins of each sex on each island
+ggplot(penguins) +
+ aes(x=island, fill=sex) +
+ geom_bar(position=position_dodge()) +
+ scale_fill_grey() +
+ theme_bw() +
+ xlab("Island") +
+ ylab("Count") +
+ ggtitle("Count of penguins on each island by sex")
Alternatively, you can use facets to separate the data into multiple plots based on a data feature. Let’s see how that works to facet the plots by species.
One way to show more information more clearly in a plot is to break the plot into pieces that each show part of the information. In ggplot2
, this is called faceting the plot. There are two main facet functions, facet_grid()
(which puts plots in a grid), and facet_wrap()
, which puts plots side-by-side until it runs out of room, then wraps to a new line. We’ll use facet_wrap()
here, with the first argument being ~species
. This tells ggplot2
to break the plot into pieces by plotting the data for each species
separately.
#count the penguins of each species on each island
-ggplot(penguins) +
- aes(x=island) +
- geom_bar() +
- scale_fill_grey() +
- theme_bw() +
- xlab("Island") +
- ylab("Count") +
- ggtitle("Count of penguins on each island by species") +
- facet_wrap(~species, ncol=3)
#count the penguins of each species on each island
+ggplot(penguins) +
+ aes(x=island) +
+ geom_bar() +
+ scale_fill_grey() +
+ theme_bw() +
+ xlab("Island") +
+ ylab("Count") +
+ ggtitle("Count of penguins on each island by species") +
+ facet_wrap(~species, ncol=3)
Here’s an example that recently came up in my office hours. You’ve done an experiment to see how mice with two different genotypes respond to two different treatments. Now you want to plot the mean response of each group as a column, with error bars indicating the standard deviation of the mean. You also want to show the raw data. I’ve simulated some data for us to use - download it here.
This one is kind of complicated because you have to tell ggplot2
how to calculate the height of the columns and of the error bars. This involves cr
mice_geno = read_csv("data/genotype-response.csv")
-
-# show the treatment response for different genotypes
-ggplot(mice_geno) +
- aes(x=trt,
- y=resp,
- fill=genotype) +
- scale_fill_brewer(palette="Dark2") +
- geom_bar(position=position_dodge(),
- stat='summary',
- fun='mean') +
- geom_errorbar(fun.min=function(x) {mean(x) - sd(x) / sqrt(length(x))},
- fun.max=function(x) {mean(x) + sd(x) / sqrt(length(x))},
- stat='summary',
- position=position_dodge(0.9),
- width=0.2) +
- geom_point(position=
- position_jitterdodge(
- dodge.width=0.9,
- jitter.width=0.1)) +
- xlab("Treatment") +
- ylab("Response (mm/g)") +
- ggtitle("Mean growth response of mice by genotype and treatment")
mice_geno = read_csv("data/genotype-response.csv")
+
+# show the treatment response for different genotypes
+ggplot(mice_geno) +
+ aes(x=trt,
+ y=resp,
+ fill=genotype) +
+ scale_fill_brewer(palette="Dark2") +
+ geom_bar(position=position_dodge(),
+ stat='summary',
+ fun='mean') +
+ geom_errorbar(fun.min=function(x) {mean(x) - sd(x) / sqrt(length(x))},
+ fun.max=function(x) {mean(x) + sd(x) / sqrt(length(x))},
+ stat='summary',
+ position=position_dodge(0.9),
+ width=0.2) +
+ geom_point(position=
+ position_jitterdodge(
+ dodge.width=0.9,
+ jitter.width=0.1)) +
+ xlab("Treatment") +
+ ylab("Response (mm/g)") +
+ ggtitle("Mean growth response of mice by genotype and treatment")
People mail dead birds to the USDA and USGS, where scientists analyze the birds to find out why they died. Right now there is a bird flu epidemic, and the USDA provides public data about the birds in whom the disease has been detected. You can access the data here or see the official USDA webpage here. After you download the data, we will load the data and do some visualization.
-# load data directly from the USDA website
-flu <- read_csv("data/hpai-wild-birds-ver2.csv")
-flu$date <- mdy(flu$`Date Detected`)
-
-# plot a histogram of when bird flu was detected
-ggplot(flu) +
- aes(x = date) +
- geom_histogram() +
- ggtitle("Bird flu detections in wild birds") +
- xlab("Date") +
- ylab("Count")
# load data directly from the USDA website
+flu <- read_csv("data/hpai-wild-birds-ver2.csv")
+flu$date <- mdy(flu$`Date Detected`)
+
+# plot a histogram of when bird flu was detected
+ggplot(flu) +
+ aes(x = date) +
+ geom_histogram() +
+ ggtitle("Bird flu detections in wild birds") +
+ xlab("Date") +
+ ylab("Count")
# plot a histogram of when bird flu was detected
-ggplot(flu) +
- aes(x = date, fill = `Sampling Method`) +
- geom_histogram() +
- ggtitle("Bird flu detections in wild birds") +
- xlab("Date") +
- ylab("Count")
# plot a histogram of when bird flu was detected
+ggplot(flu) +
+ aes(x = date, fill = `Sampling Method`) +
+ geom_histogram() +
+ ggtitle("Bird flu detections in wild birds") +
+ xlab("Date") +
+ ylab("Count")
# bar chart shows how the bird flu reports compare between west coast states
-subset(flu, State %in% c("California", "Oregon", "Washington")) |>
- ggplot() +
- aes(x = State, fill = `Sampling Method`) +
- stat_count() +
- geom_bar() +
- ggtitle("Bird flu detections in wild birds (West coast states)") +
- ylab("Count")
# bar chart shows how the bird flu reports compare between west coast states
+subset(flu, State %in% c("California", "Oregon", "Washington")) |>
+ ggplot() +
+ aes(x = State, fill = `Sampling Method`) +
+ stat_count() +
+ geom_bar() +
+ ggtitle("Bird flu detections in wild birds (West coast states)") +
+ ylab("Count")
Let’s compare the bird flu season to the human flu season. Download hospitalization data for the 2021-2022 and 2022-2023 flu seasons from the CDC website here or see the official Centers for Disease Control website here. After you download the data, we will see how adding a second data series works a little differently from the first. That’s because composing data, aesthetic mapping, and geometry with addition only works when there is no ambiguity about which data series is being mapped or drawn.
After downloading the data, there is some work required to adjust the dates and change the hospitalization rate from cases per 100,000 to cases per 10 million, which better matches the scale of the bird flu data.
-# processing CDC flu data:
-cdc <- read_csv("data/FluSurveillance_Custom_Download_Data.csv", skip = 2)
-cdc$date <- as_date("1950-01-01")
-year(cdc$date) <- cdc$`MMWR-YEAR`
-week(cdc$date) <- cdc$`MMWR-WEEK`
-
-
-# get flu hospitalization counts that include all race, sex, and age categories
-cdc_overall <- subset(
- cdc,
- `AGE CATEGORY` == "Overall" &
- `SEX CATEGORY` == "Overall" &
- `RACE CATEGORY` == "Overall"
-)
-
-# convert the counts to cases per 10 million
-cdc_overall$`WEEKLY RATE` <- as.numeric(cdc_overall$`WEEKLY RATE`) * 100
# remake the plot but add a new geom_line() with its own data
-ggplot(flu) +
- aes(x = date, fill = `Sampling Method`) +
- geom_histogram() +
- geom_line(data = cdc_overall,
- mapping = aes(x = date, y = `WEEKLY RATE`),
- inherit.aes = FALSE) +
- ggtitle("Bird flu detections and human flu hospitalizations") +
- xlab("Date") +
- ylab("Count") +
- xlim(as_date("2022-01-01"), as_date("2023-05-01"))
# processing CDC flu data:
+cdc <- read_csv("data/FluSurveillance_Custom_Download_Data.csv", skip = 2)
+cdc$date <- as_date("1950-01-01")
+year(cdc$date) <- cdc$`MMWR-YEAR`
+week(cdc$date) <- cdc$`MMWR-WEEK`
+
+
+# get flu hospitalization counts that include all race, sex, and age categories
+cdc_overall <- subset(
+ cdc,
+ `AGE CATEGORY` == "Overall" &
+ `SEX CATEGORY` == "Overall" &
+ `RACE CATEGORY` == "Overall"
+)
+
+# convert the counts to cases per 10 million
+cdc_overall$`WEEKLY RATE` <- as.numeric(cdc_overall$`WEEKLY RATE`) * 100
# remake the plot but add a new geom_line() with its own data
+ggplot(flu) +
+ aes(x = date, fill = `Sampling Method`) +
+ geom_histogram() +
+ geom_line(data = cdc_overall,
+ mapping = aes(x = date, y = `WEEKLY RATE`),
+ inherit.aes = FALSE) +
+ ggtitle("Bird flu detections and human flu hospitalizations") +
+ xlab("Date") +
+ ylab("Count") +
+ xlim(as_date("2022-01-01"), as_date("2023-05-01"))
The US Small Business Administration (SBA) maintains data on the loans it offers to businesses. Data about loans made since 2020 can be found at the Small Business Administration website, or you can download it from here. We’ll load that data and then explore some ways to visualize it. Since the difference between a $100 loan and a $1000 loan is more like the difference between $100,000 and $1M than between $100,000 ad $100,900, we should put the loan values on a logarithmic scale. You can do this in ggplot2
with the scale_y_log10()
function (when the loan values are on the y axis).
# load the small business loan data
-sba <- read_csv("data/small-business-loans.csv")
-
-# check the SBA data to see the data types, etc.
-head(sba)
# load the small business loan data
+sba <- read_csv("data/small-business-loans.csv")
+
+# check the SBA data to see the data types, etc.
+head(sba)
## # A tibble: 6 × 39
## AsOfDate Program BorrName BorrStreet BorrCity BorrState BorrZip BankName
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
@@ -646,41 +650,41 @@ 5.5.4 Small Business Loans
-# boxplot of loan sizes by business type
-subset(sba, ProjectState == "CA") |>
- ggplot() +
- aes(x = BusinessType, y = SBAGuaranteedApproval) +
- geom_boxplot() +
- scale_y_log10() +
- ggtitle("Small Business Administraton guaranteed loans in California") +
- ylab("Loan guarantee (dollars)")
# boxplot of loan sizes by business type
+subset(sba, ProjectState == "CA") |>
+ ggplot() +
+ aes(x = BusinessType, y = SBAGuaranteedApproval) +
+ geom_boxplot() +
+ scale_y_log10() +
+ ggtitle("Small Business Administraton guaranteed loans in California") +
+ ylab("Loan guarantee (dollars)")
# relationship between loan size and interest rate
-subset(sba, ProjectState == "CA") |>
- ggplot() +
- aes(x = GrossApproval, y = InitialInterestRate, ) +
- geom_point() +
- facet_wrap(~BusinessType, ncol = 3) +
- scale_x_log10() +
- ggtitle("Interest rate as a function of loan size") +
- xlab("Loan size (dollars)") +
- ylab("Interest rate (%)")
# relationship between loan size and interest rate
+subset(sba, ProjectState == "CA") |>
+ ggplot() +
+ aes(x = GrossApproval, y = InitialInterestRate, ) +
+ geom_point() +
+ facet_wrap(~BusinessType, ncol = 3) +
+ scale_x_log10() +
+ ggtitle("Interest rate as a function of loan size") +
+ xlab("Loan size (dollars)") +
+ ylab("Interest rate (%)")
Now let’s color the points by the loan status. Thankfully, ggplot2
integrates directly with Color Brewer (colorbrewer2.org) to get better color palettes. We will use the Accent
color palette, which is just one of the many options that can be found on the Color Brewer site.
There are a lot of data points, which tent to largely overlap and hide each other. We use a smoother (geom_smooth()
) to help call out differences that would otherwise be lost in the noise of the points.
# color the dots by the loan status.
-subset(sba, ProjectState == "CA" & LoanStatus != "EXEMPT" & LoanStatus != "CHGOFF") |>
- ggplot() +
- aes(x = GrossApproval, y = InitialInterestRate, color = LoanStatus) +
- geom_point() +
- geom_smooth() +
- facet_wrap(~BusinessType, ncol = 3) +
- scale_x_log10() +
- ggtitle("Interest rate as a function of loan size by loan status") +
- xlab("Loan size (dollars)") +
- ylab("Interest rate (%)") +
- labs(color = "Loan status") +
- scale_color_brewer(type = "qual", palette = "Accent")
# color the dots by the loan status.
+subset(sba, ProjectState == "CA" & LoanStatus != "EXEMPT" & LoanStatus != "CHGOFF") |>
+ ggplot() +
+ aes(x = GrossApproval, y = InitialInterestRate, color = LoanStatus) +
+ geom_point() +
+ geom_smooth() +
+ facet_wrap(~BusinessType, ncol = 3) +
+ scale_x_log10() +
+ ggtitle("Interest rate as a function of loan size by loan status") +
+ xlab("Loan size (dollars)") +
+ ylab("Interest rate (%)") +
+ labs(color = "Loan status") +
+ scale_color_brewer(type = "qual", palette = "Accent")
print
Functionmessage
Functioncat
Functionprint
Functionmessage
Functioncat
FunctionYou can use the new.env
function to create a new environment:
## <environment: 0x55b763994b70>
+
+## <environment: 0x5625b8afdb70>
Unlike most objects, printing an environment doesn’t print its contents.
Instead, R prints its type (which is environment
) and a unique identifier
-(0x55b763994b70
in this case).
0x5625b8afdb70
in this case).
The unique identifier is actually the memory address of the environment. Every object you use in R is stored as a series of bytes in your computer’s random-access memory (RAM). Each byte in memory has a unique address, @@ -375,9 +379,9 @@
To see the names in an environment’s frame, you can call the ls
or names
function on the environment:
## character(0)
-
+
## character(0)
You just created the environment e
, so its frame is currently empty. The
printout character(0)
means R returned a character vector of length 0.
[[
operator, similar to how
you would assign a named element of a list. For example, one way to assign the
number 8
to the name "lucky"
in the environment e
’s frame is:
-
+
Now there’s a name defined in the environment:
- +## [1] "lucky"
-
+
## [1] "lucky"
Here’s another example of assigning an object to a name in the environment:
- +You can assign any type of R object to a name in an environment, including other environments.
The ls
function ignores names that begin with a dot .
by default. For
example:
## [1] "lucky" "my_message"
You can pass the argument all.names = TRUE
to make the function return all
names in the frame:
## [1] ".x" "lucky" "my_message"
Alternatively, you can just use the names
function, which always prints all
names in an environment’s frame.
Objects in an environment’s frame don’t have positions or any particular order, so they must always be assigned to a name. R raises an error if you try to assign an object to a position:
- +## Error in e[[3]] = 10: wrong args for environment subassignment
As you might expect, you can also use the dollar sign operator and double square bracket operator to get objects in an environment by name:
- +## [1] "May your coffee kick in before reality does."
-
+
## [1] 8
You can use the exists
function to check whether a specific name exists in an
environment’s frame:
## [1] FALSE
-
+
## [1] TRUE
Finally, you can remove a name and object from an environment’s frame with the
rm
function. Make sure to pass the environment as the argument to the envir
parameter when you do this:
## [1] FALSE
x = list()
-x$a = 10
-x
## $a
## [1] 10
-
+
## $a
## [1] 10
When you run y = x
, R makes y
refer to the same object as x
, without
@@ -454,13 +458,13 @@
e_x = new.env()
-e_x$a = 10
-e_x$a
+
## [1] 10
-
+
## [1] 20
As before, e_y = e_x
makes both e_y
and e_x
refer to the same object. The
difference is that when you run e_x$a = 20
, the copy-on-write rule does not
@@ -478,8 +482,8 @@
globalenv
function to get the global environment:
-
+
## <environment: R_GlobalEnv>
The global environment is easy to recognize because its unique identifier is
R_GlobalEnv
rather than its memory address (even though it’s stored in your
@@ -489,8 +493,8 @@
## <environment: R_GlobalEnv>
As you can see, at the R prompt or the top level of an R script, the local environment is just the global environment.
@@ -498,23 +502,23 @@envir
parameter. This makes them convenient for inspecting or modifying
the local environment’s frame:
-
+
## [1] "e" "e_x" "e_y" "g" "loc"
## [6] "source_rmd" "x" "y"
-
+
## [1] "e" "e_x" "e_y" "g" "loc"
## [6] "source_rmd" "x" "y"
If you assign a variable, it appears in the local environment’s frame:
- +## [1] "coffee" "e" "e_x" "e_y" "g"
## [6] "loc" "source_rmd" "x" "y"
-
+
## [1] "Right. No coffee. This is a terrible planet."
Conversely, if you assign an object in the local environment’s frame, you can access it as a variable:
- +## [1] "Tea isn't coffee!"
my_hello = function() {
- hello = "from the other side"
-}
Even after calling the function, there’s no variable hello
in the global
environment:
## [1] "loc" "my_hello" "tea" "e_x" "x"
## [6] "e_y" "y" "coffee" "source_rmd" "e"
## [11] "g" ".First"
As further demonstration, consider this modified version of my_hello
, which
returns the call environment:
The call environment is not the global environment:
- -## <environment: 0x55b765e895e8>
+
+## <environment: 0x5625baff25e8>
And the variable hello
exists in the call environment, but not in the global
environment:
## [1] FALSE
-
+
## [1] TRUE
-
+
## [1] "from the other side"
Each call to a function creates a new call environment. So if you call
my_hello
again, it returns a different environment (pay attention to the
memory address):
## <environment: 0x55b765e895e8>
-
-## <environment: 0x55b76640c6f8>
+
+## <environment: 0x5625baff25e8>
+
+## <environment: 0x5625bb5756f8>
By creating a new environment for every call, R isolates code in the function body from code outside of the body. As a result, most R functions have no side effects. This is a good thing, since it means you generally don’t have @@ -577,45 +581,45 @@
tea = "Tea isn't coffee!"
-get_tea = function() {
- tea
-}
+
Then the get_tea
function can access the tea
variable:
## [1] "Tea isn't coffee!"
Note that variable lookup takes place when a function is called, not when it’s defined. This is called dynamic lookup.
For example, the result from get_tea
changes if you change the value of
tea
:
## [1] "Tea for two."
-
+
## [1] "Tea isn't coffee!"
When a local variable (a variable in the local environment) and a non-local variable have the same name, R almost always prioritizes the local variable. For instance:
- +## [1] "Earl grey is tea!"
The function body assigns the local variable tea
to "Earl grey is tea!"
, so
R returns that value rather than "Tea isn't coffee!"
. In other words, local
variables mask, or hide, non-local variables with the same name.
There’s only one case where R doesn’t prioritize local variables. To see it, consider this call:
- +## [1] 10.5
The variable mean
must refer to a function, because it’s being called—it’s
followed by parentheses ( )
, the call syntax. In this situation, R ignores
local variables that aren’t functions, so you can write code such as:
## [1] 5.5
That said, defining a local variable with the same name as a function can still be confusing, so it’s usually considered a bad practice.
@@ -623,7 +627,7 @@environment(get_tea)
+
## <environment: R_GlobalEnv>
tea
## [1] "Tea isn't coffee!"
On the other hand, in the get_tea
function from Section
6.1.5, tea
is not a local variable:
To make this more concrete, consider a function which just returns its call environment:
- +The call environment clearly doesn’t contain the tea
variable:
## character(0)
When a variable doesn’t exist in the local environment’s frame, then R gets the parent environment of the local environment.
You can use the parent.env
function to get the parent environment of an
environment. For the call environment e
, the parent environment is the global
environment, because that’s where get_call_env
was defined:
## <environment: R_GlobalEnv>
When R can’t find tea
in the call environment’s frame, R gets the parent
environment, which is the global environment. Then R searches for tea
in the
@@ -669,12 +673,12 @@
get("pi")
+
## [1] 3.141593
You can use the get
function to look up a variable starting from a specific
environment or to control how R does the lookup the variable. For example, if
you set inherits = FALSE
, R will not search any parent environments:
## Error in get("pi", inherits = FALSE): object 'pi' not found
As with most functions for inspecting and modifying environments, use the get
function sparingly. R already provides a much simpler way to get a variable:
@@ -692,16 +696,16 @@
g = globalenv()
-e = parent.env(g)
-e
+
## <environment: package:stats>
## attr(,"name")
## [1] "package:stats"
## attr(,"path")
## [1] "/usr/lib/R/library/stats"
-
+
## <environment: package:graphics>
## attr(,"name")
## [1] "package:graphics"
@@ -711,18 +715,18 @@ 6.1.7 The Search Pathsearch()
+
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
The base environment (identified by base
) is the always topmost
environment. You can use the baseenv
function to get the base environment:
-
+
## <environment: base>
The base environment’s parent is the special empty environment (identified by
R_EmptyEnv
), which contains no variables and has no parent. You can use the
emptyenv
function to get the empty environment:
-
+
## <environment: R_EmptyEnv>
Understanding R’s process for looking up variables and the search path is
helpful for resolving conflicts between the names of variables in packages.
@@ -736,7 +740,7 @@ 6.1.7.1 The Colon OperatorsGet a variable from a package without loading the package.
For example:
-
+
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
@@ -745,7 +749,7 @@ 6.1.7.1 The Colon Operators## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
-
+
## function (x, filter, method = c("convolution", "recursive"),
## sides = 2L, circular = FALSE, init = NULL)
## {
@@ -811,9 +815,9 @@ 6.1.7.1 The Colon Operators
-
+
## function (.data, ..., .by = NULL, .preserve = FALSE)
## {
## check_by_typo(...)
@@ -823,14 +827,14 @@ 6.1.7.1 The Colon Operators
-
+
## function (data = NULL, mapping = aes(), ..., environment = parent.frame())
## {
## UseMethod("ggplot")
## }
-## <bytecode: 0x55b7654062e8>
+## <bytecode: 0x5625ba56f2e8>
## <environment: namespace:ggplot2>
The related triple-colon operator :::
gets a private variable in a
package. Generally these are private for a reason! Only use :::
if you’re
@@ -846,8 +850,8 @@
6.2 Closuresf = function() 42
-environment(f)
+
## <environment: R_GlobalEnv>
Since the enclosing environment exists whether or not you call the function,
you can use the enclosing environment to store and share data between calls.
@@ -856,11 +860,11 @@ 6.2 Closurescounter = 0
-count = function() {
- counter <<- counter + 1
- counter
-}
+
In this example, the enclosing environment is the global environment. Each time
you call count
, it assigns a new value to the counter
variable in the
global environment.
@@ -872,12 +876,12 @@ 6.2.1 Tidy Closurescounter = 0
-count()
+
## [1] 1
Or the user might overwrite the function’s variables:
-
+
## Error in counter + 1: non-numeric argument to binary operator
For functions that rely on storing information in their enclosing environment,
there are several different ways to make sure the enclosing environment is
@@ -890,33 +894,33 @@
6.2.1 Tidy Closuresmake_fn = function() {
- # Define variables in the enclosing environment here:
-
- # Define and return the function here:
- function() {
- # ...
- }
-}
-
-f = make_fn()
-# Now you can call f() as you would any other function.
+make_fn = function() {
+ # Define variables in the enclosing environment here:
+
+ # Define and return the function here:
+ function() {
+ # ...
+ }
+}
+
+f = make_fn()
+# Now you can call f() as you would any other function.
For example, you can use the template for the counter
function:
-make_count = function() {
- counter = 0
-
- function() {
- counter <<- counter + 1
- counter
- }
-}
-
-count = make_count()
+make_count = function() {
+ counter = 0
+
+ function() {
+ counter <<- counter + 1
+ counter
+ }
+}
+
+count = make_count()
Then calling count
has no effect on the global environment:
-
+
## [1] 1
-
+
## [1] 10
@@ -935,9 +939,9 @@ 6.3 Attributesclass(mtcars)
+
## [1] "data.frame"
-
+
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
@@ -951,7 +955,7 @@ 6.3 Attributesattr(mtcars, "row.names")
+
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
@@ -963,12 +967,12 @@ 6.3 Attributesattr(mtcars, "foo") = 42
-attr(mtcars, "foo")
+
## [1] 42
You can get all of the attributes attached to an object with the attributes
function:
-
+
## $names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
@@ -992,8 +996,8 @@ 6.3 Attributesmod_mtcars = structure(mtcars, foo = 50, bar = 100)
-attributes(mod_mtcars)
+
## $names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
@@ -1020,10 +1024,10 @@ 6.3 Attributesattributes(5)
+
## NULL
But the class
function still returns a class:
-
+
## [1] "numeric"
When a helper function exists to get or set an attribute, use the helper
function rather than attr
. This will make your code clearer and ensure that
@@ -1047,11 +1051,11 @@
6.4 S3type and is stored
in the class
attribute. You can get the class of an object with the class
function. For example, the class of a data frame is data.frame
:
-
+
## [1] "data.frame"
Some objects have more than one class. One example of this is matrices:
-
+
## [1] "matrix" "array"
When an object has multiple classes, they’re stored in the class
attribute in
order from highest to lowest priority. So the matrix m
will primarily behave
@@ -1072,16 +1076,16 @@
6.4.1 Method Dispatchsplit
+
## function (x, f, drop = FALSE, ...)
## UseMethod("split")
-## <bytecode: 0x55b763387020>
+## <bytecode: 0x5625b84f0020>
## <environment: namespace:base>
Another is the plot
function, which creates a plot:
-
+
## function (x, y, ...)
## UseMethod("plot")
-## <bytecode: 0x55b7651cbfa0>
+## <bytecode: 0x5625ba334fa0>
## <environment: namespace:base>
The UseMethod
function requires the name of the generic (as a string) as its
first argument. The second argument is optional and specifies the object to use
@@ -1090,7 +1094,7 @@
6.4.1 Method Dispatchmethods(split)
+
## [1] split.data.frame split.Date split.default split.POSIXct
## see '?methods' for accessing help and source code
Method names always have the form GENERIC.CLASS
, where GENERIC
is the name
@@ -1102,7 +1106,7 @@
6.4.1 Method Dispatchsplit.data.frame
+
## function (x, f, drop = FALSE, ...)
## {
## if (inherits(f, "formula"))
@@ -1110,13 +1114,13 @@ 6.4.1 Method DispatchgetAnywhere(plot.data.frame)
+
## A single object matching 'plot.data.frame' was found
## It was found in the following places
## registered S3 method for plot from namespace graphics
@@ -1143,17 +1147,17 @@ 6.4.1 Method Dispatchsplit(mtcars, mtcars$cyl)
+
The split
function is generic and dispatches on its first argument. In this
case, the first argument is mtcars
, which has class data.frame
. Since the
method split.data.frame
exists, R calls split.data.frame
with the same
arguments you used to call the generic split
function. In other words, R
calls:
-
+
When an object has more than one class, method dispatch considers them from
left to right. For instance, matrices created with the matrix
function have
class matrix
and also class array
. If you pass a matrix to a generic
@@ -1164,9 +1168,9 @@
6.4.1 Method Dispatch# install.packages("sloop")
-library("sloop")
-s3_dispatch(split(mtcars, mtcars$cyl))
+
## => split.data.frame
## * split.default
The selected method is indicated with an arrow =>
, while methods that were
@@ -1181,16 +1185,16 @@
6.4.2 Creating Objectsget_age = function(animal) {
- UseMethod("get_age")
-}
+
Next, let’s create a class Human
to represent a human. Since humans are
animals, let’s make each Human
also have class Animal
. You can use any type
of object as the foundation for a class, but lists are often a good choice
because they can store multiple named elements. Here’s how to create a Human
object with a field age_years
to store the age in years:
-
+
Class names can include any characters that are valid in R variable names. One
common convention is to make them start with an uppercase letter, to
distinguish them from variables.
@@ -1198,56 +1202,56 @@ 6.4.2 Creating ObjectsHuman = function(age_years) {
- obj = list(age_years = age_years)
- class(obj) = c("Human", "Animal")
- obj
-}
-
-asriel = Human(45)
+Human = function(age_years) {
+ obj = list(age_years = age_years)
+ class(obj) = c("Human", "Animal")
+ obj
+}
+
+asriel = Human(45)
The get_age
generic doesn’t have any methods yet, so R raises an error if you
call it (regardless of the argument’s class):
-
+
## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('Human', 'Animal')"
Let’s define a method for Animal
objects. The method will just return the
value of the age_years
field:
-
+
## [1] 13
-
+
## [1] 45
Notice that the get_age
generic still raises an error for objects that don’t
have class Animal
:
-
+
## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('double', 'numeric')"
Now let’s create a class Dog
to represent dogs. Like the Human
class, a
Dog
is a kind of Animal
and has an age_years
field. Each Dog
will also
have a breed
field to store the breed of the dog:
-Dog = function(age_years, breed) {
- obj = list(age_years = age_years, breed = breed)
- class(obj) = c("Dog", "Animal")
- obj
-}
-
-pongo = Dog(10, "dalmatian")
+Dog = function(age_years, breed) {
+ obj = list(age_years = age_years, breed = breed)
+ class(obj) = c("Dog", "Animal")
+ obj
+}
+
+pongo = Dog(10, "dalmatian")
Since a Dog
is an Animal
, the get_age
generic returns a result:
-
+
## [1] 10
Recall that the goal of this example was to make get_age
return the age of an
animal in terms of a human lifespan. For a dog, their age in “human years” is
about 5 times their age in actual years. You can implement a get_age
method
for Dog
to take this into account:
-
+
Now the get_age
generic returns an age in terms of a human lifespan whether
its argument is a Human
or a Dog
:
-
+
## [1] 13
-
+
## [1] 50
You can create new data structures in R by creating classes, and you can add
functionality to new or existing generics by creating new methods. Before
diff --git a/docs/part-2.html b/docs/part-2.html
index 4777dcb..20e9933 100644
--- a/docs/part-2.html
+++ b/docs/part-2.html
@@ -202,17 +202,33 @@
II Writing & Debugging R Code
-3 Best Practices for Writing R Scripts
-4 Squashing Bugs with R’s Debugging Tools
+ 3 Best Practices for Writing R Scripts
+
+- 3.1 Scripting Your Code
+- 3.2 Printing Output
+- 3.3 Reading Input
+- 3.4 Managing Packages
+- 3.5 Iteration Strategies
-- 4.1.1 The
print
Function
-- 4.1.2 The
message
Function
-- 4.1.3 The
cat
Function
-- 4.1.4 Formatting Output
-- 4.1.5 Logging Output
+- 3.5.1 For-loops
+- 3.5.2 While-loops
+- 3.5.3 Saving Multiple Results
+- 3.5.4 Break & Next
+- 3.5.5 Planning for Iteration
+- 3.5.6 Case Study: The Collatz Conjecture
+- 3.5.7 Case Study: U.S. Fruit Prices
+
+4 Squashing Bugs with R’s Debugging Tools
+
+- 4.1 Printing
- 4.2 The Conditions System
- 4.2.1 Raising Conditions
@@ -285,19 +301,7 @@
- 6.5 Other Object Systems
-7 Part 2
-
+7 Part 2
V Backmatter
References
Assessment
@@ -320,588 +324,9 @@
7 Part 2
-This chapter is part 2 (of 2) of Thinking in R, a workshop series about how R
-works and how to examine code critically.
-
-Learning Objectives
-
-- Describe and use R’s for, while, and repeat loops
-- Identify the most appropriate iteration strategy for a given problem
-- Explain strategies to organize iterative code
-
-
-
-7.1 Iteration Strategies
-R is powerful tool for automating tasks that have repetitive steps. For
-example, you can:
-
-- Apply a transformation to an entire column of data.
-- Compute distances between all pairs from a set of points.
-- Read a large collection of files from disk in order to combine and analyze
-the data they contain.
-- Simulate how a system evolves over time from a specific set of starting
-parameters.
-- Scrape data from many pages of a website.
-
-You can implement concise, efficient solutions for these kinds of tasks in R by
-using iteration, which means repeating a computation many times. R provides
-four different strategies for writing iterative code:
-
-- Vectorization, where a function is implicitly called on each element of a
-vector. See this section of the R Basics for more details.
-- Apply functions, where a function is explicitly called on each element of a
-vector or array. See this section of the R Basics reader
-for more details.
-- Loops, where an expression is evaluated repeatedly until some condition is
-met.
-- Recursion, where a function calls itself.
-
-Vectorization is the most efficient and most concise iteration strategy, but
-also the least flexible, because it only works with vectorized functions and
-vectors. Apply functions are more flexible—they work with any function and
-any data structure with elements—but less efficient and less concise. Loops
-and recursion provide the most flexibility but are the least concise. In recent
-versions of R, apply functions and loops are similar in terms of efficiency.
-Recursion tends to be the least efficient iteration strategy in R.
-The rest of this section explains how to write loops and how to choose which
-iteration strategy to use. We assume you’re already comfortable with
-vectorization and have at least some familiarity with apply functions.
-
-7.1.1 For-loops
-A for-loop evaluates an expression once for each element of a vector or
-list. The for
keyword creates a for-loop. The syntax is:
-
-The variable I
is called an induction variable. At the beginning of each
-iteration, I
is assigned the next element of DATA
. The loop iterates once
-for each element, unless a keyword instructs R to exit the loop early (more
-about this in Section 7.1.4). As with if-statements and functions,
-the curly braces { }
are only required if the body contains multiple lines of
-code. Here’s a simple for-loop:
-
-## Hi from iteration 1
-## Hi from iteration 2
-## Hi from iteration 3
-## Hi from iteration 4
-## Hi from iteration 5
-## Hi from iteration 6
-## Hi from iteration 7
-## Hi from iteration 8
-## Hi from iteration 9
-## Hi from iteration 10
-When some or all of the iterations in a task depend on results from prior
-iterations, loops tend to be the most appropriate iteration strategy. For
-instance, loops are a good way to implement time-based simulations or compute
-values in recursively defined sequences.
-As a concrete example, suppose you want to compute the result of starting from
-the value 1 and composing the sine function 100 times:
-
-## [1] 0.1688525
-Unlike other iteration strategies, loops don’t return a result automatically.
-It’s up to you to use variables to store any results you want to use later. If
-you want to save a result from every iteration, you can use a vector or a list
-indexed on the iteration number:
-n = 1 + 100
-result = numeric(n)
-result[1] = 1
-for (i in 2:n) {
- result[i] = sin(result[i - 1])
-}
-
-plot(result)
-
-Section 7.1.3 explains this in more detail.
-If the iterations in a task are not dependent, it’s preferable to use
-vectorization or apply functions instead of a loop. Vectorization is more
-efficient, and apply functions are usually more concise.
-In some cases, you can use vectorization to handle a task even if the
-iterations are dependent. For example, you can use vectorized exponentiation
-and the sum
function to compute the sum of the cubes of many numbers:
-
-## [1] 1001910
-
-
-7.1.2 While-loops
-A while-loop runs a block of code repeatedly as long as some condition is
-TRUE
. The while
keyword creates a while-loop. The syntax is:
-
-The CONDITION
should be a scalar logical value or an expression that returns
-one. At the beginning of each iteration, R checks the CONDITION
and exits the
-loop if it’s FALSE
. As always, the curly braces { }
are only required if
-the body contains multiple lines of code. Here’s a simple while-loop:
-
-## Hello from iteration 1
-## Hello from iteration 2
-## Hello from iteration 3
-## Hello from iteration 4
-## Hello from iteration 5
-## Hello from iteration 6
-## Hello from iteration 7
-## Hello from iteration 8
-## Hello from iteration 9
-## Hello from iteration 10
-Notice that this example does the same thing as the simple for-loop in Section
-7.1.1, but requires 5 lines of code instead of 2. While-loops are a
-generalization of for-loops, and only do the bare minimum necessary to iterate.
-They tend to be most useful when you don’t know how many iterations will be
-necessary to complete a task.
-As an example, suppose you want to add up the integers in order until the total
-is greater than 50:
-total = 0
-i = 1
-while (total < 50) {
- total = total + i
- message("i is ", i, " total is ", total)
- i = i + 1
-}
-## i is 1 total is 1
-## i is 2 total is 3
-## i is 3 total is 6
-## i is 4 total is 10
-## i is 5 total is 15
-## i is 6 total is 21
-## i is 7 total is 28
-## i is 8 total is 36
-## i is 9 total is 45
-## i is 10 total is 55
-
-## [1] 55
-
-## [1] 11
-
-
-7.1.3 Saving Multiple Results
-Loops often produce a different result for each iteration. If you want to save
-more than one result, there are a few things you must do.
-First, set up an index vector. The index vector should usually correspond to
-the positions of the elements in the data you want to process. The seq_along
-function returns an index vector when passed a vector or list. For instance:
-
-The loop will iterate over the index rather than the input, so the induction
-variable will track the current iteration number. On the first iteration, the
-induction variable will be 1, on the second it will be 2, and so on. Then you
-can use the induction variable and indexing to get the input for each
-iteration.
-Second, set up an empty output vector or list. This should usually also
-correspond to the input, or one element longer (the extra element comes from
-the initial value). R has several functions for creating vectors:
-
-logical
, integer
, numeric
, complex
, and character
create an empty
-vector with a specific type and length
-vector
creates an empty vector with a specific type and length
-rep
creates a vector by repeating elements of some other vector
-
-Empty vectors are filled with FALSE
, 0
, or ""
, depending on the type of
-the vector. Here are some examples:
-
-## [1] FALSE FALSE FALSE
-
-## [1] 0 0 0 0
-
-## [1] 1 2 1 2
-Let’s create an empty numeric vector congruent to the numbers
vector:
-
-As with the input, you can use the induction variable and indexing to set the
-output for each iteration.
-Creating a vector or list in advance to store something, as we’ve just done, is
-called preallocation. Preallocation is extremely important for efficiency
-in loops. Avoid the temptation to use c
or append
to build up the output
-bit by bit in each iteration.
-Finally, write the loop, making sure to get the input and set the output. As an
-example, this loop adds each element of numbers
to a running total and
-squares the new running total:
-for (i in index) {
- prev = if (i > 1) result[i - 1] else 0
- result[i] = (numbers[i] + prev)^2
-}
-result
-## [1] 1.000000e+00 4.840000e+02 2.371690e+05 5.624534e+10 3.163538e+21
-
-
-7.1.4 Break & Next
-The break
keyword causes a loop to immediately exit. It only makes sense to
-use break
inside of an if-statement.
-For example, suppose you want to print each string in a vector, but stop at the
-first missing value. You can do this with a for-loop and the break
keyword:
-my_messages = c("Hi", "Hello", NA, "Goodbye")
-
-for (msg in my_messages) {
- if (is.na(msg))
- break
-
- message(msg)
-}
-## Hi
-## Hello
-The next
keyword causes a loop to immediately go to the next iteration. As
-with break
, it only makes sense to use next
inside of an if-statement.
-Let’s modify the previous example so that missing values are skipped, but don’t
-cause printing to stop. Here’s the code:
-
-## Hi
-## Hello
-## Goodbye
-These keywords work with both for-loops and while-loops.
-
-
-7.1.5 Planning for Iteration
-At first it may seem difficult to decide if and what kind of iteration to use.
-Start by thinking about whether you need to do something over and over. If you
-don’t, then you probably don’t need to use iteration. If you do, then try
-iteration strategies in this order:
-
-- Vectorization
-- Apply functions
-
-- Try an apply function if iterations are independent.
-
-- Loops
-
-- Try a for-loop if some iterations depend on others.
-- Try a while-loop if the number of iterations is unknown.
-
-- Recursion (which isn’t covered here)
-
-- Convenient for naturally recursive problems (like Fibonacci),
-but often there are faster solutions.
-
-
-Start by writing the code for just one iteration. Make sure that code works;
-it’s easy to test code for one iteration.
-When you have one iteration working, then try using the code with an iteration
-strategy (you will have to make some small changes). If it doesn’t work, try to
-figure out which iteration is causing the problem. One way to do this is to use
-message
to print out information. Then try to write the code for the broken
-iteration, get that iteration working, and repeat this whole process.
-
-
-7.1.6 Case Study: The Collatz Conjecture
-The Collatz Conjecture is a conjecture in math that was introduced
-in 1937 by Lothar Collatz and remains unproven today, despite being relatively
-easy to explain. Here’s a statement of the conjecture:
-
-Start from any positive integer. If the integer is even, divide by 2. If the
-integer is odd, multiply by 3 and add 1.
-If the result is not 1, repeat using the result as the new starting value.
-The result will always reach 1 eventually, regardless of the starting value.
-
-The sequences of numbers this process generates are called Collatz
-sequences. For instance, the Collatz sequence starting from 2 is 2, 1
. The
-Collatz sequence starting from 12 is 12, 6, 3, 10, 5, 16, 8, 4, 2, 1
.
-You can use iteration to compute the Collatz sequence for a given starting
-value. Since each number in the sequence depends on the previous one, and since
-the length of the sequence varies, a while-loop is the most appropriate
-iteration strategy:
-n = 5
-i = 0
-while (n != 1) {
- i = i + 1
- if (n %% 2 == 0) {
- n = n / 2
- } else {
- n = 3 * n + 1
- }
- message(n, " ", appendLF = FALSE)
-}
-## 16 8 4 2 1
-As of 2020, scientists have used computers to check the Collatz sequences for
-every number up to approximately \(2^{64}\). For more details about the Collatz
-Conjecture, check out this video.
-
-
-7.1.7 Case Study: U.S. Fruit Prices
-The U.S. Department of Agriculture (USDA) Economic Research Service (ERS)
-publishes data about consumer food prices. For instance, in 2018 they posted a
-dataset that estimates average retail prices for various fruits, vegetables,
-and snack foods. The estimates are formatted as a collection
-of Excel files, one for each type of fruit or vegetable. In this case study,
-you’ll use iteration to get the estimated “fresh” price for all of the fruits
-in the dataset that are sold fresh.
-To get started, download the zipped collection of fruit
-spreadsheets and save it somewhere on your computer. Then unzip the
-file with a zip program or R’s own unzip
function.
-The first sheet of each file contains a table with the name of the fruit and
-prices sorted by how the fruit was prepared. You can see this for yourself if
-you use a spreadsheet program to inspect some of the files.
-In order to read the files into R, first get a vector of their names. You can
-use the list.files
function to list all of the files in a directory. If you
-set full.names = TRUE
, the function will return the absolute path to each
-file:
-
-## [1] "data/fruit/apples_2013.xlsx" "data/fruit/apricots_2013.xlsx"
-## [3] "data/fruit/bananas_2013.xlsx" "data/fruit/berries_mixed_2013.xlsx"
-## [5] "data/fruit/blackberries_2013.xlsx" "data/fruit/blueberries_2013.xlsx"
-## [7] "data/fruit/cantaloupe_2013.xlsx" "data/fruit/cherries_2013.xlsx"
-## [9] "data/fruit/cranberries_2013.xlsx" "data/fruit/dates_2013.xlsx"
-## [11] "data/fruit/figs_2013.xlsx" "data/fruit/fruit_cocktail_2013.xlsx"
-## [13] "data/fruit/grapefruit_2013.xlsx" "data/fruit/grapes_2013.xlsx"
-## [15] "data/fruit/honeydew_2013.xlsx" "data/fruit/kiwi_2013.xlsx"
-## [17] "data/fruit/mangoes_2013.xlsx" "data/fruit/nectarines_2013.xlsx"
-## [19] "data/fruit/oranges_2013.xlsx" "data/fruit/papaya_2013.xlsx"
-## [21] "data/fruit/peaches_2013.xlsx" "data/fruit/pears_2013.xlsx"
-## [23] "data/fruit/pineapple_2013.xlsx" "data/fruit/plums_2013.xlsx"
-## [25] "data/fruit/pomegranate_2013.xlsx" "data/fruit/raspberries_2013.xlsx"
-## [27] "data/fruit/strawberries_2013.xlsx" "data/fruit/tangerines_2013.xlsx"
-## [29] "data/fruit/watermelon_2013.xlsx"
-The files are in Excel format, which you can read with the read_excel
-function from the readxl package. First try reading one file and extracting
-the fresh price:
-
-## New names:
-## • `` -> `...2`
-## • `` -> `...3`
-## • `` -> `...4`
-## • `` -> `...5`
-## • `` -> `...6`
-## • `` -> `...7`
-The name of the fruit is the first word in the first column’s name. The fresh
-price appears in the row where the word in column 1 starts with "Fresh"
. You
-can use str_which
from the stringr package (Section
-1.4.1) to find and extract this row:
-
-## # A tibble: 1 × 7
-## Apples—Average retail price per pound or…¹ ...2 ...3 ...4 ...5 ...6 ...7
-## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
-## 1 Fresh1 1.56… per … 0.9 0.24… poun… 0.42…
-## # ℹ abbreviated name:
-## # ¹`Apples—Average retail price per pound or pint and per cup equivalent, 2013`
-The price and unit appear in column 2 and column 3.
-Now generalize these steps by making a read_fresh_price
function. The
-function should accept a path as input and return a vector that contains the
-fruit name, fresh price, and unit. Don’t worry about cleaning up the fruit name
-at this point—you can do that with a vectorized operation after combining the
-data from all of the files. A few fruits don’t have a fresh price, and the
-function should return NA
for the price and unit for those. Here’s one way to
-implement the read_fresh_price
function:
-read_fresh_price = function(path) {
- prices = read_excel(path)
-
- # Get fruit name.
- fruit = names(prices)[[1]]
-
- # Find fresh price.
- idx = str_which(prices[[1]], "^Fresh")
- if (length(idx) > 0) {
- prices = prices[idx, ]
- c(fruit, prices[[2]], prices[[3]])
- } else {
- c(fruit, NA, NA)
- }
-}
-Test that the function returns the correct result for a few of the files:
-
-## New names:
-## • `` -> `...2`
-## • `` -> `...3`
-## • `` -> `...4`
-## • `` -> `...5`
-## • `` -> `...6`
-## • `` -> `...7`
-## [1] "Apples—Average retail price per pound or pint and per cup equivalent, 2013"
-## [2] "1.5675153914496354"
-## [3] "per pound"
-
-## New names:
-## • `` -> `...2`
-## • `` -> `...3`
-## • `` -> `...4`
-## • `` -> `...5`
-## • `` -> `...6`
-## • `` -> `...7`
-## [1] "Mixed berries—Average retail price per pound and per cup equivalent, 2013"
-## [2] NA
-## [3] NA
-
-## New names:
-## • `` -> `...2`
-## • `` -> `...3`
-## • `` -> `...4`
-## • `` -> `...5`
-## • `` -> `...6`
-## • `` -> `...7`
-## [1] "Cherries—Average retail price per pound and per cup equivalent, 2013"
-## [2] "3.5929897554945156"
-## [3] "per pound"
-Now that the function is working, it’s time to choose an iteration strategy.
-The read_fresh_price
function is not vectorized, so that strategy isn’t
-possible. Reading one file doesn’t depend on reading any of the others, so
-apply functions are the best strategy here. The read_fresh_price
function
-always returns a character vector with 3 elements, so you can use sapply
to
-process all of the files and get a matrix of results:
-
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## New names:
-## • `` -> `...2`
-## • `` -> `...3`
-## • `` -> `...4`
-## • `` -> `...5`
-## • `` -> `...6`
-## • `` -> `...7`
-# Transpose, convert to a data frame, and set names for easy reading.
-all_prices = t(all_prices)
-all_prices = data.frame(all_prices)
-rownames(all_prices) = NULL
-colnames(all_prices) = c("fruit", "price", "unit")
-all_prices
-## fruit
-## 1 Apples—Average retail price per pound or pint and per cup equivalent, 2013
-## 2 Apricots—Average retail price per pound and per cup equivalent, 2013
-## 3 Bananas—Average retail price per pound and per cup equivalent, 2013
-## 4 Mixed berries—Average retail price per pound and per cup equivalent, 2013
-## 5 Blackberries—Average retail price per pound and per cup equivalent, 2013
-## 6 Blueberries—Average retail price per pound and per cup equivalent, 2013
-## 7 Cantaloupe—Average retail price per pound and per cup equivalent, 2013
-## 8 Cherries—Average retail price per pound and per cup equivalent, 2013
-## 9 Cranberries—Average retail price per pound and per cup equivalent, 2013
-## 10 Dates—Average retail price per pound and per cup equivalent, 2013
-## 11 Figs—Average retail price per pound and per cup equivalent, 2013
-## 12 Fruit cocktail—Average retail price per pound and per cup equivalent, 2013
-## 13 Grapefruit—Average retail price per pound or pint and per cup equivalent, 2013
-## 14 Grapes—Average retail price per pound or pint and per cup equivalent, 2013
-## 15 Honeydew melon—Average retail price per pound and per cup equivalent, 2013
-## 16 Kiwi—Average retail price per pound and per cup equivalent, 2013
-## 17 Mangoes—Average retail price per pound and per cup equivalent, 2013
-## 18 Nectarines—Average retail price per pound and per cup equivalent, 2013
-## 19 Oranges—Average retail price per pound or pint and per cup equivalent, 2013
-## 20 Papaya—Average retail price per pound and per cup equivalent, 2013
-## 21 Peaches—Average retail price per pound and per cup equivalent, 2013
-## 22 Pears—Average retail price per pound and per cup equivalent, 2013
-## 23 Pineapple—Average retail price per pound or pint and per cup equivalent, 2013
-## 24 Plums—Average retail price per pound or pint and per cup equivalent, 2013
-## 25 Pomegranate—Average retail price per pound or pint and per cup equivalent, 2013
-## 26 Raspberries—Average retail price per pound and per cup equivalent, 2013
-## 27 Strawberries—Average retail price per pound and per cup equivalent, 2013
-## 28 Tangerines—Average retail price per pound or pint and per cup equivalent, 2013
-## 29 Watermelon—Average retail price per pound and per cup equivalent, 2013
-## price unit
-## 1 1.5675153914496354 per pound
-## 2 3.0400719670964378 per pound
-## 3 0.56698341453144807 per pound
-## 4 <NA> <NA>
-## 5 5.7747082503535152 per pound
-## 6 4.7346216897250253 per pound
-## 7 0.53587377610644515 per pound
-## 8 3.5929897554945156 per pound
-## 9 <NA> <NA>
-## 10 <NA> <NA>
-## 11 <NA> <NA>
-## 12 <NA> <NA>
-## 13 0.89780204117954143 per pound
-## 14 2.0938274120049827 per pound
-## 15 0.79665620543008364 per pound
-## 16 2.0446834079658482 per pound
-## 17 1.3775634470319702 per pound
-## 18 1.7611484827950696 per pound
-## 19 1.0351727302444853 per pound
-## 20 1.2980115892049107 per pound
-## 21 1.5911868532458617 per pound
-## 22 1.4615746043999458 per pound
-## 23 0.62766194593569868 per pound
-## 24 1.8274160078099031 per pound
-## 25 2.1735904118559191 per pound
-## 26 6.9758107988552958 per pound
-## 27 2.3588084831103004 per pound
-## 28 1.3779618772323634 per pound
-## 29 0.33341203532340097 per pound
-Finally, the last step is to remove the extra text from the fruit name. One way
-to do this is with the str_split_fixed
function from the stringr package.
-There’s an en dash —
after each fruit name, which you can use for the split:
-
-## fruit price unit
-## 1 Apples 1.5675153914496354 per pound
-## 2 Apricots 3.0400719670964378 per pound
-## 3 Bananas 0.56698341453144807 per pound
-## 4 Mixed berries <NA> <NA>
-## 5 Blackberries 5.7747082503535152 per pound
-## 6 Blueberries 4.7346216897250253 per pound
-## 7 Cantaloupe 0.53587377610644515 per pound
-## 8 Cherries 3.5929897554945156 per pound
-## 9 Cranberries <NA> <NA>
-## 10 Dates <NA> <NA>
-## 11 Figs <NA> <NA>
-## 12 Fruit cocktail <NA> <NA>
-## 13 Grapefruit 0.89780204117954143 per pound
-## 14 Grapes 2.0938274120049827 per pound
-## 15 Honeydew melon 0.79665620543008364 per pound
-## 16 Kiwi 2.0446834079658482 per pound
-## 17 Mangoes 1.3775634470319702 per pound
-## 18 Nectarines 1.7611484827950696 per pound
-## 19 Oranges 1.0351727302444853 per pound
-## 20 Papaya 1.2980115892049107 per pound
-## 21 Peaches 1.5911868532458617 per pound
-## 22 Pears 1.4615746043999458 per pound
-## 23 Pineapple 0.62766194593569868 per pound
-## 24 Plums 1.8274160078099031 per pound
-## 25 Pomegranate 2.1735904118559191 per pound
-## 26 Raspberries 6.9758107988552958 per pound
-## 27 Strawberries 2.3588084831103004 per pound
-## 28 Tangerines 1.3779618772323634 per pound
-## 29 Watermelon 0.33341203532340097 per pound
-Now the data are ready for analysis. You could extend the reader function to
-extract more of the data (e.g., dried and frozen prices), but the overall
-process is fundamentally the same. Write the code to handle one file (one
-step), generalize it to work on several, and then iterate.
-For another example, see Liza Wood’s Real-world Function Writing
-Mini-reader.
+This chapter will eventually contain part 2 (of 2) of Thinking in R, a
+workshop series about how R works and how to examine code critically.
-
-
diff --git a/docs/reference-keys.txt b/docs/reference-keys.txt
index bfb6a02..3885c25 100644
--- a/docs/reference-keys.txt
+++ b/docs/reference-keys.txt
@@ -51,7 +51,7 @@ ambiguous-columns
missing-values
conclusion
best-practices-for-writing-r-scripts
-squashing-bugs-with-rs-debugging-tools
+scripting-your-code
printing-output
the-print-function
the-message-function
@@ -60,6 +60,18 @@ formatting-output
escape-sequences-2
formatting-functions
logging-output
+reading-input
+managing-packages
+iteration-strategies
+for-loops
+while-loops
+saving-multiple-results
+break-next
+planning-for-iteration
+case-study-the-collatz-conjecture
+case-study-u.s.-fruit-prices
+squashing-bugs-with-rs-debugging-tools
+printing-1
the-conditions-system
raising-conditions
handling-conditions
@@ -107,11 +119,3 @@ method-dispatch
creating-objects
other-object-systems
part-2
-iteration-strategies
-for-loops
-while-loops
-saving-multiple-results
-break-next
-planning-for-iteration
-case-study-the-collatz-conjecture
-case-study-u.s.-fruit-prices
diff --git a/docs/references.html b/docs/references.html
index f4e967e..f0071fa 100644
--- a/docs/references.html
+++ b/docs/references.html
@@ -202,17 +202,33 @@
II Writing & Debugging R Code
-3 Best Practices for Writing R Scripts
-4 Squashing Bugs with R’s Debugging Tools
+ 3 Best Practices for Writing R Scripts
+
+- 3.1 Scripting Your Code
+- 3.2 Printing Output
+- 3.3 Reading Input
+- 3.4 Managing Packages
+- 3.5 Iteration Strategies
-- 4.1.1 The
print
Function
-- 4.1.2 The
message
Function
-- 4.1.3 The
cat
Function
-- 4.1.4 Formatting Output
-- 4.1.5 Logging Output
+- 3.5.1 For-loops
+- 3.5.2 While-loops
+- 3.5.3 Saving Multiple Results
+- 3.5.4 Break & Next
+- 3.5.5 Planning for Iteration
+- 3.5.6 Case Study: The Collatz Conjecture
+- 3.5.7 Case Study: U.S. Fruit Prices
+
+4 Squashing Bugs with R’s Debugging Tools
+
+- 4.1 Printing
- 4.2 The Conditions System
- 4.2.1 Raising Conditions
@@ -285,19 +301,7 @@
- 6.5 Other Object Systems
-7 Part 2
-
+7 Part 2
V Backmatter
References
Assessment
diff --git a/docs/search_index.json b/docs/search_index.json
index 10a2377..2e0e851 100644
--- a/docs/search_index.json
+++ b/docs/search_index.json
@@ -1 +1 @@
-[["index.html", "Intermediate R Overview", " Intermediate R Nick Ulle and Wesley Brooks 2024-01-14 Overview This is the reader for all of UC Davis DataLab’s Intermediate R workshop series. There are currently three: Thinking in R, which is about understanding how R works, how to diagnose and fix bugs in code, and how to estimate and measure performance characteristics of code. Cleaning Data & Automating Tasks, which is about how to clean and prepare messy data such as dates, times, and text for analysis, and how to use loops or other forms of iteration to automate repetitive tasks. Data Visualization and Analysis in R, which is about plotting data and models in R. Each series is independent and consists of 2 sessions (equivalently, 2 chapters in this reader). After completing both series, students will have a better understanding of language features, packages, and programming strategies, which will enable them to write more efficient code, be more productive when writing code, and debug code more effectively. These series are not an introduction to R. Participants are expected to have prior experience using R, be comfortable with basic R syntax, and to have it pre-installed and running on their laptops. They are appropriate for motivated intermediate to advanced users who want a better understanding of base R. "],["string-date-processing.html", "1 String & Date Processing 1.1 The Tidyverse 1.2 Parsing Dates & Times 1.3 String Fundamentals 1.4 Processing Strings 1.5 Regular Expression Examples", " 1 String & Date Processing This chapter is part 1 (of 2) of Cleaning & Reshaping Data, a workshop series about how to prepare data for analysis. The major topics of this chapter are how to convert dates and times into appropriate R data types and how to extract and clean up data from strings (including numbers with non-numeric characters such as $, %, and ,). Learning Objectives Explain why we use special data structures for dates & times Identify the correct data structure for a given date/time Use lubridate to parse a date Use the date format string mini-language Use escape codes in strings to represent non-keyboard characters Explain what a text encoding is Use the stringr package to detect, extract, and change patterns in strings Use the regular expressions mini-language 1.1 The Tidyverse For working with dates, times, and strings, we recommend using packages from the Tidyverse, a popular collection of packages for doing data science. Compared to R’s built-in functions, we’ve found that the functions in Tidyverse packages are generally easier to learn and use. They also provide additional features and have more robust support for characters outside of the Latin alphabet. Although they’re developed by many different members of the R community, Tidyverse packages follow a unified design philosophy, and thus have many interfaces and data structures in common. The packages provide convenient and efficient alternatives to built-in R functions for many tasks, including: Reading and writing files (package readr) Processing dates and times (packages lubridate, hms) Processing strings (package stringr) Reshaping data (package tidyr) Making visualizations (package ggplot2) And more Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. As a consequence, the Tidyverse is sometimes polarizing in the R community. It’s useful to be literate in both base R and the Tidyverse, since both are popular. One major advantage of the Tidyverse is that the packages are usually well-documented and provide lots of examples. Every package has a documentation website and the most popular ones also have cheatsheets. 1.2 Parsing Dates & Times When working with dates and times, you might want to: Use them to sort other data Add or subtract an offset Get components like the month, day, or hour Compute derived components like the day of week or quarter Compute differences Even though this list isn’t exhaustive, it shows that there are lots of things you might want to do. In order to do them in R, you must first make sure that your dates and times are represented by appropriate data types. Most of R’s built-in functions for loading data do not automatically recognize dates and times. This section describes several data types that represent dates and times, and explains how to use R to parse—break down and convert—dates and times to these types. 1.2.1 The lubridate Package As explained in Section 1.1, we recommend the Tidyverse packages for working with dates and times over other packages or R’s built-in functions. There are two: lubridate, the primary package for working with dates and times hms, a package specifically for working with time durations This chapter only covers lubridate, since it’s more useful in most situations. The package has detailed documentation and a cheatsheet. You’ll have to install the package if you haven’t already, and then load it: # install.packages("lubridate") library("lubridate") ## ## Attaching package: 'lubridate' ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union Perhaps the most common task you’ll need to do with date and time data is convert from strings to more appropriate data types. This is because R’s built-in functions for reading data from a text format, such as read.csv, read dates and times as strings. For example, here are some dates as strings: date_strings = c("Jan 10, 2021", "Sep 3, 2018", "Feb 28, 1982") date_strings ## [1] "Jan 10, 2021" "Sep 3, 2018" "Feb 28, 1982" You can tell that these are dates, but as far as R is concerned, they’re text. The lubridate package provides a variety of functions to automatically parse strings into date or time objects that R understands. These functions are named with one letter per component of the date or time. The order of the letters must match the order of the components in the string you want to parse. In the example, the strings have the month (m), then the day (d), and then the year (y), so you can use the mdy function to parse them automatically: dates = mdy(date_strings) dates ## [1] "2021-01-10" "2018-09-03" "1982-02-28" class(dates) ## [1] "Date" Notice that the dates now have class Date, one of R’s built-in classes for representing dates, and that R prints them differently. Now R recognizes that the dates are in fact dates, so they’re ready to use in an analysis. There is a complete list of the automatic parsing functions in the lubridate documentation. Note: a relatively new package, clock, tries to solve some problems with the Date class people have identified over the years. The package is in the r-lib collection of packages, which provide low-level functionality complementary to the Tidyverse. Eventually, it may be preferable to use the classes in clock rather than the Date class, but for now, the Date class is still suitable for most tasks. Occasionally, a date or time string may have a format that lubridate can’t parse automatically. In that case, you can use the fast_strptime function to describe the format in detail. At a minimum, the function requires two arguments: a vector of strings to parse and a format string. The format string describes the format of the dates or times, and is based on the syntax of strptime, a function provided by many programming languages (including R) to parse date or time strings. In a format string, a percent sign % followed by a character is called a specification and has a special meaning. Here are a few of the most useful ones: Specification Description 2015-01-29 21:32:55 %Y 4-digit year 2015 %m 2-digit month 01 %d 2-digit day 29 %H 2-digit hour 21 %M 2-digit minute 32 %S 2-digit second 55 %% literal % % %y 2-digit year 15 %B full month name January %b short month name Jan You can find a complete list in ?fast_strptime. Other characters in the format string do not have any special meaning. Write the format string so that it matches the format of the dates you want to parse. For example, let’s try parsing an unusual time format: time_string = "6 minutes, 32 seconds after 10 o'clock" time = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock") time ## [1] "0-01-01 10:06:32 UTC" class(time) ## [1] "POSIXlt" "POSIXt" R represents date-times with the classes POSIXlt and POSIXct. There’s no built-in class to represent times alone, which is why the result in the example above includes a date. Internally, a POSIXlt object is a list with elements to store different date and time components. On the other hand, a POSIXct object is a single floating point number (type double). If you want to store your time data in a data frame, use POSIXct objects, since data frames don’t work well with columns of lists. You can control whether fast_strptime returns a POSIXlt or POSIXct object by setting the lt parameter to TRUE or FALSE: time_ct = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock", lt = FALSE) class(time_ct) ## [1] "POSIXct" "POSIXt" Another common task is combining the numeric components of a date or time into a single object. You can use the make_date and make_datetime functions to do this. The parameters are named for the different components. For example: make_date(day = 10, year = 2023, month = 1) ## [1] "2023-01-10" These functions are vectorized, so you can use them to combine the components of many dates or times at once. They’re especially useful for reconstructing dates and times from tabular datasets where each component is stored in a separate column. After you’ve converted your date and time data to appropriate types, you can do any of the operations listed at the beginning of this section. For example, you can use lubridate’s period function to create an offset to add to a date or time: dates ## [1] "2021-01-10" "2018-09-03" "1982-02-28" dates + period(1, "month") ## [1] "2021-02-10" "2018-10-03" "1982-03-28" You can also use lubridate functions to get or set the components. These functions usually have the same name as the component. For instance: day(dates) ## [1] 10 3 28 month(dates) ## [1] 1 9 2 See the lubridate documentation for even more details about what you can do. 1.2.2 Case Study: Ocean Temperatures The U.S. National Oceanic and Atmospheric Administration (NOAA) publishes ocean temperature data collected by sensor buoys off the coast on the National Data Buoy Center (NDBC) website. California also has many sensors collecting ocean temperature data that are not administered by the federal government. Data from these is published on the California Ocean Observing Systems (CALOOS) Data Portal. Suppose you’re a researcher who wants to combine ocean temperature data from both sources to use in R. Both publish the data in comma-separated value (CSV) format, but record dates, times, and temperatures differently. Thus you need to be careful that the dates and times are parsed correctly. Download these two 2021 datasets: 2021_noaa-ndbc_46013.txt, from NOAA buoy 46013, off the coast of Bodega Bay (DOWNLOAD)(source) 2021_ucdavis_bml_wts.csv, from the UC Davis Bodega Bay Marine Laboratory’s sensors (DOWNLOAD)(source) The NOAA data has a fixed-width format, which means each column has a fixed width in characters over all rows. The readr package provides a function read_fwf that can automatically guess the column widths and read the data into a data frame. The column names appear in the first row and column units appear in the second row, so read those rows separately: # install.packages("readr") library("readr") noaa_path = "data/ocean_data/2021_noaa-ndbc_46013.txt" noaa_headers = read_fwf(noaa_path, n_max = 2, guess_max = 1) ## Rows: 2 Columns: 18 ## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ## chr (18): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. noaa = read_fwf(noaa_path, skip = 2) ## Rows: 3323 Columns: 18 ## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ## chr (4): X2, X3, X4, X5 ## dbl (14): X1, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. names(noaa) = as.character(noaa_headers[1, ]) names(noaa)[1] = "YY" The dates and times for the observations are separated into component columns, and the read_fwf function does not convert some of these to numbers automatically. You can use as.numeric to convert them to numbers: cols = 2:5 noaa[cols] = lapply(noaa[cols], as.numeric) Finally, use the make_datetime function to combine the components into date-time objects: noaa_dt = make_datetime(year = noaa$YY, month = noaa$MM, day = noaa$DD, hour = noaa$hh, min = noaa$mm) noaa$date = noaa_dt head(noaa_dt) ## [1] "2021-01-01 00:00:00 UTC" "2021-01-01 00:10:00 UTC" ## [3] "2021-01-01 00:20:00 UTC" "2021-01-01 00:30:00 UTC" ## [5] "2021-01-01 00:40:00 UTC" "2021-01-01 00:50:00 UTC" That takes care of the dates in the NOAA data. The Bodega Marine Lab data is CSV format, which you can read with read.csv or the readr package’s read_csv function. The latter is faster and usually better at guessing column types. The column names appear in the first row and the column units appear in the second row. The read_csv function handles the names automatically, but you’ll have to remove the unit row as a separate step: bml = read_csv("data/ocean_data/2021_ucdavis_bml_wts.csv") ## Rows: 87283 Columns: 4 ## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## Delimiter: "," ## chr (3): time, sea_water_temperature, z ## dbl (1): sea_water_temperature_qc_agg ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. bml = bml[-1, ] The dates and times of the observations were loaded as strings. You can use lubridate’s ymd_hms function to automatically parse them: bml_dt = ymd_hms(bml$time) bml$date = bml_dt head(bml_dt) ## [1] "2020-12-31 09:06:00 UTC" "2020-12-31 09:12:00 UTC" ## [3] "2020-12-31 09:18:00 UTC" "2020-12-31 09:24:00 UTC" ## [5] "2020-12-31 09:30:00 UTC" "2020-12-31 09:36:00 UTC" Now you have date and time objects for both datasets, so you can combine the two. For example, you could extract the date and water temperature columns from each, create a new column identifying the data source, and then row-bind the datasets together. 1.3 String Fundamentals Strings represent text, but even if your datasets are composed entirely of numbers, you’ll need to know how to work with strings. Text formats for data are widespread: comma-separated values (CSV), tab-separated values (TSV), JavaScript object notation (JSON), a panopoly of markup languages (HTML, XML, YAML, TOML), and more. When you read data in these formats into R, sometimes R will correctly convert the values to appropriate non-string types. The rest of the time, you need to know how to work with strings so that you can fix whatever went wrong and convert the data yourself. This section introduces several fundamental concepts related to working with strings. The next section, Section 1.4.1, describes the stringr package for working with strings. The last section, Section 1.4.2, builds on both and explains how to do powerful pattern matching. 1.3.1 Printing There are two different ways to print strings: you can print a representation of the characters in the string or you can print the actual characters in the string. To print a representation of the characters in a string, use the print function. The representation is useful to identify characters that are not normally visible, such as tabs and the characters that mark the end of a line. To print the actual characters in a string, use the message function. This important difference in how the print and message functions print strings is demonstrated in the next section. You can learn more about different ways to print output in R by reading Section 4.1. 1.3.2 Escape Sequences In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to: Write quotes or backslashes within a string Write characters that don’t appear on your keyboard (for example, characters in a foreign language) For example, the escape sequence \\n corresponds to the newline character. Notice that the message function translates \\n into a literal new line, whereas the print function doesn’t: x = "Hello\\nNick" message(x) ## Hello ## Nick print(x) ## [1] "Hello\\nNick" As another example, suppose we want to put a literal quote in a string. We can either enclose the string in the other kind of quotes, or escape the quotes in the string: x = 'She said, "Hi"' message(x) ## She said, "Hi" y = "She said, \\"Hi\\"" message(y) ## She said, "Hi" Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes: x = "\\\\" message(x) ## \\ There’s a complete list of escape sequences for R in the ?Quotes help file. Other programming languages also use escape sequences, and many of them are the same as in R. 1.3.3 Raw Strings A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions (covered in Section 1.4.2). Raw strings begin with r\" and an opening delimiter (, [, or {. Raw strings end with a matching closing delimiter and quote. For example: x = r"(quotes " and backslashes \\)" message(x) ## quotes " and backslashes \\ Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions. 1.3.4 Character Encodings Computers store data as numbers. In order to store text on a computer, people have to agree on a character encoding, a system for mapping characters to numbers. For example, in ASCII, one of the most popular encodings in the United States, the character a maps to the number 97. Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs). Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8. In R, you can write Unicode characters with the escape sequence \\U followed by the number for the character in base 16. For instance, the number for a in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61. So you can write an a as: x = "\\U61" # or "\\u61" x ## [1] "a" Unicode escape sequences are usually only used for characters that are not easy to type. For example, the cat emoji is number 1f408 (in base 16) in Unicode. So the string \"\\U1f408\" is the cat emoji. Note that being able to see printed Unicode characters also depends on whether the font your computer is using has a glyph (image representation) for that character. Many fonts are limited to a small number of languages. The NerdFont project patches fonts commonly used for programming so that they have better Unicode coverage. Using a font with good Unicode coverage is not essential, but it’s convenient if you expect to work with many different natural languages or love using emoji. 1.3.4.1 Character Encodings in Text Files Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding. Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is: read.csv("my_data.csv", fileEncoding = "UTF-8") Other reader functions may use a different parameter to set the encoding, so always check the documentation. On computers where the native language is not set to English, it can also help to set R’s native language to English with Sys.setlocale(locale = \"English\"). Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation. 1.4 Processing Strings String processing encompasses a variety of tasks such as searching for patterns within strings, extracting data from within strings, splitting strings into component parts, and removing or replacing unwanted characters (excess whitespace, punctuation, and so on). If you work with data, sooner or later you’ll run into a dataset in text format that needs a few text corrections before or after you read it into R, and for that you’ll find familiarity with string processing invaluable. 1.4.1 The stringr Package Although R has built-in functions for string processing, we recommend using the stringr package for all of your string processing needs. The package is part of the Tidyverse, a collection of packages introduced in Section 1.1. Major advantages of stringr over other packages and R’s built-in functions include: Correctness: the package builds on International Components for Unicode (ICU), the Unicode Consortium’s own library for handling text encodings Discoverability: every function’s name begins with str_ so they’re easy to discover, remember, and identify in code Interface consistency: the first argument is always the string to process, the second argument is always the pattern to match (if applicable) Vectorization: most of the functions are vectorized in the first and second argument stringr has detailed documentation and also a cheatsheet. The first time you use stringr, you’ll have to install it with install.packages (the same as any other package). Then you can load the package with the library function: # install.packages("stringr") library(stringr) The typical syntax of a stringr function is: str_name(string, pattern, ...) Where: name describes what the function does string is a string to search within or transform pattern is a pattern to search for, if applicable ... is additional, function-specific arguments For example, the str_detect function detects whether a pattern appears within a string. The function returns TRUE if the pattern is found and FALSE if it isn’t: str_detect("hello", "el") ## [1] TRUE str_detect("hello", "ol") ## [1] FALSE Most of the stringr functions are vectorized in the string parameter: str_detect(c("hello", "goodbye", "lo"), "lo") ## [1] TRUE FALSE TRUE As another example, the str_sub function extracts a substring from a string, given the substring’s position. The first argument is the string, the second is the position of the substring’s first character, and the third is the position of the substring’s last character: str_sub("You speak of destiny as if it was fixed.", 5, 9) ## [1] "speak" The str_sub function is especially useful for extracting data from strings that have a fixed width (although the readr package’s read_fwf is usually a better choice if you have a fixed-width file). There are a lot of stringr functions. Five that are especially important and are explained in this reader are: str_detect, to test whether a string contains a pattern str_sub, to extract a substring at a given position from a string str_replace, to replace or remove parts of a string str_split_fixed, to split a string into parts str_match, to extract data from a string You can find a complete list of functions with examples on the stringr documentation’s reference page and the cheatsheet. 1.4.2 Regular Expressions The stringr functions use a special language called regular expressions or regex to describe patterns in strings. Many other programming languages also have string processing tools that use regular expressions, so fluency with regular expressions is a transferrable skill. You can use a regular expression to describe a complicated pattern in just a few characters because some characters, called metacharacters, have special meanings. Metacharacters are usually punctation characters. They are never letters or numbers, which always have their literal meaning. This table lists some of the most useful metacharacters: Metacharacter Meaning . any one character (wildcard) \\ escape character (in both R and regex), see Section 1.3.2 ^ the beginning of string (not a character) $ the end of string (not a character) [ab] one character, either 'a' or 'b' [^ab] one character, anything except 'a' or 'b' ? the previous character appears 0 or 1 times * the previous character appears 0 or more times + the previous character appears 1 or more times () make a group | match left OR right side (not a character) Section 1.5 provides examples of how most of the metacharacters work. Even more examples are presented in the stringr package’s regular expressions vignette. You can find a complete listing of regex metacharacters in ?regex or on the stringr cheatsheet. You can disable regular expressions in a stringr function by calling the fixed function on the pattern. For example, to test whether a string contains a literal dot .: x = c("No dot", "Lotsa dots...") str_detect(x, fixed(".")) ## [1] FALSE TRUE It’s a good idea to call fixed on any pattern that doesn’t contain regex metacharacters, because it communicates to the reader that you’re not using regex, it helps to prevent bugs, and it provides a small speed boost. 1.4.3 Replacing Parts of Strings Replacing part of a string is a common string processing task. For instance, quantitative data often contain non-numeric characters such as commas, currency symbols, and percent signs. These must be removed before converting to numeric data types. Replacement and removal go hand-in-hand, since removal is equivalent to replacing part of a string with the empty string \"\". The str_replace function replaces the first part of a string that matches a pattern (from left to right), while the related str_replace_all function replaces every part of a string that matches a pattern. Most stringr functions that do pattern matching come in a pair like this: one to process only the first match and one to process every match. As an example, suppose you want to remove commas from a number so that you can convert it with as.numeric, which returns NA for numbers that contain commas. You want to remove all of the commas, so str_replace_all is the function to use. As usual, the first argument is the string and the second is the pattern. The third argument is the replacement, which is the empty string \"\" in this case: x = "1,000,000" str_replace_all(x, ",", "") ## [1] "1000000" The str_replace function doesn’t work as well for this task, since it only replaces the first match to the pattern: str_replace(x, ",", "") ## [1] "1000,000" You can also use these functions to replace or remove longer patterns within words. For instance, suppose you want to change the word \"dog\" to \"cat\": x = c("dogs are great, dogs are fun", "dogs are fluffy") str_replace(x, "dog", "cat") ## [1] "cats are great, dogs are fun" "cats are fluffy" str_replace_all(x, "dog", "cat") ## [1] "cats are great, cats are fun" "cats are fluffy" As a final example, you can use the replacement functions and a regex pattern to replace repeated spaces with a single space. This is a good standardization step if you’re working with text. The key is to use the regex quantifier +, which means a character “repeats one or more times” in the pattern, and to use a single space \" \" as the replacement: x = "This sentence has extra space." str_replace_all(x, " +", " ") ## [1] "This sentence has extra space." If you just want to trim (remove) all whitespace from the beginning and end of a string, you can use the str_trim function instead. 1.4.4 Splitting Strings Distinct data in a text are generally separated by a character like a space or a comma, to make them easy for people to read. Often these separators also make the data easy for R to parse. The idea is to split the string into a separate value at each separator. The str_split function splits a string at each match to a pattern. The matching characters—that is, the separators—are discarded. For example, suppose you want to split several numbers separated by commas and spaces: x = "21, 32.3, 5, 64" result = str_split(x, ", ") result ## [[1]] ## [1] "21" "32.3" "5" "64" The str_split function always returns a list with one element for each input string. Here the list only has one element because x only has one element. You can get the first element with: result[[1]] ## [1] "21" "32.3" "5" "64" You then convert the values with as.numeric. To see why the str_split function always returns a list, consider what happens if you try to split two different strings at once: x = c(x, "10, 15, 1.3") result = str_split(x, ", ") result ## [[1]] ## [1] "21" "32.3" "5" "64" ## ## [[2]] ## [1] "10" "15" "1.3" Each string has a different number of parts, so the vectors in the result have different lengths. So a list is the only way to store them. You can also use the str_split function to split a sentence into words. Use spaces for the split: x = "The students in this workshop are great!" str_split(x, " ") ## [[1]] ## [1] "The" "students" "in" "this" "workshop" "are" "great!" When you know exactly how many parts you expect a string to have, use the str_split_fixed function instead of str_split. It accepts a third argument for the maximum number of splits to make. Because the number of splits is fixed, the function can return the result in a matrix instead of a list. For example: x = c("1, 2, 3", "10, 20, 30") str_split_fixed(x, ", ", 3) ## [,1] [,2] [,3] ## [1,] "1" "2" "3" ## [2,] "10" "20" "30" The str_split_fixed function is often more convenient than str_split because the nth piece of each input string is just the nth column of the result. For example, suppose you want to get the area codes from some phone numbers: phones = c("717-555-3421", "629-555-8902", "903-555-6781") result = str_split_fixed(phones, "-", 3) result[, 1] ## [1] "717" "629" "903" 1.4.5 Extracting Matches Occasionally, you might need to extract parts of a string in a more complicated way than string splitting allows. One solution is to write a regular expression that will match all of the data you want to capture, with parentheses ( ), the regex metacharacter for a group, around each distinct value. Then you can use the str_match function to extract the groups. Section 1.5.6 presents some examples of regex groups. For example, suppose you want to split an email address into three parts: the user name, the domain name, and the [top-level domain][tld]. To create a regular expression that matches email addresses, you can use the @ and . in the address as anchors. The surrounding characters are generally alphanumeric, which you can represent with the “word” metacharacter \\w: \\w+@\\w+[.]\\w+ Next, put parentheses ( ) around each part that you want to extract: (\\w+)@(\\w+)[.](\\w+) Finally, use this pattern in str_match, adding extra backslashes so that everything is escaped correctly: x = "datalab@ucdavis.edu" regex = "(\\\\w+)@(\\\\w+)[.](\\\\w+)" str_match(x, regex) ## [,1] [,2] [,3] [,4] ## [1,] "datalab@ucdavis.edu" "datalab" "ucdavis" "edu" The function extracts the overall match to the pattern, as well as the match to each group. The pattern in this example doesn’t work for all possible email addresses, since user names can contain dots and other characters that are not alphanumeric. You could generalize the pattern if necessary. The point is that the str_match function and groups provide an extremely flexible way to extract data from strings. 1.4.6 Case Study: U.S. Warehouse Stocks The U.S. Department of Agriculture (USDA) publishes a variety of datasets online, particularly through its National Agricultural Statistics Service (NASS). Unfortunately, most of are published in PDF or semi-structured text format, which makes reading the data into R or other statistical software a challenge. The USDA NASS posts monthly reports about stocks of agricultural products in refrigerated warehouses. In this case study, you’ll use string processing functions to extract a table of data from the December 2022 report. To begin, download the report and save it somewhere on your computer. Then open the file in a text editor (or RStudio) to inspect it. The goal is to extract the first table, about “Nuts, Dairy Products, Frozen Eggs, and Frozen Poultry,” from the report. The report is a semi-structured mix of natural language text and fixed-width tables. As a consequence, most functions for reading tabular data will not work well on the entire report. You could try to use a function for reading fixed-width data, such as read.fwf or the readr package’s read_fwf on only the lines containing a table. Another approach, which is shown here, is to use string processing functions to find and extract the table. The readLines function reads a text file into a character vector with one element for each line. This makes the function useful for reading unstructured or semi-structured text. Use the function to read the report: report = readLines("data/cost1222.txt") head(report) ## [1] "" ## [2] "Cold Storage" ## [3] "" ## [4] "ISSN: 1948-903X" ## [5] "" ## [6] "Released December 22, 2022, by the National Agricultural Statistics Service " In the report, tables always begin and end with lines that contain only dashes -. By locating these all-dash lines, you can locate the tables. Like str_detect, the str_which function tests whether strings in a vector match a pattern. The only difference is that str_which returns the indexes of the strings that matched (as if you had called which) rather than a logical vector. Use str_which to find the all-dash lines: # The regex means: # ^ begining of string # -+ one or more dashes # $ end of string dashes = str_which(report, "^-+$") head(report[dashes], 2) ## [1] "--------------------------------------------------------------------------------------------------------------------------" ## [2] "--------------------------------------------------------------------------------------------------------------------------" Each table contains three dash lines—one separates the header and body. The header and body of the first table are: report[dashes[1]:dashes[2]] ## [1] "--------------------------------------------------------------------------------------------------------------------------" ## [2] " : : : November 30, 2022 : Public " ## [3] " : Stocks in all warehouses : as a percent of : warehouse " ## [4] " : : : : stocks " ## [5] " Commodity :-----------------------------------------------------------------------------------" ## [6] " :November 30, : October 31, :November 30, :November 30, : October 31, :November 30, " ## [7] " : 2021 : 2022 : 2022 : 2021 : 2022 : 2022 " ## [8] "--------------------------------------------------------------------------------------------------------------------------" bod = report[dashes[2]:dashes[3]] head(bod) ## [1] "--------------------------------------------------------------------------------------------------------------------------" ## [2] " : ------------ 1,000 pounds ----------- ---- percent ---- 1,000 pounds " ## [3] " : " ## [4] "Nuts : " ## [5] "Shelled : " ## [6] " Pecans .............................: 30,906 38,577 34,489 112 89 " The columns have fixed widths, so extracting the columns is relatively easy with str_sub if you can get the offsets. In the last line of the header, the columns are separated by colons :. Thus you can use the str_locate_all function, which returns the locations of a pattern in a string, to get the offsets: # The regex means: # [^:]+ one or more characters, excluding colons # (:|$) a colon or the end of the line cols = str_locate_all(report[dashes[2] - 1], "[^:]+(:|$)") # Like str_split, str_locate_all returns a list cols = cols[[1]] cols ## start end ## [1,] 1 39 ## [2,] 40 53 ## [3,] 54 67 ## [4,] 68 81 ## [5,] 82 95 ## [6,] 96 109 ## [7,] 110 122 You can use these offsets with str_sub to break a line in the body of the table into columns: str_sub(bod[6], cols) ## [1] " Pecans .............................:" ## [2] " 30,906 " ## [3] " 38,577 " ## [4] " 34,489 " ## [5] " 112 " ## [6] " 89 " ## [7] " " Because of the way str_sub is vectorized, you can’t process every line in the body of the table in one vectorized call. Instead, you can use sapply to call str_sub on each line: # Set USE.NAMES to make the table easier to read tab = sapply(bod, str_sub, cols, USE.NAMES = FALSE) # The sapply function transposes the table tab = t(tab) head(tab) ## [,1] [,2] ## [1,] "---------------------------------------" "--------------" ## [2,] " :" " ------------" ## [3,] " :" " " ## [4,] "Nuts :" " " ## [5,] "Shelled :" " " ## [6,] " Pecans .............................:" " 30,906 " ## [,3] [,4] [,5] [,6] ## [1,] "--------------" "--------------" "--------------" "--------------" ## [2,] " 1,000 pounds " "----------- " " ---- perc" "ent ---- " ## [3,] " " " " " " " " ## [4,] " " " " " " " " ## [5,] " " " " " " " " ## [6,] " 38,577 " " 34,489 " " 112 " " 89 " ## [,7] ## [1,] "-------------" ## [2,] "1,000 pounds " ## [3,] " " ## [4,] " " ## [5,] " " ## [6,] " " The columns still contain undesirable punctuation and whitespace, but you can remove these with str_replace_all and str_trim. Since the table is a matrix, it’s necessary to use apply to process it column-by-column: # The regex means: # , a comma # | OR # [.]* zero or more literal dots # : a colon # $ the end of the line tab = apply(tab, 2, function(col) { col = str_replace_all(col, ",|[.]*:$", "") str_trim(col) }) head(tab) ## [,1] [,2] ## [1,] "---------------------------------------" "--------------" ## [2,] "" "------------" ## [3,] "" "" ## [4,] "Nuts" "" ## [5,] "Shelled" "" ## [6,] "Pecans" "30906" ## [,3] [,4] [,5] [,6] ## [1,] "--------------" "--------------" "--------------" "--------------" ## [2,] "1000 pounds" "-----------" "---- perc" "ent ----" ## [3,] "" "" "" "" ## [4,] "" "" "" "" ## [5,] "" "" "" "" ## [6,] "38577" "34489" "112" "89" ## [,7] ## [1,] "-------------" ## [2,] "1000 pounds" ## [3,] "" ## [4,] "" ## [5,] "" ## [6,] "" The first few rows and the last row can be removed, since they don’t contain data. Then you can convert the table to a data frame and convert the individual columns to appropriate data types: tab = tab[-c(1:3, nrow(tab)), ] tab = data.frame(tab) tab[2:7] = lapply(tab[2:7], as.numeric) head(tab, 10) ## X1 X2 X3 X4 X5 X6 X7 ## 1 Nuts NA NA NA NA NA NA ## 2 Shelled NA NA NA NA NA NA ## 3 Pecans 30906 38577 34489 112 89 NA ## 4 In-Shell NA NA NA NA NA NA ## 5 Pecans 63788 44339 47638 75 107 NA ## 6 NA NA NA NA NA NA ## 7 Dairy products NA NA NA NA NA NA ## 8 Butter 210473 239658 199695 95 83 188566 ## 9 Natural cheese NA NA NA NA NA NA ## 10 American 834775 831213 815655 98 98 NA The data frame is now sufficiently clean that you could use it for a simple analysis. Of course, there are many things you could do to improve the extracted data frame, such as identifying categories and subcategories in the first column, removing rows that are completely empty, and adding column names. These entail more string processing and data frame manipulation—if you want to practice your R skills, try doing them on your own. 1.5 Regular Expression Examples This section provides examples of several different regular expression metacharacters and other features. Most of the examples use the str_view function, which is especially helpful for testing regular expressions. The function displays an HTML-rendered version of the string with the first match highlighted. The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string. 1.5.1 The Wildcard The regex wildcard character is . and matches any single character. For example: x = "dog" str_view(x, "d.g") ## [1] │ <dog> By default, regex searches from left to right: str_view(x, ".") ## [1] │ <d><o><g> 1.5.2 Escape Sequences Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character. For example, suppose you want to match a literal dot .. The regex for a literal dot is \\.. Since backslashes in R strings have to be escaped, the R string for this regex is \"\\\\.. For example: str_view("this.string", "\\\\.") ## [1] │ this<.>string The double backslash can be confusing, and it gets worse if you want to match a literal backslash. You have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is: str_view("this\\\\that", "\\\\\\\\") ## [1] │ this<\\>that Raw strings (see Section 1.3.3) make regular expressions easier to read, because they make backslashes literal (but they still mark the beginning of an escape sequence in regex). You can use a raw string to write the above as: str_view(r"(this\\that)", r"(\\\\)") ## [1] │ this<\\>that 1.5.3 Anchors By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor. The beginning of string anchor is ^. It marks the beginning of the string, but doesn’t count as a character in the pattern. For example, suppose you want to match an a at the beginning of the string: x = c("abc", "cab") str_view(x, "a") ## [1] │ <a>bc ## [2] │ c<a>b str_view(x, "^a") ## [1] │ <a>bc It doesn’t make sense to put characters before ^, since no characters can come before the beginning of the string. Likewise, the end of string anchor is $. It marks the end of the string, but doesn’t count as a character in the pattern. 1.5.4 Character Classes In regex, square brackets [ ] denote a character class. A character class matches exactly one character, but that character can be any of the characters inside of the square brackets. The square brackets themselves don’t count as characters in the pattern. For example, suppose you want to match c followed by either a or t: x = c("ca", "ct", "cat", "cta") str_view(x, "c[ta]") ## [1] │ <ca> ## [2] │ <ct> ## [3] │ <ca>t ## [4] │ <ct>a You can use a dash - in a character class to create a range. For example, to match letters p through z: str_view(x, "c[p-z]") ## [2] │ <ct> ## [4] │ <ct>a Ranges also work with numbers and capital letters. To match a literal dash, place the dash at the end of the character class (instead of between two other characters), as in [abc-]. Most metacharacters are literal when inside a character class. For example, [.] matches a literal dot. A hat ^ at the beginning of the character class negates the class. So for example, [^abc] matches any one character except for a, b, or c: str_view("abcdef", "[^abc]") ## [1] │ abc<d><e><f> 1.5.5 Quantifiers Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match. For example, the question mark ? quantifier means the preceding character can appear 0 or 1 times. In other words, ? makes the preceding character optional. For example: x = c("abc", "ab", "ac", "abbc") str_view(x, "ab?c") ## [1] │ <abc> ## [3] │ <ac> The star * quantifier means the preceding character can appear 0 or more times. In other words, * means the preceding character can appear any number of times or not at all. For instance: str_view(x, "ab*c") ## [1] │ <abc> ## [3] │ <ac> ## [4] │ <abbc> The plus + quantifier means the preceding character must appear 1 or more times. Quantifiers are greedy, meaning they always match as many characters as possible. In this example, notice that the pattern matches the entire string, even though it could also match just abba: str_view("abbabbba", ".+a") ## [1] │ <abbabbba> You can add a question mark ? after another quantifier to make it non-greedy: str_view("abbabbba", ".+?a") ## [1] │ <abba><bbba> 1.5.6 Groups In regex, parentheses ( ) denote a group. The parentheses themselves don’t count as characters in the pattern. Groups are useful for repeating or extracting specific parts of a pattern (see Section 1.4.5). Quantifiers can act on groups in addition to individual characters. For example, suppose you want to make the entire substring \", dogs,\" optional in a pattern, so that both of the test strings in this example match: x = c("cats, dogs, and frogs", "cats and frogs") str_view(x, "cats(, dogs,)? and frogs") ## [1] │ <cats, dogs, and frogs> ## [2] │ <cats and frogs> "],["tidy-relational-data.html", "2 Tidy & Relational Data 2.1 Tidy Datasets 2.2 Relational Datasets", " 2 Tidy & Relational Data This chapter is part 2 (of 2) of Cleaning & Reshaping Data, a workshop series about how to prepare data for analysis. The major topics of this chapter are how to reshape datasets with pivots, how to combine related datasets with joins, and how to select and use iteration strategies that automate repetitive computations. Learning Objectives After completing this session, learners should be able to: Explain what it means for data to be tidy Use the tidyr package to reshape data Explain what a relational dataset is Use the dplyr package to join data based on common columns Describe the different types of joins Identify which types of joins to use when faced with a relational dataset 2.1 Tidy Datasets The structure of a dataset—its shape and organization—has enormous influence on how difficult it will be to analyze, so making structural changes is an important part of the cleaning process. Researchers conventionally arrange tabular datasets so that each row contains a single observation or case, and each column contains a single kind of measurement or identifier, called a feature. In 2014, Hadley Wickham refined and formalized the conventions for tabular datasets by introducing the concept of tidy datasets, which have a specific structure. Paraphrasing Wickham, the rules for a tidy dataset are: Every column is a single feature. Every row is a single observation. Every cell is a single value. These rules ensure that all of the values in a dataset are visually organized and are easy to access with indexing operations. They’re also specific enough to make tidiness a convenient standard for functions that operate on tabular datasets. In fact, the Tidyverse packages (see Section 1.1) are designed from the ground up for working with tidy datasets. Tidy datesets have also been adopted as a standard in other software, including various packages for Python and Julia. This section explains how to reshape tabular datasets into tidy datasets. While reshaping can seem tricky at first, making sure your dataset has the right structure before you begin analysis saves time and frustration in the long run. 2.1.1 The tidyr Package The tidyr package provides functions to reshape tabular datasets. It also provides examples of tidy and untidy datasets. Like most Tidyverse packages, it comes with detailed documentation and a cheatsheet. As usual, install the package if you haven’t already, and then load it: # install.packages("tidyr") library(tidyr) Let’s start with an example of a tidy dataset. The table1 dataset in the package records the number of tuberculosis cases across several different countries and years: table1 ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 Each of the four columns contains a single kind of measurement or identifier, so the dataset satifies tidy rule 1. The measurements were taken at the country-year level, and each row contains data for one country-year pair, so the dataset also satisfies tidy rule 2. Each cell in the data frame only contains one value, so the dataset also satisfies tidy rule 3. The same data are recorded in table2, table3, and the pair table4a with table4b, but these are all untidy datasets. For example, table2 breaks rule 1 because the column count contains two different kinds of measurements—case counts and population counts: table2 ## # A tibble: 12 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 When considering whether you should reshape a dataset, think about what the features are and what the observations are. These depend on the dataset itself, but also on what kinds of analyses you want to do. Datasets sometimes have closely related features or multiple (nested) levels of observation. The tidyr documentation includes a detailed article on how to reason about reshaping datasets. If you do decide to reshape a dataset, then you should also think about what role each feature serves: Identifiers are labels that distinguish observations from one another. They are often but not always categorical. Examples include names or identification numbers, treatment groups, and dates or times. In the tuberculosis data set, the country and year columns are identifiers. Measurements are the values collected for each observation and typically the values of research interest. For the tuberculosis data set, the cases and population columns are measurements. Having a clear understanding of which features are identifiers and which are measurements makes it easier to use the tidyr functions. 2.1.2 Rows into Columns Tidy data rule 1 is that each column must be a single feature. The table2 dataset breaks this rule: table2 ## # A tibble: 12 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 To make the dataset tidy, the measurements in the count column need to be separated into two separate columns, cases and population, based on the categories in the type column. You can use the pivot_wider function to pivot the single count column into two columns according to the type column. This makes the dataset wider, hence the name pivot_wider. The function’s first parameter is the dataset to pivot. Other important parameters are: values_from – The column(s) to pivot. names_from – The column that contains names for the new columns. id_cols – The identifier columns, which are not pivoted. This defaults to all columns except those in values_from and names_from. Here’s how to use the function to make table2 tidy: pivot_wider(table2, values_from = count, names_from = type) ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 The function automatically removes values from the country and year columns as needed to maintain their original correspondence with the pivoted values. 2.1.3 Columns into Rows Tidy data rule 2 is that every row must be a single observation. The table4a and table4b datasets break this rule because each row in each dataset contains measurements for two different years: table4a ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 table4b ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 The tuberculosis case counts are in table4a. The population counts are in table4b. Neither is tidy. To make the table4a dataset tidy, the 1999 and 2000 columns need to be pivoted into two new columns: one for the measurements (the counts) and one for the identifiers (the years). It might help to visualize this as stacking the two separate columns 1999 and 2000 together, one on top of the other, and then adding a second column with the appropriate years. The same process makes table4b tidy. You can use the pivot_longer function to pivot the two columns 1999 and 2000 into a column of counts and a column of years. This makes the dataset longer, hence the name pivot_longer. Again the function’s first parameter is the dataset to pivot. Other important parameters are: cols – The columns to pivot. values_to – Name(s) for the new measurement column(s) names_to – Name(s) for the new identifier column(s) Here’s how to use the function to make table4a tidy: tidy4a = pivot_longer(table4a, -country, values_to = "cases", names_to = "year") tidy4a ## # A tibble: 6 × 3 ## country year cases ## <chr> <chr> <dbl> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 In this case, the cols parameter is set to all columns except the country column, because the country column does not need to be pivoted. The function automatically repeats values in the country column as needed to maintain its original correspondence with the pivoted values. Here’s the same for table4b: tidy4b = pivot_longer(table4b, -country, values_to = "population", names_to = "year") tidy4b ## # A tibble: 6 × 3 ## country year population ## <chr> <chr> <dbl> ## 1 Afghanistan 1999 19987071 ## 2 Afghanistan 2000 20595360 ## 3 Brazil 1999 172006362 ## 4 Brazil 2000 174504898 ## 5 China 1999 1272915272 ## 6 China 2000 1280428583 Once the two datasets are tidy, you can join them with the merge function to reproduce table1: merge(tidy4a, tidy4b) ## country year cases population ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 2.1.4 Separating Values Tidy data rule 3 says each value must have its own cell. The table3 dataset breaks this rule because the rate column contains two values per cell: table3 ## # A tibble: 6 × 3 ## country year rate ## <chr> <dbl> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 The two values separated by / in the rate column are the tuberculosis case count and the population count. To make this dataset tidy, the rate column needs to be split into two columns, cases and population. The values in the rate column are strings, so one way to do this is with the stringr package’s str_split_fixed function, described in Section 1.4.4: library(stringr) # Split the rate column into 2 columns. cols = str_split_fixed(table3$rate, fixed("/"), 2) # Remove the rate column and append the 2 new columns. tidy3 = table3[-3] tidy3$cases = as.numeric(cols[, 1]) tidy3$population = as.numeric(cols[, 2]) tidy3 ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 Extracting values, converting to appropriate data types, and then combining everything back into a single data frame is an extremely common pattern in data science. The tidyr package provides the separate function to streamline the steps taken above. The first parameter is the dataset, the second is the column to split, the third is the names of the new columns, and the fourth is the delimiter. The convert parameter controls whether the new columns are automatically converted to appropriate data types: separate(table3, rate, c("cases", "population"), "/", convert = TRUE) ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 As of writing, the tidyr developers have deprecated the separate function in favor of several more specific functions (separate_wider_delim, separate_wider_position, and separate_wider_regex). These functions are still experimental, so we still recommend using the separate function in the short term. 2.1.5 Case Study: SMART Ridership Sonoma-Marin Area Rail Transit (SMART) is a single-line passenger rail service between the San Francisco Bay and Santa Rosa. They publish data about monthly ridership in PDF and Excel format. In this case study, you’ll reshape and clean the dataset to prepare it for analysis. To get started, download the December 2022 report it Excel format. Pay attention to where you save the file—or move it to a directory just for files related to this case study—so that you can load it into R. If you want, you can use R’s download.file function to download the file rather than your browser. The readxl package provides functions to read data from Excel files. Install the package if you don’t already have it installed, and then load it: # install.packages("readxl") library("readxl") You can use the read_excel function to read a sheet from an Excel spreadsheet. Before doing so, it’s a good idea to manually inspect the spreadsheet in a spreadsheet program. The SMART dataset contains two tables in the first sheet, one for total monthly ridership and another for average weekday ridership (by month). Let’s focus on the total monthly ridership table, which occupies cells B4 to H16. You can specify a range of cells when you call read_excel by setting the range parameter: smart_path = "./data/SMART Ridership Web Posting_Mar.23.xlsx" smart = read_excel(smart_path, range = "B4:H16") smart ## # A tibble: 12 × 7 ## Month FY18 FY19 FY20 FY21 FY22 `FY23 (DRAFT)` ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Jul - 63864 62851 9427 24627 43752 ## 2 Aug 54484 74384 65352 8703 25020 48278 ## 3 Sep 65019 62314 62974 8910 27967 49134 ## 4 Oct 57453 65492 57222 9851 26998. 59322 ## 5 Nov 56125 52774 64966 8145 26575 51383 ## 6 Dec 56425 51670 58199. 7414 24050 47606 ## 7 Jan 56527 57136 71974 6728 22710 46149 ## 8 Feb 54797 51130 71676 7412 26652 49724 ## 9 Mar 57312 58091 33624 9933 35291 53622 ## 10 Apr 56631 60256 4571 11908 34258 NA ## 11 May 59428 64036 5308 13949 38655 NA ## 12 Jun 61828 55700 8386 20469 41525 NA The loaded dataset needs to be cleaned. The FY18 column uses a hyphen to indicate missing data and has the wrong data type. The dates—months and years—are identifiers for observations, so the dataset is also not tidy. You can correct the missing value in the FY18 column with indexing, and the type with the as.numeric function: smart$FY18[smart$FY18 == "-"] = NA smart$FY18 = as.numeric(smart$FY18) head(smart) ## # A tibble: 6 × 7 ## Month FY18 FY19 FY20 FY21 FY22 `FY23 (DRAFT)` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Jul NA 63864 62851 9427 24627 43752 ## 2 Aug 54484 74384 65352 8703 25020 48278 ## 3 Sep 65019 62314 62974 8910 27967 49134 ## 4 Oct 57453 65492 57222 9851 26998. 59322 ## 5 Nov 56125 52774 64966 8145 26575 51383 ## 6 Dec 56425 51670 58199. 7414 24050 47606 To make the dataset tidy, it needs to be reshaped so that the values in the various fiscal year columns are all in one column. In other words, the dataset needs to be pivoted longer (Section 2.1.3). The result of the pivot will be easier to understand if you rename the columns as their years first. Here’s one way to do that: names(smart)[-1] = 2018:2023 head(smart) ## # A tibble: 6 × 7 ## Month `2018` `2019` `2020` `2021` `2022` `2023` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Jul NA 63864 62851 9427 24627 43752 ## 2 Aug 54484 74384 65352 8703 25020 48278 ## 3 Sep 65019 62314 62974 8910 27967 49134 ## 4 Oct 57453 65492 57222 9851 26998. 59322 ## 5 Nov 56125 52774 64966 8145 26575 51383 ## 6 Dec 56425 51670 58199. 7414 24050 47606 Next, use pivot_longer to pivot the dataset: smart = pivot_longer(smart, -Month, values_to = "riders", names_to = "fiscal_year") head(smart) ## # A tibble: 6 × 3 ## Month fiscal_year riders ## <chr> <chr> <dbl> ## 1 Jul 2018 NA ## 2 Jul 2019 63864 ## 3 Jul 2020 62851 ## 4 Jul 2021 9427 ## 5 Jul 2022 24627 ## 6 Jul 2023 43752 Now the dataset is tidy, but it’s still not completely clean. To make it easy to study time trends, let’s combine and convert the month and fiscal_year columns into a calendar date. You can use functions from the lubridate package (Section 1.2.1) to do this. First paste the year and month together and use the my function to parse them as dates: library(lubridate) ## ## Attaching package: 'lubridate' ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union dates = my(paste(smart$Month, smart$fiscal_year)) dates ## [1] "2018-07-01" "2019-07-01" "2020-07-01" "2021-07-01" "2022-07-01" ## [6] "2023-07-01" "2018-08-01" "2019-08-01" "2020-08-01" "2021-08-01" ## [11] "2022-08-01" "2023-08-01" "2018-09-01" "2019-09-01" "2020-09-01" ## [16] "2021-09-01" "2022-09-01" "2023-09-01" "2018-10-01" "2019-10-01" ## [21] "2020-10-01" "2021-10-01" "2022-10-01" "2023-10-01" "2018-11-01" ## [26] "2019-11-01" "2020-11-01" "2021-11-01" "2022-11-01" "2023-11-01" ## [31] "2018-12-01" "2019-12-01" "2020-12-01" "2021-12-01" "2022-12-01" ## [36] "2023-12-01" "2018-01-01" "2019-01-01" "2020-01-01" "2021-01-01" ## [41] "2022-01-01" "2023-01-01" "2018-02-01" "2019-02-01" "2020-02-01" ## [46] "2021-02-01" "2022-02-01" "2023-02-01" "2018-03-01" "2019-03-01" ## [51] "2020-03-01" "2021-03-01" "2022-03-01" "2023-03-01" "2018-04-01" ## [56] "2019-04-01" "2020-04-01" "2021-04-01" "2022-04-01" "2023-04-01" ## [61] "2018-05-01" "2019-05-01" "2020-05-01" "2021-05-01" "2022-05-01" ## [66] "2023-05-01" "2018-06-01" "2019-06-01" "2020-06-01" "2021-06-01" ## [71] "2022-06-01" "2023-06-01" The SMART fiscal year extends from July to the following June and equals the calendar year at the end of the period. So for observations from July to December, the calendar year is the fiscal year minus 1. You can use indexing to make this adjustment efficiently, and then append the dates to the data frame: jul2dec = month(dates) >= 7 dates[jul2dec] = dates[jul2dec] - period(1, "year") smart$date = dates head(smart) ## # A tibble: 6 × 4 ## Month fiscal_year riders date ## <chr> <chr> <dbl> <date> ## 1 Jul 2018 NA 2017-07-01 ## 2 Jul 2019 63864 2018-07-01 ## 3 Jul 2020 62851 2019-07-01 ## 4 Jul 2021 9427 2020-07-01 ## 5 Jul 2022 24627 2021-07-01 ## 6 Jul 2023 43752 2022-07-01 As a final adjustment, you can use the tolower function to convert the column names to lowercase, so that they’re easier to use during analysis: names(smart) = tolower(names(smart)) smart ## # A tibble: 72 × 4 ## month fiscal_year riders date ## <chr> <chr> <dbl> <date> ## 1 Jul 2018 NA 2017-07-01 ## 2 Jul 2019 63864 2018-07-01 ## 3 Jul 2020 62851 2019-07-01 ## 4 Jul 2021 9427 2020-07-01 ## 5 Jul 2022 24627 2021-07-01 ## 6 Jul 2023 43752 2022-07-01 ## 7 Aug 2018 54484 2017-08-01 ## 8 Aug 2019 74384 2018-08-01 ## 9 Aug 2020 65352 2019-08-01 ## 10 Aug 2021 8703 2020-08-01 ## # ℹ 62 more rows Now that the dataset is tidied and cleaned, it’s straightforward to do things like plot it as a time series: library("ggplot2") ggplot(smart) + aes(x = date, y = riders) + geom_line() + expand_limits(y = 0) ## Warning: Removed 4 rows containing missing values (`geom_line()`). Notice the huge drop (more than 90%) in April of 2020 due to the COVID-19 pandemic! 2.1.6 Without tidyr This section shows how to pivot datasets without the help of the tidyr package. In practice, we recommend that you use the package, but the examples here may make it easier to understand what’s actually happening when you pivot a dataset. 2.1.6.1 Rows into Columns The steps for pivoting table2 wider are: Subset rows to separate cases and population values. Remove the type column from each. Rename the count column to cases and population. Merge the two subsets by matching country and year. And the code is: # Step 1 cases = table2[table2$type == "cases", ] pop = table2[table2$type == "population", ] # Step 2 cases = cases[-3] pop = pop[-3] # Step 3 names(cases)[3] = "cases" names(pop)[3] = "population" # Step 4 merge(cases, pop) ## country year cases population ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 2.1.6.2 Columns into Rows The steps for pivoting table4a longer are: Subset columns to separate 1999 and 2000 into two data frames. Add a year column to each. Rename the 1999 and 2000 columns to cases. Stack the two data frames with rbind. And the code is: # Step 1 df99 = table4a[-3] df00 = table4a[-2] # Step 2 df99$year = "1999" df00$year = "2000" # Step 3 names(df99)[2] = "cases" names(df00)[2] = "cases" # Step 4 rbind(df99, df00) ## # A tibble: 6 × 3 ## country cases year ## <chr> <dbl> <chr> ## 1 Afghanistan 745 1999 ## 2 Brazil 37737 1999 ## 3 China 212258 1999 ## 4 Afghanistan 2666 2000 ## 5 Brazil 80488 2000 ## 6 China 213766 2000 2.2 Relational Datasets Many datasets contain multiple tables (or data frames) that are all closely related to each other. Sometimes, the rows in one table may be connected the rows in others through columns they have in common. For example, our library keeps track of its books using three tables: one identifying books, one identifying borrowers, and one that records each book checkout. Each book and each borrower has a unique identification number, recorded in the book and borrower tables, respectively. These ID numbers are also recorded in the checkouts table. Using the ID numbers, you can connect rows from one table to rows in another. We call this kind of dataset a relational dataset, because there are relationships between the tables. Storing relational datasets as several small tables rather than one large table has many benefits. Perhaps the most important is that it reduces redundancy and thereby reduces the size (in bytes) of the dataset. As a result, most databases are designed to store relational datasets. Because the data are split across many different tables, relational datasets also pose a unique challenge: to explore, compute statistics, make visualizations, and answer questions, you’ll typically need to combine the data of interest into a single table. One way to do this is with a join, an operation that combines rows from two tables based on values of a column they have in common. There are many different types of joins, which are covered in the subsequent sections. 2.2.1 The dplyr Package The dplyr package provides functions to join related data frames, among other things. Check out this list of all the functions provided by dplyr. If you’ve ever used SQL, you’re probably familiar with relational datasets and recognize functions like select, left_join, and group_by. In fact, dplyr was designed to bring SQL-style data manipulation to R. As a result, many concepts of dplyr and SQL are nearly identical, and even the language overlaps a lot. I’ll point out some examples of this as we go, because I think some people might find it helpful. If you haven’t used SQL, don’t worry—all of the functions will be explained in detail. 2.2.2 Gradebook Dataset Another example of a relational dataset that we all interact with regularly is the university gradebook. One table might store information about students and another might store their grades. The grades are linked to the student records by student ID. Looking at a student’s grades requires combining the two tables with a join. Let’s use a made-up gradebook dataset to make the idea of joins concrete. We’ll create two tables: the first identifies students by name and ID, and the second lists their grades in a class. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4), name = c("Angel", "Beto", "Cici", "Desmond")) students ## student_id name ## 1 1 Angel ## 2 2 Beto ## 3 3 Cici ## 4 4 Desmond grades = data.frame( student_id = c(2, 3, 4, 5, 6), grade = c(90, 85, 80, 75, 60)) grades ## student_id grade ## 1 2 90 ## 2 3 85 ## 3 4 80 ## 4 5 75 ## 5 6 60 The rows and columns of these tables have different meanings, so we can’t stack them side-by-side or one on top of the other. The “key” piece of information for linking them is the student_id column present in both. In relational datasets, each table usually has a primary key, a column of values that uniquely identify the rows. Key columns are important because they link rows in one table to rows in other tables. In the gradebook dataset, student_id is the primary key for the students table. Although the values of student_id in the grades table are unique, it is not a primary key for the grades table, because a student could have grades for more than one class. When one table’s primary key is included in another table, it’s called a foreign key. So student_id is a foreign key in the grades table. If you’ve used SQL, you’ve probably heard the terms primary key and foreign key before. They have the same meaning in R. In most databases, the primary key must be unique—there can be no duplicates. That said, relational datasets are not always designed for use as databases, and they may have key columns that are not unique. How to handle non-unique keys is going to be a recurring feature of this section. 2.2.3 Left Joins Suppose we want a table with each student’s name and grade. This is a combination of information from both the students table and the grades table, but how can we combine the two? The students table contains the student names and has one row for each student. So we can use the students table as a starting point. Then we need to use each student’s ID number to look up their grade in the grades table. When you want combine data from two tables like this, you should think of using a join. In joins terminology, the two tables are usually called the left table and right table so that it’s easy to refer to each without ambiguity. For this particular example, we’ll use a left join. A left join keeps all of the rows in the left table and combines them with rows from the right table that match the primary key. We want to keep every student in the students table, so we’ll use it as the left table. The grades table will be the right table. The key that links the two tables is student_id. This left join will only keep rows from the grades table that match student IDs present in the students table. In dplyr, you can use the left_join function to carry out a left join. The first argument is the left table and the second argument is the right table. You can also set an argument for the by parameter to specify which column(s) to use as the key. Thus: # load dplyr package library(dplyr) library(knitr) # Left join left_join(students, grades, by = "student_id") ## student_id name grade ## 1 1 Angel NA ## 2 2 Beto 90 ## 3 3 Cici 85 ## 4 4 Desmond 80 # |> kable() Note that the keys do not match up perfectly between the tables: the grades table has no rows with student_id 1 (Angel) and has rows with student_id 5 (an unknown student). Because we used a left join, the result has a missing value (NA) in the grade column for Angel and no entry for student_id 5. A left join augments the left table (students) with columns from the right table (grades). So the result of a left join will often have the same number of rows as the left table. New rows are not added for rows in the right table with non-matching key values. There is one case where the result of a left join will have more rows than the left table: when a key value is repeated in either table. In that case, every possible match will be provided in the result. For an example, let’s add rows with repeat IDs to both the students and grades tables. Let’s also rename the student_id column of grades to be sid so we can see how to join tables where the key column names don’t match. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4, 4), name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) grades = data.frame( sid = c(2, 3, 4, 5, 2), grade = c(90, 85, 80, 75, 60)) # Left join left_join(students, grades, by = join_by(student_id == sid)) ## Warning in left_join(students, grades, by = join_by(student_id == sid)): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 2 of `x` matches multiple rows in `y`. ## ℹ Row 3 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning. ## student_id name grade ## 1 1 Angel NA ## 2 2 Beto 90 ## 3 2 Beto 60 ## 4 3 Cici 85 ## 5 4 Desmond 80 ## 6 4 Erik 80 #|> kable() Both of the tables had five rows, but the result has six rows because student_id is 4 for two rows of students and sid is 2 for two rows of grades. R warns that there is a many-to-many relationship in the join, which means that duplicate keys were matched in the left table and the right table. When there are no duplicate keys in either table, the match is one-to-one. When there are duplicates in one table only, the match is one-to-many or many-to-one. These are often desired behavior and so R just complies silently. A many-to-many match may be desired, but it is often a sign that something has gone wrong, so R emits a warning. You can get funky results when your keys are not unique! Cats join meme 2.2.4 Other Joins There are several other kinds of joins: A right join is almost the same as a left join, but reverses the roles of the left and right table. All rows from the right table are augmented with columns from the left table where the key matches. An inner join returns rows from the left and right tables only if they match (their key appears in both tables). A full join returns all rows from the left table and from the right table, even if they do not match. Here’s visualization to help identify the differences: Disney characters illustrate differences between joins The following subsections provide examples of different types of joins. 2.2.4.1 Inner Join An inner join returns the same columns as a left join, but potentially fewer rows. The result of a inner join only includes the rows that matched according to the join specification. This will leave out some rows from the left table if they aren’t matched in the right table, which is the difference between an inner join and a left join. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4, 4), name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) grades = data.frame( student_id = c(2, 3, 4, 5, 2), grade = c(90, 85, 80, 75, 60)) # Inner join inner_join(students, grades, by = "student_id") |> kable() ## Warning in inner_join(students, grades, by = "student_id"): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 2 of `x` matches multiple rows in `y`. ## ℹ Row 3 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning. student_id name grade 2 Beto 90 2 Beto 60 3 Cici 85 4 Desmond 80 4 Erik 80 2.2.5 Getting Clever with join_by So far, we’ve focused on the join types and the tables. There’s been a third element in all of the examples that we’ve mostly ignored until now: the by argument in the joins. Specifying a single column name (like student_id) works great when the key columns have the same names in both tables. However, real examples are often more complicated. For those times, dplyr provides a function called join_by, which lets you create join specifications to solve even very complicated problems. We begin with an example where the key name in the grades table has been changed from student_id to sid. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4), name = c("Angel", "Beto", "Cici", "Desmond")) grades = data.frame( sid = c(2, 3, 4, 5), grade = c(90, 85, 80, 75)) # Left join left_join(students, grades, by = join_by(student_id==sid)) |> kable() student_id name grade 1 Angel NA 2 Beto 90 3 Cici 85 4 Desmond 80 Since the key column names don’t match, I have provided a join_by specification. Specifying a match via join_by is very powerful and flexible, but the main thing to recognize here is that R searches for the column name on the left of the double-equals in the left table and searches for the column name on the right of the double-equals in the right table. In this example, that means the join will try to match students$student_id to grades$sid. 2.2.5.1 Matching multiple columns Sometimes it takes more than one key to uniquely identify a row of data. For example, suppose some of our students are retaking the class in 2023 because they got a failing grade in 2022. Then we would need to combine the student ID with the year to uniquely identify a student’s grade. You can include multiple comparisons in a join_by specification by separating them with commas. In the following example, student ID still has different names between the tables but the year column has the same name in both tables. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4), name = c("Angel", "Beto", "Cici", "Desmond")) # duplicate the students for two years students = bind_rows( mutate(students, year = 2022), mutate(students, year = 2023) ) # create the grades data.frame grades = data.frame( sid = c(2, 3, 4, 5), grade = c(90, 85, 80, 75) ) # duplicate the grades table for two years grades = bind_rows( mutate(grades, grade = grade - 50, year = 2022), mutate(grades, year = 2023) ) # Left join left_join(students, grades, by = join_by(student_id==sid, year)) |> kable() student_id name year grade 1 Angel 2022 NA 2 Beto 2022 40 3 Cici 2022 35 4 Desmond 2022 30 1 Angel 2023 NA 2 Beto 2023 90 3 Cici 2023 85 4 Desmond 2023 80 To learn clever tricks for complicated joins, see the documentation at ?join_by. 2.2.6 Examples We’ve seen enough of the made-up grades example! Let’s look at some real data and practice our skills! Let’s begin by looking at the data on books, borrowers, and checkouts. borrowers = read.csv("data/library/borrowers.csv") books = read.csv("data/library/books.csv") checkouts = read.csv("data/library/checkouts.csv") # show the top rows print(head(books)) ## book_id title author subject ## 1 1 my alaska guy noir NA ## 2 2 jubilee charro NA ## 3 3 the window ruiner ruined my windows zipperman NA ## 4 4 clowns in the clouds dark doug NA ## 5 5 dogs walked, tigers tamed, bars emptied lucky pierre NA ## 6 6 ace ventura sandy mackinnon NA ## Keywords minority_author Description. Contents. Series. Pages. Publisher. ## 1 NA NA NA NA NA 100 NA ## 2 NA NA NA NA NA 100 NA ## 3 NA NA NA NA NA 100 NA ## 4 NA NA NA NA NA 100 NA ## 5 NA NA NA NA NA 100 NA ## 6 NA NA NA NA NA 100 NA ## creation_date publication_date Edition. format venue Language. Source. ## 1 NA NA NA NA NA NA NA ## 2 NA NA NA NA NA NA NA ## 3 NA NA NA NA NA NA NA ## 4 NA NA NA NA NA NA NA ## 5 NA NA NA NA NA NA NA ## 6 NA NA NA NA NA NA NA ## Identifier. ISBN. location copies type Barcode. ## 1 NA NA NA NA NA NA ## 2 NA NA NA NA NA NA ## 3 NA NA NA NA NA NA ## 4 NA NA NA NA NA NA ## 5 NA NA NA NA NA NA ## 6 NA NA NA NA NA NA print(head(borrowers)) ## borrower_id account_type major date_created ## 1 1 student NA NA ## 2 2 staff NA NA ## 3 3 student NA NA ## 4 4 student NA NA ## 5 5 faculty NA NA ## 6 6 staff NA NA print(head(checkouts)) ## borrower_id book_id checkout_date due_date ## 1 1 1 12/21/23 6/18/24 ## 2 1 2 1/3/22 7/2/22 ## 3 2 2 10/12/20 4/10/21 ## 4 2 3 5/5/23 11/1/23 ## 5 2 4 12/12/21 6/10/22 ## 6 4 2 6/7/18 12/4/18 # get the table sizes print(dim(books)) ## [1] 9 24 print(dim(borrowers)) ## [1] 10 4 print(dim(checkouts)) ## [1] 7 4 One thing we can see is that the books table refers to physical copies of a book, so if the library owns two copies of the same book then the same title, publisher, etc. will appear on the same row. In the previous section, we set a goal of augmenting the checkouts table with the information in the books table. To augment means to add to. We are going to be adding to the checkouts table, but do we add rows or columns? Each row of checkouts is an event that matches one book and one borrower. Adding new rows would be like adding new events that didn’t occur: not good! Each row has a book and each book has many columns of information in the books table. So we can maintain the relationships in the data while adding new columns of information to checkouts. How are we to know which books were checked out most often, or were generaly checked out by the same people? The tables have different numbers of rows and columns, so we won’t be able to stack them side-by-side or one on top of the other. The “key” pieces of information are the columns books$book_id and borrowers$borrower_id. If you’ve ever used SQL, you may recall that each table should have a primary key, which is a column of values that identify the rows. In a database, the primary key must be unique - there can be no duplicates. Most spreadsheets are not designed as databases, and they may have key or ID columns that are not unique. How to handle non-unique keys is going to be a recurring feature of this workshop. Now look at the checkouts table again. It has two ID columns: book_id and borrower_id. These match the borrower and book IDs in the borrowers and books tables. Obviously, these aren’t unique: one person may check out multiple books, and an exceptionaly popular book might be checked out as many as three times from the same library! These are columns that identify unique keys other tables, which SQL calls foreign keys. Now we can begin to reason about how to approach the goal of identifying the books that are most often checked out. We want to augment the checkouts table with the information in the books table, matching rows where book_id matches. Every row in the checkouts table should match exactly one row in the results and every row in the results should match exactly one row in the checkouts table. In the next section we will translate this plain-english description into the language used by dplyr. # Top ten books with most checkouts left_join(checkouts, books, by="book_id") |> group_by(book_id) |> summarize(title=first(title), author=first(author), n_checkouts=n()) |> arrange(desc(n_checkouts)) |> head(n=10) |> kable() book_id title author n_checkouts 2 jubilee charro 3 1 my alaska guy noir 1 3 the window ruiner ruined my windows zipperman 1 4 clowns in the clouds dark doug 1 5 dogs walked, tigers tamed, bars emptied lucky pierre 1 Just for fun, here is an instructive example of why relational tables are a better way to store data than putting everything into one spreadsheet. If we want to identify the authors whose books were most checked out from the UCD library, we might think to adapt our previous example to group by author rather than by book_id. # Top ten authors with most checkouts left_join(checkouts, books, by="book_id") |> group_by(author) |> summarize(author=first(author), n_checkouts = n()) |> arrange(desc(n_checkouts)) |> head(n=10) |> kable() author n_checkouts charro 3 dark doug 1 guy noir 1 lucky pierre 1 zipperman 1 The problem is that the author column is a text field for author name(s), which is not a one-to-one match to a person. There are a lot of reasons: some books have multiple authors, some authors change their names, the order of personal name and family name may be reversed, and middle initials are sometimes included, sometimes not. A table of authors would allow you to refer to authors by a unique identifier and have it always point to the same name (this is what ORCID does for scientific publishing). 2.2.6.1 Three or More Tables A join operates on two tables, but you can combine multiple tables by doing several joins in a row. Let’s look at an example that combines checkouts, books, and borrowers in order to see how many books were checked out by students, faculty, and staff. # list the account types who checked out the most books left_join(checkouts, books, by="book_id") |> left_join(borrowers, by="borrower_id") |> group_by(borrower_id) |> summarize(account_type=first(account_type), n_checkouts = n()) |> arrange(desc(n_checkouts)) |> kable() borrower_id account_type n_checkouts 2 staff 3 1 student 2 4 student 1 5 faculty 1 2.2.7 Be Explicit Do you find it odd that we have to tell R exactly what kind of data join to do by calling one of left_join, right_join, inner_join, or full_join? Why isn’t there just one function called join that assumes you’re doing a left join unless you specifically provide an argument type like join(..., type=\"inner\")? If you think it would be confusing for R to make assumptions about what kind of data join we want, then you’re on the right track but you’ll want to watch out for these other cases where R has strong assumptions about what the default behavior should be. A general principle of programming is that explicit is better than implicit because writing information into your code explicitly makes it easier to understand what the code does. Here are some examples of implicit assumptions R will make unless you provide explicit instructions. 2.2.7.1 Handling Duplicate Keys Values in the key columns may not be unique. What do you think happens when you join using keys that aren’t unique? # Example datasets students = data.frame( student_id = c(1, 2, 3, 4, 4), name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) grades = data.frame(student_id = c(2, 2, 3, 4, 4, 5), grade = c(90, 50, 85, 80, 75, 30)) # Left join left_join(students, grades, by = "student_id") |> kable() ## Warning in left_join(students, grades, by = "student_id"): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 2 of `x` matches multiple rows in `y`. ## ℹ Row 4 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning. student_id name grade 1 Angel NA 2 Beto 90 2 Beto 50 3 Cici 85 4 Desmond 80 4 Desmond 75 4 Erik 80 4 Erik 75 We get one row in the result for every possible combination of the matching keys. Sometimes that is what you want, and other times not. In this case, it might be reasonable that Beto, Desmond, and Erik have multiple grades in the book, but it is probably not reasonable that both Desmond and Erik have student ID 4 and have the same grades as each other. This is a many-to-many match, with all the risks we’ve mentioned before. 2.2.7.1.1 Specifying the Expected Relationship You can be explicit about what kind of relationship you expect in the join by specifying the relationship parameter. Your options are one-to-one, one-to-many, or many-to-one. Any of those will stop the code with an error if the data doesn’t match the relationship you told it to expect. If you leave the relationship parameter blank, R will allow a many-to-many join but will raise a warning. Pay attention to your warning messages! If you know in advance that you want a many-to-many join, then you can provide the argument relatonship='many-to-many', which will do the same as leaving relationship blank, except it will not raise the warning. 2.2.7.1.2 Using Only Distinct Rows An alternative to handling duplicate keys is to subset the data to avoid duplicates in the first place. The dplyr package provides a function, distinct, which can help. When distinct finds duplicated rows, it keeps the first one. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4, 4), name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) grades = data.frame(student_id = c(2, 2, 3, 4, 4, 5), grade = c(90, 50, 85, 80, 75, 30)) # Left join distinct_keys_result = students |> distinct(student_id, .keep_all=TRUE) |> left_join(grades, by = "student_id") |> kable() 2.2.7.2 Ambiguous Columns When the two tables have columns with the same names, it is ambiguous which one to use in the result. R handles that situation by keeping both but changing the names to include the table names. So the column from the left table gets a .x appended by default and the column from the right table gets a .y appended by default. Let’s see an example. Suppose that the date_created column of the borrowers table had the name date instead. Then in the joined data it would be ambiguous with the date column of the checkouts table. # Rename the date_created column of borrowers borrowers = rename(borrowers, date=date_created) # Now create the list of checkouts left_join(checkouts, books, by="book_id") |> left_join(borrowers, by="borrower_id") |> head(n=10) |> kable() borrower_id book_id checkout_date due_date title author subject Keywords minority_author Description. Contents. Series. Pages. Publisher. creation_date publication_date Edition. format venue Language. Source. Identifier. ISBN. location copies type Barcode. account_type major date 1 1 12/21/23 6/18/24 my alaska guy noir NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA student NA NA 1 2 1/3/22 7/2/22 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA student NA NA 2 2 10/12/20 4/10/21 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA staff NA NA 2 3 5/5/23 11/1/23 the window ruiner ruined my windows zipperman NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA staff NA NA 2 4 12/12/21 6/10/22 clowns in the clouds dark doug NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA staff NA NA 4 2 6/7/18 12/4/18 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA student NA NA 5 5 10/1/23 3/29/24 dogs walked, tigers tamed, bars emptied lucky pierre NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA faculty NA NA If you aren’t satisfied with appending .x and .y to the ambiguous columns, then you can specify the suffix argument with a pair of strings like this: # Now create the list of checkouts left_join(checkouts, books, by="book_id", suffix=c("_checkout", "_book")) |> head(n=10) |> kable() borrower_id book_id checkout_date due_date title author subject Keywords minority_author Description. Contents. Series. Pages. Publisher. creation_date publication_date Edition. format venue Language. Source. Identifier. ISBN. location copies type Barcode. 1 1 12/21/23 6/18/24 my alaska guy noir NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 2 1/3/22 7/2/22 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 2 10/12/20 4/10/21 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 3 5/5/23 11/1/23 the window ruiner ruined my windows zipperman NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 4 12/12/21 6/10/22 clowns in the clouds dark doug NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 2 6/7/18 12/4/18 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 5 10/1/23 3/29/24 dogs walked, tigers tamed, bars emptied lucky pierre NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA By specifying the suffix argument, we get column names in the result with more meaningful names. 2.2.7.3 Missing Values The dplyr package has a default behavior that I think is dangerous. In the conditions of a join, NA==NA evaluates to TRUE, which is unlike the behavior anywhere else in R. This means that keys identified as NA will match other NAs in the join. This is a very strong assumption that seems to contradict the idea of a missing value since if we actually don’t know two keys, how can we say that they match? And if we know two keys have the same value then they should be labeled in the data. In my opinion, it’s a mistake to have the computer make strong assumptions by default, and especially if it does so without warning the user. Fortunately, there is a way to make the more sensible decision that NAs don’t match anything: include the argument na_matches='never' in the join. # Example datasets students = data.frame( student_id = c(1, NA, 3, 4), name = c("Angel", "Beto", "Cici", "Desmond")) grades = data.frame(student_id = c(2, NA, 4, 5), grade = c(90, 85, 80, 75)) # Left joins left_join(students, grades, by = "student_id") |> kable() student_id name grade 1 Angel NA NA Beto 85 3 Cici NA 4 Desmond 80 left_join(students, grades, by = "student_id", na_matches = "never") |> kable() student_id name grade 1 Angel NA NA Beto NA 3 Cici NA 4 Desmond 80 Notice that since Beto’s student ID is NA, none of the rows in the grades table can match him. As a result, his grade is left NA in the result. 2.2.8 Conclusion You’ve now seen how to join data tables that can be linked by key columns. I encourage you to expand on the examples by posing questions and trying to write the code to answer them. Reading the documentation for join functions and join_by specifications is a great way to continue your learning journey by studying the (many!) special cases that we skipped over here. "],["best-practices-for-writing-r-scripts.html", "3 Best Practices for Writing R Scripts", " 3 Best Practices for Writing R Scripts "],["squashing-bugs-with-rs-debugging-tools.html", "4 Squashing Bugs with R’s Debugging Tools 4.1 Printing Output 4.2 The Conditions System 4.3 Global Options 4.4 Debugging 4.5 Measuring Performance", " 4 Squashing Bugs with R’s Debugging Tools The major topics of this chapter are how to print output, how R’s conditions system for warnings and errors works, how to use the R debugger, and how to estimate the performance of R code. Learning Objectives Identify and explain the difference between R’s various printing functions Use R’s conditions system to raise and catch messages, warnings, and errors Use R’s debugging functions to diagnose bugs in code Estimate the amount of memory a data set will require Use the lobstr package to get memory usage for an R object Describe what a profiler is and why you would use one Describe what kinds of profiling tools R provides 4.1 Printing Output Perhaps the simplest thing you can do to get a better understanding of some code is make it print out lots of information about what’s happening as it runs. This section introduces several different functions for printing output and making that output easier to read. 4.1.1 The print Function The print function prints a string representation of an object to the console. The string representation is usually formatted in a way that exposes detail important programmers rather than users. For example, when printing a vector, the function prints the position of the first element on each line in square brackets [ ]: print(1:100) ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 ## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 ## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 ## [91] 91 92 93 94 95 96 97 98 99 100 The print function also prints quotes around strings: print("Hi") ## [1] "Hi" These features make the print function ideal for printing information when you’re trying to understand some code or diagnose a bug. On the other hand, these features also make print a bad choice for printing output or status messages for users (including you). R calls the print function automatically anytime a result is returned at the prompt. Thus it’s not necessary to call print to print something when you’re working directly in the console—only from within loops, functions, scripts, and other code that runs non-interactively. The print function is an S3 generic (see Section 6.4), so you if you create an S3 class, you can define a custom print method for it. For S4 objects, R uses the S4 generic show instead of print. 4.1.2 The message Function To print output for users, the message function is the one you should use. The main reason for this is that the message function is part of R’s conditions system for reporting status information as code runs. This makes it easier for other code to detect, record, respond to, or suppress the output. Section 4.2 will explain the conditions system in more detail. The message function prints its argument(s) and a newline to the console: message("Hello world!") ## Hello world! If an argument isn’t a string, the function automatically and silently attempts to coerce it to one: message(4) ## 4 Some types of objects can’t be coerced to a string: message(sqrt) ## Error in FUN(X[[i]], ...): cannot coerce type 'builtin' to vector of type 'character' For objects with multiple elements, the function pastes together the string representations of the elements with no separators in between: x = c(1, 2, 3) message(x) ## 123 Similarly, if you pass the message function multiple arguments, it pastes them together with no separators: name = "R" message("Hi, my name is ", name, " and x is ", x) ## Hi, my name is R and x is 123 This is a convenient way to print names or descriptions alongside values from your code without having to call a formatting function like paste. You can make the message function print something without adding a newline at the end by setting the argument appendLF = FALSE. The difference can be easy to miss unless you make several calls to message, so the say_hello function in this example calls message twice: say_hello = function(appendLF) { message("Hello", appendLF = appendLF) message(" world!") } say_hello(appendLF = TRUE) ## Hello ## world! say_hello(appendLF = FALSE) ## Hello world! Note that RStudio always adds a newline in front of the prompt, so making an isolated call to message with appendLF = FALSE appears to produce the same output as with appendLF = TRUE. This is an example of a situation where RStudio leads you astray: in an ordinary R console, the two are clearly different. 4.1.3 The cat Function The cat function, whose name stands for “concatenate and print,” is a low-level way to print output to the console or a file. The message function prints output by calling cat, but cat is not part of R’s conditions system. The cat function prints its argument(s) to the console. It does not add a newline at the end: cat("Hello") ## Hello As with message, RStudio hides the fact that there’s no newline if you make an isolated call to cat. The cat function coerces its arguments to strings and concatenates them. By default, a space is inserted between arguments and their elements: cat(4) ## 4 cat(x) ## 1 2 3 cat("Hello", "Nick") ## Hello Nick You can set the sep parameter to control the separator cat inserts: cat("Hello", "world", x, sep = "|") ## Hello|world|1|2|3 If you want to write output to a file rather than to the console, you can call cat with the file parameter set. However, it’s preferable to use functions tailored to writing specific kinds of data, such as writeLines (for text) or write.table (for tabular data), since they provide additional options to control the output. Many scripts and packages still use cat to print output, but the message function provides more flexibility and control to people running the code. Thus it’s generally preferable to use message in new code. Nevertheless, there are a few specific cases where cat is useful—for example, if you want to pipe data to a UNIX shell command. See ?cat for details. 4.1.4 Formatting Output R provides a variety of ways to format data before you print it. Taking the time to format output carefully makes it easier to read and understand, as well as making your scripts seem more professional. 4.1.4.1 Escape Sequences One way to format strings is by adding (or removing) escape sequences. An escape sequence is a sequence of characters that represents some other character, usually one that’s invisible (such as whitespace) or doesn’t appear on a standard keyboard. In R, escape sequences always begin with a backslash. For example, \\n is a newline. The message and cat functions automatically convert escape sequences to the characters they represent: x = "Hello\\nworld!" message(x) ## Hello ## world! The print function doesn’t convert escape sequences: print(x) ## [1] "Hello\\nworld!" Some escape sequences trigger special behavior in the console. For example, ending a line with a carriage return \\r makes the console print the next line over the line. Try running this code in a console (it’s not possible to see the result in a static book): # Run this in an R console. for (i in 1:10) { message(i, "\\r", appendLF = FALSE) # Wait 0.5 seconds. Sys.sleep(0.5) } You can find a complete list of escape sequences in ?Quotes. 4.1.4.2 Formatting Functions You can use the sprintf function to apply specific formatting to values and substitute them into strings. The function uses a mini-language to describe the formatting and substitutions. The sprintf function (or something like it) is available in many programming languages, so being familiar with it will serve you well on your programming journey. The key idea is that substitutions are marked by a percent sign % and a character. The character indicates the kind of data to be substituted: s for strings, i for integers, f for floating point numbers, and so on. The first argument to sprintf must be a string, and subsequent arguments are values to substitute into the string (from left to right). For example: sprintf("My age is %i, and my name is %s", 32, "Nick") ## [1] "My age is 32, and my name is Nick" You can use the mini-language to do things like specify how many digits to print after a decimal point. Format settings for a substituted value go between the percent sign % and the character. For instance, here’s how to print pi with 2 digits after the decimal: sprintf("%.2f", pi) ## [1] "3.14" You can learn more by reading ?sprintf. Much simpler are the paste and paste0 functions, which coerce their arguments to strings and concatenate (or “paste together”) them. The paste function inserts a space between each argument, while the paste0 function doesn’t: paste("Hello", "world") ## [1] "Hello world" paste0("Hello", "world") ## [1] "Helloworld" You can control the character inserted between arguments with the sep parameter. By setting an argument for the collapse parameter, you can also use the paste and paste0 functions to concatenate the elements of a vector. The argument to collapse is inserted between the elements. For example, suppose you want to paste together elements of a vector inserting a comma and space , in between: paste(1:3, collapse = ", ") ## [1] "1, 2, 3" Members of the R community have developed many packages to make formatting strings easier: cli – helper functions for developing command-line interfaces, including functions to add color, progress bars, and more. glue – alternatives to sprintf for string iterpolation. stringr – a collection of general-purpose string manipulation functions. 4.1.5 Logging Output Logging means saving the output from some code to a file as the code runs. The file where the output is saved is called a log file or log, but this name isn’t indicative of a specific format (unlike, say, a “CSV file”). It’s a good idea to set up some kind of logging for any code that takes more than a few minutes to run, because then if something goes wrong you can inspect the log to diagnose the problem. Think of any output that’s not logged as ephemeral: it could disappear if someone reboots the computer, or there’s a power outage, or some other, unforeseen event. R’s built-in tools for logging are rudimentary, but members of the community have developed a variety of packages for logging. Here are a few that are still actively maintained as of January 2023: logger – a relatively new package that aims to improve aspects of other logging packages that R users find confusing. futile.logger – a popular, mature logging package based on Apache’s Log4j utility and on R idioms. logging – a mature logging package based on Python’s logging module. loggit – integrates with R’s conditions system and writes logs in JavaScript Object Notation (JSON) format so they are easy to inspect programmatically. log4r – another package based on Log4j with an object-oriented programming approach. 4.2 The Conditions System R’s conditions system provides a way to signal and handle unusual conditions that arise while code runs. With the conditions system, you can make R print status, warning, and error messages that make it easier for users to understand what your code is doing and whether they’re using it as intended. The condition system also makes it possible to safely run code that might cause an error, and respond appropriately in the event that it does. In short, understanding the conditions system will enable you write code that’s easier to use and more robust. 4.2.1 Raising Conditions The message, warning, and stop functions are the primary ways to raise, or signal, conditions. The message function was described in Section 4.1.2. A message provides status information about running code, but does not necessarily indicate that something has gone wrong. You can use messages to print out any information you think might be relevant to users. The warning function raises a warning. Warnings indicate that something unexpected happened, but that it didn’t stop the code from running. By default, R doesn’t print warnings to the console until code finishes running, which can make it difficult to understand their cause; Section 4.3 explains how to change this setting. Unnamed arguments to the warning function are concatenated with no separator between them, in the same way as arguments to the message function. For example: warning("Objects in mirror", " may be closer than they appear.") ## Warning: Objects in mirror may be closer than they appear. Warnings are always printed with Warning: before the message. By default, calling warning from the body of a function also prints the name of the function: f = function(x, y) { warning("This is a warning!") x + y } f(3, 4) ## Warning in f(3, 4): This is a warning! ## [1] 7 The name of the function that raised the warning is generally useful information for users that want to correct whatever caused the warning. Occasionally, you might want to disable this behavior, which you can do by setting call. = FALSE: f = function(x, y) { warning("This is a warning!", call. = FALSE) x + y } f(3, 4) ## Warning: This is a warning! ## [1] 7 The warning function also has several other parameters that control when and how warnings are displayed. The stop function raises an error, which indicates that something unexpected happened that prevents the code from running, and immediately stops the evaluation of code. As a result, R prints errors as soon as they’re raised. For instance, in this function, the line x + y never runs: f = function(x, y) { stop() x + y } f(3, 4) ## Error in f(3, 4): Like message and warning, the stop function concatenates its unnamed arguments into a message to print: stop("I'm afraid something has gone terribly wrong.") ## Error in eval(expr, envir, enclos): I'm afraid something has gone terribly wrong. Errors are always printed with Error: before the error message. You can use the call. parameter to control whether the error message also includes the name of the function from which stop was called. When writing code—especially functions, executable scripts, and packages—it’s a good habit to include tests for unexpected conditions such as invalid arguments and impossible results. If the tests detect a problem, use the warning or stop function (depending on severity) to signal what the problem is. Try to provide a concise but descriptive warning or error message so that users can easily understand what went wrong. 4.2.2 Handling Conditions In some cases, you can anticipate the problems likely to occur when code runs and can even devise ways to work around them. As an example, suppose your code is supposed to load parameters from a configuration file, but the path to the file provided by the user is invalid. It might still be possible for your code to run by falling back on a set of default parameters. R’s conditions system provides a way to handle or “catch” messages, warnings, and errors, and to run alternative code in response. You can use the try function to safely run code that might produce an error. If no error occurs, the try function returns whatever the result of the code was. If an error does occur, the try function prints the error message and returns an object of class try-error, but evaluation does not stop. For example: bad_add = function(x) { # No error x1 = try(5 + x) # Error x2 = try("yay" + x) list(x1, x2) } bad_add(10) ## Error in "yay" + x : non-numeric argument to binary operator ## [[1]] ## [1] 15 ## ## [[2]] ## [1] "Error in \\"yay\\" + x : non-numeric argument to binary operator\\n" ## attr(,"class") ## [1] "try-error" ## attr(,"condition") ## <simpleError in "yay" + x: non-numeric argument to binary operator> The simplest thing you can do in response to an error is ignore it. This is usually not a good idea, but if you understand exactly what went wrong, can’t fix it easily, and know it won’t affect the rest of your code, doing so might be the best option. A more robust approach is to inspect the result from a call to try to see if an error occurred, and then take some appropriate action if one did. You can use the inherits function to check whether an object has a specific class, so here’s a template for how to run code that might cause an error, check for the error, and respond to it: result = try({ # Code that might cause an error. }) if (inherits(result, "try-error")) { # Code to respond to the error. } You can prevent the try function from printing error messages by setting silent = TRUE. This is useful when your code is designed to detect and handle the error, so you don’t users to think an error occurred. The tryCatch function provides another way to handle conditions raised by a piece of code. It requires that you provide a handler function for each kind of condition you want to handle. The kinds of conditions are: message warning error interrupt – when the user interrupts the code (for example, by pressing Ctrl-C) Each handler function must accept exactly one argument. When you call tryCatch, if the suspect code raises a condition, then it calls the associated handler function and returns whatever the handler returns. Otherwise, tryCatch returns the result of the code. Here’s an example of using tryCatch to catch an error: bad_fn = function(x, y) { stop("Hi") x + y } err = tryCatch(bad_fn(3, 4), error = function(e) e) And here’s an example of using tryCatch to catch a message: msg_fn = function(x, y) { message("Hi") x + y } msg = tryCatch(msg_fn(3, 4), message = function(e) e) The tryCatch function always silences conditions. Details about raised conditions are provided in the object passed to the handler function, which has class condition (and a more specific class that indicates what kind of condition it is). If you want to learn more about R’s conditions system, start by reading ?conditions. 4.3 Global Options R’s global options to control many different aspects of how R works. They’re relevant to the theme of this chapter because some of them control when and how R displays warnings and errors. You can use the options function to get or set global options. If you call the function with no arguments, it returns the current settings: opts = options() # Display the first 6 options. head(opts) ## $add.smooth ## [1] TRUE ## ## $bitmapType ## [1] "cairo" ## ## $browser ## [1] "" ## ## $browserNLdisabled ## [1] FALSE ## ## $callr.condition_handler_cli_message ## function (msg) ## { ## custom_handler <- getOption("cli.default_handler") ## if (is.function(custom_handler)) { ## custom_handler(msg) ## } ## else { ## cli_server_default(msg) ## } ## } ## <bytecode: 0x559004a07e50> ## <environment: namespace:cli> ## ## $CBoundsCheck ## [1] FALSE This section only explains a few of the options, but you can read about all of them in ?options. The warn option controls how R handles warnings. It can be set to three different values: 0 – (the default) warnings are only displayed after code finishes running. 1 – warnings are displayed immediately. 2 – warnings stop code from running, like errors. Setting warn = 2 is useful for pinpointing expressions that raise warnings. Setting warn = 1 makes it easier to determine which expressions raise warnings, without the inconvenience of stopping code from running. That makes it a good default (better than the actual default). You can use the option function to change the value of the warn option: options(warn = 1) When you set an option this way, the change only lasts until you quit R. Next time you start R, the option will go back to its default value. Fortunately, there is a way override the default options every time R starts. When R starts, it searches for a .Rprofile file. The file is usually in your system’s home directory (see this section of the R Basics Reader for how to locate your home directory). Customizing your .Rprofile file is one of the marks of an experienced R user. If you define a .First function in your .Rprofile, R will call it automatically during startup. Here’s an example .First function: .First = function() { # Only change options if R is running interactively. if (!interactive()) return() options( # Don't print more than 1000 elements of anything. max.print = 1000, # Warn on partial matches. warnPartialMatchAttr = TRUE, warnPartialMatchDollar = TRUE, warnPartialMatchArgs = TRUE, # Print warnings immediately (2 = warnings are errors). warn = 1 ) } You can learn more about the .Rprofile file and R’s startup process at ?Startup. 4.4 Debugging Debugging code is the process of confirming, step-by-step, that what you believe the code does is what the code actually does. The key idea is to check each step (or expression) in the code. There are two different strategies for doing this: Work forward through the code from the beginning. Work backward from the source of an error. R has built-in functions to help with debugging. The browser() function pauses the running code and starts R’s debugging system. For example: # Run this in an R console. f = function(n) { total = 0 for (i in 1:n) { browser() total = total + i } total } f(10) The most important debugger commands are: n to run the next line s to “step into” a call c to continue running the code Q to quit the debugger where to print call stack help to print debugger help Another example: # Run this in an R console. g = function(x, y) (1 + x) * y f = function(n) { total = 0 for (i in 1:n) { browser() total = total + g(i, i) } total } f(11) 4.4.1 Other Functions The debug() function places a call to browser() at the beginning of a function. Use debug() to debug functions that you can’t or don’t want to edit. For example: # Run this in an R console. f = function(x, y) { x + y } debug(f) f(5, 5) You can use undebug() to reverse the effect of debug(): # Run this in an R console. undebug(f) f(10, 20) The debugonce() function places a call to browser() at the beginning of a function for the next call only. The idea is that you then don’t have to call undebug(). For instance: # Run this in an R console. debugonce(f) f(10, 20) f(3, 4) Finally, the global option error can be used to make R enter the debugger any time an error occurs. Set the option to error = recover: options(error = recover) Then try this example: # Run this in an R console. bad_fn = function(x, y) { stop("Hi") x + y } bad_fn(3, 4) 4.5 Measuring Performance How quickly code runs and how much memory it uses can be just as much of an obstacle to research computing tasks as errors and bugs. This section describes some of the strategies you can use to estimate or measure the performance characteristics of code, so that you can identify potential problems and fix them. 4.5.1 Estimating Memory Usage Running out of memory can be extremely frustrating, because it can slow down your code or prevent it from running at all. It’s useful to know how to estimate how much memory a given data structure will use so that you can determine whether a programming strategy is feasible before you even start writing code. The central processing units (CPUs) in most modern computers are designed to work most efficiently with 64 bits of data at a time. Consequently, R and other programming languages typically use 64 bits to store each number (regardless of type). While the data structures R uses create some additional overhead, you can use this fact to do back-of-the-envelope calculations about how much memory a vector or matrix of numbers will require. Start by determining how many elements the data structure will contain. Then multiply by 64 bits and divide by 8 to convert bits to bytes. You can then repeatedly divide by 1024 to convert to kilobytes, megabytes, gigabytes, or terabytes. For instance, an vector of 2 million numbers will require approximately this many megabytes: n = 2000000 n * (64 / 8) / 1024^2 ## [1] 15.25879 You can even write an R function to do these calculations for you! If you’re not sure whether a particular programming strategy is realistic, do the memory calculations before you start writing code. This is a simple way to avoid strategies that are inefficient. If you’ve already written some code and it runs out of memory, the first step to fixing the problem is identifying the cause. The lobstr package provides functions to explore how R is using memory. You can use the mem_used function to get the amount of memory R is currently using: library("lobstr") mem_used() ## 42.08 MB Sometimes the culprit isn’t your code, but other applications on your computer. Modern web browsers are especially memory-intensive, and closing yours while you run code can make a big difference. If you’ve determined that your code is the reason R runs out of memory, you can use the obj_size function to get how much memory objects in your code actually use: obj_size(1) ## 56 B x = runif(n) obj_size(x) ## 16.00 MB obj_size(mtcars) ## 7.21 kB If a specific object created by your code uses a lot of memory, think about ways you might change the code to avoid creating the object or avoid creating the entire object at once. For instance, consider whether it’s possible to create part of the object, save that to disk, remove it from memory, and then create the another part. 4.5.2 Benchmarking Benchmarking means timing how long code takes to run. Benchmarking is useful for evaluating different strategies to solve a computational problem and for understanding how quickly (or slowly) your code runs. When you benchmark code, it’s important to collect and aggregate multiple data points so that your estimates reflect how the code performs on average. R has built-in functions for timing code, but several packages provide functions that are more convenient for benchmarking, because they automatically run the code multiple times and return summary statistics. The two most mature packages for benchmarking are: microbenchmark bench The microbenchmark package is simpler to use. It provides a single function, microbenchmark, for carrying out benchmarks. The function accepts any number of expressions to benchmark as arguments. For example, to compare the speed of runif and rnorm (as A and B respectively): library("microbenchmark") microbenchmark(A = runif(1e5), B = rnorm(1e5)) ## Unit: milliseconds ## expr min lq mean median uq max neval ## A 2.823220 3.201357 3.389076 3.229029 3.365375 8.54375 100 ## B 5.886882 6.175219 6.408401 6.246132 6.371443 12.54406 100 The microbenchmark has parameters to control the number of times each expression runs, the units for the timings, and more. You can find the details in ?microbenchmark. 4.5.3 Profiling Profiling code means collecting data about the code as it runs, and a profiler is a program that profiles code. A typical profiler estimates how much time is spent on each expression (as actual time or as a percentage of total runtime) and how much memory the code uses over time. Profiling is a good way to determine which parts of your code are performance bottlenecks, so that you can target them when you try to optimize your code. R has a built-in profiler. You can use the Rprof function to enable or disable the profiler. Essential parameters for the function are: filename – a path to a file for storing results. Defaults to Rprof.out. interval – the time between samples, in seconds. memory.profiling – whether to track memory in addition to time. Set these parameters in the first call you make to Rprof, which will enable the profiler. Then run the code you want to profile. At the end of the code, call Rprof(NULL) to disable the profiler. The profiler saves the collected data to a file. You can use the summaryRprof function to read the profile data and get a summary. Essential parameters for this function are: filename – the path to the results file. Defaults to Rprof.out. memory – how to display memory information. Use \"both\" to see total changes. The summary lists times in seconds and memory in bytes. The profvis package provides an interactive graphical interface for exploring profile data collected with Rprof. Examining profile data graphically makes it easier to interpret the results and to identify patterns. "],["data-visualization-in-r.html", "5 Data Visualization in R 5.1 Our Friend ggplot2 5.2 Example: Palmer Penguins 5.3 Layers 5.4 Guidelines for Graphics 5.5 Case Studies", " 5 Data Visualization in R We are here today to learn how to do data visualization in R. Some of you will have recently done the Principles of Data Visualization workshop. There you were given a checklist of questions to guide you as you create a plot, which we are going to use today. The checklist is here. 5.1 Our Friend ggplot2 We will be using the R package ggplot2 to create data visualizations. Install it via the install.packages() function. While we are at it, let’s make sure we install all of the packages that we’ll need for today’s workshop. Beyone ggplot2, we’ll use readr for reading data files, dplyr for manupulating data, and palmerpenguins provides a nice dataset. install.packages("ggplot2") install.packages("dplyr") install.packages("readr") install.packages("palmerpenguins") ggplot2 is an enormously popular R package that provides a way to create data visualizations through a so-called “grammar of graphics” (hence the “gg” in the name). That grammar interface may be a little bit unintuitive at first but once you grasp it, you hold enormous power to quickly craft plots. It doesn’t hurt that the ggplot2 plots look great, too. 5.1.1 The Grammar of Graphics The grammar of graphics breaks the elements of statistical graphics into parts in an analogy to human language grammars. Knowing how to put together subject nouns, object nouns, verbs, and adjectives allows you to construct sentences that express meaning. Similarly, the grammar of graphics is a collection of layers and the rules for putting them together to graphically express the meaning of your data. 5.2 Example: Palmer Penguins Let’s look at an example. This uses data from the palmerpenguins package that you just installed (make sure to load the package with library(palmerpenguins)). It is measurements of 344 penguins, collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The data package was created by Allison Horst. Before jumping in, let’s have a look at the data and the image we want to create. head(penguins) ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## # ℹ 2 more variables: sex <fct>, year <int> Plot of bill length vs. flipper length for the Palmer Penguins 5.2.1 Examining the Plot Referring to the graphics checklist, we see that this plot has two numerical features (bill length and flipper length), expressed using a scatter plot. There is also a categorical feature (species), which is indicated by the different colors and shapes of the plot. The plot expresses the fact that flipper length is positively associated with bill length for all thre species of penguins, but the sizes and the relationships between them are unique to each species. There is a title and a legend, the axes are labeled with units, and all of the text is in plain language. There is a risk that the data may hide the message, so a smoothing line is added to each species for clarity. The colors are accessible (avoiding red/green colorblindness issues). This is a good dataviz, now let’s duplicate it! 5.2.2 Duplicating the Palmer Penguins Plot Here is the code to make the plot: # matching the Allison Horst peguins plot ggplot(penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species) + geom_point() + geom_smooth(method = lm, se = FALSE) + xlab("Flipper length (mm)") + ylab("Bill length (mm)") + ggtitle( "Flipper and bill length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo penguins at Palmer Station LTER" ) + labs(color = "Penguin species", shape = "Penguin species") + scale_color_brewer(palette = "Dark2") 5.2.3 Analysis This is a complicated data visualization that includes features we haven’t covered yet, so let’s go into how it works. You might have noticed how the code to make the plot is separated into a bunch of function calls, all separated by plus signs (+). Each function does something small and specific. The functions and the ability to add them together provide a powerful and flexible “grammar” to describe the desired plot. Our plot begins the way that most do - by calling ggplot() with the data as the argument. This creates a plot object (but doesn’t draw it) and sets the data. Then we add the a so-called “aesthetic mapping” with the aes() function. An aesthetic mapping describes how features of the data map to visual channels of the plot. In our case, that means mapping the flipper length to the x (horizontal) direction, the bill length to the y (vertical) direction, and mapping species to both color and shape. Next, we add a geometry to describe what kind of marks to use in drawing the plot. Here you can refer to the table at the top of the graphics checklist that suggests geometries to use for different kinds of features. We have numeric features for both x and y, so the table suggests line, scatter (points), and heatmap. We’ve selected points (geom_point()) because we want to show the individual penguins (lines would imply a chain of connections from one penguin to the next.) Those three parts (data, a geometry, and a map between the two) would be enough to get a basic plot that looks like this: # Make a basic penguin plot with just data, # a geometry, and a map between the two. ggplot(penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species) + geom_point() We know, though, that this plot is not complete. In particular, there is no title and the axes aren’t labeled meaningfully. Also, the clouds of points seem to hide the meaning that we are trying to convey and the colors aren’t colorblind-safe. The rest of the pieces of the plot call are meant to address those shortcomings. We add a second geometry layer with geom_smooth(method=lm, se=FALSE), which specifies the lm method in order to draw a straight (instead of wiggly) smoother through the data. The x-axis label, y-axis label and title are set by xlab(), ylab(), and ggtitle(), respectively. We want a more informative label for the legend title than just the variable name (“Penguin Species” instead of “species”), which is handled by the labs() function. And you’ll recall from the principles of data visualization that you can use Cynthia Brewer’s Color Brewer website to select colorblind-friendly color schemes. Color Brewer is integrated directly into ggplot2, so the scale_color_brewer() function can pull a named color scheme from Color Brewer directly into your plot as the color scale. We can begin to better understand the grammar of graphics as we consider this example. Recognize that we our data visualization conveys information via several visual channels that express data as visual marks. The geometry determines how those marks are drawn on the page, which can be set separately from the mapping. Let’s see a couple examples of that: # placing plots via gridExtra library(gridExtra) # plot the Palmer penguin data with a line geometry peng_line = ggplot(penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species) + geom_line() # plot the Palmer penguin data with a hex heatmap geometry peng_hex = ggplot(penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species) + geom_hex() # place the plts side-by-side grid.arrange(peng_line, peng_hex) You can see how changing the geometry but not the mapping will plot the same data with a different method. Separating the mapping of features to channels from the drawing of marks is at the core of the grammar of graphics. This separation of functions gives ggplot2 its power by allowing us to compose a small number of functions to express data in unlimited ways (kind of like poetry). Recognizing the grammar of graphics allows us to reason in a consistent way about different kinds of plots, and make intelligent assumptions about mappings and geometries. 5.3 Layers Layers are the building blocks of the grammar of graphics. The typical pattern is that you express the idea of a plot in the grammar of graphics by adding layers with the addition symbol (+). There aren’t even that many of layers to know! Here is the list, and the name of the function(s) you’ll use to control the layer. Some of the names include asterisks because there are a lot of similar options - for instance, geometry layers include geom_point(), geom_line(), geom_boxplot(), and many more. See the comprehensive listing on the official ggplot2 website. Data (ggplot2()) - provides the data for the visualization. Aesthetic mapping (aes()) - a mapping that indicates which variables in the data control which channel in the plot (recall from the Principle of Data Visualization that a “channel” is used in an abstract way to include things like shape, color, and line width.) Geometry (geom_*()) - how the marks will be drawn in the figure. Statistical transform (stat_*()) - alters the data before drawing - for instance binning or removing duplicates. Scale (scale_*()) - used to control the way that values in the data are mapped to the channels. For instance, you can control how numbers or categories in the data map to colors. Coordinates (coord_*()) - used to control how the data are mapped to plot axes. Facets (facet_*()) - used to separate the data into subplots called “facets”. Theme (theme()) - modifies plot details like titles, labels, and legends. 5.4 Guidelines for Graphics I’ve attached the PDF checklist for creating good data visualizations, created by Nick Ulle of UC Davis Datalab. Download it and keep a copy around - it’s an excellent guide. I’m going to go over how the checklist translates into the grammar of graphics. 5.4.1 Data You can’t have a data visualization without data! ggplot2 expects that your data is tidy, which means that each row is a complete observation and each column is a unique feature. In fact, ggplot2 is part of an actively developing collection of packages called the tidyverse that provides ways to create and work with tidy data. You dont have to adopt the entire tidyverse to use ggplot2, though. 5.4.2 Feature Types The first item on the list is a table of options for geometries that are commonly relevant for a given kind of data - for instance, a histogram is a geometry that can be used with a single numeric feature, and a box plot can be used with one numeric and one categorical feature. - Should it be a dot plot? Pie plots are hard to read and bar plots don’t use space efficiently (Cleveland and McGill 1990; Heer and Bostock 2010). Generally a dot plot is a better choice. 5.4.3 Theme Guidelines Does the graphic convey important information? Don’t include graphics that are uninformative or redundant. Title? Make sure the title explains what the graphic shows. Axis labels? Label the axes in plain language (no variable names!). Axis units? Label the axes with units (inches, dollars, etc). Legend? Any graphic that shows two or more categories coded by style or color must include a legend. 5.4.4 Scale Guidelines Appropriate scales and limits? Make sure the scales and limits of the axes do not lead people to incorrect conclusions. For side-by-side graphics or graphics that viewers will compare, use identical scales and limits. Print safe? Design graphics to be legible in black & white. Color is great, but use point and line styles to distinguish groups in addition to color. Also try to choose colors that are accessible to colorblind people. The RColorBrewer and viridis packages can help with choosing colors. 5.4.5 Facet Guidelines No more than 5 lines? Line plots with more than 5 lines risk becoming hard-to-read “spaghetti” plots. Generally a line plot with more than 5 lines should be split into multiple plots with fewer lines. If the x-axis is discrete, consider using a heat map instead. No overplotting? Scatter plots where many plot points overlap hide the actual patterns in the data. Consider splitting the data into facets, making the points smaller, or using a two-dimensional density plot (a smooth scatter plot) instead. 5.5 Case Studies We have covered enough of the grammar of graphics that you should begin to see the patterns in how it is used to express graphical ideas for ggplot2. Now we will work through some examples. 5.5.1 Counting Penguins First, let’s revisit the penguins data. There are tree categorical features in the data: species, island, and sex. Let’s use geom_bar() to count how many penguins of each species and/or sex were observed on each island. The x-axis of the plot should be the island, but note that there are multiple values of species and sex that have the same position on that x-axis. In this case, you can use the position_dodge() or position_stack() arguments to tell ggplot2 how to handle the second grouping channel. # count the penguins on each island ggplot(penguins) + aes(x=island) + geom_bar() + xlab("Island") + ylab("Count") + ggtitle("Count of penguins on each island") #count the penguins of each sex on each island ggplot(penguins) + aes(x=island, fill=sex) + geom_bar(position=position_dodge()) + scale_fill_grey() + theme_bw() + xlab("Island") + ylab("Count") + ggtitle("Count of penguins on each island by sex") Alternatively, you can use facets to separate the data into multiple plots based on a data feature. Let’s see how that works to facet the plots by species. One way to show more information more clearly in a plot is to break the plot into pieces that each show part of the information. In ggplot2, this is called faceting the plot. There are two main facet functions, facet_grid() (which puts plots in a grid), and facet_wrap(), which puts plots side-by-side until it runs out of room, then wraps to a new line. We’ll use facet_wrap() here, with the first argument being ~species. This tells ggplot2 to break the plot into pieces by plotting the data for each species separately. #count the penguins of each species on each island ggplot(penguins) + aes(x=island) + geom_bar() + scale_fill_grey() + theme_bw() + xlab("Island") + ylab("Count") + ggtitle("Count of penguins on each island by species") + facet_wrap(~species, ncol=3) 5.5.2 Experimental Data with Error Bars Here’s an example that recently came up in my office hours. You’ve done an experiment to see how mice with two different genotypes respond to two different treatments. Now you want to plot the mean response of each group as a column, with error bars indicating the standard deviation of the mean. You also want to show the raw data. I’ve simulated some data for us to use - download it here. This one is kind of complicated because you have to tell ggplot2 how to calculate the height of the columns and of the error bars. This involves cr mice_geno = read_csv("data/genotype-response.csv") # show the treatment response for different genotypes ggplot(mice_geno) + aes(x=trt, y=resp, fill=genotype) + scale_fill_brewer(palette="Dark2") + geom_bar(position=position_dodge(), stat='summary', fun='mean') + geom_errorbar(fun.min=function(x) {mean(x) - sd(x) / sqrt(length(x))}, fun.max=function(x) {mean(x) + sd(x) / sqrt(length(x))}, stat='summary', position=position_dodge(0.9), width=0.2) + geom_point(position= position_jitterdodge( dodge.width=0.9, jitter.width=0.1)) + xlab("Treatment") + ylab("Response (mm/g)") + ggtitle("Mean growth response of mice by genotype and treatment") 5.5.3 Bird Flu Mortality People mail dead birds to the USDA and USGS, where scientists analyze the birds to find out why they died. Right now there is a bird flu epidemic, and the USDA provides public data about the birds in whom the disease has been detected. You can access the data here or see the official USDA webpage here. After you download the data, we will load the data and do some visualization. # load data directly from the USDA website flu <- read_csv("data/hpai-wild-birds-ver2.csv") flu$date <- mdy(flu$`Date Detected`) # plot a histogram of when bird flu was detected ggplot(flu) + aes(x = date) + geom_histogram() + ggtitle("Bird flu detections in wild birds") + xlab("Date") + ylab("Count") # plot a histogram of when bird flu was detected ggplot(flu) + aes(x = date, fill = `Sampling Method`) + geom_histogram() + ggtitle("Bird flu detections in wild birds") + xlab("Date") + ylab("Count") # bar chart shows how the bird flu reports compare between west coast states subset(flu, State %in% c("California", "Oregon", "Washington")) |> ggplot() + aes(x = State, fill = `Sampling Method`) + stat_count() + geom_bar() + ggtitle("Bird flu detections in wild birds (West coast states)") + ylab("Count") Let’s compare the bird flu season to the human flu season. Download hospitalization data for the 2021-2022 and 2022-2023 flu seasons from the CDC website here or see the official Centers for Disease Control website here. After you download the data, we will see how adding a second data series works a little differently from the first. That’s because composing data, aesthetic mapping, and geometry with addition only works when there is no ambiguity about which data series is being mapped or drawn. After downloading the data, there is some work required to adjust the dates and change the hospitalization rate from cases per 100,000 to cases per 10 million, which better matches the scale of the bird flu data. # processing CDC flu data: cdc <- read_csv("data/FluSurveillance_Custom_Download_Data.csv", skip = 2) cdc$date <- as_date("1950-01-01") year(cdc$date) <- cdc$`MMWR-YEAR` week(cdc$date) <- cdc$`MMWR-WEEK` # get flu hospitalization counts that include all race, sex, and age categories cdc_overall <- subset( cdc, `AGE CATEGORY` == "Overall" & `SEX CATEGORY` == "Overall" & `RACE CATEGORY` == "Overall" ) # convert the counts to cases per 10 million cdc_overall$`WEEKLY RATE` <- as.numeric(cdc_overall$`WEEKLY RATE`) * 100 # remake the plot but add a new geom_line() with its own data ggplot(flu) + aes(x = date, fill = `Sampling Method`) + geom_histogram() + geom_line(data = cdc_overall, mapping = aes(x = date, y = `WEEKLY RATE`), inherit.aes = FALSE) + ggtitle("Bird flu detections and human flu hospitalizations") + xlab("Date") + ylab("Count") + xlim(as_date("2022-01-01"), as_date("2023-05-01")) 5.5.4 Small Business Loans The US Small Business Administration (SBA) maintains data on the loans it offers to businesses. Data about loans made since 2020 can be found at the Small Business Administration website, or you can download it from here. We’ll load that data and then explore some ways to visualize it. Since the difference between a $100 loan and a $1000 loan is more like the difference between $100,000 and $1M than between $100,000 ad $100,900, we should put the loan values on a logarithmic scale. You can do this in ggplot2 with the scale_y_log10() function (when the loan values are on the y axis). # load the small business loan data sba <- read_csv("data/small-business-loans.csv") # check the SBA data to see the data types, etc. head(sba) ## # A tibble: 6 × 39 ## AsOfDate Program BorrName BorrStreet BorrCity BorrState BorrZip BankName ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 20230331 7A Mark Dusa 3623 Swal… Sylvania OH 43560 The Hun… ## 2 20230331 7A Shaddai Harris 614 Valle… Arlingt… TX 76018 PeopleF… ## 3 20230331 7A Aqualon Inc. 7180 Agen… Tipp Ci… OH 45371 The Hun… ## 4 20230331 7A Redline Resta… 2450 Cher… Saint C… FL 34772 SouthSt… ## 5 20230331 7A Meluota Corp 2702 ASTO… ASTORIA NY 11102 Santand… ## 6 20230331 7A Sky Lake Vaca… 15 Nestle… Laconia NH 03246 TD Bank… ## # ℹ 31 more variables: BankFDICNumber <dbl>, BankNCUANumber <dbl>, ## # BankStreet <chr>, BankCity <chr>, BankState <chr>, BankZip <chr>, ## # GrossApproval <dbl>, SBAGuaranteedApproval <dbl>, ApprovalDate <chr>, ## # ApprovalFiscalYear <dbl>, FirstDisbursementDate <chr>, ## # DeliveryMethod <chr>, subpgmdesc <chr>, InitialInterestRate <dbl>, ## # TermInMonths <dbl>, NaicsCode <dbl>, NaicsDescription <chr>, ## # FranchiseCode <chr>, FranchiseName <chr>, ProjectCounty <chr>, … # boxplot of loan sizes by business type subset(sba, ProjectState == "CA") |> ggplot() + aes(x = BusinessType, y = SBAGuaranteedApproval) + geom_boxplot() + scale_y_log10() + ggtitle("Small Business Administraton guaranteed loans in California") + ylab("Loan guarantee (dollars)") # relationship between loan size and interest rate subset(sba, ProjectState == "CA") |> ggplot() + aes(x = GrossApproval, y = InitialInterestRate, ) + geom_point() + facet_wrap(~BusinessType, ncol = 3) + scale_x_log10() + ggtitle("Interest rate as a function of loan size") + xlab("Loan size (dollars)") + ylab("Interest rate (%)") Now let’s color the points by the loan status. Thankfully, ggplot2 integrates directly with Color Brewer (colorbrewer2.org) to get better color palettes. We will use the Accent color palette, which is just one of the many options that can be found on the Color Brewer site. There are a lot of data points, which tent to largely overlap and hide each other. We use a smoother (geom_smooth()) to help call out differences that would otherwise be lost in the noise of the points. # color the dots by the loan status. subset(sba, ProjectState == "CA" & LoanStatus != "EXEMPT" & LoanStatus != "CHGOFF") |> ggplot() + aes(x = GrossApproval, y = InitialInterestRate, color = LoanStatus) + geom_point() + geom_smooth() + facet_wrap(~BusinessType, ncol = 3) + scale_x_log10() + ggtitle("Interest rate as a function of loan size by loan status") + xlab("Loan size (dollars)") + ylab("Interest rate (%)") + labs(color = "Loan status") + scale_color_brewer(type = "qual", palette = "Accent") "],["language-fundamentals.html", "6 Language Fundamentals 6.1 Variables & Environments 6.2 Closures 6.3 Attributes 6.4 S3 6.5 Other Object Systems", " 6 Language Fundamentals This chapter is part 1 (of 2) of Thinking in R, a workshop series about how R works and how to examine code critically. The major topics of this chapter are how R stores and locates variables (including functions) defined in your code and in packages, and how some of R’s object-oriented programming systems work. Learning Objectives Explain what an environment is and how R uses them Explain how R looks up variables Explain what attributes are and how R uses them Get and set attributes Explain what (S3) classes are and how R uses them Explain R’s (S3) method dispatch system Create an (S3) class Describe R’s other object-oriented programming systems at a high level 6.1 Variables & Environments Assigning and looking up values of variables are fundamental operations in R, as in most programming languages. They were likely among the first operations you learned, and now you use them instictively. This section is a deep dive into what R actually does when you assign a variables and how R looks up the values of those variables later. Understanding the process and the data structures involved will introduce you to new programming strategies, make it easier to reason about code, and help you identify potential bugs. 6.1.1 What’s an Environment? The foundation of how R stores and looks up variables is a data structure called an environment. Every environment has two parts: A frame, which is a collection of names and associated R objects. A parent or enclosing environment, which must be another environment. For now, you’ll learn how to create environments and how to assign and get values from their frames. Parent environments will be explained in a later section. You can use the new.env function to create a new environment: e = new.env() e ## <environment: 0x55b763994b70> Unlike most objects, printing an environment doesn’t print its contents. Instead, R prints its type (which is environment) and a unique identifier (0x55b763994b70 in this case). The unique identifier is actually the memory address of the environment. Every object you use in R is stored as a series of bytes in your computer’s random-access memory (RAM). Each byte in memory has a unique address, similar to how each house on a street has a unique address. Memory addresses are usually just numbers counting up from 0, but they’re often written in hexadecimal (base 16) (indicated by the prefix 0x) because it’s more concise. For the purposes of this reader, you can just think of the memory address as a unique identifier. To see the names in an environment’s frame, you can call the ls or names function on the environment: ls(e) ## character(0) names(e) ## character(0) You just created the environment e, so its frame is currently empty. The printout character(0) means R returned a character vector of length 0. You can assign an R object to a name in an environment’s frame with the dollar sign $ operator or the double square bracket [[ operator, similar to how you would assign a named element of a list. For example, one way to assign the number 8 to the name \"lucky\" in the environment e’s frame is: e$lucky = 8 Now there’s a name defined in the environment: ls(e) ## [1] "lucky" names(e) ## [1] "lucky" Here’s another example of assigning an object to a name in the environment: e[["my_message"]] = "May your coffee kick in before reality does." You can assign any type of R object to a name in an environment, including other environments. The ls function ignores names that begin with a dot . by default. For example: e$.x = list(1, sin) ls(e) ## [1] "lucky" "my_message" You can pass the argument all.names = TRUE to make the function return all names in the frame: ls(e, all.names = TRUE) ## [1] ".x" "lucky" "my_message" Alternatively, you can just use the names function, which always prints all names in an environment’s frame. Objects in an environment’s frame don’t have positions or any particular order, so they must always be assigned to a name. R raises an error if you try to assign an object to a position: e[[3]] = 10 ## Error in e[[3]] = 10: wrong args for environment subassignment As you might expect, you can also use the dollar sign operator and double square bracket operator to get objects in an environment by name: e$my_message ## [1] "May your coffee kick in before reality does." e[["lucky"]] ## [1] 8 You can use the exists function to check whether a specific name exists in an environment’s frame: exists("hi", e) ## [1] FALSE exists("lucky", e) ## [1] TRUE Finally, you can remove a name and object from an environment’s frame with the rm function. Make sure to pass the environment as the argument to the envir parameter when you do this: rm("lucky", envir = e) exists("lucky", e) ## [1] FALSE 6.1.2 Reference Objects Environments are reference objects, which means they don’t follow R’s copy-on-write rule: for most types of objects, if you modify the object, R automatically and silently makes a copy, so that any other variables that refer to the object remain unchanged. As an example, lists follow the copy-on-write rule. Suppose you assign a list to variable x, assign x to y, and then make a change to x: x = list() x$a = 10 x ## $a ## [1] 10 y = x x$a = 20 y ## $a ## [1] 10 When you run y = x, R makes y refer to the same object as x, without using any additional memory. When you run x$a = 20, the copy-on-write rule applies, so R creates and modifies a copy of the object. From then on, x refers to the modified copy and y refers to the original. Environments don’t follow the copy-on-write rule, so repeating the example with an enviroment produces a different result: e_x = new.env() e_x$a = 10 e_x$a ## [1] 10 e_y = e_x e_x$a = 20 e_y$a ## [1] 20 As before, e_y = e_x makes both e_y and e_x refer to the same object. The difference is that when you run e_x$a = 20, the copy-on-write rule does not apply and R does not create a copy of the environment. As a result, the change to e_x is also reflected in e_y. Environments and other reference objects can be confusing since they behave differently from most objects. You usually won’t need to construct or manipulate environments directly, but it’s useful to know how to inspect them. 6.1.3 The Local Environment Think of environments as containers for variables. Whenever you assign a variable, R assigns it to the frame of an environment. Whenever you get a variable, R searches through one or more environments for its value. When you start R, R creates a special environment called the global environment to store variables you assign at the prompt or the top level of a script. You can use the globalenv function to get the global environment: g = globalenv() g ## <environment: R_GlobalEnv> The global environment is easy to recognize because its unique identifier is R_GlobalEnv rather than its memory address (even though it’s stored in your computer’s memory like any other object). The local environment is the environment where the assignment operators <- and = assign variables. Think of the local environment as the environment that’s currently active. The local environment varies depending on the context where you run an expression. You can get the local environment with the environment function: loc = environment() loc ## <environment: R_GlobalEnv> As you can see, at the R prompt or the top level of an R script, the local environment is just the global environment. Except for names, the functions introduced in Section 6.1.1 default to the local environment if you don’t set the envir parameter. This makes them convenient for inspecting or modifying the local environment’s frame: ls(loc) ## [1] "e" "e_x" "e_y" "g" "loc" ## [6] "source_rmd" "x" "y" ls() ## [1] "e" "e_x" "e_y" "g" "loc" ## [6] "source_rmd" "x" "y" If you assign a variable, it appears in the local environment’s frame: coffee = "Right. No coffee. This is a terrible planet." ls() ## [1] "coffee" "e" "e_x" "e_y" "g" ## [6] "loc" "source_rmd" "x" "y" loc$coffee ## [1] "Right. No coffee. This is a terrible planet." Conversely, if you assign an object in the local environment’s frame, you can access it as a variable: loc$tea = "Tea isn't coffee!" tea ## [1] "Tea isn't coffee!" 6.1.4 Call Environments Every time you call (not define) a function, R creates a new environment. R uses this call environment as the local environment while the code in the body of the function runs. As a result, assigning variables in a function doesn’t affect the global environment, and they generally can’t be accessed from outside of the function. For example, consider this function which assigns the variable hello: my_hello = function() { hello = "from the other side" } Even after calling the function, there’s no variable hello in the global environment: my_hello() names(g) ## [1] "loc" "my_hello" "tea" "e_x" "x" ## [6] "e_y" "y" "coffee" "source_rmd" "e" ## [11] "g" ".First" As further demonstration, consider this modified version of my_hello, which returns the call environment: my_hello = function() { hello = "from the other side" environment() } The call environment is not the global environment: e = my_hello() e ## <environment: 0x55b765e895e8> And the variable hello exists in the call environment, but not in the global environment: exists("hello", g) ## [1] FALSE exists("hello", e) ## [1] TRUE e$hello ## [1] "from the other side" Each call to a function creates a new call environment. So if you call my_hello again, it returns a different environment (pay attention to the memory address): e2 = my_hello() e ## <environment: 0x55b765e895e8> e2 ## <environment: 0x55b76640c6f8> By creating a new environment for every call, R isolates code in the function body from code outside of the body. As a result, most R functions have no side effects. This is a good thing, since it means you generally don’t have to worry about calls assigning, reassigning, or removing variables in other environments (such as the global environment!). The local function provides another way to create a new local environment in which to run code. However, it’s usually preferable to define and call a function, since that makes it easier to test and reuse the code. 6.1.5 Lexical Scoping A function can access variables outside of its local environment, but only if those variables exist in the environment where the function was defined (not called). This property is called lexical scoping. For example, assign a variable tea and function get_tea in the global environment: tea = "Tea isn't coffee!" get_tea = function() { tea } Then the get_tea function can access the tea variable: get_tea() ## [1] "Tea isn't coffee!" Note that variable lookup takes place when a function is called, not when it’s defined. This is called dynamic lookup. For example, the result from get_tea changes if you change the value of tea: tea = "Tea for two." get_tea() ## [1] "Tea for two." tea = "Tea isn't coffee!" get_tea() ## [1] "Tea isn't coffee!" When a local variable (a variable in the local environment) and a non-local variable have the same name, R almost always prioritizes the local variable. For instance: get_local_tea = function() { tea = "Earl grey is tea!" tea } get_local_tea() ## [1] "Earl grey is tea!" The function body assigns the local variable tea to \"Earl grey is tea!\", so R returns that value rather than \"Tea isn't coffee!\". In other words, local variables mask, or hide, non-local variables with the same name. There’s only one case where R doesn’t prioritize local variables. To see it, consider this call: mean(1:20) ## [1] 10.5 The variable mean must refer to a function, because it’s being called—it’s followed by parentheses ( ), the call syntax. In this situation, R ignores local variables that aren’t functions, so you can write code such as: mean = 10 mean(1:10) ## [1] 5.5 That said, defining a local variable with the same name as a function can still be confusing, so it’s usually considered a bad practice. To help you reason about lexical scoping, you can get the environment where a function was defined by calling the environment function on the function itself. For example, the get_tea function was defined in the global environment: environment(get_tea) ## <environment: R_GlobalEnv> 6.1.6 Variable Lookup The key to how R looks up variables and how lexical scoping works is that in addition to a frame, every environment has a parent environment. When R evaluates a variable in an expression, it starts by looking for the variable in the local environment’s frame. For example, at the prompt, tea is a local variable because that’s where you assigned it. If you enter tea at the prompt, R finds tea in the local environment’s frame and returns the value: tea ## [1] "Tea isn't coffee!" On the other hand, in the get_tea function from Section 6.1.5, tea is not a local variable: get_tea = function() { tea } To make this more concrete, consider a function which just returns its call environment: get_call_env = function() { environment() } The call environment clearly doesn’t contain the tea variable: e = get_call_env() ls(e) ## character(0) When a variable doesn’t exist in the local environment’s frame, then R gets the parent environment of the local environment. You can use the parent.env function to get the parent environment of an environment. For the call environment e, the parent environment is the global environment, because that’s where get_call_env was defined: parent.env(e) ## <environment: R_GlobalEnv> When R can’t find tea in the call environment’s frame, R gets the parent environment, which is the global environment. Then R searches for tea in the global environment, finds it, and returns the value. R repeats the lookup process for as many parents as necessary to find the variable, stopping only when it finds the variable or a special environment called the empty environment which will be explained in Section 6.1.7. The lookup process also hints at how R finds variables and functions such as pi and sqrt that clearly aren’t defined in the global environment. They’re defined in parent environments of the global environment. The get function looks up a variable by name: get("pi") ## [1] 3.141593 You can use the get function to look up a variable starting from a specific environment or to control how R does the lookup the variable. For example, if you set inherits = FALSE, R will not search any parent environments: get("pi", inherits = FALSE) ## Error in get("pi", inherits = FALSE): object 'pi' not found As with most functions for inspecting and modifying environments, use the get function sparingly. R already provides a much simpler way to get a variable: the variable’s name. 6.1.7 The Search Path R also uses environments to manage packages. Each time you load a package with library or require, R creates a new environment: The frame contains the package’s local variables. The parent environment is the environment of the previous package loaded. This new environment becomes the parent of the global environment. R always loads several built-in packages at startup, which contain variables and functions such as pi and sqrt. Thus the global environment is never the top-level environment. For instance: g = globalenv() e = parent.env(g) e ## <environment: package:stats> ## attr(,"name") ## [1] "package:stats" ## attr(,"path") ## [1] "/usr/lib/R/library/stats" e = parent.env(e) e ## <environment: package:graphics> ## attr(,"name") ## [1] "package:graphics" ## attr(,"path") ## [1] "/usr/lib/R/library/graphics" Notice that package environments use package: and the name of the package as their unique identifier rather than their memory address. The chain of package environments is called the search path. The search function returns the search path: search() ## [1] ".GlobalEnv" "package:stats" "package:graphics" ## [4] "package:grDevices" "package:utils" "package:datasets" ## [7] "package:methods" "Autoloads" "package:base" The base environment (identified by base) is the always topmost environment. You can use the baseenv function to get the base environment: baseenv() ## <environment: base> The base environment’s parent is the special empty environment (identified by R_EmptyEnv), which contains no variables and has no parent. You can use the emptyenv function to get the empty environment: emptyenv() ## <environment: R_EmptyEnv> Understanding R’s process for looking up variables and the search path is helpful for resolving conflicts between the names of variables in packages. 6.1.7.1 The Colon Operators The double-colon operator :: gets a variable in a specific package. Two common uses: Disambiguate which package you mean when several packages have variables with the same names. Get a variable from a package without loading the package. For example: library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union stats::filter ## function (x, filter, method = c("convolution", "recursive"), ## sides = 2L, circular = FALSE, init = NULL) ## { ## method <- match.arg(method) ## x <- as.ts(x) ## storage.mode(x) <- "double" ## xtsp <- tsp(x) ## n <- as.integer(NROW(x)) ## if (is.na(n)) ## stop(gettextf("invalid value of %s", "NROW(x)"), domain = NA) ## nser <- NCOL(x) ## filter <- as.double(filter) ## nfilt <- as.integer(length(filter)) ## if (is.na(nfilt)) ## stop(gettextf("invalid value of %s", "length(filter)"), ## domain = NA) ## if (anyNA(filter)) ## stop("missing values in 'filter'") ## if (method == "convolution") { ## if (nfilt > n) ## stop("'filter' is longer than time series") ## sides <- as.integer(sides) ## if (is.na(sides) || (sides != 1L && sides != 2L)) ## stop("argument 'sides' must be 1 or 2") ## circular <- as.logical(circular) ## if (is.na(circular)) ## stop("'circular' must be logical and not NA") ## if (is.matrix(x)) { ## y <- matrix(NA, n, nser) ## for (i in seq_len(nser)) y[, i] <- .Call(C_cfilter, ## x[, i], filter, sides, circular) ## } ## else y <- .Call(C_cfilter, x, filter, sides, circular) ## } ## else { ## if (missing(init)) { ## init <- matrix(0, nfilt, nser) ## } ## else { ## ni <- NROW(init) ## if (ni != nfilt) ## stop("length of 'init' must equal length of 'filter'") ## if (NCOL(init) != 1L && NCOL(init) != nser) { ## stop(sprintf(ngettext(nser, "'init' must have %d column", ## "'init' must have 1 or %d columns", domain = "R-stats"), ## nser), domain = NA) ## } ## if (!is.matrix(init)) ## dim(init) <- c(nfilt, nser) ## } ## ind <- seq_len(nfilt) ## if (is.matrix(x)) { ## y <- matrix(NA, n, nser) ## for (i in seq_len(nser)) y[, i] <- .Call(C_rfilter, ## x[, i], filter, c(rev(init[, i]), double(n)))[-ind] ## } ## else y <- .Call(C_rfilter, x, filter, c(rev(init[, 1L]), ## double(n)))[-ind] ## } ## tsp(y) <- xtsp ## class(y) <- if (nser > 1L) ## c("mts", "ts") ## else "ts" ## y ## } ## <bytecode: 0x55b765e3dd10> ## <environment: namespace:stats> dplyr::filter ## function (.data, ..., .by = NULL, .preserve = FALSE) ## { ## check_by_typo(...) ## by <- enquo(.by) ## if (!quo_is_null(by) && !is_false(.preserve)) { ## abort("Can't supply both `.by` and `.preserve`.") ## } ## UseMethod("filter") ## } ## <bytecode: 0x55b761f4d678> ## <environment: namespace:dplyr> ggplot2::ggplot ## function (data = NULL, mapping = aes(), ..., environment = parent.frame()) ## { ## UseMethod("ggplot") ## } ## <bytecode: 0x55b7654062e8> ## <environment: namespace:ggplot2> The related triple-colon operator ::: gets a private variable in a package. Generally these are private for a reason! Only use ::: if you’re sure you know what you’re doing. 6.2 Closures A closure is a function together with an enclosing environment. In order to support lexical scoping, every R function is a closure (except a few very special built-in functions). The enclosing environment is generally the environment where the function was defined. Recall that you can use the environment function to get the enclosing environment of a function: f = function() 42 environment(f) ## <environment: R_GlobalEnv> Since the enclosing environment exists whether or not you call the function, you can use the enclosing environment to store and share data between calls. You can use the superassignment operator <<- to assign to a variable to an ancestor environment (if the variable already exists) or the global environment (if the variable does not already exist). For example, suppose you want to make a function that returns the number of times it’s been called: counter = 0 count = function() { counter <<- counter + 1 counter } In this example, the enclosing environment is the global environment. Each time you call count, it assigns a new value to the counter variable in the global environment. 6.2.1 Tidy Closures The count function has a side effect—it reassigns a non-local variable. As discussed in 6.1.4, functions with side effects make code harder to understand and reason about. Use side effects sparingly and try to isolate them from the global environment. When side effects aren’t isolated, several things can go wrong. The function might overwrite the user’s variables: counter = 0 count() ## [1] 1 Or the user might overwrite the function’s variables: counter = "hi" count() ## Error in counter + 1: non-numeric argument to binary operator For functions that rely on storing information in their enclosing environment, there are several different ways to make sure the enclosing environment is isolated. Two of these are: Define and return the function from the body of another function. The second function is called a factory function because it produces (returns) the first. The enclosing environment of the first function is the call environment of the second. Define the function inside of a call to local. Here’s a template for the first approach: make_fn = function() { # Define variables in the enclosing environment here: # Define and return the function here: function() { # ... } } f = make_fn() # Now you can call f() as you would any other function. For example, you can use the template for the counter function: make_count = function() { counter = 0 function() { counter <<- counter + 1 counter } } count = make_count() Then calling count has no effect on the global environment: counter = 10 count() ## [1] 1 counter ## [1] 10 6.3 Attributes An attribute is named metadata attached to an R object. Attributes provide basic information about objects and play an important role in R’s class system, so most objects have attributes. Some common attributes are: class – the class row.names – row names names – element names or column names dim – dimensions (on matrices) dimnames – names of dimensions (on matrices) R provides helper functions to get and set the values of the common attributes. These functions usually have the same name as the attribute. For example, the class function gets or sets the class attribute: class(mtcars) ## [1] "data.frame" row.names(mtcars) ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" An attribute can have any name and any value. You can use the attr function to get or set an attribute by name: attr(mtcars, "row.names") ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" attr(mtcars, "foo") = 42 attr(mtcars, "foo") ## [1] 42 You can get all of the attributes attached to an object with the attributes function: attributes(mtcars) ## $names ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ## ## $row.names ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" ## ## $class ## [1] "data.frame" ## ## $foo ## [1] 42 You can use the structure function to set multiple attributes on an object: mod_mtcars = structure(mtcars, foo = 50, bar = 100) attributes(mod_mtcars) ## $names ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ## ## $row.names ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" ## ## $class ## [1] "data.frame" ## ## $foo ## [1] 50 ## ## $bar ## [1] 100 Vectors usually don’t have attributes: attributes(5) ## NULL But the class function still returns a class: class(5) ## [1] "numeric" When a helper function exists to get or set an attribute, use the helper function rather than attr. This will make your code clearer and ensure that attributes with special behavior and requirements, such as dim, are set correctly. 6.4 S3 R provides several systems for object-oriented programming (OOP), a programming paradigm where code is organized into a collection of “objects” that interact with each other. These systems provide a way to create new data structures with customized behavior, and also underpin how some of R’s built-in functions work. The S3 system is particularly important for understanding R, because it’s the oldest and most widely-used. This section focuses on S3, while Section 6.5 provides an overview of R’s other OOP systems. The central idea of S3 is that some functions can be generic, meaning they perform different computations (and run different code) for different classes of objects. Conversely, every object has at least one class, which dictates how the object behaves. For most objects, the class is independent of type and is stored in the class attribute. You can get the class of an object with the class function. For example, the class of a data frame is data.frame: class(mtcars) ## [1] "data.frame" Some objects have more than one class. One example of this is matrices: m = matrix() class(m) ## [1] "matrix" "array" When an object has multiple classes, they’re stored in the class attribute in order from highest to lowest priority. So the matrix m will primarily behave like a matrix, but it can also behave like an array. The priority of classes is often described in terms of a child-parent relationship: array is the parent class of matrix, or equivalently, the class matrix inherits from the class array. 6.4.1 Method Dispatch A function is generic if it selects and calls another function, called a method, based on the class of one of its arguments. A generic function can have any number of methods, and each must have the same signature, or collection of parameters, as the generic. Think of a generic function’s methods as the range of different computations it can perform, or alternatively as the range of different classes it can accept as input. Method dispatch, or just dispatch, is the process of selecting a method based on the class of an argument. You can identify S3 generics because they always call the UseMethod function, which initiates S3 method dispatch. Many of R’s built-in functions are generic. One example is the split function, which splits a data frame or vector into groups: split ## function (x, f, drop = FALSE, ...) ## UseMethod("split") ## <bytecode: 0x55b763387020> ## <environment: namespace:base> Another is the plot function, which creates a plot: plot ## function (x, y, ...) ## UseMethod("plot") ## <bytecode: 0x55b7651cbfa0> ## <environment: namespace:base> The UseMethod function requires the name of the generic (as a string) as its first argument. The second argument is optional and specifies the object to use for method dispatch. By default, the first argument to the generic is used for method dispatch. So for split, the argument for x is used for method dispatch. R checks the class of the argument and selects a matching method. You can use the methods function to list all of the methods of a generic. The methods for split are: methods(split) ## [1] split.data.frame split.Date split.default split.POSIXct ## see '?methods' for accessing help and source code Method names always have the form GENERIC.CLASS, where GENERIC is the name of the generic and CLASS is the name of a class. For instance, split.data.frame is the split method for objects with class data.frame. Methods named GENERIC.default are a special case: they are default methods, selected only if none of the other methods match the class during dispatch. So split.default is the default method for split. Most generic functions have a default method. Methods are ordinary R functions. For instance, the code for split.data.frame is: split.data.frame ## function (x, f, drop = FALSE, ...) ## { ## if (inherits(f, "formula")) ## f <- .formula2varlist(f, x) ## lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), ## function(ind) x[ind, , drop = FALSE]) ## } ## <bytecode: 0x55b763642e08> ## <environment: namespace:base> Sometimes methods are defined in privately packages and can’t be accessed by typing their name at the prompt. You can use the getAnywhere function to get the code for these methods. For instance, to get the code for plot.data.frame: getAnywhere(plot.data.frame) ## A single object matching 'plot.data.frame' was found ## It was found in the following places ## registered S3 method for plot from namespace graphics ## namespace:graphics ## with value ## ## function (x, ...) ## { ## plot2 <- function(x, xlab = names(x)[1L], ylab = names(x)[2L], ## ...) plot(x[[1L]], x[[2L]], xlab = xlab, ylab = ylab, ## ...) ## if (!is.data.frame(x)) ## stop("'plot.data.frame' applied to non data frame") ## if (ncol(x) == 1) { ## x1 <- x[[1L]] ## if (class(x1)[1L] %in% c("integer", "numeric")) ## stripchart(x1, ...) ## else plot(x1, ...) ## } ## else if (ncol(x) == 2) { ## plot2(x, ...) ## } ## else { ## pairs(data.matrix(x), ...) ## } ## } ## <bytecode: 0x55b7652badb8> ## <environment: namespace:graphics> As a demonstration of method dispatch, consider this code to split the mtcars dataset by number of cylinders: split(mtcars, mtcars$cyl) The split function is generic and dispatches on its first argument. In this case, the first argument is mtcars, which has class data.frame. Since the method split.data.frame exists, R calls split.data.frame with the same arguments you used to call the generic split function. In other words, R calls: split.data.frame(mtcars, mtcars$cyl) When an object has more than one class, method dispatch considers them from left to right. For instance, matrices created with the matrix function have class matrix and also class array. If you pass a matrix to a generic function, R will first look for a matrix method. If there isn’t one, R will look for an array method. If there still isn’t one, R will look for a default method. If there’s no default method either, then R raises an error. The sloop package provides useful functions inspecting S3 classes, generics, and methods, as well as the method dispatch process. For example, you can use the s3_dispatch function to see which method will be selected when you call a generic: # install.packages("sloop") library("sloop") s3_dispatch(split(mtcars, mtcars$cyl)) ## => split.data.frame ## * split.default The selected method is indicated with an arrow =>, while methods that were not selected are indicated with a star *. See ?s3_dispatch for complete details about the output from the function. 6.4.2 Creating Objects S3 classes are defined implicitly by their associated methods. To create a new class, decide what its structure will be and define some methods. To create an object of the class, set an object’s class attribute to the class name. For example, let’s create a generic function get_age that returns the age of an animal in terms of a typical human lifespan. First define the generic: get_age = function(animal) { UseMethod("get_age") } Next, let’s create a class Human to represent a human. Since humans are animals, let’s make each Human also have class Animal. You can use any type of object as the foundation for a class, but lists are often a good choice because they can store multiple named elements. Here’s how to create a Human object with a field age_years to store the age in years: lyra = list(age_years = 13) class(lyra) = c("Human", "Animal") Class names can include any characters that are valid in R variable names. One common convention is to make them start with an uppercase letter, to distinguish them from variables. If you want to make constructing an object of a given class less ad-hoc (and error-prone), define a constructor function that returns a new object of a given class. A common convention is to give the constructor function the same name as the class: Human = function(age_years) { obj = list(age_years = age_years) class(obj) = c("Human", "Animal") obj } asriel = Human(45) The get_age generic doesn’t have any methods yet, so R raises an error if you call it (regardless of the argument’s class): get_age(lyra) ## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('Human', 'Animal')" Let’s define a method for Animal objects. The method will just return the value of the age_years field: get_age.Animal = function(animal) { animal$age_years } get_age(lyra) ## [1] 13 get_age(asriel) ## [1] 45 Notice that the get_age generic still raises an error for objects that don’t have class Animal: get_age(3) ## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('double', 'numeric')" Now let’s create a class Dog to represent dogs. Like the Human class, a Dog is a kind of Animal and has an age_years field. Each Dog will also have a breed field to store the breed of the dog: Dog = function(age_years, breed) { obj = list(age_years = age_years, breed = breed) class(obj) = c("Dog", "Animal") obj } pongo = Dog(10, "dalmatian") Since a Dog is an Animal, the get_age generic returns a result: get_age(pongo) ## [1] 10 Recall that the goal of this example was to make get_age return the age of an animal in terms of a human lifespan. For a dog, their age in “human years” is about 5 times their age in actual years. You can implement a get_age method for Dog to take this into account: get_age.Dog = function(animal) { animal$age_years * 5 } Now the get_age generic returns an age in terms of a human lifespan whether its argument is a Human or a Dog: get_age(lyra) ## [1] 13 get_age(pongo) ## [1] 50 You can create new data structures in R by creating classes, and you can add functionality to new or existing generics by creating new methods. Before creating a class, think about whether R already provides a data structure that suits your needs. It’s uncommon to create new classes in the course of a typical data analysis, but many packages do provide new classes. Regardless of whether you ever create a new class, understanding the details means understanding how S3 works, and thus how R’s many S3 generic functions work. As a final note, while exploring S3 methods you may also encounter the NextMethod function. The NextMethod function redirects dispatch to the method that is the next closest match for an object’s class. You can learn more by reading ?NextMethod. 6.5 Other Object Systems R provides many systems for object-oriented programming besides S3. Some are built into the language, while others are provided by packages. A few of the most popular systems are: S4 – S4 is built into R and is the most widely-used system after S3. Like S3, S4 frames OOP in terms of generic functions and methods. The major differences are that S4 is stricter—the structure of each class must be formally defined—and that S4 generics can dispatch on the classes of multiple arguments instead of just one. R provides a special field operator @ to access fields of an S4 object. Most of the packages in the Bioconductor project use S4. Reference classes – Objects created with the S3 and S4 systems generally follow the copy-on-write rule, but this can be inefficient for some programming tasks. The reference class system is built into R and provides a way to create reference objects with a formal class structure (in the spirit of S4). This system is more like OOP systems in languages like Java and Python than S3 or S4 are. The reference class system is sometimes jokingly called “R5”, but that isn’t an official name. R6 – An alternative to reference classes created by Winston Chang, a developer at Posit (formerly RStudio). Claims to be simpler and faster than reference classes. S7 – A new OOP system being developed collaboratively by representatives from several different important groups in the R community, including the R core developers, Bioconductor, and Posit. Many of these systems are described in more detail in Hadley Wickham’s book Advanced R. "],["part-2.html", "7 Part 2 7.1 Iteration Strategies", " 7 Part 2 This chapter is part 2 (of 2) of Thinking in R, a workshop series about how R works and how to examine code critically. Learning Objectives Describe and use R’s for, while, and repeat loops Identify the most appropriate iteration strategy for a given problem Explain strategies to organize iterative code 7.1 Iteration Strategies R is powerful tool for automating tasks that have repetitive steps. For example, you can: Apply a transformation to an entire column of data. Compute distances between all pairs from a set of points. Read a large collection of files from disk in order to combine and analyze the data they contain. Simulate how a system evolves over time from a specific set of starting parameters. Scrape data from many pages of a website. You can implement concise, efficient solutions for these kinds of tasks in R by using iteration, which means repeating a computation many times. R provides four different strategies for writing iterative code: Vectorization, where a function is implicitly called on each element of a vector. See this section of the R Basics for more details. Apply functions, where a function is explicitly called on each element of a vector or array. See this section of the R Basics reader for more details. Loops, where an expression is evaluated repeatedly until some condition is met. Recursion, where a function calls itself. Vectorization is the most efficient and most concise iteration strategy, but also the least flexible, because it only works with vectorized functions and vectors. Apply functions are more flexible—they work with any function and any data structure with elements—but less efficient and less concise. Loops and recursion provide the most flexibility but are the least concise. In recent versions of R, apply functions and loops are similar in terms of efficiency. Recursion tends to be the least efficient iteration strategy in R. The rest of this section explains how to write loops and how to choose which iteration strategy to use. We assume you’re already comfortable with vectorization and have at least some familiarity with apply functions. 7.1.1 For-loops A for-loop evaluates an expression once for each element of a vector or list. The for keyword creates a for-loop. The syntax is: for (I in DATA) { # Your code goes here } The variable I is called an induction variable. At the beginning of each iteration, I is assigned the next element of DATA. The loop iterates once for each element, unless a keyword instructs R to exit the loop early (more about this in Section 7.1.4). As with if-statements and functions, the curly braces { } are only required if the body contains multiple lines of code. Here’s a simple for-loop: for (i in 1:10) message("Hi from iteration ", i) ## Hi from iteration 1 ## Hi from iteration 2 ## Hi from iteration 3 ## Hi from iteration 4 ## Hi from iteration 5 ## Hi from iteration 6 ## Hi from iteration 7 ## Hi from iteration 8 ## Hi from iteration 9 ## Hi from iteration 10 When some or all of the iterations in a task depend on results from prior iterations, loops tend to be the most appropriate iteration strategy. For instance, loops are a good way to implement time-based simulations or compute values in recursively defined sequences. As a concrete example, suppose you want to compute the result of starting from the value 1 and composing the sine function 100 times: result = 1 for (i in 1:100) { result = sin(result) } result ## [1] 0.1688525 Unlike other iteration strategies, loops don’t return a result automatically. It’s up to you to use variables to store any results you want to use later. If you want to save a result from every iteration, you can use a vector or a list indexed on the iteration number: n = 1 + 100 result = numeric(n) result[1] = 1 for (i in 2:n) { result[i] = sin(result[i - 1]) } plot(result) Section 7.1.3 explains this in more detail. If the iterations in a task are not dependent, it’s preferable to use vectorization or apply functions instead of a loop. Vectorization is more efficient, and apply functions are usually more concise. In some cases, you can use vectorization to handle a task even if the iterations are dependent. For example, you can use vectorized exponentiation and the sum function to compute the sum of the cubes of many numbers: numbers = c(10, 3, 100, -5, 2, 10) sum(numbers^3) ## [1] 1001910 7.1.2 While-loops A while-loop runs a block of code repeatedly as long as some condition is TRUE. The while keyword creates a while-loop. The syntax is: while (CONDITION) { # Your code goes here } The CONDITION should be a scalar logical value or an expression that returns one. At the beginning of each iteration, R checks the CONDITION and exits the loop if it’s FALSE. As always, the curly braces { } are only required if the body contains multiple lines of code. Here’s a simple while-loop: i = 0 while (i < 10) { i = i + 1 message("Hello from iteration ", i) } ## Hello from iteration 1 ## Hello from iteration 2 ## Hello from iteration 3 ## Hello from iteration 4 ## Hello from iteration 5 ## Hello from iteration 6 ## Hello from iteration 7 ## Hello from iteration 8 ## Hello from iteration 9 ## Hello from iteration 10 Notice that this example does the same thing as the simple for-loop in Section 7.1.1, but requires 5 lines of code instead of 2. While-loops are a generalization of for-loops, and only do the bare minimum necessary to iterate. They tend to be most useful when you don’t know how many iterations will be necessary to complete a task. As an example, suppose you want to add up the integers in order until the total is greater than 50: total = 0 i = 1 while (total < 50) { total = total + i message("i is ", i, " total is ", total) i = i + 1 } ## i is 1 total is 1 ## i is 2 total is 3 ## i is 3 total is 6 ## i is 4 total is 10 ## i is 5 total is 15 ## i is 6 total is 21 ## i is 7 total is 28 ## i is 8 total is 36 ## i is 9 total is 45 ## i is 10 total is 55 total ## [1] 55 i ## [1] 11 7.1.3 Saving Multiple Results Loops often produce a different result for each iteration. If you want to save more than one result, there are a few things you must do. First, set up an index vector. The index vector should usually correspond to the positions of the elements in the data you want to process. The seq_along function returns an index vector when passed a vector or list. For instance: numbers = c(-1, 21, 3, -8, 5) index = seq_along(numbers) The loop will iterate over the index rather than the input, so the induction variable will track the current iteration number. On the first iteration, the induction variable will be 1, on the second it will be 2, and so on. Then you can use the induction variable and indexing to get the input for each iteration. Second, set up an empty output vector or list. This should usually also correspond to the input, or one element longer (the extra element comes from the initial value). R has several functions for creating vectors: logical, integer, numeric, complex, and character create an empty vector with a specific type and length vector creates an empty vector with a specific type and length rep creates a vector by repeating elements of some other vector Empty vectors are filled with FALSE, 0, or \"\", depending on the type of the vector. Here are some examples: logical(3) ## [1] FALSE FALSE FALSE numeric(4) ## [1] 0 0 0 0 rep(c(1, 2), 2) ## [1] 1 2 1 2 Let’s create an empty numeric vector congruent to the numbers vector: n = length(numbers) result = numeric(n) As with the input, you can use the induction variable and indexing to set the output for each iteration. Creating a vector or list in advance to store something, as we’ve just done, is called preallocation. Preallocation is extremely important for efficiency in loops. Avoid the temptation to use c or append to build up the output bit by bit in each iteration. Finally, write the loop, making sure to get the input and set the output. As an example, this loop adds each element of numbers to a running total and squares the new running total: for (i in index) { prev = if (i > 1) result[i - 1] else 0 result[i] = (numbers[i] + prev)^2 } result ## [1] 1.000000e+00 4.840000e+02 2.371690e+05 5.624534e+10 3.163538e+21 7.1.4 Break & Next The break keyword causes a loop to immediately exit. It only makes sense to use break inside of an if-statement. For example, suppose you want to print each string in a vector, but stop at the first missing value. You can do this with a for-loop and the break keyword: my_messages = c("Hi", "Hello", NA, "Goodbye") for (msg in my_messages) { if (is.na(msg)) break message(msg) } ## Hi ## Hello The next keyword causes a loop to immediately go to the next iteration. As with break, it only makes sense to use next inside of an if-statement. Let’s modify the previous example so that missing values are skipped, but don’t cause printing to stop. Here’s the code: for (msg in my_messages) { if (is.na(msg)) next message(msg) } ## Hi ## Hello ## Goodbye These keywords work with both for-loops and while-loops. 7.1.5 Planning for Iteration At first it may seem difficult to decide if and what kind of iteration to use. Start by thinking about whether you need to do something over and over. If you don’t, then you probably don’t need to use iteration. If you do, then try iteration strategies in this order: Vectorization Apply functions Try an apply function if iterations are independent. Loops Try a for-loop if some iterations depend on others. Try a while-loop if the number of iterations is unknown. Recursion (which isn’t covered here) Convenient for naturally recursive problems (like Fibonacci), but often there are faster solutions. Start by writing the code for just one iteration. Make sure that code works; it’s easy to test code for one iteration. When you have one iteration working, then try using the code with an iteration strategy (you will have to make some small changes). If it doesn’t work, try to figure out which iteration is causing the problem. One way to do this is to use message to print out information. Then try to write the code for the broken iteration, get that iteration working, and repeat this whole process. 7.1.6 Case Study: The Collatz Conjecture The Collatz Conjecture is a conjecture in math that was introduced in 1937 by Lothar Collatz and remains unproven today, despite being relatively easy to explain. Here’s a statement of the conjecture: Start from any positive integer. If the integer is even, divide by 2. If the integer is odd, multiply by 3 and add 1. If the result is not 1, repeat using the result as the new starting value. The result will always reach 1 eventually, regardless of the starting value. The sequences of numbers this process generates are called Collatz sequences. For instance, the Collatz sequence starting from 2 is 2, 1. The Collatz sequence starting from 12 is 12, 6, 3, 10, 5, 16, 8, 4, 2, 1. You can use iteration to compute the Collatz sequence for a given starting value. Since each number in the sequence depends on the previous one, and since the length of the sequence varies, a while-loop is the most appropriate iteration strategy: n = 5 i = 0 while (n != 1) { i = i + 1 if (n %% 2 == 0) { n = n / 2 } else { n = 3 * n + 1 } message(n, " ", appendLF = FALSE) } ## 16 8 4 2 1 As of 2020, scientists have used computers to check the Collatz sequences for every number up to approximately \\(2^{64}\\). For more details about the Collatz Conjecture, check out this video. 7.1.7 Case Study: U.S. Fruit Prices The U.S. Department of Agriculture (USDA) Economic Research Service (ERS) publishes data about consumer food prices. For instance, in 2018 they posted a dataset that estimates average retail prices for various fruits, vegetables, and snack foods. The estimates are formatted as a collection of Excel files, one for each type of fruit or vegetable. In this case study, you’ll use iteration to get the estimated “fresh” price for all of the fruits in the dataset that are sold fresh. To get started, download the zipped collection of fruit spreadsheets and save it somewhere on your computer. Then unzip the file with a zip program or R’s own unzip function. The first sheet of each file contains a table with the name of the fruit and prices sorted by how the fruit was prepared. You can see this for yourself if you use a spreadsheet program to inspect some of the files. In order to read the files into R, first get a vector of their names. You can use the list.files function to list all of the files in a directory. If you set full.names = TRUE, the function will return the absolute path to each file: paths = list.files("data/fruit", full.names = TRUE) paths ## [1] "data/fruit/apples_2013.xlsx" "data/fruit/apricots_2013.xlsx" ## [3] "data/fruit/bananas_2013.xlsx" "data/fruit/berries_mixed_2013.xlsx" ## [5] "data/fruit/blackberries_2013.xlsx" "data/fruit/blueberries_2013.xlsx" ## [7] "data/fruit/cantaloupe_2013.xlsx" "data/fruit/cherries_2013.xlsx" ## [9] "data/fruit/cranberries_2013.xlsx" "data/fruit/dates_2013.xlsx" ## [11] "data/fruit/figs_2013.xlsx" "data/fruit/fruit_cocktail_2013.xlsx" ## [13] "data/fruit/grapefruit_2013.xlsx" "data/fruit/grapes_2013.xlsx" ## [15] "data/fruit/honeydew_2013.xlsx" "data/fruit/kiwi_2013.xlsx" ## [17] "data/fruit/mangoes_2013.xlsx" "data/fruit/nectarines_2013.xlsx" ## [19] "data/fruit/oranges_2013.xlsx" "data/fruit/papaya_2013.xlsx" ## [21] "data/fruit/peaches_2013.xlsx" "data/fruit/pears_2013.xlsx" ## [23] "data/fruit/pineapple_2013.xlsx" "data/fruit/plums_2013.xlsx" ## [25] "data/fruit/pomegranate_2013.xlsx" "data/fruit/raspberries_2013.xlsx" ## [27] "data/fruit/strawberries_2013.xlsx" "data/fruit/tangerines_2013.xlsx" ## [29] "data/fruit/watermelon_2013.xlsx" The files are in Excel format, which you can read with the read_excel function from the readxl package. First try reading one file and extracting the fresh price: library("readxl") prices = read_excel(paths[1]) ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` The name of the fruit is the first word in the first column’s name. The fresh price appears in the row where the word in column 1 starts with \"Fresh\". You can use str_which from the stringr package (Section 1.4.1) to find and extract this row: library("stringr") idx_fresh = str_which(prices[[1]], "^Fresh") prices[idx_fresh, ] ## # A tibble: 1 × 7 ## Apples—Average retail price per pound or…¹ ...2 ...3 ...4 ...5 ...6 ...7 ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Fresh1 1.56… per … 0.9 0.24… poun… 0.42… ## # ℹ abbreviated name: ## # ¹`Apples—Average retail price per pound or pint and per cup equivalent, 2013` The price and unit appear in column 2 and column 3. Now generalize these steps by making a read_fresh_price function. The function should accept a path as input and return a vector that contains the fruit name, fresh price, and unit. Don’t worry about cleaning up the fruit name at this point—you can do that with a vectorized operation after combining the data from all of the files. A few fruits don’t have a fresh price, and the function should return NA for the price and unit for those. Here’s one way to implement the read_fresh_price function: read_fresh_price = function(path) { prices = read_excel(path) # Get fruit name. fruit = names(prices)[[1]] # Find fresh price. idx = str_which(prices[[1]], "^Fresh") if (length(idx) > 0) { prices = prices[idx, ] c(fruit, prices[[2]], prices[[3]]) } else { c(fruit, NA, NA) } } Test that the function returns the correct result for a few of the files: read_fresh_price(paths[[1]]) ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` ## [1] "Apples—Average retail price per pound or pint and per cup equivalent, 2013" ## [2] "1.5675153914496354" ## [3] "per pound" read_fresh_price(paths[[4]]) ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` ## [1] "Mixed berries—Average retail price per pound and per cup equivalent, 2013" ## [2] NA ## [3] NA read_fresh_price(paths[[8]]) ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` ## [1] "Cherries—Average retail price per pound and per cup equivalent, 2013" ## [2] "3.5929897554945156" ## [3] "per pound" Now that the function is working, it’s time to choose an iteration strategy. The read_fresh_price function is not vectorized, so that strategy isn’t possible. Reading one file doesn’t depend on reading any of the others, so apply functions are the best strategy here. The read_fresh_price function always returns a character vector with 3 elements, so you can use sapply to process all of the files and get a matrix of results: all_prices = sapply(paths, read_fresh_price) ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` # Transpose, convert to a data frame, and set names for easy reading. all_prices = t(all_prices) all_prices = data.frame(all_prices) rownames(all_prices) = NULL colnames(all_prices) = c("fruit", "price", "unit") all_prices ## fruit ## 1 Apples—Average retail price per pound or pint and per cup equivalent, 2013 ## 2 Apricots—Average retail price per pound and per cup equivalent, 2013 ## 3 Bananas—Average retail price per pound and per cup equivalent, 2013 ## 4 Mixed berries—Average retail price per pound and per cup equivalent, 2013 ## 5 Blackberries—Average retail price per pound and per cup equivalent, 2013 ## 6 Blueberries—Average retail price per pound and per cup equivalent, 2013 ## 7 Cantaloupe—Average retail price per pound and per cup equivalent, 2013 ## 8 Cherries—Average retail price per pound and per cup equivalent, 2013 ## 9 Cranberries—Average retail price per pound and per cup equivalent, 2013 ## 10 Dates—Average retail price per pound and per cup equivalent, 2013 ## 11 Figs—Average retail price per pound and per cup equivalent, 2013 ## 12 Fruit cocktail—Average retail price per pound and per cup equivalent, 2013 ## 13 Grapefruit—Average retail price per pound or pint and per cup equivalent, 2013 ## 14 Grapes—Average retail price per pound or pint and per cup equivalent, 2013 ## 15 Honeydew melon—Average retail price per pound and per cup equivalent, 2013 ## 16 Kiwi—Average retail price per pound and per cup equivalent, 2013 ## 17 Mangoes—Average retail price per pound and per cup equivalent, 2013 ## 18 Nectarines—Average retail price per pound and per cup equivalent, 2013 ## 19 Oranges—Average retail price per pound or pint and per cup equivalent, 2013 ## 20 Papaya—Average retail price per pound and per cup equivalent, 2013 ## 21 Peaches—Average retail price per pound and per cup equivalent, 2013 ## 22 Pears—Average retail price per pound and per cup equivalent, 2013 ## 23 Pineapple—Average retail price per pound or pint and per cup equivalent, 2013 ## 24 Plums—Average retail price per pound or pint and per cup equivalent, 2013 ## 25 Pomegranate—Average retail price per pound or pint and per cup equivalent, 2013 ## 26 Raspberries—Average retail price per pound and per cup equivalent, 2013 ## 27 Strawberries—Average retail price per pound and per cup equivalent, 2013 ## 28 Tangerines—Average retail price per pound or pint and per cup equivalent, 2013 ## 29 Watermelon—Average retail price per pound and per cup equivalent, 2013 ## price unit ## 1 1.5675153914496354 per pound ## 2 3.0400719670964378 per pound ## 3 0.56698341453144807 per pound ## 4 <NA> <NA> ## 5 5.7747082503535152 per pound ## 6 4.7346216897250253 per pound ## 7 0.53587377610644515 per pound ## 8 3.5929897554945156 per pound ## 9 <NA> <NA> ## 10 <NA> <NA> ## 11 <NA> <NA> ## 12 <NA> <NA> ## 13 0.89780204117954143 per pound ## 14 2.0938274120049827 per pound ## 15 0.79665620543008364 per pound ## 16 2.0446834079658482 per pound ## 17 1.3775634470319702 per pound ## 18 1.7611484827950696 per pound ## 19 1.0351727302444853 per pound ## 20 1.2980115892049107 per pound ## 21 1.5911868532458617 per pound ## 22 1.4615746043999458 per pound ## 23 0.62766194593569868 per pound ## 24 1.8274160078099031 per pound ## 25 2.1735904118559191 per pound ## 26 6.9758107988552958 per pound ## 27 2.3588084831103004 per pound ## 28 1.3779618772323634 per pound ## 29 0.33341203532340097 per pound Finally, the last step is to remove the extra text from the fruit name. One way to do this is with the str_split_fixed function from the stringr package. There’s an en dash — after each fruit name, which you can use for the split: fruit = str_split_fixed(all_prices$fruit, "—", 2)[, 1] all_prices$fruit = fruit all_prices ## fruit price unit ## 1 Apples 1.5675153914496354 per pound ## 2 Apricots 3.0400719670964378 per pound ## 3 Bananas 0.56698341453144807 per pound ## 4 Mixed berries <NA> <NA> ## 5 Blackberries 5.7747082503535152 per pound ## 6 Blueberries 4.7346216897250253 per pound ## 7 Cantaloupe 0.53587377610644515 per pound ## 8 Cherries 3.5929897554945156 per pound ## 9 Cranberries <NA> <NA> ## 10 Dates <NA> <NA> ## 11 Figs <NA> <NA> ## 12 Fruit cocktail <NA> <NA> ## 13 Grapefruit 0.89780204117954143 per pound ## 14 Grapes 2.0938274120049827 per pound ## 15 Honeydew melon 0.79665620543008364 per pound ## 16 Kiwi 2.0446834079658482 per pound ## 17 Mangoes 1.3775634470319702 per pound ## 18 Nectarines 1.7611484827950696 per pound ## 19 Oranges 1.0351727302444853 per pound ## 20 Papaya 1.2980115892049107 per pound ## 21 Peaches 1.5911868532458617 per pound ## 22 Pears 1.4615746043999458 per pound ## 23 Pineapple 0.62766194593569868 per pound ## 24 Plums 1.8274160078099031 per pound ## 25 Pomegranate 2.1735904118559191 per pound ## 26 Raspberries 6.9758107988552958 per pound ## 27 Strawberries 2.3588084831103004 per pound ## 28 Tangerines 1.3779618772323634 per pound ## 29 Watermelon 0.33341203532340097 per pound Now the data are ready for analysis. You could extend the reader function to extract more of the data (e.g., dried and frozen prices), but the overall process is fundamentally the same. Write the code to handle one file (one step), generalize it to work on several, and then iterate. For another example, see Liza Wood’s Real-world Function Writing Mini-reader. "],["references.html", "References", " References This reader would not have been possible without the many excellent reference texts created by other members of the R community. Now that you’ve completed this reader, these texts are a great way to continue you R learning journey. Advanced R by Hadley Wickham is a must-read if you want a deep understanding of R. It provides many examples of R features that are important for package/software development. Other texts I’ve found useful include: What They Forgot to Teach You About R by Bryan & Hester. The Art of R Programming by Matloff (of UC Davis). A general reference on R programming, with more of a computer science and software engineering perspective than most R texts. The R Inferno by Burns. A discussion of the most difficult and confusing parts of R. R Packages by Wickham. A gentle, modern introduction to creating packages for R. Writing R Extensions by the R core developers. A description of how to create packages and other extensions for R. R Language Definition by the R core developers. Documentation about how R works at a low level. R Internals by the R core developers. Documentation about how R works internally (that is, its C code). Finally, here are a few other readers and notes created by DataLab staff: My personal teaching notes from several years of teaching statistical computing. R Basics, our workshop series aimed at people just starting to learn R. Adventures in Data Science, our course introducing humanities undergraduates to data science techniques. Python Basics, our workshop series aimed at people just starting to learn Python. Intermediate Python, this reader’s counterpart for Python users. "],["assessment.html", "Assessment", " Assessment If you are taking this workshop to complete a GradPathways research computing or other DataLab sponsored pathway, you need to complete an assessment and submit to GradPathways to claim your badge. For the “Intermediate R” 4-part workshop series, you can download the assessment instructions here. For the “Intermediate R: Data Visualizations with Ggplot” workshop, register for the Data Visualization pathway and complete the following. Submit all materials to GradPathways via the Badgr portal: Generate a data visualization from your research data using ggplot in R. Export it as a .jpg titled “figure_v1”. Upload the figure and code used to generate it. Next, iterate on the figure using data visualization best practices such that it is a publication-worthy plot. Export this revised plot as “figure_v2”, and upload it along with its respective code and a figure caption. Write a narrative explaining the data visualization. This should include a short data biography (what are the data, who collected it and why, and who it affects), the purpose of the plot (audience, goals), and what data story it tells. Discuss what changes you made, and why, from v1 to v2. Also list a few additional changes you would ideally like to make to v2 in the future. If you are happy with v2, instead of listing future changes discuss what other plot types might be appropriate or additional data visualizations you could make to help support your data story. If you have questions regarding how to submit your assessment materials, contact GradPathways. "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
+[["index.html", "Intermediate R Overview", " Intermediate R Nick Ulle and Wesley Brooks 2024-01-14 Overview This is the reader for all of UC Davis DataLab’s Intermediate R workshop series. There are currently three: Thinking in R, which is about understanding how R works, how to diagnose and fix bugs in code, and how to estimate and measure performance characteristics of code. Cleaning Data & Automating Tasks, which is about how to clean and prepare messy data such as dates, times, and text for analysis, and how to use loops or other forms of iteration to automate repetitive tasks. Data Visualization and Analysis in R, which is about plotting data and models in R. Each series is independent and consists of 2 sessions (equivalently, 2 chapters in this reader). After completing both series, students will have a better understanding of language features, packages, and programming strategies, which will enable them to write more efficient code, be more productive when writing code, and debug code more effectively. These series are not an introduction to R. Participants are expected to have prior experience using R, be comfortable with basic R syntax, and to have it pre-installed and running on their laptops. They are appropriate for motivated intermediate to advanced users who want a better understanding of base R. "],["string-date-processing.html", "1 String & Date Processing 1.1 The Tidyverse 1.2 Parsing Dates & Times 1.3 String Fundamentals 1.4 Processing Strings 1.5 Regular Expression Examples", " 1 String & Date Processing This chapter is part 1 (of 2) of Cleaning & Reshaping Data, a workshop series about how to prepare data for analysis. The major topics of this chapter are how to convert dates and times into appropriate R data types and how to extract and clean up data from strings (including numbers with non-numeric characters such as $, %, and ,). Learning Objectives Explain why we use special data structures for dates & times Identify the correct data structure for a given date/time Use lubridate to parse a date Use the date format string mini-language Use escape codes in strings to represent non-keyboard characters Explain what a text encoding is Use the stringr package to detect, extract, and change patterns in strings Use the regular expressions mini-language 1.1 The Tidyverse For working with dates, times, and strings, we recommend using packages from the Tidyverse, a popular collection of packages for doing data science. Compared to R’s built-in functions, we’ve found that the functions in Tidyverse packages are generally easier to learn and use. They also provide additional features and have more robust support for characters outside of the Latin alphabet. Although they’re developed by many different members of the R community, Tidyverse packages follow a unified design philosophy, and thus have many interfaces and data structures in common. The packages provide convenient and efficient alternatives to built-in R functions for many tasks, including: Reading and writing files (package readr) Processing dates and times (packages lubridate, hms) Processing strings (package stringr) Reshaping data (package tidyr) Making visualizations (package ggplot2) And more Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. As a consequence, the Tidyverse is sometimes polarizing in the R community. It’s useful to be literate in both base R and the Tidyverse, since both are popular. One major advantage of the Tidyverse is that the packages are usually well-documented and provide lots of examples. Every package has a documentation website and the most popular ones also have cheatsheets. 1.2 Parsing Dates & Times When working with dates and times, you might want to: Use them to sort other data Add or subtract an offset Get components like the month, day, or hour Compute derived components like the day of week or quarter Compute differences Even though this list isn’t exhaustive, it shows that there are lots of things you might want to do. In order to do them in R, you must first make sure that your dates and times are represented by appropriate data types. Most of R’s built-in functions for loading data do not automatically recognize dates and times. This section describes several data types that represent dates and times, and explains how to use R to parse—break down and convert—dates and times to these types. 1.2.1 The lubridate Package As explained in Section 1.1, we recommend the Tidyverse packages for working with dates and times over other packages or R’s built-in functions. There are two: lubridate, the primary package for working with dates and times hms, a package specifically for working with time durations This chapter only covers lubridate, since it’s more useful in most situations. The package has detailed documentation and a cheatsheet. You’ll have to install the package if you haven’t already, and then load it: # install.packages("lubridate") library("lubridate") ## ## Attaching package: 'lubridate' ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union Perhaps the most common task you’ll need to do with date and time data is convert from strings to more appropriate data types. This is because R’s built-in functions for reading data from a text format, such as read.csv, read dates and times as strings. For example, here are some dates as strings: date_strings = c("Jan 10, 2021", "Sep 3, 2018", "Feb 28, 1982") date_strings ## [1] "Jan 10, 2021" "Sep 3, 2018" "Feb 28, 1982" You can tell that these are dates, but as far as R is concerned, they’re text. The lubridate package provides a variety of functions to automatically parse strings into date or time objects that R understands. These functions are named with one letter per component of the date or time. The order of the letters must match the order of the components in the string you want to parse. In the example, the strings have the month (m), then the day (d), and then the year (y), so you can use the mdy function to parse them automatically: dates = mdy(date_strings) dates ## [1] "2021-01-10" "2018-09-03" "1982-02-28" class(dates) ## [1] "Date" Notice that the dates now have class Date, one of R’s built-in classes for representing dates, and that R prints them differently. Now R recognizes that the dates are in fact dates, so they’re ready to use in an analysis. There is a complete list of the automatic parsing functions in the lubridate documentation. Note: a relatively new package, clock, tries to solve some problems with the Date class people have identified over the years. The package is in the r-lib collection of packages, which provide low-level functionality complementary to the Tidyverse. Eventually, it may be preferable to use the classes in clock rather than the Date class, but for now, the Date class is still suitable for most tasks. Occasionally, a date or time string may have a format that lubridate can’t parse automatically. In that case, you can use the fast_strptime function to describe the format in detail. At a minimum, the function requires two arguments: a vector of strings to parse and a format string. The format string describes the format of the dates or times, and is based on the syntax of strptime, a function provided by many programming languages (including R) to parse date or time strings. In a format string, a percent sign % followed by a character is called a specification and has a special meaning. Here are a few of the most useful ones: Specification Description 2015-01-29 21:32:55 %Y 4-digit year 2015 %m 2-digit month 01 %d 2-digit day 29 %H 2-digit hour 21 %M 2-digit minute 32 %S 2-digit second 55 %% literal % % %y 2-digit year 15 %B full month name January %b short month name Jan You can find a complete list in ?fast_strptime. Other characters in the format string do not have any special meaning. Write the format string so that it matches the format of the dates you want to parse. For example, let’s try parsing an unusual time format: time_string = "6 minutes, 32 seconds after 10 o'clock" time = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock") time ## [1] "0-01-01 10:06:32 UTC" class(time) ## [1] "POSIXlt" "POSIXt" R represents date-times with the classes POSIXlt and POSIXct. There’s no built-in class to represent times alone, which is why the result in the example above includes a date. Internally, a POSIXlt object is a list with elements to store different date and time components. On the other hand, a POSIXct object is a single floating point number (type double). If you want to store your time data in a data frame, use POSIXct objects, since data frames don’t work well with columns of lists. You can control whether fast_strptime returns a POSIXlt or POSIXct object by setting the lt parameter to TRUE or FALSE: time_ct = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock", lt = FALSE) class(time_ct) ## [1] "POSIXct" "POSIXt" Another common task is combining the numeric components of a date or time into a single object. You can use the make_date and make_datetime functions to do this. The parameters are named for the different components. For example: make_date(day = 10, year = 2023, month = 1) ## [1] "2023-01-10" These functions are vectorized, so you can use them to combine the components of many dates or times at once. They’re especially useful for reconstructing dates and times from tabular datasets where each component is stored in a separate column. After you’ve converted your date and time data to appropriate types, you can do any of the operations listed at the beginning of this section. For example, you can use lubridate’s period function to create an offset to add to a date or time: dates ## [1] "2021-01-10" "2018-09-03" "1982-02-28" dates + period(1, "month") ## [1] "2021-02-10" "2018-10-03" "1982-03-28" You can also use lubridate functions to get or set the components. These functions usually have the same name as the component. For instance: day(dates) ## [1] 10 3 28 month(dates) ## [1] 1 9 2 See the lubridate documentation for even more details about what you can do. 1.2.2 Case Study: Ocean Temperatures The U.S. National Oceanic and Atmospheric Administration (NOAA) publishes ocean temperature data collected by sensor buoys off the coast on the National Data Buoy Center (NDBC) website. California also has many sensors collecting ocean temperature data that are not administered by the federal government. Data from these is published on the California Ocean Observing Systems (CALOOS) Data Portal. Suppose you’re a researcher who wants to combine ocean temperature data from both sources to use in R. Both publish the data in comma-separated value (CSV) format, but record dates, times, and temperatures differently. Thus you need to be careful that the dates and times are parsed correctly. Download these two 2021 datasets: 2021_noaa-ndbc_46013.txt, from NOAA buoy 46013, off the coast of Bodega Bay (DOWNLOAD)(source) 2021_ucdavis_bml_wts.csv, from the UC Davis Bodega Bay Marine Laboratory’s sensors (DOWNLOAD)(source) The NOAA data has a fixed-width format, which means each column has a fixed width in characters over all rows. The readr package provides a function read_fwf that can automatically guess the column widths and read the data into a data frame. The column names appear in the first row and column units appear in the second row, so read those rows separately: # install.packages("readr") library("readr") noaa_path = "data/ocean_data/2021_noaa-ndbc_46013.txt" noaa_headers = read_fwf(noaa_path, n_max = 2, guess_max = 1) ## Rows: 2 Columns: 18 ## ── Column specification ────────────────────────────────────────────────────────────── ## ## chr (18): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. noaa = read_fwf(noaa_path, skip = 2) ## Rows: 3323 Columns: 18 ## ── Column specification ────────────────────────────────────────────────────────────── ## ## chr (4): X2, X3, X4, X5 ## dbl (14): X1, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. names(noaa) = as.character(noaa_headers[1, ]) names(noaa)[1] = "YY" The dates and times for the observations are separated into component columns, and the read_fwf function does not convert some of these to numbers automatically. You can use as.numeric to convert them to numbers: cols = 2:5 noaa[cols] = lapply(noaa[cols], as.numeric) Finally, use the make_datetime function to combine the components into date-time objects: noaa_dt = make_datetime(year = noaa$YY, month = noaa$MM, day = noaa$DD, hour = noaa$hh, min = noaa$mm) noaa$date = noaa_dt head(noaa_dt) ## [1] "2021-01-01 00:00:00 UTC" "2021-01-01 00:10:00 UTC" ## [3] "2021-01-01 00:20:00 UTC" "2021-01-01 00:30:00 UTC" ## [5] "2021-01-01 00:40:00 UTC" "2021-01-01 00:50:00 UTC" That takes care of the dates in the NOAA data. The Bodega Marine Lab data is CSV format, which you can read with read.csv or the readr package’s read_csv function. The latter is faster and usually better at guessing column types. The column names appear in the first row and the column units appear in the second row. The read_csv function handles the names automatically, but you’ll have to remove the unit row as a separate step: bml = read_csv("data/ocean_data/2021_ucdavis_bml_wts.csv") ## Rows: 87283 Columns: 4 ## ── Column specification ────────────────────────────────────────────────────────────── ## Delimiter: "," ## chr (3): time, sea_water_temperature, z ## dbl (1): sea_water_temperature_qc_agg ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. bml = bml[-1, ] The dates and times of the observations were loaded as strings. You can use lubridate’s ymd_hms function to automatically parse them: bml_dt = ymd_hms(bml$time) bml$date = bml_dt head(bml_dt) ## [1] "2020-12-31 09:06:00 UTC" "2020-12-31 09:12:00 UTC" ## [3] "2020-12-31 09:18:00 UTC" "2020-12-31 09:24:00 UTC" ## [5] "2020-12-31 09:30:00 UTC" "2020-12-31 09:36:00 UTC" Now you have date and time objects for both datasets, so you can combine the two. For example, you could extract the date and water temperature columns from each, create a new column identifying the data source, and then row-bind the datasets together. 1.3 String Fundamentals Strings represent text, but even if your datasets are composed entirely of numbers, you’ll need to know how to work with strings. Text formats for data are widespread: comma-separated values (CSV), tab-separated values (TSV), JavaScript object notation (JSON), a panopoly of markup languages (HTML, XML, YAML, TOML), and more. When you read data in these formats into R, sometimes R will correctly convert the values to appropriate non-string types. The rest of the time, you need to know how to work with strings so that you can fix whatever went wrong and convert the data yourself. This section introduces several fundamental concepts related to working with strings. The next section, Section 1.4.1, describes the stringr package for working with strings. The last section, Section 1.4.2, builds on both and explains how to do powerful pattern matching. 1.3.1 Printing There are two different ways to print strings: you can print a representation of the characters in the string or you can print the actual characters in the string. To print a representation of the characters in a string, use the print function. The representation is useful to identify characters that are not normally visible, such as tabs and the characters that mark the end of a line. To print the actual characters in a string, use the message function. This important difference in how the print and message functions print strings is demonstrated in the next section. You can learn more about different ways to print output in R by reading Section 3.2. 1.3.2 Escape Sequences In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to: Write quotes or backslashes within a string Write characters that don’t appear on your keyboard (for example, characters in a foreign language) For example, the escape sequence \\n corresponds to the newline character. Notice that the message function translates \\n into a literal new line, whereas the print function doesn’t: x = "Hello\\nNick" message(x) ## Hello ## Nick print(x) ## [1] "Hello\\nNick" As another example, suppose we want to put a literal quote in a string. We can either enclose the string in the other kind of quotes, or escape the quotes in the string: x = 'She said, "Hi"' message(x) ## She said, "Hi" y = "She said, \\"Hi\\"" message(y) ## She said, "Hi" Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes: x = "\\\\" message(x) ## \\ There’s a complete list of escape sequences for R in the ?Quotes help file. Other programming languages also use escape sequences, and many of them are the same as in R. 1.3.3 Raw Strings A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions (covered in Section 1.4.2). Raw strings begin with r\" and an opening delimiter (, [, or {. Raw strings end with a matching closing delimiter and quote. For example: x = r"(quotes " and backslashes \\)" message(x) ## quotes " and backslashes \\ Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions. 1.3.4 Character Encodings Computers store data as numbers. In order to store text on a computer, people have to agree on a character encoding, a system for mapping characters to numbers. For example, in ASCII, one of the most popular encodings in the United States, the character a maps to the number 97. Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs). Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8. In R, you can write Unicode characters with the escape sequence \\U followed by the number for the character in base 16. For instance, the number for a in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61. So you can write an a as: x = "\\U61" # or "\\u61" x ## [1] "a" Unicode escape sequences are usually only used for characters that are not easy to type. For example, the cat emoji is number 1f408 (in base 16) in Unicode. So the string \"\\U1f408\" is the cat emoji. Note that being able to see printed Unicode characters also depends on whether the font your computer is using has a glyph (image representation) for that character. Many fonts are limited to a small number of languages. The NerdFont project patches fonts commonly used for programming so that they have better Unicode coverage. Using a font with good Unicode coverage is not essential, but it’s convenient if you expect to work with many different natural languages or love using emoji. 1.3.4.1 Character Encodings in Text Files Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding. Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is: read.csv("my_data.csv", fileEncoding = "UTF-8") Other reader functions may use a different parameter to set the encoding, so always check the documentation. On computers where the native language is not set to English, it can also help to set R’s native language to English with Sys.setlocale(locale = \"English\"). Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation. 1.4 Processing Strings String processing encompasses a variety of tasks such as searching for patterns within strings, extracting data from within strings, splitting strings into component parts, and removing or replacing unwanted characters (excess whitespace, punctuation, and so on). If you work with data, sooner or later you’ll run into a dataset in text format that needs a few text corrections before or after you read it into R, and for that you’ll find familiarity with string processing invaluable. 1.4.1 The stringr Package Although R has built-in functions for string processing, we recommend using the stringr package for all of your string processing needs. The package is part of the Tidyverse, a collection of packages introduced in Section 1.1. Major advantages of stringr over other packages and R’s built-in functions include: Correctness: the package builds on International Components for Unicode (ICU), the Unicode Consortium’s own library for handling text encodings Discoverability: every function’s name begins with str_ so they’re easy to discover, remember, and identify in code Interface consistency: the first argument is always the string to process, the second argument is always the pattern to match (if applicable) Vectorization: most of the functions are vectorized in the first and second argument stringr has detailed documentation and also a cheatsheet. The first time you use stringr, you’ll have to install it with install.packages (the same as any other package). Then you can load the package with the library function: # install.packages("stringr") library(stringr) The typical syntax of a stringr function is: str_name(string, pattern, ...) Where: name describes what the function does string is a string to search within or transform pattern is a pattern to search for, if applicable ... is additional, function-specific arguments For example, the str_detect function detects whether a pattern appears within a string. The function returns TRUE if the pattern is found and FALSE if it isn’t: str_detect("hello", "el") ## [1] TRUE str_detect("hello", "ol") ## [1] FALSE Most of the stringr functions are vectorized in the string parameter: str_detect(c("hello", "goodbye", "lo"), "lo") ## [1] TRUE FALSE TRUE As another example, the str_sub function extracts a substring from a string, given the substring’s position. The first argument is the string, the second is the position of the substring’s first character, and the third is the position of the substring’s last character: str_sub("You speak of destiny as if it was fixed.", 5, 9) ## [1] "speak" The str_sub function is especially useful for extracting data from strings that have a fixed width (although the readr package’s read_fwf is usually a better choice if you have a fixed-width file). There are a lot of stringr functions. Five that are especially important and are explained in this reader are: str_detect, to test whether a string contains a pattern str_sub, to extract a substring at a given position from a string str_replace, to replace or remove parts of a string str_split_fixed, to split a string into parts str_match, to extract data from a string You can find a complete list of functions with examples on the stringr documentation’s reference page and the cheatsheet. 1.4.2 Regular Expressions The stringr functions use a special language called regular expressions or regex to describe patterns in strings. Many other programming languages also have string processing tools that use regular expressions, so fluency with regular expressions is a transferrable skill. You can use a regular expression to describe a complicated pattern in just a few characters because some characters, called metacharacters, have special meanings. Metacharacters are usually punctation characters. They are never letters or numbers, which always have their literal meaning. This table lists some of the most useful metacharacters: Metacharacter Meaning . any one character (wildcard) \\ escape character (in both R and regex), see Section 1.3.2 ^ the beginning of string (not a character) $ the end of string (not a character) [ab] one character, either 'a' or 'b' [^ab] one character, anything except 'a' or 'b' ? the previous character appears 0 or 1 times * the previous character appears 0 or more times + the previous character appears 1 or more times () make a group | match left OR right side (not a character) Section 1.5 provides examples of how most of the metacharacters work. Even more examples are presented in the stringr package’s regular expressions vignette. You can find a complete listing of regex metacharacters in ?regex or on the stringr cheatsheet. You can disable regular expressions in a stringr function by calling the fixed function on the pattern. For example, to test whether a string contains a literal dot .: x = c("No dot", "Lotsa dots...") str_detect(x, fixed(".")) ## [1] FALSE TRUE It’s a good idea to call fixed on any pattern that doesn’t contain regex metacharacters, because it communicates to the reader that you’re not using regex, it helps to prevent bugs, and it provides a small speed boost. 1.4.3 Replacing Parts of Strings Replacing part of a string is a common string processing task. For instance, quantitative data often contain non-numeric characters such as commas, currency symbols, and percent signs. These must be removed before converting to numeric data types. Replacement and removal go hand-in-hand, since removal is equivalent to replacing part of a string with the empty string \"\". The str_replace function replaces the first part of a string that matches a pattern (from left to right), while the related str_replace_all function replaces every part of a string that matches a pattern. Most stringr functions that do pattern matching come in a pair like this: one to process only the first match and one to process every match. As an example, suppose you want to remove commas from a number so that you can convert it with as.numeric, which returns NA for numbers that contain commas. You want to remove all of the commas, so str_replace_all is the function to use. As usual, the first argument is the string and the second is the pattern. The third argument is the replacement, which is the empty string \"\" in this case: x = "1,000,000" str_replace_all(x, ",", "") ## [1] "1000000" The str_replace function doesn’t work as well for this task, since it only replaces the first match to the pattern: str_replace(x, ",", "") ## [1] "1000,000" You can also use these functions to replace or remove longer patterns within words. For instance, suppose you want to change the word \"dog\" to \"cat\": x = c("dogs are great, dogs are fun", "dogs are fluffy") str_replace(x, "dog", "cat") ## [1] "cats are great, dogs are fun" "cats are fluffy" str_replace_all(x, "dog", "cat") ## [1] "cats are great, cats are fun" "cats are fluffy" As a final example, you can use the replacement functions and a regex pattern to replace repeated spaces with a single space. This is a good standardization step if you’re working with text. The key is to use the regex quantifier +, which means a character “repeats one or more times” in the pattern, and to use a single space \" \" as the replacement: x = "This sentence has extra space." str_replace_all(x, " +", " ") ## [1] "This sentence has extra space." If you just want to trim (remove) all whitespace from the beginning and end of a string, you can use the str_trim function instead. 1.4.4 Splitting Strings Distinct data in a text are generally separated by a character like a space or a comma, to make them easy for people to read. Often these separators also make the data easy for R to parse. The idea is to split the string into a separate value at each separator. The str_split function splits a string at each match to a pattern. The matching characters—that is, the separators—are discarded. For example, suppose you want to split several numbers separated by commas and spaces: x = "21, 32.3, 5, 64" result = str_split(x, ", ") result ## [[1]] ## [1] "21" "32.3" "5" "64" The str_split function always returns a list with one element for each input string. Here the list only has one element because x only has one element. You can get the first element with: result[[1]] ## [1] "21" "32.3" "5" "64" You then convert the values with as.numeric. To see why the str_split function always returns a list, consider what happens if you try to split two different strings at once: x = c(x, "10, 15, 1.3") result = str_split(x, ", ") result ## [[1]] ## [1] "21" "32.3" "5" "64" ## ## [[2]] ## [1] "10" "15" "1.3" Each string has a different number of parts, so the vectors in the result have different lengths. So a list is the only way to store them. You can also use the str_split function to split a sentence into words. Use spaces for the split: x = "The students in this workshop are great!" str_split(x, " ") ## [[1]] ## [1] "The" "students" "in" "this" "workshop" "are" "great!" When you know exactly how many parts you expect a string to have, use the str_split_fixed function instead of str_split. It accepts a third argument for the maximum number of splits to make. Because the number of splits is fixed, the function can return the result in a matrix instead of a list. For example: x = c("1, 2, 3", "10, 20, 30") str_split_fixed(x, ", ", 3) ## [,1] [,2] [,3] ## [1,] "1" "2" "3" ## [2,] "10" "20" "30" The str_split_fixed function is often more convenient than str_split because the nth piece of each input string is just the nth column of the result. For example, suppose you want to get the area codes from some phone numbers: phones = c("717-555-3421", "629-555-8902", "903-555-6781") result = str_split_fixed(phones, "-", 3) result[, 1] ## [1] "717" "629" "903" 1.4.5 Extracting Matches Occasionally, you might need to extract parts of a string in a more complicated way than string splitting allows. One solution is to write a regular expression that will match all of the data you want to capture, with parentheses ( ), the regex metacharacter for a group, around each distinct value. Then you can use the str_match function to extract the groups. Section 1.5.6 presents some examples of regex groups. For example, suppose you want to split an email address into three parts: the user name, the domain name, and the [top-level domain][tld]. To create a regular expression that matches email addresses, you can use the @ and . in the address as anchors. The surrounding characters are generally alphanumeric, which you can represent with the “word” metacharacter \\w: \\w+@\\w+[.]\\w+ Next, put parentheses ( ) around each part that you want to extract: (\\w+)@(\\w+)[.](\\w+) Finally, use this pattern in str_match, adding extra backslashes so that everything is escaped correctly: x = "datalab@ucdavis.edu" regex = "(\\\\w+)@(\\\\w+)[.](\\\\w+)" str_match(x, regex) ## [,1] [,2] [,3] [,4] ## [1,] "datalab@ucdavis.edu" "datalab" "ucdavis" "edu" The function extracts the overall match to the pattern, as well as the match to each group. The pattern in this example doesn’t work for all possible email addresses, since user names can contain dots and other characters that are not alphanumeric. You could generalize the pattern if necessary. The point is that the str_match function and groups provide an extremely flexible way to extract data from strings. 1.4.6 Case Study: U.S. Warehouse Stocks The U.S. Department of Agriculture (USDA) publishes a variety of datasets online, particularly through its National Agricultural Statistics Service (NASS). Unfortunately, most of are published in PDF or semi-structured text format, which makes reading the data into R or other statistical software a challenge. The USDA NASS posts monthly reports about stocks of agricultural products in refrigerated warehouses. In this case study, you’ll use string processing functions to extract a table of data from the December 2022 report. To begin, download the report and save it somewhere on your computer. Then open the file in a text editor (or RStudio) to inspect it. The goal is to extract the first table, about “Nuts, Dairy Products, Frozen Eggs, and Frozen Poultry,” from the report. The report is a semi-structured mix of natural language text and fixed-width tables. As a consequence, most functions for reading tabular data will not work well on the entire report. You could try to use a function for reading fixed-width data, such as read.fwf or the readr package’s read_fwf on only the lines containing a table. Another approach, which is shown here, is to use string processing functions to find and extract the table. The readLines function reads a text file into a character vector with one element for each line. This makes the function useful for reading unstructured or semi-structured text. Use the function to read the report: report = readLines("data/cost1222.txt") head(report) ## [1] "" ## [2] "Cold Storage" ## [3] "" ## [4] "ISSN: 1948-903X" ## [5] "" ## [6] "Released December 22, 2022, by the National Agricultural Statistics Service " In the report, tables always begin and end with lines that contain only dashes -. By locating these all-dash lines, you can locate the tables. Like str_detect, the str_which function tests whether strings in a vector match a pattern. The only difference is that str_which returns the indexes of the strings that matched (as if you had called which) rather than a logical vector. Use str_which to find the all-dash lines: # The regex means: # ^ begining of string # -+ one or more dashes # $ end of string dashes = str_which(report, "^-+$") head(report[dashes], 2) ## [1] "--------------------------------------------------------------------------------------------------------------------------" ## [2] "--------------------------------------------------------------------------------------------------------------------------" Each table contains three dash lines—one separates the header and body. The header and body of the first table are: report[dashes[1]:dashes[2]] ## [1] "--------------------------------------------------------------------------------------------------------------------------" ## [2] " : : : November 30, 2022 : Public " ## [3] " : Stocks in all warehouses : as a percent of : warehouse " ## [4] " : : : : stocks " ## [5] " Commodity :-----------------------------------------------------------------------------------" ## [6] " :November 30, : October 31, :November 30, :November 30, : October 31, :November 30, " ## [7] " : 2021 : 2022 : 2022 : 2021 : 2022 : 2022 " ## [8] "--------------------------------------------------------------------------------------------------------------------------" bod = report[dashes[2]:dashes[3]] head(bod) ## [1] "--------------------------------------------------------------------------------------------------------------------------" ## [2] " : ------------ 1,000 pounds ----------- ---- percent ---- 1,000 pounds " ## [3] " : " ## [4] "Nuts : " ## [5] "Shelled : " ## [6] " Pecans .............................: 30,906 38,577 34,489 112 89 " The columns have fixed widths, so extracting the columns is relatively easy with str_sub if you can get the offsets. In the last line of the header, the columns are separated by colons :. Thus you can use the str_locate_all function, which returns the locations of a pattern in a string, to get the offsets: # The regex means: # [^:]+ one or more characters, excluding colons # (:|$) a colon or the end of the line cols = str_locate_all(report[dashes[2] - 1], "[^:]+(:|$)") # Like str_split, str_locate_all returns a list cols = cols[[1]] cols ## start end ## [1,] 1 39 ## [2,] 40 53 ## [3,] 54 67 ## [4,] 68 81 ## [5,] 82 95 ## [6,] 96 109 ## [7,] 110 122 You can use these offsets with str_sub to break a line in the body of the table into columns: str_sub(bod[6], cols) ## [1] " Pecans .............................:" ## [2] " 30,906 " ## [3] " 38,577 " ## [4] " 34,489 " ## [5] " 112 " ## [6] " 89 " ## [7] " " Because of the way str_sub is vectorized, you can’t process every line in the body of the table in one vectorized call. Instead, you can use sapply to call str_sub on each line: # Set USE.NAMES to make the table easier to read tab = sapply(bod, str_sub, cols, USE.NAMES = FALSE) # The sapply function transposes the table tab = t(tab) head(tab) ## [,1] [,2] ## [1,] "---------------------------------------" "--------------" ## [2,] " :" " ------------" ## [3,] " :" " " ## [4,] "Nuts :" " " ## [5,] "Shelled :" " " ## [6,] " Pecans .............................:" " 30,906 " ## [,3] [,4] [,5] [,6] ## [1,] "--------------" "--------------" "--------------" "--------------" ## [2,] " 1,000 pounds " "----------- " " ---- perc" "ent ---- " ## [3,] " " " " " " " " ## [4,] " " " " " " " " ## [5,] " " " " " " " " ## [6,] " 38,577 " " 34,489 " " 112 " " 89 " ## [,7] ## [1,] "-------------" ## [2,] "1,000 pounds " ## [3,] " " ## [4,] " " ## [5,] " " ## [6,] " " The columns still contain undesirable punctuation and whitespace, but you can remove these with str_replace_all and str_trim. Since the table is a matrix, it’s necessary to use apply to process it column-by-column: # The regex means: # , a comma # | OR # [.]* zero or more literal dots # : a colon # $ the end of the line tab = apply(tab, 2, function(col) { col = str_replace_all(col, ",|[.]*:$", "") str_trim(col) }) head(tab) ## [,1] [,2] ## [1,] "---------------------------------------" "--------------" ## [2,] "" "------------" ## [3,] "" "" ## [4,] "Nuts" "" ## [5,] "Shelled" "" ## [6,] "Pecans" "30906" ## [,3] [,4] [,5] [,6] ## [1,] "--------------" "--------------" "--------------" "--------------" ## [2,] "1000 pounds" "-----------" "---- perc" "ent ----" ## [3,] "" "" "" "" ## [4,] "" "" "" "" ## [5,] "" "" "" "" ## [6,] "38577" "34489" "112" "89" ## [,7] ## [1,] "-------------" ## [2,] "1000 pounds" ## [3,] "" ## [4,] "" ## [5,] "" ## [6,] "" The first few rows and the last row can be removed, since they don’t contain data. Then you can convert the table to a data frame and convert the individual columns to appropriate data types: tab = tab[-c(1:3, nrow(tab)), ] tab = data.frame(tab) tab[2:7] = lapply(tab[2:7], as.numeric) head(tab, 10) ## X1 X2 X3 X4 X5 X6 X7 ## 1 Nuts NA NA NA NA NA NA ## 2 Shelled NA NA NA NA NA NA ## 3 Pecans 30906 38577 34489 112 89 NA ## 4 In-Shell NA NA NA NA NA NA ## 5 Pecans 63788 44339 47638 75 107 NA ## 6 NA NA NA NA NA NA ## 7 Dairy products NA NA NA NA NA NA ## 8 Butter 210473 239658 199695 95 83 188566 ## 9 Natural cheese NA NA NA NA NA NA ## 10 American 834775 831213 815655 98 98 NA The data frame is now sufficiently clean that you could use it for a simple analysis. Of course, there are many things you could do to improve the extracted data frame, such as identifying categories and subcategories in the first column, removing rows that are completely empty, and adding column names. These entail more string processing and data frame manipulation—if you want to practice your R skills, try doing them on your own. 1.5 Regular Expression Examples This section provides examples of several different regular expression metacharacters and other features. Most of the examples use the str_view function, which is especially helpful for testing regular expressions. The function displays an HTML-rendered version of the string with the first match highlighted. The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string. 1.5.1 The Wildcard The regex wildcard character is . and matches any single character. For example: x = "dog" str_view(x, "d.g") ## [1] │ <dog> By default, regex searches from left to right: str_view(x, ".") ## [1] │ <d><o><g> 1.5.2 Escape Sequences Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character. For example, suppose you want to match a literal dot .. The regex for a literal dot is \\.. Since backslashes in R strings have to be escaped, the R string for this regex is \"\\\\.. For example: str_view("this.string", "\\\\.") ## [1] │ this<.>string The double backslash can be confusing, and it gets worse if you want to match a literal backslash. You have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is: str_view("this\\\\that", "\\\\\\\\") ## [1] │ this<\\>that Raw strings (see Section 1.3.3) make regular expressions easier to read, because they make backslashes literal (but they still mark the beginning of an escape sequence in regex). You can use a raw string to write the above as: str_view(r"(this\\that)", r"(\\\\)") ## [1] │ this<\\>that 1.5.3 Anchors By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor. The beginning of string anchor is ^. It marks the beginning of the string, but doesn’t count as a character in the pattern. For example, suppose you want to match an a at the beginning of the string: x = c("abc", "cab") str_view(x, "a") ## [1] │ <a>bc ## [2] │ c<a>b str_view(x, "^a") ## [1] │ <a>bc It doesn’t make sense to put characters before ^, since no characters can come before the beginning of the string. Likewise, the end of string anchor is $. It marks the end of the string, but doesn’t count as a character in the pattern. 1.5.4 Character Classes In regex, square brackets [ ] denote a character class. A character class matches exactly one character, but that character can be any of the characters inside of the square brackets. The square brackets themselves don’t count as characters in the pattern. For example, suppose you want to match c followed by either a or t: x = c("ca", "ct", "cat", "cta") str_view(x, "c[ta]") ## [1] │ <ca> ## [2] │ <ct> ## [3] │ <ca>t ## [4] │ <ct>a You can use a dash - in a character class to create a range. For example, to match letters p through z: str_view(x, "c[p-z]") ## [2] │ <ct> ## [4] │ <ct>a Ranges also work with numbers and capital letters. To match a literal dash, place the dash at the end of the character class (instead of between two other characters), as in [abc-]. Most metacharacters are literal when inside a character class. For example, [.] matches a literal dot. A hat ^ at the beginning of the character class negates the class. So for example, [^abc] matches any one character except for a, b, or c: str_view("abcdef", "[^abc]") ## [1] │ abc<d><e><f> 1.5.5 Quantifiers Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match. For example, the question mark ? quantifier means the preceding character can appear 0 or 1 times. In other words, ? makes the preceding character optional. For example: x = c("abc", "ab", "ac", "abbc") str_view(x, "ab?c") ## [1] │ <abc> ## [3] │ <ac> The star * quantifier means the preceding character can appear 0 or more times. In other words, * means the preceding character can appear any number of times or not at all. For instance: str_view(x, "ab*c") ## [1] │ <abc> ## [3] │ <ac> ## [4] │ <abbc> The plus + quantifier means the preceding character must appear 1 or more times. Quantifiers are greedy, meaning they always match as many characters as possible. In this example, notice that the pattern matches the entire string, even though it could also match just abba: str_view("abbabbba", ".+a") ## [1] │ <abbabbba> You can add a question mark ? after another quantifier to make it non-greedy: str_view("abbabbba", ".+?a") ## [1] │ <abba><bbba> 1.5.6 Groups In regex, parentheses ( ) denote a group. The parentheses themselves don’t count as characters in the pattern. Groups are useful for repeating or extracting specific parts of a pattern (see Section 1.4.5). Quantifiers can act on groups in addition to individual characters. For example, suppose you want to make the entire substring \", dogs,\" optional in a pattern, so that both of the test strings in this example match: x = c("cats, dogs, and frogs", "cats and frogs") str_view(x, "cats(, dogs,)? and frogs") ## [1] │ <cats, dogs, and frogs> ## [2] │ <cats and frogs> "],["tidy-relational-data.html", "2 Tidy & Relational Data 2.1 Tidy Datasets 2.2 Relational Datasets", " 2 Tidy & Relational Data This chapter is part 2 (of 2) of Cleaning & Reshaping Data, a workshop series about how to prepare data for analysis. The major topics of this chapter are how to reshape datasets with pivots and how to combine related datasets with joins Learning Objectives After completing this session, learners should be able to: Explain what it means for data to be tidy Use the tidyr package to reshape data Explain what a relational dataset is Use the dplyr package to join data based on common columns Describe the different types of joins Identify which types of joins to use when faced with a relational dataset 2.1 Tidy Datasets The structure of a dataset—its shape and organization—has enormous influence on how difficult it will be to analyze, so making structural changes is an important part of the cleaning process. Researchers conventionally arrange tabular datasets so that each row contains a single observation or case, and each column contains a single kind of measurement or identifier, called a feature. In 2014, Hadley Wickham refined and formalized the conventions for tabular datasets by introducing the concept of tidy datasets, which have a specific structure. Paraphrasing Wickham, the rules for a tidy dataset are: Every column is a single feature. Every row is a single observation. Every cell is a single value. These rules ensure that all of the values in a dataset are visually organized and are easy to access with indexing operations. They’re also specific enough to make tidiness a convenient standard for functions that operate on tabular datasets. In fact, the Tidyverse packages (see Section 1.1) are designed from the ground up for working with tidy datasets. Tidy datesets have also been adopted as a standard in other software, including various packages for Python and Julia. This section explains how to reshape tabular datasets into tidy datasets. While reshaping can seem tricky at first, making sure your dataset has the right structure before you begin analysis saves time and frustration in the long run. 2.1.1 The tidyr Package The tidyr package provides functions to reshape tabular datasets. It also provides examples of tidy and untidy datasets. Like most Tidyverse packages, it comes with detailed documentation and a cheatsheet. As usual, install the package if you haven’t already, and then load it: # install.packages("tidyr") library(tidyr) Let’s start with an example of a tidy dataset. The table1 dataset in the package records the number of tuberculosis cases across several different countries and years: table1 ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 Each of the four columns contains a single kind of measurement or identifier, so the dataset satifies tidy rule 1. The measurements were taken at the country-year level, and each row contains data for one country-year pair, so the dataset also satisfies tidy rule 2. Each cell in the data frame only contains one value, so the dataset also satisfies tidy rule 3. The same data are recorded in table2, table3, and the pair table4a with table4b, but these are all untidy datasets. For example, table2 breaks rule 1 because the column count contains two different kinds of measurements—case counts and population counts: table2 ## # A tibble: 12 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 When considering whether you should reshape a dataset, think about what the features are and what the observations are. These depend on the dataset itself, but also on what kinds of analyses you want to do. Datasets sometimes have closely related features or multiple (nested) levels of observation. The tidyr documentation includes a detailed article on how to reason about reshaping datasets. If you do decide to reshape a dataset, then you should also think about what role each feature serves: Identifiers are labels that distinguish observations from one another. They are often but not always categorical. Examples include names or identification numbers, treatment groups, and dates or times. In the tuberculosis data set, the country and year columns are identifiers. Measurements are the values collected for each observation and typically the values of research interest. For the tuberculosis data set, the cases and population columns are measurements. Having a clear understanding of which features are identifiers and which are measurements makes it easier to use the tidyr functions. 2.1.2 Rows into Columns Tidy data rule 1 is that each column must be a single feature. The table2 dataset breaks this rule: table2 ## # A tibble: 12 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 To make the dataset tidy, the measurements in the count column need to be separated into two separate columns, cases and population, based on the categories in the type column. You can use the pivot_wider function to pivot the single count column into two columns according to the type column. This makes the dataset wider, hence the name pivot_wider. The function’s first parameter is the dataset to pivot. Other important parameters are: values_from – The column(s) to pivot. names_from – The column that contains names for the new columns. id_cols – The identifier columns, which are not pivoted. This defaults to all columns except those in values_from and names_from. Here’s how to use the function to make table2 tidy: pivot_wider(table2, values_from = count, names_from = type) ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 The function automatically removes values from the country and year columns as needed to maintain their original correspondence with the pivoted values. 2.1.3 Columns into Rows Tidy data rule 2 is that every row must be a single observation. The table4a and table4b datasets break this rule because each row in each dataset contains measurements for two different years: table4a ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 table4b ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 The tuberculosis case counts are in table4a. The population counts are in table4b. Neither is tidy. To make the table4a dataset tidy, the 1999 and 2000 columns need to be pivoted into two new columns: one for the measurements (the counts) and one for the identifiers (the years). It might help to visualize this as stacking the two separate columns 1999 and 2000 together, one on top of the other, and then adding a second column with the appropriate years. The same process makes table4b tidy. You can use the pivot_longer function to pivot the two columns 1999 and 2000 into a column of counts and a column of years. This makes the dataset longer, hence the name pivot_longer. Again the function’s first parameter is the dataset to pivot. Other important parameters are: cols – The columns to pivot. values_to – Name(s) for the new measurement column(s) names_to – Name(s) for the new identifier column(s) Here’s how to use the function to make table4a tidy: tidy4a = pivot_longer(table4a, -country, values_to = "cases", names_to = "year") tidy4a ## # A tibble: 6 × 3 ## country year cases ## <chr> <chr> <dbl> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 In this case, the cols parameter is set to all columns except the country column, because the country column does not need to be pivoted. The function automatically repeats values in the country column as needed to maintain its original correspondence with the pivoted values. Here’s the same for table4b: tidy4b = pivot_longer(table4b, -country, values_to = "population", names_to = "year") tidy4b ## # A tibble: 6 × 3 ## country year population ## <chr> <chr> <dbl> ## 1 Afghanistan 1999 19987071 ## 2 Afghanistan 2000 20595360 ## 3 Brazil 1999 172006362 ## 4 Brazil 2000 174504898 ## 5 China 1999 1272915272 ## 6 China 2000 1280428583 Once the two datasets are tidy, you can join them with the merge function to reproduce table1: merge(tidy4a, tidy4b) ## country year cases population ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 2.1.4 Separating Values Tidy data rule 3 says each value must have its own cell. The table3 dataset breaks this rule because the rate column contains two values per cell: table3 ## # A tibble: 6 × 3 ## country year rate ## <chr> <dbl> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 The two values separated by / in the rate column are the tuberculosis case count and the population count. To make this dataset tidy, the rate column needs to be split into two columns, cases and population. The values in the rate column are strings, so one way to do this is with the stringr package’s str_split_fixed function, described in Section 1.4.4: library(stringr) # Split the rate column into 2 columns. cols = str_split_fixed(table3$rate, fixed("/"), 2) # Remove the rate column and append the 2 new columns. tidy3 = table3[-3] tidy3$cases = as.numeric(cols[, 1]) tidy3$population = as.numeric(cols[, 2]) tidy3 ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 Extracting values, converting to appropriate data types, and then combining everything back into a single data frame is an extremely common pattern in data science. The tidyr package provides the separate function to streamline the steps taken above. The first parameter is the dataset, the second is the column to split, the third is the names of the new columns, and the fourth is the delimiter. The convert parameter controls whether the new columns are automatically converted to appropriate data types: separate(table3, rate, c("cases", "population"), "/", convert = TRUE) ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 As of writing, the tidyr developers have deprecated the separate function in favor of several more specific functions (separate_wider_delim, separate_wider_position, and separate_wider_regex). These functions are still experimental, so we still recommend using the separate function in the short term. 2.1.5 Case Study: SMART Ridership Sonoma-Marin Area Rail Transit (SMART) is a single-line passenger rail service between the San Francisco Bay and Santa Rosa. They publish data about monthly ridership in PDF and Excel format. In this case study, you’ll reshape and clean the dataset to prepare it for analysis. To get started, download the December 2022 report it Excel format. Pay attention to where you save the file—or move it to a directory just for files related to this case study—so that you can load it into R. If you want, you can use R’s download.file function to download the file rather than your browser. The readxl package provides functions to read data from Excel files. Install the package if you don’t already have it installed, and then load it: # install.packages("readxl") library("readxl") You can use the read_excel function to read a sheet from an Excel spreadsheet. Before doing so, it’s a good idea to manually inspect the spreadsheet in a spreadsheet program. The SMART dataset contains two tables in the first sheet, one for total monthly ridership and another for average weekday ridership (by month). Let’s focus on the total monthly ridership table, which occupies cells B4 to H16. You can specify a range of cells when you call read_excel by setting the range parameter: smart_path = "./data/SMART Ridership Web Posting_Mar.23.xlsx" smart = read_excel(smart_path, range = "B4:H16") smart ## # A tibble: 12 × 7 ## Month FY18 FY19 FY20 FY21 FY22 `FY23 (DRAFT)` ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Jul - 63864 62851 9427 24627 43752 ## 2 Aug 54484 74384 65352 8703 25020 48278 ## 3 Sep 65019 62314 62974 8910 27967 49134 ## 4 Oct 57453 65492 57222 9851 26998. 59322 ## 5 Nov 56125 52774 64966 8145 26575 51383 ## 6 Dec 56425 51670 58199. 7414 24050 47606 ## 7 Jan 56527 57136 71974 6728 22710 46149 ## 8 Feb 54797 51130 71676 7412 26652 49724 ## 9 Mar 57312 58091 33624 9933 35291 53622 ## 10 Apr 56631 60256 4571 11908 34258 NA ## 11 May 59428 64036 5308 13949 38655 NA ## 12 Jun 61828 55700 8386 20469 41525 NA The loaded dataset needs to be cleaned. The FY18 column uses a hyphen to indicate missing data and has the wrong data type. The dates—months and years—are identifiers for observations, so the dataset is also not tidy. You can correct the missing value in the FY18 column with indexing, and the type with the as.numeric function: smart$FY18[smart$FY18 == "-"] = NA smart$FY18 = as.numeric(smart$FY18) head(smart) ## # A tibble: 6 × 7 ## Month FY18 FY19 FY20 FY21 FY22 `FY23 (DRAFT)` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Jul NA 63864 62851 9427 24627 43752 ## 2 Aug 54484 74384 65352 8703 25020 48278 ## 3 Sep 65019 62314 62974 8910 27967 49134 ## 4 Oct 57453 65492 57222 9851 26998. 59322 ## 5 Nov 56125 52774 64966 8145 26575 51383 ## 6 Dec 56425 51670 58199. 7414 24050 47606 To make the dataset tidy, it needs to be reshaped so that the values in the various fiscal year columns are all in one column. In other words, the dataset needs to be pivoted longer (Section 2.1.3). The result of the pivot will be easier to understand if you rename the columns as their years first. Here’s one way to do that: names(smart)[-1] = 2018:2023 head(smart) ## # A tibble: 6 × 7 ## Month `2018` `2019` `2020` `2021` `2022` `2023` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Jul NA 63864 62851 9427 24627 43752 ## 2 Aug 54484 74384 65352 8703 25020 48278 ## 3 Sep 65019 62314 62974 8910 27967 49134 ## 4 Oct 57453 65492 57222 9851 26998. 59322 ## 5 Nov 56125 52774 64966 8145 26575 51383 ## 6 Dec 56425 51670 58199. 7414 24050 47606 Next, use pivot_longer to pivot the dataset: smart = pivot_longer(smart, -Month, values_to = "riders", names_to = "fiscal_year") head(smart) ## # A tibble: 6 × 3 ## Month fiscal_year riders ## <chr> <chr> <dbl> ## 1 Jul 2018 NA ## 2 Jul 2019 63864 ## 3 Jul 2020 62851 ## 4 Jul 2021 9427 ## 5 Jul 2022 24627 ## 6 Jul 2023 43752 Now the dataset is tidy, but it’s still not completely clean. To make it easy to study time trends, let’s combine and convert the month and fiscal_year columns into a calendar date. You can use functions from the lubridate package (Section 1.2.1) to do this. First paste the year and month together and use the my function to parse them as dates: library(lubridate) ## ## Attaching package: 'lubridate' ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union dates = my(paste(smart$Month, smart$fiscal_year)) dates ## [1] "2018-07-01" "2019-07-01" "2020-07-01" "2021-07-01" "2022-07-01" ## [6] "2023-07-01" "2018-08-01" "2019-08-01" "2020-08-01" "2021-08-01" ## [11] "2022-08-01" "2023-08-01" "2018-09-01" "2019-09-01" "2020-09-01" ## [16] "2021-09-01" "2022-09-01" "2023-09-01" "2018-10-01" "2019-10-01" ## [21] "2020-10-01" "2021-10-01" "2022-10-01" "2023-10-01" "2018-11-01" ## [26] "2019-11-01" "2020-11-01" "2021-11-01" "2022-11-01" "2023-11-01" ## [31] "2018-12-01" "2019-12-01" "2020-12-01" "2021-12-01" "2022-12-01" ## [36] "2023-12-01" "2018-01-01" "2019-01-01" "2020-01-01" "2021-01-01" ## [41] "2022-01-01" "2023-01-01" "2018-02-01" "2019-02-01" "2020-02-01" ## [46] "2021-02-01" "2022-02-01" "2023-02-01" "2018-03-01" "2019-03-01" ## [51] "2020-03-01" "2021-03-01" "2022-03-01" "2023-03-01" "2018-04-01" ## [56] "2019-04-01" "2020-04-01" "2021-04-01" "2022-04-01" "2023-04-01" ## [61] "2018-05-01" "2019-05-01" "2020-05-01" "2021-05-01" "2022-05-01" ## [66] "2023-05-01" "2018-06-01" "2019-06-01" "2020-06-01" "2021-06-01" ## [71] "2022-06-01" "2023-06-01" The SMART fiscal year extends from July to the following June and equals the calendar year at the end of the period. So for observations from July to December, the calendar year is the fiscal year minus 1. You can use indexing to make this adjustment efficiently, and then append the dates to the data frame: jul2dec = month(dates) >= 7 dates[jul2dec] = dates[jul2dec] - period(1, "year") smart$date = dates head(smart) ## # A tibble: 6 × 4 ## Month fiscal_year riders date ## <chr> <chr> <dbl> <date> ## 1 Jul 2018 NA 2017-07-01 ## 2 Jul 2019 63864 2018-07-01 ## 3 Jul 2020 62851 2019-07-01 ## 4 Jul 2021 9427 2020-07-01 ## 5 Jul 2022 24627 2021-07-01 ## 6 Jul 2023 43752 2022-07-01 As a final adjustment, you can use the tolower function to convert the column names to lowercase, so that they’re easier to use during analysis: names(smart) = tolower(names(smart)) smart ## # A tibble: 72 × 4 ## month fiscal_year riders date ## <chr> <chr> <dbl> <date> ## 1 Jul 2018 NA 2017-07-01 ## 2 Jul 2019 63864 2018-07-01 ## 3 Jul 2020 62851 2019-07-01 ## 4 Jul 2021 9427 2020-07-01 ## 5 Jul 2022 24627 2021-07-01 ## 6 Jul 2023 43752 2022-07-01 ## 7 Aug 2018 54484 2017-08-01 ## 8 Aug 2019 74384 2018-08-01 ## 9 Aug 2020 65352 2019-08-01 ## 10 Aug 2021 8703 2020-08-01 ## # ℹ 62 more rows Now that the dataset is tidied and cleaned, it’s straightforward to do things like plot it as a time series: library("ggplot2") ggplot(smart) + aes(x = date, y = riders) + geom_line() + expand_limits(y = 0) ## Warning: Removed 4 rows containing missing values (`geom_line()`). Notice the huge drop (more than 90%) in April of 2020 due to the COVID-19 pandemic! 2.1.6 Without tidyr This section shows how to pivot datasets without the help of the tidyr package. In practice, we recommend that you use the package, but the examples here may make it easier to understand what’s actually happening when you pivot a dataset. 2.1.6.1 Rows into Columns The steps for pivoting table2 wider are: Subset rows to separate cases and population values. Remove the type column from each. Rename the count column to cases and population. Merge the two subsets by matching country and year. And the code is: # Step 1 cases = table2[table2$type == "cases", ] pop = table2[table2$type == "population", ] # Step 2 cases = cases[-3] pop = pop[-3] # Step 3 names(cases)[3] = "cases" names(pop)[3] = "population" # Step 4 merge(cases, pop) ## country year cases population ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 2.1.6.2 Columns into Rows The steps for pivoting table4a longer are: Subset columns to separate 1999 and 2000 into two data frames. Add a year column to each. Rename the 1999 and 2000 columns to cases. Stack the two data frames with rbind. And the code is: # Step 1 df99 = table4a[-3] df00 = table4a[-2] # Step 2 df99$year = "1999" df00$year = "2000" # Step 3 names(df99)[2] = "cases" names(df00)[2] = "cases" # Step 4 rbind(df99, df00) ## # A tibble: 6 × 3 ## country cases year ## <chr> <dbl> <chr> ## 1 Afghanistan 745 1999 ## 2 Brazil 37737 1999 ## 3 China 212258 1999 ## 4 Afghanistan 2666 2000 ## 5 Brazil 80488 2000 ## 6 China 213766 2000 2.2 Relational Datasets Many datasets contain multiple tables (or data frames) that are all closely related to each other. Sometimes, the rows in one table may be connected the rows in others through columns they have in common. For example, our library keeps track of its books using three tables: one identifying books, one identifying borrowers, and one that records each book checkout. Each book and each borrower has a unique identification number, recorded in the book and borrower tables, respectively. These ID numbers are also recorded in the checkouts table. Using the ID numbers, you can connect rows from one table to rows in another. We call this kind of dataset a relational dataset, because there are relationships between the tables. Storing relational datasets as several small tables rather than one large table has many benefits. Perhaps the most important is that it reduces redundancy and thereby reduces the size (in bytes) of the dataset. As a result, most databases are designed to store relational datasets. Because the data are split across many different tables, relational datasets also pose a unique challenge: to explore, compute statistics, make visualizations, and answer questions, you’ll typically need to combine the data of interest into a single table. One way to do this is with a join, an operation that combines rows from two tables based on values of a column they have in common. There are many different types of joins, which are covered in the subsequent sections. 2.2.1 The dplyr Package The dplyr package provides functions to join related data frames, among other things. Check out this list of all the functions provided by dplyr. If you’ve ever used SQL, you’re probably familiar with relational datasets and recognize functions like select, left_join, and group_by. In fact, dplyr was designed to bring SQL-style data manipulation to R. As a result, many concepts of dplyr and SQL are nearly identical, and even the language overlaps a lot. I’ll point out some examples of this as we go, because I think some people might find it helpful. If you haven’t used SQL, don’t worry—all of the functions will be explained in detail. 2.2.2 Gradebook Dataset Another example of a relational dataset that we all interact with regularly is the university gradebook. One table might store information about students and another might store their grades. The grades are linked to the student records by student ID. Looking at a student’s grades requires combining the two tables with a join. Let’s use a made-up gradebook dataset to make the idea of joins concrete. We’ll create two tables: the first identifies students by name and ID, and the second lists their grades in a class. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4), name = c("Angel", "Beto", "Cici", "Desmond")) students ## student_id name ## 1 1 Angel ## 2 2 Beto ## 3 3 Cici ## 4 4 Desmond grades = data.frame( student_id = c(2, 3, 4, 5, 6), grade = c(90, 85, 80, 75, 60)) grades ## student_id grade ## 1 2 90 ## 2 3 85 ## 3 4 80 ## 4 5 75 ## 5 6 60 The rows and columns of these tables have different meanings, so we can’t stack them side-by-side or one on top of the other. The “key” piece of information for linking them is the student_id column present in both. In relational datasets, each table usually has a primary key, a column of values that uniquely identify the rows. Key columns are important because they link rows in one table to rows in other tables. In the gradebook dataset, student_id is the primary key for the students table. Although the values of student_id in the grades table are unique, it is not a primary key for the grades table, because a student could have grades for more than one class. When one table’s primary key is included in another table, it’s called a foreign key. So student_id is a foreign key in the grades table. If you’ve used SQL, you’ve probably heard the terms primary key and foreign key before. They have the same meaning in R. In most databases, the primary key must be unique—there can be no duplicates. That said, relational datasets are not always designed for use as databases, and they may have key columns that are not unique. How to handle non-unique keys is going to be a recurring feature of this section. 2.2.3 Left Joins Suppose we want a table with each student’s name and grade. This is a combination of information from both the students table and the grades table, but how can we combine the two? The students table contains the student names and has one row for each student. So we can use the students table as a starting point. Then we need to use each student’s ID number to look up their grade in the grades table. When you want combine data from two tables like this, you should think of using a join. In joins terminology, the two tables are usually called the left table and right table so that it’s easy to refer to each without ambiguity. For this particular example, we’ll use a left join. A left join keeps all of the rows in the left table and combines them with rows from the right table that match the primary key. We want to keep every student in the students table, so we’ll use it as the left table. The grades table will be the right table. The key that links the two tables is student_id. This left join will only keep rows from the grades table that match student IDs present in the students table. In dplyr, you can use the left_join function to carry out a left join. The first argument is the left table and the second argument is the right table. You can also set an argument for the by parameter to specify which column(s) to use as the key. Thus: # load dplyr package library(dplyr) library(knitr) # Left join left_join(students, grades, by = "student_id") ## student_id name grade ## 1 1 Angel NA ## 2 2 Beto 90 ## 3 3 Cici 85 ## 4 4 Desmond 80 # |> kable() Note that the keys do not match up perfectly between the tables: the grades table has no rows with student_id 1 (Angel) and has rows with student_id 5 (an unknown student). Because we used a left join, the result has a missing value (NA) in the grade column for Angel and no entry for student_id 5. A left join augments the left table (students) with columns from the right table (grades). So the result of a left join will often have the same number of rows as the left table. New rows are not added for rows in the right table with non-matching key values. There is one case where the result of a left join will have more rows than the left table: when a key value is repeated in either table. In that case, every possible match will be provided in the result. For an example, let’s add rows with repeat IDs to both the students and grades tables. Let’s also rename the student_id column of grades to be sid so we can see how to join tables where the key column names don’t match. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4, 4), name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) grades = data.frame( sid = c(2, 3, 4, 5, 2), grade = c(90, 85, 80, 75, 60)) # Left join left_join(students, grades, by = join_by(student_id == sid)) ## Warning in left_join(students, grades, by = join_by(student_id == sid)): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 2 of `x` matches multiple rows in `y`. ## ℹ Row 3 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to ## silence this warning. ## student_id name grade ## 1 1 Angel NA ## 2 2 Beto 90 ## 3 2 Beto 60 ## 4 3 Cici 85 ## 5 4 Desmond 80 ## 6 4 Erik 80 #|> kable() Both of the tables had five rows, but the result has six rows because student_id is 4 for two rows of students and sid is 2 for two rows of grades. R warns that there is a many-to-many relationship in the join, which means that duplicate keys were matched in the left table and the right table. When there are no duplicate keys in either table, the match is one-to-one. When there are duplicates in one table only, the match is one-to-many or many-to-one. These are often desired behavior and so R just complies silently. A many-to-many match may be desired, but it is often a sign that something has gone wrong, so R emits a warning. You can get funky results when your keys are not unique! Cats join meme 2.2.4 Other Joins There are several other kinds of joins: A right join is almost the same as a left join, but reverses the roles of the left and right table. All rows from the right table are augmented with columns from the left table where the key matches. An inner join returns rows from the left and right tables only if they match (their key appears in both tables). A full join returns all rows from the left table and from the right table, even if they do not match. Here’s visualization to help identify the differences: Disney characters illustrate differences between joins The following subsections provide examples of different types of joins. 2.2.4.1 Inner Join An inner join returns the same columns as a left join, but potentially fewer rows. The result of a inner join only includes the rows that matched according to the join specification. This will leave out some rows from the left table if they aren’t matched in the right table, which is the difference between an inner join and a left join. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4, 4), name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) grades = data.frame( student_id = c(2, 3, 4, 5, 2), grade = c(90, 85, 80, 75, 60)) # Inner join inner_join(students, grades, by = "student_id") |> kable() ## Warning in inner_join(students, grades, by = "student_id"): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 2 of `x` matches multiple rows in `y`. ## ℹ Row 3 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to ## silence this warning. student_id name grade 2 Beto 90 2 Beto 60 3 Cici 85 4 Desmond 80 4 Erik 80 2.2.5 Getting Clever with join_by So far, we’ve focused on the join types and the tables. There’s been a third element in all of the examples that we’ve mostly ignored until now: the by argument in the joins. Specifying a single column name (like student_id) works great when the key columns have the same names in both tables. However, real examples are often more complicated. For those times, dplyr provides a function called join_by, which lets you create join specifications to solve even very complicated problems. We begin with an example where the key name in the grades table has been changed from student_id to sid. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4), name = c("Angel", "Beto", "Cici", "Desmond")) grades = data.frame( sid = c(2, 3, 4, 5), grade = c(90, 85, 80, 75)) # Left join left_join(students, grades, by = join_by(student_id==sid)) |> kable() student_id name grade 1 Angel NA 2 Beto 90 3 Cici 85 4 Desmond 80 Since the key column names don’t match, I have provided a join_by specification. Specifying a match via join_by is very powerful and flexible, but the main thing to recognize here is that R searches for the column name on the left of the double-equals in the left table and searches for the column name on the right of the double-equals in the right table. In this example, that means the join will try to match students$student_id to grades$sid. 2.2.5.1 Matching multiple columns Sometimes it takes more than one key to uniquely identify a row of data. For example, suppose some of our students are retaking the class in 2023 because they got a failing grade in 2022. Then we would need to combine the student ID with the year to uniquely identify a student’s grade. You can include multiple comparisons in a join_by specification by separating them with commas. In the following example, student ID still has different names between the tables but the year column has the same name in both tables. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4), name = c("Angel", "Beto", "Cici", "Desmond")) # duplicate the students for two years students = bind_rows( mutate(students, year = 2022), mutate(students, year = 2023) ) # create the grades data.frame grades = data.frame( sid = c(2, 3, 4, 5), grade = c(90, 85, 80, 75) ) # duplicate the grades table for two years grades = bind_rows( mutate(grades, grade = grade - 50, year = 2022), mutate(grades, year = 2023) ) # Left join left_join(students, grades, by = join_by(student_id==sid, year)) |> kable() student_id name year grade 1 Angel 2022 NA 2 Beto 2022 40 3 Cici 2022 35 4 Desmond 2022 30 1 Angel 2023 NA 2 Beto 2023 90 3 Cici 2023 85 4 Desmond 2023 80 To learn clever tricks for complicated joins, see the documentation at ?join_by. 2.2.6 Examples We’ve seen enough of the made-up grades example! Let’s look at some real data and practice our skills! Let’s begin by looking at the data on books, borrowers, and checkouts. borrowers = read.csv("data/library/borrowers.csv") books = read.csv("data/library/books.csv") checkouts = read.csv("data/library/checkouts.csv") # show the top rows print(head(books)) ## book_id title author subject ## 1 1 my alaska guy noir NA ## 2 2 jubilee charro NA ## 3 3 the window ruiner ruined my windows zipperman NA ## 4 4 clowns in the clouds dark doug NA ## 5 5 dogs walked, tigers tamed, bars emptied lucky pierre NA ## 6 6 ace ventura sandy mackinnon NA ## Keywords minority_author Description. Contents. Series. Pages. Publisher. ## 1 NA NA NA NA NA 100 NA ## 2 NA NA NA NA NA 100 NA ## 3 NA NA NA NA NA 100 NA ## 4 NA NA NA NA NA 100 NA ## 5 NA NA NA NA NA 100 NA ## 6 NA NA NA NA NA 100 NA ## creation_date publication_date Edition. format venue Language. Source. ## 1 NA NA NA NA NA NA NA ## 2 NA NA NA NA NA NA NA ## 3 NA NA NA NA NA NA NA ## 4 NA NA NA NA NA NA NA ## 5 NA NA NA NA NA NA NA ## 6 NA NA NA NA NA NA NA ## Identifier. ISBN. location copies type Barcode. ## 1 NA NA NA NA NA NA ## 2 NA NA NA NA NA NA ## 3 NA NA NA NA NA NA ## 4 NA NA NA NA NA NA ## 5 NA NA NA NA NA NA ## 6 NA NA NA NA NA NA print(head(borrowers)) ## borrower_id account_type major date_created ## 1 1 student NA NA ## 2 2 staff NA NA ## 3 3 student NA NA ## 4 4 student NA NA ## 5 5 faculty NA NA ## 6 6 staff NA NA print(head(checkouts)) ## borrower_id book_id checkout_date due_date ## 1 1 1 12/21/23 6/18/24 ## 2 1 2 1/3/22 7/2/22 ## 3 2 2 10/12/20 4/10/21 ## 4 2 3 5/5/23 11/1/23 ## 5 2 4 12/12/21 6/10/22 ## 6 4 2 6/7/18 12/4/18 # get the table sizes print(dim(books)) ## [1] 9 24 print(dim(borrowers)) ## [1] 10 4 print(dim(checkouts)) ## [1] 7 4 One thing we can see is that the books table refers to physical copies of a book, so if the library owns two copies of the same book then the same title, publisher, etc. will appear on the same row. In the previous section, we set a goal of augmenting the checkouts table with the information in the books table. To augment means to add to. We are going to be adding to the checkouts table, but do we add rows or columns? Each row of checkouts is an event that matches one book and one borrower. Adding new rows would be like adding new events that didn’t occur: not good! Each row has a book and each book has many columns of information in the books table. So we can maintain the relationships in the data while adding new columns of information to checkouts. How are we to know which books were checked out most often, or were generaly checked out by the same people? The tables have different numbers of rows and columns, so we won’t be able to stack them side-by-side or one on top of the other. The “key” pieces of information are the columns books$book_id and borrowers$borrower_id. If you’ve ever used SQL, you may recall that each table should have a primary key, which is a column of values that identify the rows. In a database, the primary key must be unique - there can be no duplicates. Most spreadsheets are not designed as databases, and they may have key or ID columns that are not unique. How to handle non-unique keys is going to be a recurring feature of this workshop. Now look at the checkouts table again. It has two ID columns: book_id and borrower_id. These match the borrower and book IDs in the borrowers and books tables. Obviously, these aren’t unique: one person may check out multiple books, and an exceptionaly popular book might be checked out as many as three times from the same library! These are columns that identify unique keys other tables, which SQL calls foreign keys. Now we can begin to reason about how to approach the goal of identifying the books that are most often checked out. We want to augment the checkouts table with the information in the books table, matching rows where book_id matches. Every row in the checkouts table should match exactly one row in the results and every row in the results should match exactly one row in the checkouts table. In the next section we will translate this plain-english description into the language used by dplyr. # Top ten books with most checkouts left_join(checkouts, books, by="book_id") |> group_by(book_id) |> summarize(title=first(title), author=first(author), n_checkouts=n()) |> arrange(desc(n_checkouts)) |> head(n=10) |> kable() book_id title author n_checkouts 2 jubilee charro 3 1 my alaska guy noir 1 3 the window ruiner ruined my windows zipperman 1 4 clowns in the clouds dark doug 1 5 dogs walked, tigers tamed, bars emptied lucky pierre 1 Just for fun, here is an instructive example of why relational tables are a better way to store data than putting everything into one spreadsheet. If we want to identify the authors whose books were most checked out from the UCD library, we might think to adapt our previous example to group by author rather than by book_id. # Top ten authors with most checkouts left_join(checkouts, books, by="book_id") |> group_by(author) |> summarize(author=first(author), n_checkouts = n()) |> arrange(desc(n_checkouts)) |> head(n=10) |> kable() author n_checkouts charro 3 dark doug 1 guy noir 1 lucky pierre 1 zipperman 1 The problem is that the author column is a text field for author name(s), which is not a one-to-one match to a person. There are a lot of reasons: some books have multiple authors, some authors change their names, the order of personal name and family name may be reversed, and middle initials are sometimes included, sometimes not. A table of authors would allow you to refer to authors by a unique identifier and have it always point to the same name (this is what ORCID does for scientific publishing). 2.2.6.1 Three or More Tables A join operates on two tables, but you can combine multiple tables by doing several joins in a row. Let’s look at an example that combines checkouts, books, and borrowers in order to see how many books were checked out by students, faculty, and staff. # list the account types who checked out the most books left_join(checkouts, books, by="book_id") |> left_join(borrowers, by="borrower_id") |> group_by(borrower_id) |> summarize(account_type=first(account_type), n_checkouts = n()) |> arrange(desc(n_checkouts)) |> kable() borrower_id account_type n_checkouts 2 staff 3 1 student 2 4 student 1 5 faculty 1 2.2.7 Be Explicit Do you find it odd that we have to tell R exactly what kind of data join to do by calling one of left_join, right_join, inner_join, or full_join? Why isn’t there just one function called join that assumes you’re doing a left join unless you specifically provide an argument type like join(..., type=\"inner\")? If you think it would be confusing for R to make assumptions about what kind of data join we want, then you’re on the right track but you’ll want to watch out for these other cases where R has strong assumptions about what the default behavior should be. A general principle of programming is that explicit is better than implicit because writing information into your code explicitly makes it easier to understand what the code does. Here are some examples of implicit assumptions R will make unless you provide explicit instructions. 2.2.7.1 Handling Duplicate Keys Values in the key columns may not be unique. What do you think happens when you join using keys that aren’t unique? # Example datasets students = data.frame( student_id = c(1, 2, 3, 4, 4), name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) grades = data.frame(student_id = c(2, 2, 3, 4, 4, 5), grade = c(90, 50, 85, 80, 75, 30)) # Left join left_join(students, grades, by = "student_id") |> kable() ## Warning in left_join(students, grades, by = "student_id"): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 2 of `x` matches multiple rows in `y`. ## ℹ Row 4 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to ## silence this warning. student_id name grade 1 Angel NA 2 Beto 90 2 Beto 50 3 Cici 85 4 Desmond 80 4 Desmond 75 4 Erik 80 4 Erik 75 We get one row in the result for every possible combination of the matching keys. Sometimes that is what you want, and other times not. In this case, it might be reasonable that Beto, Desmond, and Erik have multiple grades in the book, but it is probably not reasonable that both Desmond and Erik have student ID 4 and have the same grades as each other. This is a many-to-many match, with all the risks we’ve mentioned before. 2.2.7.1.1 Specifying the Expected Relationship You can be explicit about what kind of relationship you expect in the join by specifying the relationship parameter. Your options are one-to-one, one-to-many, or many-to-one. Any of those will stop the code with an error if the data doesn’t match the relationship you told it to expect. If you leave the relationship parameter blank, R will allow a many-to-many join but will raise a warning. Pay attention to your warning messages! If you know in advance that you want a many-to-many join, then you can provide the argument relatonship='many-to-many', which will do the same as leaving relationship blank, except it will not raise the warning. 2.2.7.1.2 Using Only Distinct Rows An alternative to handling duplicate keys is to subset the data to avoid duplicates in the first place. The dplyr package provides a function, distinct, which can help. When distinct finds duplicated rows, it keeps the first one. # Example datasets students = data.frame( student_id = c(1, 2, 3, 4, 4), name = c("Angel", "Beto", "Cici", "Desmond", "Erik")) grades = data.frame(student_id = c(2, 2, 3, 4, 4, 5), grade = c(90, 50, 85, 80, 75, 30)) # Left join distinct_keys_result = students |> distinct(student_id, .keep_all=TRUE) |> left_join(grades, by = "student_id") |> kable() 2.2.7.2 Ambiguous Columns When the two tables have columns with the same names, it is ambiguous which one to use in the result. R handles that situation by keeping both but changing the names to include the table names. So the column from the left table gets a .x appended by default and the column from the right table gets a .y appended by default. Let’s see an example. Suppose that the date_created column of the borrowers table had the name date instead. Then in the joined data it would be ambiguous with the date column of the checkouts table. # Rename the date_created column of borrowers borrowers = rename(borrowers, date=date_created) # Now create the list of checkouts left_join(checkouts, books, by="book_id") |> left_join(borrowers, by="borrower_id") |> head(n=10) |> kable() borrower_id book_id checkout_date due_date title author subject Keywords minority_author Description. Contents. Series. Pages. Publisher. creation_date publication_date Edition. format venue Language. Source. Identifier. ISBN. location copies type Barcode. account_type major date 1 1 12/21/23 6/18/24 my alaska guy noir NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA student NA NA 1 2 1/3/22 7/2/22 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA student NA NA 2 2 10/12/20 4/10/21 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA staff NA NA 2 3 5/5/23 11/1/23 the window ruiner ruined my windows zipperman NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA staff NA NA 2 4 12/12/21 6/10/22 clowns in the clouds dark doug NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA staff NA NA 4 2 6/7/18 12/4/18 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA student NA NA 5 5 10/1/23 3/29/24 dogs walked, tigers tamed, bars emptied lucky pierre NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA faculty NA NA If you aren’t satisfied with appending .x and .y to the ambiguous columns, then you can specify the suffix argument with a pair of strings like this: # Now create the list of checkouts left_join(checkouts, books, by="book_id", suffix=c("_checkout", "_book")) |> head(n=10) |> kable() borrower_id book_id checkout_date due_date title author subject Keywords minority_author Description. Contents. Series. Pages. Publisher. creation_date publication_date Edition. format venue Language. Source. Identifier. ISBN. location copies type Barcode. 1 1 12/21/23 6/18/24 my alaska guy noir NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 2 1/3/22 7/2/22 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 2 10/12/20 4/10/21 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 3 5/5/23 11/1/23 the window ruiner ruined my windows zipperman NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 4 12/12/21 6/10/22 clowns in the clouds dark doug NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 2 6/7/18 12/4/18 jubilee charro NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 5 10/1/23 3/29/24 dogs walked, tigers tamed, bars emptied lucky pierre NA NA NA NA NA NA 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA By specifying the suffix argument, we get column names in the result with more meaningful names. 2.2.7.3 Missing Values The dplyr package has a default behavior that I think is dangerous. In the conditions of a join, NA==NA evaluates to TRUE, which is unlike the behavior anywhere else in R. This means that keys identified as NA will match other NAs in the join. This is a very strong assumption that seems to contradict the idea of a missing value since if we actually don’t know two keys, how can we say that they match? And if we know two keys have the same value then they should be labeled in the data. In my opinion, it’s a mistake to have the computer make strong assumptions by default, and especially if it does so without warning the user. Fortunately, there is a way to make the more sensible decision that NAs don’t match anything: include the argument na_matches='never' in the join. # Example datasets students = data.frame( student_id = c(1, NA, 3, 4), name = c("Angel", "Beto", "Cici", "Desmond")) grades = data.frame(student_id = c(2, NA, 4, 5), grade = c(90, 85, 80, 75)) # Left joins left_join(students, grades, by = "student_id") |> kable() student_id name grade 1 Angel NA NA Beto 85 3 Cici NA 4 Desmond 80 left_join(students, grades, by = "student_id", na_matches = "never") |> kable() student_id name grade 1 Angel NA NA Beto NA 3 Cici NA 4 Desmond 80 Notice that since Beto’s student ID is NA, none of the rows in the grades table can match him. As a result, his grade is left NA in the result. 2.2.8 Conclusion You’ve now seen how to join data tables that can be linked by key columns. I encourage you to expand on the examples by posing questions and trying to write the code to answer them. Reading the documentation for join functions and join_by specifications is a great way to continue your learning journey by studying the (many!) special cases that we skipped over here. "],["best-practices-for-writing-r-scripts.html", "3 Best Practices for Writing R Scripts 3.1 Scripting Your Code 3.2 Printing Output 3.3 Reading Input 3.4 Managing Packages 3.5 Iteration Strategies", " 3 Best Practices for Writing R Scripts Learning Objectives Identify and explain the difference between R’s various printing functions Describe and use R’s for, while, and repeat loops Identify the most appropriate iteration strategy for a given problem Explain strategies to organize iterative code 3.1 Scripting Your Code 3.2 Printing Output This section introduces several different functions for printing output and making that output easier to read. 3.2.1 The print Function The print function prints a string representation of an object to the console. The string representation is usually formatted in a way that exposes details important to programmers rather than users. For example, when printing a vector, the function prints the position of the first element on each line in square brackets [ ]: print(1:100) ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 ## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 ## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 ## [91] 91 92 93 94 95 96 97 98 99 100 The print function also prints quotes around strings: print("Hi") ## [1] "Hi" These features make the print function ideal for printing information when you’re trying to understand some code or diagnose a bug. On the other hand, these features also make print a bad choice for printing output or status messages for users (including you). R calls the print function automatically anytime a result is returned at the prompt. Thus it’s not necessary to call print to print something when you’re working directly in the console—only from within loops, functions, scripts, and other code that runs non-interactively. The print function is an S3 generic (see Section 6.4), so you if you create an S3 class, you can define a custom print method for it. For S4 objects, R uses the S4 generic show instead of print. 3.2.2 The message Function To print output for users, the message function is the one you should use. The main reason for this is that the message function is part of R’s conditions system for reporting status information as code runs. This makes it easier for other code to detect, record, respond to, or suppress the output (see Section 4.2 to learn more about R’s conditions system). The message function prints its argument(s) and a newline to the console: message("Hello world!") ## Hello world! If an argument isn’t a string, the function automatically and silently attempts to coerce it to one: message(4) ## 4 Some types of objects can’t be coerced to a string: message(sqrt) ## Error in FUN(X[[i]], ...): cannot coerce type 'builtin' to vector of type 'character' For objects with multiple elements, the function pastes together the string representations of the elements with no separators in between: x = c(1, 2, 3) message(x) ## 123 Similarly, if you pass the message function multiple arguments, it pastes them together with no separators: name = "R" message("Hi, my name is ", name, " and x is ", x) ## Hi, my name is R and x is 123 This is a convenient way to print names or descriptions alongside values from your code without having to call a formatting function like paste. You can make the message function print something without adding a newline at the end by setting the argument appendLF = FALSE. The difference can be easy to miss unless you make several calls to message, so the say_hello function in this example calls message twice: say_hello = function(appendLF) { message("Hello", appendLF = appendLF) message(" world!") } say_hello(appendLF = TRUE) ## Hello ## world! say_hello(appendLF = FALSE) ## Hello world! Note that RStudio always adds a newline in front of the prompt, so making an isolated call to message with appendLF = FALSE appears to produce the same output as with appendLF = TRUE. This is an example of a situation where RStudio leads you astray: in an ordinary R console, the two are clearly different. 3.2.3 The cat Function The cat function, whose name stands for “concatenate and print,” is a low-level way to print output to the console or a file. The message function prints output by calling cat, but cat is not part of R’s conditions system. The cat function prints its argument(s) to the console. It does not add a newline at the end: cat("Hello") ## Hello As with message, RStudio hides the fact that there’s no newline if you make an isolated call to cat. The cat function coerces its arguments to strings and concatenates them. By default, a space is inserted between arguments and their elements: cat(4) ## 4 cat(x) ## 1 2 3 cat("Hello", "Nick") ## Hello Nick You can set the sep parameter to control the separator cat inserts: cat("Hello", "world", x, sep = "|") ## Hello|world|1|2|3 If you want to write output to a file rather than to the console, you can call cat with the file parameter set. However, it’s preferable to use functions tailored to writing specific kinds of data, such as writeLines (for text) or write.table (for tabular data), since they provide additional options to control the output. Many scripts and packages still use cat to print output, but the message function provides more flexibility and control to people running the code. Thus it’s generally preferable to use message in new code. Nevertheless, there are a few specific cases where cat is useful—for example, if you want to pipe data to a UNIX shell command. See ?cat for details. 3.2.4 Formatting Output R provides a variety of ways to format data before you print it. Taking the time to format output carefully makes it easier to read and understand, as well as making your scripts seem more professional. 3.2.4.1 Escape Sequences One way to format strings is by adding (or removing) escape sequences. An escape sequence is a sequence of characters that represents some other character, usually one that’s invisible (such as whitespace) or doesn’t appear on a standard keyboard. In R, escape sequences always begin with a backslash. For example, \\n is a newline. The message and cat functions automatically convert escape sequences to the characters they represent: x = "Hello\\nworld!" message(x) ## Hello ## world! The print function doesn’t convert escape sequences: print(x) ## [1] "Hello\\nworld!" Some escape sequences trigger special behavior in the console. For example, ending a line with a carriage return \\r makes the console print the next line over the line. Try running this code in a console (it’s not possible to see the result in a static book): # Run this in an R console. for (i in 1:10) { message(i, "\\r", appendLF = FALSE) # Wait 0.5 seconds. Sys.sleep(0.5) } You can find a complete list of escape sequences in ?Quotes. 3.2.4.2 Formatting Functions You can use the sprintf function to apply specific formatting to values and substitute them into strings. The function uses a mini-language to describe the formatting and substitutions. The sprintf function (or something like it) is available in many programming languages, so being familiar with it will serve you well on your programming journey. The key idea is that substitutions are marked by a percent sign % and a character. The character indicates the kind of data to be substituted: s for strings, i for integers, f for floating point numbers, and so on. The first argument to sprintf must be a string, and subsequent arguments are values to substitute into the string (from left to right). For example: sprintf("My age is %i, and my name is %s", 32, "Nick") ## [1] "My age is 32, and my name is Nick" You can use the mini-language to do things like specify how many digits to print after a decimal point. Format settings for a substituted value go between the percent sign % and the character. For instance, here’s how to print pi with 2 digits after the decimal: sprintf("%.2f", pi) ## [1] "3.14" You can learn more by reading ?sprintf. Much simpler are the paste and paste0 functions, which coerce their arguments to strings and concatenate (or “paste together”) them. The paste function inserts a space between each argument, while the paste0 function doesn’t: paste("Hello", "world") ## [1] "Hello world" paste0("Hello", "world") ## [1] "Helloworld" You can control the character inserted between arguments with the sep parameter. By setting an argument for the collapse parameter, you can also use the paste and paste0 functions to concatenate the elements of a vector. The argument to collapse is inserted between the elements. For example, suppose you want to paste together elements of a vector inserting a comma and space , in between: paste(1:3, collapse = ", ") ## [1] "1, 2, 3" Members of the R community have developed many packages to make formatting strings easier: cli – helper functions for developing command-line interfaces, including functions to add color, progress bars, and more. glue – alternatives to sprintf for string iterpolation. stringr – a collection of general-purpose string manipulation functions. 3.2.5 Logging Output Logging means saving the output from some code to a file as the code runs. The file where the output is saved is called a log file or log, but this name isn’t indicative of a specific format (unlike, say, a “CSV file”). It’s a good idea to set up some kind of logging for any code that takes more than a few minutes to run, because then if something goes wrong you can inspect the log to diagnose the problem. Think of any output that’s not logged as ephemeral: it could disappear if someone reboots the computer, or there’s a power outage, or some other, unforeseen event. R’s built-in tools for logging are rudimentary, but members of the community have developed a variety of packages for logging. Here are a few that are still actively maintained as of January 2023: logger – a relatively new package that aims to improve aspects of other logging packages that R users find confusing. futile.logger – a popular, mature logging package based on Apache’s Log4j utility and on R idioms. logging – a mature logging package based on Python’s logging module. loggit – integrates with R’s conditions system and writes logs in JavaScript Object Notation (JSON) format so they are easy to inspect programmatically. log4r – another package based on Log4j with an object-oriented programming approach. 3.3 Reading Input 3.4 Managing Packages 3.5 Iteration Strategies R is powerful tool for automating tasks that have repetitive steps. For example, you can: Apply a transformation to an entire column of data. Compute distances between all pairs from a set of points. Read a large collection of files from disk in order to combine and analyze the data they contain. Simulate how a system evolves over time from a specific set of starting parameters. Scrape data from many pages of a website. You can implement concise, efficient solutions for these kinds of tasks in R by using iteration, which means repeating a computation many times. R provides four different strategies for writing iterative code: Vectorization, where a function is implicitly called on each element of a vector. See this section of the R Basics for more details. Apply functions, where a function is explicitly called on each element of a vector or array. See this section of the R Basics reader for more details. Loops, where an expression is evaluated repeatedly until some condition is met. Recursion, where a function calls itself. Vectorization is the most efficient and most concise iteration strategy, but also the least flexible, because it only works with vectorized functions and vectors. Apply functions are more flexible—they work with any function and any data structure with elements—but less efficient and less concise. Loops and recursion provide the most flexibility but are the least concise. In recent versions of R, apply functions and loops are similar in terms of efficiency. Recursion tends to be the least efficient iteration strategy in R. The rest of this section explains how to write loops and how to choose which iteration strategy to use. We assume you’re already comfortable with vectorization and have at least some familiarity with apply functions. 3.5.1 For-loops A for-loop evaluates an expression once for each element of a vector or list. The for keyword creates a for-loop. The syntax is: for (I in DATA) { # Your code goes here } The variable I is called an induction variable. At the beginning of each iteration, I is assigned the next element of DATA. The loop iterates once for each element, unless a keyword instructs R to exit the loop early (more about this in Section 3.5.4). As with if-statements and functions, the curly braces { } are only required if the body contains multiple lines of code. Here’s a simple for-loop: for (i in 1:10) message("Hi from iteration ", i) ## Hi from iteration 1 ## Hi from iteration 2 ## Hi from iteration 3 ## Hi from iteration 4 ## Hi from iteration 5 ## Hi from iteration 6 ## Hi from iteration 7 ## Hi from iteration 8 ## Hi from iteration 9 ## Hi from iteration 10 When some or all of the iterations in a task depend on results from prior iterations, loops tend to be the most appropriate iteration strategy. For instance, loops are a good way to implement time-based simulations or compute values in recursively defined sequences. As a concrete example, suppose you want to compute the result of starting from the value 1 and composing the sine function 100 times: result = 1 for (i in 1:100) { result = sin(result) } result ## [1] 0.1688525 Unlike other iteration strategies, loops don’t return a result automatically. It’s up to you to use variables to store any results you want to use later. If you want to save a result from every iteration, you can use a vector or a list indexed on the iteration number: n = 1 + 100 result = numeric(n) result[1] = 1 for (i in 2:n) { result[i] = sin(result[i - 1]) } plot(result) Section 3.5.3 explains this in more detail. If the iterations in a task are not dependent, it’s preferable to use vectorization or apply functions instead of a loop. Vectorization is more efficient, and apply functions are usually more concise. In some cases, you can use vectorization to handle a task even if the iterations are dependent. For example, you can use vectorized exponentiation and the sum function to compute the sum of the cubes of many numbers: numbers = c(10, 3, 100, -5, 2, 10) sum(numbers^3) ## [1] 1001910 3.5.2 While-loops A while-loop runs a block of code repeatedly as long as some condition is TRUE. The while keyword creates a while-loop. The syntax is: while (CONDITION) { # Your code goes here } The CONDITION should be a scalar logical value or an expression that returns one. At the beginning of each iteration, R checks the CONDITION and exits the loop if it’s FALSE. As always, the curly braces { } are only required if the body contains multiple lines of code. Here’s a simple while-loop: i = 0 while (i < 10) { i = i + 1 message("Hello from iteration ", i) } ## Hello from iteration 1 ## Hello from iteration 2 ## Hello from iteration 3 ## Hello from iteration 4 ## Hello from iteration 5 ## Hello from iteration 6 ## Hello from iteration 7 ## Hello from iteration 8 ## Hello from iteration 9 ## Hello from iteration 10 Notice that this example does the same thing as the simple for-loop in Section 3.5.1, but requires 5 lines of code instead of 2. While-loops are a generalization of for-loops, and only do the bare minimum necessary to iterate. They tend to be most useful when you don’t know how many iterations will be necessary to complete a task. As an example, suppose you want to add up the integers in order until the total is greater than 50: total = 0 i = 1 while (total < 50) { total = total + i message("i is ", i, " total is ", total) i = i + 1 } ## i is 1 total is 1 ## i is 2 total is 3 ## i is 3 total is 6 ## i is 4 total is 10 ## i is 5 total is 15 ## i is 6 total is 21 ## i is 7 total is 28 ## i is 8 total is 36 ## i is 9 total is 45 ## i is 10 total is 55 total ## [1] 55 i ## [1] 11 3.5.3 Saving Multiple Results Loops often produce a different result for each iteration. If you want to save more than one result, there are a few things you must do. First, set up an index vector. The index vector should usually correspond to the positions of the elements in the data you want to process. The seq_along function returns an index vector when passed a vector or list. For instance: numbers = c(-1, 21, 3, -8, 5) index = seq_along(numbers) The loop will iterate over the index rather than the input, so the induction variable will track the current iteration number. On the first iteration, the induction variable will be 1, on the second it will be 2, and so on. Then you can use the induction variable and indexing to get the input for each iteration. Second, set up an empty output vector or list. This should usually also correspond to the input, or one element longer (the extra element comes from the initial value). R has several functions for creating vectors: logical, integer, numeric, complex, and character create an empty vector with a specific type and length vector creates an empty vector with a specific type and length rep creates a vector by repeating elements of some other vector Empty vectors are filled with FALSE, 0, or \"\", depending on the type of the vector. Here are some examples: logical(3) ## [1] FALSE FALSE FALSE numeric(4) ## [1] 0 0 0 0 rep(c(1, 2), 2) ## [1] 1 2 1 2 Let’s create an empty numeric vector congruent to the numbers vector: n = length(numbers) result = numeric(n) As with the input, you can use the induction variable and indexing to set the output for each iteration. Creating a vector or list in advance to store something, as we’ve just done, is called preallocation. Preallocation is extremely important for efficiency in loops. Avoid the temptation to use c or append to build up the output bit by bit in each iteration. Finally, write the loop, making sure to get the input and set the output. As an example, this loop adds each element of numbers to a running total and squares the new running total: for (i in index) { prev = if (i > 1) result[i - 1] else 0 result[i] = (numbers[i] + prev)^2 } result ## [1] 1.000000e+00 4.840000e+02 2.371690e+05 5.624534e+10 3.163538e+21 3.5.4 Break & Next The break keyword causes a loop to immediately exit. It only makes sense to use break inside of an if-statement. For example, suppose you want to print each string in a vector, but stop at the first missing value. You can do this with a for-loop and the break keyword: my_messages = c("Hi", "Hello", NA, "Goodbye") for (msg in my_messages) { if (is.na(msg)) break message(msg) } ## Hi ## Hello The next keyword causes a loop to immediately go to the next iteration. As with break, it only makes sense to use next inside of an if-statement. Let’s modify the previous example so that missing values are skipped, but don’t cause printing to stop. Here’s the code: for (msg in my_messages) { if (is.na(msg)) next message(msg) } ## Hi ## Hello ## Goodbye These keywords work with both for-loops and while-loops. 3.5.5 Planning for Iteration At first it may seem difficult to decide if and what kind of iteration to use. Start by thinking about whether you need to do something over and over. If you don’t, then you probably don’t need to use iteration. If you do, then try iteration strategies in this order: Vectorization Apply functions Try an apply function if iterations are independent. Loops Try a for-loop if some iterations depend on others. Try a while-loop if the number of iterations is unknown. Recursion (which isn’t covered here) Convenient for naturally recursive problems (like Fibonacci), but often there are faster solutions. Start by writing the code for just one iteration. Make sure that code works; it’s easy to test code for one iteration. When you have one iteration working, then try using the code with an iteration strategy (you will have to make some small changes). If it doesn’t work, try to figure out which iteration is causing the problem. One way to do this is to use message to print out information. Then try to write the code for the broken iteration, get that iteration working, and repeat this whole process. 3.5.6 Case Study: The Collatz Conjecture The Collatz Conjecture is a conjecture in math that was introduced in 1937 by Lothar Collatz and remains unproven today, despite being relatively easy to explain. Here’s a statement of the conjecture: Start from any positive integer. If the integer is even, divide by 2. If the integer is odd, multiply by 3 and add 1. If the result is not 1, repeat using the result as the new starting value. The result will always reach 1 eventually, regardless of the starting value. The sequences of numbers this process generates are called Collatz sequences. For instance, the Collatz sequence starting from 2 is 2, 1. The Collatz sequence starting from 12 is 12, 6, 3, 10, 5, 16, 8, 4, 2, 1. You can use iteration to compute the Collatz sequence for a given starting value. Since each number in the sequence depends on the previous one, and since the length of the sequence varies, a while-loop is the most appropriate iteration strategy: n = 5 i = 0 while (n != 1) { i = i + 1 if (n %% 2 == 0) { n = n / 2 } else { n = 3 * n + 1 } message(n, " ", appendLF = FALSE) } ## 16 8 4 2 1 As of 2020, scientists have used computers to check the Collatz sequences for every number up to approximately \\(2^{64}\\). For more details about the Collatz Conjecture, check out this video. 3.5.7 Case Study: U.S. Fruit Prices The U.S. Department of Agriculture (USDA) Economic Research Service (ERS) publishes data about consumer food prices. For instance, in 2018 they posted a dataset that estimates average retail prices for various fruits, vegetables, and snack foods. The estimates are formatted as a collection of Excel files, one for each type of fruit or vegetable. In this case study, you’ll use iteration to get the estimated “fresh” price for all of the fruits in the dataset that are sold fresh. To get started, download the zipped collection of fruit spreadsheets and save it somewhere on your computer. Then unzip the file with a zip program or R’s own unzip function. The first sheet of each file contains a table with the name of the fruit and prices sorted by how the fruit was prepared. You can see this for yourself if you use a spreadsheet program to inspect some of the files. In order to read the files into R, first get a vector of their names. You can use the list.files function to list all of the files in a directory. If you set full.names = TRUE, the function will return the absolute path to each file: paths = list.files("data/fruit", full.names = TRUE) paths ## [1] "data/fruit/apples_2013.xlsx" "data/fruit/apricots_2013.xlsx" ## [3] "data/fruit/bananas_2013.xlsx" "data/fruit/berries_mixed_2013.xlsx" ## [5] "data/fruit/blackberries_2013.xlsx" "data/fruit/blueberries_2013.xlsx" ## [7] "data/fruit/cantaloupe_2013.xlsx" "data/fruit/cherries_2013.xlsx" ## [9] "data/fruit/cranberries_2013.xlsx" "data/fruit/dates_2013.xlsx" ## [11] "data/fruit/figs_2013.xlsx" "data/fruit/fruit_cocktail_2013.xlsx" ## [13] "data/fruit/grapefruit_2013.xlsx" "data/fruit/grapes_2013.xlsx" ## [15] "data/fruit/honeydew_2013.xlsx" "data/fruit/kiwi_2013.xlsx" ## [17] "data/fruit/mangoes_2013.xlsx" "data/fruit/nectarines_2013.xlsx" ## [19] "data/fruit/oranges_2013.xlsx" "data/fruit/papaya_2013.xlsx" ## [21] "data/fruit/peaches_2013.xlsx" "data/fruit/pears_2013.xlsx" ## [23] "data/fruit/pineapple_2013.xlsx" "data/fruit/plums_2013.xlsx" ## [25] "data/fruit/pomegranate_2013.xlsx" "data/fruit/raspberries_2013.xlsx" ## [27] "data/fruit/strawberries_2013.xlsx" "data/fruit/tangerines_2013.xlsx" ## [29] "data/fruit/watermelon_2013.xlsx" The files are in Excel format, which you can read with the read_excel function from the readxl package. First try reading one file and extracting the fresh price: library("readxl") prices = read_excel(paths[1]) ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` The name of the fruit is the first word in the first column’s name. The fresh price appears in the row where the word in column 1 starts with \"Fresh\". You can use str_which from the stringr package (Section 1.4.1) to find and extract this row: library("stringr") idx_fresh = str_which(prices[[1]], "^Fresh") prices[idx_fresh, ] ## # A tibble: 1 × 7 ## Apples—Average retail price per pound or…¹ ...2 ...3 ...4 ...5 ...6 ...7 ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Fresh1 1.56… per … 0.9 0.24… poun… 0.42… ## # ℹ abbreviated name: ## # ¹`Apples—Average retail price per pound or pint and per cup equivalent, 2013` The price and unit appear in column 2 and column 3. Now generalize these steps by making a read_fresh_price function. The function should accept a path as input and return a vector that contains the fruit name, fresh price, and unit. Don’t worry about cleaning up the fruit name at this point—you can do that with a vectorized operation after combining the data from all of the files. A few fruits don’t have a fresh price, and the function should return NA for the price and unit for those. Here’s one way to implement the read_fresh_price function: read_fresh_price = function(path) { prices = read_excel(path) # Get fruit name. fruit = names(prices)[[1]] # Find fresh price. idx = str_which(prices[[1]], "^Fresh") if (length(idx) > 0) { prices = prices[idx, ] c(fruit, prices[[2]], prices[[3]]) } else { c(fruit, NA, NA) } } Test that the function returns the correct result for a few of the files: read_fresh_price(paths[[1]]) ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` ## [1] "Apples—Average retail price per pound or pint and per cup equivalent, 2013" ## [2] "1.5675153914496354" ## [3] "per pound" read_fresh_price(paths[[4]]) ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` ## [1] "Mixed berries—Average retail price per pound and per cup equivalent, 2013" ## [2] NA ## [3] NA read_fresh_price(paths[[8]]) ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` ## [1] "Cherries—Average retail price per pound and per cup equivalent, 2013" ## [2] "3.5929897554945156" ## [3] "per pound" Now that the function is working, it’s time to choose an iteration strategy. The read_fresh_price function is not vectorized, so that strategy isn’t possible. Reading one file doesn’t depend on reading any of the others, so apply functions are the best strategy here. The read_fresh_price function always returns a character vector with 3 elements, so you can use sapply to process all of the files and get a matrix of results: all_prices = sapply(paths, read_fresh_price) ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## New names: ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` # Transpose, convert to a data frame, and set names for easy reading. all_prices = t(all_prices) all_prices = data.frame(all_prices) rownames(all_prices) = NULL colnames(all_prices) = c("fruit", "price", "unit") all_prices ## fruit ## 1 Apples—Average retail price per pound or pint and per cup equivalent, 2013 ## 2 Apricots—Average retail price per pound and per cup equivalent, 2013 ## 3 Bananas—Average retail price per pound and per cup equivalent, 2013 ## 4 Mixed berries—Average retail price per pound and per cup equivalent, 2013 ## 5 Blackberries—Average retail price per pound and per cup equivalent, 2013 ## 6 Blueberries—Average retail price per pound and per cup equivalent, 2013 ## 7 Cantaloupe—Average retail price per pound and per cup equivalent, 2013 ## 8 Cherries—Average retail price per pound and per cup equivalent, 2013 ## 9 Cranberries—Average retail price per pound and per cup equivalent, 2013 ## 10 Dates—Average retail price per pound and per cup equivalent, 2013 ## 11 Figs—Average retail price per pound and per cup equivalent, 2013 ## 12 Fruit cocktail—Average retail price per pound and per cup equivalent, 2013 ## 13 Grapefruit—Average retail price per pound or pint and per cup equivalent, 2013 ## 14 Grapes—Average retail price per pound or pint and per cup equivalent, 2013 ## 15 Honeydew melon—Average retail price per pound and per cup equivalent, 2013 ## 16 Kiwi—Average retail price per pound and per cup equivalent, 2013 ## 17 Mangoes—Average retail price per pound and per cup equivalent, 2013 ## 18 Nectarines—Average retail price per pound and per cup equivalent, 2013 ## 19 Oranges—Average retail price per pound or pint and per cup equivalent, 2013 ## 20 Papaya—Average retail price per pound and per cup equivalent, 2013 ## 21 Peaches—Average retail price per pound and per cup equivalent, 2013 ## 22 Pears—Average retail price per pound and per cup equivalent, 2013 ## 23 Pineapple—Average retail price per pound or pint and per cup equivalent, 2013 ## 24 Plums—Average retail price per pound or pint and per cup equivalent, 2013 ## 25 Pomegranate—Average retail price per pound or pint and per cup equivalent, 2013 ## 26 Raspberries—Average retail price per pound and per cup equivalent, 2013 ## 27 Strawberries—Average retail price per pound and per cup equivalent, 2013 ## 28 Tangerines—Average retail price per pound or pint and per cup equivalent, 2013 ## 29 Watermelon—Average retail price per pound and per cup equivalent, 2013 ## price unit ## 1 1.5675153914496354 per pound ## 2 3.0400719670964378 per pound ## 3 0.56698341453144807 per pound ## 4 <NA> <NA> ## 5 5.7747082503535152 per pound ## 6 4.7346216897250253 per pound ## 7 0.53587377610644515 per pound ## 8 3.5929897554945156 per pound ## 9 <NA> <NA> ## 10 <NA> <NA> ## 11 <NA> <NA> ## 12 <NA> <NA> ## 13 0.89780204117954143 per pound ## 14 2.0938274120049827 per pound ## 15 0.79665620543008364 per pound ## 16 2.0446834079658482 per pound ## 17 1.3775634470319702 per pound ## 18 1.7611484827950696 per pound ## 19 1.0351727302444853 per pound ## 20 1.2980115892049107 per pound ## 21 1.5911868532458617 per pound ## 22 1.4615746043999458 per pound ## 23 0.62766194593569868 per pound ## 24 1.8274160078099031 per pound ## 25 2.1735904118559191 per pound ## 26 6.9758107988552958 per pound ## 27 2.3588084831103004 per pound ## 28 1.3779618772323634 per pound ## 29 0.33341203532340097 per pound Finally, the last step is to remove the extra text from the fruit name. One way to do this is with the str_split_fixed function from the stringr package. There’s an en dash — after each fruit name, which you can use for the split: fruit = str_split_fixed(all_prices$fruit, "—", 2)[, 1] all_prices$fruit = fruit all_prices ## fruit price unit ## 1 Apples 1.5675153914496354 per pound ## 2 Apricots 3.0400719670964378 per pound ## 3 Bananas 0.56698341453144807 per pound ## 4 Mixed berries <NA> <NA> ## 5 Blackberries 5.7747082503535152 per pound ## 6 Blueberries 4.7346216897250253 per pound ## 7 Cantaloupe 0.53587377610644515 per pound ## 8 Cherries 3.5929897554945156 per pound ## 9 Cranberries <NA> <NA> ## 10 Dates <NA> <NA> ## 11 Figs <NA> <NA> ## 12 Fruit cocktail <NA> <NA> ## 13 Grapefruit 0.89780204117954143 per pound ## 14 Grapes 2.0938274120049827 per pound ## 15 Honeydew melon 0.79665620543008364 per pound ## 16 Kiwi 2.0446834079658482 per pound ## 17 Mangoes 1.3775634470319702 per pound ## 18 Nectarines 1.7611484827950696 per pound ## 19 Oranges 1.0351727302444853 per pound ## 20 Papaya 1.2980115892049107 per pound ## 21 Peaches 1.5911868532458617 per pound ## 22 Pears 1.4615746043999458 per pound ## 23 Pineapple 0.62766194593569868 per pound ## 24 Plums 1.8274160078099031 per pound ## 25 Pomegranate 2.1735904118559191 per pound ## 26 Raspberries 6.9758107988552958 per pound ## 27 Strawberries 2.3588084831103004 per pound ## 28 Tangerines 1.3779618772323634 per pound ## 29 Watermelon 0.33341203532340097 per pound Now the data are ready for analysis. You could extend the reader function to extract more of the data (e.g., dried and frozen prices), but the overall process is fundamentally the same. Write the code to handle one file (one step), generalize it to work on several, and then iterate. For another example, see Liza Wood’s Real-world Function Writing Mini-reader. "],["squashing-bugs-with-rs-debugging-tools.html", "4 Squashing Bugs with R’s Debugging Tools 4.1 Printing 4.2 The Conditions System 4.3 Global Options 4.4 Debugging 4.5 Measuring Performance", " 4 Squashing Bugs with R’s Debugging Tools The major topics of this chapter are how to print output, how R’s conditions system for warnings and errors works, how to use the R debugger, and how to estimate the performance of R code. Learning Objectives Use R’s conditions system to raise and catch messages, warnings, and errors Use R’s debugging functions to diagnose bugs in code Estimate the amount of memory a data set will require Use the lobstr package to get memory usage for an R object Describe what a profiler is and why you would use one Describe what kinds of profiling tools R provides 4.1 Printing Perhaps the simplest thing you can do to get a better understanding of some code is make it print out lots of information about what’s happening as it runs. You can use the print function to print values in a way that exposes details important to programmers. For example, when printing a vector, the function prints the position of the first element on each line in square brackets [ ]: print(1:100) ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 ## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 ## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 ## [91] 91 92 93 94 95 96 97 98 99 100 The print function also prints quotes around strings: print("Hi") ## [1] "Hi" These features make the print function ideal for printing information when you’re trying to understand some code or diagnose a bug. 4.2 The Conditions System R’s conditions system provides a way to signal and handle unusual conditions that arise while code runs. With the conditions system, you can make R print status, warning, and error messages that make it easier for users to understand what your code is doing and whether they’re using it as intended. The condition system also makes it possible to safely run code that might cause an error, and respond appropriately in the event that it does. In short, understanding the conditions system will enable you write code that’s easier to use and more robust. 4.2.1 Raising Conditions The message, warning, and stop functions are the primary ways to raise, or signal, conditions. The message function is the primary way to print output (see Section 3.2 for alternatives). A message provides status information about running code, but does not necessarily indicate that something has gone wrong. You can use messages to print out any information you think might be relevant to users. The warning function raises a warning. Warnings indicate that something unexpected happened, but that it didn’t stop the code from running. By default, R doesn’t print warnings to the console until code finishes running, which can make it difficult to understand their cause; Section 4.3 explains how to change this setting. Unnamed arguments to the warning function are concatenated with no separator between them, in the same way as arguments to the message function. For example: warning("Objects in mirror", " may be closer than they appear.") ## Warning: Objects in mirror may be closer than they appear. Warnings are always printed with Warning: before the message. By default, calling warning from the body of a function also prints the name of the function: f = function(x, y) { warning("This is a warning!") x + y } f(3, 4) ## Warning in f(3, 4): This is a warning! ## [1] 7 The name of the function that raised the warning is generally useful information for users that want to correct whatever caused the warning. Occasionally, you might want to disable this behavior, which you can do by setting call. = FALSE: f = function(x, y) { warning("This is a warning!", call. = FALSE) x + y } f(3, 4) ## Warning: This is a warning! ## [1] 7 The warning function also has several other parameters that control when and how warnings are displayed. The stop function raises an error, which indicates that something unexpected happened that prevents the code from running, and immediately stops the evaluation of code. As a result, R prints errors as soon as they’re raised. For instance, in this function, the line x + y never runs: f = function(x, y) { stop() x + y } f(3, 4) ## Error in f(3, 4): Like message and warning, the stop function concatenates its unnamed arguments into a message to print: stop("I'm afraid something has gone terribly wrong.") ## Error in eval(expr, envir, enclos): I'm afraid something has gone terribly wrong. Errors are always printed with Error: before the error message. You can use the call. parameter to control whether the error message also includes the name of the function from which stop was called. When writing code—especially functions, executable scripts, and packages—it’s a good habit to include tests for unexpected conditions such as invalid arguments and impossible results. If the tests detect a problem, use the warning or stop function (depending on severity) to signal what the problem is. Try to provide a concise but descriptive warning or error message so that users can easily understand what went wrong. 4.2.2 Handling Conditions In some cases, you can anticipate the problems likely to occur when code runs and can even devise ways to work around them. As an example, suppose your code is supposed to load parameters from a configuration file, but the path to the file provided by the user is invalid. It might still be possible for your code to run by falling back on a set of default parameters. R’s conditions system provides a way to handle or “catch” messages, warnings, and errors, and to run alternative code in response. You can use the try function to safely run code that might produce an error. If no error occurs, the try function returns whatever the result of the code was. If an error does occur, the try function prints the error message and returns an object of class try-error, but evaluation does not stop. For example: bad_add = function(x) { # No error x1 = try(5 + x) # Error x2 = try("yay" + x) list(x1, x2) } bad_add(10) ## Error in "yay" + x : non-numeric argument to binary operator ## [[1]] ## [1] 15 ## ## [[2]] ## [1] "Error in \\"yay\\" + x : non-numeric argument to binary operator\\n" ## attr(,"class") ## [1] "try-error" ## attr(,"condition") ## <simpleError in "yay" + x: non-numeric argument to binary operator> The simplest thing you can do in response to an error is ignore it. This is usually not a good idea, but if you understand exactly what went wrong, can’t fix it easily, and know it won’t affect the rest of your code, doing so might be the best option. A more robust approach is to inspect the result from a call to try to see if an error occurred, and then take some appropriate action if one did. You can use the inherits function to check whether an object has a specific class, so here’s a template for how to run code that might cause an error, check for the error, and respond to it: result = try({ # Code that might cause an error. }) if (inherits(result, "try-error")) { # Code to respond to the error. } You can prevent the try function from printing error messages by setting silent = TRUE. This is useful when your code is designed to detect and handle the error, so you don’t users to think an error occurred. The tryCatch function provides another way to handle conditions raised by a piece of code. It requires that you provide a handler function for each kind of condition you want to handle. The kinds of conditions are: message warning error interrupt – when the user interrupts the code (for example, by pressing Ctrl-C) Each handler function must accept exactly one argument. When you call tryCatch, if the suspect code raises a condition, then it calls the associated handler function and returns whatever the handler returns. Otherwise, tryCatch returns the result of the code. Here’s an example of using tryCatch to catch an error: bad_fn = function(x, y) { stop("Hi") x + y } err = tryCatch(bad_fn(3, 4), error = function(e) e) And here’s an example of using tryCatch to catch a message: msg_fn = function(x, y) { message("Hi") x + y } msg = tryCatch(msg_fn(3, 4), message = function(e) e) The tryCatch function always silences conditions. Details about raised conditions are provided in the object passed to the handler function, which has class condition (and a more specific class that indicates what kind of condition it is). If you want to learn more about R’s conditions system, start by reading ?conditions. 4.3 Global Options R’s global options to control many different aspects of how R works. They’re relevant to the theme of this chapter because some of them control when and how R displays warnings and errors. You can use the options function to get or set global options. If you call the function with no arguments, it returns the current settings: opts = options() # Display the first 6 options. head(opts) ## $add.smooth ## [1] TRUE ## ## $bitmapType ## [1] "cairo" ## ## $browser ## [1] "" ## ## $browserNLdisabled ## [1] FALSE ## ## $callr.condition_handler_cli_message ## function (msg) ## { ## custom_handler <- getOption("cli.default_handler") ## if (is.function(custom_handler)) { ## custom_handler(msg) ## } ## else { ## cli_server_default(msg) ## } ## } ## <bytecode: 0x5555844dbe50> ## <environment: namespace:cli> ## ## $CBoundsCheck ## [1] FALSE This section only explains a few of the options, but you can read about all of them in ?options. The warn option controls how R handles warnings. It can be set to three different values: 0 – (the default) warnings are only displayed after code finishes running. 1 – warnings are displayed immediately. 2 – warnings stop code from running, like errors. Setting warn = 2 is useful for pinpointing expressions that raise warnings. Setting warn = 1 makes it easier to determine which expressions raise warnings, without the inconvenience of stopping code from running. That makes it a good default (better than the actual default). You can use the option function to change the value of the warn option: options(warn = 1) When you set an option this way, the change only lasts until you quit R. Next time you start R, the option will go back to its default value. Fortunately, there is a way override the default options every time R starts. When R starts, it searches for a .Rprofile file. The file is usually in your system’s home directory (see this section of the R Basics Reader for how to locate your home directory). Customizing your .Rprofile file is one of the marks of an experienced R user. If you define a .First function in your .Rprofile, R will call it automatically during startup. Here’s an example .First function: .First = function() { # Only change options if R is running interactively. if (!interactive()) return() options( # Don't print more than 1000 elements of anything. max.print = 1000, # Warn on partial matches. warnPartialMatchAttr = TRUE, warnPartialMatchDollar = TRUE, warnPartialMatchArgs = TRUE, # Print warnings immediately (2 = warnings are errors). warn = 1 ) } You can learn more about the .Rprofile file and R’s startup process at ?Startup. 4.4 Debugging Debugging code is the process of confirming, step-by-step, that what you believe the code does is what the code actually does. The key idea is to check each step (or expression) in the code. There are two different strategies for doing this: Work forward through the code from the beginning. Work backward from the source of an error. R has built-in functions to help with debugging. The browser() function pauses the running code and starts R’s debugging system. For example: # Run this in an R console. f = function(n) { total = 0 for (i in 1:n) { browser() total = total + i } total } f(10) The most important debugger commands are: n to run the next line s to “step into” a call c to continue running the code Q to quit the debugger where to print call stack help to print debugger help Another example: # Run this in an R console. g = function(x, y) (1 + x) * y f = function(n) { total = 0 for (i in 1:n) { browser() total = total + g(i, i) } total } f(11) 4.4.1 Other Functions The debug() function places a call to browser() at the beginning of a function. Use debug() to debug functions that you can’t or don’t want to edit. For example: # Run this in an R console. f = function(x, y) { x + y } debug(f) f(5, 5) You can use undebug() to reverse the effect of debug(): # Run this in an R console. undebug(f) f(10, 20) The debugonce() function places a call to browser() at the beginning of a function for the next call only. The idea is that you then don’t have to call undebug(). For instance: # Run this in an R console. debugonce(f) f(10, 20) f(3, 4) Finally, the global option error can be used to make R enter the debugger any time an error occurs. Set the option to error = recover: options(error = recover) Then try this example: # Run this in an R console. bad_fn = function(x, y) { stop("Hi") x + y } bad_fn(3, 4) 4.5 Measuring Performance How quickly code runs and how much memory it uses can be just as much of an obstacle to research computing tasks as errors and bugs. This section describes some of the strategies you can use to estimate or measure the performance characteristics of code, so that you can identify potential problems and fix them. 4.5.1 Estimating Memory Usage Running out of memory can be extremely frustrating, because it can slow down your code or prevent it from running at all. It’s useful to know how to estimate how much memory a given data structure will use so that you can determine whether a programming strategy is feasible before you even start writing code. The central processing units (CPUs) in most modern computers are designed to work most efficiently with 64 bits of data at a time. Consequently, R and other programming languages typically use 64 bits to store each number (regardless of type). While the data structures R uses create some additional overhead, you can use this fact to do back-of-the-envelope calculations about how much memory a vector or matrix of numbers will require. Start by determining how many elements the data structure will contain. Then multiply by 64 bits and divide by 8 to convert bits to bytes. You can then repeatedly divide by 1024 to convert to kilobytes, megabytes, gigabytes, or terabytes. For instance, an vector of 2 million numbers will require approximately this many megabytes: n = 2000000 n * (64 / 8) / 1024^2 ## [1] 15.25879 You can even write an R function to do these calculations for you! If you’re not sure whether a particular programming strategy is realistic, do the memory calculations before you start writing code. This is a simple way to avoid strategies that are inefficient. If you’ve already written some code and it runs out of memory, the first step to fixing the problem is identifying the cause. The lobstr package provides functions to explore how R is using memory. You can use the mem_used function to get the amount of memory R is currently using: library("lobstr") mem_used() ## 37.76 MB Sometimes the culprit isn’t your code, but other applications on your computer. Modern web browsers are especially memory-intensive, and closing yours while you run code can make a big difference. If you’ve determined that your code is the reason R runs out of memory, you can use the obj_size function to get how much memory objects in your code actually use: obj_size(1) ## 56 B x = runif(n) obj_size(x) ## 16.00 MB obj_size(mtcars) ## 7.21 kB If a specific object created by your code uses a lot of memory, think about ways you might change the code to avoid creating the object or avoid creating the entire object at once. For instance, consider whether it’s possible to create part of the object, save that to disk, remove it from memory, and then create the another part. 4.5.2 Benchmarking Benchmarking means timing how long code takes to run. Benchmarking is useful for evaluating different strategies to solve a computational problem and for understanding how quickly (or slowly) your code runs. When you benchmark code, it’s important to collect and aggregate multiple data points so that your estimates reflect how the code performs on average. R has built-in functions for timing code, but several packages provide functions that are more convenient for benchmarking, because they automatically run the code multiple times and return summary statistics. The two most mature packages for benchmarking are: microbenchmark bench The microbenchmark package is simpler to use. It provides a single function, microbenchmark, for carrying out benchmarks. The function accepts any number of expressions to benchmark as arguments. For example, to compare the speed of runif and rnorm (as A and B respectively): library("microbenchmark") microbenchmark(A = runif(1e5), B = rnorm(1e5)) ## Unit: milliseconds ## expr min lq mean median uq max neval ## A 2.866326 3.258161 3.479572 3.323092 3.472401 7.942078 100 ## B 5.904507 6.281376 6.721875 6.485273 6.915268 13.351294 100 The microbenchmark has parameters to control the number of times each expression runs, the units for the timings, and more. You can find the details in ?microbenchmark. 4.5.3 Profiling Profiling code means collecting data about the code as it runs, and a profiler is a program that profiles code. A typical profiler estimates how much time is spent on each expression (as actual time or as a percentage of total runtime) and how much memory the code uses over time. Profiling is a good way to determine which parts of your code are performance bottlenecks, so that you can target them when you try to optimize your code. R has a built-in profiler. You can use the Rprof function to enable or disable the profiler. Essential parameters for the function are: filename – a path to a file for storing results. Defaults to Rprof.out. interval – the time between samples, in seconds. memory.profiling – whether to track memory in addition to time. Set these parameters in the first call you make to Rprof, which will enable the profiler. Then run the code you want to profile. At the end of the code, call Rprof(NULL) to disable the profiler. The profiler saves the collected data to a file. You can use the summaryRprof function to read the profile data and get a summary. Essential parameters for this function are: filename – the path to the results file. Defaults to Rprof.out. memory – how to display memory information. Use \"both\" to see total changes. The summary lists times in seconds and memory in bytes. The profvis package provides an interactive graphical interface for exploring profile data collected with Rprof. Examining profile data graphically makes it easier to interpret the results and to identify patterns. "],["data-visualization-in-r.html", "5 Data Visualization in R 5.1 Our Friend ggplot2 5.2 Example: Palmer Penguins 5.3 Layers 5.4 Guidelines for Graphics 5.5 Case Studies", " 5 Data Visualization in R We are here today to learn how to do data visualization in R. Some of you will have recently done the Principles of Data Visualization workshop. There you were given a checklist of questions to guide you as you create a plot, which we are going to use today. The checklist is here. 5.1 Our Friend ggplot2 We will be using the R package ggplot2 to create data visualizations. Install it via the install.packages() function. While we are at it, let’s make sure we install all of the packages that we’ll need for today’s workshop. Beyone ggplot2, we’ll use readr for reading data files, dplyr for manupulating data, and palmerpenguins provides a nice dataset. install.packages("ggplot2") install.packages("dplyr") install.packages("readr") install.packages("palmerpenguins") ggplot2 is an enormously popular R package that provides a way to create data visualizations through a so-called “grammar of graphics” (hence the “gg” in the name). That grammar interface may be a little bit unintuitive at first but once you grasp it, you hold enormous power to quickly craft plots. It doesn’t hurt that the ggplot2 plots look great, too. 5.1.1 The Grammar of Graphics The grammar of graphics breaks the elements of statistical graphics into parts in an analogy to human language grammars. Knowing how to put together subject nouns, object nouns, verbs, and adjectives allows you to construct sentences that express meaning. Similarly, the grammar of graphics is a collection of layers and the rules for putting them together to graphically express the meaning of your data. 5.2 Example: Palmer Penguins Let’s look at an example. This uses data from the palmerpenguins package that you just installed (make sure to load the package with library(palmerpenguins)). It is measurements of 344 penguins, collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The data package was created by Allison Horst. Before jumping in, let’s have a look at the data and the image we want to create. head(penguins) ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## # ℹ 2 more variables: sex <fct>, year <int> Plot of bill length vs. flipper length for the Palmer Penguins 5.2.1 Examining the Plot Referring to the graphics checklist, we see that this plot has two numerical features (bill length and flipper length), expressed using a scatter plot. There is also a categorical feature (species), which is indicated by the different colors and shapes of the plot. The plot expresses the fact that flipper length is positively associated with bill length for all thre species of penguins, but the sizes and the relationships between them are unique to each species. There is a title and a legend, the axes are labeled with units, and all of the text is in plain language. There is a risk that the data may hide the message, so a smoothing line is added to each species for clarity. The colors are accessible (avoiding red/green colorblindness issues). This is a good dataviz, now let’s duplicate it! 5.2.2 Duplicating the Palmer Penguins Plot Here is the code to make the plot: # matching the Allison Horst peguins plot ggplot(penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species) + geom_point() + geom_smooth(method = lm, se = FALSE) + xlab("Flipper length (mm)") + ylab("Bill length (mm)") + ggtitle( "Flipper and bill length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo penguins at Palmer Station LTER" ) + labs(color = "Penguin species", shape = "Penguin species") + scale_color_brewer(palette = "Dark2") 5.2.3 Analysis This is a complicated data visualization that includes features we haven’t covered yet, so let’s go into how it works. You might have noticed how the code to make the plot is separated into a bunch of function calls, all separated by plus signs (+). Each function does something small and specific. The functions and the ability to add them together provide a powerful and flexible “grammar” to describe the desired plot. Our plot begins the way that most do - by calling ggplot() with the data as the argument. This creates a plot object (but doesn’t draw it) and sets the data. Then we add the a so-called “aesthetic mapping” with the aes() function. An aesthetic mapping describes how features of the data map to visual channels of the plot. In our case, that means mapping the flipper length to the x (horizontal) direction, the bill length to the y (vertical) direction, and mapping species to both color and shape. Next, we add a geometry to describe what kind of marks to use in drawing the plot. Here you can refer to the table at the top of the graphics checklist that suggests geometries to use for different kinds of features. We have numeric features for both x and y, so the table suggests line, scatter (points), and heatmap. We’ve selected points (geom_point()) because we want to show the individual penguins (lines would imply a chain of connections from one penguin to the next.) Those three parts (data, a geometry, and a map between the two) would be enough to get a basic plot that looks like this: # Make a basic penguin plot with just data, # a geometry, and a map between the two. ggplot(penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species) + geom_point() We know, though, that this plot is not complete. In particular, there is no title and the axes aren’t labeled meaningfully. Also, the clouds of points seem to hide the meaning that we are trying to convey and the colors aren’t colorblind-safe. The rest of the pieces of the plot call are meant to address those shortcomings. We add a second geometry layer with geom_smooth(method=lm, se=FALSE), which specifies the lm method in order to draw a straight (instead of wiggly) smoother through the data. The x-axis label, y-axis label and title are set by xlab(), ylab(), and ggtitle(), respectively. We want a more informative label for the legend title than just the variable name (“Penguin Species” instead of “species”), which is handled by the labs() function. And you’ll recall from the principles of data visualization that you can use Cynthia Brewer’s Color Brewer website to select colorblind-friendly color schemes. Color Brewer is integrated directly into ggplot2, so the scale_color_brewer() function can pull a named color scheme from Color Brewer directly into your plot as the color scale. We can begin to better understand the grammar of graphics as we consider this example. Recognize that we our data visualization conveys information via several visual channels that express data as visual marks. The geometry determines how those marks are drawn on the page, which can be set separately from the mapping. Let’s see a couple examples of that: # placing plots via gridExtra library(gridExtra) # plot the Palmer penguin data with a line geometry peng_line = ggplot(penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species) + geom_line() # plot the Palmer penguin data with a hex heatmap geometry peng_hex = ggplot(penguins) + aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species) + geom_hex() # place the plts side-by-side grid.arrange(peng_line, peng_hex) You can see how changing the geometry but not the mapping will plot the same data with a different method. Separating the mapping of features to channels from the drawing of marks is at the core of the grammar of graphics. This separation of functions gives ggplot2 its power by allowing us to compose a small number of functions to express data in unlimited ways (kind of like poetry). Recognizing the grammar of graphics allows us to reason in a consistent way about different kinds of plots, and make intelligent assumptions about mappings and geometries. 5.3 Layers Layers are the building blocks of the grammar of graphics. The typical pattern is that you express the idea of a plot in the grammar of graphics by adding layers with the addition symbol (+). There aren’t even that many of layers to know! Here is the list, and the name of the function(s) you’ll use to control the layer. Some of the names include asterisks because there are a lot of similar options - for instance, geometry layers include geom_point(), geom_line(), geom_boxplot(), and many more. See the comprehensive listing on the official ggplot2 website. Data (ggplot2()) - provides the data for the visualization. Aesthetic mapping (aes()) - a mapping that indicates which variables in the data control which channel in the plot (recall from the Principle of Data Visualization that a “channel” is used in an abstract way to include things like shape, color, and line width.) Geometry (geom_*()) - how the marks will be drawn in the figure. Statistical transform (stat_*()) - alters the data before drawing - for instance binning or removing duplicates. Scale (scale_*()) - used to control the way that values in the data are mapped to the channels. For instance, you can control how numbers or categories in the data map to colors. Coordinates (coord_*()) - used to control how the data are mapped to plot axes. Facets (facet_*()) - used to separate the data into subplots called “facets”. Theme (theme()) - modifies plot details like titles, labels, and legends. 5.4 Guidelines for Graphics I’ve attached the PDF checklist for creating good data visualizations, created by Nick Ulle of UC Davis Datalab. Download it and keep a copy around - it’s an excellent guide. I’m going to go over how the checklist translates into the grammar of graphics. 5.4.1 Data You can’t have a data visualization without data! ggplot2 expects that your data is tidy, which means that each row is a complete observation and each column is a unique feature. In fact, ggplot2 is part of an actively developing collection of packages called the tidyverse that provides ways to create and work with tidy data. You dont have to adopt the entire tidyverse to use ggplot2, though. 5.4.2 Feature Types The first item on the list is a table of options for geometries that are commonly relevant for a given kind of data - for instance, a histogram is a geometry that can be used with a single numeric feature, and a box plot can be used with one numeric and one categorical feature. - Should it be a dot plot? Pie plots are hard to read and bar plots don’t use space efficiently (Cleveland and McGill 1990; Heer and Bostock 2010). Generally a dot plot is a better choice. 5.4.3 Theme Guidelines Does the graphic convey important information? Don’t include graphics that are uninformative or redundant. Title? Make sure the title explains what the graphic shows. Axis labels? Label the axes in plain language (no variable names!). Axis units? Label the axes with units (inches, dollars, etc). Legend? Any graphic that shows two or more categories coded by style or color must include a legend. 5.4.4 Scale Guidelines Appropriate scales and limits? Make sure the scales and limits of the axes do not lead people to incorrect conclusions. For side-by-side graphics or graphics that viewers will compare, use identical scales and limits. Print safe? Design graphics to be legible in black & white. Color is great, but use point and line styles to distinguish groups in addition to color. Also try to choose colors that are accessible to colorblind people. The RColorBrewer and viridis packages can help with choosing colors. 5.4.5 Facet Guidelines No more than 5 lines? Line plots with more than 5 lines risk becoming hard-to-read “spaghetti” plots. Generally a line plot with more than 5 lines should be split into multiple plots with fewer lines. If the x-axis is discrete, consider using a heat map instead. No overplotting? Scatter plots where many plot points overlap hide the actual patterns in the data. Consider splitting the data into facets, making the points smaller, or using a two-dimensional density plot (a smooth scatter plot) instead. 5.5 Case Studies We have covered enough of the grammar of graphics that you should begin to see the patterns in how it is used to express graphical ideas for ggplot2. Now we will work through some examples. 5.5.1 Counting Penguins First, let’s revisit the penguins data. There are tree categorical features in the data: species, island, and sex. Let’s use geom_bar() to count how many penguins of each species and/or sex were observed on each island. The x-axis of the plot should be the island, but note that there are multiple values of species and sex that have the same position on that x-axis. In this case, you can use the position_dodge() or position_stack() arguments to tell ggplot2 how to handle the second grouping channel. # count the penguins on each island ggplot(penguins) + aes(x=island) + geom_bar() + xlab("Island") + ylab("Count") + ggtitle("Count of penguins on each island") #count the penguins of each sex on each island ggplot(penguins) + aes(x=island, fill=sex) + geom_bar(position=position_dodge()) + scale_fill_grey() + theme_bw() + xlab("Island") + ylab("Count") + ggtitle("Count of penguins on each island by sex") Alternatively, you can use facets to separate the data into multiple plots based on a data feature. Let’s see how that works to facet the plots by species. One way to show more information more clearly in a plot is to break the plot into pieces that each show part of the information. In ggplot2, this is called faceting the plot. There are two main facet functions, facet_grid() (which puts plots in a grid), and facet_wrap(), which puts plots side-by-side until it runs out of room, then wraps to a new line. We’ll use facet_wrap() here, with the first argument being ~species. This tells ggplot2 to break the plot into pieces by plotting the data for each species separately. #count the penguins of each species on each island ggplot(penguins) + aes(x=island) + geom_bar() + scale_fill_grey() + theme_bw() + xlab("Island") + ylab("Count") + ggtitle("Count of penguins on each island by species") + facet_wrap(~species, ncol=3) 5.5.2 Experimental Data with Error Bars Here’s an example that recently came up in my office hours. You’ve done an experiment to see how mice with two different genotypes respond to two different treatments. Now you want to plot the mean response of each group as a column, with error bars indicating the standard deviation of the mean. You also want to show the raw data. I’ve simulated some data for us to use - download it here. This one is kind of complicated because you have to tell ggplot2 how to calculate the height of the columns and of the error bars. This involves cr mice_geno = read_csv("data/genotype-response.csv") # show the treatment response for different genotypes ggplot(mice_geno) + aes(x=trt, y=resp, fill=genotype) + scale_fill_brewer(palette="Dark2") + geom_bar(position=position_dodge(), stat='summary', fun='mean') + geom_errorbar(fun.min=function(x) {mean(x) - sd(x) / sqrt(length(x))}, fun.max=function(x) {mean(x) + sd(x) / sqrt(length(x))}, stat='summary', position=position_dodge(0.9), width=0.2) + geom_point(position= position_jitterdodge( dodge.width=0.9, jitter.width=0.1)) + xlab("Treatment") + ylab("Response (mm/g)") + ggtitle("Mean growth response of mice by genotype and treatment") 5.5.3 Bird Flu Mortality People mail dead birds to the USDA and USGS, where scientists analyze the birds to find out why they died. Right now there is a bird flu epidemic, and the USDA provides public data about the birds in whom the disease has been detected. You can access the data here or see the official USDA webpage here. After you download the data, we will load the data and do some visualization. # load data directly from the USDA website flu <- read_csv("data/hpai-wild-birds-ver2.csv") flu$date <- mdy(flu$`Date Detected`) # plot a histogram of when bird flu was detected ggplot(flu) + aes(x = date) + geom_histogram() + ggtitle("Bird flu detections in wild birds") + xlab("Date") + ylab("Count") # plot a histogram of when bird flu was detected ggplot(flu) + aes(x = date, fill = `Sampling Method`) + geom_histogram() + ggtitle("Bird flu detections in wild birds") + xlab("Date") + ylab("Count") # bar chart shows how the bird flu reports compare between west coast states subset(flu, State %in% c("California", "Oregon", "Washington")) |> ggplot() + aes(x = State, fill = `Sampling Method`) + stat_count() + geom_bar() + ggtitle("Bird flu detections in wild birds (West coast states)") + ylab("Count") Let’s compare the bird flu season to the human flu season. Download hospitalization data for the 2021-2022 and 2022-2023 flu seasons from the CDC website here or see the official Centers for Disease Control website here. After you download the data, we will see how adding a second data series works a little differently from the first. That’s because composing data, aesthetic mapping, and geometry with addition only works when there is no ambiguity about which data series is being mapped or drawn. After downloading the data, there is some work required to adjust the dates and change the hospitalization rate from cases per 100,000 to cases per 10 million, which better matches the scale of the bird flu data. # processing CDC flu data: cdc <- read_csv("data/FluSurveillance_Custom_Download_Data.csv", skip = 2) cdc$date <- as_date("1950-01-01") year(cdc$date) <- cdc$`MMWR-YEAR` week(cdc$date) <- cdc$`MMWR-WEEK` # get flu hospitalization counts that include all race, sex, and age categories cdc_overall <- subset( cdc, `AGE CATEGORY` == "Overall" & `SEX CATEGORY` == "Overall" & `RACE CATEGORY` == "Overall" ) # convert the counts to cases per 10 million cdc_overall$`WEEKLY RATE` <- as.numeric(cdc_overall$`WEEKLY RATE`) * 100 # remake the plot but add a new geom_line() with its own data ggplot(flu) + aes(x = date, fill = `Sampling Method`) + geom_histogram() + geom_line(data = cdc_overall, mapping = aes(x = date, y = `WEEKLY RATE`), inherit.aes = FALSE) + ggtitle("Bird flu detections and human flu hospitalizations") + xlab("Date") + ylab("Count") + xlim(as_date("2022-01-01"), as_date("2023-05-01")) 5.5.4 Small Business Loans The US Small Business Administration (SBA) maintains data on the loans it offers to businesses. Data about loans made since 2020 can be found at the Small Business Administration website, or you can download it from here. We’ll load that data and then explore some ways to visualize it. Since the difference between a $100 loan and a $1000 loan is more like the difference between $100,000 and $1M than between $100,000 ad $100,900, we should put the loan values on a logarithmic scale. You can do this in ggplot2 with the scale_y_log10() function (when the loan values are on the y axis). # load the small business loan data sba <- read_csv("data/small-business-loans.csv") # check the SBA data to see the data types, etc. head(sba) ## # A tibble: 6 × 39 ## AsOfDate Program BorrName BorrStreet BorrCity BorrState BorrZip BankName ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 20230331 7A Mark Dusa 3623 Swal… Sylvania OH 43560 The Hun… ## 2 20230331 7A Shaddai Harris 614 Valle… Arlingt… TX 76018 PeopleF… ## 3 20230331 7A Aqualon Inc. 7180 Agen… Tipp Ci… OH 45371 The Hun… ## 4 20230331 7A Redline Resta… 2450 Cher… Saint C… FL 34772 SouthSt… ## 5 20230331 7A Meluota Corp 2702 ASTO… ASTORIA NY 11102 Santand… ## 6 20230331 7A Sky Lake Vaca… 15 Nestle… Laconia NH 03246 TD Bank… ## # ℹ 31 more variables: BankFDICNumber <dbl>, BankNCUANumber <dbl>, ## # BankStreet <chr>, BankCity <chr>, BankState <chr>, BankZip <chr>, ## # GrossApproval <dbl>, SBAGuaranteedApproval <dbl>, ApprovalDate <chr>, ## # ApprovalFiscalYear <dbl>, FirstDisbursementDate <chr>, ## # DeliveryMethod <chr>, subpgmdesc <chr>, InitialInterestRate <dbl>, ## # TermInMonths <dbl>, NaicsCode <dbl>, NaicsDescription <chr>, ## # FranchiseCode <chr>, FranchiseName <chr>, ProjectCounty <chr>, … # boxplot of loan sizes by business type subset(sba, ProjectState == "CA") |> ggplot() + aes(x = BusinessType, y = SBAGuaranteedApproval) + geom_boxplot() + scale_y_log10() + ggtitle("Small Business Administraton guaranteed loans in California") + ylab("Loan guarantee (dollars)") # relationship between loan size and interest rate subset(sba, ProjectState == "CA") |> ggplot() + aes(x = GrossApproval, y = InitialInterestRate, ) + geom_point() + facet_wrap(~BusinessType, ncol = 3) + scale_x_log10() + ggtitle("Interest rate as a function of loan size") + xlab("Loan size (dollars)") + ylab("Interest rate (%)") Now let’s color the points by the loan status. Thankfully, ggplot2 integrates directly with Color Brewer (colorbrewer2.org) to get better color palettes. We will use the Accent color palette, which is just one of the many options that can be found on the Color Brewer site. There are a lot of data points, which tent to largely overlap and hide each other. We use a smoother (geom_smooth()) to help call out differences that would otherwise be lost in the noise of the points. # color the dots by the loan status. subset(sba, ProjectState == "CA" & LoanStatus != "EXEMPT" & LoanStatus != "CHGOFF") |> ggplot() + aes(x = GrossApproval, y = InitialInterestRate, color = LoanStatus) + geom_point() + geom_smooth() + facet_wrap(~BusinessType, ncol = 3) + scale_x_log10() + ggtitle("Interest rate as a function of loan size by loan status") + xlab("Loan size (dollars)") + ylab("Interest rate (%)") + labs(color = "Loan status") + scale_color_brewer(type = "qual", palette = "Accent") "],["language-fundamentals.html", "6 Language Fundamentals 6.1 Variables & Environments 6.2 Closures 6.3 Attributes 6.4 S3 6.5 Other Object Systems", " 6 Language Fundamentals This chapter is part 1 (of 2) of Thinking in R, a workshop series about how R works and how to examine code critically. The major topics of this chapter are how R stores and locates variables (including functions) defined in your code and in packages, and how some of R’s object-oriented programming systems work. Learning Objectives Explain what an environment is and how R uses them Explain how R looks up variables Explain what attributes are and how R uses them Get and set attributes Explain what (S3) classes are and how R uses them Explain R’s (S3) method dispatch system Create an (S3) class Describe R’s other object-oriented programming systems at a high level 6.1 Variables & Environments Assigning and looking up values of variables are fundamental operations in R, as in most programming languages. They were likely among the first operations you learned, and now you use them instictively. This section is a deep dive into what R actually does when you assign a variables and how R looks up the values of those variables later. Understanding the process and the data structures involved will introduce you to new programming strategies, make it easier to reason about code, and help you identify potential bugs. 6.1.1 What’s an Environment? The foundation of how R stores and looks up variables is a data structure called an environment. Every environment has two parts: A frame, which is a collection of names and associated R objects. A parent or enclosing environment, which must be another environment. For now, you’ll learn how to create environments and how to assign and get values from their frames. Parent environments will be explained in a later section. You can use the new.env function to create a new environment: e = new.env() e ## <environment: 0x5625b8afdb70> Unlike most objects, printing an environment doesn’t print its contents. Instead, R prints its type (which is environment) and a unique identifier (0x5625b8afdb70 in this case). The unique identifier is actually the memory address of the environment. Every object you use in R is stored as a series of bytes in your computer’s random-access memory (RAM). Each byte in memory has a unique address, similar to how each house on a street has a unique address. Memory addresses are usually just numbers counting up from 0, but they’re often written in hexadecimal (base 16) (indicated by the prefix 0x) because it’s more concise. For the purposes of this reader, you can just think of the memory address as a unique identifier. To see the names in an environment’s frame, you can call the ls or names function on the environment: ls(e) ## character(0) names(e) ## character(0) You just created the environment e, so its frame is currently empty. The printout character(0) means R returned a character vector of length 0. You can assign an R object to a name in an environment’s frame with the dollar sign $ operator or the double square bracket [[ operator, similar to how you would assign a named element of a list. For example, one way to assign the number 8 to the name \"lucky\" in the environment e’s frame is: e$lucky = 8 Now there’s a name defined in the environment: ls(e) ## [1] "lucky" names(e) ## [1] "lucky" Here’s another example of assigning an object to a name in the environment: e[["my_message"]] = "May your coffee kick in before reality does." You can assign any type of R object to a name in an environment, including other environments. The ls function ignores names that begin with a dot . by default. For example: e$.x = list(1, sin) ls(e) ## [1] "lucky" "my_message" You can pass the argument all.names = TRUE to make the function return all names in the frame: ls(e, all.names = TRUE) ## [1] ".x" "lucky" "my_message" Alternatively, you can just use the names function, which always prints all names in an environment’s frame. Objects in an environment’s frame don’t have positions or any particular order, so they must always be assigned to a name. R raises an error if you try to assign an object to a position: e[[3]] = 10 ## Error in e[[3]] = 10: wrong args for environment subassignment As you might expect, you can also use the dollar sign operator and double square bracket operator to get objects in an environment by name: e$my_message ## [1] "May your coffee kick in before reality does." e[["lucky"]] ## [1] 8 You can use the exists function to check whether a specific name exists in an environment’s frame: exists("hi", e) ## [1] FALSE exists("lucky", e) ## [1] TRUE Finally, you can remove a name and object from an environment’s frame with the rm function. Make sure to pass the environment as the argument to the envir parameter when you do this: rm("lucky", envir = e) exists("lucky", e) ## [1] FALSE 6.1.2 Reference Objects Environments are reference objects, which means they don’t follow R’s copy-on-write rule: for most types of objects, if you modify the object, R automatically and silently makes a copy, so that any other variables that refer to the object remain unchanged. As an example, lists follow the copy-on-write rule. Suppose you assign a list to variable x, assign x to y, and then make a change to x: x = list() x$a = 10 x ## $a ## [1] 10 y = x x$a = 20 y ## $a ## [1] 10 When you run y = x, R makes y refer to the same object as x, without using any additional memory. When you run x$a = 20, the copy-on-write rule applies, so R creates and modifies a copy of the object. From then on, x refers to the modified copy and y refers to the original. Environments don’t follow the copy-on-write rule, so repeating the example with an enviroment produces a different result: e_x = new.env() e_x$a = 10 e_x$a ## [1] 10 e_y = e_x e_x$a = 20 e_y$a ## [1] 20 As before, e_y = e_x makes both e_y and e_x refer to the same object. The difference is that when you run e_x$a = 20, the copy-on-write rule does not apply and R does not create a copy of the environment. As a result, the change to e_x is also reflected in e_y. Environments and other reference objects can be confusing since they behave differently from most objects. You usually won’t need to construct or manipulate environments directly, but it’s useful to know how to inspect them. 6.1.3 The Local Environment Think of environments as containers for variables. Whenever you assign a variable, R assigns it to the frame of an environment. Whenever you get a variable, R searches through one or more environments for its value. When you start R, R creates a special environment called the global environment to store variables you assign at the prompt or the top level of a script. You can use the globalenv function to get the global environment: g = globalenv() g ## <environment: R_GlobalEnv> The global environment is easy to recognize because its unique identifier is R_GlobalEnv rather than its memory address (even though it’s stored in your computer’s memory like any other object). The local environment is the environment where the assignment operators <- and = assign variables. Think of the local environment as the environment that’s currently active. The local environment varies depending on the context where you run an expression. You can get the local environment with the environment function: loc = environment() loc ## <environment: R_GlobalEnv> As you can see, at the R prompt or the top level of an R script, the local environment is just the global environment. Except for names, the functions introduced in Section 6.1.1 default to the local environment if you don’t set the envir parameter. This makes them convenient for inspecting or modifying the local environment’s frame: ls(loc) ## [1] "e" "e_x" "e_y" "g" "loc" ## [6] "source_rmd" "x" "y" ls() ## [1] "e" "e_x" "e_y" "g" "loc" ## [6] "source_rmd" "x" "y" If you assign a variable, it appears in the local environment’s frame: coffee = "Right. No coffee. This is a terrible planet." ls() ## [1] "coffee" "e" "e_x" "e_y" "g" ## [6] "loc" "source_rmd" "x" "y" loc$coffee ## [1] "Right. No coffee. This is a terrible planet." Conversely, if you assign an object in the local environment’s frame, you can access it as a variable: loc$tea = "Tea isn't coffee!" tea ## [1] "Tea isn't coffee!" 6.1.4 Call Environments Every time you call (not define) a function, R creates a new environment. R uses this call environment as the local environment while the code in the body of the function runs. As a result, assigning variables in a function doesn’t affect the global environment, and they generally can’t be accessed from outside of the function. For example, consider this function which assigns the variable hello: my_hello = function() { hello = "from the other side" } Even after calling the function, there’s no variable hello in the global environment: my_hello() names(g) ## [1] "loc" "my_hello" "tea" "e_x" "x" ## [6] "e_y" "y" "coffee" "source_rmd" "e" ## [11] "g" ".First" As further demonstration, consider this modified version of my_hello, which returns the call environment: my_hello = function() { hello = "from the other side" environment() } The call environment is not the global environment: e = my_hello() e ## <environment: 0x5625baff25e8> And the variable hello exists in the call environment, but not in the global environment: exists("hello", g) ## [1] FALSE exists("hello", e) ## [1] TRUE e$hello ## [1] "from the other side" Each call to a function creates a new call environment. So if you call my_hello again, it returns a different environment (pay attention to the memory address): e2 = my_hello() e ## <environment: 0x5625baff25e8> e2 ## <environment: 0x5625bb5756f8> By creating a new environment for every call, R isolates code in the function body from code outside of the body. As a result, most R functions have no side effects. This is a good thing, since it means you generally don’t have to worry about calls assigning, reassigning, or removing variables in other environments (such as the global environment!). The local function provides another way to create a new local environment in which to run code. However, it’s usually preferable to define and call a function, since that makes it easier to test and reuse the code. 6.1.5 Lexical Scoping A function can access variables outside of its local environment, but only if those variables exist in the environment where the function was defined (not called). This property is called lexical scoping. For example, assign a variable tea and function get_tea in the global environment: tea = "Tea isn't coffee!" get_tea = function() { tea } Then the get_tea function can access the tea variable: get_tea() ## [1] "Tea isn't coffee!" Note that variable lookup takes place when a function is called, not when it’s defined. This is called dynamic lookup. For example, the result from get_tea changes if you change the value of tea: tea = "Tea for two." get_tea() ## [1] "Tea for two." tea = "Tea isn't coffee!" get_tea() ## [1] "Tea isn't coffee!" When a local variable (a variable in the local environment) and a non-local variable have the same name, R almost always prioritizes the local variable. For instance: get_local_tea = function() { tea = "Earl grey is tea!" tea } get_local_tea() ## [1] "Earl grey is tea!" The function body assigns the local variable tea to \"Earl grey is tea!\", so R returns that value rather than \"Tea isn't coffee!\". In other words, local variables mask, or hide, non-local variables with the same name. There’s only one case where R doesn’t prioritize local variables. To see it, consider this call: mean(1:20) ## [1] 10.5 The variable mean must refer to a function, because it’s being called—it’s followed by parentheses ( ), the call syntax. In this situation, R ignores local variables that aren’t functions, so you can write code such as: mean = 10 mean(1:10) ## [1] 5.5 That said, defining a local variable with the same name as a function can still be confusing, so it’s usually considered a bad practice. To help you reason about lexical scoping, you can get the environment where a function was defined by calling the environment function on the function itself. For example, the get_tea function was defined in the global environment: environment(get_tea) ## <environment: R_GlobalEnv> 6.1.6 Variable Lookup The key to how R looks up variables and how lexical scoping works is that in addition to a frame, every environment has a parent environment. When R evaluates a variable in an expression, it starts by looking for the variable in the local environment’s frame. For example, at the prompt, tea is a local variable because that’s where you assigned it. If you enter tea at the prompt, R finds tea in the local environment’s frame and returns the value: tea ## [1] "Tea isn't coffee!" On the other hand, in the get_tea function from Section 6.1.5, tea is not a local variable: get_tea = function() { tea } To make this more concrete, consider a function which just returns its call environment: get_call_env = function() { environment() } The call environment clearly doesn’t contain the tea variable: e = get_call_env() ls(e) ## character(0) When a variable doesn’t exist in the local environment’s frame, then R gets the parent environment of the local environment. You can use the parent.env function to get the parent environment of an environment. For the call environment e, the parent environment is the global environment, because that’s where get_call_env was defined: parent.env(e) ## <environment: R_GlobalEnv> When R can’t find tea in the call environment’s frame, R gets the parent environment, which is the global environment. Then R searches for tea in the global environment, finds it, and returns the value. R repeats the lookup process for as many parents as necessary to find the variable, stopping only when it finds the variable or a special environment called the empty environment which will be explained in Section 6.1.7. The lookup process also hints at how R finds variables and functions such as pi and sqrt that clearly aren’t defined in the global environment. They’re defined in parent environments of the global environment. The get function looks up a variable by name: get("pi") ## [1] 3.141593 You can use the get function to look up a variable starting from a specific environment or to control how R does the lookup the variable. For example, if you set inherits = FALSE, R will not search any parent environments: get("pi", inherits = FALSE) ## Error in get("pi", inherits = FALSE): object 'pi' not found As with most functions for inspecting and modifying environments, use the get function sparingly. R already provides a much simpler way to get a variable: the variable’s name. 6.1.7 The Search Path R also uses environments to manage packages. Each time you load a package with library or require, R creates a new environment: The frame contains the package’s local variables. The parent environment is the environment of the previous package loaded. This new environment becomes the parent of the global environment. R always loads several built-in packages at startup, which contain variables and functions such as pi and sqrt. Thus the global environment is never the top-level environment. For instance: g = globalenv() e = parent.env(g) e ## <environment: package:stats> ## attr(,"name") ## [1] "package:stats" ## attr(,"path") ## [1] "/usr/lib/R/library/stats" e = parent.env(e) e ## <environment: package:graphics> ## attr(,"name") ## [1] "package:graphics" ## attr(,"path") ## [1] "/usr/lib/R/library/graphics" Notice that package environments use package: and the name of the package as their unique identifier rather than their memory address. The chain of package environments is called the search path. The search function returns the search path: search() ## [1] ".GlobalEnv" "package:stats" "package:graphics" ## [4] "package:grDevices" "package:utils" "package:datasets" ## [7] "package:methods" "Autoloads" "package:base" The base environment (identified by base) is the always topmost environment. You can use the baseenv function to get the base environment: baseenv() ## <environment: base> The base environment’s parent is the special empty environment (identified by R_EmptyEnv), which contains no variables and has no parent. You can use the emptyenv function to get the empty environment: emptyenv() ## <environment: R_EmptyEnv> Understanding R’s process for looking up variables and the search path is helpful for resolving conflicts between the names of variables in packages. 6.1.7.1 The Colon Operators The double-colon operator :: gets a variable in a specific package. Two common uses: Disambiguate which package you mean when several packages have variables with the same names. Get a variable from a package without loading the package. For example: library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union stats::filter ## function (x, filter, method = c("convolution", "recursive"), ## sides = 2L, circular = FALSE, init = NULL) ## { ## method <- match.arg(method) ## x <- as.ts(x) ## storage.mode(x) <- "double" ## xtsp <- tsp(x) ## n <- as.integer(NROW(x)) ## if (is.na(n)) ## stop(gettextf("invalid value of %s", "NROW(x)"), domain = NA) ## nser <- NCOL(x) ## filter <- as.double(filter) ## nfilt <- as.integer(length(filter)) ## if (is.na(nfilt)) ## stop(gettextf("invalid value of %s", "length(filter)"), ## domain = NA) ## if (anyNA(filter)) ## stop("missing values in 'filter'") ## if (method == "convolution") { ## if (nfilt > n) ## stop("'filter' is longer than time series") ## sides <- as.integer(sides) ## if (is.na(sides) || (sides != 1L && sides != 2L)) ## stop("argument 'sides' must be 1 or 2") ## circular <- as.logical(circular) ## if (is.na(circular)) ## stop("'circular' must be logical and not NA") ## if (is.matrix(x)) { ## y <- matrix(NA, n, nser) ## for (i in seq_len(nser)) y[, i] <- .Call(C_cfilter, ## x[, i], filter, sides, circular) ## } ## else y <- .Call(C_cfilter, x, filter, sides, circular) ## } ## else { ## if (missing(init)) { ## init <- matrix(0, nfilt, nser) ## } ## else { ## ni <- NROW(init) ## if (ni != nfilt) ## stop("length of 'init' must equal length of 'filter'") ## if (NCOL(init) != 1L && NCOL(init) != nser) { ## stop(sprintf(ngettext(nser, "'init' must have %d column", ## "'init' must have 1 or %d columns", domain = "R-stats"), ## nser), domain = NA) ## } ## if (!is.matrix(init)) ## dim(init) <- c(nfilt, nser) ## } ## ind <- seq_len(nfilt) ## if (is.matrix(x)) { ## y <- matrix(NA, n, nser) ## for (i in seq_len(nser)) y[, i] <- .Call(C_rfilter, ## x[, i], filter, c(rev(init[, i]), double(n)))[-ind] ## } ## else y <- .Call(C_rfilter, x, filter, c(rev(init[, 1L]), ## double(n)))[-ind] ## } ## tsp(y) <- xtsp ## class(y) <- if (nser > 1L) ## c("mts", "ts") ## else "ts" ## y ## } ## <bytecode: 0x5625bafa6d10> ## <environment: namespace:stats> dplyr::filter ## function (.data, ..., .by = NULL, .preserve = FALSE) ## { ## check_by_typo(...) ## by <- enquo(.by) ## if (!quo_is_null(by) && !is_false(.preserve)) { ## abort("Can't supply both `.by` and `.preserve`.") ## } ## UseMethod("filter") ## } ## <bytecode: 0x5625b70b6678> ## <environment: namespace:dplyr> ggplot2::ggplot ## function (data = NULL, mapping = aes(), ..., environment = parent.frame()) ## { ## UseMethod("ggplot") ## } ## <bytecode: 0x5625ba56f2e8> ## <environment: namespace:ggplot2> The related triple-colon operator ::: gets a private variable in a package. Generally these are private for a reason! Only use ::: if you’re sure you know what you’re doing. 6.2 Closures A closure is a function together with an enclosing environment. In order to support lexical scoping, every R function is a closure (except a few very special built-in functions). The enclosing environment is generally the environment where the function was defined. Recall that you can use the environment function to get the enclosing environment of a function: f = function() 42 environment(f) ## <environment: R_GlobalEnv> Since the enclosing environment exists whether or not you call the function, you can use the enclosing environment to store and share data between calls. You can use the superassignment operator <<- to assign to a variable to an ancestor environment (if the variable already exists) or the global environment (if the variable does not already exist). For example, suppose you want to make a function that returns the number of times it’s been called: counter = 0 count = function() { counter <<- counter + 1 counter } In this example, the enclosing environment is the global environment. Each time you call count, it assigns a new value to the counter variable in the global environment. 6.2.1 Tidy Closures The count function has a side effect—it reassigns a non-local variable. As discussed in 6.1.4, functions with side effects make code harder to understand and reason about. Use side effects sparingly and try to isolate them from the global environment. When side effects aren’t isolated, several things can go wrong. The function might overwrite the user’s variables: counter = 0 count() ## [1] 1 Or the user might overwrite the function’s variables: counter = "hi" count() ## Error in counter + 1: non-numeric argument to binary operator For functions that rely on storing information in their enclosing environment, there are several different ways to make sure the enclosing environment is isolated. Two of these are: Define and return the function from the body of another function. The second function is called a factory function because it produces (returns) the first. The enclosing environment of the first function is the call environment of the second. Define the function inside of a call to local. Here’s a template for the first approach: make_fn = function() { # Define variables in the enclosing environment here: # Define and return the function here: function() { # ... } } f = make_fn() # Now you can call f() as you would any other function. For example, you can use the template for the counter function: make_count = function() { counter = 0 function() { counter <<- counter + 1 counter } } count = make_count() Then calling count has no effect on the global environment: counter = 10 count() ## [1] 1 counter ## [1] 10 6.3 Attributes An attribute is named metadata attached to an R object. Attributes provide basic information about objects and play an important role in R’s class system, so most objects have attributes. Some common attributes are: class – the class row.names – row names names – element names or column names dim – dimensions (on matrices) dimnames – names of dimensions (on matrices) R provides helper functions to get and set the values of the common attributes. These functions usually have the same name as the attribute. For example, the class function gets or sets the class attribute: class(mtcars) ## [1] "data.frame" row.names(mtcars) ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" An attribute can have any name and any value. You can use the attr function to get or set an attribute by name: attr(mtcars, "row.names") ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" attr(mtcars, "foo") = 42 attr(mtcars, "foo") ## [1] 42 You can get all of the attributes attached to an object with the attributes function: attributes(mtcars) ## $names ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ## ## $row.names ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" ## ## $class ## [1] "data.frame" ## ## $foo ## [1] 42 You can use the structure function to set multiple attributes on an object: mod_mtcars = structure(mtcars, foo = 50, bar = 100) attributes(mod_mtcars) ## $names ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ## ## $row.names ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" ## ## $class ## [1] "data.frame" ## ## $foo ## [1] 50 ## ## $bar ## [1] 100 Vectors usually don’t have attributes: attributes(5) ## NULL But the class function still returns a class: class(5) ## [1] "numeric" When a helper function exists to get or set an attribute, use the helper function rather than attr. This will make your code clearer and ensure that attributes with special behavior and requirements, such as dim, are set correctly. 6.4 S3 R provides several systems for object-oriented programming (OOP), a programming paradigm where code is organized into a collection of “objects” that interact with each other. These systems provide a way to create new data structures with customized behavior, and also underpin how some of R’s built-in functions work. The S3 system is particularly important for understanding R, because it’s the oldest and most widely-used. This section focuses on S3, while Section 6.5 provides an overview of R’s other OOP systems. The central idea of S3 is that some functions can be generic, meaning they perform different computations (and run different code) for different classes of objects. Conversely, every object has at least one class, which dictates how the object behaves. For most objects, the class is independent of type and is stored in the class attribute. You can get the class of an object with the class function. For example, the class of a data frame is data.frame: class(mtcars) ## [1] "data.frame" Some objects have more than one class. One example of this is matrices: m = matrix() class(m) ## [1] "matrix" "array" When an object has multiple classes, they’re stored in the class attribute in order from highest to lowest priority. So the matrix m will primarily behave like a matrix, but it can also behave like an array. The priority of classes is often described in terms of a child-parent relationship: array is the parent class of matrix, or equivalently, the class matrix inherits from the class array. 6.4.1 Method Dispatch A function is generic if it selects and calls another function, called a method, based on the class of one of its arguments. A generic function can have any number of methods, and each must have the same signature, or collection of parameters, as the generic. Think of a generic function’s methods as the range of different computations it can perform, or alternatively as the range of different classes it can accept as input. Method dispatch, or just dispatch, is the process of selecting a method based on the class of an argument. You can identify S3 generics because they always call the UseMethod function, which initiates S3 method dispatch. Many of R’s built-in functions are generic. One example is the split function, which splits a data frame or vector into groups: split ## function (x, f, drop = FALSE, ...) ## UseMethod("split") ## <bytecode: 0x5625b84f0020> ## <environment: namespace:base> Another is the plot function, which creates a plot: plot ## function (x, y, ...) ## UseMethod("plot") ## <bytecode: 0x5625ba334fa0> ## <environment: namespace:base> The UseMethod function requires the name of the generic (as a string) as its first argument. The second argument is optional and specifies the object to use for method dispatch. By default, the first argument to the generic is used for method dispatch. So for split, the argument for x is used for method dispatch. R checks the class of the argument and selects a matching method. You can use the methods function to list all of the methods of a generic. The methods for split are: methods(split) ## [1] split.data.frame split.Date split.default split.POSIXct ## see '?methods' for accessing help and source code Method names always have the form GENERIC.CLASS, where GENERIC is the name of the generic and CLASS is the name of a class. For instance, split.data.frame is the split method for objects with class data.frame. Methods named GENERIC.default are a special case: they are default methods, selected only if none of the other methods match the class during dispatch. So split.default is the default method for split. Most generic functions have a default method. Methods are ordinary R functions. For instance, the code for split.data.frame is: split.data.frame ## function (x, f, drop = FALSE, ...) ## { ## if (inherits(f, "formula")) ## f <- .formula2varlist(f, x) ## lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), ## function(ind) x[ind, , drop = FALSE]) ## } ## <bytecode: 0x5625b87abe08> ## <environment: namespace:base> Sometimes methods are defined in privately packages and can’t be accessed by typing their name at the prompt. You can use the getAnywhere function to get the code for these methods. For instance, to get the code for plot.data.frame: getAnywhere(plot.data.frame) ## A single object matching 'plot.data.frame' was found ## It was found in the following places ## registered S3 method for plot from namespace graphics ## namespace:graphics ## with value ## ## function (x, ...) ## { ## plot2 <- function(x, xlab = names(x)[1L], ylab = names(x)[2L], ## ...) plot(x[[1L]], x[[2L]], xlab = xlab, ylab = ylab, ## ...) ## if (!is.data.frame(x)) ## stop("'plot.data.frame' applied to non data frame") ## if (ncol(x) == 1) { ## x1 <- x[[1L]] ## if (class(x1)[1L] %in% c("integer", "numeric")) ## stripchart(x1, ...) ## else plot(x1, ...) ## } ## else if (ncol(x) == 2) { ## plot2(x, ...) ## } ## else { ## pairs(data.matrix(x), ...) ## } ## } ## <bytecode: 0x5625ba423db8> ## <environment: namespace:graphics> As a demonstration of method dispatch, consider this code to split the mtcars dataset by number of cylinders: split(mtcars, mtcars$cyl) The split function is generic and dispatches on its first argument. In this case, the first argument is mtcars, which has class data.frame. Since the method split.data.frame exists, R calls split.data.frame with the same arguments you used to call the generic split function. In other words, R calls: split.data.frame(mtcars, mtcars$cyl) When an object has more than one class, method dispatch considers them from left to right. For instance, matrices created with the matrix function have class matrix and also class array. If you pass a matrix to a generic function, R will first look for a matrix method. If there isn’t one, R will look for an array method. If there still isn’t one, R will look for a default method. If there’s no default method either, then R raises an error. The sloop package provides useful functions inspecting S3 classes, generics, and methods, as well as the method dispatch process. For example, you can use the s3_dispatch function to see which method will be selected when you call a generic: # install.packages("sloop") library("sloop") s3_dispatch(split(mtcars, mtcars$cyl)) ## => split.data.frame ## * split.default The selected method is indicated with an arrow =>, while methods that were not selected are indicated with a star *. See ?s3_dispatch for complete details about the output from the function. 6.4.2 Creating Objects S3 classes are defined implicitly by their associated methods. To create a new class, decide what its structure will be and define some methods. To create an object of the class, set an object’s class attribute to the class name. For example, let’s create a generic function get_age that returns the age of an animal in terms of a typical human lifespan. First define the generic: get_age = function(animal) { UseMethod("get_age") } Next, let’s create a class Human to represent a human. Since humans are animals, let’s make each Human also have class Animal. You can use any type of object as the foundation for a class, but lists are often a good choice because they can store multiple named elements. Here’s how to create a Human object with a field age_years to store the age in years: lyra = list(age_years = 13) class(lyra) = c("Human", "Animal") Class names can include any characters that are valid in R variable names. One common convention is to make them start with an uppercase letter, to distinguish them from variables. If you want to make constructing an object of a given class less ad-hoc (and error-prone), define a constructor function that returns a new object of a given class. A common convention is to give the constructor function the same name as the class: Human = function(age_years) { obj = list(age_years = age_years) class(obj) = c("Human", "Animal") obj } asriel = Human(45) The get_age generic doesn’t have any methods yet, so R raises an error if you call it (regardless of the argument’s class): get_age(lyra) ## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('Human', 'Animal')" Let’s define a method for Animal objects. The method will just return the value of the age_years field: get_age.Animal = function(animal) { animal$age_years } get_age(lyra) ## [1] 13 get_age(asriel) ## [1] 45 Notice that the get_age generic still raises an error for objects that don’t have class Animal: get_age(3) ## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('double', 'numeric')" Now let’s create a class Dog to represent dogs. Like the Human class, a Dog is a kind of Animal and has an age_years field. Each Dog will also have a breed field to store the breed of the dog: Dog = function(age_years, breed) { obj = list(age_years = age_years, breed = breed) class(obj) = c("Dog", "Animal") obj } pongo = Dog(10, "dalmatian") Since a Dog is an Animal, the get_age generic returns a result: get_age(pongo) ## [1] 10 Recall that the goal of this example was to make get_age return the age of an animal in terms of a human lifespan. For a dog, their age in “human years” is about 5 times their age in actual years. You can implement a get_age method for Dog to take this into account: get_age.Dog = function(animal) { animal$age_years * 5 } Now the get_age generic returns an age in terms of a human lifespan whether its argument is a Human or a Dog: get_age(lyra) ## [1] 13 get_age(pongo) ## [1] 50 You can create new data structures in R by creating classes, and you can add functionality to new or existing generics by creating new methods. Before creating a class, think about whether R already provides a data structure that suits your needs. It’s uncommon to create new classes in the course of a typical data analysis, but many packages do provide new classes. Regardless of whether you ever create a new class, understanding the details means understanding how S3 works, and thus how R’s many S3 generic functions work. As a final note, while exploring S3 methods you may also encounter the NextMethod function. The NextMethod function redirects dispatch to the method that is the next closest match for an object’s class. You can learn more by reading ?NextMethod. 6.5 Other Object Systems R provides many systems for object-oriented programming besides S3. Some are built into the language, while others are provided by packages. A few of the most popular systems are: S4 – S4 is built into R and is the most widely-used system after S3. Like S3, S4 frames OOP in terms of generic functions and methods. The major differences are that S4 is stricter—the structure of each class must be formally defined—and that S4 generics can dispatch on the classes of multiple arguments instead of just one. R provides a special field operator @ to access fields of an S4 object. Most of the packages in the Bioconductor project use S4. Reference classes – Objects created with the S3 and S4 systems generally follow the copy-on-write rule, but this can be inefficient for some programming tasks. The reference class system is built into R and provides a way to create reference objects with a formal class structure (in the spirit of S4). This system is more like OOP systems in languages like Java and Python than S3 or S4 are. The reference class system is sometimes jokingly called “R5”, but that isn’t an official name. R6 – An alternative to reference classes created by Winston Chang, a developer at Posit (formerly RStudio). Claims to be simpler and faster than reference classes. S7 – A new OOP system being developed collaboratively by representatives from several different important groups in the R community, including the R core developers, Bioconductor, and Posit. Many of these systems are described in more detail in Hadley Wickham’s book Advanced R. "],["part-2.html", "7 Part 2", " 7 Part 2 This chapter will eventually contain part 2 (of 2) of Thinking in R, a workshop series about how R works and how to examine code critically. "],["references.html", "References", " References This reader would not have been possible without the many excellent reference texts created by other members of the R community. Now that you’ve completed this reader, these texts are a great way to continue you R learning journey. Advanced R by Hadley Wickham is a must-read if you want a deep understanding of R. It provides many examples of R features that are important for package/software development. Other texts I’ve found useful include: What They Forgot to Teach You About R by Bryan & Hester. The Art of R Programming by Matloff (of UC Davis). A general reference on R programming, with more of a computer science and software engineering perspective than most R texts. The R Inferno by Burns. A discussion of the most difficult and confusing parts of R. R Packages by Wickham. A gentle, modern introduction to creating packages for R. Writing R Extensions by the R core developers. A description of how to create packages and other extensions for R. R Language Definition by the R core developers. Documentation about how R works at a low level. R Internals by the R core developers. Documentation about how R works internally (that is, its C code). Finally, here are a few other readers and notes created by DataLab staff: My personal teaching notes from several years of teaching statistical computing. R Basics, our workshop series aimed at people just starting to learn R. Adventures in Data Science, our course introducing humanities undergraduates to data science techniques. Python Basics, our workshop series aimed at people just starting to learn Python. Intermediate Python, this reader’s counterpart for Python users. "],["assessment.html", "Assessment", " Assessment If you are taking this workshop to complete a GradPathways research computing or other DataLab sponsored pathway, you need to complete an assessment and submit to GradPathways to claim your badge. For the “Intermediate R” 4-part workshop series, you can download the assessment instructions here. For the “Intermediate R: Data Visualizations with Ggplot” workshop, register for the Data Visualization pathway and complete the following. Submit all materials to GradPathways via the Badgr portal: Generate a data visualization from your research data using ggplot in R. Export it as a .jpg titled “figure_v1”. Upload the figure and code used to generate it. Next, iterate on the figure using data visualization best practices such that it is a publication-worthy plot. Export this revised plot as “figure_v2”, and upload it along with its respective code and a figure caption. Write a narrative explaining the data visualization. This should include a short data biography (what are the data, who collected it and why, and who it affects), the purpose of the plot (audience, goals), and what data story it tells. Discuss what changes you made, and why, from v1 to v2. Also list a few additional changes you would ideally like to make to v2 in the future. If you are happy with v2, instead of listing future changes discuss what other plot types might be appropriate or additional data visualizations you could make to help support your data story. If you have questions regarding how to submit your assessment materials, contact GradPathways. "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
diff --git a/docs/squashing-bugs-with-rs-debugging-tools.html b/docs/squashing-bugs-with-rs-debugging-tools.html
index 6670a84..8478c35 100644
--- a/docs/squashing-bugs-with-rs-debugging-tools.html
+++ b/docs/squashing-bugs-with-rs-debugging-tools.html
@@ -202,17 +202,33 @@
II Writing & Debugging R Code
-3 Best Practices for Writing R Scripts
-4 Squashing Bugs with R’s Debugging Tools
+ 3 Best Practices for Writing R Scripts
+4 Squashing Bugs with R’s Debugging Tools
+
+- 4.1 Printing
- 4.2 The Conditions System
- 4.2.1 Raising Conditions
@@ -285,19 +301,7 @@
- 6.5 Other Object Systems
-7 Part 2
-
+7 Part 2
V Backmatter
References
Assessment
@@ -323,10 +327,9 @@ 4 Squashing Bugs with R’s Debug
The major topics of this chapter are how to print output, how R’s conditions
system for warnings and errors works, how to use the R debugger, and how to
estimate the performance of R code.
-
-Learning Objectives
+
+Learning Objectives
-- Identify and explain the difference between R’s various printing functions
- Use R’s conditions system to raise and catch messages, warnings, and errors
- Use R’s debugging functions to diagnose bugs in code
- Estimate the amount of memory a data set will require
@@ -335,20 +338,15 @@ Learning Objectives
-4.1 Printing Output
+
+4.1 Printing
Perhaps the simplest thing you can do to get a better understanding of some
code is make it print out lots of information about what’s happening as it
-runs. This section introduces several different functions for printing output
-and making that output easier to read.
-
-4.1.1 The print
Function
-The print
function prints a string representation of an object to the
-console. The string representation is usually formatted in a way that exposes
-detail important programmers rather than users.
+runs. You can use the print
function to print values in a way that exposes
+details important to programmers.
For example, when printing a vector, the function prints the position of the
first element on each line in square brackets [ ]
:
-
+
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
@@ -356,210 +354,10 @@ 4.1.1 The print
Func
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
The print
function also prints quotes around strings:
-
+
## [1] "Hi"
These features make the print
function ideal for printing information when
-you’re trying to understand some code or diagnose a bug. On the other hand,
-these features also make print
a bad choice for printing output or status
-messages for users (including you).
-R calls the print
function automatically anytime a result is returned at the
-prompt. Thus it’s not necessary to call print
to print something when you’re
-working directly in the console—only from within loops, functions, scripts,
-and other code that runs non-interactively.
-The print
function is an S3 generic (see Section 6.4), so you if you
-create an S3 class, you can define a custom print method for it. For S4
-objects, R uses the S4 generic show
instead of print
.
-
-
-4.1.2 The message
Function
-To print output for users, the message
function is the one you should use.
-The main reason for this is that the message
function is part of R’s
-conditions system for reporting status information as code runs. This makes
-it easier for other code to detect, record, respond to, or suppress the output.
-Section 4.2 will explain the conditions system in more
-detail.
-The message
function prints its argument(s) and a newline to the console:
-
-## Hello world!
-If an argument isn’t a string, the function automatically and silently attempts
-to coerce it to one:
-
-## 4
-Some types of objects can’t be coerced to a string:
-
-## Error in FUN(X[[i]], ...): cannot coerce type 'builtin' to vector of type 'character'
-For objects with multiple elements, the function pastes together the string
-representations of the elements with no separators in between:
-
-## 123
-Similarly, if you pass the message
function multiple arguments, it pastes
-them together with no separators:
-
-## Hi, my name is R and x is 123
-This is a convenient way to print names or descriptions alongside values from
-your code without having to call a formatting function like paste
.
-You can make the message function print something without adding a newline at
-the end by setting the argument appendLF = FALSE
. The difference can be easy
-to miss unless you make several calls to message
, so the say_hello
function
-in this example calls message
twice:
-say_hello = function(appendLF) {
- message("Hello", appendLF = appendLF)
- message(" world!")
-}
-
-say_hello(appendLF = TRUE)
-## Hello
-## world!
-
-## Hello world!
-Note that RStudio always adds a newline in front of the prompt, so making an
-isolated call to message
with appendLF = FALSE
appears to produce the same
-output as with appendLF = TRUE
. This is an example of a situation where
-RStudio leads you astray: in an ordinary R console, the two are clearly
-different.
-
-
-4.1.3 The cat
Function
-The cat
function, whose name stands for “concatenate and print,” is a
-low-level way to print output to the console or a file. The message
function
-prints output by calling cat
, but cat
is not part of R’s conditions system.
-The cat
function prints its argument(s) to the console. It does not add a
-newline at the end:
-
-## Hello
-As with message
, RStudio hides the fact that there’s no newline if you make
-an isolated call to cat
.
-The cat
function coerces its arguments to strings and concatenates them. By
-default, a space
is inserted between arguments and their elements:
-
-## 4
-
-## 1 2 3
-
-## Hello Nick
-You can set the sep
parameter to control the separator cat
inserts:
-
-## Hello|world|1|2|3
-If you want to write output to a file rather than to the console, you can call
-cat
with the file
parameter set. However, it’s preferable to use functions
-tailored to writing specific kinds of data, such as writeLines
(for text) or
-write.table
(for tabular data), since they provide additional options to
-control the output.
-Many scripts and packages still use cat
to print output, but the message
-function provides more flexibility and control to people running the code. Thus
-it’s generally preferable to use message
in new code. Nevertheless, there are
-a few specific cases where cat
is useful—for example, if you want to pipe
-data to a UNIX shell command. See ?cat
for details.
-
-
-4.1.4 Formatting Output
-R provides a variety of ways to format data before you print it. Taking the
-time to format output carefully makes it easier to read and understand, as well
-as making your scripts seem more professional.
-
-4.1.4.1 Escape Sequences
-One way to format strings is by adding (or removing) escape sequences. An
-escape sequence is a sequence of characters that represents some other
-character, usually one that’s invisible (such as whitespace) or doesn’t appear
-on a standard keyboard.
-In R, escape sequences always begin with a backslash. For example, \n
is a
-newline. The message
and cat
functions automatically convert escape
-sequences to the characters they represent:
-
-## Hello
-## world!
-The print
function doesn’t convert escape sequences:
-
-## [1] "Hello\nworld!"
-Some escape sequences trigger special behavior in the console. For example,
-ending a line with a carriage return \r
makes the console print the next line
-over the line. Try running this code in a console (it’s not possible to see the
-result in a static book):
-# Run this in an R console.
-for (i in 1:10) {
- message(i, "\r", appendLF = FALSE)
- # Wait 0.5 seconds.
- Sys.sleep(0.5)
-}
-You can find a complete list of escape sequences in ?Quotes
.
-
-
-4.1.4.2 Formatting Functions
-You can use the sprintf
function to apply specific formatting to values and
-substitute them into strings. The function uses a mini-language to describe the
-formatting and substitutions. The sprintf
function (or something like it) is
-available in many programming languages, so being familiar with it will serve
-you well on your programming journey.
-The key idea is that substitutions are marked by a percent sign %
and a
-character. The character indicates the kind of data to be substituted: s
for
-strings, i
for integers, f
for floating point numbers, and so on.
-The first argument to sprintf
must be a string, and subsequent arguments are
-values to substitute into the string (from left to right). For example:
-
-## [1] "My age is 32, and my name is Nick"
-You can use the mini-language to do things like specify how many digits to
-print after a decimal point. Format settings for a substituted value go between
-the percent sign %
and the character. For instance, here’s how to print pi
-with 2 digits after the decimal:
-
-## [1] "3.14"
-You can learn more by reading ?sprintf
.
-Much simpler are the paste
and paste0
functions, which coerce their
-arguments to strings and concatenate (or “paste together”) them. The paste
-function inserts a space
between each argument, while the paste0
-function doesn’t:
-
-## [1] "Hello world"
-
-## [1] "Helloworld"
-You can control the character inserted between arguments with the sep
-parameter.
-By setting an argument for the collapse
parameter, you can also use the
-paste
and paste0
functions to concatenate the elements of a vector. The
-argument to collapse
is inserted between the elements. For example, suppose
-you want to paste together elements of a vector inserting a comma and space
-,
in between:
-
-## [1] "1, 2, 3"
-Members of the R community have developed many packages to make formatting
-strings easier:
-
-
-
-
-4.1.5 Logging Output
-Logging means saving the output from some code to a file as the code runs.
-The file where the output is saved is called a log file or log, but this
-name isn’t indicative of a specific format (unlike, say, a “CSV file”).
-It’s a good idea to set up some kind of logging for any code that takes more
-than a few minutes to run, because then if something goes wrong you can inspect
-the log to diagnose the problem. Think of any output that’s not logged as
-ephemeral: it could disappear if someone reboots the computer, or there’s a
-power outage, or some other, unforeseen event.
-R’s built-in tools for logging are rudimentary, but members of the community
-have developed a variety of packages for logging. Here are a few that are still
-actively maintained as of January 2023:
-
-- logger – a relatively new package that aims to improve aspects of other
-logging packages that R users find confusing.
-- futile.logger – a popular, mature logging package based on Apache’s
-Log4j utility and on R idioms.
-- logging – a mature logging package based on Python’s
logging
module.
-- loggit – integrates with R’s conditions system and writes logs in
-JavaScript Object Notation (JSON) format so they are easy to inspect
-programmatically.
-- log4r – another package based on Log4j with an object-oriented
-programming approach.
-
-
+you’re trying to understand some code or diagnose a bug.
4.2 The Conditions System
@@ -575,10 +373,11 @@ 4.2 The Conditions System4.2.1 Raising Conditions
The message
, warning
, and stop
functions are the primary ways to
raise, or signal, conditions.
-The message
function was described in Section 4.1.2. A
-message provides status information about running code, but does not
-necessarily indicate that something has gone wrong. You can use messages to
-print out any information you think might be relevant to users.
+The message
function is the primary way to print output (see Section
+3.2 for alternatives). A message provides status information
+about running code, but does not necessarily indicate that something has gone
+wrong. You can use messages to print out any information you think might be
+relevant to users.
The warning
function raises a warning. Warnings indicate that something
unexpected happened, but that it didn’t stop the code from running. By default,
R doesn’t print warnings to the console until code finishes running, which can
@@ -587,28 +386,28 @@
4.2.1 Raising Conditionswarning("Objects in mirror", " may be closer than they appear.")
+
## Warning: Objects in mirror may be closer than they appear.
Warnings are always printed with Warning:
before the message. By default,
calling warning
from the body of a function also prints the name of the
function:
-
+
## Warning in f(3, 4): This is a warning!
## [1] 7
The name of the function that raised the warning is generally useful
information for users that want to correct whatever caused the warning.
Occasionally, you might want to disable this behavior, which you can do by
setting call. = FALSE
:
-
+
## Warning: This is a warning!
## [1] 7
The warning
function also has several other parameters that control when and
@@ -617,16 +416,16 @@
4.2.1 Raising Conditionsf = function(x, y) {
- stop()
- x + y
-}
-
-f(3, 4)
+
## Error in f(3, 4):
Like message
and warning
, the stop
function concatenates its unnamed
arguments into a message to print:
-
+
## Error in eval(expr, envir, enclos): I'm afraid something has gone terribly wrong.
Errors are always printed with Error:
before the error message. You can use
the call.
parameter to control whether the error message also includes the
@@ -652,16 +451,16 @@
4.2.2 Handling Conditionstry function prints the error message and
returns an object of class try-error
, but evaluation does not stop. For
example:
-bad_add = function(x) {
- # No error
- x1 = try(5 + x)
- # Error
- x2 = try("yay" + x)
-
- list(x1, x2)
-}
-
-bad_add(10)
+bad_add = function(x) {
+ # No error
+ x1 = try(5 + x)
+ # Error
+ x2 = try("yay" + x)
+
+ list(x1, x2)
+}
+
+bad_add(10)
## Error in "yay" + x : non-numeric argument to binary operator
## [[1]]
## [1] 15
@@ -681,12 +480,12 @@ 4.2.2 Handling Conditionsinherits
function to check whether an object has a specific class, so
here’s a template for how to run code that might cause an error, check for the
error, and respond to it:
-result = try({
- # Code that might cause an error.
-})
-if (inherits(result, "try-error")) {
- # Code to respond to the error.
-}
+result = try({
+ # Code that might cause an error.
+})
+if (inherits(result, "try-error")) {
+ # Code to respond to the error.
+}
You can prevent the try
function from printing error messages by setting
silent = TRUE
. This is useful when your code is designed to detect and handle
the error, so you don’t users to think an error occurred.
@@ -705,19 +504,19 @@ 4.2.2 Handling ConditionstryCatch returns the result of the code.
Here’s an example of using tryCatch
to catch an error:
-bad_fn = function(x, y) {
- stop("Hi")
- x + y
-}
-
-err = tryCatch(bad_fn(3, 4), error = function(e) e)
+bad_fn = function(x, y) {
+ stop("Hi")
+ x + y
+}
+
+err = tryCatch(bad_fn(3, 4), error = function(e) e)
And here’s an example of using tryCatch
to catch a message:
-msg_fn = function(x, y) {
- message("Hi")
- x + y
-}
-
-msg = tryCatch(msg_fn(3, 4), message = function(e) e)
+msg_fn = function(x, y) {
+ message("Hi")
+ x + y
+}
+
+msg = tryCatch(msg_fn(3, 4), message = function(e) e)
The tryCatch
function always silences conditions. Details about raised
conditions are provided in the object passed to the handler function, which has
class condition
(and a more specific class that indicates what kind of
@@ -733,9 +532,9 @@
4.3 Global Optionsopts = options()
-# Display the first 6 options.
-head(opts)
+
## $add.smooth
## [1] TRUE
##
@@ -759,7 +558,7 @@ 4.3 Global Options4.3 Global Optionsoptions(warn = 1)
+
When you set an option this way, the change only lasts until you quit R. Next
time you start R, the option will go back to its default value. Fortunately,
there is a way override the default options every time R starts.
@@ -789,22 +588,22 @@ 4.3 Global Options.First = function() {
- # Only change options if R is running interactively.
- if (!interactive())
- return()
-
- options(
- # Don't print more than 1000 elements of anything.
- max.print = 1000,
- # Warn on partial matches.
- warnPartialMatchAttr = TRUE,
- warnPartialMatchDollar = TRUE,
- warnPartialMatchArgs = TRUE,
- # Print warnings immediately (2 = warnings are errors).
- warn = 1
- )
-}
+.First = function() {
+ # Only change options if R is running interactively.
+ if (!interactive())
+ return()
+
+ options(
+ # Don't print more than 1000 elements of anything.
+ max.print = 1000,
+ # Warn on partial matches.
+ warnPartialMatchAttr = TRUE,
+ warnPartialMatchDollar = TRUE,
+ warnPartialMatchArgs = TRUE,
+ # Print warnings immediately (2 = warnings are errors).
+ warn = 1
+ )
+}
You can learn more about the .Rprofile
file and R’s startup process at
?Startup
.
@@ -821,17 +620,17 @@ 4.4 Debugging# Run this in an R console.
-f = function(n) {
- total = 0
- for (i in 1:n) {
- browser()
- total = total + i
- }
- total
-}
-
-f(10)
+# Run this in an R console.
+f = function(n) {
+ total = 0
+ for (i in 1:n) {
+ browser()
+ total = total + i
+ }
+ total
+}
+
+f(10)
The most important debugger commands are:
n
to run the next line
@@ -842,54 +641,54 @@ 4.4 Debugging# Run this in an R console.
-g = function(x, y) (1 + x) * y
-
-f = function(n) {
- total = 0
- for (i in 1:n) {
- browser()
- total = total + g(i, i)
- }
-
- total
-}
-
-f(11)
+# Run this in an R console.
+g = function(x, y) (1 + x) * y
+
+f = function(n) {
+ total = 0
+ for (i in 1:n) {
+ browser()
+ total = total + g(i, i)
+ }
+
+ total
+}
+
+f(11)
4.4.1 Other Functions
The debug()
function places a call to browser()
at the beginning
of a function. Use debug()
to debug functions that you can’t or don’t want to
edit. For example:
-
+
You can use undebug()
to reverse the effect of debug()
:
-
+
The debugonce()
function places a call to browser()
at the beginning of a
function for the next call only. The idea is that you then don’t have to call
undebug()
. For instance:
-
+
Finally, the global option error
can be used to make R enter the debugger any
time an error occurs. Set the option to error = recover
:
-
+
Then try this example:
-
+
@@ -917,8 +716,8 @@ 4.5.1 Estimating Memory Usage
-
+
## [1] 15.25879
You can even write an R function to do these calculations for you! If you’re
not sure whether a particular programming strategy is realistic, do the memory
@@ -929,21 +728,21 @@
4.5.1 Estimating Memory Usage
You can use the mem_used
function to get the amount of memory R is currently
using:
-
-## 42.08 MB
+
+## 37.76 MB
Sometimes the culprit isn’t your code, but other applications on your computer.
Modern web browsers are especially memory-intensive, and closing yours while
you run code can make a big difference.
If you’ve determined that your code is the reason R runs out of memory, you can
use the obj_size
function to get how much memory objects in your code
actually use:
-
+
## 56 B
-
+
## 16.00 MB
-
+
## 7.21 kB
If a specific object created by your code uses a lot of memory, think about
ways you might change the code to avoid creating the object or avoid creating
@@ -970,12 +769,12 @@
4.5.2 Benchmarkinglibrary("microbenchmark")
-microbenchmark(A = runif(1e5), B = rnorm(1e5))
+
## Unit: milliseconds
-## expr min lq mean median uq max neval
-## A 2.823220 3.201357 3.389076 3.229029 3.365375 8.54375 100
-## B 5.886882 6.175219 6.408401 6.246132 6.371443 12.54406 100
+## expr min lq mean median uq max neval
+## A 2.866326 3.258161 3.479572 3.323092 3.472401 7.942078 100
+## B 5.904507 6.281376 6.721875 6.485273 6.915268 13.351294 100
The microbenchmark
has parameters to control the number of times each
expression runs, the units for the timings, and more. You can find the details
in ?microbenchmark
.
diff --git a/docs/string-date-processing.html b/docs/string-date-processing.html
index fabea19..2487ce5 100644
--- a/docs/string-date-processing.html
+++ b/docs/string-date-processing.html
@@ -202,17 +202,33 @@
II Writing & Debugging R Code
-3 Best Practices for Writing R Scripts
-4 Squashing Bugs with R’s Debugging Tools
+ 3 Best Practices for Writing R Scripts
+
+- 3.1 Scripting Your Code
+- 3.2 Printing Output
+- 3.3 Reading Input
+- 3.4 Managing Packages
+- 3.5 Iteration Strategies
-- 4.1.1 The
print
Function
-- 4.1.2 The
message
Function
-- 4.1.3 The
cat
Function
-- 4.1.4 Formatting Output
-- 4.1.5 Logging Output
+- 3.5.1 For-loops
+- 3.5.2 While-loops
+- 3.5.3 Saving Multiple Results
+- 3.5.4 Break & Next
+- 3.5.5 Planning for Iteration
+- 3.5.6 Case Study: The Collatz Conjecture
+- 3.5.7 Case Study: U.S. Fruit Prices
+
+4 Squashing Bugs with R’s Debugging Tools
+
+- 4.1 Printing
- 4.2 The Conditions System
- 4.2.1 Raising Conditions
@@ -285,19 +301,7 @@
- 6.5 Other Object Systems
-7 Part 2
-
+7 Part 2
V Backmatter
References
Assessment
@@ -584,7 +588,7 @@ 1.2.2 Case Study: Ocean Temperatu
noaa_path = "data/ocean_data/2021_noaa-ndbc_46013.txt"
noaa_headers = read_fwf(noaa_path, n_max = 2, guess_max = 1)
## Rows: 2 Columns: 18
-## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+## ── Column specification ──────────────────────────────────────────────────────────────
##
## chr (18): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
##
@@ -592,7 +596,7 @@ 1.2.2 Case Study: Ocean Temperatu
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 3323 Columns: 18
-## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+## ── Column specification ──────────────────────────────────────────────────────────────
##
## chr (4): X2, X3, X4, X5
## dbl (14): X1, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18
@@ -623,7 +627,7 @@ 1.2.2 Case Study: Ocean Temperatu
names automatically, but you’ll have to remove the unit row as a separate step:
## Rows: 87283 Columns: 4
-## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+## ── Column specification ──────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): time, sea_water_temperature, z
## dbl (1): sea_water_temperature_qc_agg
@@ -672,7 +676,7 @@ 1.3.1 Printing4.1.
+3.2.
1.3.2 Escape Sequences
diff --git a/docs/tidy-relational-data.html b/docs/tidy-relational-data.html
index e351e4b..0fd38a1 100644
--- a/docs/tidy-relational-data.html
+++ b/docs/tidy-relational-data.html
@@ -202,17 +202,33 @@
II Writing & Debugging R Code
-3 Best Practices for Writing R Scripts
-4 Squashing Bugs with R’s Debugging Tools
+ 3 Best Practices for Writing R Scripts
+
+- 3.1 Scripting Your Code
+- 3.2 Printing Output
+- 3.3 Reading Input
+- 3.4 Managing Packages
+- 3.5 Iteration Strategies
-- 4.1.1 The
print
Function
-- 4.1.2 The
message
Function
-- 4.1.3 The
cat
Function
-- 4.1.4 Formatting Output
-- 4.1.5 Logging Output
+- 3.5.1 For-loops
+- 3.5.2 While-loops
+- 3.5.3 Saving Multiple Results
+- 3.5.4 Break & Next
+- 3.5.5 Planning for Iteration
+- 3.5.6 Case Study: The Collatz Conjecture
+- 3.5.7 Case Study: U.S. Fruit Prices
+
+4 Squashing Bugs with R’s Debugging Tools
+
+- 4.1 Printing
- 4.2 The Conditions System
- 4.2.1 Raising Conditions
@@ -285,19 +301,7 @@
- 6.5 Other Object Systems
-7 Part 2
-
+7 Part 2
V Backmatter
References
Assessment
@@ -322,9 +326,8 @@
2 Tidy & Relational Data
This chapter is part 2 (of 2) of Cleaning & Reshaping Data, a workshop
series about how to prepare data for analysis. The major topics of this chapter
-are how to reshape datasets with pivots, how to combine related datasets with
-joins, and how to select and use iteration strategies that automate repetitive
-computations.
+are how to reshape datasets with pivots and how to combine related datasets
+with joins
Learning Objectives
After completing this session, learners should be able to:
@@ -1020,7 +1023,8 @@ 2.2.3 Left Joins2.2.4.1 Inner Join
@@ -1551,7 +1556,8 @@ 2.2.7.1 Handling Duplicate Keys## Warning in left_join(students, grades, by = "student_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 4 of `y` matches multiple rows in `x`.
-## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
+## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to
+## silence this warning.