-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathpurrr-basics.Rmd
358 lines (236 loc) · 13.3 KB
/
purrr-basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
# Basic map functions
```{r message = FALSE, warning = FALSE}
library(tidyverse)
```
## Iteration as an assembly line
You'll often need to apply the same function to each element of a list or atomic vector. As an example, take `moons`, a list of vectors.
```{r}
moons <-
list(
earth = 1737.1,
mars = c(11.3, 6.2),
neptune =
c(60.4, 81.4, 156, 174.8, 194, 34.8, 420, 2705.2, 340, 62, 44, 42, 40, 60)
)
```
(This data is from the Wikipedia pages on [Earth's Moon](https://en.wikipedia.org/wiki/Moon), [Mars's moons](https://en.wikipedia.org/wiki/Moons_of_Mars), and [Neptune's moons](https://en.wikipedia.org/wiki/Moons_of_Neptune)).
Each vector in `moons` represents a planet. The elements of each vector represent the radii of that planet's moons, in kilometers. (Note that some moons are irregularly shaped, so these are average radii.)
Say we want to figure out the number of moons that belong to each planet by taking the length of each vector. We can't just apply `length()` to `moons`.
```{r}
length(moons)
```
`length(moons)` returns the number of planets in `moons`. Instead, we need to apply `length()` to each vector in `moons`. We could find the length of each vector by individually pulling out each one and then applying `length()`.
```{r}
length(moons$earth)
length(moons$mars)
length(moons$neptune)
```
This strategy is tedious and repetitive, and would be even more tedious and repetitive if `moons` contained more planets. Fortunately, we can use functions from the purrr package to iterate through vectors and do the same thing to each element. In this reading, you'll learn about the most basic purrr functions: the map functions.
Take a look at the [purrr cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) for an overview of all the purrr functions.
The map functions are like an assembly line in a factory. One conveyor belt, the *input belt*, transports various objects to a worker. The worker picks up each object and does something with it, but, importantly, never changes the object itself. For example, she might make a mold of the object, or grab a piece of plastic corresponding to the color of the object. Then, she places her new creation (the mold, piece of plastic, etc.) on the *output belt*. She does this until there are no new objects to process on the input belt and the conveyor belt stops. The factory then ends up with the same number of output objects as came in on the input belt.
Imagine the map functions as this factory. You supply the map function with a list or vector (the objects on the input conveyor belt) and a function (what the worker does with each object). Then, the map function makes the conveyor belts run, applying the function to each element in the original vector to create a new vector of the same length, while never changing the original.
In our `moons` example, the elements on the input belt are the vectors of moon radii. We want the worker to take the length of each vector by applying `length()`, producing three numbers on the output belt, each indicating a number of moons. In the next section, we'll show you exactly how to carry out this operation with the map functions.
## The map functions
We'll explain the most general map function, `map()`, first. `map()`, like all the map functions, takes a list/vector and a function as arguments.
```{r echo=FALSE}
knitr::include_graphics("images/map-step-1.png", dpi = image_dpi)
```
(All diagrams were adapted from https://adv-r.hadley.nz/functionals.html.)
`map()` then applies that function to each element of the input list/vector.
```{r echo=FALSE}
knitr::include_graphics("images/map-step-2.png", dpi = image_dpi)
```
Applying the function to each element of the input list/vector results in one output element for each input element.
```{r echo=FALSE}
knitr::include_graphics("images/map.png", dpi = image_dpi)
```
`map()` then combines all these output elements into a list.
```{r echo=FALSE}
knitr::include_graphics("images/map-output.png", dpi = image_dpi)
```
In the assembly line metaphor, the input list/vector is the container for the items that go on the conveyor belt. The function specifies what the worker should do with each item. The items on the output conveyor belt make up the list that `map()` returns once it's done.
In our example, `moons` is our input. We want to apply `length()` to each element of `moons`, so `length` is the function.
```{r echo=FALSE}
knitr::include_graphics("images/map-length.png", dpi = image_dpi)
```
Each call to `length()` produces an integer, so the result is a list of integers.
```{r echo=FALSE}
knitr::include_graphics("images/length-list.png", dpi = image_dpi)
```
Here's what the call to `map()` looks like:
```{r}
map(moons, length)
```
Note that we used `length`, not `length()`, to specify which function to use. `length()` calls the function, while `length` refers to the function object.
As we already said, `map()` returns a list.
```{r}
typeof(map(moons, length))
```
A list of integers is fine, but you'll often rather have an atomic vector of integers. purrr also contains variants of `map()` that produce atomic vectors. There is one variant for each type of atomic vector.
* `map_int()` creates an integer vector.
* `map_dbl()` creates a double vector.
* `map_chr()` creates a character vector.
* `map_lgl()` creates a logical vector.
`length()` returns integers, so we'll use `map_int()` to create an integer vector instead of a list.
```{r}
map_int(moons, length)
```
The result looks similar to the result we got from `map()`, but is now an integer vector.
```{r}
typeof(map_int(moons, length))
```
`map_int()` requires that each output element be an integer.
```{r echo=FALSE}
knitr::include_graphics("images/map-int.png", dpi = image_dpi)
```
If each element is an integer, `map_int()` can then combine all the elements into a single atomic vector.
```{r echo=FALSE}
knitr::include_graphics("images/map-int-output.png", dpi = image_dpi)
```
The other variants of `map()` that produce double, character, and logical vectors work in the same way.
We could use `map()` and `median()` to find the median moon radius for each planet.
```{r}
map(moons, median)
```
Or, we could use `map_dbl()` to produce a vector of doubles.
```{r}
map_dbl(moons, median)
```
Mars has very small moons!
Use `map_chr()` to create a character vector when using a function that produces characters, and `map_lgl()` to create logical vectors.
```{r}
map_lgl(moons, is.double)
```
The map variants that produce atomic vectors will throw an error if your function doesn't output the correct type.
```{r error=TRUE}
map_int(moons, median)
```
`median()` returns doubles, but `map_int()` expects integers, and cannot coerce doubles into integers. If you get a "Can't coerce" error when working with map functions, you're likely using the wrong variant.
Some functions return vectors. For example, `sort()` returns a sorted version of a vector.
```{r}
sort(moons$neptune)
```
Even though `sort()` returns a vector of doubles, we can't use `map_dbl()` to apply `sort()` to each element of `moons`.
```{r error=TRUE}
map_dbl(moons, sort)
```
`map_dbl()` requires that the function return a *single* double for each element of `moons` because atomic vectors can't contain other vectors. `sort()` returns a vector of doubles for each element, so `map_dbl()` fails.
Instead, we have to use `map()` to combine the individual output vectors into a list.
```{r}
map(moons, sort)
```
Because `map()` creates lists, it's very flexible. We just used `map()` to create a list of vectors, but you can also use `map()` to create lists of lists, lists of lists of lists, lists of tibbles, lists of functions, etc.
### Extra arguments
In the previous section, we used `sort()` to arrange each vector. By default, `sort()` arranges in increasing order.
```{r}
map(moons, sort)
```
To sort in decreasing order, we have to specify `decreasing = TRUE` inside of `sort()`. Outside of a map function, we would just put `decreasing = TRUE` into the function call.
```{r}
sort(moons$neptune, decreasing = TRUE)
```
Inside a map function, you put function arguments directly after the function name.
```{r}
map(moons, sort, decreasing = TRUE)
```
You can add as many arguments as you like, and `map()` will automatically supply them to the function.
```{r echo=FALSE}
knitr::include_graphics("images/map-extra-arg.png", dpi = image_dpi)
```
For example, the following code uses two additional arguments to find the 95th quantile for each planet, excluding missing values.
```{r}
moons %>%
map_dbl(quantile, probs = 0.95, na.rm = TRUE)
```
### Anonymous functions
So far in the reading, we've only given map **named** functions. Recall that named functions have a name which you can use to call the function.
Say we want to convert the moon radii from kilometers to miles. Here's a named function that turns kilometers into miles.
```{r}
km_to_miles <- function(x) {
x * 0.62
}
```
Anytime we want to convert kilometers to miles, we can now call `km_to_miles()`.
```{r}
km_to_miles(22)
```
Now, we can use `km_to_miles()` and `map()` to convert all moon radii to miles. We have to use `map()` because `km_to_miles()` will return a vector for each planet.
```{r}
map(moons, km_to_miles)
```
If we're not going to use `km_to_miles()` again, we don't need to make a named function. It will be less work and more succinct to just create an anonymous function inside `map()`. Recall that anonymous functions are just functions without names, and the full syntax looks like this:
```{r}
function(x) x * 0.62
```
We can copy this anonymous function directly into `map()`.
```{r}
map(moons, function(x) x * 0.62)
```
The full syntax for anonymous functions is clunky, so purrr provides a shortcut. Here's what the code looks like if we use the shortcut:
```{r}
moons %>%
map(~ . * 0.62)
```
The `~` tells `map()` that an anonymous function is coming. The `.` refers to the function argument, taking the place of `x` from the full anonymous function syntax.
Just like when you supply `map()` with a named function, `map()` will apply an anonymous function to each element of `moons`. You can think of `.` as a placeholder, referring to `earth`, then `mars`, and then `neptune` as the conveyor belt delivers the vector of each planet's moons to the worker.
```{r echo=FALSE}
knitr::include_graphics("images/map-anonymous.png", dpi = image_dpi)
```
If a named function does not exist for your mapping, we recommend using an anonymous function instead of creating a named function, unless you'll use the function again or the function is very long. We also recommend always using the syntax shortcut instead of the full anonymous function syntax.
Anonymous functions can be confusing and tricky to get right. If you're struggling to get the syntax correct, try testing your anonymous function on just a single element of your list. Here, we'll explain one strategy for doing so.
First, assign a single element of your list to the variable `.`. This looks strange, but means you'll end up with a function that uses the purrr anonymous function syntax.
```{r}
. <- moons[[3]]
.
```
(We picked the third element because Neptune is more interesting than Earth, but it's usually easiest to simply pick the first element with `[[1]]`.)
Then, figure out how to perform your desired operation on just that element. Check that the output is correct.
```{r}
. * 0.62
```
Now, copy and paste your code into `map()`. Just remember to put a `~` at the beginning.
```{r}
moons %>%
map(~ . * 0.62)
```
Let's walk through a more complicated example. Say you want to find the number of large moons that belong to each planet. We'll define a large moon as a moon with a radius greater than 100 kilometers.
First, assign an element of your list to `.`.
```{r}
. <- moons[[3]]
.
```
Then, build and test your function just for `.`.
First, we need to subset `.` so that it just includes moon radii greater than 100. Recall the syntax for subsetting vectors. `x[x == 1]` extracts just the elements of some vector `x` that are equal to 1. `x[x > 2]` gives you all the elements of `x` greater than 2. Use this syntax to extract the moon radii in `.` that are greater than 100.
```{r}
.[. > 100]
```
Then, take the length of this new vector to find the number of moons it contains.
```{r}
length(.[. > 100])
```
Now, copy and paste your code into a map function after a `~`. We'll use `map_int()` because `length()` returns integers.
```{r}
moons %>%
map_int(~ length(.[. > 100]))
```
This strategy makes the purrr anonymous function syntax less abstract and allows you to test your anonymous functions before plugging them into a map function.
### Tibbles
All the examples so far have used the map function on `moons`, a list of vectors, but the map functions work on any type of vector or list, including tibbles.
Tibbles are lists of vectors. Notice that when you create a tibble with `tibble()`, you create a vector for each column.
```{r}
y <-
tibble(
col_1 = c(1, 2, 3),
col_2 = c(100, 200, 300),
col_3 = c(0.1, 0.2, 0.3)
)
```
Because the elements of a tibble are the vector columns, map functions act on the tibble columns, not the tibble rows.
```{r}
map_dbl(y, median)
```
Here's another example that finds the number of `NA`'s in each column of `nycflights13::flights`.
```{r}
nycflights13::flights %>%
map_int(~ sum(is.na(.)))
```