generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path15-FP.Rmd
805 lines (555 loc) · 49.1 KB
/
15-FP.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
# An Introduction to Functional Programming
**Functional Programming (FP)**\index{functions!functional programming} is another way of thinking about how to organize programs. We talked about OOP---another way to organize programs---in the last chapter (chapter \@ref(an-introduction-to-object-oriented-programming)). So how do OOP and FP differ? To put it simply, FP focuses on functions instead of objects. Because we are talking a lot about functions in this chapter, we will assume you have read and understood section \@ref(functions).
Neither R nor Python is a purely functional language. For us, FP is a style that we can choose to let guide us, or that we can disregard. You can choose to employ a more functional style, or you can choose to use a more object-oriented style, or neither. Some people tend to prefer one style to other styles, and others prefer to decide which to use depending on the task at hand.
More specifically, a functional programming style takes advantage of **first-class functions**\index{functions!first-class functions} and favors functions that are **pure**.\index{functions!pure functions}
1. **First-class functions** are [@struc_and_interp] functions that
- can be passed as arguments to other functions,
- can be returned from other functions, and
- can be assigned to variables or stored in data structures.
2. **Pure functions**
- return the same output if they are given the same input, and
- do not produce **side-effects**.
Side-effects are changes made to non-temporary variables, to the "state" of the program.
We discussed (1) in the beginning of chapter \@ref(functions). If you have not used any other programming languages before, you might even take (1) for granted. However, using first-class functions can be difficult in other languages not mentioned in this text.
There is more to say about definition (2). This means you should keep your functions as *modular* as possible, unless you want your overall program to be much more difficult to understand. FP stipulates that
- **ideally functions will not refer to non-local variables;**
- **ideally functions will not (refer to and) modify non-local variables; and**
- **ideally functions will not modify their arguments.**
Unfortunately, violating the first of these three criteria is very easy to do in both of our languages. Recall our conversation about *dynamic lookup* in subsection \@ref(accessing-and-modifying-captured-variables). Both R and Python use dynamic lookup,\index{dynamic lookup} which means you can't reliably control *when* functions look for variables. Typos in variable names easily go undiscovered, and modified global variables can potentially wreak havoc on your overall program.
Fortunately it is difficult to modify global variables inside functions in both R and Python. This was also discussed in subsection \@ref(accessing-and-modifying-captured-variables). In Python, you need to make use of the `global` keyword (mentioned in section \@ref(passing-by-assignment-in-python)), and in R, you need to use the rare super assignment operator (it looks like `<<-`, and it was mentioned in \@ref(passing-by-value-in-r)). Because these two symbols are so rare, they can serve as signals to viewers of your code about when and where (in which functions) global variables are being modified.
Last, violating the third criterion is easy in Python and difficult in R. This was discussed earlier in \@ref(modifying-a-functions-arguments). Python can mutate/change arguments that have a mutable type because it has *pass-by-assignment* semantics\index{pass-by-assignment} (mentioned in section \@ref(passing-by-assignment-in-python)), and R generally can't modify its arguments at all because it has *pass-by-value*\index{pass-by-value} semantics \@ref(passing-by-value-in-r).\index{mutable in Python}
This chapter avoids the philosophical discussion of FP. Instead, it takes the applied approach, and provides instructions on how to use FP in your own programs. I try to give examples of *how* you can use FP, and *when* these tools are especially suitable.
One of the biggest tip-offs that you should be using functional programming is if you need to evaluate a single function many times, or in many different ways. This happens quite frequently in statistical computing. Instead of copy/pasting similar-looking lines of code, you might consider *higher-order* functions that take your function as an input, and intelligently call it in all the many ways you want it to. A third option you might also consider is to use a loop (c.f. \@ref(loops)). However, that approach is not very functional, and so it will not be heavily-discussed in this section.
Another tip-off that you need FP is if you need many different functions that are all "related" to one another. Should you define each function separately, using excessive copy/paste-ing? Or should you write a function that can elegantly generate any function you need?
Not repeating yourself and re-using code is a primary motivation, but it is not the only one. Another motivation for **functional programming** is clearly explained in [Advanced R](https://adv-r.hadley.nz/fp.html)^[Even though this book only discusses *one* of our languages of interest, this quote applies to both langauges.]:
> A functional style tends to create functions that can easily be analysed in isolation (i.e. using only local information), and hence is often much easier to automatically optimise or parallelise.
All of these sound like a good things to have in our code, so let's get started with some examples!
## Functions as Function Inputs in R
Many of the most commonly-used functionals in R have names that end in "apply". The ones I discuss are `sapply()`, `vapply()`, `lapply()`, `apply()`, `tapply()` and `mapply()`. Each of these takes a function as one of its arguments. Recall that this is made possible by the fact that R has first-class functions.
### `sapply()` and `vapply()`
```{r, echo=F}
myDFrows <- 10
myDFcols <- 100
myDF <- data.frame(matrix(rnorm(myDFrows*myDFcols), ncol = myDFcols))
```
Suppose we have a `data.frame` that has `r nrow(myDF)` rows and `r ncol(myDF)` columns. What if we want to take the mean of each column?
An amateurish way to do this would be something like the following.
```{r, eval=F}
myFirstMean <- mean(myDF[,1])
mySecondMean <- mean(myDF[,2])
# ... so on and so forth ..
myHundredthMean <- mean(myDF[,100])
```
You will need one line of code for each column in the data frame! For data frames with a lot of columns, this becomes quite tedious. You should also ask yourself what happens to you and your collaborators when the data frame changes even slightly, or if you want to apply a different function to its columns. Third, the results are not stored in a single container. You are making it difficult on yourself if you want to use these variables in subsequent pieces of code.
::: {.rmd-details data-latex=""}
"Don't repeat yourself" (DRY) is an idea that's been around for a while and is widely accepted [@hunt2000pragmatic]. DRY is the opposite of [WET](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself#WET).\index{DRY code|see{WET code}}\index{WET code|see{DRY code}}
:::
Instead, prefer the use of `sapply()` in this situation. The "s" in `sapply()` stands for "simplified." In this bit of code `mean()` is called on each column of the data frame. `sapply()` applies the function over columns, instead of rows, because data frames are internally a `list` of columns.
```{r, collapse=TRUE}
myMeans <- sapply(myDF, mean)
myMeans[1:5]
```
Each call to `mean()` returns a `double` `vector` of length $1$. This is necessary if you want to collect all the results into a `vector`--remember, all elements of a `vector` have to have the same type. To get the same behavior, you might also consider using `vapply(myDF, mean, numeric(1))`.
In the above case, "simplify" referred to how one-hundred length-$1$ vectors were simplified into one length-$100$ `vector`. However, "simplified" does not necessarily imply that all elements will be stored in a `vector`. Consider the `summary` function, which returns a `double` `vector` of length $6$. In this case, one-hundred length-$6$ vectors were simplified into one $6 \times 100$ matrix.
```{r, collapse = TRUE}
mySummaries <- sapply(myDF, summary)
is.matrix(mySummaries)
dim(mySummaries)
```
::: {.rmd-details data-latex=""}
Another function that is worth mentioning is `replicate()`--it is a wrapper for `sapply()`. Consider a situation where you want to call a function many times with the same inputs. You might try something like `sapply(1:100, function(elem) { return(myFunc(someInput)) } )`. Another, more readable, way to do this is `replicate(100, myFunc(someInput))`.
:::
### `lapply()`
For functions that do not return amenable types that fit into a `vector`, `matrix` or `array`, they might need to be stored in `list`. In this situation, you would need `lapply()`. The "l" in `lapply()` stands for "list". `lapply()` always returns a `list` of the same length as the `input`.
```{r, collapse = TRUE}
regress <- function(y){ lm(y ~ 1) }
myRegs <- lapply(myDF, regress)
length(myRegs)
class(myRegs[[1]])
summary(myRegs[[12]])
```
### `apply()`
I use `sapply()` and `lapply()` the most, personally. The next most common function I use is `apply()`. You can use it to apply functions to *rows* of rectangular arrays instead of columns. However, it can also apply functions over columns, just as the other functions we discussed can.^[`apply()` is everyone's favorite whipping boy whenever it comes to comparing `apply()` against the other `*apply()` functions. This is because it is generally a little slower--it is written in R and doesn't call out to compiled C code. However, in my humble opinion, it doesn't matter all that much because the fractions of a second saved don't always add up in practice. ]
```{r, collapse = TRUE}
dim(myDF)
results <- apply(myDF, 1, mean)
results[1:4]
```
Another example where it can be useful to apply a function to rows is **predicate functions.** A predicate function is just a fancy name for a function that returns a Boolean. I use them to filter out rows of a `data.frame`. Without a predicate function, filtering rows might look something like this on our real estate data [@albemarle_county_gis_web] [@clay_ford].
```{r, collapse = TRUE, R.options = list(width = 65)}
albRealEstate <- read.csv("data/albemarle_real_estate.csv")
firstB <- albRealEstate$YearBuilt == 2006
secondB <- albRealEstate$Condition == "Average"
thirdB <- albRealEstate$City == "CROZET"
subDF <- albRealEstate[(firstB & secondB) | thirdB,]
str(subDF, strict.width = "cut")
```
Complicated filtering criteria can become quite wide, so I prefer to break the above code into three steps.
- Step 1: write a predicate function that returns TRUE or FALSE;
- Step 2: construct a `logical` `vector` by `apply()`ing the predicate over rows;
- Step 3: plug the `logical` `vector` into the `[` operator to remove the rows.
```{r, collapse = TRUE, R.options = list(width = 65)}
pred <- function(row){
yrBuiltCorrect <- row['YearBuilt'] == 2006
aveCond <- row['Condition'] == "Average"
inCrozet <- row['City'] == "CROZET"
( yrBuiltCorrect && aveCond) || inCrozet
}
whichRows <- apply(albRealEstate, 1, pred)
subDF <- albRealEstate[whichRows,]
str(subDF, strict.width = "cut")
```
### `tapply()`
`tapply()` can be very handy when you need it. First, we've alluded to the definition before in subsection \@ref(data-frames-in-r), but a **ragged array** is a collection of arrays that all have potentially different lengths. I don't typically construct such an object and then pass it to `tapply()`. Rather, I let `tapply()` construct the ragged array for me. The first argument it expects is, to quote the documentation, "typically vector-like," while the second tells us how to break that `vector` into chunks. The third argument is a function that gets applied to each `vector` chunk.
If I wanted the average home price for each city, I could use something like this.
```{r, collapse = TRUE, R.options = list(width = 65)}
str(albRealEstate, strict.width = "cut")
length(unique(albRealEstate$City))
tapply(albRealEstate$TotalValue, list(albRealEstate$City), mean)[1:4]
```
You might be wondering why we put `albRealEstate$City` into a `list`. That seems kind of unnecessary. This is because `tapply()` can be used with multiple `factor`s--this will break down the `vector` input into a finer partition. The second argument must be one object, though, so all of these `factor`s must be collected into a `list`. The following code produces a "pivot table."
```{r, collapse = TRUE}
pivTable <- tapply(albRealEstate$TotalValue,
list(albRealEstate$City, albRealEstate$Condition),
mean)
pivTable[,1:5]
```
::: {.rmd-details data-latex=""}
For functions that return higher-dimensional output, you will have to use something like `by()` or `aggregate()` in place of `tapply()`.
:::
### `mapply()`
The [documentation of `mapply()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/mapply.html) states `mapply()` is a multivariate version of `sapply()`. `sapply()` worked with univariate functions; the function was called multiple times, but each time with a single argument. If you have a function that takes *multiple arguments*, and you want those arguments to change each time the function is called, then you might be able to use `mapply()`.
Here is a short example. Regarding the `n=` argument of `rnorm()`, the documentation explains, "[i]f `length(n) > 1`, the length is taken to be the number required." This would be a problem if we want to sample three times from a mean $0$ normal first, then twice from a mean $100$ normal, and then third, once from a mean $42$ normal distribution. We only get three samples when we want six!
```{r}
rnorm(n = c(3,2,1), mean = c(0,100,42), sd = c(.01, .01, .01))
mapply(rnorm, n = c(3,2,1), mean = c(0,100,42), sd = c(.01, .01, .01))
```
### `Reduce()` and `do.call()`
Unlike the other examples of functions that take other functions as inputs, `Reduce()` and `do.call()` don't have many outputs. Instead of collecting many outputs into a container, they just output one thing.
Let's start with an example: "combining" data sets. In section \@ref(reshaping-and-combining-data-sets) we talked about several different ways of combining data sets. We discussed stacking data sets on top of one another with `rbind()` (c.f. subsection \@ref(stacking-data-sets-and-placing-them-shoulder-to-shoulder)), stacking them side-by-side with `cbind()` (also in \@ref(stacking-data-sets-and-placing-them-shoulder-to-shoulder)), and intelligently joining them together with `merge()` (c.f. \@ref(merging-or-joining-data-sets)).
Now consider the task of combining *many* data sets. How can we combine three or more data sets into one? Also, how do we write DRY code and abide by the DRY principle? As the name of the subsection suggests, we can use either `Reduce()` or `do.call()` as a higher-order function. Just like the aforementioned `*apply()` functions, they take in either `cbind()`, `rbind()`, or `merge()` as a function input. Which one do we pick, though? The answer to that question deals with *how many arguments our lower-order function takes.*
Take a look at the documentation to `rbind()`. Its first argument is `...`, which is the [dot-dot-dot](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Dot_002ddot_002ddot) symbol. This means `rbind()` can take a varying number of `data.frame`s to stack on top of each other. In other words, `rbind()` is **variadic**.
On the other hand, take a look at the documentation of `merge()`. It only takes two `data.frame`s at a time^[Although, it is still variadic. The difference is that the `dot-dot-dot` symbol does not refer to a varying number of `data.frame`s, just a varying number of other things we don't care about in this example.]. If we want to combine many data sets, `merge()` needs a helper function.
This is the difference between `Reduce()` and `do.call()`. `do.call()` calls a function once on many arguments, so its function must be able to handle many arguments. On the other hand, `Reduce()` calls a binary function many times on pairs of arguments. `Reduce()`'s function argument gets called on the first two elements, then on the first output and the third element, then on the second output and fourth element, and so on.
Here is an initial example that makes use of four data sets `d1.csv`, `d2.csv`, `d3.csv`, and `d4.csv`. To start, ask yourself how we would read all of these in. There is a temptation to copy and paste `read.csv` calls, but that would violate the DRY principle. Instead, let's use `lapply()` an anonymous function that constructs a file path string, and then uses it to read in the data set the string refers to.
```{r, collapse = TRUE}
numDataSets <- 4
dataSets <- paste0("d",1:numDataSets)
dfs <- lapply(dataSets,
function(name) read.csv(paste0("data/", name, ".csv")))
head(dfs[[3]])
```
Notice how the above code would only need to be changed by one character if we wanted to increase the number of data sets being read in!^[To make it even more flexible, we could write code that doesn't assume the functions are all named the same way, or in the same directory together.]
Next, `cbind()`ing them all together can be done as follows. `do.call()` will call the function only once. `cbind()` takes many arguments at once, so this works. This code is even better than the above code in that if `dfs` becomes longer, or changes at all, *nothing* will need to be changed.
```{r, collapse = TRUE}
do.call(cbind, dfs) # DRY! :)
# cbind(df1,df2,df3,df4) # WET! :(
```
What if we wanted to `merge()` all these data sets together? After all, the `id` column appears to be repeating itself, and some data from `d2` isn't lining up.
```{r, collapse=TRUE}
Reduce(merge, dfs)
```
Again, this is very DRY code. Nothing would need to be changed if `dfs` grew. Furthermore, trying to `do.call()` the `merge()` function wouldn't work because it can only take two data sets at a time.
## Functions as Function Inputs in Python
### Functions as Function Inputs in Base Python
I discuss two functions from base Python that take functions as input. Neither return a `list` or a `np.array`, but they do return different kinds of **iterables**, which are "objects capable of returning their members one at a time," [according to the Python documentation.](https://docs.python.org/3/glossary.html) `map()`, the function, will return objects of type `map`. `filter()`, the function, will return objects of type `filter`. Often times we will just convert these to the container we are more familiar with.
#### `map()`
[`map()`](https://docs.python.org/3/library/functions.html#map) can call a function repeatedly using elements of a container as inputs. Here is an example of calculating outputs of a *spline* function, which can be useful for coming up with predictors in regression models. This particular spline function is $f(x) = (x-k)1(x \ge k)$, where $k$ is some chosen "knot point."
```{python, collapse=TRUE}
import numpy as np
my_inputs = np.linspace(start = 0, stop = 2*np.pi)
def spline(x):
knot = 3.0
if x >= knot:
return x-knot
else:
return 0.0
output = list(map(spline, my_inputs))
```
We can visualize the mathematical function by plotting its outputs against its inputs. More information on visualization was given in subsection \@ref(visualization).
```{r spline-plot, out.width='80%', fig.align='center', fig.cap = "Our Spline Function", echo = F}
knitr::include_graphics("pics/spline.png")
```
`map()` can also be used like `mapply()`. In other words, you can apply it to two containers,
```{python, collapse=TRUE}
import numpy as np
x = np.linspace(start = -1., stop = 1.0)
y = np.linspace(start = -1., stop = 1.0)
def f(x,y):
return np.log(x**2 + y**2)
list(map(f, x, y))[:3]
```
#### `filter()`
[`filter()`](https://docs.python.org/3/library/functions.html#filter) helps remove unwanted elements from a container. It returns an iterable of type `filter`, which we can iterate over or convert to a more familiar type of container. In this example, I iterate over it without converting it.
This code also provides our first example of a [**lambda function**](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions) [@Lutz13]. Lambda functions are simply another way to define functions. Notice that in this example, we didn't have to name our function. In other words, it was **anonymous**. We can also save a few lines of code.
```{python, collapse = TRUE}
raw_data = np.arange(0,1.45,.01)
for elem in filter(lambda x : x**2 > 2, raw_data):
print(elem)
```
### Functions as Function Inputs in Numpy
Numpy provides a [number of functions](https://numpy.org/doc/stable/reference/routines.functional.html) that facilitate working with `np.ndarray`s in a functional style. For example, [`np.apply_along_axis()`](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html) is similar to R's `apply()`. `apply()` had a `MARGIN=` input (`1` sums rows, `2` sums columns), whereas this function has a `axis=` input (`0` sums columns, `1` sums rows).
```{python, collapse = TRUE}
import numpy as np
my_array = np.arange(6).reshape((2,3))
my_array
np.apply_along_axis(sum, 0, my_array) # summing columns
np.apply_along_axis(sum, 1, my_array) # summing rows
```
### Functional Methods in Pandas
Pandas\index{Pandas}' `DataFrame`s have an [`.apply()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) that is very similar to `apply()` in R,^[You should know that a lot of special-case functions that you typically apply to rows or columns come built-in as `DataFrame` methods. For instance, `.mean()` would allow you to do something like `my_df.mean()`.] but again, just like the above function, you have to think about an `axis=` argument instead of a `MARGIN=` argument.
```{python, collapse = TRUE}
import pandas as pd
alb_real_est = pd.read_csv("data/albemarle_real_estate.csv")
alb_real_est.shape
alb_real_est.apply(len, axis=0) # length of columns
type(alb_real_est.apply(len, axis=1)) # length of rows
```
Another thing to keep in mind is that `DataFrame`s, unlike `ndarray`s, don't have to have the same type for all elements. If you have mixed column types, then summing rows, for instance, might not make sense. This just requires subsetting columns before `.apply()`ing a function to rows. Here is an example of computing each property's "score".
```{python, collapse = TRUE}
import pandas as pd
# alb_real_est.apply(sum, axis=1) # can't add letters to numbers!
def get_prop_score(row):
return 2*row[0] + 3*row[1]
two_col_df = alb_real_est[['FinSqFt','LotSize']]
alb_real_est['Score'] = two_col_df.apply(get_prop_score, 1)
alb_real_est[['FinSqFt','LotSize','Score']].head(2)
```
`.apply()` also works with more than one function at a time.
```{python, collapse = TRUE}
alb_real_est[['FinSqFt','LotSize']].apply([sum, len])
```
If you do not want to waste two lines defining a function with `def`, you can use an anonymous lambda function. Be careful, though--if your function is complex enough, then your lines will get quite wide. For instance, this example is pushing it.
```{python, collapse = TRUE}
two_col_df.apply(lambda row : sum(row*[2,3]), 1)[:3]
```
The previous example `.apply()`s a *binary* function to each row. The function is binary because it takes two elements at a time. If you want to apply a *unary* function (i.e. it takes one argument at a time) function to each row for, and for each column, then you can use [`.applymap()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html#pandas.DataFrame.applymap).
```{python, collapse = TRUE}
alb_real_est[['FinSqFt','LotSize']].applymap(lambda e : e + 1).head(3)
```
Last, we have a [`.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method, which can be used to mirror the behavior of R's `tapply()`, `aggregate()` or `by()`. It can take the `DataFrame` it belongs to, and group its rows into multiple sub-`DataFrame`s. The collection of sub-`DataFrames` has a lot of the same methods that an individual `DataFrame` has (e.g. the subsetting operators, and the `.apply()` method), which can all be used in a second step of calculating things on each sub-`DataFrame`.
```{python, collapse = TRUE}
type(alb_real_est.groupby(['City']))
type(alb_real_est.groupby(['City'])['TotalValue'])
```
Here is an example that models some pretty typical functionality. It shows two ways to get the average home price by city. The first line groups the rows by which `City` they are in, extracts the `TotalValue` column in each sub-`DataFrame`, and then `.apply()`s the `np.average()` function on the sole column found in each sub-`DataFrame`. The second `.apply()`s a lambda function to each sub-`DataFrame` directly. More details on this "split-apply-combine" strategy can be found in the [Pandas documentation.](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)
```{python, collapse = TRUE}
grouped = alb_real_est.groupby(['City'])
grouped['TotalValue'].apply(np.average)
grouped.apply(lambda df : np.average(df['TotalValue']))
```
## Functions as Function Outputs in R
Functions that create and return other functions are sometimes called **function factories.** Functions are first-class objects in R, so it's easy to return them. What's more interesting is that supposedly temporary objects inside the outer function can be accessed during the call of the inner function after it's returned.
Here is a first quick example.
```{r, collapse = TRUE}
funcFactory <- function(greetingMessage){
function(name){
paste(greetingMessage, name)
}
}
greetWithHello <- funcFactory("Hello")
greetWithHello("Taylor")
greetWithHello("Charlie")
```
Notice that **the `greetingMessage=` argument that is passed in, `"Hello"`, isn't temporary anymore.** It lives on so it can be used by all the functions created by `funcFactory()`. This is the most surprising aspect of writing function factories.
Let's now consider a more complicated and realistic example. Let's implement a variance reduction technique called **common random numbers**.
Suppose $X \sim \text{Normal}(\mu, \sigma^2)$, and we are interested in approximating an expectation of a function of this random variable. Suppose that we don't know that
\begin{equation}
\mathbb{E}[\sin(X)] = \sin(\mu) \exp\left(-\frac{\sigma^2}{2}\right)
\end{equation}
for any particular choice of $\mu$ and $\sigma^2$, and instead, we choose to use the Monte Carlo method:
\begin{equation}
\hat{\mathbb{E}}[\sin(X)] = \frac{1}{n}\sum_{i=1}^n\sin(X^i)
\end{equation}
where $X^1, \ldots, X^n \overset{\text{iid}}{\sim} \text{Normal}(\mu, \sigma^2)$ is a large collection of draws from the appropriate normal distribution, probably coming from a call to `rnorm()`. In more realistic situations, the theoretical expectation might not be tractable, either because the random variable has a complicated distribution, or maybe because the functional is very complicated. In these cases, a tool like Monte Carlo might be the only available approach.
Here are two functions that calculate the above quantities for $n=1000$. `actualExpectSin()` is a function that computes the theoretical expectation for any particular parameter pair. `monteCarloSin()` is a function that implements the Monte Carlo approximate expectation.
```{r, collapse = TRUE}
n <- 1000 # don't hardcode variables that aren't passed as arguments!
actualExpectSin <- function(params){
stopifnot(params[2] > 0) # second parameter is sigma
sin(params[1])*exp(-.5*(params[2]^2))
}
monteCarloSin <- function(params){
stopifnot(params[2] > 0)
mean(sin(rnorm(n = n, mean = params[1], sd = params[2])))
}
# monteCarloSin(c(10,1))
```
One-off approximations aren't as interesting as visualizing many expectations for many parameter inputs. Below we plot the expectations for many different parameter vectors/configurations/settings.
```{r, collapse = TRUE, out.width='80%', fig.align='center', fig.cap = "Monte Carlo Approximations versus Exact Evaluations"}
muGrid <- seq(-10,10, length.out = 100)
sigmaGrid <- seq(.001, 5, length.out = 100)
muSigmaGrid <- expand.grid(muGrid, sigmaGrid)
actuals <- matrix(apply(muSigmaGrid, 1, actualExpectSin),
ncol = length(muGrid))
mcApprox <- matrix(apply(muSigmaGrid, 1, monteCarloSin),
ncol = length(muGrid))
par(mfrow=c(1,2))
contour(muGrid, sigmaGrid, actuals,
xlab = "mu", ylab = "sigma", main = "actual expects")
contour(muGrid, sigmaGrid, mcApprox,
xlab = "mu", ylab = "sigma", main = "mc without crn")
```
```{r, echo=FALSE}
par(mfrow=c(1,1))
```
There are three problems with this implementation:
- `monteCarloSin()` is not pure because it captures the `n` variable,
- the only way to increase the accuracy of the plot in the right panel is to increase `n`, and
- every time we re-run this code the plot on the right looks different.
If we wanted to use common random numbers, we could generate $Z^1, \ldots, Z^n \overset{\text{iid}}{\sim} \text{Normal}(0, 1)$, and use the fact that
\begin{equation}
X^i = \mu + \sigma Z^i
\end{equation}
This leads to the Monte Carlo estimate
\begin{equation}
\tilde{\mathbb{E}}[\sin(X)] = \frac{1}{n}\sum_{i=1}^n\sin(\mu + \sigma Z^i)
\end{equation}
Here is one function that naively implements Monte Carlo with common random numbers. We generate the collection of standard normal random variables once, globally. Each time you call `monteCarloSinCRNv1(c(10,1))`, you get the same answer.
```{r, collapse = TRUE}
commonZs <- rnorm(n=n)
monteCarloSinCRNv1 <- function(params){
stopifnot(params[2] > 0)
mean(sin(params[1] + params[2]*commonZs))
}
# monteCarloSinCRNv1(c(10,1))
```
Let's compare using common random numbers to going without. As you can see, common random numbers make the plot look "smoother." In other words, we increase our sampling accuracy without spending more computational time.
```{r, collapse=TRUE, out.width='80%', fig.align='center', fig.cap = "Monte Carlo: With and Without Common Random Numbers"}
mcApproxCRNv1 <- matrix(apply(muSigmaGrid, 1, monteCarloSinCRNv1),
ncol = length(muGrid))
par(mfrow=c(1,2))
contour(muGrid, sigmaGrid, mcApprox,
xlab = "mu", ylab = "sigma", main = "mc without crn")
contour(muGrid, sigmaGrid, mcApproxCRNv1,
xlab = "mu", ylab = "sigma", main = "mc with crn")
par(mfrow=c(1,1))
```
There are some new downsides to this implementation to consider:
- we have another global variable--a bunch of samples called `commonZs` floating around, and
- the dependence on the global variable for sample size is even further obscured.
We can fix these two problems very nicely by using a function factory.
```{r, collapse = TRUE}
makeMCFunc <- function(n = 1000){
commonZs <- rnorm(n)
function(params){
stopifnot(params[2] > 0)
mean(sin(params[1] + params[2]*commonZs))
}
}
monteCarloSinCRNv2 <- makeMCFunc()
# now call monteCarloSinCRNv2 to approx. expectations
# e.g. monteCarloSinCRNv2(c(10,1))
```
This is much better because
- the desired sample size must be passed in as a function argument instead of being captured,
- the re-used standard normal variates are not in the global environment anymore, and
- a sensible default number of samples is provided in the event that the programmer forgets to specify one.
The inner function did in fact capture `commonZs`, but it captured from the enclosing scope, not the global scope. Capturing isn't always a terrible idea. It would be difficult to modify these variables, so we don't need to worry about function behavior changing in unpredictable ways. Actually capturing a variable instead of passing it in is an intelligent design choice--now the end-users of functions created by this factory don't need to worry about plugging in extra parameters.
Let's use $1000$ samples again and make sure this function works by comparing its output to the known true function. Run the following code on your own machine. Note the new Greek letters in the axis labels.
```{r, collapse=TRUE, eval=FALSE}
mcApproxCRNv2 <- matrix(apply(muSigmaGrid, 1, monteCarloSinCRNv2),
ncol = length(muGrid))
par(mfrow=c(1,2))
contour(muGrid, sigmaGrid, mcApprox,
xlab = expression(mu), ylab = expression(sigma))
contour(muGrid, sigmaGrid, mcApproxCRNv2,
xlab = expression(mu), ylab = expression(sigma))
par(mfrow=c(1,1))
```
## Functions as Function Outputs in Python
We can write function factories in Python, too. Here is another implementation of the first example from the previous section. Again, just as it did in R, `str` passed in as `greeting_message` persists well after `func_factory()` is finished working.
```{python, collapse = TRUE}
def func_factory(greeting_message):
def func(name):
print(greeting_message + ' ' + name)
return func
greet_with_hello = func_factory("Hello")
greet_with_hello("Taylor")
greet_with_hello("Charlie")
```
Let's consider another less trivial example. Recall the spline function from earlier in the chapter:
```{python, collapse = TRUE}
import numpy as np
def spline(x):
knot = 3.0
if x >= knot:
return x-knot
else:
return 0.0
```
This function is limited in that it takes in only one element at a time. Unfortunately, we would not be able to provide an entire Numpy array as an argument (e.g. `spline(np.arange(3))`). Many functions do possess this behavior, and it is generally advantageous to take advantage of it. If you recall our discussion about universal functions in section \@ref(vectorization-in-python), you might have grown accustomed to taking advantage of writing vectorized code.
Fortunately there's a way to automatically vectorize functions like the one above: [`np.vectorize()`](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html#numpy.vectorize). `np.vectorize()` takes in a unary function, and outputs a vectorized version of it that is able to take entire arrays as an input. Here's an example. Compare this to us using `map()` before.
```{python, collapse = TRUE}
my_inputs = np.linspace(start = 0, stop = 2*np.pi)
# make a vectorized function
vec_spline = np.vectorize(spline)
vec_spline(my_inputs)[:4]
# alternatively output = list(map(spline, my_inputs)) from last time
```
The above code doesn't just demonstrate how to return functions from a function. It is also an example of using functions as function inputs. When a function takes in and spits out functions, there is an alternative way to use it that is unique to Python. You can use **[function decorators](https://www.python.org/dev/peps/pep-0318/).**\index{functions!function decorators} You can *decorate* a function by using the `@` operator [@Lutz13].
If you decorate a function, it is equivalent to passing that function in to a function factory (aka outer function). That function will take the function you defined, alter it, and then give it back to you with the same name that you chose in the first place.
```{python, collapse = TRUE}
# instead of spline = np.vectorize(spline)
@np.vectorize
def spline(x):
knot = 3.0
if x >= knot:
return x-knot
else:
return 0.0
spline(my_inputs)[:4]
```
### Writing Our Own Decorators
We can write our own function factory that can be used as decoration. The main restriction is that this function factory must take a function as an argument, too. This can sometimes be restrictive. You might have noticed that the definition of `func_factory()` from earlier in this section did not do that. If you don't believe me, as an exercise, after you read this section, you might consider trying to rewrite the example from \@ref(functions-as-function-outputs-in-r) that implements Monte Carlo sampling using common random numbers.
Before we get too ahead of ourselves, let's describe the basics. Here is our first decorator function `add_greeting()`.
```{python, collapse = TRUE}
def add_greeting(func):
def wrapper(name):
print('Salutations, ')
func(name)
return wrapper
```
The decorator `add_greeting()` returns a function that is an embellished version of the function it is given. When we decorate a function with it, it looks like this.
```{python, collapse = TRUE}
@add_greeting
def print_name(first_name):
print(first_name)
```
You could get the same behavior by typing the following. They are equivalent!
```{python, collapse = TRUE}
def print_name(first_name):
print(first_name)
print_name = add_greeting(print_name)
```
Things can get a little more complicated when your decorators take additional arguments.
```{python, collapse = TRUE, echo = F}
def add_greeting(greet):
def decorator(func):
def wrapper(name):
print(greet)
func(name)
return wrapper
return decorator
```
```{python, collapse = TRUE}
@add_greeting("How you doin'")
def print_name(first_name):
print(first_name)
print_name('Taylor')
```
So how do we write decorators that accomplish this? The important thing to remember is that `@add_greeting("How you doin'")` in the previous code block is equivalent to writing this after the function definition: `print_name = add_greeting("How you doin'")(print_name)`. This is a function returning a function returning a function! The definition of `add_greeting()` could look something like this.
```{python, collapse = TRUE}
def add_greeting(greet):
def decorator(func):
def wrapper(name):
print(greet)
func(name)
return wrapper
return decorator
```
Now that you know how decorators work, you can feel comfortable using third-party ones. You might come across, for example, the `@jit` decorator from [Numba](https://numba.pydata.org/), which will translate your Python function into faster machine code, the `@lru_cache` decorator from the [`functools` module](https://docs.python.org/3/library/functools.html)--this can make your code faster by saving some of its outputs--or decorators that perform application specific tasks like [`@tf.function`](https://www.tensorflow.org/api_docs/python/tf/function) from Tensorflow.
## Exercises
### Python Questions
1.
Write a function decorator called `@log_dens_eval(left_bound, right_bound)`. When it decorates a function, say `func(x)`, it will not change that function's input or output, but it will verify that the input to the function (in this case `x`) is between `left_bound` and `right_bound`. If it is not, it will return negative infinity.
2.
The Split-Apply-Combine strategy might be useful in writing code for a tree-based model [@trees]. We won't discuss how these models are estimated, but we will write a function that generates another function that is able to generate predictions by stratifying the input/predictor/feature space.
a) Import the data `"winequality-red.csv"`, call it `wine`, and remove all columns except for `fixed acidity`, `volatile acidity`, and `quality`.
b) Write a function called `generate_pred_func(fixed_cutoff, vol_cutoff, dataset)`.
+ The `dataset` argument should be a Pandas `DataFrame` that has three columns called `fixed acidity`, `volatile acidity`, and `quality`.
+ The `fixed_cutoff` argument should be a floating point number that separates `fixed acidity` into two regions.
+ The `vol_cutoff` argument should be a floating point number that separates `volatile acidity` into two regions.
+ The function should return a function, say `func(fixed_acidity, volatile_acidity)`. The two arguments are floating points. This function should return the most frequent `quality` observation out of all points whose inputs lie in in the corresponding region in the feature space.
After you finish the problem, you should have a definition of a `generate_pred_func()` that could be used as follows:
```{python, collapse = TRUE, eval=FALSE}
predict = generate_pred_func(fixed_cutoff=8, vol_cutoff=.5, dataset=wine)
predict(10,10)
```
3.
Let's predict what type of activity some is doing based on measurements taken from their cell phone. We will begin implementing a **K-Nearest Neighbors (KNN) classifier** [@knn1] [@knn2].
Consider the data files `"X_train.txt"` and `"y_train.txt"` from [@Anguita2013APD], which is available from the UCI Machine Learning Repository [@uci_data]. The first data set consists of recorded movements from a cell phone, and the second data set consists of activity labels of people. Labels $1$ through $6$ correspond to walking, walking upstairs, walking downstairs, sitting, standing and laying, respectively.
a) Read in `"X_train.txt"` as a `DataFrame` called `x_train`.
b) Read in `"y_train.txt"` as a `DataFrame` called `y_train`
c) Define a function called `standardize(arr)` that takes in an array-like and returns a standardized version of the array-like. Do this by subtracting from each element the overall mean and dividing each element by the standard deviation (use the length as a denominator, not length minus $1$).
d) Apply `standardize()` to each column and transform `x_train` by replacing all of its column by their standardized versions. Make sure to overwrite `x_train`. Do this in one line of code.
e) Write a function called `euclid_dist(arr1, arr2)` that calculates Euclidean distance between two points/array-likes.
f) What is the most common label among the 5 rows that are closest to the first row? Assign your answer to `my_predict`. Assume that the two data sets you imported have the same order. Don't include the first row in these $5$ *nearest neighbors*. Take care not to modify `x_train` or `y_train`.
### R Questions
1.
The density of a particular bivariate Gaussian distribution is
$$
f(x,y) = \frac{1}{2 \pi} \exp\left[ -\frac{x ^2 + y^2}{2} \right] \tag{1}.
$$
The random elements $X$ and $Y$, in this particular case, are independent, each have unit variance, and zero mean. In this case, the marginal for $X$ is a mean $0$, unit variance normal distribution:
$$
g(x) = \frac{1}{\sqrt{2\pi}} \exp\left[ -\frac{x ^2 }{2} \right] \tag{2}.
$$
a) Write a function called `fTwoArgs(x,y)` that takes two arguments, and returns the value of the above density in equation (1) at those two points.
b) Write a function called `fOneArg(vec)` that takes one argument: a length two `vector`. It should return a density in equation (1) evaluated at that point.
c) Write a function called `gOneArg(x)` that evaluates the density in (2).
d) Generate two sequences called `xPoints` and `yPoints`. Have them contain the twenty equally-spaced numbers going from $-3$ to $3$, inclusive.
e) Use `expand.grid()` to create a `data.frame` called `myGrid`. It should have two columns, and it should contain in its rows every possible pair of two points from the above sequences. The "x" coordinates should be in the first column.
f) Use `mapply()` to evaluate the bivariate density on every grid point. Store your results in a `vector` `mEvals`.
g) Use `apply()` to evaluate the bivariate density on every grid point. Store your results in a `vector` `aEvals`.
h) Use `sapply()` to evaluate the univariate density on every element of `xPoints`. Store your results in a `vector` `sEvals`.
i) Use `vapply` to evaluate the univariate density on every element of `xPoints`. Store your results in `vector` `vEvals`.
j) Use `lapply` to evaluate the univariate density on every element of `xPoints`. Store your results in a `list` `lEvals`.
k) Generate two plots of the bivariate density. For one, use `persp()`. For the other, use `contour()`. Feel free to revive the code you used in Chapter 13's Exercise 1.
l) Generate a third plot of the univariate density. Feel free to revive the code you used in Chapter 13's Exercises.
2.
Write a function that reads in all of the data sets contained in any given folder.
+ The function should be called `readAllData(path, extensions)`.
+ The first argument, `path`, should be a string representing which folder you would like to search for files.
+ The second argument, `extensions`, should be a `character` `vector` of all the file extensions you are interested in reading in (e.g. `c('.txt','.csv')`). Be careful about using regular expressions!
+ The function should return a `list` of `data.frame`s.
3.
Consider the Militarized Interstate Disputes (v5.0) [@mid5] data sets again: `"MIDA 5.0.csv"`, `"MIDB 5.0.csv"`, `"MIDI 5.0.csv"`, and `"MIDIP 5.0.csv"`.
a) Read these data sets in as a `list` of `data.frame`s. Call it `dfList`.
b) Use `lapply()` and `do.call()` to calculate the biggest column average for all of these data sets. Store this average as `biggestAve`. Store the name of the column that had this biggest average as `whichBiggestAve`. Don't worry about storing which data set this column mean was found in.
c) Use `Reduce()` and `lapply()` to, once again, calculate the biggest column average. Store this number as `biggestAve2`
<!-- ### Exercises -->
<!-- 1. Let $X_t$ be the price of a stock at time $t$. A *call (or put) option* is a type of derivative contract you can buy or sell. It's called a derivative because its value derives from the value of a stock. A single call contract (or put) gives the owner the right to buy (or sell) stock ($100$ shares) at a certain price. This certain price written into an options contract ($s$) is called the *strike price*. The day the option expires is called the *expiration date*. If a holder of the option uses his/her right to buy or sell the underlying stock, he/she is said to have *exercised* the option. -->
<!-- The price and value of options floats around randomly through time just like a stock does. Part of the value of an option is its *intrinsic value*. It is not the same as the overall value of an option, or its "fair price"--rather, the intrinsic value represents how much money its possible to guarantee at the current moment, if you exercised the option, and then liquidated the resulting position in the underlying. -->
<!-- For example, call owners hope the stock rises above the strike price so that they can buy the stock for a low price, and immediately sell it for a high price. If $X_t > s$, they could buy at $s$, and sell at $X_t$. In other words, the intrinsic value of a call option is -->
<!-- $$ -->
<!-- C^s_t = 100 \times \max(X_t - s, 0) -->
<!-- $$ -->
<!-- as long as $t$ is before the expiration date. If $X_t < s$, then the owner doesn't have to buy the stock, so the call at that time is intrinsically worthless. -->
<!-- Put owners hope the stock price falls below the strike price, because a put's intrinsic value is -->
<!-- $$ -->
<!-- P^s_t = 100 \times \max(s - X_t, 0) -->
<!-- $$ -->
<!-- Sometimes people buy and sell multiple contracts (with the same expiration date) at the same time. If they do, the intrinsic value of the entire position is additive. For example, if you bought two of the same put option as the one above, then you just double the above expression to get the intrinsic value. -->
<!-- As another example, if you buy (you can also sell) a *call spread*, you buy one call, and then sell another call with the same expiration and a higher strike price. Assuming $s_1 < s_2$ the intrinsic value is of the call spread with strikes $s_1$ and $s_2$ is -->
<!-- $$ -->
<!-- 100 \times \max(X_t - s_1, 0) - 100\times\max(X_t - s_2, 0) -->
<!-- $$ -->
<!-- a. In R, write a function called `getIntrinsicValueFunc` that takes three vectors as arguments, (`dirs`, `strikes`, `contractType`, `numContracts`), and returns a function. These arguments should all be the same length. The length of these inputs is the number of contracts in an options position. The function that `getIntrinsicValueFunc` returns should only take one argument: `prices`, a vector of prices of the underlying stock. The function that `getIntrinsicValueFunc` should return a vector of intrinsic values over time. -->
<!-- Because the arguments may only take on certain values, and because the function to be returned has a certain signature, a template has been provided below. -->
<!-- ```{r, eval=F} -->
<!-- getIntrinsicValueFunc <-function(dirs, strikes, contractType, numContracts){ -->
<!-- stopifnot(all(dirs %in% c("buy", "sell"))) -->
<!-- stopifnot(all(contractType %in% c("call", "put"))) -->
<!-- stopifnot(all(is.integer(strikes))) -->
<!-- stopifnot(all(is.integer(numContracts))) -->
<!-- stopifnot( (length(dirs) == length(strikes)) && (length(dirs) == length(contractType))) -->
<!-- usefulFunc <- function(prices){ -->
<!-- # TODO -->
<!-- } -->
<!-- return(usefulFunc) -->
<!-- } -->
<!-- ``` -->
<!-- b. Create an intrinsic value plot. The x-axis should be a grid of values: let's say from $100$ to $200$. On the y-axis plot the intrinsic value of buying a call spread with strikes $110$ and $120$. -->
<!-- c. Do the same thing as above but show what happens when you sell the same call spread. -->
<!-- d. Plot the intrinsic value of a bought put spread with strikes $110$ and $120$. -->
<!-- e. Plot the intrinsic value of a sold put spread with strikes $110$ and $120$. What is the difference between all four plots? -->
<!-- 2. Do the same thing, but in Python! -->
<!-- creating row masks in a data frame -->
<!-- .apply with weird predicate function -->
<!-- Optimization example -->
<!-- Importance sampling example -->
<!-- bootstrap -->
<!-- numerical integration -->
<!-- 3. verbal, open-ended wuestion about what happens when you make a typo in a variable name? R versus Python? -->
<!-- 4. try to think of the most complicated program that uses two functions that both use (and modify) globals -->