forked from EmilHvitfeldt/ISLR-tidymodels-labs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-classification.qmd
598 lines (445 loc) · 22.3 KB
/
04-classification.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
# Classification
```{r}
#| echo: false
set.seed(1234)
source("_common.R")
```
This lab will be our first experience with classification models. These models differ from the regression model we saw in the last chapter by the fact that the response variable is a qualitative variable instead of a continuous variable.
This chapter will use [parsnip](https://www.tidymodels.org/start/models/) for model fitting and [recipes and workflows](https://www.tidymodels.org/start/recipes/) to perform the transformations.
## The Stock Market Data
We load the tidymodels for modeling functions, ISLR and ISLR2 for data sets, [discrim](https://discrim.tidymodels.org/) to give us access to discriminant analysis models such as LDA and QDA as well as the Naive Bayes model and [poissonreg](https://poissonreg.tidymodels.org/) for Poisson Regression.
```{r}
#| message: false
library(tidymodels)
library(ISLR) # For the Smarket data set
library(ISLR2) # For the Bikeshare data set
library(discrim)
library(poissonreg)
```
```{r}
#| echo: false
select <- dplyr::select
```
We will be examining the `Smarket` data set for this lab. It contains a number of numeric variables plus a variable called `Direction` which has the two labels `"Up"` and `"Down"`. Before we do on to modeling, let us take a look at the correlation between the variables.
To look at the correlation, we will use the [corrr](https://corrr.tidymodels.org/) package. The `correlate()` function will calculate the correlation matrix between all the variables that it is being fed. We will therefore remove `Direction` as it is not numeric.
Then we pass that to `rplot()` to quickly visualize the correlation matrix. I have also changed the `colours` argument to better see what is going on.
```{r}
#| fig-alt: |
#| Correlation chart. Most values are very close to 0.
#| Year and Volume appear quite correlated.
library(corrr)
cor_Smarket <- Smarket %>%
select(-Direction) %>%
correlate()
rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))
```
And we see that these variables are more or less uncorrelated with each other. The other pair is `Year` and `Volume` that is a little correlated.
If you want to create heatmap styled correlation chart you can also create it manually.
```{r}
#| fig-alt: |
#| Correlation chart. Most values are very close to 0.
#| Year and Volume appear quite correlated.
library(paletteer)
cor_Smarket %>%
stretch() %>%
ggplot(aes(x, y, fill = r)) +
geom_tile() +
geom_text(aes(label = as.character(fashion(r)))) +
scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1)
```
If we plot `Year` against `Volume` we see that there is an upwards trend in `Volume` with time.
```{r}
#| fig-alt: |
#| Jittered scatter chart. Jittered around year along the
#| x-axis. Volume along the y-axis. Fairly wide scattering
#| along volume. Slight increase in volumne as year increase.
ggplot(Smarket, aes(Year, Volume)) +
geom_jitter(height = 0)
```
## Logistic Regression
Now we will fit a logistic regression model. We will again use the parsnip package, and we will use `logistic_reg()` to create a logistic regression model specification.
```{r}
lr_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
```
Notice that while I did set the engine and mode, they are just restating the defaults.
We can now fit the model like normal. We want to model the `Direction` of the stock market based on the percentage return from the 5 previous days plus the volume of shares traded.
When fitting a classification with parsnip requires that the response variable is a factor. This is the case for the `Smarket` data set so we don't need to do adjustments.
```{r}
lr_fit <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket
)
lr_fit
```
this fit is done using the `glm()` function, and it comes with a very handy `summary()` method as well.
```{r}
lr_fit %>%
pluck("fit") %>%
summary()
```
This lets us see a couple of different things such as; parameter estimates, standard errors, p-values, and model fit statistics. we can use the `tidy()` function to extract some of these model attributes for further analysis or presentation.
```{r}
tidy(lr_fit)
```
Predictions are done much the same way. Here we use the model to predict on the data it was trained on.
```{r}
predict(lr_fit, new_data = Smarket)
```
The result is a tibble with a single column `.pred_class` which will be a factor variable of the same labels as the original training data set.
We can also get back probability predictions, by specifying `type = "prob"`.
```{r}
predict(lr_fit, new_data = Smarket, type = "prob")
```
note that we get back a column for each of the classes. This is a little reductive since we could easily calculate the inverse, but once we get to multi-classification models it becomes quite handy.
Using `augment()` we can add the predictions to the data.frame and then use that to look at model performance metrics. before we calculate the metrics directly, I find it useful to look at the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). This will show you how well your predictive model is performing by given a table of predicted values against the true value.
```{r}
augment(lr_fit, new_data = Smarket) %>%
conf_mat(truth = Direction, estimate = .pred_class)
```
A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal. We see here that the model isn't great, as it tends to predict `"Down"` as `"Up"` more often than it should.
if you want a more visual representation of the confusion matrix you can pipe the result of `conf_mat()` into `autoplot()` to generate a ggplot2 chart.
```{r}
#| fig-alt: |
#| Confusion matrix chart. Truth along the x-axis,
#| prediction along the y-axis. Up is predicted vastly
#| more than Down, regardless of the true value.
augment(lr_fit, new_data = Smarket) %>%
conf_mat(truth = Direction, estimate = .pred_class) %>%
autoplot(type = "heatmap")
```
We can also calculate various performance metrics. One of the most common metrics is accuracy, which is how often the model predicted correctly as a percentage.
```{r}
augment(lr_fit, new_data = Smarket) %>%
accuracy(truth = Direction, estimate = .pred_class)
```
and we see that the accuracy isn't great either which is obvious already looking at the confusion matrix.
We just fit a model and evaluated it on the same data. This doesn't give us that much information about the model performance. Let us instead split up the data, train it on some of it and then evaluate it on the other part of the data. Since we are working with some data that has a time component, it is natural to fit the model using the first year's worth of data and evaluate it on the last year. This would more closely match how such a model would be used in real life.
```{r}
Smarket_train <- Smarket %>%
filter(Year != 2005)
Smarket_test <- Smarket %>%
filter(Year == 2005)
```
Now that we have split the data into `Smarket_train` and `Smarket_test` we can fit a logistic regression model to `Smarket_train` and evaluate it on `Smarket_test` to see how well the model generalizes.
```{r}
lr_fit2 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket_train
)
```
And we will evaluate on the testing data set.
```{r}
augment(lr_fit2, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
augment(lr_fit2, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
```
We see that this model is not more likely to predict `"Down"` rather than `"Up"`. Also, note how the model performs worse than the last model. This is expected since we are evaluating on new data.
We recall that the logistic regression model had underwhelming p-values. Let us see what happens if we remove some of the variables that appear not to be helpful we might achieve a more predictive model since the variables that do not have a relationship with the response will cause an increase in variance without a decrease in bias.
```{r}
lr_fit3 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2,
data = Smarket_train
)
augment(lr_fit3, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
augment(lr_fit3, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
```
And we see an increase in performance. The model is still not perfect but it is starting to perform better.
Suppose that we want to predict the returns associated with particular values of `Lag1` and `Lag2`. In particular, we want to predict `Direction` on a day when `Lag1` and `Lag2` equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and −0.8.
For this we start by creating a tibble corresponding to the scenarios we want to predict for
```{r}
Smarket_new <- tibble(
Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)
)
```
And then we will use `predict()`
```{r}
predict(
lr_fit3,
new_data = Smarket_new,
type = "prob"
)
```
## Linear Discriminant Analysis
Now we will perform LDA on the `Smarket` data. We will use the `discrim_linear()` function to create a LDA specification. We will continue to use 2 predictors for easy comparison.
```{r}
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
lda_fit <- lda_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
lda_fit
```
One of the things to look for in the LDA output is the group means. We see that there is a slight difference between the means of the two groups. These suggest that there is a tendency for the previous 2 days' returns to be negative on days when the market increases, and a tendency for the previous day's returns to be positive on days when the market declines.
Predictions are done just the same as with logistic regression:
```{r}
predict(lda_fit, new_data = Smarket_test)
predict(lda_fit, new_data = Smarket_test, type = "prob")
```
And we can take a look at the performance.
```{r}
augment(lda_fit, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
augment(lda_fit, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
```
And we see no markedly different performance between this model and the logistic regression model.
## Quadratic Discriminant Analysis
We will now fit a QDA model. The `discrim_quad()` function is used here.
Once we have the model specification fitting the model is just like before.
```{r}
qda_spec <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
qda_fit <- qda_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
```
```{r}
augment(qda_fit, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
augment(qda_fit, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
```
And we are seeing another increase in accuracy. However this model still rarely predicts `"Down"`. This make it appear that the quadratic form assumed by QDA captures the relationship more clearly.
## Naive Bayes
We will now fit a Naive Bayes model to the `Smarket` data. For this, we will be using the `naive_Bayes()` function to create the specification and also set the `usekernel` argument to `FALSE`. This means that we are assuming that the predictors `Lag1` and `Lag2` are drawn from Gaussian distributions.
Once the model is specified, the fitting process is exactly like before:
```{r}
nb_spec <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("klaR") %>%
set_args(usekernel = FALSE)
nb_fit <- nb_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
```
Once the model is fit, we can create the confusion matrix based on the testing data and also assess the model accuracy.
```{r}
augment(nb_fit, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
```
```{r}
augment(nb_fit, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
```
The accuracy of the Naive Bayes is very similar to that of the QDA model. This seems reasonable since the below scatter plot shows that there is no apparent relationship between `Lag1` vs `Lag2` and thus the Naive Bayes' assumption of independently distributed predictors is not unreasonable.
```{r}
#| message: false
#| fig-alt: |
#| Scatter chart. Lag1 along the x-axis and Lag2 along the
#| y-axis. No apparent correlation between Lag1 and Lag2.
ggplot(Smarket, aes(Lag1, Lag2)) +
geom_point(alpha = 0.1, size = 2) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "No apparent correlation between Lag1 and Lag2")
```
## K-Nearest Neighbors
Lastly let us take a look at a K-Nearest Neighbors model. This is the first model we have looked at that has a hyperparameter we need to specify. I have set it to 3 with `neighbors = 3`. Fitting is done like normal.
```{r}
knn_spec <- nearest_neighbor(neighbors = 3) %>%
set_mode("classification") %>%
set_engine("kknn")
knn_fit <- knn_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
knn_fit
```
And evaluation is done the same way:
```{r}
augment(knn_fit, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
augment(knn_fit, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
```
It appears that this model is not performing that well.
We will try using a K-nearest neighbors model in an application to caravan insurance data. This data set includes 85 predictors that measure demographic characteristics for 5822 individuals. The response variable is `Purchase`, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only 6% of people purchased caravan insurance.
We want to build a predictive model that uses the demographic characteristics to predict whether an individual is going to purchase a caravan insurance. Before we go on, we split the data set into a training data set and testing data set. (This is a not the proper way this should be done. See next chapter for the correct way.)
```{r}
Caravan_test <- Caravan[seq_len(1000), ]
Caravan_train <- Caravan[-seq_len(1000), ]
```
Since we are using a K-nearest neighbor model, it is importance that the variables are centered and scaled to make sure that the variables have a uniform influence. We can accomplish this transformation with `step_normalize()`, which does centering and scaling in one go.
```{r}
rec_spec <- recipe(Purchase ~ ., data = Caravan_train) %>%
step_normalize(all_numeric_predictors())
```
We will be trying different values of K to see how the number of neighbors affect the model performance. A workflow object is created, with just the recipe added.
```{r}
Caravan_wf <- workflow() %>%
add_recipe(rec_spec)
```
Next we create a general KNN model specification.
```{r}
knn_spec <- nearest_neighbor() %>%
set_mode("classification") %>%
set_engine("kknn")
```
We can then use this model specification along with `Caravan_wf` to create 3 full workflow objects for `K = 1,3,5`.
```{r}
knn1_wf <- Caravan_wf %>%
add_model(knn_spec %>% set_args(neighbors = 1))
knn3_wf <- Caravan_wf %>%
add_model(knn_spec %>% set_args(neighbors = 3))
knn5_wf <- Caravan_wf %>%
add_model(knn_spec %>% set_args(neighbors = 5))
```
With all these workflow specification we can fit all the models one by one.
```{r}
knn1_fit <- fit(knn1_wf, data = Caravan_train)
knn3_fit <- fit(knn3_wf, data = Caravan_train)
knn5_fit <- fit(knn5_wf, data = Caravan_train)
```
And we can calculate all the confusion matrices.
```{r}
augment(knn1_fit, new_data = Caravan_test) %>%
conf_mat(truth = Purchase, estimate = .pred_class)
```
```{r}
augment(knn3_fit, new_data = Caravan_test) %>%
conf_mat(truth = Purchase, estimate = .pred_class)
```
```{r}
augment(knn5_fit, new_data = Caravan_test) %>%
conf_mat(truth = Purchase, estimate = .pred_class)
```
And it appears that the model performance doesn't change much when changing from 1 to 5.
## Poisson Regression
So far we have been using the `Smarket` data set to predict the stock price movement. We will now shift to a new data set, `Bikeshare`, and look at the number of bike rentals per hour in Washington, D.C.
The variable of interest, *number of bike rentals per hour*, can take on non-negative integer values. This makes Poisson Regression a suitable candidate to model the same.
We start with specifying the model using the `poisson_reg()` function.
```{r}
pois_spec <- poisson_reg() %>%
set_mode("regression") %>%
set_engine("glm")
```
Here we will be predicting `bikers` using the following predictors:
* `mnth` - month of the year, coded as a factor
* `hr` - hour of the day, coded as a factor from 0 to 23
* `workingday` - Is it a workday? Already coded as a dummy variable with Yes = 1, No = 0
* `temp` - normalized temperature in Celsius
* `weathersit` - weather condition, again coded as a factor with the following levels:
* clear
* cloudy/misty
* light rain/snow
* heavy rain/snow
As we can see, apart from `temp` all other predictors are categorical in nature. Thus, we will first create a recipe to convert these into dummy variables and then bundle the model spec and recipe using a workflow.
```{r}
pois_rec_spec <- recipe(
bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare
) %>%
step_dummy(all_nominal_predictors())
```
```{r}
pois_wf <- workflow() %>%
add_recipe(pois_rec_spec) %>%
add_model(pois_spec)
```
With the workflow in place, we follow the same pattern to fit the model and look at the predictions.
```{r}
#| fig.asp: 1.0
#| fig.alt: |
#| Scatter chart. bikers along the x-axis and .pred along
#| the y-axis. A diagonal line has been added. Points are
#| scattered around the diagonal. More closely for low values.
pois_fit <- pois_wf %>% fit(data = Bikeshare)
augment(pois_fit, new_data = Bikeshare, type.predict = "response") %>%
ggplot(aes(bikers, .pred)) +
geom_point(alpha = 0.1) +
geom_abline(slope = 1, size = 1, color = "grey40") +
labs(title = "Predicting the number of bikers per hour using Poission Regression",
x = "Actual", y = "Predicted")
```
We can also look at the model coefficients to get a feel for the working of the model and comparing it with our own understanding.
Looking at the coefficients corresponding to the `mnth` variable, we note that it is lower in the winter months and higher in the summer months. This seems logical as we would expect the number of bike rentals to be higher during summertime.
```{r}
#| fig.asp: 0.6
#| fig.alt: |
#| Line chart. months along the x-axis, coefficient along the
#| y-axis. Coefficient values for January and February are low
#| Rest of the months are high.
pois_fit_coef_mnths <-
tidy(pois_fit) %>%
filter(grepl("^mnth", term)) %>%
mutate(
term = stringr::str_replace(term, "mnth_", ""),
term = forcats::fct_inorder(term)
)
pois_fit_coef_mnths %>%
ggplot(aes(term, estimate)) +
geom_line(group = 1) +
geom_point(shape = 21, size = 3, stroke = 1.5,
fill = "black", color = "white") +
labs(title = "Coefficient value from Poission Regression",
x = "Month", y = "Coefficient")
```
We can similarly also look at the coefficients corresponding to the `hr` variable. Here the peaks occur at 8:00 AM and 5:00 PM, i.e. during normal office start and end times.
```{r}
#| fig.asp: 0.6
#| fig.alt: |
#| Line chart. hours along the x-axis, coefficient along the
#| y-axis. Coefficient values for hour between 1 and 7 are
#| low, the rest are higher.
pois_fit_coef_hr <-
tidy(pois_fit) %>%
filter(grepl("^hr", term)) %>%
mutate(
term = stringr::str_replace(term, "hr_X", ""),
term = forcats::fct_inorder(term)
)
pois_fit_coef_hr %>%
ggplot(aes(term, estimate)) +
geom_line(group = 1) +
geom_point(shape = 21, size = 3, stroke = 1.5,
fill = "black", color = "white") +
labs(title = "Coefficient value from Poission Regression",
x = "hours", y = "Coefficient")
```
## Extra - comparing multiple models
This section is new and not part of ISLR. We have fitted a lot of different models in this lab. And we were able to calculate the performance metrics one by one, but it is not ideal if we want to compare the different models. Below is an example of how you can more conveniently calculate performance metrics for multiple models at the same time.
Start of by creating a named list of the fitted models you want to evaluate. I have made sure only to include models that were fitted on the same parameters to make it easier to compare them.
```{r}
models <- list("logistic regression" = lr_fit3,
"LDA" = lda_fit,
"QDA" = qda_fit,
"KNN" = knn_fit)
```
Next use `imap_dfr()` from the [purrr](https://purrr.tidyverse.org/) package to apply `augment()` to each of the models using the testing data set. `.id = "model"` creates a column named `"model"` that is added to the resulting tibble using the names of `models`.
```{r}
preds <- imap_dfr(models, augment,
new_data = Smarket_test, .id = "model")
preds %>%
select(model, Direction, .pred_class, .pred_Down, .pred_Up)
```
We have seen how to use `accuracy()` a lot of times by now, but it is not the only metric to use for classification, and yardstick provides [many more](https://yardstick.tidymodels.org/reference/index.html#section-classification-metrics).
You can combine multiple different metrics together with `metric_set()`
```{r}
multi_metric <- metric_set(accuracy, sensitivity, specificity)
```
and then the resulting function can be applied to calculate multiple metrics at the same time. All of the yardstick works with grouped tibbles so by calling `group_by(model)` we can calculate the metrics for each of the models in one go.
```{r}
preds %>%
group_by(model) %>%
multi_metric(truth = Direction, estimate = .pred_class)
```
The same technique can be used to create ROC curves.
```{r}
#| fig-alt: |
#| A ROC curve plot. 1-specificity along the x-axis and
#| sensitivity along the y-axis. A dotted line is drawn
#| along the diagonal. One curve for each of the 4 models
#| KNN, LDA, QDA and Logistic regression is drawn.
#| The curves are all fairly close th the diagonal for
#| all models with KNN doing the absolutely worst.
preds %>%
group_by(model) %>%
roc_curve(Direction, .pred_Down) %>%
autoplot()
```
Here you can't see the LDA because it lies perfectly under the logistic regression.