-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathunderstanding_features.qmd
1862 lines (1448 loc) · 75.4 KB
/
understanding_features.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Understanding the Features {#sec-knowing-feature}
![](img/chapter_gp_plots/gp_plot_16.svg){width=75% fig-align='center'}
Assuming our model is adequate, let's now turn our attention to the features. There's a lot to unpack here, so let's get started!
## Key Ideas {#sec-knowing-feature-key}
- There are many ways to get to know your features, and though it can be challenging, it can also be very rewarding!
- Visualizations can help you understand how your model is making predictions and which features are important.
- Feature importance is difficult to determine even in the simplest of model settings, but there are tools to help you understand how much each feature contributes to a prediction.
### Why this matters {#sec-knowing-feature-why}
We often need to, or simply want to know why our predictions come about. What's driving them? What happens for specific observations? By exploring the features we can begin to better understand our model, and potentially improve it. We can also get a sense of what's important in our data, and what's not, and this can help us make better decisions.
### Helpful context {#sec-knowing-feature-good}
We'd suggest having the linear model basics down pretty well, including basic logistic regression (@sec-glm-logistic), as much of what we explore here can be understood most easily in that setting. If you already have machine learning models down too, you're in good shape, as this can all apply in those modeling contexts as well.
## Data Setup
We'll use the movie review data as with our previous chapters. Later on, we'll also use the world happiness data set to explore some more advanced concepts. For the movie review data, we'll split the data into training and testing sets, and then fit a linear regression model and a logistic regression model to the training data. We'll then use the testing data to evaluate the models.
:::{.panel-tabset}
##### R
```{r}
#| label: r-regression-import-split
# all data found on github repo
df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')
set.seed(123)
initial_split = sample(
x = 1:nrow(df_reviews),
size = nrow(df_reviews) * .75,
replace = FALSE
)
df_train = df_reviews[initial_split, ]
df_test = df_reviews[-initial_split, ]
model_lr_train = lm(
rating ~
review_year_0
+ release_year_0
+ age_sc
+ length_minutes_sc
+ total_reviews_sc
+ word_count_sc
+ genre
,
df_train
)
model_class_train = glm(
rating_good ~
review_year_0
+ release_year_0
+ age_sc
+ length_minutes_sc
+ total_reviews_sc
+ word_count_sc
+ genre
,
df_train,
family = binomial
)
```
##### Python
```{python}
#| label: python-regression-import-split
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
# all data found on github repo
df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
df_train, df_test = train_test_split(
df_reviews,
test_size = 0.25,
random_state = 123
)
# we'll use this later
features = [
"review_year_0",
"release_year_0",
"age_sc",
"length_minutes_sc",
"total_reviews_sc",
"word_count_sc",
"genre",
]
model = 'rating ~ ' + " + ".join(features)
model_lr_train = smf.ols(formula = model, data = df_train).fit()
model = 'rating_good ~ ' + " + ".join(features)
model_class_train = smf.glm(
formula = model,
data = df_train,
family = sm.families.Binomial()
).fit()
```
:::
## Basic model parameters {#sec-knowing-feature-params}
We saw in the linear model chapter ([-@sec-lm-interpretation-feature]) that we can get a lot out of the basic output from standard linear models. Our starting point should be the coefficients or weights, which can give us a sense of the direction and magnitude of the relationship between the feature and the target given their respective scales. We can also look at the standard errors and confidence intervals to get a sense of the uncertainty in those estimates. Here is a basic summary of the coefficients for our regression model on the training data.
\footnotesize
```{r}
#| echo: false
#| label: tbl-r-regression-coefs
#| tbl-cap: Coefficients for Our Regression Model
broom::tidy(model_lr_train, conf.int = TRUE) |>
janitor::clean_names() |>
rename(feature = term) |>
mutate(feature = ifelse(feature == "(Intercept)", "intercept", feature)) |>
gt(decimals=2)
```
\normalsize
We also noted how we can get a bit more relative comparison by using standardized coefficients, or some other scaling of the coefficients that allows for a bit of a more apples-to-apples comparison. But as we'll see, in the real world even if we have just apples, there are fuji, gala, granny smith, honeycrisp, and many other kinds, and some may be good for snacks, others for baking pies, some are good for cider, etc. In other words, **there is no one size fits all approach to understanding how a feature contributes to understanding the target**, and the sooner you grasp that, the better.
## Feature Contributions {#sec-knowing-feature-contrib}
We can also look at the contribution of a feature to the model's explanatory power, namely through its predictions. To start our discussion, we don't want to lean too heavily on the phrase **feature importance** yet, because as we'll see later, trying to rank features by an importance metric is difficult at best, and a misguided endeavor at worst. We can however look at the **feature contribution** to the model's predictions, and we can come to a conclusion about whether we think a feature is practically important, but just we need to be careful about how we do it.
Truly understanding feature contribution is a bit more complicated than just looking at the coefficient if we're using any model that isn't a linear regression, and there are many ways to go about it. We know we can't compare raw coefficients across features, because they are on different scales. But even when we put them on the same scale, it may be very easy for some features to move, e.g., one standard deviation, and very hard for others. Binary features can only be on or off, while numeric features can move around more, but numeric features may also be highly skewed. We also can't use statistical significance based on p-values, because they reflect sample size as much or more than effect size.
So what are we to do? What you need to know to get started looking at a feature's contribution includes the following:
- feature range and variability
- feature distributions (e.g., skewness)
- representative values of the feature
- target range and variability
- feature interactions and correlations
We can't necessarily do a whole lot about these aspects, but we can at least be aware of them to help us understand any effects we want to explore. And just as importantly, knowing this sort of information can help us be aware of the limitations of our understanding of these effects. In any case, let's try to get a sense of how we can understand the contribution of a feature to our model.
## Marginal Effects {#sec-marginal-effects}
One way to understand the contribution of a feature to the model is to look at the **marginal effect** of the feature, which conceptually attempts to boil a feature effect to something simple like an average. Unfortunately, not everyone means the same thing when they use this term and it can be a bit confusing. Marginal effects *typically* refer to a partial derivative of the target with respect to the feature. Oh no! Math! However, as an example, this becomes very simple for standard linear models with no interactions and all linear effects, as in linear regression. The derivative of our coefficient with respect to the feature is just the coefficient itself! But for more complicated models, even just a classification model like our logistic regression which is nonlinear on the probability scale, we need to do a bit more work to get the marginal effect, or other so-called *average* effects. Let's think about a couple common versions:
- Average slope, Average Marginal Effect
- Marginal effect at the mean
- Marginal Means (for categorical features)
- Counterfactuals and other predictions at key feature values
Note that if you need an introduction or refresher on logistic regression, you'll find it in the GLM chapter ([-@sec-glm-logistic]).
### Marginal effects at the mean {#sec-marginal-effects-mean}
First let's think about an average slope. This is the average of the slopes across the feature's values, or values of another feature it interacts with. To begin, let's just look at the effect of word count first, and we'll do this for the logistic regression model to make things more interesting. A good question to start with is, how do we visualize the relationship/effect?
Here are two plots, and both are useful, neither is inherently wrong, and yet they both tell us something different. The first plot shows the predicted probability of a good review as word count changes, *with all other features at their mean* (or mode for categorical features like genre). We can see that on the probability scale, this relationship is negative with a bit of a curve that gets steeper with higher word count values.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-r-mem-vs-pdp
#| fig-cap: Marginal Effect at the Mean vs. Partial Dependence Plot
p_dat = ggeffects::ggpredict(model_class_train, terms = "word_count_sc")
pred_0 = p_dat |> filter(x == 0) |> pull(predicted)
p_mem = plot(p_dat, colors = c(okabe_ito[['darkblue']])) +
geom_point(x = 0, y = pred_0, size = 4) +
labs(
# title = 'Predicted Probability of Good Review',
title = '',
subtitle = 'Other features at their mean/mode',
x = 'Word Count (standardized)',
y = ''
) +
scale_y_continuous(
breaks = seq(0, 1, .1),
limits = c(0, 1),
labels = scales::percent
) +
coord_cartesian(xlim = c(-2, 2)) +
theme_clean()
# autoplot(pdp::partial(model_class_train, pred.var='word_count_sc', type = 'classification', plot.engine='ggplot2'))
df_pdp = plot(
DALEX::model_profile(
explainer = DALEX::explain(model_class_train, verbose=FALSE),
N = NULL,
variables = 'word_count_sc'
)
)$data
avg_pred_0 = predict(model_class_train, newdata=df_train |>
mutate(word_count_sc=0), type = 'response') |>
mean()
p_pdp = df_pdp |>
rename(
'word_count_sc' = `_x_`,
'prob' = `_yhat_`
) |>
ggplot(aes(x = word_count_sc, y = prob)) +
geom_line() +
geom_point(x = 0, y = avg_pred_0, size = 4) +
scale_y_continuous(
breaks = seq(0, 1, .1),
limits = c(0, 1),
labels = scales::percent
) +
labs(
subtitle = 'Partial Dependence Plot',
# title = 'Predicted Probability of Good Review',
x = 'Word Count (standardized)',
y = ''
) +
coord_cartesian(xlim = c(-2, 2))
p_mem +
scale_color_manual(values = c(okabe_ito[['darkblue']], okabe_ito[['orange']])) +
theme_clean() +
p_pdp +
plot_layout(guides = 'collect', ) +
plot_annotation(title = 'Predicted Probability of Good Review') &
theme(
plot.title = element_text(size = 14, hjust = 0),
plot.subtitle = element_text(size = 12, hjust = 0),
axis.line.y = element_line(color = "gray90"),
axis.line.x = element_line(color = "gray90"),
axis.title.x = element_text(size = 10),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
)
ggsave("img/knowing-mem-vs-pdp.svg", width = 8, height = 6)
```
:::
:::{.content-visible when-format='pdf'}
![Marginal Effect at the Mean vs. Partial Dependence Plot](img/knowing-mem-vs-pdp.svg){#fig-r-mem-vs-pdp}
:::
The second plot shows what is called a **partial dependence plot**, which shows the *average predicted probability* of a good review as word count changes. In both cases we make predictions with imputed values- the left plot imputes the other features to be their mean or mode, while the right plot leaves the other features at their actual values, and then, using a range of values for word count, gets a prediction as if every observation had that value for word count. We then plot the average of the predictions for each value in the range.
Let's demystify this by calculating it ourselves.
:::{.panel-tabset}
##### R
```{r}
#| label: r-marginal-effects-word-count
#| eval: false
#| results: hide
#| fig.cap: Average Slope of Word Count by Genre
df_typical = tibble(
genre = names(which.max(table(df_train$genre))),
age_sc = mean(df_train$age_sc),
release_year_0 = mean(df_train$release_year_0),
length_minutes_sc = mean(df_train$length_minutes_sc),
total_reviews_sc = mean(df_train$total_reviews_sc),
word_count_sc = mean(df_train$word_count_sc),
review_year_0 = mean(df_train$review_year_0)
)
# avg prediction when everything is typical
avg_pred = predict(model_class_train, newdata = df_typical, type = "response") |>
mean()
# avg prediction when word count is at its mean
avg_pred_0 = predict(
model_class_train,
newdata = df_train |> mutate(word_count_sc = 0),
type = 'response'
) |>
mean()
c(avg_pred, avg_pred_0)
```
##### Python
```{python}
#| label: python-marginal-effects-word-count
#| eval: false
#| results: hide
#| fig.cap: Average Slope of Word Count by Genre
df_typical = pd.DataFrame({
"genre": df_train.genre.mode().values,
"age_sc": df_train.age_sc.mean(),
"release_year_0": df_train.release_year_0.mean(),
"length_minutes_sc": df_train.length_minutes_sc.mean(),
"total_reviews_sc": df_train.total_reviews_sc.mean(),
"word_count_sc": df_train.word_count_sc.mean(),
"review_year_0": df_train.review_year_0.mean()
})
# avg prediction when everything is typical
avg_pred_typical = model_class_train.predict(df_typical).mean()
# avg prediction when word count is at its mean
avg_pred_0 = model_class_train.predict(
df_train.assign(word_count_sc = 0)
).mean()
avg_pred_typical.round(3), avg_pred_0.round(3)
```
:::
When word count is zero, i.e., its mean, and everything else is at its mean/mode, we'd predict a chance of a good review at about `r label_percent()(pred_0)`. As such, we interpret this as 'when everything is typical', we have a pretty good chance of getting a good review. The average prediction we'd get if we predicted every observation as if it were the mean word count is more like `r label_percent()(avg_pred_0)`, which is notably less. Which is correct? Both, or neither! They are telling us different things, either of which may be useful, or not. If it's doubtful that the feature values used in the calculation are realistic (e.g., everything at its mean at the same time, or an average word count when length of a movie is at its minimum), then they may both be misleading. You have to know your features and your target to know how best to use the information.
### Average marginal effects {#sec-avg-marginal-effects}
Let's say we want to boil our understanding of the feature-target relationship to a single number. In this case, the coefficient is fine if we're dealing with an entirely linear model. In this classification case, the raw coefficient tells us what we need to know, but on the log odds scale, which is not very intuitive for most folks. We can understand the probability scale, but this means things get nonlinear. As an example, a .1 to .2 change in the probability is doubling it, while a .8 to .9 change is a 12.5% increase in the probability. But is there any way we can stick with probabilities and get a single value to understand the change in the probability of a good review as word count changes by 1 unit?
Yes! We can look at what's called the **average marginal effect** of word count. This is the average of the slope of the predicted probability of a good review as word count changes. This is a bit more complicated than just looking at the coefficient, but it's still intuitive, and more so than a coefficient that regards odds. How do we get it? By a neat little trick where we predict the target with the feature at two values. We start with the observed value and then add or subtract a very small amount. Then we take the difference in the predictions for those two feature values. This results in the same thing as taking the derivative of the target with respect to the feature.
```{r}
#| echo: false
#| eval: false
#| label: fig-r-marginal-slope-word-count
#| fig-cap: Average Slope of Word Count by Genre
library(marginaleffects)
model_word_ct_by_genre = update(model_lr_train, . ~ . + word_count_sc:genre)
# average slope
p1 = plot_slopes(model_lr_train, variables = "word_count_sc", by = 'genre') +
labs(
title = 'Average Slope of Word Count (standardized) by Genre',
subtitle = 'No interaction'
)
p2 = plot_slopes(model_word_ct_by_genre, variables = "word_count_sc", by = 'genre') +
labs(
title = 'Average Slope of Word Count (standardized) by Genre',
subtitle = 'No interaction'
)
#|
# our average
model_word_ct_by_genre = update(model_lr_train, . ~ . + word_count_sc:genre)
# slopes(model_word_ct_by_genre, variables = "word_count_sc")
```
:::{.panel-tabset}
##### R
```{r}
#| label: calc-marginal-slope-word-count-r
fudge_factor = 1e-3
fudge_plus = predict(
model_class_train,
newdata = df_train |> mutate(word_count_sc = word_count_sc + fudge_factor/2),
type = "response"
)
fudge_minus = predict(
model_class_train,
newdata = df_train |> mutate(word_count_sc = word_count_sc - fudge_factor/2),
type = "response"
)
# compare
# mean(fudge_plus - fudge_minus) / fudge_factor
marginaleffects::avg_slopes(
model_class_train,
variables = "word_count_sc",
type = 'response'
)
```
##### Python
```{python}
#| label: calc-marginal-slope-word-count-py
fudge_factor = 1e-3
fudge_plus = model_class_train.predict(
df_train.assign(
word_count_sc = df_train.word_count_sc + fudge_factor/2
)
)
fudge_minus = model_class_train.predict(
df_train.assign(
word_count_sc = df_train.word_count_sc - fudge_factor/2
)
)
np.mean(fudge_plus - fudge_minus) / fudge_factor
# note that the marginaleffects is available in Python, but still very fresh!
# we'll add a comparison in the future, but it doesn't handle models
# with categoricals right now.
# import marginaleffects as me
# me.avg_slopes(model_class_train, variables = "word_count_sc")
```
:::
Our result suggests we're getting about a `r round(mean(fudge_plus - fudge_minus) / fudge_factor, 1)` drop in the expected probability of a good review for a 1 standard deviation increase in word count on average. This is a bit more intuitive than the coefficient or odds ratio based on it, and we probably don't want to ignore that sort of change. It also doesn't take much to get with the right package, or even on our own. Another nice thing about this approach is that it can potentially be applied to any model, including ones that don't normally produce coefficients, like gradient boosting models or deep learning models.
### Marginal means
Marginal means are just about getting an average prediction for the levels of *categorical features*. As an example, we can get the average predicted probability of a good review for each level of the genre feature, and then compare them. To do this we just have to make predictions as if every observation had a certain value for genre, and then average the predictions. This is also the exact same approach that produced the PDP for word count we saw earlier.
:::{.panel-tabset}
##### R
```{r}
#| label: r-marginal-means-genre
#| eval: false
#| results: hide
marginal_means = map_df(
unique(df_train$genre),
~ tibble(
genre = .x,
avg_pred = predict(
model_class_train,
newdata = df_train |>
mutate(genre = .x),
type = "link"
) |>
mean() |> # get the average log odds then transform
plogis()
)
)
# marginal_means
marginaleffects::avg_predictions(
model_class_train,
newdata = df_train,
variables = "genre"
)
```
##### Python
```{python}
#| label: python-marginal-means-genre
#| eval: false
inv_logit = lambda x: 1 / (1 + np.exp(-x))
marginal_means = pd.DataFrame({
"genre": df_train.genre.unique(),
"avg_pred": [
inv_logit(
model_class_train.predict(df_train.assign(genre = g), which='linear')
).mean()
for g in df_train.genre.unique()
]
})
marginal_means
```
:::
```{r}
#| echo: false
#| label: tbl-marginal-means-genre
#| tbl-cap: Average Predicted Probability of a Good Review by Genre
marginaleffects::avg_predictions(model_class_train, variables = "genre") |>
as_tibble() |>
select(-c(p.value, s.value)) |>
gt() |>
tab_footnote(
"Select output from the R package marginaleffects."
) |>
tab_options(
footnotes.font.size = 8
)
```
These results suggest that we can expect a good review for comedy and drama, maybe not so much for the others, *on average*. It also seems that our probability of a good review is similar for Comedy and Drama, and the others kind of group together as well.
:::{.callout-note title="That's not what I got!" collapse='true'}
Just a reminder that if you used a different seed for your split or are using Python vs. R, you will see slightly different results. This is normal and expected!
:::
## Counterfactual Predictions {#sec-knowing-counterfactual-predictions}
The nice thing about having a model is that we can make predictions for any set of feature values we want to explore. This is a great way to understand the contribution of a feature to the model. We can make predictions for a range of feature values, and then compare the predictions to see how much the feature contributes to the model. **Counterfactual predictions** allow us to ask "what if?" questions of our model, and see how it responds. As an example, we can get a prediction as if every review was made for a drama, and then see what we'd expect if every review pertained to a comedy. This is a very powerful approach, and often utilized in causal inference, but it's also a great way to understand the contribution of a feature to a model in general.
Consider an experimental setting where we have lots of control over how the data is produced for different scenarios. Ideally we'd be able to look at the same instances under when everything about them was identical, but in one case, the instance was part of the control group, and in another, part of the treatment group. Unfortunately, not only is it impossible to have *everything* be identical, but it's also impossible to have the same instance be in two experimental group settings at the same time! Counterfactual predictions are the next best thing though, because once we have a model, we can predict an observation *as if* it was in the treatment, and then when it is a control. If we do this for all observations, we can get a sense of the **average treatment effect**, one of the main points of interest in causal inference.
But you don't need an experiment for this. In fact, we've been doing this all along with our marginal effects examples. Take the marginal means, where we looked at the average predicted probability of a good review as if every observation was a specific genre. We could have also looked at any specific observation's predictions to compare the two genre settings, instead of getting an average. This is very much in the spirit of a counterfactual prediction.
Let's try a new data set to help drive the point home. We'll use some data at the global stage- the world happiness data set (@sec-dd-world-happiness-report). For our model we'll predict the happiness score for a country, considering freedom to make life choices, GDP and other things associated with it. We'll then switch the freedom to make life choices and GDP values for the US and Russia, and see how the predictions change!
:::{.panel-tabset}
##### R
```{r}
#| label: r-counterfactual-happiness
#| results: hide
# data available on repo (full link in appendix)
df_happiness_2018 = read_csv('https://tinyurl.com/worldhappiness2018')
model_happiness = lm(
happiness_score ~
log_gdp_per_capita
+ healthy_life_expectancy_at_birth
+ generosity
+ freedom_to_make_life_choices
+ confidence_in_national_government,
data = df_happiness_2018
)
df_us_russia = df_happiness_2018 |>
filter(country %in% c('United States', 'Russia'))
happiness_gdp_freedom_values = df_us_russia |>
arrange(country) |>
select(log_gdp_per_capita, freedom_to_make_life_choices)
base_predictions = predict(
model_happiness,
newdata = df_us_russia
)
# switch up their GDP and freedom!
df_switch = df_us_russia |>
mutate(
log_gdp_per_capita = rev(log_gdp_per_capita),
freedom_to_make_life_choices = rev(freedom_to_make_life_choices)
)
switch_predictions = predict(
model_happiness,
newdata = df_switch
)
tibble(
country = c('Russia', 'USA'),
base_predictions,
switch_predictions
) |>
mutate(
diff_in_happiness = switch_predictions - base_predictions
)
```
##### Python
```{python}
#| label: python-counterfactual-happiness
#| results: hide
# data available on repo (full link in appendix)
df_happiness_2018 = pd.read_csv('https://tinyurl.com/worldhappiness2018')
model_happiness = smf.ols(
formula = 'happiness_score ~ \
log_gdp_per_capita \
+ healthy_life_expectancy_at_birth \
+ generosity \
+ freedom_to_make_life_choices \
+ confidence_in_national_government',
data = df_happiness_2018
).fit()
df_us_russia = df_happiness_2018[
df_happiness_2018.country.isin(['United States', 'Russia'])
]
happiness_gdp_freedom_values = df_happiness_2018.loc[
df_happiness_2018.country.isin(['United States', 'Russia']),
['log_gdp_per_capita', 'freedom_to_make_life_choices']
]
base_predictions = model_happiness.predict(df_us_russia)
# switch up their GDP and freedom!
df_switch = df_us_russia.copy()
df_switch[['log_gdp_per_capita', 'freedom_to_make_life_choices']] = df_switch[
['log_gdp_per_capita', 'freedom_to_make_life_choices']
].values[::-1]
switch_predictions = model_happiness.predict(df_switch)
pd.DataFrame({
'country': ['Russia', 'USA'],
'base_predictions': base_predictions,
'switch_predictions': switch_predictions,
'diff_in_happiness': switch_predictions - base_predictions
}).round(3)
```
:::
Here are the results in a clean table.
```{r}
#| echo: false
#| label: tbl-counterfactual-happiness
#| tbl-cap: Counterfactual Predictions for Happiness Score
tab_democracy = tibble(
country = c("Russia", "United States"),
base_predictions,
switch_predictions
) |>
mutate(
diff_in_happiness = switch_predictions - base_predictions
)
tab_democracy |>
gt(decimals = 1)
```
In this case, we see that the happiness score is expected to be very lopsided in favor of the US, which our base prediction would suggest the US to be almost a full standard deviation higher in happiness than Russia given their current values (SD happiness ~`r round(sd(df_happiness_2018$happiness_score), 1)`). But if the US was just a bit more like Russia, we'd see a significant drop even if it maintained its life expectancy, generosity, and faith in government. Likewise, if Russia was a bit more like the US, we'd expect to see a significant increase in their happiness score.
It's very easy even with base package functions to see some very interesting things about our data and model. As an exercise, you might go back to the movie reviews data and see what happens if we switch the age of reviewer and length of a movie for a few observations. Counterfactual predictions get us thinking more explicitly about what the situation would be if things were much different, but in the end, we're just playing around with predicted values and thinking about possibilities!
## SHAP Values {#sec-knowing-shap-values}
As we've suggested, most models are more complicated than can be explained by a simple coefficient, for example, nonlinear effects in generalized additive models. Or, there may not even be feature-specific coefficients available, like gradient boosting models. Or, we may even have many parameters associated with a feature, as in deep learning. Such models typically won't come with statistical output like standard errors and confidence intervals either. But we'll still have some tricks up our sleeve to help us figure things out!
A very common interpretation tool is called a **SHAP value**. SHAP stands for **SHapley Additive exPlanations**, and it provides a means to understand how much each feature contributes to a specific prediction. It's based on a concept from game theory called the Shapley value, which is a way to understand how much each player contributes to the outcome of a game. For our modeling context, SHAP values break down a prediction to show the impact of each feature. The reason we bring it up here is that it has a nice intuition in the linear model case, and seeing it in that context is a good way to get a sense of how it works. Furthermore, it builds on what we've been talking about with our various prediction approaches.
While the actual computations behind the scenes can be tedious, the basic idea is relatively straightforward- for a given prediction at a specific observation with set feature values, we can calculate *the difference between the prediction at that observation versus the average prediction for the model as a whole*. We can break this down for each feature, and see how much each contributes to the difference. This provides us the **local effect** of the feature, or how it plays out for a specific observation, as opposed to the whole data. The SHAP approach also has the benefit of being able to be applied to *any* model, whether a simple linear or deep learning model. Very cool! To demonstrate we'll use the simple model from our model comparison demo in the previous chapter, but keep the features on the raw scale.
:::{.panel-tabset}
##### R
```{r}
#| label: lm-extra-features-r
model_lr_3feat = lm(
rating ~
age
+ release_year
+ length_minutes,
data = df_reviews
)
# inspect if desired
# summary(model_lr_3feat)
```
##### Python
```{python}
#| label: lm-extra-features-py
model_lr_3feat = smf.ols(
formula = 'rating ~ \
age \
+ release_year \
+ length_minutes',
data = df_reviews
).fit()
# inspect if desired
# model_lr_3feat.summary(slim = True)
```
:::
```{r}
#| echo: false
#| label: save-shap-model-r
# save the model for use in the next section
save(model_lr_3feat, file = "data/models/knowing_models/model_lr_3feat.RData")
```
```{python}
#| echo: false
#| label: save-shap-model-py
model_lr_3feat.save("data/models/knowing_models/model_lr_3feat.pkl")
```
With our model in place let's look at the SHAP values for the features. We'll start by choosing the instance we want to explain. Here we'll consider an observation where the release year is 2020, age of reviewer is 30, and a movie length of 110 minutes. To aid our understanding, we calculate the SHAP values at that observation by hand, and using a package. The by-hand approach consists of the following steps.
1. Get the average prediction for the model
2. Get the prediction for the feature at the value of interest for all observations, and average the predictions
3. Calculate the SHAP value as the difference between the average prediction and the average prediction for the feature value of interest
Note that this approach only works for our simple linear regression case, and we'd need to use a package incorporating an appropriate method for more complicated settings. But this simplified approach helps get our bearings on what SHAP values tell us. Also, be aware that our focus is a feature's *marginal contribution* at a *single observation*. Our coefficient already tells us the average contribution of a feature across all observations for this linear regression setting, i.e the average marginal effect discussed previously.
:::{.panel-tabset}
##### R
```{r}
#| eval: true
#| label: shap-values-r
#| results: hide
# first we need to get the average prediction
avg_pred = mean(predict(model_lr_3feat))
# observation of interest we want shap values for
obs_of_interest = tibble(
age = 30,
length_minutes = 110,
release_year = 2020
)
# then we need to get the prediction for the feature value of interest
# for all observations, and average them
pred_age_30 = predict(
model_lr_3feat,
newdata = df_reviews |> mutate(age = obs_of_interest$age)
)
pred_year_2022 = predict(
model_lr_3feat,
newdata = df_reviews |>
mutate(release_year = obs_of_interest$release_year)
)
pred_length_110 = predict(
model_lr_3feat,
newdata = df_reviews |>
mutate(length_minutes = obs_of_interest$length_minutes)
)
# then we can calculate the shap values
shap_value_ours = tibble(
age = mean(pred_age_30) - avg_pred,
release_year = mean(pred_year_2022) - avg_pred,
length_minutes = mean(pred_length_110) - avg_pred
)
```
##### Python
```{python}
#| eval: false
#| label: shap-values-py
#| results: hide
# first we need to get the average prediction
avg_pred = model_lr_3feat.predict(df_reviews).mean()
# observation of interest we want shap values for
obs_of_interest = pd.DataFrame({
'age': 30,
'release_year': 2020,
'length_minutes': 110
}, index = ['new_observation'])
# then we need to get the prediction for the feature value of interest
# for all observations, and average them
pred_dat = df_reviews.assign(
age = obs_of_interest.loc['new_observation', 'age']
)
pred_age_30 = model_lr_3feat.predict(pred_dat)
pred_dat = df_reviews.assign(
release_year = obs_of_interest.loc['new_observation', 'release_year']
)
pred_year_2022 = model_lr_3feat.predict(pred_dat)
pred_dat = df_reviews.assign(
length_minutes = obs_of_interest.loc['new_observation', 'length_minutes']
)
pred_length_110 = model_lr_3feat.predict(pred_dat)
# then we can calculate the shap values
shap_value_ours = pd.DataFrame({
'age': pred_age_30.mean() - avg_pred,
'release_year': pred_year_2022.mean() - avg_pred,
'length_minutes': pred_length_110.mean() - avg_pred
}, index = ['new_observation'])
```
:::
Now that we have our own part set up. We can use a package to do the work more formally, and compare the results.
:::{.panel-tabset}
##### R
```{r}
#| eval: true
#| label: shap-values-r-package
#| results: hide
# we'll use the DALEX package for this
explainer = DALEX::explain(model_lr_3feat, verbose = FALSE)
shap_value_package = DALEX::predict_parts(
explainer,
obs_of_interest,
type = 'shap'
)
rbind(
shap_value_ours,
shap_value_package[
c('age', 'release_year', 'length_minutes'),
'contribution'
]
)
```
##### Python
```{python}
#| eval: false
#| label: shap-values-py-package
#|
# now use the shap package for this; it does not work with statsmodels though,
# but we still get there in the end!
import shap
from sklearn.linear_model import LinearRegression
# set data up for shap and sklearn
fnames = [
'age',
'release_year',
'length_minutes'
]
X = df_reviews[fnames]
y = df_reviews['rating']
# use a linear model that works with shap
model_reviews = LinearRegression().fit(X, y)
# 1000 instances for use as the 'background distribution'
X_sample = shap.maskers.Independent(data = X, max_samples = 1000)
# # compute the SHAP values for the linear model
explainer = shap.Explainer(
model_reviews.predict,
X_sample
)
shap_values = explainer(obs_of_interest)
shap_value_package = pd.DataFrame(
shap_values.values[0, :],
index = fnames,
columns = ['new_observation']
).T
pd.concat([shap_value_ours, shap_value_package])
```
:::
The following table reveals that the results are identical.
```{r}
#| echo: false
#| label: tbl-shap-values-comparison
#| tbl-cap: SHAP Value Comparison
rbind(
ours = shap_value_ours,
dalex = shap_value_package[c('age', 'release_year', 'length_minutes'), 'contribution']
) |>
mutate(source = c('Ours', 'Package')) |>
select(source, everything()) |>
gt(decimals = 3)
```
We can visualize these as well, via a **force plot** or **waterfall plot**, the latter of which is shown in the next plot. The dotted line at *E[f(x)]* represents the average prediction from our model (~`r round(mean(fitted(model_lr_3feat)), 2)`), and the prediction we have for the observation at *f(x)*, which is about `r round(mean(predict(model_lr_3feat, obs_of_interest)), 2)`.
With the average prediction as our starting point, we add the SHAP values for each feature to get the prediction for the observation. First we add the SHAP value for age, which bumps the value by `r round(shap_value_ours[['age']], 3)`, then the SHAP value for movie length, which `r word_sign(shap_value_ours[['length_minutes']], c('increases', 'decreases'))` the prediction `r round(shap_value_ours[['length_minutes']], 3)`, and finally the SHAP value for release year, which brings us to the final predicted value by `r word_sign(shap_value_ours[['release_year']], c('increasing', 'decreasing'))` the prediction `r round(shap_value_ours[['release_year']], 3)`.
```{r}
#| echo: false
#| label: fig-shap-viz-r
#| fig-cap: SHAP Visualizations
# change width and height if using both plots
pp = DALEX::predict_parts(
explainer,
obs_of_interest,
type = "shap",
# B = 1000 # b won't matter since linreg
)
fc = unname(okabe_ito[c(5,6)])
library(shapviz)
plot_dat_shap = shapviz(pp)
layout = '
AAAAA
#BBB#
'
bd = sv_force(
plot_dat_shap,
fill_colors = fc,
max_display = 4,
colour = "white",
contrast = FALSE
) +
labs(
x = "Breakdown",
title = "SHAP Force"
)
wf = sv_waterfall(
plot_dat_shap,
max_display = 4,
fill_colors = fc,
colour = "white",
contrast = FALSE
) +
labs(
x = "Contribution to Prediction",
title = "SHAP Waterfall Plot"
)
# bd / wf +
# plot_layout(nrow = 2, widths = c(.25, 1), design=layout )
## just show waterfall for simplicity
wf
# sv_importance(plot_dat_shap) +
# xlab("Importance")
```
And there you have it- we've demystified the SHAP value! Things get more complicated in nonlinear settings, dealing with correlated features, and other cases, but hopefully this provides you some context. SHAP values are useful because they tell us how much each feature contributes to the prediction for the observation under consideration.
Pretty neat huh? So for any observation we want to inspect, and more importantly, for any model we might use, we can get a sense of how features contribute to that prediction. We also can get a sense of how much each feature contributes to the model as a whole by aggregating these values across all observations in our data, and this potentially provides a measure of **feature importance**, but we'll come back to that in a bit.
## Related Visualizations {#sec-knowing-related-viz}
We've seen how we can get some plots for predictions in different ways previously with what's called a **partial dependence plot** (@fig-r-mem-vs-pdp). In essence, a PDP shows the average prediction of a feature on the target across the feature values. This is also what we were just doing to calculate our SHAP value, and for the linear case, the PDP has a direct correspondence to the SHAP. In this setting, the SHAP value is the difference between the average prediction and the point on the PDP for a feature at a specific feature value. As an example, for a movie of 110 minutes, the line in the plot below corresponds to the value in the previous waterfall plot (@fig-shap-viz-r).
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-pdp-ice-r