Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHAP analysis for h2o.deeplearning model #16463

Closed
bappa10085 opened this issue Dec 17, 2024 · 12 comments
Closed

SHAP analysis for h2o.deeplearning model #16463

bappa10085 opened this issue Dec 17, 2024 · 12 comments
Labels

Comments

@bappa10085
Copy link

bappa10085 commented Dec 17, 2024

From the documentation of h2o package I came to know that SHAP analysis is not available for h2o.deeplearning model. So I was unable to use shapviz for h2o.deeplearning model. Here is a minimal reproducible example

library(shapviz)
library(tidyverse)
library(h2o)
h2o.init()

set.seed(1)
# Get rid of that darn ordinals
ord <- c("clarity", "cut", "color")
diamonds[, ord] <- lapply(diamonds[, ord], factor, ordered = FALSE)

dia_h2o <- as.h2o(diamonds)

### Deep Learning Model
fit <- h2o.deeplearning(x = c("carat", "clarity", "color", "cut"), 
                             y = "price", training_frame = dia_h2o, seed=123456)

fit

# SHAP analysis on about 2000 diamonds
X_small <- diamonds %>%
  filter(carat <= 2.5) %>%
  sample_n(2000) %>%
  as.h2o()

shp <- shapviz(fit, X_pred = X_small)

It returns me following error

Error in .check_model_suitability_for_calculation_of_contributions(object, :
Calculation of feature contributions without a background frame requires a tree-based model.

According to the developer of shapviz package "h2o provides SHAP only for certain tree based models. You could ask h2o for a model-agnostic explainer such as permutation SHAP or Kernel SHAP (or some deep learning-specific variant). If they would implement something like this, it could be easily added to the pure plotting package shapviz." as discussed here. Are there any plan to add this feature?

@mayer79
Copy link

mayer79 commented Dec 20, 2024

A (slightly unrelated) comment from my side: it is fantastic that h2o random forests now provide SHAP values as well.

@tomasfryda
Copy link
Contributor

We actually support SHAP for Deep Learning, and GBM, DRF, GLM, StackedEnsembles, XGBoost (see https://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#predict-contributions).

The problem in shapviz package likely stems from different API - all models that support SHAP in h2o (all models that come out of the h2o automl) can be run with background_frame but only tree-based models can be run without it (but AFAIK recommendation is to use background_frame even for the tree-based models (but it's slower)).

Background frame is a frame that contains baselines. The Generalized Deep SHAP paper has some examples when it might be beneficial using some subset as the background_frame.

@bappa10085
Copy link
Author

@tomasfryda Can you show the code using the example data I have provided?

@mayer79
Copy link

mayer79 commented Jan 7, 2025

@tomasfryda very neat! I was not aware of additional SHAP algos in h2o.

I will study the implementation a bit closer and see if adaptions to {shapviz} are required.

This seems to work:

X_small <- diamonds %>%
  filter(carat <= 2.5) %>%
  sample_n(200) %>%
  as.h2o()

X_bg <- X_small[1:50, ]

shp <- shapviz(fit, X_pred = X_small, background_frame = X_bg)
sv_importance(shp)
sv_importance(shp, kind = "bee")
sv_dependence(shp, v = c("carat", "clarity", "color", "cut"))

image

image

@bappa10085
Copy link
Author

@mayer79 What should be the optimum background_frame (% of the samples)?

@mayer79
Copy link

mayer79 commented Jan 8, 2025

My feeling is around 100-500 (not percent) but I need to study the implementation.

@tomasfryda
Copy link
Contributor

@mayer79 If you have any questions regarding the implementation, feel free to ask me (I implemented it in H2O-3).

@Bappa10 The number of samples depends on the use-case and complexity of the task (e.g. how big does the dataset be to be representative). It can be memory intensive since we internally build a matrix with number of rows = nrows(test_frame)*nrows(background_frame) (that's the worst case for output_per_reference=True which outputs contribution for each row in the "test_frame" compared to each row from the background frame (this is used in generalized deep shap in Stacked Ensembes)). But 100 - 500 samples seems like a good first try. (You can always add more or pick some other background frame sample to find out how sensitive the SHAP is to the particular background frame.)

@bappa10085
Copy link
Author

@mayer79 and @tomasfryda Another question, generally we use 70% of the data for model training (train set) and 30% for model validation (test set). Which set should be used as X_pred and background_frame?

@tomasfryda
Copy link
Contributor

@bappa10085 short answer: I'd use subset of training data as background_frame and test_set as the X_pred.

But it's actually quite complicated to decide and it depends on what question are you trying to answer.

For example, according to the Consumer Financial Protection Bureau, for credit denials in the US, the regulatory commentary suggests to “identify the factors for which the applicant’s score fell furthest below the average score for each of those factors achieved by applicants whose total score was at or slightly above the minimum passing score.” This process can be done by using the applicants just above the cutoff to receive the credit product as the background dataset according to Machine Learning for High-Risk Applications (by Patrick Hall, James Curtis, Parul Pandey).

@mayer79
Copy link

mayer79 commented Jan 8, 2025

IMHO it does not matter from which data to sample because the response variable is not used. Thus, I'd sample both from the training data.

@tomasfryda
Copy link
Contributor

@mayer79 I think it depends on the question you're trying to answer - when you have real world data, they can have temporal dependencies that might influence the model, e.g., crop production being influenced by climate change.

Depending on the choice of the background dataset you can either have SHAP values that were applicable 50 years ago (e.g. when there was more rainy days, artificial irrigation might appear much less important) or that will be applicable now.

Since it's often the case that we're interested in generalization capabilities of the model, we keep the interesting/"future" data in the test set so by that logic I think it can be beneficial to use the test set as X_pred and for background_frame I'd use relevant subset of the training data (in my example something like subset from the last decade) so we get SHAP values that correspond to our use-case of the model.

Does it make it clearer or did I make some logical mistake there?

@mayer79
Copy link

mayer79 commented Jan 8, 2025

I think that makes perfect sense. I had a "simple" splitting scheme in mind, where I'd want to reduce unnecessary use of the test data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants