-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SHAP analysis for h2o.deeplearning
model
#16463
Comments
A (slightly unrelated) comment from my side: it is fantastic that h2o random forests now provide SHAP values as well. |
We actually support SHAP for Deep Learning, and GBM, DRF, GLM, StackedEnsembles, XGBoost (see https://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#predict-contributions). The problem in Background frame is a frame that contains baselines. The Generalized Deep SHAP paper has some examples when it might be beneficial using some subset as the |
@tomasfryda Can you show the code using the example data I have provided? |
@tomasfryda very neat! I was not aware of additional SHAP algos in h2o. I will study the implementation a bit closer and see if adaptions to {shapviz} are required. This seems to work: X_small <- diamonds %>%
filter(carat <= 2.5) %>%
sample_n(200) %>%
as.h2o()
X_bg <- X_small[1:50, ]
shp <- shapviz(fit, X_pred = X_small, background_frame = X_bg)
sv_importance(shp)
sv_importance(shp, kind = "bee")
sv_dependence(shp, v = c("carat", "clarity", "color", "cut")) |
@mayer79 What should be the optimum |
My feeling is around 100-500 (not percent) but I need to study the implementation. |
@mayer79 If you have any questions regarding the implementation, feel free to ask me (I implemented it in H2O-3). @Bappa10 The number of samples depends on the use-case and complexity of the task (e.g. how big does the dataset be to be representative). It can be memory intensive since we internally build a matrix with |
@mayer79 and @tomasfryda Another question, generally we use 70% of the data for model training (train set) and 30% for model validation (test set). Which set should be used as |
@bappa10085 short answer: I'd use subset of training data as But it's actually quite complicated to decide and it depends on what question are you trying to answer. For example, according to the Consumer Financial Protection Bureau, for credit denials in the US, the regulatory commentary suggests to “identify the factors for which the applicant’s score fell furthest below the average score for each of those factors achieved by applicants whose total score was at or slightly above the minimum passing score.” This process can be done by using the applicants just above the cutoff to receive the credit product as the background dataset according to Machine Learning for High-Risk Applications (by Patrick Hall, James Curtis, Parul Pandey). |
IMHO it does not matter from which data to sample because the response variable is not used. Thus, I'd sample both from the training data. |
@mayer79 I think it depends on the question you're trying to answer - when you have real world data, they can have temporal dependencies that might influence the model, e.g., crop production being influenced by climate change. Depending on the choice of the background dataset you can either have SHAP values that were applicable 50 years ago (e.g. when there was more rainy days, artificial irrigation might appear much less important) or that will be applicable now. Since it's often the case that we're interested in generalization capabilities of the model, we keep the interesting/"future" data in the test set so by that logic I think it can be beneficial to use the test set as Does it make it clearer or did I make some logical mistake there? |
I think that makes perfect sense. I had a "simple" splitting scheme in mind, where I'd want to reduce unnecessary use of the test data. |
From the documentation of
h2o
package I came to know that SHAP analysis is not available forh2o.deeplearning
model. So I was unable to useshapviz
forh2o.deeplearning
model. Here is a minimal reproducible exampleIt returns me following error
According to the developer of
shapviz
package "h2o
provides SHAP only for certain tree based models. You could askh2o
for a model-agnostic explainer such as permutation SHAP or Kernel SHAP (or some deep learning-specific variant). If they would implement something like this, it could be easily added to the pure plotting packageshapviz
." as discussed here. Are there any plan to add this feature?The text was updated successfully, but these errors were encountered: