-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastshap::explain() uses all cores and not those defined by registerDoParallel(cores=X) #75
Comments
Thanks @abussalleuc, on a Windows system, multicore functionality in R will not work (e.g., specifying cl <- makeCluster(25)
registerDoParallel(cl) Does this seem to fix the issue on your system? Further, explaining that many rows, even with |
Hi @brandongreenwell-8451 I was originally using makePSOCKcluster which (to my limited knowledge) should work on a windows machine. I tried your suggestion and the issue persists. My idea is to use SHAP values to explain correlations between modeled variables that share the same predictors so my dataset is a thank you for your time. |
Hi @bgreenwell
I am using fastshap::explain for a large dataset (4 million rows, 23 columns) on a windows system (512gb ram, 48 cores)
Here is how the code looks like:
t<-fastshap::explain(
model,#ranger::ranger() object
X=train_set[,vars],# train set used to train the model (~1million rows, but a higher gradient of predictor values)
pred_wrapper = pfun,#prediction function: ranger::predict()$predictions
newdata=new_data[,vars],#dataset to explain ~4million rows
feature_names=NULL#predictor variables of interest (23)
nsim=10,
adjust=TRUE,
parallel = TRUE,
.packages=c('ranger') )
Due to the size of my dataset, I don't want to use all cores to avoid memory issues, so I tried defining the paralell backend as:
cl<-makePSOCKcluster(25)
registerDoParallel(cl)
and as:
registerDoParallel(cores=25)
In both cases I have noticed (using task manager-perofmance) that all the available cores and not those defined above are being used.
I tried this with data subsets of 4000 rows and changing the number of simulations and cores, but it still uses more cores than those defined by the registerDoParallel command (based on task manager - performance).
While explain runs and produces results with the smaller datasets, with the whole dataset it use 100% of my RAM and sometimes the computer crashes.
In total the model, the train_set and the new_data objects weight ~5GB, so I don't think is a good idea to use as much clusters/cores/logical processors as possible.
Am I defining the parallel backend wrong?
Should I create a foreach loop for each column instead and with parallel=FALSE?
Thank you for your time.
best,
Alonso
The text was updated successfully, but these errors were encountered: