-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error: device-side assert triggered during stability_selection #69
Comments
I have absolutely no idea. Stability selection shouldn't have anything to do with that error. Maybe try to update pytorch to the latest version? |
Hey @louisabraham I've tried updating to the latest version, but this error still occurs randomly. |
Traceback (most recent call last): |
Could you re-run your script with the environment variable |
okay https://builtin.com/software-engineering-perspectives/cuda-error-device-side-assert-triggered A guess is that the classifier doesn't have the right number of neutrons in output layer matching total number of classes? |
I re-ran my code and this time it didn't return any error, lol. This seems like a random error that occurs |
My uneducated guess is that is could be due to size mismatch or out-of-bound indexing error (not static, but data-dependent, hence the "random" failures you observe). To be clear, the errors are still deterministic not random (nothing is ever "random"!) |
Yes,I will try that (Makes sense!!) |
I found that on occasion, stability selection will trigger CUDA error like below:
self.criterion(model(X_val), y_val).item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Could this be due to the use of validation sets? This error often occurs in the middle of stability-selection process at random number of trials (e.g. sometimes it fails at 40/100, others it fails at 80/100...).
The text was updated successfully, but these errors were encountered: