Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: device-side assert triggered during stability_selection #69

Open
ElrondL opened this issue Nov 8, 2024 · 8 comments
Open

Comments

@ElrondL
Copy link

ElrondL commented Nov 8, 2024

I found that on occasion, stability selection will trigger CUDA error like below:

self.criterion(model(X_val), y_val).item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Could this be due to the use of validation sets? This error often occurs in the middle of stability-selection process at random number of trials (e.g. sometimes it fails at 40/100, others it fails at 80/100...).

@louisabraham
Copy link
Collaborator

I have absolutely no idea. Stability selection shouldn't have anything to do with that error. Maybe try to update pytorch to the latest version?

@ElrondL
Copy link
Author

ElrondL commented Nov 14, 2024

Hey @louisabraham I've tried updating to the latest version, but this error still occurs randomly.

@ElrondL
Copy link
Author

ElrondL commented Nov 14, 2024

Traceback (most recent call last):
File "...", line --, in eval_LassoNet
oracle, order, *_ = lassonetModel.stability_selection(X, y, n_models=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/lassonet/interfaces.py", line 569, in stability_selection
paths = [
^
File ".../lib/python3.11/site-packages/lassonet/interfaces.py", line 570, in
self._stability_selection_path(X, y, lambda_seq)
File ".../lib/python3.11/site-packages/lassonet/interfaces.py", line 531, in _stability_selection_path
return BaseLassoNet.path(
^^^^^^^^^^^^^^^^^^
File ".../python3.11/site-packages/lassonet/interfaces.py", line 411, in path
self._train(
File ".../python3.11/site-packages/lassonet/interfaces.py", line 266, in _train
best_val_obj = validation_obj()
^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/lassonet/interfaces.py", line 260, in validation_obj
self.criterion(model(X_val), y_val).item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@ilemhadri
Copy link
Collaborator

Could you re-run your script with the environment variable CUDA_LAUNCH_BLOCKING and report the stack trace?
The command would look like CUDA_LAUNCH_BLOCKING=1 python script.py and we would want to check the failing operation in the reported stack trace.

@ElrondL
Copy link
Author

ElrondL commented Nov 14, 2024

okay

https://builtin.com/software-engineering-perspectives/cuda-error-device-side-assert-triggered
It's a little strange that the assertion error happens in the middle of stability selection. I'm only seeing this in LassoNetClassifier.

A guess is that the classifier doesn't have the right number of neutrons in output layer matching total number of classes?
Could this be because stability-select runs some sort of train-test-split, and only detects the total number of classes from the training set?

@ElrondL
Copy link
Author

ElrondL commented Nov 15, 2024

I re-ran my code and this time it didn't return any error, lol. This seems like a random error that occurs

@ilemhadri
Copy link
Collaborator

ilemhadri commented Nov 15, 2024

My uneducated guess is that is could be due to size mismatch or out-of-bound indexing error (not static, but data-dependent, hence the "random" failures you observe). To be clear, the errors are still deterministic not random (nothing is ever "random"!)
Besides my previous suggestion, another thing you could do is run the code on CPU and try to catch the same error. It will give a more useful traceback message.

@ElrondL
Copy link
Author

ElrondL commented Nov 15, 2024

Yes,I will try that (Makes sense!!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants