CUDA error: device-side assert triggered during stability_selection #69

ElrondL · 2024-11-08T03:55:25Z

I found that on occasion, stability selection will trigger CUDA error like below:

self.criterion(model(X_val), y_val).item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Could this be due to the use of validation sets? This error often occurs in the middle of stability-selection process at random number of trials (e.g. sometimes it fails at 40/100, others it fails at 80/100...).

louisabraham · 2024-11-08T19:02:29Z

I have absolutely no idea. Stability selection shouldn't have anything to do with that error. Maybe try to update pytorch to the latest version?

ElrondL · 2024-11-14T01:43:17Z

Hey @louisabraham I've tried updating to the latest version, but this error still occurs randomly.

ElrondL · 2024-11-14T01:46:07Z

Traceback (most recent call last):
File "...", line --, in eval_LassoNet
oracle, order, *_ = lassonetModel.stability_selection(X, y, n_models=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/lassonet/interfaces.py", line 569, in stability_selection
paths = [
^
File ".../lib/python3.11/site-packages/lassonet/interfaces.py", line 570, in
self._stability_selection_path(X, y, lambda_seq)
File ".../lib/python3.11/site-packages/lassonet/interfaces.py", line 531, in _stability_selection_path
return BaseLassoNet.path(
^^^^^^^^^^^^^^^^^^
File ".../python3.11/site-packages/lassonet/interfaces.py", line 411, in path
self._train(
File ".../python3.11/site-packages/lassonet/interfaces.py", line 266, in _train
best_val_obj = validation_obj()
^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/lassonet/interfaces.py", line 260, in validation_obj
self.criterion(model(X_val), y_val).item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ilemhadri · 2024-11-14T01:50:52Z

Could you re-run your script with the environment variable CUDA_LAUNCH_BLOCKING and report the stack trace?
The command would look like CUDA_LAUNCH_BLOCKING=1 python script.py and we would want to check the failing operation in the reported stack trace.

ElrondL · 2024-11-14T01:56:59Z

okay

https://builtin.com/software-engineering-perspectives/cuda-error-device-side-assert-triggered
It's a little strange that the assertion error happens in the middle of stability selection. I'm only seeing this in LassoNetClassifier.

A guess is that the classifier doesn't have the right number of neutrons in output layer matching total number of classes?
Could this be because stability-select runs some sort of train-test-split, and only detects the total number of classes from the training set?

ElrondL · 2024-11-15T02:24:01Z

I re-ran my code and this time it didn't return any error, lol. This seems like a random error that occurs

ilemhadri · 2024-11-15T03:20:03Z

My uneducated guess is that is could be due to size mismatch or out-of-bound indexing error (not static, but data-dependent, hence the "random" failures you observe). To be clear, the errors are still deterministic not random (nothing is ever "random"!)
Besides my previous suggestion, another thing you could do is run the code on CPU and try to catch the same error. It will give a more useful traceback message.

ElrondL · 2024-11-15T03:38:58Z

Yes，I will try that (Makes sense!!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: device-side assert triggered during stability_selection #69

CUDA error: device-side assert triggered during stability_selection #69

ElrondL commented Nov 8, 2024

louisabraham commented Nov 8, 2024

ElrondL commented Nov 14, 2024

ElrondL commented Nov 14, 2024

ilemhadri commented Nov 14, 2024

ElrondL commented Nov 14, 2024 •

edited

Loading

ElrondL commented Nov 15, 2024

ilemhadri commented Nov 15, 2024 •

edited

Loading

ElrondL commented Nov 15, 2024

CUDA error: device-side assert triggered during stability_selection #69

CUDA error: device-side assert triggered during stability_selection #69

Comments

ElrondL commented Nov 8, 2024

louisabraham commented Nov 8, 2024

ElrondL commented Nov 14, 2024

ElrondL commented Nov 14, 2024

ilemhadri commented Nov 14, 2024

ElrondL commented Nov 14, 2024 • edited Loading

ElrondL commented Nov 15, 2024

ilemhadri commented Nov 15, 2024 • edited Loading

ElrondL commented Nov 15, 2024

ElrondL commented Nov 14, 2024 •

edited

Loading

ilemhadri commented Nov 15, 2024 •

edited

Loading