Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA backend failed to initialize #1048

Open
hofmank0 opened this issue Dec 17, 2024 · 1 comment
Open

CUDA backend failed to initialize #1048

hofmank0 opened this issue Dec 17, 2024 · 1 comment

Comments

@hofmank0
Copy link

I used to have a nice & running AF2 installation on my computer (ubuntu 24.04, Nvidia 4090, driver nvidia-open 565.57.01).
Unfortunately, after install AF3 on the same computer (worked after some hiccups) and after trying to set docker to rootless mode, AF2 stopped to work, and so did AF3. After extensive web searches, I tried several 'fixes' but they made things worse. Even after reinstalling everything (docker, cuda-toolbox, AF2) following my original procedure (and avoiding rootless mode), I did not manage to get it to run. Error messages vary, but the most persistent is:
CUDA backend failed to initialize: jaxlib/cuda/versions_helpers.cc:98: operation cuInit(0) failed: CUDA_ERROR_NO_DEVICE (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
I read that this might be related to having CUDA stuff in the LD_LIBRARY_PATH and/or to having set (or not properly set) CUDA_VISIBLE_DEVICES. However, changing those did not help

I had a similar error message in my first (successful) installation of AF2, where it could be fixed by using an older version of docker (5:26.1.4-1 rather than the newer 5:27.4.0-1). However, this time it did not help to revert to the old docker.

As far as I can see, the GPU is present and functional. I know very little about docker and don't know for sure how I can test if the GPU is accessible to the docker image I generated. The AF2 documentation recommends to try
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
but this never worked for me - not even when AF2 was running great.

I am really somewhat desperate - could it be a hardware problem? Any hint would be greatly appreciated.

@NicholasAKovacs
Copy link

That line never worked for me either; docker couldn't find the image.

Modifying the line slightly to this worked since nvidia/cuda:11.0.3-base is on dockerhub whereas nvidia/cuda:11.0-base is not.
docker run --rm --gpus all nvidia/cuda:11.0.3-base nvidia-smi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants