Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while attaching to shared memory segment #1589

Open
cassanof opened this issue Jan 25, 2025 · 1 comment
Open

Error while attaching to shared memory segment #1589

cassanof opened this issue Jan 25, 2025 · 1 comment

Comments

@cassanof
Copy link

Hello, we are encountering the following error on the latest NCCL version. Downgrading to 2.21.5 solved the issue.

g148:14937:69361 [0] NCCL INFO Channel 00/1 : 0[0] -> 7[7] via P2P/CUMEM
g148:14937:69361 [0] NCCL INFO Channel 01/1 : 0[0] -> 8[0] [send] via NET/IB/9/GDRDMA/Shared

g148:14937:69361 [0] misc/shmutils.cc:93 NCCL WARN Call to open failed: No such file or directory

g148:14937:69361 [0] misc/shmutils.cc:129 NCCL WARN Error while attaching to shared memory segment /dev/shm/nccl-glnlGK (size 14156128), error: No such file or directory (2)
@kiskra-nvidia
Copy link
Member

Is it reproducible with something generic like all_reduce_perf from https://github.com/NVIDIA/nccl-tests? Could you share a complete log file obtained with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,BOOTSTRAP,ALLOC?

Also, have you reviewed https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#shared-memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants