-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nccl test failed when using gdr #78
Comments
disable p2p and shm for network test ./all_reduce_perf -g 2 Using devices
k69a05298:85517:85517 [0] NCCL INFO Launch mode Group/CGMD k69a05298:85517:85837 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 32622, vendor err 81 k69a05298:85517:85836 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 32622, vendor err 81 Environment:DGX-2 |
I'm a user like you but I had the same problem and I solved it by disabling PCIe ACS. I got my information from this issue which seems to match your problem. And this https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/13 I'm not an expert so take my suggestion with a grain of salt. |
No description provided.
The text was updated successfully, but these errors were encountered: