-
Notifications
You must be signed in to change notification settings - Fork 854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL internal error for ncclCommInitRank when using infiniband #1591
Comments
This usually happens when two GPUs within the same node end up seeing a different node topology (which should never happen in theory). Can you capture the log with |
Sure thing, here are the full logs:
|
The NCCL_DEBUG=INFO log is only for rank 0. It would be good to capture it in separate files and attach them as .txt files. Actually that may be the reason for your problems: only rank 0 is getting the environment variables, hence NCCL_IB_HCA or NCCL_CUMEM_ENABLE is inconsistent between ranks of the same node, causing the crash. You seem to have many NICs on the node, so setting |
Okay, this is my bad. I set the envs manually on the second node and didn't restart the ray process. I ran into another issue though with NCCL hanging on me. I had to manually switch Eg for nodes_per_proc = 2 mlx5_5 would work but it would hang nodes_per_proc=8 that worked with mlx5_2. Here is the script I used for this: import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"
print("PyTorch NCCL is successful!") and running with For NCCL_IB_HCA=mlx5_5
Here are the logs: |
This begs the question: what is your network configuration? In general, you don't want to limit Still, it is weird that |
Hi, I'm trying to serve a vLLM instance on 2 nodes
I'm running inside docker container:
When trying to serve it throws an internal error
Running with NCCL_IB_DISABLE=1 works successfully but the throughput is terrible.
Some debug info I :
----NODE 0 (host)-----
----NODE 1-----
The text was updated successfully, but these errors were encountered: