NCCL internal error for ncclCommInitRank when using infiniband #1591

SzymonOzog · 2025-01-29T13:37:12Z

Hi, I'm trying to serve a vLLM instance on 2 nodes

I'm running inside docker container:

docker run --shm-size=1024G --entrypoint /bin/bash  --gpus all \
        --network host  --privileged   -v ~/.cache/huggingface \
        -e NCCL_IB_DISABLE=1 \
        --device="/dev/infiniband" \
        vllm/vllm-openai -c "ray start --block --head --port=6379"

When trying to serve it throws an internal error

INFO 01-29 05:10:25 utils.py:918] Found nccl from library libnccl.so.2
INFO 01-29 05:10:25 pynccl.py:69] vLLM is using nccl==2.21.5
gh-3714u06:61225:61225 [0] NCCL INFO Bootstrap : Using ib0.0065:10.1.32.3<0>
gh-3714u06:61225:61225 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
gh-3714u06:61225:61225 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
gh-3714u06:61225:61225 [0] NCCL INFO NET/Plugin: Using internal network plugin.
gh-3714u06:61225:61225 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda12.4
(RayWorkerWrapper pid=50232, ip=10.1.0.67) INFO 01-29 05:10:25 utils.py:918] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=50232, ip=10.1.0.67) INFO 01-29 05:10:25 pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=983) INFO 01-29 05:10:17 selector.py:120] Using Flash Attention backend. [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/mast
er/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
gh-3714u06:61225:61225 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
gh-3714u06:61225:61225 [0] NCCL INFO NCCL_IB_HCA set to mlx5_5
gh-3714u06:61225:61225 [0] NCCL INFO NET/IB : Using [0]mlx5_5:1/IB [RO]; OOB ib0.0065:10.1.32.3<0>
gh-3714u06:61225:61225 [0] NCCL INFO Using non-device net plugin version 0
gh-3714u06:61225:61225 [0] NCCL INFO Using network IB
gh-3714u06:61225:61225 [0] NCCL INFO ncclCommInitRank comm 0xc1c7d90 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 19000 commId 0x2fddf5f7f1542c0a - Init START
gh-3714u06:61225:61225 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
gh-3714u06:61225:61225 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
gh-3714u06:61225:61225 [0] NCCL INFO NVLS multicast support is available on dev 0
gh-3714u06:61225:61225 [0] NCCL INFO comm 0xc1c7d90 rank 0 nRanks 16 nNodes 3 localRanks 1 localRank 0 MNNVL 0
gh-3714u06:61225:61225 [0] NCCL INFO NVLS Head  0:  0  6  8
gh-3714u06:61225:61225 [0] NCCL INFO Channel 00/16 :    0   7   8  15  14  13  12  11  10   9   0   7   8  15  14  13

gh-3714u06:61225:61225 [0] graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (12 != 0)
gh-3714u06:61225:61225 [0] NCCL INFO graph/connect.cc:489 -> 3
gh-3714u06:61225:61225 [0] NCCL INFO init.cc:1210 -> 3
gh-3714u06:61225:61225 [0] NCCL INFO init.cc:1548 -> 3
gh-3714u06:61225:61225 [0] NCCL INFO init.cc:1799 -> 3
gh-3714u06:61225:61225 [0] NCCL INFO init.cc:1837 -> 3
ERROR 01-29 05:10:31 worker_base.py:467] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 01-29 05:10:31 worker_base.py:467] Traceback (most recent call last):
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
ERROR 01-29 05:10:31 worker_base.py:467]     return executor(*args, **kwargs)
ERROR 01-29 05:10:31 worker_base.py:467]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
ERROR 01-29 05:10:31 worker_base.py:467]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
ERROR 01-29 05:10:31 worker_base.py:467]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
ERROR 01-29 05:10:31 worker_base.py:467]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
ERROR 01-29 05:10:31 worker_base.py:467]     _TP = init_model_parallel_group(group_ranks,
ERROR 01-29 05:10:31 worker_base.py:467]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
ERROR 01-29 05:10:31 worker_base.py:467]     return GroupCoordinator(
ERROR 01-29 05:10:31 worker_base.py:467]            ^^^^^^^^^^^^^^^^^
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
ERROR 01-29 05:10:31 worker_base.py:467]     self.pynccl_comm = PyNcclCommunicator(
ERROR 01-29 05:10:31 worker_base.py:467]                        ^^^^^^^^^^^^^^^^^^^
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
ERROR 01-29 05:10:31 worker_base.py:467]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 01-29 05:10:31 worker_base.py:467]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
ERROR 01-29 05:10:31 worker_base.py:467]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 01-29 05:10:31 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
ERROR 01-29 05:10:31 worker_base.py:467]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 01-29 05:10:31 worker_base.py:467] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
ERROR 01-29 05:10:31 engine.py:366] NCCL error: internal error - please report this issue to the NCCL developers

Running with NCCL_IB_DISABLE=1 works successfully but the throughput is terrible.

Some debug info I :
----NODE 0 (host)-----

ibstat
...
CA 'mlx5_5'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.41.1000
        Hardware version: 0
        Node GUID: 0x9c63c003005548ae
        System image GUID: 0x9c63c003005548ae
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 400
                Base lid: 987
                LMC: 0
                SM lid: 359
                Capability mask: 0xa751e848
                Port GUID: 0x9c63c003005548ae
                Link layer: InfiniBand

root@gh-3714u06:/vllm-workspace# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     0-31,64-95      0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     0-31,64-95      0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE    SYS     SYS     0-31,64-95      0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     0-31,64-95      0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    32-63,96-127    1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    NODE    32-63,96-127    1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    PIX     32-63,96-127    1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    32-63,96-127    1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS
NIC1    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     SYS     SYS
NIC3    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5


root@gh-3714u06:/vllm-workspace# ib_write_bw

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x3cf QPN 0x0155 PSN 0x1ad76e RKey 0x201200 VAddr 0x007fa4d3980000
 remote address: LID 0x3d1 QPN 0x00df PSN 0xdd302 RKey 0x201300 VAddr 0x007fe452c22000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             45909.63            45895.80                  0.734333
---------------------------------------------------------------------------------------

----NODE 1-----

ibstat 
...
CA 'mlx5_5'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.41.1000
        Hardware version: 0
        Node GUID: 0x9c63c003005b1e14
        System image GUID: 0x9c63c003005b1e14
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 400
                Base lid: 984
                LMC: 0
                SM lid: 359
                Capability mask: 0xa751e848
                Port GUID: 0x9c63c003005b1e14
                Link layer: InfiniBand

root@gh-3714u15:/vllm-workspace# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     0-31,64-95      0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     0-31,64-95      0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE    SYS     SYS     0-31,64-95      0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     0-31,64-95      0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    32-63,96-127    1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    NODE    32-63,96-127    1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    PIX     32-63,96-127    1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    32-63,96-127    1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS
NIC1    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     SYS     SYS
NIC3    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5

root@gh-3714u15:/vllm-workspace# ib_write_bw 10.1.0.68
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x3d1 QPN 0x00df PSN 0xdd302 RKey 0x201300 VAddr 0x007fe452c22000
 remote address: LID 0x3cf QPN 0x0155 PSN 0x1ad76e RKey 0x201200 VAddr 0x007fa4d3980000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             45909.63            45895.80                  0.734333
---------------------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

sjeaugey · 2025-01-29T14:18:03Z

This usually happens when two GPUs within the same node end up seeing a different node topology (which should never happen in theory).

Can you capture the log with NCCL_DEBUG_SUBSYS=GRAPH NCCL_DEBUG=INFO on all ranks, and post it here?

SzymonOzog · 2025-01-29T15:55:18Z

Sure thing, here are the full logs:

(RayWorkerWrapper pid=1019) INFO 01-29 07:44:20 selector.py:120] Using Flash Attention backend. [repeated 12x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)                                                                            [615/4138]
INFO 01-29 07:44:27 utils.py:918] Found nccl from library libnccl.so.2
INFO 01-29 07:44:27 pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=1000) INFO 01-29 07:44:27 utils.py:918] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=1000) INFO 01-29 07:44:27 pynccl.py:69] vLLM is using nccl==2.21.5
NCCL version 2.21.5+cuda12.4
gh-3714u06:737:737 [0] NCCL INFO === System : maxBw 48.0 totalBw 370.8 ===
gh-3714u06:737:737 [0] NCCL INFO CPU/0-0 (1/1/2)
gh-3714u06:737:737 [0] NCCL INFO + PCI[48.0] - PCI/0-17000 (1000c03015d91d25)
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - GPU/0-19000 (0)
gh-3714u06:737:737 [0] NCCL INFO                             + NVL[370.8] - NVS/0
gh-3714u06:737:737 [0] NCCL INFO + PCI[48.0] - PCI/0-2a000 (1000c03015d91d25)
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - GPU/0-2d000 (1)
gh-3714u06:737:737 [0] NCCL INFO                             + NVL[370.8] - NVS/0
gh-3714u06:737:737 [0] NCCL INFO + PCI[48.0] - PCI/0-3d000 (1000c03015d91d25)
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - GPU/0-3f000 (2)
gh-3714u06:737:737 [0] NCCL INFO                             + NVL[370.8] - NVS/0
gh-3714u06:737:737 [0] NCCL INFO + PCI[48.0] - PCI/0-63000 (1000c03015d91d25)
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - GPU/0-66000 (3)
gh-3714u06:737:737 [0] NCCL INFO                             + NVL[370.8] - NVS/0
gh-3714u06:737:737 [0] NCCL INFO + SYS[10.0] - CPU/1
gh-3714u06:737:737 [0] NCCL INFO CPU/0-1 (1/1/2)
gh-3714u06:737:737 [0] NCCL INFO + PCI[48.0] - PCI/0-99000 (1000c03015d91d25)
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - GPU/0-9b000 (4)
gh-3714u06:737:737 [0] NCCL INFO                             + NVL[370.8] - NVS/0
gh-3714u06:737:737 [0] NCCL INFO + PCI[48.0] - PCI/0-ab000 (1000c03015d91d25)
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - GPU/0-ae000 (5)
gh-3714u06:737:737 [0] NCCL INFO                             + NVL[370.8] - NVS/0
gh-3714u06:737:737 [0] NCCL INFO + PCI[48.0] - PCI/0-bd000 (1000c03015d91d25)
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - GPU/0-bf000 (6)
gh-3714u06:737:737 [0] NCCL INFO                             + NVL[370.8] - NVS/0
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - NIC/0-c0000
gh-3714u06:737:737 [0] NCCL INFO                             + NET[50.0] - NET/0 (141e5b0003c0639c/1/50.000000)
gh-3714u06:737:737 [0] NCCL INFO + PCI[48.0] - PCI/0-e1000 (1000c03015d91d25)
gh-3714u06:737:737 [0] NCCL INFO               + PCI[48.0] - GPU/0-e4000 (7)
gh-3714u06:737:737 [0] NCCL INFO                             + NVL[370.8] - NVS/0
gh-3714u06:737:737 [0] NCCL INFO + SYS[10.0] - CPU/0
gh-3714u06:737:737 [0] NCCL INFO ==========================================
gh-3714u06:737:737 [0] NCCL INFO GPU/19000 :GPU/0-19000 (0/5000.0/LOC) GPU/0-2d000 (2/370.8/NVL) GPU/0-3f000 (2/370.8/NVL) GPU/0-66000 (2/370.8/NVL) GPU/0-9b000 (2/370.8/NVL) GPU/0-ae000 (2/370.8/NVL) GPU/0-bf000 (2/370.8/NVL) GPU/0-e4000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (2/48.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (5/48.0/PXN)
gh-3714u06:737:737 [0] NCCL INFO GPU/2D000 :GPU/0-19000 (2/370.8/NVL) GPU/0-2d000 (0/5000.0/LOC) GPU/0-3f000 (2/370.8/NVL) GPU/0-66000 (2/370.8/NVL) GPU/0-9b000 (2/370.8/NVL) GPU/0-ae000 (2/370.8/NVL) GPU/0-bf000 (2/370.8/NVL) GPU/0-e4000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (2/48.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (5/48.0/PXN)
gh-3714u06:737:737 [0] NCCL INFO GPU/3F000 :GPU/0-19000 (2/370.8/NVL) GPU/0-2d000 (2/370.8/NVL) GPU/0-3f000 (0/5000.0/LOC) GPU/0-66000 (2/370.8/NVL) GPU/0-9b000 (2/370.8/NVL) GPU/0-ae000 (2/370.8/NVL) GPU/0-bf000 (2/370.8/NVL) GPU/0-e4000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (2/48.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (5/48.0/PXN)
gh-3714u06:737:737 [0] NCCL INFO GPU/66000 :GPU/0-19000 (2/370.8/NVL) GPU/0-2d000 (2/370.8/NVL) GPU/0-3f000 (2/370.8/NVL) GPU/0-66000 (0/5000.0/LOC) GPU/0-9b000 (2/370.8/NVL) GPU/0-ae000 (2/370.8/NVL) GPU/0-bf000 (2/370.8/NVL) GPU/0-e4000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (2/48.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (5/48.0/PXN)
gh-3714u06:737:737 [0] NCCL INFO GPU/9B000 :GPU/0-19000 (2/370.8/NVL) GPU/0-2d000 (2/370.8/NVL) GPU/0-3f000 (2/370.8/NVL) GPU/0-66000 (2/370.8/NVL) GPU/0-9b000 (0/5000.0/LOC) GPU/0-ae000 (2/370.8/NVL) GPU/0-bf000 (2/370.8/NVL) GPU/0-e4000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (3/10.0/SYS) CPU/0-1 (2/48.0/PHB) NET/0-0 (5/48.0/PXN)
gh-3714u06:737:737 [0] NCCL INFO GPU/AE000 :GPU/0-19000 (2/370.8/NVL) GPU/0-2d000 (2/370.8/NVL) GPU/0-3f000 (2/370.8/NVL) GPU/0-66000 (2/370.8/NVL) GPU/0-9b000 (2/370.8/NVL) GPU/0-ae000 (0/5000.0/LOC) GPU/0-bf000 (2/370.8/NVL) GPU/0-e4000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (3/10.0/SYS) CPU/0-1 (2/48.0/PHB) NET/0-0 (5/48.0/PXN)
gh-3714u06:737:737 [0] NCCL INFO GPU/BF000 :GPU/0-19000 (2/370.8/NVL) GPU/0-2d000 (2/370.8/NVL) GPU/0-3f000 (2/370.8/NVL) GPU/0-66000 (2/370.8/NVL) GPU/0-9b000 (2/370.8/NVL) GPU/0-ae000 (2/370.8/NVL) GPU/0-bf000 (0/5000.0/LOC) GPU/0-e4000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (3/10.0/SYS) CPU/0-1 (2/48.0/PHB) NET/0-0 (3/48.0/PIX)
gh-3714u06:737:737 [0] NCCL INFO GPU/E4000 :GPU/0-19000 (2/370.8/NVL) GPU/0-2d000 (2/370.8/NVL) GPU/0-3f000 (2/370.8/NVL) GPU/0-66000 (2/370.8/NVL) GPU/0-9b000 (2/370.8/NVL) GPU/0-ae000 (2/370.8/NVL) GPU/0-bf000 (2/370.8/NVL) GPU/0-e4000 (0/5000.0/LOC) NVS/0-0 (1/370.8/NVL) CPU/0-0 (3/10.0/SYS) CPU/0-1 (2/48.0/PHB) NET/0-0 (5/48.0/PXN)
gh-3714u06:737:737 [0] NCCL INFO NET/0 :GPU/0-19000 (6/10.0/SYS) GPU/0-2d000 (6/10.0/SYS) GPU/0-3f000 (6/10.0/SYS) GPU/0-66000 (6/10.0/SYS) GPU/0-9b000 (5/48.0/PHB) GPU/0-ae000 (5/48.0/PHB) GPU/0-bf000 (3/48.0/PIX) GPU/0-e4000 (5/48.0/PHB) CPU/0-0 (4/10.0/SYS) CPU/0-1 (3/48.0/PHB) NET/0-0 (0/5000.0/LOC)
gh-3714u06:737:737 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 24.000000/24.000000, type NVL/PXN, sameChannels 1
gh-3714u06:737:737 [0] NCCL INFO  0 : NET/0 GPU/6 GPU/5 GPU/4 GPU/3 GPU/2 GPU/1 GPU/0 GPU/7 NET/0                                                                                                                                                                                                                                                                                                                                       gh-3714u06:737:737 [0] NCCL INFO  1 : NET/0 GPU/6 GPU/5 GPU/4 GPU/3 GPU/2 GPU/1 GPU/0 GPU/7 NET/0
gh-3714u06:737:737 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 48.000000/24.000000, type NVL/PIX, sameChannels 1
gh-3714u06:737:737 [0] NCCL INFO  0 : NET/0 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 NET/0
gh-3714u06:737:737 [0] NCCL INFO  1 : NET/0 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 NET/0
gh-3714u06:737:737 [0] NCCL INFO Pattern 5, crossNic 0, nChannels 1, bw 48.000000/48.000000, type NVL/PIX, sameChannels 0
gh-3714u06:737:737 [0] NCCL INFO  0 : NET/0 GPU/6 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/0
gh-3714u06:737:737 [0] NCCL INFO Tree 0 : 7 -> 0 -> 1/8/-1
gh-3714u06:737:737 [0] NCCL INFO Tree 2 : 6 -> 0 -> 1/-1/-1
gh-3714u06:737:737 [0] NCCL INFO Tree 1 : 7 -> 0 -> 1/8/-1
gh-3714u06:737:737 [0] NCCL INFO Tree 3 : 6 -> 0 -> 1/-1/-1
gh-3714u06:737:737 [0] NCCL INFO NVLS Trees : 17/8/-1->0->-1 17/8/-1->0->6

gh-3714u06:737:737 [0] graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (12 != 0)
gh-3714u06:737:737 [0] NCCL INFO graph/connect.cc:489 -> 3
gh-3714u06:737:737 [0] NCCL INFO init.cc:1210 -> 3
gh-3714u06:737:737 [0] NCCL INFO init.cc:1548 -> 3
gh-3714u06:737:737 [0] NCCL INFO init.cc:1799 -> 3
gh-3714u06:737:737 [0] NCCL INFO init.cc:1837 -> 3
ERROR 01-29 07:44:33 worker_base.py:467] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 01-29 07:44:33 worker_base.py:467] Traceback (most recent call last):
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
ERROR 01-29 07:44:33 worker_base.py:467]     return executor(*args, **kwargs)
ERROR 01-29 07:44:33 worker_base.py:467]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
ERROR 01-29 07:44:33 worker_base.py:467]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
ERROR 01-29 07:44:33 worker_base.py:467]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
ERROR 01-29 07:44:33 worker_base.py:467]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
ERROR 01-29 07:44:33 worker_base.py:467]     _TP = init_model_parallel_group(group_ranks,
ERROR 01-29 07:44:33 worker_base.py:467]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
ERROR 01-29 07:44:33 worker_base.py:467]     return GroupCoordinator(
ERROR 01-29 07:44:33 worker_base.py:467]            ^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
ERROR 01-29 07:44:33 worker_base.py:467]     self.pynccl_comm = PyNcclCommunicator(
ERROR 01-29 07:44:33 worker_base.py:467]                        ^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
ERROR 01-29 07:44:33 worker_base.py:467]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 01-29 07:44:33 worker_base.py:467]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
ERROR 01-29 07:44:33 worker_base.py:467]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
ERROR 01-29 07:44:33 worker_base.py:467]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 01-29 07:44:33 worker_base.py:467] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
ERROR 01-29 07:44:33 engine.py:366] NCCL error: internal error - please report this issue to the NCCL developers
ERROR 01-29 07:44:33 engine.py:366] Traceback (most recent call last):
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 01-29 07:44:33 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 01-29 07:44:33 engine.py:366]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 01-29 07:44:33 engine.py:366]     return cls(ipc_path=ipc_path,
ERROR 01-29 07:44:33 engine.py:366]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 01-29 07:44:33 engine.py:366]     self.engine = LLMEngine(*args, **kwargs)
ERROR 01-29 07:44:33 engine.py:366]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__
ERROR 01-29 07:44:33 engine.py:366]     self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 01-29 07:44:33 engine.py:366]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
ERROR 01-29 07:44:33 engine.py:366]     super().__init__(*args, **kwargs)
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 36, in __init__
ERROR 01-29 07:44:33 engine.py:366]     self._init_executor()
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_gpu_executor.py", line 64, in _init_executor
ERROR 01-29 07:44:33 engine.py:366]     self._init_workers_ray(placement_group)
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_gpu_executor.py", line 277, in _init_workers_ray
ERROR 01-29 07:44:33 engine.py:366]     self._run_workers("init_device")
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_gpu_executor.py", line 407, in _run_workers
ERROR 01-29 07:44:33 engine.py:366]     self.driver_worker.execute_method(method, *driver_args,
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
ERROR 01-29 07:44:33 engine.py:366]     raise e
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
ERROR 01-29 07:44:33 engine.py:366]     return executor(*args, **kwargs)
ERROR 01-29 07:44:33 engine.py:366]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
ERROR 01-29 07:44:33 engine.py:366]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
ERROR 01-29 07:44:33 engine.py:366]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
ERROR 01-29 07:44:33 engine.py:366]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
ERROR 01-29 07:44:33 engine.py:366]     _TP = init_model_parallel_group(group_ranks,
ERROR 01-29 07:44:33 engine.py:366]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
ERROR 01-29 07:44:33 engine.py:366]     return GroupCoordinator(
ERROR 01-29 07:44:33 engine.py:366]            ^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
ERROR 01-29 07:44:33 engine.py:366]     self.pynccl_comm = PyNcclCommunicator(
ERROR 01-29 07:44:33 engine.py:366]                        ^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
ERROR 01-29 07:44:33 engine.py:366]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 01-29 07:44:33 engine.py:366]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
ERROR 01-29 07:44:33 engine.py:366]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 01-29 07:44:33 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
ERROR 01-29 07:44:33 engine.py:366]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 01-29 07:44:33 engine.py:366] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 36, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_gpu_executor.py", line 64, in _init_executor
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_gpu_executor.py", line 277, in _init_workers_ray
    self._run_workers("init_device")
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_gpu_executor.py", line 407, in _run_workers
    self.driver_worker.execute_method(method, *driver_args,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,365 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=336, ip=10.1.0.67, actor_id=ec0c1e35ca7765679644959301000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f4c7c992ba0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,365 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=342, ip=10.1.0.67, actor_id=d77d9243010bc27581dc9b7e01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fc9ddcb67b0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,365 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=341, ip=10.1.0.67, actor_id=7bb9df9741fdad274a86a8b301000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7ef55b19a960>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,365 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=340, ip=10.1.0.67, actor_id=1a9e0e4fb377f154ee662e1301000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f52cd39aae0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,365 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=339, ip=10.1.0.67, actor_id=651387d41c7aede26ec89be201000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f09ca57ea20>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,365 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=338, ip=10.1.0.67, actor_id=06f90cd84554735873974fd501000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f7677e76b40>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,366 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=337, ip=10.1.0.67, actor_id=7ae7076cb2ebffd4efd97c9801000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fc4a899e810>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,387 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=335, ip=10.1.0.67, actor_id=f29724c0c2e2711ef178206801000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f9c9d992c00>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,388 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=994, ip=10.1.0.68, actor_id=74adb5d5bfc29b3fb9dbbcea01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f8315576a50>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,388 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1019, ip=10.1.0.68, actor_id=36c25a0f11ba004c93f852a201000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fa26abe2cf0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,389 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1000, ip=10.1.0.68, actor_id=735c9adcec002c773ff3f40801000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f73a998ac60>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,389 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1034, ip=10.1.0.68, actor_id=aa47128c24b21dd7e83ebebd01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f7ff2daae70>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,389 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1013, ip=10.1.0.68, actor_id=3f265b86392a4b771b2af10201000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f53cede2d20>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,389 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1007, ip=10.1.0.68, actor_id=bd417be96297eb5e4dab4f6501000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f72af192990>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
2025-01-29 07:44:33,390 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1009, ip=10.1.0.68, actor_id=29809a6aff0b1aae3aa1c11801000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f17f098aa20>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 468, in execute_method
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467] Traceback (most recent call last):
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     init_worker_distributed_environment(self.vllm_config, self.rank,
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     initialize_model_parallel(tensor_model_parallel_size,
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     _TP = init_model_parallel_group(group_ranks,
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     return GroupCoordinator(
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]            ^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 216, in __init__
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     self.pynccl_comm = PyNcclCommunicator(
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]                        ^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467]     raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=335, ip=10.1.0.67) ERROR 01-29 07:44:33 worker_base.py:467] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
(RayWorkerWrapper pid=994) INFO 01-29 07:44:20 selector.py:120] Using Flash Attention backend. [repeated 2x across cluster]
(RayWorkerWrapper pid=336, ip=10.1.0.67) INFO 01-29 07:44:27 utils.py:918] Found nccl from library libnccl.so.2 [repeated 14x across cluster]
(RayWorkerWrapper pid=336, ip=10.1.0.67) INFO 01-29 07:44:27 pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467] Error executing method init_device. This might cause deadlock in distributed execution. [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467] Traceback (most recent call last): [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     return executor(*args, **kwargs) [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]            ^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 148, in init_device [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     init_worker_distributed_environment(self.vllm_config, self.rank, [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     initialize_model_parallel(tensor_model_parallel_size, [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1045, in initialize_model_parallel [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     _TP = init_model_parallel_group(group_ranks, [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     return GroupCoordinator( [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]            ^^^^^^^^^^^^^^^^^ [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__ [repeated 28x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     self.pynccl_comm = PyNcclCommunicator( [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]                        ^^^^^^^^^^^^^^^^^^^ [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank( [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK [repeated 14x across cluster]
(RayWorkerWrapper pid=994) ERROR 01-29 07:44:33 worker_base.py:467]     raise RuntimeError(f"NCCL error: {error_str}") [repeated 14x across cluster]
[rank0]:[W129 07:44:34.187124715 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This co
nstraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/scripts.py", line 201, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/scripts.py", line 42, in serve
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 740, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 118, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 223, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
root@gh-3714u06:/vllm-workspace#

sjeaugey · 2025-01-29T16:21:49Z

The NCCL_DEBUG=INFO log is only for rank 0. It would be good to capture it in separate files and attach them as .txt files.

Actually that may be the reason for your problems: only rank 0 is getting the environment variables, hence NCCL_IB_HCA or NCCL_CUMEM_ENABLE is inconsistent between ranks of the same node, causing the crash.

You seem to have many NICs on the node, so setting NCCL_IB_HCA=mlx5_5 on only one rank would definitely cause this error.

SzymonOzog · 2025-01-30T12:46:23Z

Okay, this is my bad. I set the envs manually on the second node and didn't restart the ray process.

I ran into another issue though with NCCL hanging on me. I had to manually switch NCCL_IB_HCA and find one for specific node configurations.

Eg for nodes_per_proc = 2 mlx5_5 would work but it would hang nodes_per_proc=8 that worked with mlx5_2.

Here is the script I used for this:

import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

and running with
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py

For NCCL_IB_HCA=mlx5_5

nproc-per-node 2 works
nproc-per-node 4 works on host, hangs on node1
nproc-per-node 8 hangs on both

Here are the logs:

logs_host.zip

log_node1.zip

kiskra-nvidia · 2025-02-03T22:26:01Z

This begs the question: what is your network configuration? nvidia-smi topo -m shows six NICs, but two of them (mlx5_2 and mlx5_3) are farther away from the GPUs than the remaining four -- are those two NICs perhaps supposed to provide N/S connectivity (file system, etc.)? But you mention that you got 8 GPU/s per node working with mlx5_2 and not mlx5_5... You should verify (using IB tools like ib_write_bw) for each pair of NICs (36 experiments) if they have connectivity with each other.

In general, you don't want to limit NCCL_IB_HCA to just a single NIC, as that will reduce the available bandwidth between the nodes. You want to use all the NICs that work -- ideally one per GPU, but you don't have that many. NCCL assumes rail-optimized network topology, trying to connect the same NICs on each node, but it will cross the NICs if necessary. If this is not your network configuration, you'll probably need to adjust NCCL_CROSS_NIC (to 0 if at least some of the rails are not connected and 1 if the network is not rail-optimized).

Still, it is weird that mlx5_5 works for you with 2 GPUs per node but not with 4 or 8... Though I've noticed that the 2 and 4 GPU configurations are a little weird -- on node 1 they don't start from GPU 0, but from GPU 2 (for 2 GPUs per node) and GPU 4 (for 4 GPUs per node). Was that deliberate? I would expect the same GPUs to be used on each node...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL internal error for ncclCommInitRank when using infiniband #1591

NCCL internal error for ncclCommInitRank when using infiniband #1591

SzymonOzog commented Jan 29, 2025

sjeaugey commented Jan 29, 2025 •

edited

Loading

SzymonOzog commented Jan 29, 2025

sjeaugey commented Jan 29, 2025 •

edited

Loading

SzymonOzog commented Jan 30, 2025

kiskra-nvidia commented Feb 3, 2025

NCCL internal error for ncclCommInitRank when using infiniband #1591

NCCL internal error for ncclCommInitRank when using infiniband #1591

Comments

SzymonOzog commented Jan 29, 2025

sjeaugey commented Jan 29, 2025 • edited Loading

SzymonOzog commented Jan 29, 2025

sjeaugey commented Jan 29, 2025 • edited Loading

SzymonOzog commented Jan 30, 2025

kiskra-nvidia commented Feb 3, 2025

sjeaugey commented Jan 29, 2025 •

edited

Loading

sjeaugey commented Jan 29, 2025 •

edited

Loading