Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] NCCL Crash with SIGSEGV #2803

Open
5 tasks done
looput opened this issue Jan 9, 2025 · 0 comments
Open
5 tasks done

[Bug] NCCL Crash with SIGSEGV #2803

looput opened this issue Jan 9, 2025 · 0 comments

Comments

@looput
Copy link

looput commented Jan 9, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3)                                                                                              ==== backtrace (tid: 212877) ====                                                                                                                                                                           
0 0x0000000000042520 __sigaction()  ???:0                                                                                                                                                                   1 0x0000000000049b8a ncclMemoryPoolAlloc<ncclProxyOp>()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/utils.h:280                                                                              
2 0x0000000000049b8a addProxyOpIfNeeded()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:180                                                                                                  3 0x0000000000049b8a addProxyOpIfNeeded()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:176                                                          
4 0x000000000004c496 addCBDCollToPlan()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:481                                                                                                   
5 0x000000000004f5bd ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:844                                                                                                  
6 0x000000000004f5bd ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1260                                                                                                 
7 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129
8 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339
9 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418                                                          
10 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368                                                          
11 0x000000000004d74f ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032                                                                                                  
12 0x0000000000044b36 ncclAllGather()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:26                                                                                                   
13 0x00000000011fd1f3 c10d::ProcessGroupNCCL::_allgather_base()  ???:0                            
14 0x0000000005f8e9b8 c10d::ops::(anonymous namespace)::_allgather_base_CUDA()  Ops.cpp:0         
15 0x0000000005f985cc c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_defa
ult_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at:
:Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10:
:detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call()  :0                                                                                                 
16 0x00000000055b224b c10::OperatorHandle::redispatchBoxed()  :0                                                                                                                                            
17 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0                                            18 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0                                                                      
19 0x0000000005f9fc2e c10::impl::BoxedKernelWrapper<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), void>::call()  :0                                                                
20 0x0000000005fabfe8 c10d::ProcessGroup::_allgather_base()  :0                                                                                                                                             21 0x0000000000df6c7e pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup
, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name con
st&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at:
:Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybin
d11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybin
d11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name co
nst&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11
::detail::function_call&)#3}::_FUN()  :0      

22 0x00000000004cb474 pybind11::cpp_function::dispatcher()  :0                                                                                                                                              
23 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0                                           
24 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0                                                                                                                                                         
25 0x0000000000168acb PyMethod_New()  ???:0                                                                                                                                                                 
26 0x0000000000148cfa _PyEval_EvalFrameDefault()  ???:0                                                                                                                                                     
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0                                             
28 0x0000000000169492 PyObject_Call()  ???:0                                                          
29 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0                                                                                                                                                     
30 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0                                                                                                                                                       
31 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0                                               
32 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0                                                                                                                                                       
33 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0                                               
34 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0                                                                                                                                                       
35 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0                                                                                                                                                     
36 0x000000000016893e PyMethod_New()  ???:0                                                           
37 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0                                                                                                                                                     38 0x000000000016893e PyMethod_New()  ???:0                                                                                                                                                                 
39 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
40 0x000000000014fc14 _PyObject_FastCallDictTstate()  ???:0
41 0x000000000016586c _PyObject_Call_Prepend()  ???:0
42 0x0000000000280700 PyInit__datetime()  ???:0
43 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0 
44 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
45 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
46 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
47 0x00000000001687f1 PyMethod_New()  ???:0
48 0x0000000000148cfa _PyEval_EvalFrameDefault()  ???:0
49 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
50 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
51 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
52 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
53 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
54 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
55 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
56 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
=================================
[2025-01-08 11:17:51 TP7] Scheduler hit an exception: Traceback (most recent call last):
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1578, in run_scheduler_process
   scheduler.event_loop_overlap()
 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
   return func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 410, in event_loop_overlap
   recv_reqs = self.recv_requests()
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 459, in recv_requests
   recv_reqs = broadcast_pyobj(recv_reqs, self.tp_rank, self.tp_cpu_group)
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 731, in broadcast_pyobj
   dist.broadcast(tensor_size, src=0, group=dist_group)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 
   return func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
   work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [29.127.64.100]:26496

[2025-01-08 11:17:51 TP1] Scheduler hit an exception: Traceback (most recent call last):
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1578, in run_scheduler_process
   scheduler.event_loop_overlap()
 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
   return func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 410, in event_loop_overlap
   recv_reqs = self.recv_requests()
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 459, in recv_requests
   recv_reqs = broadcast_pyobj(recv_reqs, self.tp_rank, self.tp_cpu_group)
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 731, in broadcast_pyobj
   dist.broadcast(tensor_size, src=0, group=dist_group)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 
   return func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
   work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [29.127.64.100]:2711

Killed

Reproduction

node 1
python -m sglang.launch_server --model-path DeepSeek-V3 --tp 16 --nccl-init 29.127.64.100:5000 --nnodes 2 --node-rank 0 --trust-remote-code --port 80 --host 0.0.0.0 --schedule-conservativeness 0.3 --context-length 32768

node2
python -m sglang.launch_server --model-path DeepSeek-V3 --tp 16 --nccl-init 29.127.64.100:5000 --nnodes 2 --node-rank 1 --trust-remote-code --port 80 --host 0.0.0.0 --schedule-conservativeness 0.3 --context-length 32768

Environment

/usr/local/lib/python3.10/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4                                                        
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"                                                                                                                                                              
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import o
f cv2 has been skipped.                                                                                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:                                                                                                          
* 'fields' has been removed                                                                                                                                                                                                              warnings.warn(message, UserWarning)                                                                                                                                                                                                  Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]                                                                                                                                                                             CUDA available: True                                                                                                                                                                                                                   
GPU 0,1,2,3,4,5,6,7: NVIDIA H20                                                                                    
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0                                                                                                                                                                                            
CUDA_HOME: /usr/local/cuda                                                                                         
NVCC: Cuda compilation tools, release 12.4, V12.4.131                                                                                                                                                                                  
CUDA Driver Version: 535.161.08                                                                                    
PyTorch: 2.5.1+cu124                                                                                                                                                                                                                   
sglang: 0.4.1.post3                                                                                                
flashinfer: 0.1.6+cu124torch2.4                                                                                                                                                                                                        
triton: 3.1.0                                                                                                      
transformers: 4.47.1                                                                                                                                                                                                                   
torchao: 0.7.0                                                                                                     
numpy: 1.26.4                                                                                                                                                                                                                          
aiohttp: 3.9.5                                                                                                     
fastapi: 0.114.1                                                                                                                                                                                                                       
hf_transfer: 0.1.8                                                                                                 
huggingface_hub: 0.24.7                                                                                                                                                                                                                
interegular: 0.3.3                                                                                                 
modelscope: 1.21.1                                                                                                                                                                                                                     
orjson: 3.10.13                                                                                                    
packaging: 24.0                                                                                                                                                                                                                        
psutil: 5.9.8                                                                                                      
pydantic: 2.9.1                                                                                                                                                                                                                        multipart: 0.0.20                                                                                                  zmq: 26.0.3                                                                                                                                                                                                                            
uvicorn: 0.30.6                                                                                                    
uvloop: 0.20.0                                                                                                                                                                                                                         
vllm: 0.6.4.post1                                                                                                  openai: 1.58.1                                                                                                                                                                                                                         anthropic: 0.42.0                                                                                                  decord: 0.6.0    

Legend:                                                                                                                                                                                                                                
                                                                                                                                                                                                                                       
  X    = Self                                                                                                                                                                                                                          
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)                                                                                                                                 
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node                                                                                                                           
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)                                                                                                                                                  
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)                                                                                                                                         
  PIX  = Connection traversing at most a single PCIe bridge                                                                                                                                                                            
  NV#  = Connection traversing a bonded set of # NVLinks                                                                                                                                                                               
                                                                                                                                                                                                                                       
NIC Legend:       

NVIDIA Topology:                                                                                                                                                                                                                                                                                                       
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   NIC12   NIC13   NIC14   NIC15   NIC16       NIC17   NIC18   NIC19   NIC20   NIC21   NIC22   NIC23   NIC24   NIC25   CPU Affinity    NUMA Affini
ty   GPU NUMA ID                                                                                                                                                                                                                                                                                                       
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    NODE    PHB     PIX     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    NODE    PIX     PHB     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    96-191,288-383  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    96-191,288-383  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     PHB     NODE    NODE    PIX     96-191,288-383  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     PIX     NODE    NODE    PHB     96-191,288-383  1               N/A
NIC0    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC1    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC2    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC9    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC10   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC11   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC12   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC13   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC14   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC15   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC16   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X PIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC17   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX X       SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC18   PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS       X      NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC19   NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE     X      NODE    NODE    SYS     SYS     SYS     SYS
NIC20   NODE    PHB     PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    NODE     X      PHB     SYS     SYS     SYS     SYS
NIC21   NODE    PIX     PHB     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    NODE    PHB      X      SYS     SYS     SYS     SYS
NIC22   SYS     SYS     SYS     SYS     NODE    NODE    PHB     PIX     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS      X      NODE    NODE    PHB
NIC23   SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE
NIC24   SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE
NIC25   SYS     SYS     SYS     SYS     NODE    NODE    PIX     PHB     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     PHB     NODE    NODE     X 
                                                                                                                                                                                                                                       
Legend:                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                       
  X    = Self                                                                                                                                                                                                                                                             
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)                                                                                                                                 
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node                                                                                                                                                              
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)                                                                                                                                                  
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)                                                                                                                                                                            
  PIX  = Connection traversing at most a single PCIe bridge                                                                                                                                                                            
  NV#  = Connection traversing a bonded set of # NVLinks                                                                                                                                                                                                                  
                                                                                                                                                                                                                                       
NIC Legend:                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                       
  NIC0: mlx5_0                                                                                                                                                                                                                                                            
  NIC1: mlx5_1                                                                                                                                                                                                                         
  NIC2: mlx5_2                                                                                                                                                                                                                                                            
  NIC3: mlx5_3                                                                                                                                                                                                                         
  NIC4: mlx5_4                                                                                                                                                                                                                                                            
  NIC5: mlx5_5                                                                                                                                                                                                                         
  NIC6: mlx5_6                                                                                                                                                                                                                                                            
  NIC7: mlx5_7                                                                                                                                                                                                                         
  NIC8: mlx5_8                                                                                                                                                                                                                                                            
  NIC9: mlx5_9                                                                                                                                                                                                                         
  NIC10: mlx5_10                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                         NIC0: mlx5_0                                                                                                                                                                                                                           NIC1: mlx5_1                                                                                                                                                                                                                           NIC2: mlx5_2                                                                                                                                                                                                                         
  NIC3: mlx5_3                                                                                                     
  NIC4: mlx5_4                                                                                                                                                                                                                         
  NIC5: mlx5_5                                                                                                     
  NIC6: mlx5_6                                                                                                                                                                                                                         
  NIC7: mlx5_7                                                                                                     
  NIC8: mlx5_8                                                                                                                                                                                                                         
  NIC9: mlx5_9                                                                                                     
  NIC10: mlx5_10                                                                                                                                                                                                                       
  NIC11: mlx5_11                                                                                                   
  NIC12: mlx5_12                                                                                                                                                                                                                       
  NIC13: mlx5_13                                                                                                   
  NIC14: mlx5_14                                                                                                                                                                                                                       
  NIC15: mlx5_16                                                                                                   
  NIC16: mlx5_17                                                                                                                                                                                                                       
  NIC17: mlx5_18                                                                                                   
  NIC18: mlx5_bond_1                                                                                                                                                                                                                   
  NIC19: mlx5_bond_2                                                                                               
  NIC20: mlx5_bond_3                                                                                                                                                                                                                   
  NIC21: mlx5_bond_4                                                                                               
  NIC22: mlx5_bond_5                                                                                                                                                                                                                   
  NIC23: mlx5_bond_6                                                                                               
  NIC24: mlx5_bond_7                                                                                                                                                                                                                     NIC25: mlx5_bond_8                                                                                                                                                                                                                                                                                                                                      
                                                                                                                   
ulimit soft: 1024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant