You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running benchmark testing using nccl_test. I have 2 nodes, which are connected via RoCE. I have also installed the nv_peer_memory. However, once I turn on GPU Direct RDMA, the all_reduce_perf bandwidth gets dramatically worse than without GPU Direct RDMA. I am aware that GPU PCIe topology matters and that's why I am only using GPU0 on both nodes since GPU0 and the Mellanox HAC are connected to the same CPU.
The GPU topology is
Without GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2
With GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2
According to this suggested system support, having single CPU in between GPU and the Mellanox HAC will yield worse performance. But I never expected it to be this much worse.
At this point, I am wondering if there is any tool which can help debug nv_peer_mem to make sure it really takes effect? Or maybe there is sth I misconfigured?
Here is the detail about my environment.
Nvidia Tesla V100
CUDA9.0
NCCL2.2.13
OFED4.2-1.2.0
Mellanox MT27710 ConnectX-4Lx
nvidia_peer_memory1.0-8
I notice that the log says that 'No module present for GPU Direct RDMA'. When I check its status, this is what it look like. Is this normal?
The text was updated successfully, but these errors were encountered:
Even when the 'No module present for GPU Direct RDMA'. message is gone after I re-installed nv_peer_mem, the performance still doesn't get any better for GPU Direct RDMA case.
I am running benchmark testing using nccl_test. I have 2 nodes, which are connected via RoCE. I have also installed the nv_peer_memory. However, once I turn on GPU Direct RDMA, the all_reduce_perf bandwidth gets dramatically worse than without GPU Direct RDMA. I am aware that GPU PCIe topology matters and that's why I am only using GPU0 on both nodes since GPU0 and the Mellanox HAC are connected to the same CPU.
![Screen Shot 2019-04-16 at 8 23 46 PM](https://user-images.githubusercontent.com/28944236/56209232-99471480-6085-11e9-8d7e-224b0a45ea0d.png)
![Screen Shot 2019-04-16 at 8 34 58 PM](https://user-images.githubusercontent.com/28944236/56210182-d3191a80-6087-11e9-907c-22397bda204d.png)
The GPU topology is
Without GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2
With GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2
![Screen Shot 2019-04-16 at 8 31 29 PM](https://user-images.githubusercontent.com/28944236/56209733-ccd66e80-6086-11e9-81f2-09497f4e682b.png)
According to this suggested system support, having single CPU in between GPU and the Mellanox HAC will yield worse performance. But I never expected it to be this much worse.
At this point, I am wondering if there is any tool which can help debug nv_peer_mem to make sure it really takes effect? Or maybe there is sth I misconfigured?
Here is the detail about my environment.
Nvidia Tesla V100
CUDA9.0
NCCL2.2.13
OFED4.2-1.2.0
Mellanox MT27710 ConnectX-4Lx
nvidia_peer_memory1.0-8
I notice that the log says that 'No module present for GPU Direct RDMA'. When I check its status, this is what it look like. Is this normal?
![Screen Shot 2019-04-16 at 8 52 55 PM](https://user-images.githubusercontent.com/28944236/56211098-c4336780-6089-11e9-9528-274fdb1809dc.png)
The text was updated successfully, but these errors were encountered: