[Bug] deepseek v3 inference on multiple nodes is very slow #2794

inforly · 2025-01-08T11:57:57Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

We can run the deepseek v3 model on mutliple GPU nodes (like 16), but found it was very slow, only ~1.5 tokens/s, any issues with the multiple nodes serving, or any configuration issues?

Reproduction

The start command:
python3 -m sglang.launch_server --model-path /mnt/blob/deepseek/ --tp 16 --dist-init-addr $ip:20000 --nnodes 16 --node-rank 0 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph --trust-remote-code --host 0.0.0.0 --port 40000

Environment

16 Nvidia H100 GPU nodes

The text was updated successfully, but these errors were encountered:

zhyncs · 2025-01-08T12:03:33Z

Hi @antferdom @fsygd Could you take a look? How is the performance on H200 multi-nodes or H800 multi-nodes? Thanks!

zhyncs · 2025-01-08T12:05:51Z

Hi @inforly Why did you disable CUDA Graph?

cgpeter96 · 2025-01-08T12:34:49Z

I deployed deepseek v3 on 2*8*h800. Token generation is very fast. But it cannot stop and returns empy string.

roG0d · 2025-01-08T12:55:01Z

For 2 nodes with 8xH200 GPUs the performance is the following:

DeepSeek V3 on 2x8xH200 (multi-node)

BF16

Given the following configuration:

# launch server
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph


# bench_serving for 300/600/1200/2400 num-prompts
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000

RPS	Num Prompts	Median E2E Latency (ms)	Median TTFT (ms)	Median TPOT (ms)	Median ITL (ms)	Output token throughput (tok/s)
1	300	971,353.97	53,189.54	843.03	638.68	275.06
2	600	2,010,951.23	313,373.93	1622.07	1192.37	256.50
4	1200	3,881,082.65	774,460.73	1645.51	1178.42	255.45
8	2400	6,819,185.61	4,072,706.72	2239.22	1205.60	250.08

FP8

Given the following configuration:

# launch server
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph


# bench_serving for 300/600/1200/2400 num-prompts
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000

RPS	Num Prompts	Median E2E Latency (ms)	Median TTFT (ms)	Median TPOT (ms)	Median ITL (ms)	Output token throughput (tok/s)
1	300	985,610.62	56,824.07	862.84	662.33	271.60
2	600	1,975,371.99	305,318.37	1632.35	1219.14	288.41
4	1200	3,901,390.30	767,082.14	3023.99	2189.83	269.19
8	2400	7,374,173.14	1,680,440.41	2974.87	2007.02	276.74

The main difference I notice is that you're using nnode 16, which means you're using 16 nodes with an H200 each. This could explain the low throughput. As we observed, communication between nodes generates more overhead compared to communication between GPUs within the same node. It's generally preferable to consolidate GPUs within a single node rather than distribute them across multiple machines.

@antferdom, @zhyncs, @inforly

inforly · 2025-01-09T12:56:27Z

Greate thanks, @roG0d! I tried reducing the nnode from 16 to 8, the throughput increased from 1.5 tokens/s to 2 tokens/s. More questions here:

Are you using Infiniband to connect nodes?
For such mutliple nodes inference, worker nodes only communicate to head node? Do worker nodes commnicate with each other?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] deepseek v3 inference on multiple nodes is very slow #2794

[Bug] deepseek v3 inference on multiple nodes is very slow #2794

inforly commented Jan 8, 2025 •

edited

Loading

zhyncs commented Jan 8, 2025

zhyncs commented Jan 8, 2025

cgpeter96 commented Jan 8, 2025 •

edited

Loading

roG0d commented Jan 8, 2025 •

edited

Loading

inforly commented Jan 9, 2025

[Bug] deepseek v3 inference on multiple nodes is very slow #2794

[Bug] deepseek v3 inference on multiple nodes is very slow #2794

Comments

inforly commented Jan 8, 2025 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

zhyncs commented Jan 8, 2025

zhyncs commented Jan 8, 2025

cgpeter96 commented Jan 8, 2025 • edited Loading

roG0d commented Jan 8, 2025 • edited Loading

DeepSeek V3 on 2x8xH200 (multi-node)

BF16

FP8

inforly commented Jan 9, 2025

inforly commented Jan 8, 2025 •

edited

Loading

cgpeter96 commented Jan 8, 2025 •

edited

Loading

roG0d commented Jan 8, 2025 •

edited

Loading