Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] deepseek v3 inference on multiple nodes is very slow #2794

Open
5 tasks
inforly opened this issue Jan 8, 2025 · 5 comments
Open
5 tasks

[Bug] deepseek v3 inference on multiple nodes is very slow #2794

inforly opened this issue Jan 8, 2025 · 5 comments

Comments

@inforly
Copy link

inforly commented Jan 8, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

We can run the deepseek v3 model on mutliple GPU nodes (like 16), but found it was very slow, only ~1.5 tokens/s, any issues with the multiple nodes serving, or any configuration issues?

Reproduction

The start command:
python3 -m sglang.launch_server --model-path /mnt/blob/deepseek/ --tp 16 --dist-init-addr $ip:20000 --nnodes 16 --node-rank 0 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph --trust-remote-code --host 0.0.0.0 --port 40000

Environment

16 Nvidia H100 GPU nodes

@zhyncs
Copy link
Member

zhyncs commented Jan 8, 2025

Hi @antferdom @fsygd Could you take a look? How is the performance on H200 multi-nodes or H800 multi-nodes? Thanks!

@zhyncs
Copy link
Member

zhyncs commented Jan 8, 2025

Hi @inforly Why did you disable CUDA Graph?

@cgpeter96
Copy link

cgpeter96 commented Jan 8, 2025

I deployed deepseek v3 on 2*8*h800. Token generation is very fast. But it cannot stop and returns empy string.

@roG0d
Copy link
Contributor

roG0d commented Jan 8, 2025

For 2 nodes with 8xH200 GPUs the performance is the following:

DeepSeek V3 on 2x8xH200 (multi-node)

BF16

Given the following configuration:

# launch server
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph


# bench_serving for 300/600/1200/2400 num-prompts
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000
RPS Num Prompts Median E2E Latency (ms) Median TTFT (ms) Median TPOT (ms) Median ITL (ms) Output token throughput (tok/s)
1 300 971,353.97 53,189.54 843.03 638.68 275.06
2 600 2,010,951.23 313,373.93 1622.07 1192.37 256.50
4 1200 3,881,082.65 774,460.73 1645.51 1178.42 255.45
8 2400 6,819,185.61 4,072,706.72 2239.22 1205.60 250.08

FP8

Given the following configuration:

# launch server
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph


# bench_serving for 300/600/1200/2400 num-prompts
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000
RPS Num Prompts Median E2E Latency (ms) Median TTFT (ms) Median TPOT (ms) Median ITL (ms) Output token throughput (tok/s)
1 300 985,610.62 56,824.07 862.84 662.33 271.60
2 600 1,975,371.99 305,318.37 1632.35 1219.14 288.41
4 1200 3,901,390.30 767,082.14 3023.99 2189.83 269.19
8 2400 7,374,173.14 1,680,440.41 2974.87 2007.02 276.74

The main difference I notice is that you're using nnode 16, which means you're using 16 nodes with an H200 each. This could explain the low throughput. As we observed, communication between nodes generates more overhead compared to communication between GPUs within the same node. It's generally preferable to consolidate GPUs within a single node rather than distribute them across multiple machines.

@antferdom, @zhyncs, @inforly

@inforly
Copy link
Author

inforly commented Jan 9, 2025

Greate thanks, @roG0d! I tried reducing the nnode from 16 to 8, the throughput increased from 1.5 tokens/s to 2 tokens/s. More questions here:

  1. Are you using Infiniband to connect nodes?
  2. For such mutliple nodes inference, worker nodes only communicate to head node? Do worker nodes commnicate with each other?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants