-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] deepseek v3 inference on multiple nodes is very slow #2794
Comments
Hi @antferdom @fsygd Could you take a look? How is the performance on H200 multi-nodes or H800 multi-nodes? Thanks! |
Hi @inforly Why did you disable CUDA Graph? |
I deployed deepseek v3 on |
For 2 nodes with 8xH200 GPUs the performance is the following: DeepSeek V3 on 2x8xH200 (multi-node)BF16Given the following configuration: # launch server
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph
# bench_serving for 300/600/1200/2400 num-prompts
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000
FP8Given the following configuration:
The main difference I notice is that you're using |
Greate thanks, @roG0d! I tried reducing the nnode from 16 to 8, the throughput increased from 1.5 tokens/s to 2 tokens/s. More questions here:
|
Checklist
Describe the bug
We can run the deepseek v3 model on mutliple GPU nodes (like 16), but found it was very slow, only ~1.5 tokens/s, any issues with the multiple nodes serving, or any configuration issues?
Reproduction
The start command:
python3 -m sglang.launch_server --model-path /mnt/blob/deepseek/ --tp 16 --dist-init-addr $ip:20000 --nnodes 16 --node-rank 0 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph --trust-remote-code --host 0.0.0.0 --port 40000
Environment
16 Nvidia H100 GPU nodes
The text was updated successfully, but these errors were encountered: