-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] DeepSeek V3 optimization #2591
Comments
Very quick response ! |
The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon. |
Is the memory sufficient for an 8 gpus instance? This model size is too large. |
671B works on H200 * 8 with FP8 (671 < 141 * 8) |
Hi @fengyang95 You can also consider multi node.
|
FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions. |
Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version. pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer |
Update: SGLang v0.4.1.post2 supports FP8 GEMM Tuning for DeepSeek V3, please use the latest version.
|
ref #2647 |
plan to support mtp? |
It's on the roadmap and it's named |
@zhyncs @Ying1123 @merrymercy ,hello,
I have two questions, could you help me answer them? 1.Can we decouple TP and DP after this implementation? Can we configure the scenario where DP is not equal to TP? 2.Is there a detailed schedule for the mentioned above? Are there any related supporting design documents that can be shared? |
I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large. +1 for shared design docs, if possible. |
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json |
Are there any data related to inference time batch size and token imbalance between experts? What's the total throughput like for a 8xH200 node? |
Has there been any progress with the support from NextN? |
The overlap scheduler with DP attention can not be used on A800 * 4., because always OOM. |
Is there a plan to support TP + SP attention? The paper says "The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP)" |
Checklist
Usage
User Guide for Existing System (Installation & Launch)
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
Please use the latest version v0.4.1.post3
Features
moe_align_block_size
@HandH1998 @zhyncs @BBufE=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8.json
@BBufnextn
speculative decoding @merrymercyRelated resources
No response
The text was updated successfully, but these errors were encountered: