Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] DeepSeek V3 optimization #2591

Open
7 of 15 tasks
zhyncs opened this issue Dec 26, 2024 · 18 comments
Open
7 of 15 tasks

[Feature] DeepSeek V3 optimization #2591

zhyncs opened this issue Dec 26, 2024 · 18 comments
Assignees
Labels
enhancement New feature or request high priority performance quant LLM Quantization

Comments

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

Checklist

Usage

User Guide for Existing System (Installation & Launch)

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

Please use the latest version v0.4.1.post3

Features

Related resources

No response

@zhyncs zhyncs added enhancement New feature or request performance quant LLM Quantization labels Dec 26, 2024
@zhyncs zhyncs pinned this issue Dec 26, 2024
@libratiger
Copy link
Contributor

Very quick response !
I understand that the overlap scheduler is model-independent and is a general optimization that should be supported by default.
At least some special optimizations are needed?

@merrymercy
Copy link
Contributor

merrymercy commented Dec 26, 2024

The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon.

@fengyang95
Copy link

fengyang95 commented Dec 26, 2024

Is the memory sufficient for an 8 gpus instance? This model size is too large.

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

Is the memory sufficient for an 8 gpus instance? This model size is too large.

671B works on H200 * 8 with FP8 (671 < 141 * 8)

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

Hi @fengyang95 You can also consider multi node.

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism (help 1 help 2).

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions.

@zhyncs
Copy link
Member Author

zhyncs commented Dec 27, 2024

Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

@zhyncs
Copy link
Member Author

zhyncs commented Dec 29, 2024

Update: SGLang v0.4.1.post2 supports FP8 GEMM Tuning for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

@zhyncs
Copy link
Member Author

zhyncs commented Dec 30, 2024

ref #2647

@CSEEduanyu
Copy link

plan to support mtp?

@zhyncs
Copy link
Member Author

zhyncs commented Jan 6, 2025

plan to support mtp?

It's on the roadmap and it's named nextn. We'll support it soon.

@lixiaolx
Copy link

lixiaolx commented Jan 8, 2025

@zhyncs @Ying1123 @merrymercy ,hello,
As you mentioned above, TP+DP,

TP+DP Attention @Ying1123

I have two questions, could you help me answer them?

1.Can we decouple TP and DP after this implementation? Can we configure the scenario where DP is not equal to TP?

2.Is there a detailed schedule for the mentioned above? Are there any related supporting design documents that can be shared?

@Mutinifni
Copy link
Contributor

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

@pipul
Copy link

pipul commented Jan 13, 2025

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

@zhyncs @Mutinifni

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json
"num_key_value_heads": 128,
DeepSeek-V3 has 128 KV heads??

@min-xu-et
Copy link
Contributor

Are there any data related to inference time batch size and token imbalance between experts? What's the total throughput like for a 8xH200 node?

@CSEEduanyu
Copy link

Has there been any progress with the support from NextN?

@lambert0312
Copy link

The overlap scheduler with DP attention can not be used on A800 * 4., because always OOM.

@MtFitzRoy
Copy link

Is there a plan to support TP + SP attention?

The paper says "The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority performance quant LLM Quantization
Projects
None yet
Development

No branches or pull requests