[Bug] Deepseek v3 forward_absorb() of attention bug #2764

yixue-qq · 2025-01-07T04:17:00Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

In deepseekv3(v2)'s attention, for decoder mode using forward_absorb().
However, self.w_kc = None is been set in

sglang/python/sglang/srt/models/deepseek_v2.py

Line 498 in 9dec582

self.w_kc = None

In this way the None shouldn't have any dtype:

sglang/python/sglang/srt/models/deepseek_v2.py

Line 579 in 9dec582

if self.w_kc.dtype == torch.float8_e4m3fnuz:

This gives the error: AttributeError: 'NoneType' object has no attribute 'dtype'.

By the way I'm using dummy weights like: python3 -m sglang.launch_server --model my_path/deepseek_v3/DeepSeek-V3 --tp 1 --trust-remote-code --port 30000 --load-format dummy

Reproduction

python3 -m sglang.launch_server --model my_path/deepseek_v3/DeepSeek-V3 --tp 1 --trust-remote-code --port 30000 --load-format dummy

Environment

Docker provided:
docker pull lmsysorg/sglang:latest

The text was updated successfully, but these errors were encountered:

ispobock · 2025-01-07T06:57:51Z

Because you didn't execute the weight loading. The w_kc should be achieved after the weight loading. ref code

yixue-qq · 2025-01-07T13:53:52Z

Does this mean using dummy weights cannot run decoding phase?
Suppose I only have 1 GPU that cannot hold such large model weights of hugging face *safetensors, , what can I do to run decoding phase on 1 GPU?

ispobock · 2025-01-07T14:56:40Z

When MLA enabled, the dummy weights cannot be used since there are some post processing of the weights after weights loading.
It's not possible to run decoding on 1 GPU even you use the dummy weight. Dummy weights means the parameter values are random initialized. The number of parameters has not been reduced.

yixue-qq · 2025-01-08T02:58:17Z

I see. Previously I reduce the layer number to 2 to use dummy weights on 1 GPU. But seems indeed MLA cannot use dummy weights to run decoding phase. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Deepseek v3 forward_absorb() of attention bug #2764

[Bug] Deepseek v3 forward_absorb() of attention bug #2764

yixue-qq commented Jan 7, 2025

ispobock commented Jan 7, 2025

yixue-qq commented Jan 7, 2025

ispobock commented Jan 7, 2025

yixue-qq commented Jan 8, 2025

[Bug] Deepseek v3 forward_absorb() of attention bug #2764

[Bug] Deepseek v3 forward_absorb() of attention bug #2764

Comments

yixue-qq commented Jan 7, 2025

Checklist

Describe the bug

Reproduction

Environment

ispobock commented Jan 7, 2025

yixue-qq commented Jan 7, 2025

ispobock commented Jan 7, 2025

yixue-qq commented Jan 8, 2025