Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Deepseek v3 forward_absorb() of attention bug #2764

Open
5 tasks done
yixue-qq opened this issue Jan 7, 2025 · 4 comments
Open
5 tasks done

[Bug] Deepseek v3 forward_absorb() of attention bug #2764

yixue-qq opened this issue Jan 7, 2025 · 4 comments

Comments

@yixue-qq
Copy link

yixue-qq commented Jan 7, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

In deepseekv3(v2)'s attention, for decoder mode using forward_absorb().
However, self.w_kc = None is been set in


In this way the None shouldn't have any dtype:
if self.w_kc.dtype == torch.float8_e4m3fnuz:

This gives the error: AttributeError: 'NoneType' object has no attribute 'dtype'.

By the way I'm using dummy weights like: python3 -m sglang.launch_server --model my_path/deepseek_v3/DeepSeek-V3 --tp 1 --trust-remote-code --port 30000 --load-format dummy

Reproduction

python3 -m sglang.launch_server --model my_path/deepseek_v3/DeepSeek-V3 --tp 1 --trust-remote-code --port 30000 --load-format dummy

Environment

Docker provided:
docker pull lmsysorg/sglang:latest

@ispobock
Copy link
Collaborator

ispobock commented Jan 7, 2025

Because you didn't execute the weight loading. The w_kc should be achieved after the weight loading. ref code

@yixue-qq
Copy link
Author

yixue-qq commented Jan 7, 2025

Does this mean using dummy weights cannot run decoding phase?
Suppose I only have 1 GPU that cannot hold such large model weights of hugging face *safetensors, , what can I do to run decoding phase on 1 GPU?

@ispobock
Copy link
Collaborator

ispobock commented Jan 7, 2025

  • When MLA enabled, the dummy weights cannot be used since there are some post processing of the weights after weights loading.
  • It's not possible to run decoding on 1 GPU even you use the dummy weight. Dummy weights means the parameter values are random initialized. The number of parameters has not been reduced.

@yixue-qq
Copy link
Author

yixue-qq commented Jan 8, 2025

I see. Previously I reduce the layer number to 2 to use dummy weights on 1 GPU. But seems indeed MLA cannot use dummy weights to run decoding phase. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants