Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Benchmarks with EAGLE-2 #2777

Open
5 tasks done
xavier-h-10 opened this issue Jan 7, 2025 · 1 comment
Open
5 tasks done

[Bug] Benchmarks with EAGLE-2 #2777

xavier-h-10 opened this issue Jan 7, 2025 · 1 comment
Assignees

Comments

@xavier-h-10
Copy link

xavier-h-10 commented Jan 7, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I have tried to use different benchmarks for sglang with EAGLE-2. However, it seems that it cannot work.

Reproduction

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7

multi_turn_chat benchmark

~/sglang/benchmark/multi_turn_chat$ python3 bench_sglang.py --tokenizer meta-llama/Llama-2-7b-chat-hf --long

Result

[2025-01-07 20:24:05] INFO: 127.0.0.1:34396 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 313, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 387, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.06, #running-req: 1, #queue-req: 18
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 433, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.13, #running-req: 2, #queue-req: 17
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 503, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.21, #running-req: 3, #queue-req: 16
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 487, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.30, #running-req: 4, #queue-req: 15
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 278, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.39, #running-req: 5, #queue-req: 14
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 305, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.44, #running-req: 6, #queue-req: 13
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 407, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.50, #running-req: 7, #queue-req: 12
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 434, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.57, #running-req: 8, #queue-req: 11
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 392, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.65, #running-req: 9, #queue-req: 10
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 443, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.73, #running-req: 10, #queue-req: 9
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.81, #running-req: 11, #queue-req: 8
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 316, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.85, #running-req: 12, #queue-req: 7
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 379, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.91, #running-req: 13, #queue-req: 6
[2025-01-07 20:24:07 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 1616, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 411, in event_loop_normal
batch = self.get_next_batch_to_run()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 783, in get_next_batch_to_run
self.running_batch = self.update_running_batch(self.running_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 924, in update_running_batch
self.draft_worker.finish_request(retracted_reqs)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 163, in finish_request
- self.finish_extend_len[req.rid]
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: '58588572becf4f0997b94a5f2c548495'

Killed

mtbench benchmark

~/sglang/benchmark/mtbench$ python3 bench_sglang.py --num-questions 80

Result

[2025-01-07 20:16:02] INFO: 127.0.0.1:41452 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 75, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.01, #running-req: 1, #queue-req: 62
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 80, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.02, #running-req: 2, #queue-req: 61
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 70, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.04, #running-req: 3, #queue-req: 60
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 60, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.05, #running-req: 4, #queue-req: 59
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.06, #running-req: 5, #queue-req: 58
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 55, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.07, #running-req: 6, #queue-req: 57
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 56, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.08, #running-req: 7, #queue-req: 56
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 71, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.09, #running-req: 8, #queue-req: 55
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.10, #running-req: 9, #queue-req: 54
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 79, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.13, #running-req: 10, #queue-req: 53
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 59, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.14, #running-req: 11, #queue-req: 52
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 119, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.15, #running-req: 12, #queue-req: 51
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 123, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.17, #running-req: 13, #queue-req: 50
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 148, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.20, #running-req: 14, #queue-req: 49
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 100, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.22, #running-req: 15, #queue-req: 48
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 83, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.24, #running-req: 16, #queue-req: 47
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 66, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.26, #running-req: 17, #queue-req: 46
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 66, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.27, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 78, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.28, #running-req: 19, #queue-req: 44
[2025-01-07 20:16:14] INFO: 127.0.0.1:41560 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:16 TP0] Decode batch. #running-req: 19, #token: 3526, token usage: 0.65, gen throughput (token/s): 2.08, #queue-req: 45
[2025-01-07 20:16:18] INFO: 127.0.0.1:41590 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:18] INFO: 127.0.0.1:41592 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:19 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.63, #running-req: 17, #queue-req: 46
[2025-01-07 20:16:19 TP0] Prefill batch. #new-seq: 1, #new-token: 62, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.64, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:19] INFO: 127.0.0.1:41604 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:20 TP0] Prefill batch. #new-seq: 1, #new-token: 45, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.62, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:25] INFO: 127.0.0.1:41672 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:26 TP0] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.6348 -> 0.7838
[2025-01-07 20:16:26 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 1616, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 419, in event_loop_normal
result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 966, in run_batch
self.draft_worker.forward_batch_speculative_generation(batch)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 73, in forward_batch_speculative_generation
self.forward_draft_decode(batch)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 55, in forward_draft_decode
logits_output = self.model_runner.forward(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/model_executor/model_runner.py", line 715, in forward
return self.forward_decode(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/model_executor/model_runner.py", line 674, in forward_decode
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 356, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama_eagle.py", line 95, in forward
hidden_states, residual = layer(
^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 235, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 172, in forward
attn_output = self.attn(q, k, v, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/radix_attention.py", line 65, in forward
return forward_batch.attn_backend.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/attention/init.py", line 67, in forward
return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 415, in forward_decode
o = decode_wrapper.forward(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/flashinfer/decode.py", line 589, in forward
return self.run(
^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/flashinfer/decode.py", line 673, in run
out = self._wrapper.run(
^^^^^^^^^^^^^^^^^^
RuntimeError: CHECK_GE(paged_kv_indptr.size(0), batch_size + 1) failed. 137 vs 145

Killed

Environment

2025-01-07 18:47:36.128848: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736275656.370115 2302 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736275656.436779 2302 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-07 18:47:37.039609: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2025-01-07 18:47:43] INFO _client.py:1038: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
/opt/conda/envs/rl/lib/python3.11/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

  • 'fields' has been removed
    warnings.warn(message, UserWarning)
    Python: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0]
    CUDA available: True
    GPU 0: NVIDIA L4
    GPU 0 Compute Capability: 8.9
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.131
    CUDA Driver Version: 550.127.05
    PyTorch: 2.5.1+cu124
    flashinfer: 0.1.6+cu124torch2.4
    triton: 3.1.0
    transformers: 4.47.1
    torchao: 0.7.0
    numpy: 1.26.4
    aiohttp: 3.11.11
    fastapi: 0.115.6
    hf_transfer: 0.1.8
    huggingface_hub: 0.27.0
    interegular: 0.3.3
    modelscope: 1.21.1
    orjson: 3.10.13
    packaging: 24.2
    psutil: 6.1.0
    pydantic: 2.10.4
    multipart: 0.0.20
    zmq: 26.2.0
    uvicorn: 0.34.0
    uvloop: 0.21.0
    vllm: 0.6.4.post1
    openai: 1.59.3
    anthropic: 0.42.0
    decord: 0.6.0
    NVIDIA Topology:
    GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
    GPU0 X 0-3 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576

@yukavio
Copy link
Collaborator

yukavio commented Jan 10, 2025

You can fix this bug by using the latest main branch. This problem is fixed in #2711.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants