[Bug] Benchmarks with EAGLE-2 #2777

xavier-h-10 · 2025-01-07T20:25:44Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I have tried to use different benchmarks for sglang with EAGLE-2. However, it seems that it cannot work.

Reproduction

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7

`multi_turn_chat` benchmark

~/sglang/benchmark/multi_turn_chat$ python3 bench_sglang.py --tokenizer meta-llama/Llama-2-7b-chat-hf --long

Result

[2025-01-07 20:24:05] INFO: 127.0.0.1:34396 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 313, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 387, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.06, #running-req: 1, #queue-req: 18
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 433, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.13, #running-req: 2, #queue-req: 17
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 503, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.21, #running-req: 3, #queue-req: 16
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 487, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.30, #running-req: 4, #queue-req: 15
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 278, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.39, #running-req: 5, #queue-req: 14
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 305, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.44, #running-req: 6, #queue-req: 13
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 407, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.50, #running-req: 7, #queue-req: 12
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 434, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.57, #running-req: 8, #queue-req: 11
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 392, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.65, #running-req: 9, #queue-req: 10
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 443, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.73, #running-req: 10, #queue-req: 9
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.81, #running-req: 11, #queue-req: 8
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 316, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.85, #running-req: 12, #queue-req: 7
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 379, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.91, #running-req: 13, #queue-req: 6
[2025-01-07 20:24:07 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 1616, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 411, in event_loop_normal
batch = self.get_next_batch_to_run()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 783, in get_next_batch_to_run
self.running_batch = self.update_running_batch(self.running_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 924, in update_running_batch
self.draft_worker.finish_request(retracted_reqs)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 163, in finish_request
- self.finish_extend_len[req.rid]
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: '58588572becf4f0997b94a5f2c548495'

Killed

`mtbench` benchmark

~/sglang/benchmark/mtbench$ python3 bench_sglang.py --num-questions 80

Result

[2025-01-07 20:16:02] INFO: 127.0.0.1:41452 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 75, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.01, #running-req: 1, #queue-req: 62
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 80, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.02, #running-req: 2, #queue-req: 61
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 70, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.04, #running-req: 3, #queue-req: 60
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 60, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.05, #running-req: 4, #queue-req: 59
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.06, #running-req: 5, #queue-req: 58
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 55, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.07, #running-req: 6, #queue-req: 57
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 56, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.08, #running-req: 7, #queue-req: 56
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 71, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.09, #running-req: 8, #queue-req: 55
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.10, #running-req: 9, #queue-req: 54
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 79, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.13, #running-req: 10, #queue-req: 53
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 59, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.14, #running-req: 11, #queue-req: 52
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 119, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.15, #running-req: 12, #queue-req: 51
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 123, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.17, #running-req: 13, #queue-req: 50
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 148, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.20, #running-req: 14, #queue-req: 49
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 100, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.22, #running-req: 15, #queue-req: 48
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 83, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.24, #running-req: 16, #queue-req: 47
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 66, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.26, #running-req: 17, #queue-req: 46
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 66, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.27, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 78, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.28, #running-req: 19, #queue-req: 44
[2025-01-07 20:16:14] INFO: 127.0.0.1:41560 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:16 TP0] Decode batch. #running-req: 19, #token: 3526, token usage: 0.65, gen throughput (token/s): 2.08, #queue-req: 45
[2025-01-07 20:16:18] INFO: 127.0.0.1:41590 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:18] INFO: 127.0.0.1:41592 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:19 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.63, #running-req: 17, #queue-req: 46
[2025-01-07 20:16:19 TP0] Prefill batch. #new-seq: 1, #new-token: 62, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.64, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:19] INFO: 127.0.0.1:41604 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:20 TP0] Prefill batch. #new-seq: 1, #new-token: 45, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.62, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:25] INFO: 127.0.0.1:41672 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:26 TP0] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.6348 -> 0.7838
[2025-01-07 20:16:26 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 1616, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 419, in event_loop_normal
result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 966, in run_batch
self.draft_worker.forward_batch_speculative_generation(batch)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 73, in forward_batch_speculative_generation
self.forward_draft_decode(batch)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 55, in forward_draft_decode
logits_output = self.model_runner.forward(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/model_executor/model_runner.py", line 715, in forward
return self.forward_decode(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/model_executor/model_runner.py", line 674, in forward_decode
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 356, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama_eagle.py", line 95, in forward
hidden_states, residual = layer(
^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 235, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 172, in forward
attn_output = self.attn(q, k, v, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/radix_attention.py", line 65, in forward
return forward_batch.attn_backend.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/attention/init.py", line 67, in forward
return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 415, in forward_decode
o = decode_wrapper.forward(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/flashinfer/decode.py", line 589, in forward
return self.run(
^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/flashinfer/decode.py", line 673, in run
out = self._wrapper.run(
^^^^^^^^^^^^^^^^^^
RuntimeError: CHECK_GE(paged_kv_indptr.size(0), batch_size + 1) failed. 137 vs 145

Killed

Environment

2025-01-07 18:47:36.128848: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736275656.370115 2302 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736275656.436779 2302 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-07 18:47:37.039609: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2025-01-07 18:47:43] INFO _client.py:1038: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
/opt/conda/envs/rl/lib/python3.11/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

'fields' has been removed
warnings.warn(message, UserWarning)
Python: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA L4
GPU 0 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.127.05
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.47.1
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.8
huggingface_hub: 0.27.0
interegular: 0.3.3
modelscope: 1.21.1
orjson: 3.10.13
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.4
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.3
anthropic: 0.42.0
decord: 0.6.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-3 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576

The text was updated successfully, but these errors were encountered:

yukavio · 2025-01-10T02:16:42Z

You can fix this bug by using the latest main branch. This problem is fixed in #2711.

zhyncs assigned yukavio Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Benchmarks with EAGLE-2 #2777

[Bug] Benchmarks with EAGLE-2 #2777

xavier-h-10 commented Jan 7, 2025 •

edited

Loading

yukavio commented Jan 10, 2025

[Bug] Benchmarks with EAGLE-2 #2777

[Bug] Benchmarks with EAGLE-2 #2777

Comments

xavier-h-10 commented Jan 7, 2025 • edited Loading

Checklist

Describe the bug

Reproduction

multi_turn_chat benchmark

Result

mtbench benchmark

Result

Environment

yukavio commented Jan 10, 2025

xavier-h-10 commented Jan 7, 2025 •

edited

Loading

`multi_turn_chat` benchmark

`mtbench` benchmark