You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[2025-01-07 20:16:02] INFO: 127.0.0.1:41452 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 75, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.01, #running-req: 1, #queue-req: 62
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 80, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.02, #running-req: 2, #queue-req: 61
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 70, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.04, #running-req: 3, #queue-req: 60
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 60, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.05, #running-req: 4, #queue-req: 59
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.06, #running-req: 5, #queue-req: 58
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 55, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.07, #running-req: 6, #queue-req: 57
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 56, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.08, #running-req: 7, #queue-req: 56
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 71, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.09, #running-req: 8, #queue-req: 55
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.10, #running-req: 9, #queue-req: 54
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 79, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.13, #running-req: 10, #queue-req: 53
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 59, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.14, #running-req: 11, #queue-req: 52
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 119, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.15, #running-req: 12, #queue-req: 51
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 123, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.17, #running-req: 13, #queue-req: 50
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 148, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.20, #running-req: 14, #queue-req: 49
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 100, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.22, #running-req: 15, #queue-req: 48
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 83, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.24, #running-req: 16, #queue-req: 47
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 66, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.26, #running-req: 17, #queue-req: 46
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 66, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.27, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 78, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.28, #running-req: 19, #queue-req: 44
[2025-01-07 20:16:14] INFO: 127.0.0.1:41560 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:16 TP0] Decode batch. #running-req: 19, #token: 3526, token usage: 0.65, gen throughput (token/s): 2.08, #queue-req: 45
[2025-01-07 20:16:18] INFO: 127.0.0.1:41590 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:18] INFO: 127.0.0.1:41592 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:19 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.63, #running-req: 17, #queue-req: 46
[2025-01-07 20:16:19 TP0] Prefill batch. #new-seq: 1, #new-token: 62, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.64, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:19] INFO: 127.0.0.1:41604 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:20 TP0] Prefill batch. #new-seq: 1, #new-token: 45, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.62, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:25] INFO: 127.0.0.1:41672 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:26 TP0] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.6348 -> 0.7838
[2025-01-07 20:16:26 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 1616, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 419, in event_loop_normal
result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 966, in run_batch
self.draft_worker.forward_batch_speculative_generation(batch)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 73, in forward_batch_speculative_generation
self.forward_draft_decode(batch)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 55, in forward_draft_decode
logits_output = self.model_runner.forward(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/model_executor/model_runner.py", line 715, in forward
return self.forward_decode(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/model_executor/model_runner.py", line 674, in forward_decode
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 356, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama_eagle.py", line 95, in forward
hidden_states, residual = layer(
^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 235, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 172, in forward
attn_output = self.attn(q, k, v, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/radix_attention.py", line 65, in forward
return forward_batch.attn_backend.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/attention/init.py", line 67, in forward
return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 415, in forward_decode
o = decode_wrapper.forward(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/flashinfer/decode.py", line 589, in forward
return self.run(
^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/flashinfer/decode.py", line 673, in run
out = self._wrapper.run(
^^^^^^^^^^^^^^^^^^
RuntimeError: CHECK_GE(paged_kv_indptr.size(0), batch_size + 1) failed. 137 vs 145
Killed
Environment
2025-01-07 18:47:36.128848: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736275656.370115 2302 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736275656.436779 2302 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-07 18:47:37.039609: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2025-01-07 18:47:43] INFO _client.py:1038: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
/opt/conda/envs/rl/lib/python3.11/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
'fields' has been removed
warnings.warn(message, UserWarning)
Python: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA L4
GPU 0 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.127.05
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.47.1
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.8
huggingface_hub: 0.27.0
interegular: 0.3.3
modelscope: 1.21.1
orjson: 3.10.13
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.4
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.3
anthropic: 0.42.0
decord: 0.6.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-3 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered:
Checklist
Describe the bug
I have tried to use different benchmarks for sglang with EAGLE-2. However, it seems that it cannot work.
Reproduction
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7
multi_turn_chat
benchmark~/sglang/benchmark/multi_turn_chat$ python3 bench_sglang.py --tokenizer meta-llama/Llama-2-7b-chat-hf --long
Result
[2025-01-07 20:24:05] INFO: 127.0.0.1:34396 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 313, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 387, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.06, #running-req: 1, #queue-req: 18
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 433, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.13, #running-req: 2, #queue-req: 17
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 503, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.21, #running-req: 3, #queue-req: 16
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 487, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.30, #running-req: 4, #queue-req: 15
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 278, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.39, #running-req: 5, #queue-req: 14
[2025-01-07 20:24:05 TP0] Prefill batch. #new-seq: 1, #new-token: 305, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.44, #running-req: 6, #queue-req: 13
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 407, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.50, #running-req: 7, #queue-req: 12
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 434, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.57, #running-req: 8, #queue-req: 11
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 392, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.65, #running-req: 9, #queue-req: 10
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 443, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.73, #running-req: 10, #queue-req: 9
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.81, #running-req: 11, #queue-req: 8
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 316, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.85, #running-req: 12, #queue-req: 7
[2025-01-07 20:24:06 TP0] Prefill batch. #new-seq: 1, #new-token: 379, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.91, #running-req: 13, #queue-req: 6
[2025-01-07 20:24:07 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 1616, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 411, in event_loop_normal
batch = self.get_next_batch_to_run()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 783, in get_next_batch_to_run
self.running_batch = self.update_running_batch(self.running_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 924, in update_running_batch
self.draft_worker.finish_request(retracted_reqs)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 163, in finish_request
- self.finish_extend_len[req.rid]
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: '58588572becf4f0997b94a5f2c548495'
Killed
mtbench
benchmark~/sglang/benchmark/mtbench$ python3 bench_sglang.py --num-questions 80
Result
[2025-01-07 20:16:02] INFO: 127.0.0.1:41452 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 75, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.01, #running-req: 1, #queue-req: 62
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 80, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.02, #running-req: 2, #queue-req: 61
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 70, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.04, #running-req: 3, #queue-req: 60
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 60, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.05, #running-req: 4, #queue-req: 59
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.06, #running-req: 5, #queue-req: 58
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 55, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.07, #running-req: 6, #queue-req: 57
[2025-01-07 20:16:02 TP0] Prefill batch. #new-seq: 1, #new-token: 56, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.08, #running-req: 7, #queue-req: 56
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 71, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.09, #running-req: 8, #queue-req: 55
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.10, #running-req: 9, #queue-req: 54
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 79, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.13, #running-req: 10, #queue-req: 53
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 59, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.14, #running-req: 11, #queue-req: 52
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 119, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.15, #running-req: 12, #queue-req: 51
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 123, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.17, #running-req: 13, #queue-req: 50
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 148, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.20, #running-req: 14, #queue-req: 49
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 100, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.22, #running-req: 15, #queue-req: 48
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 83, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.24, #running-req: 16, #queue-req: 47
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 66, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.26, #running-req: 17, #queue-req: 46
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 66, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.27, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:03 TP0] Prefill batch. #new-seq: 1, #new-token: 78, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.28, #running-req: 19, #queue-req: 44
[2025-01-07 20:16:14] INFO: 127.0.0.1:41560 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:16 TP0] Decode batch. #running-req: 19, #token: 3526, token usage: 0.65, gen throughput (token/s): 2.08, #queue-req: 45
[2025-01-07 20:16:18] INFO: 127.0.0.1:41590 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:18] INFO: 127.0.0.1:41592 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:19 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.63, #running-req: 17, #queue-req: 46
[2025-01-07 20:16:19 TP0] Prefill batch. #new-seq: 1, #new-token: 62, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.64, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:19] INFO: 127.0.0.1:41604 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:20 TP0] Prefill batch. #new-seq: 1, #new-token: 45, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.62, #running-req: 18, #queue-req: 45
[2025-01-07 20:16:25] INFO: 127.0.0.1:41672 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 20:16:26 TP0] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.6348 -> 0.7838
[2025-01-07 20:16:26 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 1616, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 419, in event_loop_normal
result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/managers/scheduler.py", line 966, in run_batch
self.draft_worker.forward_batch_speculative_generation(batch)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 73, in forward_batch_speculative_generation
self.forward_draft_decode(batch)
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/speculative/eagle_worker.py", line 55, in forward_draft_decode
logits_output = self.model_runner.forward(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/model_executor/model_runner.py", line 715, in forward
return self.forward_decode(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/model_executor/model_runner.py", line 674, in forward_decode
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 356, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama_eagle.py", line 95, in forward
hidden_states, residual = layer(
^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 235, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/models/llama.py", line 172, in forward
attn_output = self.attn(q, k, v, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/radix_attention.py", line 65, in forward
return forward_batch.attn_backend.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/attention/init.py", line 67, in forward
return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang_0105/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 415, in forward_decode
o = decode_wrapper.forward(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/flashinfer/decode.py", line 589, in forward
return self.run(
^^^^^^^^^
File "/opt/conda/envs/rl/lib/python3.11/site-packages/flashinfer/decode.py", line 673, in run
out = self._wrapper.run(
^^^^^^^^^^^^^^^^^^
RuntimeError: CHECK_GE(paged_kv_indptr.size(0), batch_size + 1) failed. 137 vs 145
Killed
Environment
2025-01-07 18:47:36.128848: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736275656.370115 2302 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736275656.436779 2302 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-07 18:47:37.039609: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2025-01-07 18:47:43] INFO _client.py:1038: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
/opt/conda/envs/rl/lib/python3.11/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
warnings.warn(message, UserWarning)
Python: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA L4
GPU 0 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.127.05
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.47.1
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.8
huggingface_hub: 0.27.0
interegular: 0.3.3
modelscope: 1.21.1
orjson: 3.10.13
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.4
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.3
anthropic: 0.42.0
decord: 0.6.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-3 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered: