[Benchmark] sglang successful requests issue (may related to env) #2805

Rssevenyu · 2025-01-09T06:20:36Z

sglang，0.4.0.post2

python -m sglang.launch_server --model-path /mnt/home/Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache

python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --dataset-path /mnt/home/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 3000 --output-file /mnt/home/offline_sglang.jsonl
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max reqeuest concurrency: not set
Successful requests: 1648
Benchmark duration (s): 170.41
Total input tokens: 369103
Total generated tokens: 326408
Total generated tokens (retokenized): 326356
Request throughput (req/s): 9.67
Input token throughput (tok/s): 2165.94
Output token throughput (tok/s): 1915.40
Total token throughput (tok/s): 4081.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 80126.93
Median E2E Latency (ms): 81160.44
---------------Time to First Token----------------
Mean TTFT (ms): 44294.80
Median TTFT (ms): 31463.00
P99 TTFT (ms): 106154.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 289.75
Median TPOT (ms): 208.83
P99 TPOT (ms): 1533.27
---------------Inter-token Latency----------------
Mean ITL (ms): 182.46
Median ITL (ms): 145.49
P99 ITL (ms): 562.80

vllm，vllm0.6.3.post1

python -m vllm.entrypoints.openai.api_server --model /mnt/home/Llama-3.1-8B-Instruct --disable-log-requests

python3 bench_serving.py --backend vllm --dataset-name sharegpt --dataset-path /mnt/home/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 3000 --output-file /mnt/home/offline_vllm.jsonl
============ Serving Benchmark Result ============
Backend: vllm
Traffic request rate: inf
Max reqeuest concurrency: not set
Successful requests: 2947
Benchmark duration (s): 334.35
Total input tokens: 660878
Total generated tokens: 572708
Total generated tokens (retokenized): 572537
Request throughput (req/s): 8.81
Input token throughput (tok/s): 1976.62
Output token throughput (tok/s): 1712.91
Total token throughput (tok/s): 3689.54
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 152238.67
Median E2E Latency (ms): 151892.38
---------------Time to First Token----------------
Mean TTFT (ms): 130851.32
Median TTFT (ms): 126929.85
P99 TTFT (ms): 270278.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 112.41
Median TPOT (ms): 115.50
P99 TPOT (ms): 145.59
---------------Inter-token Latency----------------
Mean ITL (ms): 110.68
Median ITL (ms): 112.92
P99 ITL (ms): 493.36

zhyncs · 2025-01-09T06:49:14Z

It works well for me. It's likely an issue with your environment or something went wrong. Try running python3 -m sglang.check_env to check.

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v /mnt/co-research/shared-models:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 --enable-torch-compile --disable-radix-cache

python3 -m sglang.bench_serving --backend sglang --num-prompts 3000

Rssevenyu · 2025-01-09T06:57:18Z

python3 -m sglang.check_env
/root/anaconda3/lib/python3.11/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

'fields' has been removed
warnings.warn(message, UserWarning)
Python: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA A40
GPU 0 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.66
CUDA Driver Version: 535.104.12
PyTorch: 2.4.0+cu121
sglang: 0.4.0.post2
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.47.1
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.8.5
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.0
orjson: 3.10.13
packaging: 23.1
psutil: 5.9.0
pydantic: 2.10.4
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.3.post1
openai: 1.59.4
anthropic: 0.42.0
decord: 0.6.0
NVIDIA Topology:
GPU0 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE PIX PIX SYS SYS 0-31,64-95 0 N/A
NIC0 NODE X PIX NODE NODE SYS SYS
NIC1 NODE PIX X NODE NODE SYS SYS
NIC2 PIX NODE NODE X PIX SYS SYS
NIC3 PIX NODE NODE PIX X SYS SYS
NIC4 SYS SYS SYS SYS SYS X PIX
NIC5 SYS SYS SYS SYS SYS PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5

ulimit soft: 65536

Rssevenyu changed the title ~~Why does SGlang achieve significantly fewer successful requests compared to VLLM when using ShareGPT in the benchmark?~~ [Benchmark] Why does SGlang achieve significantly fewer successful requests compared to VLLM when using ShareGPT in the benchmark? Jan 9, 2025

zhyncs added the await-response label Jan 9, 2025

zhyncs changed the title ~~[Benchmark] Why does SGlang achieve significantly fewer successful requests compared to VLLM when using ShareGPT in the benchmark?~~ [Benchmark] sglang successful requests issue (may related to env) Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] sglang successful requests issue (may related to env) #2805

[Benchmark] sglang successful requests issue (may related to env) #2805

Rssevenyu commented Jan 9, 2025

zhyncs commented Jan 9, 2025

Rssevenyu commented Jan 9, 2025

[Benchmark] sglang successful requests issue (may related to env) #2805

[Benchmark] sglang successful requests issue (may related to env) #2805

Comments

Rssevenyu commented Jan 9, 2025

zhyncs commented Jan 9, 2025

Rssevenyu commented Jan 9, 2025