Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] sglang successful requests issue (may related to env) #2805

Open
Rssevenyu opened this issue Jan 9, 2025 · 2 comments
Open

[Benchmark] sglang successful requests issue (may related to env) #2805

Rssevenyu opened this issue Jan 9, 2025 · 2 comments

Comments

@Rssevenyu
Copy link

sglang,0.4.0.post2

python -m sglang.launch_server --model-path /mnt/home/Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache

python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --dataset-path /mnt/home/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 3000 --output-file /mnt/home/offline_sglang.jsonl
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max reqeuest concurrency: not set
Successful requests: 1648
Benchmark duration (s): 170.41
Total input tokens: 369103
Total generated tokens: 326408
Total generated tokens (retokenized): 326356
Request throughput (req/s): 9.67
Input token throughput (tok/s): 2165.94
Output token throughput (tok/s): 1915.40
Total token throughput (tok/s): 4081.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 80126.93
Median E2E Latency (ms): 81160.44
---------------Time to First Token----------------
Mean TTFT (ms): 44294.80
Median TTFT (ms): 31463.00
P99 TTFT (ms): 106154.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 289.75
Median TPOT (ms): 208.83
P99 TPOT (ms): 1533.27
---------------Inter-token Latency----------------
Mean ITL (ms): 182.46
Median ITL (ms): 145.49
P99 ITL (ms): 562.80

vllm,vllm0.6.3.post1

python -m vllm.entrypoints.openai.api_server --model /mnt/home/Llama-3.1-8B-Instruct --disable-log-requests

python3 bench_serving.py --backend vllm --dataset-name sharegpt --dataset-path /mnt/home/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 3000 --output-file /mnt/home/offline_vllm.jsonl
============ Serving Benchmark Result ============
Backend: vllm
Traffic request rate: inf
Max reqeuest concurrency: not set
Successful requests: 2947
Benchmark duration (s): 334.35
Total input tokens: 660878
Total generated tokens: 572708
Total generated tokens (retokenized): 572537
Request throughput (req/s): 8.81
Input token throughput (tok/s): 1976.62
Output token throughput (tok/s): 1712.91
Total token throughput (tok/s): 3689.54
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 152238.67
Median E2E Latency (ms): 151892.38
---------------Time to First Token----------------
Mean TTFT (ms): 130851.32
Median TTFT (ms): 126929.85
P99 TTFT (ms): 270278.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 112.41
Median TPOT (ms): 115.50
P99 TPOT (ms): 145.59
---------------Inter-token Latency----------------
Mean ITL (ms): 110.68
Median ITL (ms): 112.92
P99 ITL (ms): 493.36

@Rssevenyu Rssevenyu changed the title Why does SGlang achieve significantly fewer successful requests compared to VLLM when using ShareGPT in the benchmark? [Benchmark] Why does SGlang achieve significantly fewer successful requests compared to VLLM when using ShareGPT in the benchmark? Jan 9, 2025
@zhyncs
Copy link
Member

zhyncs commented Jan 9, 2025

It works well for me. It's likely an issue with your environment or something went wrong. Try running python3 -m sglang.check_env to check.

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v /mnt/co-research/shared-models:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 --enable-torch-compile --disable-radix-cache
python3 -m sglang.bench_serving --backend sglang --num-prompts 3000
image

@zhyncs zhyncs changed the title [Benchmark] Why does SGlang achieve significantly fewer successful requests compared to VLLM when using ShareGPT in the benchmark? [Benchmark] sglang successful requests issue (may related to env) Jan 9, 2025
@Rssevenyu
Copy link
Author

python3 -m sglang.check_env
/root/anaconda3/lib/python3.11/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

  • 'fields' has been removed
    warnings.warn(message, UserWarning)
    Python: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
    CUDA available: True
    GPU 0: NVIDIA A40
    GPU 0 Compute Capability: 8.6
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.1, V12.1.66
    CUDA Driver Version: 535.104.12
    PyTorch: 2.4.0+cu121
    sglang: 0.4.0.post2
    flashinfer: 0.1.6+cu121torch2.4
    triton: 3.0.0
    transformers: 4.47.1
    torchao: 0.7.0
    numpy: 1.26.4
    aiohttp: 3.8.5
    fastapi: 0.115.6
    hf_transfer: 0.1.9
    huggingface_hub: 0.27.1
    interegular: 0.3.3
    modelscope: 1.22.0
    orjson: 3.10.13
    packaging: 23.1
    psutil: 5.9.0
    pydantic: 2.10.4
    multipart: 0.0.20
    zmq: 26.2.0
    uvicorn: 0.34.0
    uvloop: 0.21.0
    vllm: 0.6.3.post1
    openai: 1.59.4
    anthropic: 0.42.0
    decord: 0.6.0
    NVIDIA Topology:
    GPU0 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
    GPU0 X NODE NODE PIX PIX SYS SYS 0-31,64-95 0 N/A
    NIC0 NODE X PIX NODE NODE SYS SYS
    NIC1 NODE PIX X NODE NODE SYS SYS
    NIC2 PIX NODE NODE X PIX SYS SYS
    NIC3 PIX NODE NODE PIX X SYS SYS
    NIC4 SYS SYS SYS SYS SYS X PIX
    NIC5 SYS SYS SYS SYS SYS PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5

ulimit soft: 65536

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants