From 7c6e609cd4453362f49a643b5b32ed2943658e36 Mon Sep 17 00:00:00 2001 From: Rodrigo Garcia <32329949+roG0d@users.noreply.github.com> Date: Thu, 2 Jan 2025 09:43:21 +0100 Subject: [PATCH 01/12] Included Multinode DeepSeekv3 --- benchmark/deepseek_v3/README.md | 45 +++++++++++++++++++++++++++++---- 1 file changed, 40 insertions(+), 5 deletions(-) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index 9c61af88fd2..6f4b0a9b8d6 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -56,18 +56,53 @@ response = client.chat.completions.create( ) print(response) ``` -### Example serving with 2 H20*8 -For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. +### Example serving with Docker two H200*8 nodes +Having two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configuring the endpoint to expose it to another docker container with `--host 0.0.0.0` and `--port 40000` and configuring nccl comms with `--nccl-init 192.168.114.10:20000`. ```bash # node 1 -python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --name sglang_multinode1 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 +``` +```bash # node 2 -python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --name sglang_multinode2 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 ``` -If you have two H100 nodes, the usage is similar to the aforementioned H20. +To ensure the functionality, we include a testing from a client docker container: +```bash +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --name sglang_multinode_client \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 --output-file "deepseekv3_multinode.jsonl" +``` ## DeepSeek V3 Optimization Plan From 5b809e6998759af6b72a9f1ba05d50a32ee06989 Mon Sep 17 00:00:00 2001 From: Rodrigo Garcia <32329949+roG0d@users.noreply.github.com> Date: Thu, 2 Jan 2025 10:38:00 +0100 Subject: [PATCH 02/12] Reincluded H20 example --- benchmark/deepseek_v3/README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index 6f4b0a9b8d6..d8244e39e95 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -56,6 +56,19 @@ response = client.chat.completions.create( ) print(response) ``` +### Example serving with 2 H20*8 +For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. + +```bash +# node 1 +python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code + +# node 2 +python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code +``` + +If you have two H100 nodes, the usage is similar to the aforementioned H20. + ### Example serving with Docker two H200*8 nodes Having two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configuring the endpoint to expose it to another docker container with `--host 0.0.0.0` and `--port 40000` and configuring nccl comms with `--nccl-init 192.168.114.10:20000`. From 640b41c16d67aa32a256bd8ec724103d45237761 Mon Sep 17 00:00:00 2001 From: Rodrigo Garcia <32329949+roG0d@users.noreply.github.com> Date: Thu, 2 Jan 2025 11:16:00 +0100 Subject: [PATCH 03/12] Updated --nccl-init for --dist-init-addr --- benchmark/deepseek_v3/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index d8244e39e95..d6d51ac6cb1 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -61,10 +61,10 @@ For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is ` ```bash # node 1 -python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code +python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code # node 2 -python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code +python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code ``` If you have two H100 nodes, the usage is similar to the aforementioned H20. @@ -84,7 +84,7 @@ docker run --gpus all \ --env "HF_TOKEN=$HF_TOKEN" \ --ipc=host \ lmsysorg/sglang:latest \ - python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 ``` ```bash @@ -99,7 +99,7 @@ docker run --gpus all \ --env "HF_TOKEN=$HF_TOKEN" \ --ipc=host \ lmsysorg/sglang:latest \ - python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 ``` To ensure the functionality, we include a testing from a client docker container: From 9d8c2b4cc8da7238ff6c8b8e006339206413679f Mon Sep 17 00:00:00 2001 From: Yineng Zhang Date: Thu, 2 Jan 2025 22:12:07 +0800 Subject: [PATCH 04/12] upd --- benchmark/deepseek_v3/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index d6d51ac6cb1..7199faba34e 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -70,7 +70,7 @@ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --di If you have two H100 nodes, the usage is similar to the aforementioned H20. ### Example serving with Docker two H200*8 nodes -Having two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configuring the endpoint to expose it to another docker container with `--host 0.0.0.0` and `--port 40000` and configuring nccl comms with `--nccl-init 192.168.114.10:20000`. +Having two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configuring the endpoint to expose it to another docker container with `--host 0.0.0.0` and `--port 40000` and configuring nccl comms with `--dist-init-addr 192.168.114.10:20000`. ```bash # node 1 From 2770fe95f2a2d9ab605544096dd62499cb78dea2 Mon Sep 17 00:00:00 2001 From: Yineng Zhang Date: Thu, 2 Jan 2025 22:13:24 +0800 Subject: [PATCH 05/12] upd --- benchmark/deepseek_v3/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index 7199faba34e..d7907824c02 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -70,7 +70,7 @@ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --di If you have two H100 nodes, the usage is similar to the aforementioned H20. ### Example serving with Docker two H200*8 nodes -Having two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configuring the endpoint to expose it to another docker container with `--host 0.0.0.0` and `--port 40000` and configuring nccl comms with `--dist-init-addr 192.168.114.10:20000`. +There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`. ```bash # node 1 From 438cf62e2377c75ac506182266ecaf92295ff414 Mon Sep 17 00:00:00 2001 From: Yineng Zhang Date: Thu, 2 Jan 2025 22:14:29 +0800 Subject: [PATCH 06/12] upd --- benchmark/deepseek_v3/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index d7907824c02..bc09c9e63ef 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -102,7 +102,7 @@ docker run --gpus all \ python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 ``` -To ensure the functionality, we include a testing from a client docker container: +To ensure functionality, we include a test from a client Docker container. ```bash docker run --gpus all \ --shm-size 32g \ From 92b49113fde4d4c1e4ceab4917f7df6507302e64 Mon Sep 17 00:00:00 2001 From: Yineng Zhang Date: Thu, 2 Jan 2025 22:16:43 +0800 Subject: [PATCH 07/12] upd --- benchmark/deepseek_v3/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index bc09c9e63ef..15cf0b26a24 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -71,6 +71,7 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20. ### Example serving with Docker two H200*8 nodes There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`. +A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage. ```bash # node 1 From 5035232da31acda7a3a34c134f5bc9c9c9ba0d1e Mon Sep 17 00:00:00 2001 From: Rodri Date: Sun, 5 Jan 2025 08:34:22 -0800 Subject: [PATCH 08/12] (ADDED): benchmarks results for deepseekv3 --- benchmark/benchmark_dsv3/README.md | 152 ++++++++++++++++++++++++ benchmark/benchmark_dsv3/deepseek_v3.sh | 69 +++++++++++ 2 files changed, 221 insertions(+) create mode 100644 benchmark/benchmark_dsv3/README.md create mode 100644 benchmark/benchmark_dsv3/deepseek_v3.sh diff --git a/benchmark/benchmark_dsv3/README.md b/benchmark/benchmark_dsv3/README.md new file mode 100644 index 00000000000..e23f57ffa6e --- /dev/null +++ b/benchmark/benchmark_dsv3/README.md @@ -0,0 +1,152 @@ +## Benchmark for SGLang v0.4.1 - DeepSeek v3 on Different H200 configurations + +We research the capabilites of two configurations of H200 NVIDIA GPUs: +- Single-node 8xH200 (BF16/FP8) +- Multi-node 2x8xH200 (BF16/FP8) + +For the benchmarking, we choose as baseline parameters: + +- `--random-range-ratio 1` +- `--request-rate 1 ` +- `--random-input 1024` +- `--random-output 1024` + +Complete results and logs for benchmarks are in https://github.com/datacrunch-research/h200-benchmarks + +## DeepSeek V3 on 8xH200 (single-node) + +### BF16 + +| RPS | Num Prompts | Median E2E Latency (ms) | Median TTFT (ms) | Median TPOT (ms) | Median ITL (ms) | Output token throughput (tok/s) | +| ---- | ----------- | ----------------------- | ---------------- | ---------------- | --------------- | ------------------------------- | +| 1 | 300 | 214,924.09 | 587.15 | 209.48 | 159.64 | 639.99 | +| 2 | 600 | 235,524.70 | 598.77 | 229.30 | 162.99 | 1313.74 | +| 4 | 1200 | 324,438.44 | 766.70 | 316.35 | 237.99 | 2378.26 | +| 8 | 2400 | 686,261.57 | 1191.74 | 516.67 | 255.96 | 2249.03 | + +### FP8 + +| RPS | Num Prompts | Median E2E Latency (ms) | Median TTFT (ms) | Median TPOT (ms) | Median ITL (ms) | Output token throughput (tok/s) | +| ---- | ----------- | ----------------------- | ---------------- | ---------------- | --------------- | ------------------------------- | +| 1 | 300 | 147,735.43 | 563.41 | 143.71 | 101.78 | 773.15 | +| 2 | 600 | 234,757.13 | 684.33 | 228.78 | 149.46 | 1401.77 | +| 4 | 1200 | 376,040.67 | 865.26 | 366.48 | 287.95 | 2214.76 | +| 8 | 2400 | 692,710.83 | 1358.77 | 675.95 | 515.18 | 2864.31 | + +## DeepSeek V3 on 2x8xH200 (multi-node) + +### BF16 + +| RPS | Num Prompts | Median E2E Latency (ms) | Median TTFT (ms) | Median TPOT (ms) | Median ITL (ms) | Output token throughput (tok/s) | +| ---- | ----------- | ----------------------- | ---------------- | ---------------- | --------------- | ------------------------------- | +| 1 | 300 | 971,353.97 | 53,189.54 | 843.03 | 638.68 | 275.06 | +| 2 | 600 | 2,010,951.23 | 313,373.93 | 1622.07 | 1192.37 | 256.50 | +| 4 | 1200 | 3,881,082.65 | 774,460.73 | 1645.51 | 1178.42 | 255.45 | +| 8 | 2400 | 6,819,185.61 | 4,072,706.72 | 2239.22 | 1205.60 | 250.08 | + +### FP8 + +| RPS | Num Prompts | Median E2E Latency (ms) | Median TTFT (ms) | Median TPOT (ms) | Median ITL (ms) | Output token throughput (tok/s) | +| ---- | ----------- | ----------------------- | ---------------- | ---------------- | --------------- | ------------------------------- | +| 1 | 300 | 985,610.62 | 56,824.07 | 862.84 | 662.33 | 271.60 | +| 2 | 600 | 1,975,371.99 | 305,318.37 | 1632.35 | 1219.14 | 288.41 | +| 4 | 1200 | 3,901,390.30 | 767,082.14 | 3023.99 | 2189.83 | 269.19 | +| 8 | 2400 | 7,374,173.14 | 1,680,440.41 | 2974.87 | 2007.02 | 276.74 | + +## Environment + +To guarantee benchmarking results reproducibility we execute all the experiments with the latest available SGLang Docker image. Build benchmarking environment running the following commands: + +```bash +$docker pull lmsysorg/sglang:dev + +$docker run -it -d --shm-size 32g --gpus all --net host \ +--env "HF_TOKEN=$HF_TOKEN" \ +-v :/root/.cache/huggingface \ +--ipc=host --name sglang_dev lmsysorg/sglang:latest bash + +$docker exec -it /bin/bash sglang_dev +``` + +## Notes + +Keep in mind the diferences in the commands for optimization techniques due to memory constrains. + +## Online benchmarks + +## DeepSeek V3 on 8xH200 (single-node) + +### BF16 + +```bash +# launch server +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-torch-compile --enable-dp-attention --mem-fraction-static 0.8 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +``` + +### FP8 + +```bash +# launch server +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 +--quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --enable-dp-attention + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl +``` +## Deepseek V3 on 2x8xH200 (multi-node) + +### BF16 + +```bash +# launch server +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph + +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl +``` + +### FP8 + +```bash +# launch server +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph + +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +``` diff --git a/benchmark/benchmark_dsv3/deepseek_v3.sh b/benchmark/benchmark_dsv3/deepseek_v3.sh new file mode 100644 index 00000000000..d2fa25dd95d --- /dev/null +++ b/benchmark/benchmark_dsv3/deepseek_v3.sh @@ -0,0 +1,69 @@ +# Docker single-node command: (FP8 version) +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_singlenodeFP8 \ + -it \ + -rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-dp-attention +' + +# Docker multi-node command: (BF16 version) +# Node0: +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_multinode0 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 + +' + +# Node1: +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_multinode1 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 + +' + +# Docker basic client command: +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_bnchmrk_client \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 +' + +# 8xH200/2x8xH200 FP8/BF16 +# Online +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl From 62ef626c92766b49df7c6a82b4d00cad92cf1b4d Mon Sep 17 00:00:00 2001 From: Rodrigo Garcia <32329949+roG0d@users.noreply.github.com> Date: Sun, 5 Jan 2025 17:35:58 +0100 Subject: [PATCH 09/12] Update link to results --- benchmark/benchmark_dsv3/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/benchmark_dsv3/README.md b/benchmark/benchmark_dsv3/README.md index e23f57ffa6e..1982bd02c98 100644 --- a/benchmark/benchmark_dsv3/README.md +++ b/benchmark/benchmark_dsv3/README.md @@ -11,7 +11,7 @@ For the benchmarking, we choose as baseline parameters: - `--random-input 1024` - `--random-output 1024` -Complete results and logs for benchmarks are in https://github.com/datacrunch-research/h200-benchmarks +Complete results and logs for benchmarks are in [https://github.com/datacrunch-research/h200-benchmarks](https://github.com/datacrunch-research/h200-benchmarks/commit/700675be3e55a62925f9c1a80f0b68ecf724ec13) ## DeepSeek V3 on 8xH200 (single-node) From 26db155e335cd6e00d6f1ec161a3fc4c65dae4c0 Mon Sep 17 00:00:00 2001 From: Rodrigo Garcia <32329949+roG0d@users.noreply.github.com> Date: Sun, 12 Jan 2025 14:04:43 +0100 Subject: [PATCH 10/12] Update infiniband bandwidth and nccl version --- benchmark/benchmark_dsv3/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/benchmark/benchmark_dsv3/README.md b/benchmark/benchmark_dsv3/README.md index 1982bd02c98..265a805d5b6 100644 --- a/benchmark/benchmark_dsv3/README.md +++ b/benchmark/benchmark_dsv3/README.md @@ -3,6 +3,7 @@ We research the capabilites of two configurations of H200 NVIDIA GPUs: - Single-node 8xH200 (BF16/FP8) - Multi-node 2x8xH200 (BF16/FP8) + - using Infiniband (400Gbps) with `nccl=2.21.5` For the benchmarking, we choose as baseline parameters: From 49f7815f9375c4ea9ad1be40cee8c8dd2b813f4c Mon Sep 17 00:00:00 2001 From: Rodri Date: Sun, 12 Jan 2025 13:14:36 +0000 Subject: [PATCH 11/12] (ADDED): outputs and logs for DeepSeekv3 and sglang v0.4.1.post4 experimentation --- .../deepseek_v3_bf16_2x8xh200_log_output.txt | 142 +++++++++++++++++ .../deepseek_v3_bf16_8xh200_log_output.txt | 144 +++++++++++++++++ .../deepseek_v3_fp8_2x8xh200_log_output.txt | 145 ++++++++++++++++++ .../deepseek_v3_fp8_8xh200_log_output.txt | 144 +++++++++++++++++ benchmark/benchmark_v0.4.1.post4/README.md | 129 ++++++++++++++++ .../benchmark_v0.4.1.post4/deepseek_v3.sh | 69 +++++++++ .../deepseek_v3_bf16_8xh200_log_output.txt | 145 ++++++++++++++++++ .../deepseek_v3_fp8_8xh200_log_output.txt | 145 ++++++++++++++++++ 8 files changed, 1063 insertions(+) create mode 100644 benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_2x8xh200_log_output.txt create mode 100644 benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt create mode 100644 benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_2x8xh200_log_output.txt create mode 100644 benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt create mode 100644 benchmark/benchmark_v0.4.1.post4/README.md create mode 100644 benchmark/benchmark_v0.4.1.post4/deepseek_v3.sh create mode 100644 benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt create mode 100644 benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt diff --git a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_2x8xh200_log_output.txt b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_2x8xh200_log_output.txt new file mode 100644 index 00000000000..a1bb54fab73 --- /dev/null +++ b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_2x8xh200_log_output.txt @@ -0,0 +1,142 @@ +#Input tokens: 307200 +#Output tokens: 307200 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 1.0 +Max reqeuest concurrency: not set +Successful requests: 300 +Benchmark duration (s): 1116.85 +Total input tokens: 307200 +Total generated tokens: 307200 +Total generated tokens (retokenized): 306053 +Request throughput (req/s): 0.27 +Input token throughput (tok/s): 275.06 +Output token throughput (tok/s): 275.06 +Total token throughput (tok/s): 550.12 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 968448.85 +Median E2E Latency (ms): 971353.97 +---------------Time to First Token---------------- +Mean TTFT (ms): 105080.04 +Median TTFT (ms): 53189.54 +P99 TTFT (ms): 251466.03 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 843.96 +Median TPOT (ms): 843.03 +P99 TPOT (ms): 1070.14 +---------------Inter-token Latency---------------- +Mean ITL (ms): 843.96 +Median ITL (ms): 638.68 +P99 ITL (ms): 708.01 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 614400 +#Output tokens: 614400 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 2.0 +Max reqeuest concurrency: not set +Successful requests: 600 +Benchmark duration (s): 2395.34 +Total input tokens: 614400 +Total generated tokens: 614400 +Total generated tokens (retokenized): 612299 +Request throughput (req/s): 0.25 +Input token throughput (tok/s): 256.50 +Output token throughput (tok/s): 256.50 +Total token throughput (tok/s): 513.00 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 2003883.86 +Median E2E Latency (ms): 2010951.23 +---------------Time to First Token---------------- +Mean TTFT (ms): 317480.50 +Median TTFT (ms): 313373.93 +P99 TTFT (ms): 628073.04 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 1648.49 +Median TPOT (ms): 1622.07 +P99 TPOT (ms): 2054.30 +---------------Inter-token Latency---------------- +Mean ITL (ms): 1648.32 +Median ITL (ms): 1192.37 +P99 ITL (ms): 1525.58 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 1228800 +#Output tokens: 1228800 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 4.0 +Max reqeuest concurrency: not set +Successful requests: 1200 +Benchmark duration (s): 4810.40 +Total input tokens: 1228800 +Total generated tokens: 1228800 +Total generated tokens (retokenized): 1224692 +Request throughput (req/s): 0.25 +Input token throughput (tok/s): 255.45 +Output token throughput (tok/s): 255.45 +Total token throughput (tok/s): 510.89 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3206867.31 +Median E2E Latency (ms): 3881082.65 +---------------Time to First Token---------------- +Mean TTFT (ms): 1426498.17 +Median TTFT (ms): 774460.73 +P99 TTFT (ms): 3980643.34 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 1740.34 +Median TPOT (ms): 1645.51 +P99 TPOT (ms): 3600.89 +---------------Inter-token Latency---------------- +Mean ITL (ms): 1740.23 +Median ITL (ms): 1178.42 +P99 ITL (ms): 1608.58 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 2457600 +#Output tokens: 2457600 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 8.0 +Max reqeuest concurrency: not set +Successful requests: 2400 +Benchmark duration (s): 9827.36 +Total input tokens: 2457600 +Total generated tokens: 2457600 +Total generated tokens (retokenized): 2449303 +Request throughput (req/s): 0.24 +Input token throughput (tok/s): 250.08 +Output token throughput (tok/s): 250.08 +Total token throughput (tok/s): 500.15 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6004940.75 +Median E2E Latency (ms): 6819185.61 +---------------Time to First Token---------------- +Mean TTFT (ms): 3356919.45 +Median TTFT (ms): 4072706.72 +P99 TTFT (ms): 7107066.15 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 2588.49 +Median TPOT (ms): 2239.22 +P99 TPOT (ms): 7387.83 +---------------Inter-token Latency---------------- +Mean ITL (ms): 2587.96 +Median ITL (ms): 1205.60 +P99 ITL (ms): 8271.60 +================================================== diff --git a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt new file mode 100644 index 00000000000..8a9873ad5d6 --- /dev/null +++ b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt @@ -0,0 +1,144 @@ +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 307200 +#Output tokens: 307200 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 1.0 +Max reqeuest concurrency: not set +Successful requests: 300 +Benchmark duration (s): 480.00 +Total input tokens: 307200 +Total generated tokens: 307200 +Total generated tokens (retokenized): 306052 +Request throughput (req/s): 0.62 +Input token throughput (tok/s): 639.99 +Output token throughput (tok/s): 639.99 +Total token throughput (tok/s): 1279.99 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 219910.49 +Median E2E Latency (ms): 214924.09 +---------------Time to First Token---------------- +Mean TTFT (ms): 1484.08 +Median TTFT (ms): 587.15 +P99 TTFT (ms): 10167.11 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 213.52 +Median TPOT (ms): 209.48 +P99 TPOT (ms): 271.65 +---------------Inter-token Latency---------------- +Mean ITL (ms): 213.52 +Median ITL (ms): 159.64 +P99 ITL (ms): 907.22 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 614400 +#Output tokens: 614400 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 2.0 +Max reqeuest concurrency: not set +Successful requests: 600 +Benchmark duration (s): 467.67 +Total input tokens: 614400 +Total generated tokens: 614400 +Total generated tokens (retokenized): 612253 +Request throughput (req/s): 1.28 +Input token throughput (tok/s): 1313.74 +Output token throughput (tok/s): 1313.74 +Total token throughput (tok/s): 2627.48 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 235341.58 +Median E2E Latency (ms): 235524.70 +---------------Time to First Token---------------- +Mean TTFT (ms): 652.11 +Median TTFT (ms): 598.77 +P99 TTFT (ms): 1338.42 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 229.41 +Median TPOT (ms): 229.30 +P99 TPOT (ms): 296.47 +---------------Inter-token Latency---------------- +Mean ITL (ms): 229.42 +Median ITL (ms): 162.99 +P99 ITL (ms): 922.06 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 1228800 +#Output tokens: 1228800 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 4.0 +Max reqeuest concurrency: not set +Successful requests: 1200 +Benchmark duration (s): 516.68 +Total input tokens: 1228800 +Total generated tokens: 1228800 +Total generated tokens (retokenized): 1224646 +Request throughput (req/s): 2.32 +Input token throughput (tok/s): 2378.26 +Output token throughput (tok/s): 2378.26 +Total token throughput (tok/s): 4756.52 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 321625.84 +Median E2E Latency (ms): 324438.44 +---------------Time to First Token---------------- +Mean TTFT (ms): 790.54 +Median TTFT (ms): 766.70 +P99 TTFT (ms): 1631.13 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 313.62 +Median TPOT (ms): 316.35 +P99 TPOT (ms): 404.28 +---------------Inter-token Latency---------------- +Mean ITL (ms): 313.63 +Median ITL (ms): 237.99 +P99 ITL (ms): 1125.06 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 2457600 +#Output tokens: 2457600 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 8.0 +Max reqeuest concurrency: not set +Successful requests: 2400 +Benchmark duration (s): 1092.74 +Total input tokens: 2457600 +Total generated tokens: 2457600 +Total generated tokens (retokenized): 2449187 +Request throughput (req/s): 2.20 +Input token throughput (tok/s): 2249.03 +Output token throughput (tok/s): 2249.03 +Total token throughput (tok/s): 4498.07 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 654511.27 +Median E2E Latency (ms): 686261.57 +---------------Time to First Token---------------- +Mean TTFT (ms): 96306.56 +Median TTFT (ms): 1191.74 +P99 TTFT (ms): 471552.20 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 545.65 +Median TPOT (ms): 516.67 +P99 TPOT (ms): 832.97 +---------------Inter-token Latency---------------- +Mean ITL (ms): 545.71 +Median ITL (ms): 255.96 +P99 ITL (ms): 4197.25 +================================================== diff --git a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_2x8xh200_log_output.txt b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_2x8xh200_log_output.txt new file mode 100644 index 00000000000..c7d51e7cb5d --- /dev/null +++ b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_2x8xh200_log_output.txt @@ -0,0 +1,145 @@ +nohup: ignoring input +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8h200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 307200 +#Output tokens: 307200 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 1.0 +Max reqeuest concurrency: not set +Successful requests: 300 +Benchmark duration (s): 1131.09 +Total input tokens: 307200 +Total generated tokens: 307200 +Total generated tokens (retokenized): 306092 +Request throughput (req/s): 0.27 +Input token throughput (tok/s): 271.60 +Output token throughput (tok/s): 271.60 +Total token throughput (tok/s): 543.19 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 982681.06 +Median E2E Latency (ms): 985610.62 +---------------Time to First Token---------------- +Mean TTFT (ms): 99781.93 +Median TTFT (ms): 56824.07 +P99 TTFT (ms): 244007.03 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 863.05 +Median TPOT (ms): 862.84 +P99 TPOT (ms): 1084.94 +---------------Inter-token Latency---------------- +Mean ITL (ms): 863.05 +Median ITL (ms): 662.33 +P99 ITL (ms): 695.39 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 614400 +#Output tokens: 614400 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 2.0 +Max reqeuest concurrency: not set +Successful requests: 600 +Benchmark duration (s): 2130.27 +Total input tokens: 614400 +Total generated tokens: 614400 +Total generated tokens (retokenized): 612142 +Request throughput (req/s): 0.28 +Input token throughput (tok/s): 288.41 +Output token throughput (tok/s): 288.41 +Total token throughput (tok/s): 576.83 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 1978002.69 +Median E2E Latency (ms): 1975371.99 +---------------Time to First Token---------------- +Mean TTFT (ms): 309169.92 +Median TTFT (ms): 305318.37 +P99 TTFT (ms): 609895.40 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 1631.31 +Median TPOT (ms): 1632.35 +P99 TPOT (ms): 2057.38 +---------------Inter-token Latency---------------- +Mean ITL (ms): 1631.34 +Median ITL (ms): 1219.14 +P99 ITL (ms): 1537.46 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 1228800 +#Output tokens: 1228800 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 4.0 +Max reqeuest concurrency: not set +Successful requests: 1200 +Benchmark duration (s): 4564.80 +Total input tokens: 1228800 +Total generated tokens: 1228800 +Total generated tokens (retokenized): 1224515 +Request throughput (req/s): 0.26 +Input token throughput (tok/s): 269.19 +Output token throughput (tok/s): 269.19 +Total token throughput (tok/s): 538.38 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 3929702.07 +Median E2E Latency (ms): 3901390.30 +---------------Time to First Token---------------- +Mean TTFT (ms): 767128.52 +Median TTFT (ms): 767082.14 +P99 TTFT (ms): 1504428.26 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 3091.47 +Median TPOT (ms): 3023.99 +P99 TPOT (ms): 3886.39 +---------------Inter-token Latency---------------- +Mean ITL (ms): 3091.12 +Median ITL (ms): 2189.83 +P99 ITL (ms): 2596.82 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 2457600 +#Output tokens: 2457600 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 8.0 +Max reqeuest concurrency: not set +Successful requests: 2400 +Benchmark duration (s): 8880.48 +Total input tokens: 2457600 +Total generated tokens: 2457600 +Total generated tokens (retokenized): 2448836 +Request throughput (req/s): 0.27 +Input token throughput (tok/s): 276.74 +Output token throughput (tok/s): 276.74 +Total token throughput (tok/s): 553.48 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 6079389.87 +Median E2E Latency (ms): 7374173.14 +---------------Time to First Token---------------- +Mean TTFT (ms): 2858184.95 +Median TTFT (ms): 1680440.41 +P99 TTFT (ms): 7511052.50 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 3148.78 +Median TPOT (ms): 2974.87 +P99 TPOT (ms): 6686.54 +---------------Inter-token Latency---------------- +Mean ITL (ms): 3148.57 +Median ITL (ms): 2007.02 +P99 ITL (ms): 2745.71 +================================================== diff --git a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt new file mode 100644 index 00000000000..a41e219d741 --- /dev/null +++ b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt @@ -0,0 +1,144 @@ +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 307200 +#Output tokens: 307200 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 1.0 +Max reqeuest concurrency: not set +Successful requests: 300 +Benchmark duration (s): 397.34 +Total input tokens: 307200 +Total generated tokens: 307200 +Total generated tokens (retokenized): 306153 +Request throughput (req/s): 0.76 +Input token throughput (tok/s): 773.15 +Output token throughput (tok/s): 773.15 +Total token throughput (tok/s): 1546.29 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 139395.84 +Median E2E Latency (ms): 147735.43 +---------------Time to First Token---------------- +Mean TTFT (ms): 629.85 +Median TTFT (ms): 563.41 +P99 TTFT (ms): 1184.81 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 135.65 +Median TPOT (ms): 143.71 +P99 TPOT (ms): 154.90 +---------------Inter-token Latency---------------- +Mean ITL (ms): 135.65 +Median ITL (ms): 101.78 +P99 ITL (ms): 588.61 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 614400 +#Output tokens: 614400 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 2.0 +Max reqeuest concurrency: not set +Successful requests: 600 +Benchmark duration (s): 438.30 +Total input tokens: 614400 +Total generated tokens: 614400 +Total generated tokens (retokenized): 612306 +Request throughput (req/s): 1.37 +Input token throughput (tok/s): 1401.77 +Output token throughput (tok/s): 1401.77 +Total token throughput (tok/s): 2803.54 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 227131.11 +Median E2E Latency (ms): 234757.13 +---------------Time to First Token---------------- +Mean TTFT (ms): 742.35 +Median TTFT (ms): 684.33 +P99 TTFT (ms): 1576.72 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 221.30 +Median TPOT (ms): 228.78 +P99 TPOT (ms): 280.95 +---------------Inter-token Latency---------------- +Mean ITL (ms): 221.30 +Median ITL (ms): 149.46 +P99 ITL (ms): 1046.25 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 1228800 +#Output tokens: 1228800 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 4.0 +Max reqeuest concurrency: not set +Successful requests: 1200 +Benchmark duration (s): 554.82 +Total input tokens: 1228800 +Total generated tokens: 1228800 +Total generated tokens (retokenized): 1224403 +Request throughput (req/s): 2.16 +Input token throughput (tok/s): 2214.76 +Output token throughput (tok/s): 2214.76 +Total token throughput (tok/s): 4429.52 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 370518.68 +Median E2E Latency (ms): 376040.67 +---------------Time to First Token---------------- +Mean TTFT (ms): 881.28 +Median TTFT (ms): 865.26 +P99 TTFT (ms): 1518.00 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 361.33 +Median TPOT (ms): 366.48 +P99 TPOT (ms): 451.84 +---------------Inter-token Latency---------------- +Mean ITL (ms): 361.33 +Median ITL (ms): 287.95 +P99 ITL (ms): 1244.53 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 2457600 +#Output tokens: 2457600 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 8.0 +Max reqeuest concurrency: not set +Successful requests: 2400 +Benchmark duration (s): 858.01 +Total input tokens: 2457600 +Total generated tokens: 2457600 +Total generated tokens (retokenized): 2449246 +Request throughput (req/s): 2.80 +Input token throughput (tok/s): 2864.31 +Output token throughput (tok/s): 2864.31 +Total token throughput (tok/s): 5728.61 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 687402.83 +Median E2E Latency (ms): 692710.83 +---------------Time to First Token---------------- +Mean TTFT (ms): 1627.56 +Median TTFT (ms): 1358.77 +P99 TTFT (ms): 4392.08 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 670.36 +Median TPOT (ms): 675.95 +P99 TPOT (ms): 780.39 +---------------Inter-token Latency---------------- +Mean ITL (ms): 670.53 +Median ITL (ms): 515.18 +P99 ITL (ms): 4618.92 +================================================== diff --git a/benchmark/benchmark_v0.4.1.post4/README.md b/benchmark/benchmark_v0.4.1.post4/README.md new file mode 100644 index 00000000000..881cb7fdebc --- /dev/null +++ b/benchmark/benchmark_v0.4.1.post4/README.md @@ -0,0 +1,129 @@ +## Benchmark for SGLang v0.4.1.post4 - DeepSeek v3 on Different H200 configurations + +We research the capabilites of two configurations of H200 NVIDIA GPUs: +- Single-node 8xH200 (BF16/FP8) + +For the benchmarking, we choose as baseline parameters: + +- `--random-range-ratio 1` +- `--request-rate 1 ` +- `--random-input 1024` +- `--random-output 1024` + +Complete results and logs for benchmarks are in https://github.com/datacrunch-research/h200-benchmarks + +## DeepSeek V3 on 8xH200 (single-node) + +### BF16 + + +### FP8 + + +## Environment + +To guarantee benchmarking results reproducibility we execute all the experiments with the latest available SGLang Docker image. Build benchmarking environment running the following commands: + +```bash +$docker pull lmsysorg/sglang:dev + +$docker run -it -d --shm-size 32g --gpus all --net host \ +--env "HF_TOKEN=$HF_TOKEN" \ +-v :/root/.cache/huggingface \ +--ipc=host --name sglang_dev lmsysorg/sglang:latest bash + +$docker exec -it /bin/bash sglang_dev +``` + +## Notes + +Keep in mind the diferences in the commands for optimization techniques due to memory constrains. + +## Online benchmarks + +## DeepSeek V3 on 8xH200 (single-node) + +### BF16 + +```bash +# launch server +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-torch-compile --enable-dp-attention --mem-fraction-static 0.8 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +``` + +### FP8 + +```bash +# launch server +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 +--quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --enable-dp-attention + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl +``` +## Deepseek V3 on 2x8xH200 (multi-node) + +### BF16 + +```bash +# launch server +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph + +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl +``` + +### FP8 + +```bash +# launch server +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph + +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +``` + +#### Note: Detach mode +``` +nohup python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 +--quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --enable-dp-attention --host 0.0.0.0 --port 40000 &> singlenode_fp8.log & +``` + +``` +nohup deepseek_v3.sh &> deepseek_v3_fp8_8xh200_log_output.txt +``` \ No newline at end of file diff --git a/benchmark/benchmark_v0.4.1.post4/deepseek_v3.sh b/benchmark/benchmark_v0.4.1.post4/deepseek_v3.sh new file mode 100644 index 00000000000..1b0efd91b01 --- /dev/null +++ b/benchmark/benchmark_v0.4.1.post4/deepseek_v3.sh @@ -0,0 +1,69 @@ +# Docker single-node command: (FP8 version) * PROVISIONAL * +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_singlenodeFP8 \ + -it \ + -rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-dp-attention +' + +# Docker multi-node command: (BF16 version) * PROVISIONAL * +# Node0: * PROVISIONAL * +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_multinode0 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 + +' + +# Node1: * PROVISIONAL * +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_multinode1 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 + +' + +# Docker basic client command: * PROVISIONAL * +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_bnchmrk_client \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 +' + +# 8xH200/2x8xH200 FP8/BF16 +# Online +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl diff --git a/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt b/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt new file mode 100644 index 00000000000..498e4a7b759 --- /dev/null +++ b/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt @@ -0,0 +1,145 @@ +nohup: ignoring input +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 307200 +#Output tokens: 307200 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 1.0 +Max reqeuest concurrency: not set +Successful requests: 300 +Benchmark duration (s): 478.05 +Total input tokens: 307200 +Total generated tokens: 307200 +Total generated tokens (retokenized): 306099 +Request throughput (req/s): 0.63 +Input token throughput (tok/s): 642.60 +Output token throughput (tok/s): 642.60 +Total token throughput (tok/s): 1285.21 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 215967.43 +Median E2E Latency (ms): 213466.29 +---------------Time to First Token---------------- +Mean TTFT (ms): 1385.94 +Median TTFT (ms): 578.54 +P99 TTFT (ms): 9433.96 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 209.76 +Median TPOT (ms): 207.72 +P99 TPOT (ms): 263.40 +---------------Inter-token Latency---------------- +Mean ITL (ms): 209.76 +Median ITL (ms): 157.87 +P99 ITL (ms): 898.72 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 614400 +#Output tokens: 614400 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 2.0 +Max reqeuest concurrency: not set +Successful requests: 600 +Benchmark duration (s): 465.57 +Total input tokens: 614400 +Total generated tokens: 614400 +Total generated tokens (retokenized): 612168 +Request throughput (req/s): 1.29 +Input token throughput (tok/s): 1319.68 +Output token throughput (tok/s): 1319.68 +Total token throughput (tok/s): 2639.36 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 233050.15 +Median E2E Latency (ms): 233187.80 +---------------Time to First Token---------------- +Mean TTFT (ms): 653.21 +Median TTFT (ms): 593.40 +P99 TTFT (ms): 1171.34 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 227.17 +Median TPOT (ms): 227.37 +P99 TPOT (ms): 291.99 +---------------Inter-token Latency---------------- +Mean ITL (ms): 227.17 +Median ITL (ms): 160.69 +P99 ITL (ms): 925.49 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 1228800 +#Output tokens: 1228800 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 4.0 +Max reqeuest concurrency: not set +Successful requests: 1200 +Benchmark duration (s): 516.75 +Total input tokens: 1228800 +Total generated tokens: 1228800 +Total generated tokens (retokenized): 1224574 +Request throughput (req/s): 2.32 +Input token throughput (tok/s): 2377.95 +Output token throughput (tok/s): 2377.95 +Total token throughput (tok/s): 4755.90 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 322017.38 +Median E2E Latency (ms): 325442.20 +---------------Time to First Token---------------- +Mean TTFT (ms): 782.61 +Median TTFT (ms): 756.71 +P99 TTFT (ms): 1651.82 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 314.01 +Median TPOT (ms): 317.36 +P99 TPOT (ms): 403.03 +---------------Inter-token Latency---------------- +Mean ITL (ms): 314.02 +Median ITL (ms): 239.53 +P99 ITL (ms): 1122.77 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 2457600 +#Output tokens: 2457600 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 8.0 +Max reqeuest concurrency: not set +Successful requests: 2400 +Benchmark duration (s): 1103.55 +Total input tokens: 2457600 +Total generated tokens: 2457600 +Total generated tokens (retokenized): 2449208 +Request throughput (req/s): 2.17 +Input token throughput (tok/s): 2226.99 +Output token throughput (tok/s): 2226.99 +Total token throughput (tok/s): 4453.98 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 665434.00 +Median E2E Latency (ms): 697458.92 +---------------Time to First Token---------------- +Mean TTFT (ms): 98361.38 +Median TTFT (ms): 1194.39 +P99 TTFT (ms): 483150.07 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 554.32 +Median TPOT (ms): 526.85 +P99 TPOT (ms): 844.25 +---------------Inter-token Latency---------------- +Mean ITL (ms): 554.33 +Median ITL (ms): 257.79 +P99 ITL (ms): 4449.63 +================================================== diff --git a/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt b/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt new file mode 100644 index 00000000000..a6557793a9a --- /dev/null +++ b/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt @@ -0,0 +1,145 @@ +nohup: ignoring input +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 307200 +#Output tokens: 307200 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 1.0 +Max reqeuest concurrency: not set +Successful requests: 300 +Benchmark duration (s): 397.92 +Total input tokens: 307200 +Total generated tokens: 307200 +Total generated tokens (retokenized): 306152 +Request throughput (req/s): 0.75 +Input token throughput (tok/s): 772.01 +Output token throughput (tok/s): 772.01 +Total token throughput (tok/s): 1544.02 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 146462.10 +Median E2E Latency (ms): 157347.12 +---------------Time to First Token---------------- +Mean TTFT (ms): 1473.33 +Median TTFT (ms): 586.83 +P99 TTFT (ms): 11050.65 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 141.73 +Median TPOT (ms): 153.11 +P99 TPOT (ms): 162.82 +---------------Inter-token Latency---------------- +Mean ITL (ms): 141.73 +Median ITL (ms): 103.52 +P99 ITL (ms): 596.10 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 614400 +#Output tokens: 614400 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 2.0 +Max reqeuest concurrency: not set +Successful requests: 600 +Benchmark duration (s): 438.70 +Total input tokens: 614400 +Total generated tokens: 614400 +Total generated tokens (retokenized): 612336 +Request throughput (req/s): 1.37 +Input token throughput (tok/s): 1400.49 +Output token throughput (tok/s): 1400.49 +Total token throughput (tok/s): 2800.98 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 227856.17 +Median E2E Latency (ms): 235231.81 +---------------Time to First Token---------------- +Mean TTFT (ms): 766.49 +Median TTFT (ms): 695.87 +P99 TTFT (ms): 1764.03 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 221.98 +Median TPOT (ms): 228.72 +P99 TPOT (ms): 282.76 +---------------Inter-token Latency---------------- +Mean ITL (ms): 221.99 +Median ITL (ms): 150.11 +P99 ITL (ms): 1054.09 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 1228800 +#Output tokens: 1228800 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 4.0 +Max reqeuest concurrency: not set +Successful requests: 1200 +Benchmark duration (s): 555.83 +Total input tokens: 1228800 +Total generated tokens: 1228800 +Total generated tokens (retokenized): 1224553 +Request throughput (req/s): 2.16 +Input token throughput (tok/s): 2210.76 +Output token throughput (tok/s): 2210.76 +Total token throughput (tok/s): 4421.52 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 371743.37 +Median E2E Latency (ms): 376584.64 +---------------Time to First Token---------------- +Mean TTFT (ms): 885.85 +Median TTFT (ms): 870.93 +P99 TTFT (ms): 1826.65 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 362.52 +Median TPOT (ms): 367.10 +P99 TPOT (ms): 453.56 +---------------Inter-token Latency---------------- +Mean ITL (ms): 362.52 +Median ITL (ms): 290.58 +P99 ITL (ms): 1248.61 +================================================== +Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) + +#Input tokens: 2457600 +#Output tokens: 2457600 +Starting initial single prompt test run... +Initial test run completed. Starting main benchmark run... + +============ Serving Benchmark Result ============ +Backend: sglang +Traffic request rate: 8.0 +Max reqeuest concurrency: not set +Successful requests: 2400 +Benchmark duration (s): 859.04 +Total input tokens: 2457600 +Total generated tokens: 2457600 +Total generated tokens (retokenized): 2449208 +Request throughput (req/s): 2.79 +Input token throughput (tok/s): 2860.87 +Output token throughput (tok/s): 2860.87 +Total token throughput (tok/s): 5721.75 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 687592.60 +Median E2E Latency (ms): 692819.14 +---------------Time to First Token---------------- +Mean TTFT (ms): 1556.18 +Median TTFT (ms): 1310.69 +P99 TTFT (ms): 4496.40 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 670.61 +Median TPOT (ms): 676.19 +P99 TPOT (ms): 780.25 +---------------Inter-token Latency---------------- +Mean ITL (ms): 671.01 +Median ITL (ms): 518.90 +P99 ITL (ms): 4551.92 +================================================== From c62b0bf12a789baed7f02145987316a2c3f4308d Mon Sep 17 00:00:00 2001 From: Rodri Date: Sun, 12 Jan 2025 13:28:55 +0000 Subject: [PATCH 12/12] (DELETED): logs for DeepSeekv3 and v0.4.1.post4 --- .../deepseek_v3_bf16_2x8xh200_log_output.txt | 142 ----------------- .../deepseek_v3_bf16_8xh200_log_output.txt | 144 ----------------- .../deepseek_v3_fp8_2x8xh200_log_output.txt | 145 ------------------ .../deepseek_v3_fp8_8xh200_log_output.txt | 144 ----------------- .../deepseek_v3_bf16_8xh200_log_output.txt | 145 ------------------ .../deepseek_v3_fp8_8xh200_log_output.txt | 145 ------------------ 6 files changed, 865 deletions(-) delete mode 100644 benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_2x8xh200_log_output.txt delete mode 100644 benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt delete mode 100644 benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_2x8xh200_log_output.txt delete mode 100644 benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt delete mode 100644 benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt delete mode 100644 benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt diff --git a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_2x8xh200_log_output.txt b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_2x8xh200_log_output.txt deleted file mode 100644 index a1bb54fab73..00000000000 --- a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_2x8xh200_log_output.txt +++ /dev/null @@ -1,142 +0,0 @@ -#Input tokens: 307200 -#Output tokens: 307200 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 1.0 -Max reqeuest concurrency: not set -Successful requests: 300 -Benchmark duration (s): 1116.85 -Total input tokens: 307200 -Total generated tokens: 307200 -Total generated tokens (retokenized): 306053 -Request throughput (req/s): 0.27 -Input token throughput (tok/s): 275.06 -Output token throughput (tok/s): 275.06 -Total token throughput (tok/s): 550.12 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 968448.85 -Median E2E Latency (ms): 971353.97 ----------------Time to First Token---------------- -Mean TTFT (ms): 105080.04 -Median TTFT (ms): 53189.54 -P99 TTFT (ms): 251466.03 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 843.96 -Median TPOT (ms): 843.03 -P99 TPOT (ms): 1070.14 ----------------Inter-token Latency---------------- -Mean ITL (ms): 843.96 -Median ITL (ms): 638.68 -P99 ITL (ms): 708.01 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 614400 -#Output tokens: 614400 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 2.0 -Max reqeuest concurrency: not set -Successful requests: 600 -Benchmark duration (s): 2395.34 -Total input tokens: 614400 -Total generated tokens: 614400 -Total generated tokens (retokenized): 612299 -Request throughput (req/s): 0.25 -Input token throughput (tok/s): 256.50 -Output token throughput (tok/s): 256.50 -Total token throughput (tok/s): 513.00 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 2003883.86 -Median E2E Latency (ms): 2010951.23 ----------------Time to First Token---------------- -Mean TTFT (ms): 317480.50 -Median TTFT (ms): 313373.93 -P99 TTFT (ms): 628073.04 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 1648.49 -Median TPOT (ms): 1622.07 -P99 TPOT (ms): 2054.30 ----------------Inter-token Latency---------------- -Mean ITL (ms): 1648.32 -Median ITL (ms): 1192.37 -P99 ITL (ms): 1525.58 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 1228800 -#Output tokens: 1228800 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 4.0 -Max reqeuest concurrency: not set -Successful requests: 1200 -Benchmark duration (s): 4810.40 -Total input tokens: 1228800 -Total generated tokens: 1228800 -Total generated tokens (retokenized): 1224692 -Request throughput (req/s): 0.25 -Input token throughput (tok/s): 255.45 -Output token throughput (tok/s): 255.45 -Total token throughput (tok/s): 510.89 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 3206867.31 -Median E2E Latency (ms): 3881082.65 ----------------Time to First Token---------------- -Mean TTFT (ms): 1426498.17 -Median TTFT (ms): 774460.73 -P99 TTFT (ms): 3980643.34 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 1740.34 -Median TPOT (ms): 1645.51 -P99 TPOT (ms): 3600.89 ----------------Inter-token Latency---------------- -Mean ITL (ms): 1740.23 -Median ITL (ms): 1178.42 -P99 ITL (ms): 1608.58 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 2457600 -#Output tokens: 2457600 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 8.0 -Max reqeuest concurrency: not set -Successful requests: 2400 -Benchmark duration (s): 9827.36 -Total input tokens: 2457600 -Total generated tokens: 2457600 -Total generated tokens (retokenized): 2449303 -Request throughput (req/s): 0.24 -Input token throughput (tok/s): 250.08 -Output token throughput (tok/s): 250.08 -Total token throughput (tok/s): 500.15 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 6004940.75 -Median E2E Latency (ms): 6819185.61 ----------------Time to First Token---------------- -Mean TTFT (ms): 3356919.45 -Median TTFT (ms): 4072706.72 -P99 TTFT (ms): 7107066.15 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 2588.49 -Median TPOT (ms): 2239.22 -P99 TPOT (ms): 7387.83 ----------------Inter-token Latency---------------- -Mean ITL (ms): 2587.96 -Median ITL (ms): 1205.60 -P99 ITL (ms): 8271.60 -================================================== diff --git a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt deleted file mode 100644 index 8a9873ad5d6..00000000000 --- a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt +++ /dev/null @@ -1,144 +0,0 @@ -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 307200 -#Output tokens: 307200 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 1.0 -Max reqeuest concurrency: not set -Successful requests: 300 -Benchmark duration (s): 480.00 -Total input tokens: 307200 -Total generated tokens: 307200 -Total generated tokens (retokenized): 306052 -Request throughput (req/s): 0.62 -Input token throughput (tok/s): 639.99 -Output token throughput (tok/s): 639.99 -Total token throughput (tok/s): 1279.99 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 219910.49 -Median E2E Latency (ms): 214924.09 ----------------Time to First Token---------------- -Mean TTFT (ms): 1484.08 -Median TTFT (ms): 587.15 -P99 TTFT (ms): 10167.11 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 213.52 -Median TPOT (ms): 209.48 -P99 TPOT (ms): 271.65 ----------------Inter-token Latency---------------- -Mean ITL (ms): 213.52 -Median ITL (ms): 159.64 -P99 ITL (ms): 907.22 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 614400 -#Output tokens: 614400 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 2.0 -Max reqeuest concurrency: not set -Successful requests: 600 -Benchmark duration (s): 467.67 -Total input tokens: 614400 -Total generated tokens: 614400 -Total generated tokens (retokenized): 612253 -Request throughput (req/s): 1.28 -Input token throughput (tok/s): 1313.74 -Output token throughput (tok/s): 1313.74 -Total token throughput (tok/s): 2627.48 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 235341.58 -Median E2E Latency (ms): 235524.70 ----------------Time to First Token---------------- -Mean TTFT (ms): 652.11 -Median TTFT (ms): 598.77 -P99 TTFT (ms): 1338.42 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 229.41 -Median TPOT (ms): 229.30 -P99 TPOT (ms): 296.47 ----------------Inter-token Latency---------------- -Mean ITL (ms): 229.42 -Median ITL (ms): 162.99 -P99 ITL (ms): 922.06 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 1228800 -#Output tokens: 1228800 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 4.0 -Max reqeuest concurrency: not set -Successful requests: 1200 -Benchmark duration (s): 516.68 -Total input tokens: 1228800 -Total generated tokens: 1228800 -Total generated tokens (retokenized): 1224646 -Request throughput (req/s): 2.32 -Input token throughput (tok/s): 2378.26 -Output token throughput (tok/s): 2378.26 -Total token throughput (tok/s): 4756.52 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 321625.84 -Median E2E Latency (ms): 324438.44 ----------------Time to First Token---------------- -Mean TTFT (ms): 790.54 -Median TTFT (ms): 766.70 -P99 TTFT (ms): 1631.13 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 313.62 -Median TPOT (ms): 316.35 -P99 TPOT (ms): 404.28 ----------------Inter-token Latency---------------- -Mean ITL (ms): 313.63 -Median ITL (ms): 237.99 -P99 ITL (ms): 1125.06 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 2457600 -#Output tokens: 2457600 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 8.0 -Max reqeuest concurrency: not set -Successful requests: 2400 -Benchmark duration (s): 1092.74 -Total input tokens: 2457600 -Total generated tokens: 2457600 -Total generated tokens (retokenized): 2449187 -Request throughput (req/s): 2.20 -Input token throughput (tok/s): 2249.03 -Output token throughput (tok/s): 2249.03 -Total token throughput (tok/s): 4498.07 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 654511.27 -Median E2E Latency (ms): 686261.57 ----------------Time to First Token---------------- -Mean TTFT (ms): 96306.56 -Median TTFT (ms): 1191.74 -P99 TTFT (ms): 471552.20 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 545.65 -Median TPOT (ms): 516.67 -P99 TPOT (ms): 832.97 ----------------Inter-token Latency---------------- -Mean ITL (ms): 545.71 -Median ITL (ms): 255.96 -P99 ITL (ms): 4197.25 -================================================== diff --git a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_2x8xh200_log_output.txt b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_2x8xh200_log_output.txt deleted file mode 100644 index c7d51e7cb5d..00000000000 --- a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_2x8xh200_log_output.txt +++ /dev/null @@ -1,145 +0,0 @@ -nohup: ignoring input -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8h200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 307200 -#Output tokens: 307200 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 1.0 -Max reqeuest concurrency: not set -Successful requests: 300 -Benchmark duration (s): 1131.09 -Total input tokens: 307200 -Total generated tokens: 307200 -Total generated tokens (retokenized): 306092 -Request throughput (req/s): 0.27 -Input token throughput (tok/s): 271.60 -Output token throughput (tok/s): 271.60 -Total token throughput (tok/s): 543.19 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 982681.06 -Median E2E Latency (ms): 985610.62 ----------------Time to First Token---------------- -Mean TTFT (ms): 99781.93 -Median TTFT (ms): 56824.07 -P99 TTFT (ms): 244007.03 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 863.05 -Median TPOT (ms): 862.84 -P99 TPOT (ms): 1084.94 ----------------Inter-token Latency---------------- -Mean ITL (ms): 863.05 -Median ITL (ms): 662.33 -P99 ITL (ms): 695.39 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 614400 -#Output tokens: 614400 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 2.0 -Max reqeuest concurrency: not set -Successful requests: 600 -Benchmark duration (s): 2130.27 -Total input tokens: 614400 -Total generated tokens: 614400 -Total generated tokens (retokenized): 612142 -Request throughput (req/s): 0.28 -Input token throughput (tok/s): 288.41 -Output token throughput (tok/s): 288.41 -Total token throughput (tok/s): 576.83 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 1978002.69 -Median E2E Latency (ms): 1975371.99 ----------------Time to First Token---------------- -Mean TTFT (ms): 309169.92 -Median TTFT (ms): 305318.37 -P99 TTFT (ms): 609895.40 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 1631.31 -Median TPOT (ms): 1632.35 -P99 TPOT (ms): 2057.38 ----------------Inter-token Latency---------------- -Mean ITL (ms): 1631.34 -Median ITL (ms): 1219.14 -P99 ITL (ms): 1537.46 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 1228800 -#Output tokens: 1228800 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 4.0 -Max reqeuest concurrency: not set -Successful requests: 1200 -Benchmark duration (s): 4564.80 -Total input tokens: 1228800 -Total generated tokens: 1228800 -Total generated tokens (retokenized): 1224515 -Request throughput (req/s): 0.26 -Input token throughput (tok/s): 269.19 -Output token throughput (tok/s): 269.19 -Total token throughput (tok/s): 538.38 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 3929702.07 -Median E2E Latency (ms): 3901390.30 ----------------Time to First Token---------------- -Mean TTFT (ms): 767128.52 -Median TTFT (ms): 767082.14 -P99 TTFT (ms): 1504428.26 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 3091.47 -Median TPOT (ms): 3023.99 -P99 TPOT (ms): 3886.39 ----------------Inter-token Latency---------------- -Mean ITL (ms): 3091.12 -Median ITL (ms): 2189.83 -P99 ITL (ms): 2596.82 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_2x8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 2457600 -#Output tokens: 2457600 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 8.0 -Max reqeuest concurrency: not set -Successful requests: 2400 -Benchmark duration (s): 8880.48 -Total input tokens: 2457600 -Total generated tokens: 2457600 -Total generated tokens (retokenized): 2448836 -Request throughput (req/s): 0.27 -Input token throughput (tok/s): 276.74 -Output token throughput (tok/s): 276.74 -Total token throughput (tok/s): 553.48 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 6079389.87 -Median E2E Latency (ms): 7374173.14 ----------------Time to First Token---------------- -Mean TTFT (ms): 2858184.95 -Median TTFT (ms): 1680440.41 -P99 TTFT (ms): 7511052.50 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 3148.78 -Median TPOT (ms): 2974.87 -P99 TPOT (ms): 6686.54 ----------------Inter-token Latency---------------- -Mean ITL (ms): 3148.57 -Median ITL (ms): 2007.02 -P99 ITL (ms): 2745.71 -================================================== diff --git a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt b/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt deleted file mode 100644 index a41e219d741..00000000000 --- a/benchmark/benchmark_dsv3/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt +++ /dev/null @@ -1,144 +0,0 @@ -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 307200 -#Output tokens: 307200 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 1.0 -Max reqeuest concurrency: not set -Successful requests: 300 -Benchmark duration (s): 397.34 -Total input tokens: 307200 -Total generated tokens: 307200 -Total generated tokens (retokenized): 306153 -Request throughput (req/s): 0.76 -Input token throughput (tok/s): 773.15 -Output token throughput (tok/s): 773.15 -Total token throughput (tok/s): 1546.29 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 139395.84 -Median E2E Latency (ms): 147735.43 ----------------Time to First Token---------------- -Mean TTFT (ms): 629.85 -Median TTFT (ms): 563.41 -P99 TTFT (ms): 1184.81 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 135.65 -Median TPOT (ms): 143.71 -P99 TPOT (ms): 154.90 ----------------Inter-token Latency---------------- -Mean ITL (ms): 135.65 -Median ITL (ms): 101.78 -P99 ITL (ms): 588.61 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 614400 -#Output tokens: 614400 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 2.0 -Max reqeuest concurrency: not set -Successful requests: 600 -Benchmark duration (s): 438.30 -Total input tokens: 614400 -Total generated tokens: 614400 -Total generated tokens (retokenized): 612306 -Request throughput (req/s): 1.37 -Input token throughput (tok/s): 1401.77 -Output token throughput (tok/s): 1401.77 -Total token throughput (tok/s): 2803.54 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 227131.11 -Median E2E Latency (ms): 234757.13 ----------------Time to First Token---------------- -Mean TTFT (ms): 742.35 -Median TTFT (ms): 684.33 -P99 TTFT (ms): 1576.72 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 221.30 -Median TPOT (ms): 228.78 -P99 TPOT (ms): 280.95 ----------------Inter-token Latency---------------- -Mean ITL (ms): 221.30 -Median ITL (ms): 149.46 -P99 ITL (ms): 1046.25 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 1228800 -#Output tokens: 1228800 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 4.0 -Max reqeuest concurrency: not set -Successful requests: 1200 -Benchmark duration (s): 554.82 -Total input tokens: 1228800 -Total generated tokens: 1228800 -Total generated tokens (retokenized): 1224403 -Request throughput (req/s): 2.16 -Input token throughput (tok/s): 2214.76 -Output token throughput (tok/s): 2214.76 -Total token throughput (tok/s): 4429.52 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 370518.68 -Median E2E Latency (ms): 376040.67 ----------------Time to First Token---------------- -Mean TTFT (ms): 881.28 -Median TTFT (ms): 865.26 -P99 TTFT (ms): 1518.00 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 361.33 -Median TPOT (ms): 366.48 -P99 TPOT (ms): 451.84 ----------------Inter-token Latency---------------- -Mean ITL (ms): 361.33 -Median ITL (ms): 287.95 -P99 ITL (ms): 1244.53 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_FP8_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 2457600 -#Output tokens: 2457600 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 8.0 -Max reqeuest concurrency: not set -Successful requests: 2400 -Benchmark duration (s): 858.01 -Total input tokens: 2457600 -Total generated tokens: 2457600 -Total generated tokens (retokenized): 2449246 -Request throughput (req/s): 2.80 -Input token throughput (tok/s): 2864.31 -Output token throughput (tok/s): 2864.31 -Total token throughput (tok/s): 5728.61 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 687402.83 -Median E2E Latency (ms): 692710.83 ----------------Time to First Token---------------- -Mean TTFT (ms): 1627.56 -Median TTFT (ms): 1358.77 -P99 TTFT (ms): 4392.08 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 670.36 -Median TPOT (ms): 675.95 -P99 TPOT (ms): 780.39 ----------------Inter-token Latency---------------- -Mean ITL (ms): 670.53 -Median ITL (ms): 515.18 -P99 ITL (ms): 4618.92 -================================================== diff --git a/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt b/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt deleted file mode 100644 index 498e4a7b759..00000000000 --- a/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_bf16_8xh200_log_output.txt +++ /dev/null @@ -1,145 +0,0 @@ -nohup: ignoring input -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 307200 -#Output tokens: 307200 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 1.0 -Max reqeuest concurrency: not set -Successful requests: 300 -Benchmark duration (s): 478.05 -Total input tokens: 307200 -Total generated tokens: 307200 -Total generated tokens (retokenized): 306099 -Request throughput (req/s): 0.63 -Input token throughput (tok/s): 642.60 -Output token throughput (tok/s): 642.60 -Total token throughput (tok/s): 1285.21 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 215967.43 -Median E2E Latency (ms): 213466.29 ----------------Time to First Token---------------- -Mean TTFT (ms): 1385.94 -Median TTFT (ms): 578.54 -P99 TTFT (ms): 9433.96 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 209.76 -Median TPOT (ms): 207.72 -P99 TPOT (ms): 263.40 ----------------Inter-token Latency---------------- -Mean ITL (ms): 209.76 -Median ITL (ms): 157.87 -P99 ITL (ms): 898.72 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 614400 -#Output tokens: 614400 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 2.0 -Max reqeuest concurrency: not set -Successful requests: 600 -Benchmark duration (s): 465.57 -Total input tokens: 614400 -Total generated tokens: 614400 -Total generated tokens (retokenized): 612168 -Request throughput (req/s): 1.29 -Input token throughput (tok/s): 1319.68 -Output token throughput (tok/s): 1319.68 -Total token throughput (tok/s): 2639.36 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 233050.15 -Median E2E Latency (ms): 233187.80 ----------------Time to First Token---------------- -Mean TTFT (ms): 653.21 -Median TTFT (ms): 593.40 -P99 TTFT (ms): 1171.34 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 227.17 -Median TPOT (ms): 227.37 -P99 TPOT (ms): 291.99 ----------------Inter-token Latency---------------- -Mean ITL (ms): 227.17 -Median ITL (ms): 160.69 -P99 ITL (ms): 925.49 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 1228800 -#Output tokens: 1228800 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 4.0 -Max reqeuest concurrency: not set -Successful requests: 1200 -Benchmark duration (s): 516.75 -Total input tokens: 1228800 -Total generated tokens: 1228800 -Total generated tokens (retokenized): 1224574 -Request throughput (req/s): 2.32 -Input token throughput (tok/s): 2377.95 -Output token throughput (tok/s): 2377.95 -Total token throughput (tok/s): 4755.90 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 322017.38 -Median E2E Latency (ms): 325442.20 ----------------Time to First Token---------------- -Mean TTFT (ms): 782.61 -Median TTFT (ms): 756.71 -P99 TTFT (ms): 1651.82 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 314.01 -Median TPOT (ms): 317.36 -P99 TPOT (ms): 403.03 ----------------Inter-token Latency---------------- -Mean ITL (ms): 314.02 -Median ITL (ms): 239.53 -P99 ITL (ms): 1122.77 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 2457600 -#Output tokens: 2457600 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 8.0 -Max reqeuest concurrency: not set -Successful requests: 2400 -Benchmark duration (s): 1103.55 -Total input tokens: 2457600 -Total generated tokens: 2457600 -Total generated tokens (retokenized): 2449208 -Request throughput (req/s): 2.17 -Input token throughput (tok/s): 2226.99 -Output token throughput (tok/s): 2226.99 -Total token throughput (tok/s): 4453.98 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 665434.00 -Median E2E Latency (ms): 697458.92 ----------------Time to First Token---------------- -Mean TTFT (ms): 98361.38 -Median TTFT (ms): 1194.39 -P99 TTFT (ms): 483150.07 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 554.32 -Median TPOT (ms): 526.85 -P99 TPOT (ms): 844.25 ----------------Inter-token Latency---------------- -Mean ITL (ms): 554.33 -Median ITL (ms): 257.79 -P99 ITL (ms): 4449.63 -================================================== diff --git a/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt b/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt deleted file mode 100644 index a6557793a9a..00000000000 --- a/benchmark/benchmark_v0.4.1.post4/outputs/logs/deepseek_v3_fp8_8xh200_log_output.txt +++ /dev/null @@ -1,145 +0,0 @@ -nohup: ignoring input -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=300, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=1.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 307200 -#Output tokens: 307200 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 1.0 -Max reqeuest concurrency: not set -Successful requests: 300 -Benchmark duration (s): 397.92 -Total input tokens: 307200 -Total generated tokens: 307200 -Total generated tokens (retokenized): 306152 -Request throughput (req/s): 0.75 -Input token throughput (tok/s): 772.01 -Output token throughput (tok/s): 772.01 -Total token throughput (tok/s): 1544.02 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 146462.10 -Median E2E Latency (ms): 157347.12 ----------------Time to First Token---------------- -Mean TTFT (ms): 1473.33 -Median TTFT (ms): 586.83 -P99 TTFT (ms): 11050.65 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 141.73 -Median TPOT (ms): 153.11 -P99 TPOT (ms): 162.82 ----------------Inter-token Latency---------------- -Mean ITL (ms): 141.73 -Median ITL (ms): 103.52 -P99 ITL (ms): 596.10 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=600, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=2.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 614400 -#Output tokens: 614400 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 2.0 -Max reqeuest concurrency: not set -Successful requests: 600 -Benchmark duration (s): 438.70 -Total input tokens: 614400 -Total generated tokens: 614400 -Total generated tokens (retokenized): 612336 -Request throughput (req/s): 1.37 -Input token throughput (tok/s): 1400.49 -Output token throughput (tok/s): 1400.49 -Total token throughput (tok/s): 2800.98 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 227856.17 -Median E2E Latency (ms): 235231.81 ----------------Time to First Token---------------- -Mean TTFT (ms): 766.49 -Median TTFT (ms): 695.87 -P99 TTFT (ms): 1764.03 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 221.98 -Median TPOT (ms): 228.72 -P99 TPOT (ms): 282.76 ----------------Inter-token Latency---------------- -Mean ITL (ms): 221.99 -Median ITL (ms): 150.11 -P99 ITL (ms): 1054.09 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1200, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=4.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 1228800 -#Output tokens: 1228800 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 4.0 -Max reqeuest concurrency: not set -Successful requests: 1200 -Benchmark duration (s): 555.83 -Total input tokens: 1228800 -Total generated tokens: 1228800 -Total generated tokens (retokenized): 1224553 -Request throughput (req/s): 2.16 -Input token throughput (tok/s): 2210.76 -Output token throughput (tok/s): 2210.76 -Total token throughput (tok/s): 4421.52 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 371743.37 -Median E2E Latency (ms): 376584.64 ----------------Time to First Token---------------- -Mean TTFT (ms): 885.85 -Median TTFT (ms): 870.93 -P99 TTFT (ms): 1826.65 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 362.52 -Median TPOT (ms): 367.10 -P99 TPOT (ms): 453.56 ----------------Inter-token Latency---------------- -Mean ITL (ms): 362.52 -Median ITL (ms): 290.58 -P99 ITL (ms): 1248.61 -================================================== -Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=40000, dataset_name='random', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=2400, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=1.0, request_rate=8.0, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file='deepseek_v3_8xh200_BF16_online_output.jsonl', disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None) - -#Input tokens: 2457600 -#Output tokens: 2457600 -Starting initial single prompt test run... -Initial test run completed. Starting main benchmark run... - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: 8.0 -Max reqeuest concurrency: not set -Successful requests: 2400 -Benchmark duration (s): 859.04 -Total input tokens: 2457600 -Total generated tokens: 2457600 -Total generated tokens (retokenized): 2449208 -Request throughput (req/s): 2.79 -Input token throughput (tok/s): 2860.87 -Output token throughput (tok/s): 2860.87 -Total token throughput (tok/s): 5721.75 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 687592.60 -Median E2E Latency (ms): 692819.14 ----------------Time to First Token---------------- -Mean TTFT (ms): 1556.18 -Median TTFT (ms): 1310.69 -P99 TTFT (ms): 4496.40 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 670.61 -Median TPOT (ms): 676.19 -P99 TPOT (ms): 780.25 ----------------Inter-token Latency---------------- -Mean ITL (ms): 671.01 -Median ITL (ms): 518.90 -P99 ITL (ms): 4551.92 -==================================================