You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This benchmark evaluates the performance of LLaMA 3.2 using various quantization methods and no quantization. The focus is on the prefill operation, with results compared across TGI versions 2.2.0 and 3.0.1.
Example Docker Command
To run the benchmark using Docker: docker run -d --name tgi_server --gpus all -p 8080:8080 -v .:/data ghcr.io/huggingface/text-generation-inference:2.2.0 --model-id /data/llama-3.2-3B-instruct-awq --max-total-tokens=15256 --max-input-length=15192 --max-batch-prefill-tokens=15192 --quantize awq
Batch Sizes: 1 and 2 (batch size 4 tested only in specific cases)
Results: Input Length 4096
Average Prefill Operation Time (ms) for 10 runs:
2.2.0
3.0.1
Quantization
Batch Size 1/2/4
Batch Size 1/2/4
None
672/1277/2543
661/1326/17620
bitsandbytes-nf4
697/1302/2562
684/1328/5905
eetq
705/1759/2573
713/1750/39690
AWQ
667/1302/2506
679/1354/21214
Observations from Table and Additional Larger Input Lengths
Batch Size 4 Anomalies:
Significant increase in prefill time with batch size 4 in version 3.0.1 compared to 2.2.0.
Example: With eetq quantization, the time increased from 2,573 ms in 2.2.0 to 39,690 ms in 3.0.1.
AWQ Quantization at Input Length 8192:
Batch Size 1: Similar times (~1,450 ms) across versions.
Batch Size 2:
Version 2.2.0: 2,806 ms
Version 3.0.1: 18,400 ms
AWQ quntization at Input Length 15000:
Batch Size 1:
Version 2.2.0: 2,964 ms
Version 3.0.1: 12,714 ms
Batch Size 2:
Not measured for version 3.0.1.
Version 2.2.0: 5,612 ms, which is significantly lower than batch size 1 in 3.0.1.
Decode Times
Performance of decode operation for 256 is similar, did not test larger values.
Possible GPU Memory Considerations
For AWQ quantization with an input length of 4096 and batch size 1, TGI version 2.2.0 requires approximately 10 GB of GPU memory, whereas version 3.0.1 requires about 13 GB.
As batch size or input length increases, version 3.0.1 may exhaust the available 16 GB of GPU memory faster. This could explain the slower processing times observed in version 3.0.1, as memory overhead and resource contention may affect performance.
Expected behavior
I would expect at least the same performance between TGI versions 3.0.1 and 2.2.0. Specifically:
Comparable processing times for the same input length and batch size across both versions.
No significant increase in GPU memory usage in version 3.0.1 compared to version 2.2.0 for identical configurations.
Consistent scalability as input length or batch size increases, without unexpected slowdowns or performance regressions in the newer version.
The text was updated successfully, but these errors were encountered:
System Info
TGI versions 3.0.1 and 2.2.0, official docker images.
Windows 11.
GPU: NVIDIA GeForce RTX 4060 Ti, 16 GB memory, NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7
Docker version: 27.4.0
Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: db7e043
Information
Tasks
Reproduction
Text-Generation Benchmark with LLaMA 3.2
This benchmark evaluates the performance of LLaMA 3.2 using various quantization methods and no quantization. The focus is on the prefill operation, with results compared across TGI versions 2.2.0 and 3.0.1.
Example Docker Command
To run the benchmark using Docker:
docker run -d --name tgi_server --gpus all -p 8080:8080 -v .:/data ghcr.io/huggingface/text-generation-inference:2.2.0 --model-id /data/llama-3.2-3B-instruct-awq --max-total-tokens=15256 --max-input-length=15192 --max-batch-prefill-tokens=15192 --quantize awq
Benchmark Details
text-generation-benchmark --tokenizer-name /Llama-3.2-3B-Instruct/ --sequence-length 15000 --decode-length 2
Results: Input Length 4096
Average Prefill Operation Time (ms) for 10 runs:
Observations from Table and Additional Larger Input Lengths
Decode Times
Possible GPU Memory Considerations
Expected behavior
I would expect at least the same performance between TGI versions 3.0.1 and 2.2.0. Specifically:
The text was updated successfully, but these errors were encountered: