Prefill operation can be significantly slower in TGI v3 vs TGI v2 #2896

biba10 · 2025-01-10T09:04:31Z

System Info

TGI versions 3.0.1 and 2.2.0, official docker images.
Windows 11.
GPU: NVIDIA GeForce RTX 4060 Ti, 16 GB memory, NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7
Docker version: 27.4.0

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: db7e043

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Text-Generation Benchmark with LLaMA 3.2

This benchmark evaluates the performance of LLaMA 3.2 using various quantization methods and no quantization. The focus is on the prefill operation, with results compared across TGI versions 2.2.0 and 3.0.1.

Example Docker Command

To run the benchmark using Docker:
docker run -d --name tgi_server --gpus all -p 8080:8080 -v .:/data ghcr.io/huggingface/text-generation-inference:2.2.0 --model-id /data/llama-3.2-3B-instruct-awq --max-total-tokens=15256 --max-input-length=15192 --max-batch-prefill-tokens=15192 --quantize awq

Benchmark Details

Command used:
text-generation-benchmark --tokenizer-name /Llama-3.2-3B-Instruct/ --sequence-length 15000 --decode-length 2
Sequence Lengths Tested: 4096, 8192, 15000
Batch Sizes: 1 and 2 (batch size 4 tested only in specific cases)

Results: Input Length 4096

Average Prefill Operation Time (ms) for 10 runs:

	2.2.0	3.0.1
Quantization	Batch Size 1/2/4	Batch Size 1/2/4
None	672/1277/2543	661/1326/17620
bitsandbytes-nf4	697/1302/2562	684/1328/5905
eetq	705/1759/2573	713/1750/39690
AWQ	667/1302/2506	679/1354/21214

Observations from Table and Additional Larger Input Lengths

Batch Size 4 Anomalies:

Significant increase in prefill time with batch size 4 in version 3.0.1 compared to 2.2.0.
Example: With eetq quantization, the time increased from 2,573 ms in 2.2.0 to 39,690 ms in 3.0.1.

AWQ Quantization at Input Length 8192:

Batch Size 1: Similar times (~1,450 ms) across versions.
Batch Size 2:
- Version 2.2.0: 2,806 ms
- Version 3.0.1: 18,400 ms

AWQ quntization at Input Length 15000:

Batch Size 1:
- Version 2.2.0: 2,964 ms
- Version 3.0.1: 12,714 ms
Batch Size 2:
- Not measured for version 3.0.1.
- Version 2.2.0: 5,612 ms, which is significantly lower than batch size 1 in 3.0.1.

Decode Times

Performance of decode operation for 256 is similar, did not test larger values.

Possible GPU Memory Considerations

For AWQ quantization with an input length of 4096 and batch size 1, TGI version 2.2.0 requires approximately 10 GB of GPU memory, whereas version 3.0.1 requires about 13 GB.
As batch size or input length increases, version 3.0.1 may exhaust the available 16 GB of GPU memory faster. This could explain the slower processing times observed in version 3.0.1, as memory overhead and resource contention may affect performance.

Expected behavior

I would expect at least the same performance between TGI versions 3.0.1 and 2.2.0. Specifically:

Comparable processing times for the same input length and batch size across both versions.
No significant increase in GPU memory usage in version 3.0.1 compared to version 2.2.0 for identical configurations.
Consistent scalability as input length or batch size increases, without unexpected slowdowns or performance regressions in the newer version.

The text was updated successfully, but these errors were encountered:

biba10 mentioned this issue Jan 10, 2025

Automatic Calculation of Sequence Length in TGI v3 Leads to Unrealistic Values Before CUDA OOM #2897

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefill operation can be significantly slower in TGI v3 vs TGI v2 #2896

Prefill operation can be significantly slower in TGI v3 vs TGI v2 #2896

biba10 commented Jan 10, 2025

Prefill operation can be significantly slower in TGI v3 vs TGI v2 #2896

Prefill operation can be significantly slower in TGI v3 vs TGI v2 #2896

Comments

biba10 commented Jan 10, 2025

System Info

Information

Tasks

Reproduction

Text-Generation Benchmark with LLaMA 3.2

Example Docker Command

Benchmark Details

Results: Input Length 4096

Observations from Table and Additional Larger Input Lengths

Decode Times

Possible GPU Memory Considerations

Expected behavior