Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefill operation can be significantly slower in TGI v3 vs TGI v2 #2896

Open
2 of 4 tasks
biba10 opened this issue Jan 10, 2025 · 0 comments
Open
2 of 4 tasks

Prefill operation can be significantly slower in TGI v3 vs TGI v2 #2896

biba10 opened this issue Jan 10, 2025 · 0 comments

Comments

@biba10
Copy link

biba10 commented Jan 10, 2025

System Info

TGI versions 3.0.1 and 2.2.0, official docker images.
Windows 11.
GPU: NVIDIA GeForce RTX 4060 Ti, 16 GB memory, NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7
Docker version: 27.4.0

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: db7e043

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Text-Generation Benchmark with LLaMA 3.2

This benchmark evaluates the performance of LLaMA 3.2 using various quantization methods and no quantization. The focus is on the prefill operation, with results compared across TGI versions 2.2.0 and 3.0.1.


Example Docker Command

To run the benchmark using Docker:
docker run -d --name tgi_server --gpus all -p 8080:8080 -v .:/data ghcr.io/huggingface/text-generation-inference:2.2.0 --model-id /data/llama-3.2-3B-instruct-awq --max-total-tokens=15256 --max-input-length=15192 --max-batch-prefill-tokens=15192 --quantize awq

Benchmark Details

  • Command used:
    text-generation-benchmark --tokenizer-name /Llama-3.2-3B-Instruct/ --sequence-length 15000 --decode-length 2
  • Sequence Lengths Tested: 4096, 8192, 15000
  • Batch Sizes: 1 and 2 (batch size 4 tested only in specific cases)

Results: Input Length 4096

Average Prefill Operation Time (ms) for 10 runs:

2.2.0 3.0.1
Quantization Batch Size 1/2/4 Batch Size 1/2/4
None 672/1277/2543 661/1326/17620
bitsandbytes-nf4 697/1302/2562 684/1328/5905
eetq 705/1759/2573 713/1750/39690
AWQ 667/1302/2506 679/1354/21214

Observations from Table and Additional Larger Input Lengths

  1. Batch Size 4 Anomalies:
  • Significant increase in prefill time with batch size 4 in version 3.0.1 compared to 2.2.0.
  • Example: With eetq quantization, the time increased from 2,573 ms in 2.2.0 to 39,690 ms in 3.0.1.
  1. AWQ Quantization at Input Length 8192:
  • Batch Size 1: Similar times (~1,450 ms) across versions.
  • Batch Size 2:
    • Version 2.2.0: 2,806 ms
    • Version 3.0.1: 18,400 ms
  1. AWQ quntization at Input Length 15000:
  • Batch Size 1:
    • Version 2.2.0: 2,964 ms
    • Version 3.0.1: 12,714 ms
  • Batch Size 2:
    • Not measured for version 3.0.1.
    • Version 2.2.0: 5,612 ms, which is significantly lower than batch size 1 in 3.0.1.

Decode Times

  • Performance of decode operation for 256 is similar, did not test larger values.

Possible GPU Memory Considerations

  • For AWQ quantization with an input length of 4096 and batch size 1, TGI version 2.2.0 requires approximately 10 GB of GPU memory, whereas version 3.0.1 requires about 13 GB.
  • As batch size or input length increases, version 3.0.1 may exhaust the available 16 GB of GPU memory faster. This could explain the slower processing times observed in version 3.0.1, as memory overhead and resource contention may affect performance.

Expected behavior

I would expect at least the same performance between TGI versions 3.0.1 and 2.2.0. Specifically:

  • Comparable processing times for the same input length and batch size across both versions.
  • No significant increase in GPU memory usage in version 3.0.1 compared to version 2.2.0 for identical configurations.
  • Consistent scalability as input length or batch size increases, without unexpected slowdowns or performance regressions in the newer version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant