Automatic Calculation of Sequence Length in TGI v3 Leads to Unrealistic Values Before CUDA OOM #2897

biba10 · 2025-01-10T09:21:43Z

System Info

TGI versions 3.0.1 and 2.2.0, official docker images.
Windows 11.
GPU: NVIDIA GeForce RTX 4060 Ti, 16 GB memory, NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7
Docker version: 27.4.0

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: db7e043

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Issue Description

In TGI version 3.0.1, the automatic calculation of sequence length and token limits produces values that far exceed practical GPU memory constraints, resulting in out-of-memory (OOM) errors even for significantly lower input lengths.

Steps to Reproduce

Start the TGI server:
'docker run -d --name tgi_server --gpus all -p 8080:8080 -v .:/data ghcr.io/huggingface/text-generation-inference:3.0.1 --model-id /data/llama-3.2-3B-instruct-awq --quantize awq'
Automatic estimations

Maximum input tokens defaulted to 96968
Maximum total tokens defaulted to 96969
Setting max batch total tokens to 96969

Run the benchmark with a lower input length

Using the benchmark with input length 20000, which is significantly lower than 96968
text-generation-benchmark --tokenizer-name /Llama-3.2-3B-Instruct/ --sequence-length 20000 --decode-length 2

CUDA OOM Error
ERROR prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 312.00 MiB. GPU 0 has a total capacity of 16.00 GiB of which 0 bytes is free. Of the allocated memory 15.45 GiB is allocated by PyTorch, with 22.33 MiB allocated in private pools (e.g., CUDA Graphs), and 413.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Observation to Version 2.2.0

Automatic token limits in v3:
- Maximum batch total tokens: ~96k
- Memory usage: ~13 GB for input length 4096
Comparative values in v2.2.0:
- Maximum batch total tokens: ~61k
- Memory usage: ~10 GB for input length 4096
Inconsistencies in v3:
- Automatically calculated token limits are unrealistically high.
- Higher memory consumption for the same input lengths compared to v2.2.0.

These issues likely contribute to the slower processing times observed in version 3.0.1 for larger inputs, as discussed in
#2896

Expected behavior

Input lengths below the automatically calculated maximum input tokens should not result in out-of-memory errors.
Token limit calculations should align more closely with practical GPU memory constraints to avoid unrealistic values.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic Calculation of Sequence Length in TGI v3 Leads to Unrealistic Values Before CUDA OOM #2897

Automatic Calculation of Sequence Length in TGI v3 Leads to Unrealistic Values Before CUDA OOM #2897

biba10 commented Jan 10, 2025

Automatic Calculation of Sequence Length in TGI v3 Leads to Unrealistic Values Before CUDA OOM #2897

Automatic Calculation of Sequence Length in TGI v3 Leads to Unrealistic Values Before CUDA OOM #2897

Comments

biba10 commented Jan 10, 2025

System Info

Information

Tasks

Reproduction

Issue Description

Steps to Reproduce

Observation to Version 2.2.0

Expected behavior