You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In TGI version 3.0.1, the automatic calculation of sequence length and token limits produces values that far exceed practical GPU memory constraints, resulting in out-of-memory (OOM) errors even for significantly lower input lengths.
Steps to Reproduce
Start the TGI server:
'docker run -d --name tgi_server --gpus all -p 8080:8080 -v .:/data ghcr.io/huggingface/text-generation-inference:3.0.1 --model-id /data/llama-3.2-3B-instruct-awq --quantize awq'
Automatic estimations
Maximum input tokens defaulted to 96968
Maximum total tokens defaulted to 96969
Setting max batch total tokens to 96969
Run the benchmark with a lower input length
Using the benchmark with input length 20000, which is significantly lower than 96968 text-generation-benchmark --tokenizer-name /Llama-3.2-3B-Instruct/ --sequence-length 20000 --decode-length 2
CUDA OOM Error ERROR prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 312.00 MiB. GPU 0 has a total capacity of 16.00 GiB of which 0 bytes is free. Of the allocated memory 15.45 GiB is allocated by PyTorch, with 22.33 MiB allocated in private pools (e.g., CUDA Graphs), and 413.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Observation to Version 2.2.0
Automatic token limits in v3:
Maximum batch total tokens: ~96k
Memory usage: ~13 GB for input length 4096
Comparative values in v2.2.0:
Maximum batch total tokens: ~61k
Memory usage: ~10 GB for input length 4096
Inconsistencies in v3:
Automatically calculated token limits are unrealistically high.
Higher memory consumption for the same input lengths compared to v2.2.0.
These issues likely contribute to the slower processing times observed in version 3.0.1 for larger inputs, as discussed in #2896
Expected behavior
Input lengths below the automatically calculated maximum input tokens should not result in out-of-memory errors.
Token limit calculations should align more closely with practical GPU memory constraints to avoid unrealistic values.
The text was updated successfully, but these errors were encountered:
System Info
TGI versions 3.0.1 and 2.2.0, official docker images.
Windows 11.
GPU: NVIDIA GeForce RTX 4060 Ti, 16 GB memory, NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7
Docker version: 27.4.0
Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: db7e043
Information
Tasks
Reproduction
Issue Description
In TGI version 3.0.1, the automatic calculation of sequence length and token limits produces values that far exceed practical GPU memory constraints, resulting in out-of-memory (OOM) errors even for significantly lower input lengths.
Steps to Reproduce
Start the TGI server:
'docker run -d --name tgi_server --gpus all -p 8080:8080 -v .:/data ghcr.io/huggingface/text-generation-inference:3.0.1 --model-id /data/llama-3.2-3B-instruct-awq --quantize awq'
Automatic estimations
text-generation-benchmark --tokenizer-name /Llama-3.2-3B-Instruct/ --sequence-length 20000 --decode-length 2
ERROR prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 312.00 MiB. GPU 0 has a total capacity of 16.00 GiB of which 0 bytes is free. Of the allocated memory 15.45 GiB is allocated by PyTorch, with 22.33 MiB allocated in private pools (e.g., CUDA Graphs), and 413.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Observation to Version 2.2.0
These issues likely contribute to the slower processing times observed in version 3.0.1 for larger inputs, as discussed in
#2896
Expected behavior
The text was updated successfully, but these errors were encountered: