Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Calculation of Sequence Length in TGI v3 Leads to Unrealistic Values Before CUDA OOM #2897

Open
2 of 4 tasks
biba10 opened this issue Jan 10, 2025 · 0 comments
Open
2 of 4 tasks

Comments

@biba10
Copy link

biba10 commented Jan 10, 2025

System Info

TGI versions 3.0.1 and 2.2.0, official docker images.
Windows 11.
GPU: NVIDIA GeForce RTX 4060 Ti, 16 GB memory, NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7
Docker version: 27.4.0

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: db7e043

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Issue Description

In TGI version 3.0.1, the automatic calculation of sequence length and token limits produces values that far exceed practical GPU memory constraints, resulting in out-of-memory (OOM) errors even for significantly lower input lengths.


Steps to Reproduce

  1. Start the TGI server:
    'docker run -d --name tgi_server --gpus all -p 8080:8080 -v .:/data ghcr.io/huggingface/text-generation-inference:3.0.1 --model-id /data/llama-3.2-3B-instruct-awq --quantize awq'

  2. Automatic estimations

  • Maximum input tokens defaulted to 96968
  • Maximum total tokens defaulted to 96969
  • Setting max batch total tokens to 96969
  1. Run the benchmark with a lower input length
  • Using the benchmark with input length 20000, which is significantly lower than 96968
    text-generation-benchmark --tokenizer-name /Llama-3.2-3B-Instruct/ --sequence-length 20000 --decode-length 2
  1. CUDA OOM Error
    ERROR prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 312.00 MiB. GPU 0 has a total capacity of 16.00 GiB of which 0 bytes is free. Of the allocated memory 15.45 GiB is allocated by PyTorch, with 22.33 MiB allocated in private pools (e.g., CUDA Graphs), and 413.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Observation to Version 2.2.0

  • Automatic token limits in v3:
    • Maximum batch total tokens: ~96k
    • Memory usage: ~13 GB for input length 4096
  • Comparative values in v2.2.0:
    • Maximum batch total tokens: ~61k
    • Memory usage: ~10 GB for input length 4096
  • Inconsistencies in v3:
    • Automatically calculated token limits are unrealistically high.
    • Higher memory consumption for the same input lengths compared to v2.2.0.

These issues likely contribute to the slower processing times observed in version 3.0.1 for larger inputs, as discussed in
#2896

Expected behavior

  • Input lengths below the automatically calculated maximum input tokens should not result in out-of-memory errors.
  • Token limit calculations should align more closely with practical GPU memory constraints to avoid unrealistic values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant