Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGI drops requests when 150 requests are sent continuously at the rate of 5 Request Per Second in AMD 8 X MI300x with Llama 3.1 405B #2635

Open
2 of 4 tasks
Bihan opened this issue Oct 11, 2024 · 0 comments

Comments

@Bihan
Copy link

Bihan commented Oct 11, 2024

System Info

TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm
MODEL: meta-llama/Llama-3.1-405B-Instruct

Hardware used:
Intel® Xeon® Platinum 8470 2G, 52C/104T, 16GT/s, 105M Cache, Turbo, HT (350W) [x2]
AMD MI300X GPU OAM 192GB 750W GPUs [x8]
64GB RDIMM, 4800MT/s Dual Rank [x32]

Hardware provided by: hotaisle

Deployed using: dstack

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps to reproduce

  1. Using the above mentioned docker image provision the machine.
  2. RUN text-generation-launcher --port 8000 --num-shard 8 --sharded true --max-concurrent-requests 8192 --max-total-tokens 130000 --max-input-tokens 125000
  3. Clone the benchmarking repo to use benchmarking script by pulling the benchmarking repo.
  4. pip install aiohttp
  5. RUN python benchmark_serving.py --backend tgi --model meta-llama/Llama-3.1-405B-Instruct --dataset-name sonnet --sonnet-input-len 1000 --endpoint /generate_stream --dataset-path="sonnet.txt" --num-prompt=150 --request-rate=5

Expected behavior

ALL 150 requests should be successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant