Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy Qwen2-VL in inference endpoints failed #2879

Open
2 of 4 tasks
AHEADer opened this issue Jan 6, 2025 · 1 comment
Open
2 of 4 tasks

Deploy Qwen2-VL in inference endpoints failed #2879

AHEADer opened this issue Jan 6, 2025 · 1 comment

Comments

@AHEADer
Copy link

AHEADer commented Jan 6, 2025

System Info

When I use inference endpoints to deploy qwen2-vl-7b-instruct, I meet such error:

Exit code: 1. Reason: le \"/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1562, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/rotary.py\", line 57, in forward\n    rotary_emb.apply_rotary(q1, q2, cos, sin, q1, q2, False)\nRuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.173078Z","level":"ERROR","message":"Server error: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0","target":"text_generation_router_v3::client","filename":"backends/v3/src/client/mod.rs","line_number":45,"span":{"name":"warmup"},"spans":[{"max_batch_size":"None","max_input_length":"Some(10000)","max_prefill_tokens":10000,"max_total_tokens":"Some(10001)","name":"warmup"},{"name":"warmup"}]}
Error: Backend(Warmup(Generation("The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0")))
{"timestamp":"2025-01-06T12:37:13.198321Z","level":"ERROR","fields":{"message":"Webserver Crashed"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.198350Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.265449Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-01-06T12:37:13.265981Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-01-06T12:37:13.566463Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: WebserverFailed

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. in https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct, click deploy and choose inference endpoints
  2. select L40 or A100, then wait for the error

Expected behavior

successfully deployed at inference endpoint

@drbh
Copy link
Collaborator

drbh commented Jan 7, 2025

Hi @AHEADer thank you for opening this issue, I believe this is a similar issue to #2875 and may be related to a bug with cuda graphs and rotary embeddings. Looking for a good fix now and will update this issue once resolved. In the meantime I believe its possible to avoid this issue by setting the environment var CUDA_GRAPHS=0

@drbh drbh mentioned this issue Jan 7, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants