You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use inference endpoints to deploy qwen2-vl-7b-instruct, I meet such error:
Exit code: 1. Reason: le \"/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1562, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/rotary.py\", line 57, in forward\n rotary_emb.apply_rotary(q1, q2, cos, sin, q1, q2, False)\nRuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.173078Z","level":"ERROR","message":"Server error: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0","target":"text_generation_router_v3::client","filename":"backends/v3/src/client/mod.rs","line_number":45,"span":{"name":"warmup"},"spans":[{"max_batch_size":"None","max_input_length":"Some(10000)","max_prefill_tokens":10000,"max_total_tokens":"Some(10001)","name":"warmup"},{"name":"warmup"}]}
Error: Backend(Warmup(Generation("The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0")))
{"timestamp":"2025-01-06T12:37:13.198321Z","level":"ERROR","fields":{"message":"Webserver Crashed"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.198350Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.265449Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-01-06T12:37:13.265981Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-01-06T12:37:13.566463Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: WebserverFailed
Hi @AHEADer thank you for opening this issue, I believe this is a similar issue to #2875 and may be related to a bug with cuda graphs and rotary embeddings. Looking for a good fix now and will update this issue once resolved. In the meantime I believe its possible to avoid this issue by setting the environment var CUDA_GRAPHS=0
System Info
When I use inference endpoints to deploy qwen2-vl-7b-instruct, I meet such error:
Information
Tasks
Reproduction
Expected behavior
successfully deployed at inference endpoint
The text was updated successfully, but these errors were encountered: