Deploy Qwen2-VL in inference endpoints failed #2879

AHEADer · 2025-01-06T13:02:19Z

System Info

When I use inference endpoints to deploy qwen2-vl-7b-instruct, I meet such error:

Exit code: 1. Reason: le \"/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1562, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/rotary.py\", line 57, in forward\n    rotary_emb.apply_rotary(q1, q2, cos, sin, q1, q2, False)\nRuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.173078Z","level":"ERROR","message":"Server error: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0","target":"text_generation_router_v3::client","filename":"backends/v3/src/client/mod.rs","line_number":45,"span":{"name":"warmup"},"spans":[{"max_batch_size":"None","max_input_length":"Some(10000)","max_prefill_tokens":10000,"max_total_tokens":"Some(10001)","name":"warmup"},{"name":"warmup"}]}
Error: Backend(Warmup(Generation("The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 0")))
{"timestamp":"2025-01-06T12:37:13.198321Z","level":"ERROR","fields":{"message":"Webserver Crashed"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.198350Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2025-01-06T12:37:13.265449Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-01-06T12:37:13.265981Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-01-06T12:37:13.566463Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: WebserverFailed

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

in https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct, click deploy and choose inference endpoints
select L40 or A100, then wait for the error

Expected behavior

successfully deployed at inference endpoint

The text was updated successfully, but these errors were encountered:

drbh · 2025-01-07T00:23:43Z

Hi @AHEADer thank you for opening this issue, I believe this is a similar issue to #2875 and may be related to a bug with cuda graphs and rotary embeddings. Looking for a good fix now and will update this issue once resolved. In the meantime I believe its possible to avoid this issue by setting the environment var CUDA_GRAPHS=0

drbh mentioned this issue Jan 7, 2025

os-atlas-pro-7b crashes #2875

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy Qwen2-VL in inference endpoints failed #2879

Deploy Qwen2-VL in inference endpoints failed #2879

AHEADer commented Jan 6, 2025

drbh commented Jan 7, 2025

Deploy Qwen2-VL in inference endpoints failed #2879

Deploy Qwen2-VL in inference endpoints failed #2879

Comments

AHEADer commented Jan 6, 2025

System Info

Information

Tasks

Reproduction

Expected behavior

drbh commented Jan 7, 2025