-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038
Comments
Only change in OPEA Git since v1.1 is dropping of the However, comparing the Shows quite a few differences, also in sizes of the installed layers. => I think the problem is in Habana repo side. Recent Gaudi vLLM dependency changes is one possibility: https://github.com/HabanaAI/vllm-fork/commits/habana_main/requirements-hpu.txt Maybe new HPU deps do not handle correctly pod's Gaudi plugin device request allowing vLLM (write) access only to one of the 8 devices in the node? |
Not sure how the docker file are buid. Here both v1.1 and latest are built from the same commit id @ashahba |
@xiguiw As you can see from the OPEA script, it [1] It would be faster to clone specific commit instead of first cloning whole repo and only then checking out that specific commit. |
@eero-t I mean if the vllm-gaudi image build from the same commit in v1.1 and latest, it does not make sense that v1.1 works but latest failed. If v1.1 and latest docker images are built with different commit ID, that possible. |
Exact build command is this: Which uses this exact Dockerfile: Which uses this exact Python modules requirements file:
There are several components installed during building of those images, which can change although commit IDs for top level code do not change. Neither And dependencies in => image build gets latest version of those deps, which can change between builds...
@xiguiw My first guess would be issue to be with drivers in Habalabs "pytorch-installer-2.4.0:latest" image, which is used as base image by According to: https://vault.habana.ai/ui/repos/tree/General/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0 Its current |
While Now these crashes:
Are due to a segfault:
Which raises the question, why latest OPEA image is not using the latest Gaudi vLLM release, built from vLLM v0.6.4: But uses non-release commit of that repo, using vLLM v0.6.3-dev? |
@eero-t thank you a lot for discovering. As we run into some issue, so use one dev version , will try the latest version and upgrade it if no CI issue. |
Latest Gaudi driver release is 1.19.1, according to: https://docs.habana.ai/en/latest/Release_Notes/GAUDI_Release_Notes.html Whereas my test node has 1.16.2, according to output from: If newer Gaudi HW drivers are required with |
it's because at that time the vLLM master has bug to use. we have had a PR to update it.
it's Gaudi's limitation which requests HPU driver also be upgraded before using latest Gaudi related docker images. |
Fixed by #1156 |
Priority
Undecided
OS type
Hardware type
Installation method
Deploy method
Running nodes
Single Node
What's the version?
https://hub.docker.com/layers/opea/vllm-gaudi/latest/images/sha256-d2c0b0aa88cd26ae2084990663d8d789728f658bacacd8a49cc5b81a6a022c58
Description
vllm-gaudi:latest
container does not find devices, and is in crash loop.But if I change
latest
tag to1.1
, it works fine, i.e. this is regression.Reproduce steps
Run ChatQnA from GenAIInfra with vLLM:
$ helm install chatqna chatqna/ --skip-tests --values chatqna/gaudi-vllm-values.yaml ...
Raw log
The text was updated successfully, but these errors were encountered: