[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

eero-t · 2024-12-16T16:05:15Z

Priority

Undecided

OS type

OS: Ubuntu 22.04
Kernel: 5.15.0

Hardware type

HW: Gaudi2
driver_ver: 1.16.2-f195ec4

Installation method

Pull docker images from hub.docker.com

Deploy method

Helm

Running nodes

Single Node

What's the version?

https://hub.docker.com/layers/opea/vllm-gaudi/latest/images/sha256-d2c0b0aa88cd26ae2084990663d8d789728f658bacacd8a49cc5b81a6a022c58

Description

vllm-gaudi:latest container does not find devices, and is in crash loop.

But if I change latest tag to 1.1, it works fine, i.e. this is regression.

Reproduce steps

Run ChatQnA from GenAIInfra with vLLM:
$ helm install chatqna chatqna/ --skip-tests --values chatqna/gaudi-vllm-values.yaml ...

Raw log

$ kubectl logs chatqna-vllm-75dfb59d66-wp4vs
...
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 132, in current_device
    init()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 71, in init
    _hpu_C.init()
RuntimeError: synStatus=8 [Device not found] Device acquire failed.

The text was updated successfully, but these errors were encountered:

eero-t · 2024-12-16T16:37:24Z

Only change in OPEA Git since v1.1 is dropping of the eager option, but v1.1 image works fine with that change.
https://github.com/opea-project/GenAIComps/commits/main/comps/llms/text-generation/vllm/langchain/dependency/

However, comparing the latest image layers to ones in earlier v1.1 one:
https://hub.docker.com/layers/opea/vllm-gaudi/1.1/images/sha256-c75d22e05ff23e4c0745e9c0a56ec74763f85c7fecf23b7f62e0da74175ddae7

Shows quite a few differences, also in sizes of the installed layers.

=> I think the problem is in Habana repo side.

Recent Gaudi vLLM dependency changes is one possibility: https://github.com/HabanaAI/vllm-fork/commits/habana_main/requirements-hpu.txt

Maybe new HPU deps do not handle correctly pod's Gaudi plugin device request allowing vLLM (write) access only to one of the 8 devices in the node?

xiguiw · 2024-12-19T10:17:11Z

Not sure how the docker file are buid.
I find out one build command, but did not find Dockerfile.hpu

Here both v1.1 and latest are built from the same commit id git checkout 3c39626.
because Dockerfile.hpu is missed. And from the behavior, it should be Gaudi vllm service issue. not OPEA level.

@ashahba
Would you let us know the build command for docker image on docker hub?

eero-t · 2024-12-19T18:08:00Z

Not sure how the docker file are buid. I find out one build command, but did not find Dockerfile.hpu

@xiguiw As you can see from the OPEA script, it git clones Habana repo [1], cds to repo's vllm-fork dir, and builds Dockerfile.hpu from there.

[1] It would be faster to clone specific commit instead of first cloning whole repo and only then checking out that specific commit.

xiguiw · 2024-12-24T15:22:37Z

Not sure how the docker file are buid. I find out one build command, but did not find Dockerfile.hpu

@xiguiw As you can see from the OPEA script, it git clones Habana repo [1], cds to repo's vllm-fork dir, and builds Dockerfile.hpu from there.

@eero-t
I did not find the exact build command.

I mean if the vllm-gaudi image build from the same commit in v1.1 and latest, it does not make sense that v1.1 works but latest failed.

If v1.1 and latest docker images are built with different commit ID, that possible.
We can try to build vllm-gaudi docker image independently (without OPEA), then to verify if vllm-gaudi service works.

eero-t · 2024-12-30T10:15:22Z

@eero-t I did not find the exact build command.

Exact build command is this:
https://github.com/opea-project/GenAIComps/blob/main/comps/llms/text-generation/vllm/langchain/dependency/build_docker_vllm.sh#L41

Which uses this exact Dockerfile:
https://github.com/HabanaAI/vllm-fork/blob/3c39626/Dockerfile.hpu

Which uses this exact Python modules requirements file:
https://github.com/HabanaAI/vllm-fork/blob/3c39626/requirements-hpu.txt

I mean if the vllm-gaudi image build from the same commit in v1.1 and latest, it does not make sense that v1.1 works but latest failed. If v1.1 and latest docker images are built with different commit ID, that possible.

There are several components installed during building of those images, which can change although commit IDs for top level code do not change.

Neither Dockerfile.hpu , nor requirements-hpu.txt file used by it specify exact version numbers / SHA values for their dependencies.

And dependencies in requirements-common.txt, that requirements-hpu.txt depends on, pulls in several additional Python modules where exact version numbers are not given either: https://github.com/HabanaAI/vllm-fork/blob/3c39626/requirements-common.txt

=> image build gets latest version of those deps, which can change between builds...

We can try to build vllm-gaudi docker image independently (without OPEA), then to verify if vllm-gaudi service works.

@xiguiw My first guess would be issue to be with drivers in Habalabs "pytorch-installer-2.4.0:latest" image, which is used as base image by Dockerfile.hpu.

According to: https://vault.habana.ai/ui/repos/tree/General/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0

Its current latest version is from 2024-12-12, whereas OPEA vllm-gaudi:1.1 image was created several weeks earlier, on 2024-11-22: https://hub.docker.com/r/opea/vllm-gaudi/tags

eero-t · 2025-01-03T14:17:54Z

While opea/llvm-gaudi:1.1 image continues working, the failure reason for opea/llvm-gaudi:11 has changed.

Now these crashes:

NAME                                   READY   STATUS             RESTARTS      AGE
docsum-vllm-56d857cb77-99788           0/1     CrashLoopBackOff   5 (32s ago)   6m31s

Are due to a segfault:

Internal Error: Received signal - Segmentation fault
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev910+g3c39626f.gaudi000-py3.10.egg/vllm/engine/multiprocessing/client.py:175> exception=ZMQError('Operation not supported')>
...

Which raises the question, why latest OPEA image is not using the latest Gaudi vLLM release, built from vLLM v0.6.4:
https://github.com/HabanaAI/vllm-fork/releases

But uses non-release commit of that repo, using vLLM v0.6.3-dev?

yinghu5 · 2025-01-14T08:40:39Z

@eero-t thank you a lot for discovering. As we run into some issue, so use one dev version , will try the latest version and upgrade it if no CI issue.

eero-t · 2025-01-14T10:08:38Z

Latest Gaudi driver release is 1.19.1, according to: https://docs.habana.ai/en/latest/Release_Notes/GAUDI_Release_Notes.html

Whereas my test node has 1.16.2, according to output from: head /sys/class/accel/accel0/device/*ver

If newer Gaudi HW drivers are required with latest image builds and v1.2 release, than with OPEA 1.1 release, that would need to be prominently stated in docs.

ftian1 · 2025-01-15T00:21:55Z

But uses non-release commit of that repo, using vLLM v0.6.3-dev?

it's because at that time the vLLM master has bug to use. we have had a PR to update it.

If newer Gaudi HW drivers are required with latest image builds and v1.2 release, than with OPEA 1.1 release, that would need to be prominently stated in docs.

it's Gaudi's limitation which requests HPU driver also be upgraded before using latest Gaudi related docker images.

joshuayao · 2025-01-17T07:01:11Z

Fixed by #1156

eero-t added the bug Something isn't working label Dec 16, 2024

yinghu5 assigned xiguiw Jan 10, 2025

yinghu5 added the aitce label Jan 10, 2025

yinghu5 added this to OPEA Jan 14, 2025

yinghu5 added this to the v1.2 milestone Jan 14, 2025

yinghu5 moved this to In progress in OPEA Jan 14, 2025

eero-t mentioned this issue Jan 15, 2025

Add vLLM support to DocSum Helm chart opea-project/GenAIInfra#649

Merged

1 task

yinghu5 added the r1.2 label Jan 16, 2025

yinghu5 assigned ftian1 Jan 16, 2025

joshuayao closed this as completed Jan 17, 2025

github-project-automation bot moved this from In progress to Done in OPEA Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

eero-t commented Dec 16, 2024 •

edited

Loading

eero-t commented Dec 16, 2024

xiguiw commented Dec 19, 2024

eero-t commented Dec 19, 2024

xiguiw commented Dec 24, 2024 •

edited

Loading

eero-t commented Dec 30, 2024

eero-t commented Jan 3, 2025

yinghu5 commented Jan 14, 2025

eero-t commented Jan 14, 2025 •

edited

Loading

ftian1 commented Jan 15, 2025

joshuayao commented Jan 17, 2025

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

Comments

eero-t commented Dec 16, 2024 • edited Loading

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

eero-t commented Dec 16, 2024

xiguiw commented Dec 19, 2024

eero-t commented Dec 19, 2024

xiguiw commented Dec 24, 2024 • edited Loading

eero-t commented Dec 30, 2024

eero-t commented Jan 3, 2025

yinghu5 commented Jan 14, 2025

eero-t commented Jan 14, 2025 • edited Loading

ftian1 commented Jan 15, 2025

joshuayao commented Jan 17, 2025

eero-t commented Dec 16, 2024 •

edited

Loading

xiguiw commented Dec 24, 2024 •

edited

Loading

eero-t commented Jan 14, 2025 •

edited

Loading