Stuck during parallel inference. #3057

Lanbai-eleven · 2025-01-20T16:28:18Z

I am performing parallel inference with a batch size of 8 on a machine with 4 * A6000 GPUs. However, after running inference for a while, it gets stuck and stops responding. Meanwhile, nvidia-smi shows the following situation:

lvhan028 · 2025-01-21T01:46:59Z

Please share the env information by running lmdeploy check_env

Lanbai-eleven · 2025-01-21T04:26:25Z

Please share the env information by running lmdeploy check_env

Here is the output of lmdeploy check_env
`Python: 3.9.0 (default, Nov 15 2020, 14:28:56) [GCC 7.3.0] [15/1901]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA RTX A6000
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.1, V12.1.66
GCC: gcc (Ubuntu 8.4.0-3ubuntu2) 8.4.0
PyTorch: 2.4.1+cu121
PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 12.1
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 90.1 (built against CUDA 12.4)
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.19.1+cu121
LMDeploy: 0.7.0+6cd35d5
transformers: 4.47.0
gradio: Not Found
fastapi: 0.115.6
pydantic: 2.10.3
triton: 3.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 NODE NODE 0-127 0 N/A
GPU1 NV4 X NODE NODE 0-127 0 N/A
GPU2 NODE NODE X NV4 0-127 0 N/A
GPU3 NODE NODE NV4 X 0-127 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks`

lvhan028 · 2025-01-21T04:41:33Z

Can you share the reproducible code snippet too?

Lanbai-eleven · 2025-01-21T06:28:31Z

Can you share the reproducible code snippet too?

I may not be able to provide the complete code because this is a complex project, but essentially, I am performing normal inference using a VLM with a batch size of 8.
Each input in a batch consists of 8 images along with the same prompt. This issue occurs with both Internvl-8B and QwenVL-7B models. It seems that reducing the batch size to 4 can alleviate the occurrence of this problem.

`def batch_generate_entities_and_relations(
llm: BaseVideoModel,
events: list[dict],
video: VideoRepresentation,
file_path: str,
global_config: dict,
max_retries: int = 5,
batch_size: int = 4,
):
...
try:
batch_responses = llm.batch_generate_response(batch_inputs=batch_inputs)
...

model :
class InternVL_Pipe:
def batch_generate_response(self, batch_inputs, timestamps=None):
prompts = []
gen_config = GenerationConfig(do_sample=True, max_new_tokens=2048)
if "video" in batch_inputs[0].keys():
for inputs in batch_inputs:
images = inputs["video"]
video_prefix = ''.join([f'Frame-{i}: {IMAGE_TOKEN}\n' for i in range(len(images))]) if timestamps is None else
''.join([f'Timestamp {timestamps[i]}s : {IMAGE_TOKEN}\n' for i in range(len(images))])
question = video_prefix + inputs["text"]
prompts.append((question, images))

        responses = self.pipe(prompts, gen_config=gen_config)
    else:
        for inputs in batch_inputs:
            question = inputs["text"]
            prompts.append(question)
        
        responses = self.pipe(prompts, gen_config=gen_config)
    
    responses = [response.text for response in responses]

    return responses

`

lvhan028 · 2025-01-21T15:52:27Z

Can you set log_level=INFO when creating the pipeline?
Hope we can get some clues from the info log.
Another way to debug the issue is to use gdb, for instance:

gdb attach <pid>
set logging on
thread apply all bt
c
set logging off

You can find gdb.txt in the working directory. May share it with us.
Meanshile, we will try to reproduce it on our A100 device.

LaoWangGB · 2025-01-22T13:54:35Z

Same problem when Infer 78B with H800 * 4. When I use lmdeploy ==0.6.3, it occurs occasionally, but definitely occurs using lmdeploy ==0.7.0.

Lanbai-eleven · 2025-01-23T05:42:07Z

Can you set log_level=INFO when creating the pipeline? Hope we can get some clues from the info log. Another way to debug the issue is to use gdb, for instance:
gdb attach <pid>
set logging on
thread apply all bt
c
set logging off
You can find gdb.txt in the working directory. May share it with us. Meanshile, we will try to reproduce it on our A100 device.

So, do I need to keep running until the “stuck” situation occurs, or do I only need the INFO logs when creating the pipeline?

lvhan028 · 2025-01-23T06:49:14Z

Set log_level="INFO" when creating the pipeline and then run the reproducible code.
When the stuck happens, please share the whole log with us

Lanbai-eleven · 2025-01-24T14:37:40Z

Set log_level="INFO" when creating the pipeline and then run the reproducible code. When the stuck happens, please share the whole log with us

Here is the log

lvhan028 assigned lvhan028 and lzhangzz and unassigned lvhan028 Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck during parallel inference. #3057

Stuck during parallel inference. #3057

Lanbai-eleven commented Jan 20, 2025

lvhan028 commented Jan 21, 2025

Lanbai-eleven commented Jan 21, 2025

lvhan028 commented Jan 21, 2025

Lanbai-eleven commented Jan 21, 2025 •

edited

Loading

lvhan028 commented Jan 21, 2025

LaoWangGB commented Jan 22, 2025

Lanbai-eleven commented Jan 23, 2025

lvhan028 commented Jan 23, 2025

Lanbai-eleven commented Jan 24, 2025 •

edited

Loading

Stuck during parallel inference. #3057

Stuck during parallel inference. #3057

Comments

Lanbai-eleven commented Jan 20, 2025

lvhan028 commented Jan 21, 2025

Lanbai-eleven commented Jan 21, 2025

lvhan028 commented Jan 21, 2025

Lanbai-eleven commented Jan 21, 2025 • edited Loading

lvhan028 commented Jan 21, 2025

LaoWangGB commented Jan 22, 2025

Lanbai-eleven commented Jan 23, 2025

lvhan028 commented Jan 23, 2025

Lanbai-eleven commented Jan 24, 2025 • edited Loading

Lanbai-eleven commented Jan 21, 2025 •

edited

Loading

Lanbai-eleven commented Jan 24, 2025 •

edited

Loading