model_analyzer profile with mig : DCGM initialization error #954

jason-i-vv · 2024-12-25T08:53:30Z

Hardware：H800

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

model-analyzer --version
1.47.0

# baseimage
nvcr.io/nvidia/tritonserver:24.11-py3

When using the entire card, there is no problem. However, after enabling the MIG mode, when the container is on the MIG card, model_analyzer cannot be executed.

docker run -ti --rm --gpus='"device=0:0,0:1"' --network=host -v $PWD:/mnt --name triton-server tritonserver-modelanalyzer:latest

model-analyzer profile \
  --model-repository=/mnt/models \
  --profile-models=densenet_onnx \
  --output-model-repository-path=results

[Model Analyzer] Initializing GPUDevice handles
CacheManager Init Failed. Error: -17
Traceback (most recent call last):
  File "/usr/local/bin/model-analyzer", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/entrypoint.py", line 263, in main
    gpus = GPUDeviceFactory().verify_requested_gpus(config.gpus)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/device/gpu_device_factory.py", line 39, in __init__
    self.init_all_devices()
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/device/gpu_device_factory.py", line 58, in init_all_devices
    dcgm_handle = dcgm_agent.dcgmStartEmbedded(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_agent.py", line 56, in wrapper
    return fn(*newargs, **newkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_agent.py", line 91, in dcgmStartEmbedded
    dcgm_structs._dcgmCheckReturn(ret)
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_structs.py", line 691, in _dcgmCheckReturn
    raise DCGMError(ret)
model_analyzer.monitor.dcgm.dcgm_structs.DCGMError_InitError: DCGM initialization error

The text was updated successfully, but these errors were encountered:

jason-i-vv · 2024-12-25T08:56:00Z

When I test tritonserver, the service can start, but there will also be a DCGM error.

I1225 08:54:50.204154 265 server.cc:631] 
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                       |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.00000 |
|             |                                                                 | 0","default-max-batch-size":"4"}}                                                                                            |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+

I1225 08:54:50.204199 265 server.cc:674] 
+---------------+---------+--------+
| Model         | Version | Status |
+---------------+---------+--------+
| densenet_onnx | 1       | READY  |
+---------------+---------+--------+

CacheManager Init Failed. Error: -17
W1225 08:54:50.215310 265 metrics.cc:811] "**DCGM unable to start: DCGM initialization error**"
I1225 08:54:50.215796 265 metrics.cc:783] "Collecting CPU metrics"
I1225 08:54:50.215906 265 tritonserver.cc:2598]

jason-i-vv · 2024-12-25T08:57:50Z

I want to conduct a test to verify that the performance of the densenet_onnx model is better after cutting seven MIGs compared to one H800 card. Do you have any suggestions for conducting this test?

nv-braf · 2025-01-07T18:37:28Z

This is an issue with DCGM not working properly on MIG. We have recently added support to disable DCGM: #952

Note that this will only work if running MA in the remote launch mode.

jason-i-vv · 2025-01-08T03:06:59Z

thanks. I will check it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model_analyzer profile with mig : DCGM initialization error #954

model_analyzer profile with mig : DCGM initialization error #954

jason-i-vv commented Dec 25, 2024 •

edited

Loading

jason-i-vv commented Dec 25, 2024

jason-i-vv commented Dec 25, 2024

nv-braf commented Jan 7, 2025

jason-i-vv commented Jan 8, 2025

model_analyzer profile with mig : DCGM initialization error #954

model_analyzer profile with mig : DCGM initialization error #954

Comments

jason-i-vv commented Dec 25, 2024 • edited Loading

jason-i-vv commented Dec 25, 2024

jason-i-vv commented Dec 25, 2024

nv-braf commented Jan 7, 2025

jason-i-vv commented Jan 8, 2025

jason-i-vv commented Dec 25, 2024 •

edited

Loading