Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model_analyzer profile with mig : DCGM initialization error #954

Open
jason-i-vv opened this issue Dec 25, 2024 · 4 comments
Open

model_analyzer profile with mig : DCGM initialization error #954

jason-i-vv opened this issue Dec 25, 2024 · 4 comments

Comments

@jason-i-vv
Copy link

jason-i-vv commented Dec 25, 2024

Hardware:H800

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

model-analyzer --version
1.47.0

# baseimage
nvcr.io/nvidia/tritonserver:24.11-py3

When using the entire card, there is no problem. However, after enabling the MIG mode, when the container is on the MIG card, model_analyzer cannot be executed.

docker run -ti --rm --gpus='"device=0:0,0:1"' --network=host -v $PWD:/mnt --name triton-server tritonserver-modelanalyzer:latest

model-analyzer profile \
  --model-repository=/mnt/models \
  --profile-models=densenet_onnx \
  --output-model-repository-path=results

[Model Analyzer] Initializing GPUDevice handles
CacheManager Init Failed. Error: -17
Traceback (most recent call last):
  File "/usr/local/bin/model-analyzer", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/entrypoint.py", line 263, in main
    gpus = GPUDeviceFactory().verify_requested_gpus(config.gpus)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/device/gpu_device_factory.py", line 39, in __init__
    self.init_all_devices()
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/device/gpu_device_factory.py", line 58, in init_all_devices
    dcgm_handle = dcgm_agent.dcgmStartEmbedded(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_agent.py", line 56, in wrapper
    return fn(*newargs, **newkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_agent.py", line 91, in dcgmStartEmbedded
    dcgm_structs._dcgmCheckReturn(ret)
  File "/usr/local/lib/python3.12/dist-packages/model_analyzer/monitor/dcgm/dcgm_structs.py", line 691, in _dcgmCheckReturn
    raise DCGMError(ret)
model_analyzer.monitor.dcgm.dcgm_structs.DCGMError_InitError: DCGM initialization error
@jason-i-vv
Copy link
Author

When I test tritonserver, the service can start, but there will also be a DCGM error.

I1225 08:54:50.204154 265 server.cc:631] 
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                       |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.00000 |
|             |                                                                 | 0","default-max-batch-size":"4"}}                                                                                            |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+

I1225 08:54:50.204199 265 server.cc:674] 
+---------------+---------+--------+
| Model         | Version | Status |
+---------------+---------+--------+
| densenet_onnx | 1       | READY  |
+---------------+---------+--------+

CacheManager Init Failed. Error: -17
W1225 08:54:50.215310 265 metrics.cc:811] "**DCGM unable to start: DCGM initialization error**"
I1225 08:54:50.215796 265 metrics.cc:783] "Collecting CPU metrics"
I1225 08:54:50.215906 265 tritonserver.cc:2598]

@jason-i-vv
Copy link
Author

I want to conduct a test to verify that the performance of the densenet_onnx model is better after cutting seven MIGs compared to one H800 card. Do you have any suggestions for conducting this test?

@nv-braf
Copy link
Contributor

nv-braf commented Jan 7, 2025

This is an issue with DCGM not working properly on MIG. We have recently added support to disable DCGM: #952

Note that this will only work if running MA in the remote launch mode.

@jason-i-vv
Copy link
Author

thanks. I will check it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants