Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Upgrade DCGM from 3.3.6 to 3.3.9 #7952

Closed
wants to merge 1 commit into from

Conversation

rmccorm4
Copy link
Contributor

Possible resolution to an intermittent segfault from DCGM in various test scenarios, specifically on machines with an NVSwitch:

Signal (11) received.
 0# 0x000055F7D4359B48 in /opt/tritonserver/bin/tritonserver
 1# 0x00007F4E49FEF320 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F4DCA04FCB3 in /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.2
 3# 0x00007F4DCA050661 in /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.2
 4# 0x00007F4DCA04EB50 in /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.2
 5# nscq_session_path_observe in /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.2
 6# 0x00007F4E404F10E7 in /usr/lib/x86_64-linux-gnu/libdcgmmodulenvswitch.so.3
 7# 0x00007F4E4048DE53 in /usr/lib/x86_64-linux-gnu/libdcgmmodulenvswitch.so.3
 8# 0x00007F4E40474E54 in /usr/lib/x86_64-linux-gnu/libdcgmmodulenvswitch.so.3
 9# 0x00007F4E40478BCB in /usr/lib/x86_64-linux-gnu/libdcgmmodulenvswitch.so.3
10# 0x00007F4E404FFE5B in /usr/lib/x86_64-linux-gnu/libdcgmmodulenvswitch.so.3
11# 0x00007F4E405003A9 in /usr/lib/x86_64-linux-gnu/libdcgmmodulenvswitch.so.3
12# 0x00007F4E4A046A94 in /usr/lib/x86_64-linux-gnu/libc.so.6
13# __clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Other possible workarounds for end-users if the issue is still seen after version upgrade is to disable GPU metrics when running Triton:

tritonserver --allow-gpu-metrics false ...

@rmccorm4
Copy link
Contributor Author

This should be picked into r25.01as well.

@@ -77,7 +77,7 @@
"ort_version": "1.20.1",
"ort_openvino_version": "2024.4.0",
"standalone_openvino_version": "2024.4.0",
"dcgm_version": "3.3.6",
"dcgm_version": "3.3.9",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we installing old dcgm 2.2.3 by default here?

dcgm_version = "2.2.3"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's need to be reverified.
I see there is a condition, and probably it was working fine before some changes.
But the catch itself worth attention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update MA to stay in sync, but doesn't need to block this PR. Will open a separate one for MA

@rmccorm4
Copy link
Contributor Author

This doesn't appear to have fixed the issue, closing for now.

@rmccorm4 rmccorm4 closed this Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants