Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Analyzer GPU Memory Usage Differences #847

Open
KimiJL opened this issue Mar 25, 2024 · 5 comments
Open

Model Analyzer GPU Memory Usage Differences #847

KimiJL opened this issue Mar 25, 2024 · 5 comments

Comments

@KimiJL
Copy link

KimiJL commented Mar 25, 2024

Version: nvcr.io/nvidia/tritonserver:24.01-py3-sdk

For a profiled model, the GPU Memory Usage (MB) shown in results/metrics-model-gpu.csv is different from model result_summary.pdf.

In my case, metrics-model-gpu.csv shows 1592.8 while the pdf report shows 1031.

Could be my misunderstanding, do these two metrics represent the same thing? I am looking for the maximum GPU usage for a given model, so which would be the more accurate result?

@KimiJL
Copy link
Author

KimiJL commented Mar 25, 2024

Additional Context:

I am using an instance with two GPUs, though the model is limited to a single instance.

I have noticed that if I added up the GPU memory of both GPUs from csv, then divide by 2, I (470.8 + 1592.8) / 2 = 1031.8, i'm getting near the pdf result. Could be a coincidence?

@tgerdesnv
Copy link
Collaborator

Hi @KimiJL, sorry for the slow response. I just returned from vacation.

I suspect that your observation is not a coincidence and that there is a bug. We will have to investigate further.

May I ask, were you running in local mode? Or docker or remote?

@KimiJL
Copy link
Author

KimiJL commented Apr 18, 2024

Hi @tgerdesnv thanks for the response,

I was running in in --triton-launch-mode=docker

@tgerdesnv
Copy link
Collaborator

@KimiJL I have confirmed that the values in the pdfs are in fact the averages across the GPUs. The values in metrics-model-gpu.csv are the raw values per-gpu. So, in your case, the total maximum memory usage by the model on your machine would be 470.8 + 1592.8

I will fix Model Analyzer to show total memory usage, or clarify the labels to indicate that it is average memory usage.

@KimiJL
Copy link
Author

KimiJL commented Apr 24, 2024

@tgerdesnv great, thank you for the clarification, that makes sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants