-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Nvidia GPU Support #57
Conversation
Hey @dsloanm , really nice! I only have one concern: auto-rebooting. Juju has default model configs In revs past, we put the application into a blocked status if the machine needs reboot so that we could handle rebooting to get the new kernel before installing device drivers so we didn't install drivers for the outgoing kernel (like if you install a kernel and then install other drivers before rebooting to get the new kernel). I'm not sure if the Nvidia drivers installed via apt will have this issue, but we had the issue when we previously installed infiniband drivers with another charm- where the charm installed drivers for the running, outgoing kernel. Thoughts? |
Possibly a conditional reboot of the machine in the dispatch file would be best so we can catch it before any charm code actually runs? something like #!/bin/bash
# Filename: dispatch
if ! [[ -f '.init-reboot' ]]
then
if [[ -f '/var/run/reboot-required' ]]
then
reboot
fi
touch .init-reboot
fi
JUJU_DISPATCH_PATH="${JUJU_DISPATCH_PATH:-$0}" PYTHONPATH=lib:venv /usr/bin/env python3 ./src/charm.py |
Side note (if you do modify the if ! [[ -f '.installed' ]]
then
# Necessary to compile and install NHC
apt-get install --assume-yes make
touch .installed
fi ^ can be safely removed. |
Hi @jamesbeedy, thanks and happy New Year! You bring up a good point with auto-rebooting, particularly the issues around getting the right drivers installed as new kernel versions are being swapped in. Installing the Nvidia drivers from apt can pull in a new kernel as a dependency if you're using an older boot image and the latest available Agreed we should account for this auto-update/needs-reboot case. I'll reproduce it in my test environment then try out the |
Hey- happy new year to you too! Something else came to mind ... oftentimes, users provide their own drivers for GPU and network cards. I'm wondering if we can detect pre-existing driver installation and skip driver installation if drivers are already installed on the machine. Possibly this could even be a charm config instead of detection, like |
The latest commit adds a reboot check as the first step in the install hook to account for an outgoing kernel. I went with keeping things in the charm code over a custom dispatch script. I'll also have a look at removing the For user-provided drivers, I'd lean towards a charm config over detection but will give it some more thought. Definitely a case we should account for though, yep. |
Seems the installation of Support for user-provided drivers is something we'll need to spec out a bit more so will push to a future PR. |
Adds reboot check as first step in install hook in case of a pending kernel update.
to `<1.0.0,>=0.11.0`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gotta warn you, I'm a bit of a stickler about comments in code 😅
Overall, great work so far! The bulk of this review is centered on the compute node side of things - installing the drivers, getting device files, etc. - with some comments on slurmctld. How handle things on slurmctld is more flexible, so I'll go more in-depth after you've addressed my comments on the slurmd side of things. Looking quite good 🤩
Let me know if you have any questions!
try/catch blocks. Add better support for comma-separated values in `get_slurmd_info`
That's all initial comments addressed. For the next round, I think a piece needing particular attention is the To work around this, I've omitted Ideas for improving this would be much appreciated. As would thoughts on tidying up the stack of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good 🤩!
Just a couple comments around documentation and method names. After addressing those comments, I think I am good to merge this PR.
Latest comments now addressed. Let me know if you spot anything else needing fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one last typing-related thing I noticed. After addressing that, I think we're good to 🚢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Real good 😎
Excellent work on this!
This PR enables use of Nvidia GPUs on a Charmed HPC cluster. The
slurmd
charm is extended to perform automated GPU detection and driver installation. Theslurmctld
charm is extended to be aware of GPU-enabled compute nodes and to provide the necessary configuration inslurm.conf
and the newgres.conf
configuration file.Usage