Enable Nvidia GPU Support #57

dsloanm · 2024-12-20T17:46:20Z

~~Draft PR as there are outstanding actions:~~ No longer a draft PR!

~~Tests need written.~~

~~Accounting configuration needs added: AccountingStorageTRES=gres/gpu, etc.~~ Out of scope for this PR.

~~Better logging and status messages are needed.~~

~~Especially after the unit reboot following driver installation: if slurmd and slurmctld are not related, the unit status is set to slurmd not running instead of relation needed.~~ Not a blocker and _check_status function should fix the unit status on the next update_status event.

~~Approaches for package install, reboot detection and (lack of) NVLink detection aren't finalised.~~ NVLink detection out of scope for this PR

~~Account for pre-existing drivers.~~ Also for a future PR.

This PR enables use of Nvidia GPUs on a Charmed HPC cluster. The slurmd charm is extended to perform automated GPU detection and driver installation. The slurmctld charm is extended to be aware of GPU-enabled compute nodes and to provide the necessary configuration in slurm.conf and the new gres.conf configuration file.

Usage

# Assuming a bootstrapped AWS juju controller and a new model
# Deploy controller, login node and compute node
juju deploy slurmctld
juju deploy sackd
# Testing has been performed on AWS g4dn.xlarge instances, each equipped with an Nvidia Tesla T4 GPU
juju deploy --constraints="instance-type=g4dn.xlarge" slurmd compute-gpu

# Relate applications
juju integrate slurmctld:login-node sackd:slurmctld
juju integrate slurmctld:slurmd compute-gpu:slurmctld

# Wait for deployment to complete

# Bring up compute node
juju run compute-gpu/0 node-configured

# Connect to login node and submit a GPU job
juju ssh sackd/0
sudo su -

# Create a new job script.
# Note the "#SBATCH --gres=gpu:tesla_t4:1" requesting a GPU.
cat << 'EOF' > test.submit
#SBATCH --job-name=test
#SBATCH --partition=slurmd
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:00:30
#SBATCH --error=test.%J.err
#SBATCH --output=test.%J.out
#SBATCH --gres=gpu:tesla_t4:1

echo "Hello from GPU-enabled node `hostname`"
echo "gres.conf contents:"
cat /run/slurm/conf/gres.conf
EOF

# Submit and check output
sbatch test.submit
logout
juju ssh compute-gpu/0
sudo su -
cat test.1.out

# Sample output:
#
#   Hello from GPU-enabled node ip-172-31-2-0
#   gres.conf contents:
#   NodeName=ip-172-31-2-0 Name=gpu Type=tesla_t4 File=/dev/nvidia0
#
# A GPU-enabled job was submitted, scheduled and executed. Success!

…ency.

jamesbeedy · 2024-12-26T22:49:29Z

Hey @dsloanm , really nice! I only have one concern: auto-rebooting. Juju has default model configs auto-update: true, auto-upgrade: true; oftentimes, depending on how dated the boot image is, there are kernel and/or driver updates that will automatically install when the charm initializes and put the machine into a reboot-required state. I don't see any issues with what you have done here.... just wondering if we want to account for this case needs-reboot case separately instead of rolling it into GPU context.

In revs past, we put the application into a blocked status if the machine needs reboot so that we could handle rebooting to get the new kernel before installing device drivers so we didn't install drivers for the outgoing kernel (like if you install a kernel and then install other drivers before rebooting to get the new kernel).

I'm not sure if the Nvidia drivers installed via apt will have this issue, but we had the issue when we previously installed infiniband drivers with another charm- where the charm installed drivers for the running, outgoing kernel.

Thoughts?

jamesbeedy · 2024-12-26T22:58:08Z

Possibly a conditional reboot of the machine in the dispatch file would be best so we can catch it before any charm code actually runs?

something like

#!/bin/bash
# Filename: dispatch

if ! [[ -f '.init-reboot' ]]
then
	if [[ -f '/var/run/reboot-required' ]]
	then
		reboot
        fi
	
	touch .init-reboot
fi

JUJU_DISPATCH_PATH="${JUJU_DISPATCH_PATH:-$0}" PYTHONPATH=lib:venv /usr/bin/env python3 ./src/charm.py

jamesbeedy · 2024-12-26T23:11:24Z

Side note (if you do modify the dispatch file)- since we now build nhc in the charmcraft build step, we probably don't need to install make in the dispatch file with apt anymore.

if ! [[ -f '.installed' ]]
then
    # Necessary to compile and install NHC
    apt-get install --assume-yes make
    touch .installed
fi

^ can be safely removed.

dsloanm · 2025-01-06T13:02:57Z

Hi @jamesbeedy, thanks and happy New Year!

You bring up a good point with auto-rebooting, particularly the issues around getting the right drivers installed as new kernel versions are being swapped in. Installing the Nvidia drivers from apt can pull in a new kernel as a dependency if you're using an older boot image and the latest available linux-modules-nvidia package in the repos isn't compatible with the running kernel. I can definitely see this causing problems with infiniband drivers, as you say, when we move on to those.

Agreed we should account for this auto-update/needs-reboot case. I'll reproduce it in my test environment then try out the dispatch file changes.

jamesbeedy · 2025-01-06T13:59:36Z

Hey- happy new year to you too!

Something else came to mind ... oftentimes, users provide their own drivers for GPU and network cards. I'm wondering if we can detect pre-existing driver installation and skip driver installation if drivers are already installed on the machine. Possibly this could even be a charm config instead of detection, like install-gpu-drivers: true|false. Thoughts?

dsloanm · 2025-01-07T12:21:53Z

The latest commit adds a reboot check as the first step in the install hook to account for an outgoing kernel. I went with keeping things in the charm code over a custom dispatch script. I'll also have a look at removing the make installation - see if we can sneak it into this PR.

For user-provided drivers, I'd lean towards a charm config over detection but will give it some more thought. Definitely a case we should account for though, yep.

dsloanm · 2025-01-07T18:08:55Z

Seems the installation of make in the dispatch script is still needed to build nhc. The charm's install hook fails without it. Looks like the charmcraft step just includes the source archive which gets built by utils.nhc.install().

Support for user-provided drivers is something we'll need to spec out a bit more so will push to a future PR.

Adds reboot check as first step in install hook in case of a pending kernel update.

to `<1.0.0,>=0.11.0`

NucciTheBoss

I gotta warn you, I'm a bit of a stickler about comments in code 😅

Overall, great work so far! The bulk of this review is centered on the compute node side of things - installing the drivers, getting device files, etc. - with some comments on slurmctld. How handle things on slurmctld is more flexible, so I'll go more in-depth after you've addressed my comments on the slurmd side of things. Looking quite good 🤩

Let me know if you have any questions!

charms/slurmd/src/utils/machine.py

charms/slurmd/src/charm.py

charms/slurmctld/src/charm.py

charms/slurmd/src/charm.py

charms/slurmctld/src/charm.py

try/catch blocks. Add better support for comma-separated values in `get_slurmd_info`

and strides calculations

dsloanm · 2025-01-10T14:08:34Z

That's all initial comments addressed. For the next round, I think a piece needing particular attention is the slurmd unit tests. We're now building ubuntu-drivers-common from source, which isn't an issue for the integration tests, since the charm build handles everything, but does complicate the unit tests since the host machine is unlikely to have the build dependencies for ubuntu-drivers-common (libdrm-dev, libkmod-dev, etc.) installed.

To work around this, I've omitted ubuntu-drivers-common from test-requirements.txt and added module_mocks.py to mock away the missing UbuntuDrivers and apt_pkg modules. This must happen before the from charm import SlurmdCharm line in the test code or it will attempt to import apt_pkg before the fake module is in place. The current implementation introduces an ordering requirement in our imports (from module_mocks import apt_pkg_mock, detect_mock must come before from charm import SlurmdCharm). This isn't ideal, especially as it's an order that doesn't follow the Python standard (# noqa: I001 is used to keep the linter at bay).

Ideas for improving this would be much appreciated. As would thoughts on tidying up the stack of @patch("pynvml.*") decorators on test_slurmctld_on_relation_created.

NucciTheBoss

This is looking really good 🤩!

Just a couple comments around documentation and method names. After addressing those comments, I think I am good to merge this PR.

charms/slurmctld/src/charm.py

charms/slurmd/src/utils/machine.py

charms/slurmd/tests/unit/module_mocks.py

charms/slurmd/src/utils/gpu.py

charms/slurmctld/src/interface_slurmd.py

charms/slurmd/requirements.txt

charms/slurmctld/src/interface_slurmd.py

dsloanm · 2025-01-13T13:26:47Z

Latest comments now addressed. Let me know if you spot anything else needing fixed.

NucciTheBoss

Just one last typing-related thing I noticed. After addressing that, I think we're good to 🚢

charms/slurmctld/src/charm.py

NucciTheBoss

Looks good. Real good 😎

Excellent work on this!

dsloanm added 3 commits December 20, 2024 17:04

chore(all): bump slurmutils to 0.11.0 and add jsonschema depend…

a5a84c4

…ency.

feat(slurmd): add GPU detection and driver installation.

cd248fe

feat(slurmctld): add GPU and other GRES device support.

2c95313

dsloanm marked this pull request as ready for review January 7, 2025 18:10

dsloanm requested a review from a team as a code owner January 7, 2025 18:10

dsloanm requested review from jedel1043 and NucciTheBoss and removed request for a team January 7, 2025 18:10

dsloanm added 4 commits January 7, 2025 18:25

tests(slurmd): add unit tests for GPU util

1bcfa0f

feat(slurmd): account for Juju auto-upgrading base image

8fbf715

Adds reboot check as first step in install hook in case of a pending kernel update.

refactor(slurmd): improve GPU detection comments and log messages

94c1ce3

chore(all): update slurm_ops to 0.12 and slurmutils requirement

5575440

to `<1.0.0,>=0.11.0`

dsloanm force-pushed the gpu branch from 2a1771f to 5575440 Compare January 7, 2025 18:45

NucciTheBoss requested changes Jan 8, 2025

View reviewed changes

dsloanm force-pushed the gpu branch from 135662b to 38c90e3 Compare January 8, 2025 13:42

dsloanm added 2 commits January 8, 2025 13:45

refactor(slurmd): improve comments and function names. Simplify

5ae35e1

try/catch blocks. Add better support for comma-separated values in `get_slurmd_info`

refactor(slurmctld): improve comments and function names

29e1169

dsloanm force-pushed the gpu branch from 38c90e3 to 29e1169 Compare January 8, 2025 13:46

dsloanm added 3 commits January 9, 2025 10:49

refactor(slurmd): use list over set for GPU device numbers

738081e

refactor(slurmd): use calculate_rs from slurmutils for ranges

f85e6f7

and strides calculations

feat(slurmd): use packaged ubuntu-drivers-common and pynvml

20b764f

dsloanm requested a review from NucciTheBoss January 10, 2025 13:48

NucciTheBoss requested changes Jan 10, 2025

View reviewed changes

dsloanm added 2 commits January 13, 2025 13:00

refactor(slurmctld): improve function names and logging

d2f5a1a

refactor(slurmd): improve comments and logging

4cf8083

dsloanm requested a review from NucciTheBoss January 13, 2025 13:24

NucciTheBoss requested changes Jan 13, 2025

View reviewed changes

charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved

refactor(slurmctld): remove unnecessary str conversion

fffc9d8

NucciTheBoss self-requested a review January 13, 2025 15:47

NucciTheBoss approved these changes Jan 13, 2025

View reviewed changes

NucciTheBoss merged commit 2b5aaee into charmed-hpc:main Jan 13, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Nvidia GPU Support #57

Enable Nvidia GPU Support #57

dsloanm commented Dec 20, 2024 •

edited

Loading

jamesbeedy commented Dec 26, 2024 •

edited

Loading

jamesbeedy commented Dec 26, 2024 •

edited

Loading

jamesbeedy commented Dec 26, 2024

dsloanm commented Jan 6, 2025

jamesbeedy commented Jan 6, 2025 •

edited

Loading

dsloanm commented Jan 7, 2025

dsloanm commented Jan 7, 2025

NucciTheBoss left a comment

dsloanm commented Jan 10, 2025

NucciTheBoss left a comment

dsloanm commented Jan 13, 2025

NucciTheBoss left a comment

NucciTheBoss left a comment

Enable Nvidia GPU Support #57

Enable Nvidia GPU Support #57

Conversation

dsloanm commented Dec 20, 2024 • edited Loading

Usage

jamesbeedy commented Dec 26, 2024 • edited Loading

jamesbeedy commented Dec 26, 2024 • edited Loading

jamesbeedy commented Dec 26, 2024

dsloanm commented Jan 6, 2025

jamesbeedy commented Jan 6, 2025 • edited Loading

dsloanm commented Jan 7, 2025

dsloanm commented Jan 7, 2025

NucciTheBoss left a comment

Choose a reason for hiding this comment

dsloanm commented Jan 10, 2025

NucciTheBoss left a comment

Choose a reason for hiding this comment

dsloanm commented Jan 13, 2025

NucciTheBoss left a comment

Choose a reason for hiding this comment

NucciTheBoss left a comment

Choose a reason for hiding this comment

dsloanm commented Dec 20, 2024 •

edited

Loading

jamesbeedy commented Dec 26, 2024 •

edited

Loading

jamesbeedy commented Dec 26, 2024 •

edited

Loading

jamesbeedy commented Jan 6, 2025 •

edited

Loading