Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Nvidia GPU Support #57

Merged
merged 15 commits into from
Jan 13, 2025
Merged

Enable Nvidia GPU Support #57

merged 15 commits into from
Jan 13, 2025

Conversation

dsloanm
Copy link
Contributor

@dsloanm dsloanm commented Dec 20, 2024

Draft PR as there are outstanding actions: No longer a draft PR!

  • Tests need written.
  • Accounting configuration needs added: AccountingStorageTRES=gres/gpu, etc. Out of scope for this PR.
  • Better logging and status messages are needed.
    • Especially after the unit reboot following driver installation: if slurmd and slurmctld are not related, the unit status is set to slurmd not running instead of relation needed. Not a blocker and _check_status function should fix the unit status on the next update_status event.
  • Approaches for package install, reboot detection and (lack of) NVLink detection aren't finalised. NVLink detection out of scope for this PR
  • Account for pre-existing drivers. Also for a future PR.

This PR enables use of Nvidia GPUs on a Charmed HPC cluster. The slurmd charm is extended to perform automated GPU detection and driver installation. The slurmctld charm is extended to be aware of GPU-enabled compute nodes and to provide the necessary configuration in slurm.conf and the new gres.conf configuration file.

Usage

# Assuming a bootstrapped AWS juju controller and a new model
# Deploy controller, login node and compute node
juju deploy slurmctld
juju deploy sackd
# Testing has been performed on AWS g4dn.xlarge instances, each equipped with an Nvidia Tesla T4 GPU
juju deploy --constraints="instance-type=g4dn.xlarge" slurmd compute-gpu

# Relate applications
juju integrate slurmctld:login-node sackd:slurmctld
juju integrate slurmctld:slurmd compute-gpu:slurmctld

# Wait for deployment to complete

# Bring up compute node
juju run compute-gpu/0 node-configured

# Connect to login node and submit a GPU job
juju ssh sackd/0
sudo su -

# Create a new job script.
# Note the "#SBATCH --gres=gpu:tesla_t4:1" requesting a GPU.
cat << 'EOF' > test.submit
#SBATCH --job-name=test
#SBATCH --partition=slurmd
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:00:30
#SBATCH --error=test.%J.err
#SBATCH --output=test.%J.out
#SBATCH --gres=gpu:tesla_t4:1

echo "Hello from GPU-enabled node `hostname`"
echo "gres.conf contents:"
cat /run/slurm/conf/gres.conf
EOF

# Submit and check output
sbatch test.submit
logout
juju ssh compute-gpu/0
sudo su -
cat test.1.out

# Sample output:
#
#   Hello from GPU-enabled node ip-172-31-2-0
#   gres.conf contents:
#   NodeName=ip-172-31-2-0 Name=gpu Type=tesla_t4 File=/dev/nvidia0
#
# A GPU-enabled job was submitted, scheduled and executed. Success!

@jamesbeedy
Copy link
Contributor

jamesbeedy commented Dec 26, 2024

Hey @dsloanm , really nice! I only have one concern: auto-rebooting. Juju has default model configs auto-update: true, auto-upgrade: true; oftentimes, depending on how dated the boot image is, there are kernel and/or driver updates that will automatically install when the charm initializes and put the machine into a reboot-required state. I don't see any issues with what you have done here.... just wondering if we want to account for this case needs-reboot case separately instead of rolling it into GPU context.

In revs past, we put the application into a blocked status if the machine needs reboot so that we could handle rebooting to get the new kernel before installing device drivers so we didn't install drivers for the outgoing kernel (like if you install a kernel and then install other drivers before rebooting to get the new kernel).

I'm not sure if the Nvidia drivers installed via apt will have this issue, but we had the issue when we previously installed infiniband drivers with another charm- where the charm installed drivers for the running, outgoing kernel.

Thoughts?

@jamesbeedy
Copy link
Contributor

jamesbeedy commented Dec 26, 2024

Possibly a conditional reboot of the machine in the dispatch file would be best so we can catch it before any charm code actually runs?

something like

#!/bin/bash
# Filename: dispatch

if ! [[ -f '.init-reboot' ]]
then
	if [[ -f '/var/run/reboot-required' ]]
	then
		reboot
        fi
	
	touch .init-reboot
fi

JUJU_DISPATCH_PATH="${JUJU_DISPATCH_PATH:-$0}" PYTHONPATH=lib:venv /usr/bin/env python3 ./src/charm.py

@jamesbeedy
Copy link
Contributor

Side note (if you do modify the dispatch file)- since we now build nhc in the charmcraft build step, we probably don't need to install make in the dispatch file with apt anymore.

if ! [[ -f '.installed' ]]
then
    # Necessary to compile and install NHC
    apt-get install --assume-yes make
    touch .installed
fi

^ can be safely removed.

@dsloanm
Copy link
Contributor Author

dsloanm commented Jan 6, 2025

Hi @jamesbeedy, thanks and happy New Year!

You bring up a good point with auto-rebooting, particularly the issues around getting the right drivers installed as new kernel versions are being swapped in. Installing the Nvidia drivers from apt can pull in a new kernel as a dependency if you're using an older boot image and the latest available linux-modules-nvidia package in the repos isn't compatible with the running kernel. I can definitely see this causing problems with infiniband drivers, as you say, when we move on to those.

Agreed we should account for this auto-update/needs-reboot case. I'll reproduce it in my test environment then try out the dispatch file changes.

@jamesbeedy
Copy link
Contributor

jamesbeedy commented Jan 6, 2025

Hey- happy new year to you too!

Something else came to mind ... oftentimes, users provide their own drivers for GPU and network cards. I'm wondering if we can detect pre-existing driver installation and skip driver installation if drivers are already installed on the machine. Possibly this could even be a charm config instead of detection, like install-gpu-drivers: true|false. Thoughts?

@dsloanm
Copy link
Contributor Author

dsloanm commented Jan 7, 2025

The latest commit adds a reboot check as the first step in the install hook to account for an outgoing kernel. I went with keeping things in the charm code over a custom dispatch script. I'll also have a look at removing the make installation - see if we can sneak it into this PR.

For user-provided drivers, I'd lean towards a charm config over detection but will give it some more thought. Definitely a case we should account for though, yep.

@dsloanm
Copy link
Contributor Author

dsloanm commented Jan 7, 2025

Seems the installation of make in the dispatch script is still needed to build nhc. The charm's install hook fails without it. Looks like the charmcraft step just includes the source archive which gets built by utils.nhc.install().

Support for user-provided drivers is something we'll need to spec out a bit more so will push to a future PR.

@dsloanm dsloanm marked this pull request as ready for review January 7, 2025 18:10
@dsloanm dsloanm requested a review from a team as a code owner January 7, 2025 18:10
@dsloanm dsloanm requested review from jedel1043 and NucciTheBoss and removed request for a team January 7, 2025 18:10
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta warn you, I'm a bit of a stickler about comments in code 😅

Overall, great work so far! The bulk of this review is centered on the compute node side of things - installing the drivers, getting device files, etc. - with some comments on slurmctld. How handle things on slurmctld is more flexible, so I'll go more in-depth after you've addressed my comments on the slurmd side of things. Looking quite good 🤩

Let me know if you have any questions!

charms/slurmd/src/utils/machine.py Outdated Show resolved Hide resolved
charms/slurmd/src/utils/machine.py Outdated Show resolved Hide resolved
charms/slurmd/src/charm.py Outdated Show resolved Hide resolved
charms/slurmd/src/charm.py Outdated Show resolved Hide resolved
charms/slurmd/src/charm.py Outdated Show resolved Hide resolved
charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
charms/slurmd/src/charm.py Outdated Show resolved Hide resolved
charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
try/catch blocks.

Add better support for comma-separated values in `get_slurmd_info`
@dsloanm dsloanm requested a review from NucciTheBoss January 10, 2025 13:48
@dsloanm
Copy link
Contributor Author

dsloanm commented Jan 10, 2025

That's all initial comments addressed. For the next round, I think a piece needing particular attention is the slurmd unit tests. We're now building ubuntu-drivers-common from source, which isn't an issue for the integration tests, since the charm build handles everything, but does complicate the unit tests since the host machine is unlikely to have the build dependencies for ubuntu-drivers-common (libdrm-dev, libkmod-dev, etc.) installed.

To work around this, I've omitted ubuntu-drivers-common from test-requirements.txt and added module_mocks.py to mock away the missing UbuntuDrivers and apt_pkg modules. This must happen before the from charm import SlurmdCharm line in the test code or it will attempt to import apt_pkg before the fake module is in place. The current implementation introduces an ordering requirement in our imports (from module_mocks import apt_pkg_mock, detect_mock must come before from charm import SlurmdCharm). This isn't ideal, especially as it's an order that doesn't follow the Python standard (# noqa: I001 is used to keep the linter at bay).

Ideas for improving this would be much appreciated. As would thoughts on tidying up the stack of @patch("pynvml.*") decorators on test_slurmctld_on_relation_created.

Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good 🤩!

Just a couple comments around documentation and method names. After addressing those comments, I think I am good to merge this PR.

charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
charms/slurmd/src/utils/machine.py Outdated Show resolved Hide resolved
charms/slurmd/src/utils/machine.py Outdated Show resolved Hide resolved
charms/slurmd/tests/unit/module_mocks.py Outdated Show resolved Hide resolved
charms/slurmd/src/utils/gpu.py Show resolved Hide resolved
charms/slurmctld/src/interface_slurmd.py Show resolved Hide resolved
charms/slurmd/requirements.txt Show resolved Hide resolved
charms/slurmctld/src/interface_slurmd.py Outdated Show resolved Hide resolved
@dsloanm dsloanm requested a review from NucciTheBoss January 13, 2025 13:24
@dsloanm
Copy link
Contributor Author

dsloanm commented Jan 13, 2025

Latest comments now addressed. Let me know if you spot anything else needing fixed.

Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last typing-related thing I noticed. After addressing that, I think we're good to 🚢

charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
@NucciTheBoss NucciTheBoss self-requested a review January 13, 2025 15:47
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Real good 😎

Excellent work on this!

@NucciTheBoss NucciTheBoss merged commit 2b5aaee into charmed-hpc:main Jan 13, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants