Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(slurmd): enable Infiniband RDMA support #72

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dsloanm
Copy link
Contributor

@dsloanm dsloanm commented Jan 28, 2025

This PR enables use of Infiniband in Charmed HPC clusters. Microsoft Azure Standard_HB120rs_v3 instances equipped with 200 Gb/sec HDR InfiniBand have been used for testing.

Minimally, installing the rdma-core package from the Ubuntu repositories is sufficient to enable connectivity as MLNX drivers for ConnectX cards are available out-of-the-box. Here, infiniband-diags is also installed to provide diagnostic tools ibstat, ibping, etc. for cluster administrators.

These changes also re-enable UCX for OpenMPI. The developer-intended default for OpenMPI is to use the UCX PML when Infiniband devices are available however, a Debian patch disables UCX in order to suppress warnings in the case where RDMA hardware is not available. Without this change, measured bandwidth is the region of 3Gb/s on test systems, while upwards of 190Gb/s can be seen with the change.

String operations are used to remove ucx and uct from /etc/openmpi/openmpi-mca-params.conf. An alternative we may wish to consider is including our own openmpi-mca-params.conf and copying it into place.

@dsloanm dsloanm requested a review from a team as a code owner January 28, 2025 12:03
@dsloanm dsloanm requested review from jedel1043 and NucciTheBoss and removed request for a team January 28, 2025 12:03
@dsloanm
Copy link
Contributor Author

dsloanm commented Jan 28, 2025

Integration test failures due to charmcraft 3.3.2 upgrade. Issue opened at canonical/charmcraft#2125

@dsloanm
Copy link
Contributor Author

dsloanm commented Jan 29, 2025

Synced with main to pull in fix(ci): pin charmcraft to version 3.2.3 and tests are passing again.

@NucciTheBoss NucciTheBoss added the enhancement New feature or request label Jan 29, 2025
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. A few questions/suggestions around the implementation of the openmpi overrides.

Also, this PR needs complimenting unit tests and code coverage to ensure everything should work as we expect without blowing a bunch of cash on Azure for integration tests 😅

Comment on lines +47 to +86
# Re-enable UCX/UCT.
# The OpenMPI package from archive includes a Debian patch that disables UCX to silence
# warnings when running on systems without RDMA hardware. This leads to the less performant
# "ob1" method being selected on Infiniband-enabled systems. By default, OpenMPI selects UCX
# when Infiniband devices are available. This default functionality is restored here.
#
# See:
# https://sources.debian.org/src/openmpi/4.1.4-3/debian/patches/no-warning-unused.patch/#L24
# https://github.com/open-mpi/ompi/issues/8367
# https://github.com/open-mpi/ompi/blob/v4.1.x/README#L763
path = "/etc/openmpi/openmpi-mca-params.conf"
_logger.info("enabling OpenMPI UCX transport in %s", path)

try:
with open(path, "r") as f:
lines = f.readlines()
except FileNotFoundError:
_logger.warn("%s not found. skipping OpenMPI UCX enablement", path)
return

# Current patch adds lines:
# btl = ^uct,openib,ofi
# pml = ^ucx
# osc = ^ucx,pt2pt
# Remove "uct" and "ucx". Values must still begin with "^", e.g. `btl = ^openib,ofi`, to leave
# openib and ofi disabled.
new_lines = []
for line in lines:
if line.startswith(("btl =", "pml =", "osc =")):
parts = line.split()
values = parts[2].strip().lstrip("^").split(",")
values = [v for v in values if v not in ("uct", "ucx")]
# Remove line entirely if all values removed, e.g. "pml = ^ucx"
if values:
new_lines.append(f"{parts[0]} = ^{','.join(values)}\n")
else:
new_lines.append(line)

with open(path, "w") as f:
f.writelines(new_lines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if this logic was moved to it's own separate function rather than being bundled in with the install function since it seems a bit involved.

Reason saying is that I typically try to keep all functions 1:1 since it makes unit testing much, much easier since you have less branches in one function to cover. Breaking these methods apart also makes tracking possible bugs down easier. E.g. is it an issue with how we're installing the deb packages, or is it a problem with how we're modifying the OpenMPI configuration? Also, a breaking change to how openmpu-mca-params.conf is manage shouldn't break our test ensuring the correct rdma packageset is installed.

If you only want the module to expose one public method such as install, you can use a similar approach to what we do in slurm_ops where install wraps a bunch of private methods, and those private methods are tested directly in the unit tests:

https://github.com/charmed-hpc/hpc-libs/blob/865d0710a1517b066231f421c8bd893b4c4a757b/lib/charms/hpc_libs/v0/slurm_ops.py#L533-L538

For example, rdma.install() could call private methods such as _install_rdma() and _override_opmi_conf(). That way the logic is more encapsulated for testing and triaging, but the developer UX for consuming this module inside the charm is more streamlined.

Comment on lines +60 to +86
try:
with open(path, "r") as f:
lines = f.readlines()
except FileNotFoundError:
_logger.warn("%s not found. skipping OpenMPI UCX enablement", path)
return

# Current patch adds lines:
# btl = ^uct,openib,ofi
# pml = ^ucx
# osc = ^ucx,pt2pt
# Remove "uct" and "ucx". Values must still begin with "^", e.g. `btl = ^openib,ofi`, to leave
# openib and ofi disabled.
new_lines = []
for line in lines:
if line.startswith(("btl =", "pml =", "osc =")):
parts = line.split()
values = parts[2].strip().lstrip("^").split(",")
values = [v for v in values if v not in ("uct", "ucx")]
# Remove line entirely if all values removed, e.g. "pml = ^ucx"
if values:
new_lines.append(f"{parts[0]} = ^{','.join(values)}\n")
else:
new_lines.append(line)

with open(path, "w") as f:
f.writelines(new_lines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why not modify the file content in-place rather than create a duplicate list?

Suggested change
try:
with open(path, "r") as f:
lines = f.readlines()
except FileNotFoundError:
_logger.warn("%s not found. skipping OpenMPI UCX enablement", path)
return
# Current patch adds lines:
# btl = ^uct,openib,ofi
# pml = ^ucx
# osc = ^ucx,pt2pt
# Remove "uct" and "ucx". Values must still begin with "^", e.g. `btl = ^openib,ofi`, to leave
# openib and ofi disabled.
new_lines = []
for line in lines:
if line.startswith(("btl =", "pml =", "osc =")):
parts = line.split()
values = parts[2].strip().lstrip("^").split(",")
values = [v for v in values if v not in ("uct", "ucx")]
# Remove line entirely if all values removed, e.g. "pml = ^ucx"
if values:
new_lines.append(f"{parts[0]} = ^{','.join(values)}\n")
else:
new_lines.append(line)
with open(path, "w") as f:
f.writelines(new_lines)
file = Path("/etc/openmpi/openmpi-mca-params.conf")
if not file.exists():
_logger.warn("%s not found. skipping OpenMPI UCX enablement", file)
return
content = file.read_text().splitlines()
for i, line in enumerate(content):
if not line.startswith(("btl =", "pml =", "osc =")):
continue
parts = line.split()
values = parts[2].strip().lstrip("^").split(",")
values = [v for v in values if v not in ("uct", "ucx")]
# Remove line entirely if all values removed, e.g. "pml = ^ucx"
if values:
content[i] = f"{parts[0]} = ^{','.join(values)}\n"
file.write_text("\n".join(content) + "\n")

Figured this why we don't need to spend time expanding a new list in memory if we already have a fixed sized list we can create from openmpi-mca-params.conf. We also only need to worry about modifying the exact entries that we need to modify. Other benefit is one less level of nesting 😅

Raises:
RDMAOpsError: Raised if error is encountered during package install.
"""
install_packages = ["rdma-core", "infiniband-diags"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[foresight]: I wonder if we should eventually think of enabling a charm config option like enable-debug-tools=true which installs packages like infiniband-diags for system administrators. This way we can keep the overall image footprint size down, but still give people the ability to dive into nodes to figure out what's going wrong with their deployment.

Suggested change
install_packages = ["rdma-core", "infiniband-diags"]
install_packages = ["rdma-core", "infiniband-diags"]

Just wondering since I imagine the average end user will want BLAZINGLY FAST MPI, but only system administrators will want to have access to the various diagnostic tools if they are needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I yearn for the unit tests...

Comment on lines +34 to +38
"""Install RDMA packages.

Raises:
RDMAOpsError: Raised if error is encountered during package install.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will re-enabling UCX on non-Infiniband equipped nodes impact the user experience? Will openmpi throw a bunch of unfriendly log messages? If so, this should be documented behavior within our documentation.

@NucciTheBoss NucciTheBoss added the needs testing Fix/feature needs addition testing before being approved label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs testing Fix/feature needs addition testing before being approved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants