-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(slurmd): enable Infiniband RDMA support #72
base: main
Are you sure you want to change the base?
Conversation
Integration test failures due to charmcraft 3.3.2 upgrade. Issue opened at canonical/charmcraft#2125 |
Synced with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good. A few questions/suggestions around the implementation of the openmpi
overrides.
Also, this PR needs complimenting unit tests and code coverage to ensure everything should work as we expect without blowing a bunch of cash on Azure for integration tests 😅
# Re-enable UCX/UCT. | ||
# The OpenMPI package from archive includes a Debian patch that disables UCX to silence | ||
# warnings when running on systems without RDMA hardware. This leads to the less performant | ||
# "ob1" method being selected on Infiniband-enabled systems. By default, OpenMPI selects UCX | ||
# when Infiniband devices are available. This default functionality is restored here. | ||
# | ||
# See: | ||
# https://sources.debian.org/src/openmpi/4.1.4-3/debian/patches/no-warning-unused.patch/#L24 | ||
# https://github.com/open-mpi/ompi/issues/8367 | ||
# https://github.com/open-mpi/ompi/blob/v4.1.x/README#L763 | ||
path = "/etc/openmpi/openmpi-mca-params.conf" | ||
_logger.info("enabling OpenMPI UCX transport in %s", path) | ||
|
||
try: | ||
with open(path, "r") as f: | ||
lines = f.readlines() | ||
except FileNotFoundError: | ||
_logger.warn("%s not found. skipping OpenMPI UCX enablement", path) | ||
return | ||
|
||
# Current patch adds lines: | ||
# btl = ^uct,openib,ofi | ||
# pml = ^ucx | ||
# osc = ^ucx,pt2pt | ||
# Remove "uct" and "ucx". Values must still begin with "^", e.g. `btl = ^openib,ofi`, to leave | ||
# openib and ofi disabled. | ||
new_lines = [] | ||
for line in lines: | ||
if line.startswith(("btl =", "pml =", "osc =")): | ||
parts = line.split() | ||
values = parts[2].strip().lstrip("^").split(",") | ||
values = [v for v in values if v not in ("uct", "ucx")] | ||
# Remove line entirely if all values removed, e.g. "pml = ^ucx" | ||
if values: | ||
new_lines.append(f"{parts[0]} = ^{','.join(values)}\n") | ||
else: | ||
new_lines.append(line) | ||
|
||
with open(path, "w") as f: | ||
f.writelines(new_lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if this logic was moved to it's own separate function rather than being bundled in with the install
function since it seems a bit involved.
Reason saying is that I typically try to keep all functions 1:1 since it makes unit testing much, much easier since you have less branches in one function to cover. Breaking these methods apart also makes tracking possible bugs down easier. E.g. is it an issue with how we're installing the deb packages, or is it a problem with how we're modifying the OpenMPI configuration? Also, a breaking change to how openmpu-mca-params.conf is manage shouldn't break our test ensuring the correct rdma packageset is installed.
If you only want the module to expose one public method such as install
, you can use a similar approach to what we do in slurm_ops
where install
wraps a bunch of private methods, and those private methods are tested directly in the unit tests:
For example, rdma.install()
could call private methods such as _install_rdma()
and _override_opmi_conf()
. That way the logic is more encapsulated for testing and triaging, but the developer UX for consuming this module inside the charm is more streamlined.
try: | ||
with open(path, "r") as f: | ||
lines = f.readlines() | ||
except FileNotFoundError: | ||
_logger.warn("%s not found. skipping OpenMPI UCX enablement", path) | ||
return | ||
|
||
# Current patch adds lines: | ||
# btl = ^uct,openib,ofi | ||
# pml = ^ucx | ||
# osc = ^ucx,pt2pt | ||
# Remove "uct" and "ucx". Values must still begin with "^", e.g. `btl = ^openib,ofi`, to leave | ||
# openib and ofi disabled. | ||
new_lines = [] | ||
for line in lines: | ||
if line.startswith(("btl =", "pml =", "osc =")): | ||
parts = line.split() | ||
values = parts[2].strip().lstrip("^").split(",") | ||
values = [v for v in values if v not in ("uct", "ucx")] | ||
# Remove line entirely if all values removed, e.g. "pml = ^ucx" | ||
if values: | ||
new_lines.append(f"{parts[0]} = ^{','.join(values)}\n") | ||
else: | ||
new_lines.append(line) | ||
|
||
with open(path, "w") as f: | ||
f.writelines(new_lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why not modify the file content in-place rather than create a duplicate list?
try: | |
with open(path, "r") as f: | |
lines = f.readlines() | |
except FileNotFoundError: | |
_logger.warn("%s not found. skipping OpenMPI UCX enablement", path) | |
return | |
# Current patch adds lines: | |
# btl = ^uct,openib,ofi | |
# pml = ^ucx | |
# osc = ^ucx,pt2pt | |
# Remove "uct" and "ucx". Values must still begin with "^", e.g. `btl = ^openib,ofi`, to leave | |
# openib and ofi disabled. | |
new_lines = [] | |
for line in lines: | |
if line.startswith(("btl =", "pml =", "osc =")): | |
parts = line.split() | |
values = parts[2].strip().lstrip("^").split(",") | |
values = [v for v in values if v not in ("uct", "ucx")] | |
# Remove line entirely if all values removed, e.g. "pml = ^ucx" | |
if values: | |
new_lines.append(f"{parts[0]} = ^{','.join(values)}\n") | |
else: | |
new_lines.append(line) | |
with open(path, "w") as f: | |
f.writelines(new_lines) | |
file = Path("/etc/openmpi/openmpi-mca-params.conf") | |
if not file.exists(): | |
_logger.warn("%s not found. skipping OpenMPI UCX enablement", file) | |
return | |
content = file.read_text().splitlines() | |
for i, line in enumerate(content): | |
if not line.startswith(("btl =", "pml =", "osc =")): | |
continue | |
parts = line.split() | |
values = parts[2].strip().lstrip("^").split(",") | |
values = [v for v in values if v not in ("uct", "ucx")] | |
# Remove line entirely if all values removed, e.g. "pml = ^ucx" | |
if values: | |
content[i] = f"{parts[0]} = ^{','.join(values)}\n" | |
file.write_text("\n".join(content) + "\n") |
Figured this why we don't need to spend time expanding a new list in memory if we already have a fixed sized list we can create from openmpi-mca-params.conf. We also only need to worry about modifying the exact entries that we need to modify. Other benefit is one less level of nesting 😅
Raises: | ||
RDMAOpsError: Raised if error is encountered during package install. | ||
""" | ||
install_packages = ["rdma-core", "infiniband-diags"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[foresight]: I wonder if we should eventually think of enabling a charm config option like enable-debug-tools=true
which installs packages like infiniband-diags
for system administrators. This way we can keep the overall image footprint size down, but still give people the ability to dive into nodes to figure out what's going wrong with their deployment.
install_packages = ["rdma-core", "infiniband-diags"] | |
install_packages = ["rdma-core", "infiniband-diags"] |
Just wondering since I imagine the average end user will want BLAZINGLY FAST MPI, but only system administrators will want to have access to the various diagnostic tools if they are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I yearn for the unit tests...
"""Install RDMA packages. | ||
|
||
Raises: | ||
RDMAOpsError: Raised if error is encountered during package install. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will re-enabling UCX on non-Infiniband equipped nodes impact the user experience? Will openmpi
throw a bunch of unfriendly log messages? If so, this should be documented behavior within our documentation.
This PR enables use of Infiniband in Charmed HPC clusters. Microsoft Azure
Standard_HB120rs_v3
instances equipped with 200 Gb/sec HDR InfiniBand have been used for testing.Minimally, installing the
rdma-core
package from the Ubuntu repositories is sufficient to enable connectivity as MLNX drivers for ConnectX cards are available out-of-the-box. Here,infiniband-diags
is also installed to provide diagnostic toolsibstat
,ibping
, etc. for cluster administrators.These changes also re-enable UCX for OpenMPI. The developer-intended default for OpenMPI is to use the UCX PML when Infiniband devices are available however, a Debian patch disables UCX in order to suppress warnings in the case where RDMA hardware is not available. Without this change, measured bandwidth is the region of 3Gb/s on test systems, while upwards of 190Gb/s can be seen with the change.
String operations are used to remove
ucx
anduct
from/etc/openmpi/openmpi-mca-params.conf
. An alternative we may wish to consider is including our ownopenmpi-mca-params.conf
and copying it into place.