-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bphilip/add mofed and dependencies #11479
base: 3.0-dev
Are you sure you want to change the base?
Conversation
Infiniband is a requirement for AI/ML deployments for inter-GPU communication and scale out GPU clusters. HPC team built HPC azure linux image needs infiniband mofed drivers. Their major customer singularity is on a deadline to move from Ubuntu to azure linux for AI/ML workloads. All of the required sources are open source and have spec's which are already being used by NVIDIA. This PR brings to Azure Linux MOFED driver for infiniband driver and all dependencies to help use, manage and debug the stack. These modules have already been built, integrated into an HPC image and tested on an existing 16 mode cluster owned by HPC team. Performance characteristics are within specified tolerance limits.
# Copyright (c) 2014 Mellanox Technologies. All rights reserved. | ||
# | ||
# This Software is licensed under one of the following licenses: | ||
# | ||
# 1) under the terms of the "Common Public License 1.0" a copy of which is | ||
# available from the Open Source Initiative, see | ||
# http://www.opensource.org/licenses/cpl.php. | ||
# | ||
# 2) under the terms of the "The BSD License" a copy of which is | ||
# available from the Open Source Initiative, see | ||
# http://www.opensource.org/licenses/bsd-license.php. | ||
# | ||
# 3) under the terms of the "GNU General Public License (GPL) Version 2" a | ||
# copy of which is available from the Open Source Initiative, see | ||
# http://www.opensource.org/licenses/gpl-license.php. | ||
# | ||
# Licensee has the right to choose one of the above licenses. | ||
# | ||
# Redistributions of source code must retain the above copyright | ||
# notice and one of the license notices. | ||
# | ||
# Redistributions in binary form must reproduce both the above copyright | ||
# notice, one of the license notices in the documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to interpret this licensing comment and the source attributions (we mention both Nvidia and claim we've done this ourselves at the same time). I think we need more clarify on the licensing of the new specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this line
Licensee has the right to choose one of the above licenses.
as the guidance. Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added license copyright files as needed for SPECS fwctl, iser, mofed kernel, srp etc.
%{!?_name: %define _name fwctl} | ||
%{!?_version: %define _version 24.10} | ||
%{!?_release: %define _release OFED.24.10.0.6.7.1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why define these _version
separately? Won't the %{version}
work? Similar question for _name
. And what's the logic behind _release
? I see it trickles down to _kmp_rel
for instance but I don't see _kmp_rel
used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version is defined from MOFED SPECS for alignment. Hence keeping it same for furture updates.
%endif | ||
%global _kmp_rel %{_release1}%{?_kmp_build_num}%{?_dist} | ||
|
||
Summary: %{_name} Driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the specs done, please also update the version_release_matching_groups
inside toolkit/scripts/check_entangled_specs.py
. This is part of the PR check, which makes sure "entangled" specs get updated in lockstep within one PR. In case of (regular, signed) spec pairs, if you only update the regular one and forget about this one here, we'd end up either publishing a newer version without the signed .ko
files or not publishing the new version at all (I'd need to check the details).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sign pipeline changes are coming from another PR. Is this can be done with signing pipeline changes for signed SPECs?
@@ -17,6 +17,7 @@ | |||
"Fedora (ISC)": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from Fedora \d+ \(license: ISC\)(\.|\n|$)', | |||
"Magnus Edenhill Open Source": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from Magnus Edenhill Open Source \(license: BSD\)(\.|\n|$)', | |||
"NVIDIA": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from NVIDIA \(license: (ASL 2\.0|GPLv2)\)(\.|\n|$)', | |||
"NVIDIA (BSD)": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from NVIDIA \(BSD\) \(license: (BSD)\)(\.|\n|$)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for "CBL-Mariner"` in the new source attributions. The other ones are to support the old specs. And there's no need to repeat "BSD", I think.
"NVIDIA (BSD)": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from NVIDIA \(BSD\) \(license: (BSD)\)(\.|\n|$)', | |
"NVIDIA (BSD)": r'\n-\s+Initial Azure Linux import from NVIDIA \(license: (BSD)\)(\.|\n|$)', |
A bigger question, perhaps: why add a new entry specially for BSD? I see the older one for Nvidia already covers two licenses. Is there a reason we don't want to just update that one?
BuildRequires: gcc | ||
BuildRequires: make | ||
BuildRequires: kernel-devel = %{target_kernel_version_full} | ||
BuildRequires: kernel-headers = %{target_kernel_version_full} | ||
BuildRequires: binutils | ||
BuildRequires: systemd | ||
BuildRequires: kmod | ||
BuildRequires: mlnx-ofa_kernel-devel = %{_version} | ||
BuildRequires: mlnx-ofa_kernel-source = %{_version} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment for all signed specs: they don't need most BuildRequires
, since they only expand already built RPMs and replace a few files. The packages providing rpm2cpio
and cpio
are part of the default build chroot but to be on the safe side, I'd add them to BuildRequires
.
Requires: kmod | ||
|
||
%description | ||
fwctl signed kernel modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the description, that the published package will have, so I'd suggest something, that gives users more information about the package.
# This package's "version" and "release" must reflect the unsigned version that | ||
# was signed. | ||
# An important consequence is that when making a change to this package, the | ||
# unsigned version/release must be increased to keep the two versions consistent. | ||
# Ideally though, this spec will not change much or at all, so the version will | ||
# just track the unsigned package's version/release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the comment - the specs entanglement PR check will handle that for you. Just make sure to update the set of spec pairs, which need to be kept in sync.
Source1: fwctl.ko | ||
Source2: mlx5_fwctl.ko | ||
|
||
BuildRoot: /var/tmp/%{name}-%{version}-build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to define BuildRoot
explicitly here instead of using the default?
%postun | ||
/sbin/depmod %{KVERSION} | ||
|
||
%if "%{KMP}" != "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure KMP
makes much sense in the signed spec, since its only purpose is to repackage the unsigned RPM. Also, with the way it's written right now, if KMP
ever was equal to 1, we'd end up with an invalid spec file (no %files
section).
@@ -0,0 +1,142 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a bunch of comments into this spec but they apply to most/all other signed spec files.
License: GPLv2 | ||
Url: http://nvidia.com | ||
Group: System Environment/Base | ||
Source0: https://linux.mellanox.com/public/repo/mlnx_ofed/24.10-0.7.0.0/SRPMS/fwctl-24.10.tgz#/%{_name}-%{_version}.tgz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also use macros to parameterize the URL, so that it updates automatically with the macros. This makes it easier to miss necessary changes during updates.
%if "%{KMP}" != "1" | ||
%files -n %{non_kmp_pname} | ||
/lib/modules/%{KVERSION}/%{install_mod_dir}/ | ||
%if %{IS_RHEL_VENDOR} | ||
%if ! 0%{?fedora} | ||
%config(noreplace) %{_sysconfdir}/depmod.d/%{_name}.conf | ||
%endif | ||
%endif | ||
%endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing %license
. Not strictly necessary when there's a signed spec counterpart to this one but consistency helps. Having said that, a file explaining the license IS necessary.
Merge Checklist
All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)
*-static
subpackages, etc.) have had theirRelease
tag incremented../cgmanifest.json
,./toolkit/scripts/toolchain/cgmanifest.json
,.github/workflows/cgmanifest.json
)./LICENSES-AND-NOTICES/SPECS/data/licenses.json
,./LICENSES-AND-NOTICES/SPECS/LICENSES-MAP.md
,./LICENSES-AND-NOTICES/SPECS/LICENSE-EXCEPTIONS.PHOTON
)*.signatures.json
filessudo make go-tidy-all
andsudo make go-test-coverage
passSummary
What does the PR accomplish, why was it needed?
This PR brings in MOFED and dependent packages required to build our AI/ML distribution story. Immediately HPC images require infiniband for GPU-clustering. MOFED is thus a core requirement. There are first and third party requirments for MOFED drivers directly and dependent packages.
CUDA and GdrCopy drivers need MOFED as a build-time and runtime requirements. This is not satisfied today and CUDA driver is built without MOFED support. These packages getting integrated into core will enable that work stream to be unblocked as well.
Change Log
There are 18 new packages being added by this change to the SPECS directory. There are 8 new SIGNED-SPECS being added too. They are either directly required to build out MOFED and support infrastructure or are tools required to provide a complete infiniband solution like tools to configure the hw/driver and debug issues.
fwctl
ibarr
ibsim
iser
isert
knem
mft_kernel
mlnx-ethtool
mlnx-iproute2
mlnx-nfsrdma
mlnx-ofa_kernel
mlx-steering-dump
multiperf rshim
sockperf
srp
xpmem-lib
xpmem
Does this affect the toolchain?
No
YES/NO
Associated issues
Links to CVEs
Test Methodology
https://dev.azure.com/mariner-org/mariner/_build/results?buildId=709280&view=results
The RPMs from a local build of these specs were installed on A100, V100 and T4 GPU based VMs and functionality verified.