-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu][spark-rapids] Fix MIG script #1269
Conversation
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Hey @cjac, I can't add you as a reviewer for this PR. Could you please take a look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there! Take a look at the latest version of the utility scripts here:
function os_id() ( set +x ; grep '^ID=' /etc/os-release | cut -d= -f2 | xargs ; )
function os_version() ( set +x ; grep '^VERSION_ID=' /etc/os-release | cut -d= -f2 | xargs ; )
function os_codename() ( set +x ; grep '^VERSION_CODENAME=' /etc/os-release | cut -d= -f2 | xargs ; )
function version_ge() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | tail -n1)" ] ; )
function version_gt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_ge $1 $2 ; )
function version_le() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ] ; )
function version_lt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_le $1 $2 ; )
readonly -A supported_os=(
['debian']="10 11 12"
['rocky']="8 9"
['ubuntu']="18.04 20.04 22.04"
)
dynamically define OS version test utility functions
if [[ "$(os_id)" == "rocky" ]];
then _os_version=$(os_version | sed -e 's/[^0-9].*$//g')
else os_version="$(os_version)"; fi
for os_id_val in 'rocky' 'ubuntu' 'debian' ; do
eval "function is
for osver in
eval "function is_${os_id_val}${osver%%.}() ( set +x ; is_${os_id_val} && [[ "${os_version}" == "${osver}" ]] ; )"
eval "function ge${os_id_val}${osver%%.}() ( set +x ; is_${os_id_val} && version_ge "${os_version}" "${osver}" ; )"
eval "function le${os_id_val}${osver%%.*}() ( set +x ; is_${os_id_val} && version_le "${_os_version}" "${osver}" ; )"
done
done
function is_debuntu() ( set +x ; is_debian || is_ubuntu ; )
function os_vercat() ( set +x
if is_ubuntu ; then os_version | sed -e 's/[^0-9]//g'
elif is_rocky ; then os_version | sed -e 's/[^0-9].*$//g'
else os_version ; fi ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to create a template system so that we don't have to keep these functions synchronized manually.
A job for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes should wait on the next PR. There are a lot of things happening in the GPU driver that should be included, IMHO.
CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.2.2') #12.2.2 | ||
NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '535.104.05') #535.104.05 | ||
CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.4.1') #12.2.2 | ||
NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '550.54.15') #535.104.05 | ||
CUDA_VERSION_MAJOR="${CUDA_VERSION%.*}" #12.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cuda and driver version selection code is presently getting a revamp, too. You might want to wait for that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting me know. Can we include these improvements in a different PR to unblock the customers (NVIDIA/spark-rapids#11675)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the revamp is in #1275 and is ready for review. If you can take a look and leave a comment, it will give the final reviewers solace to know that our users have exercised the changes.
WORKDIR=/opt/install-nvidia-driver | ||
mkdir -p "${WORKDIR}" | ||
pushd $_ | ||
# Fetch open souce kernel module with corresponding tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*source
execute_with_retries "apt-get update" | ||
|
||
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-drivers-${NVIDIA_DRIVER_VERSION_PREFIX}" | ||
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-toolkit-${CUDA_VERSION_MAJOR//./-}" | ||
|
||
# enable a systemd service that updates kernel headers after reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, bad idea. The kernel should not change. To get a new kernel, upgrade the image. The kernel version should be pinned, in my opinion. Otherwise, there are too many failure states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Currently, reboots can trigger kernel upgrades, potentially breaking the drivers. We should avoid kernel upgrades.
Update: I'm concerned that pinning the kernel might prevent security-related fixes from being applied to existing clusters. Please correct me if my concern is unwarranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I can support customers running outdated images. I guess we've been doing it to date by applying main force. But it's not really sustainable. Can you tell me whether your users will be able to move to a new dataproc image to get new kernels?
The customer can only apply new kernel fixes if they reboot into a new kernel. When they do that, they will lose the drivers that I build for them just for that kernel version. I do not enable DKMS because we do not recommend changing kernels. Since DKMS is not installed, the system will not even attempt to build a new version of the kernel driver. The assumption is that the service account is only granted access to the driver signing private key material during the installation, or more realistically, during custom image generation.
On systems with secure boot enabled, the user will not be able to fetch the signing material required for driver re-build. It's just a lot less hassle if customers decommission old images and move to new images regularly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We transitioned away from installing drivers using run files because this approach caused issues with drivers during kernel upgrades and subsequent reboots. This was likely due to the kernel version not being pinned, leading to kernel upgrades.
Are you suggesting we pin the kernel version for each Dataproc image version (e.g., v2.0, v2.1, v2.2)?
If a user has a long-running GPU cluster with v2.2 and needs to upgrade the kernel for security patches, what would be the recommended process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a lot less hassle if customers decommission old images and move to new images regularly
- Does Dataproc (v2.0, v2.1, v2.2) support minor versions?
- How can one ensure the latest kernel is installed during cluster creation, specifically at the time of installing the run file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does creating a new cluster not work? Unless you have a reliable way to change kernels and reboot into the new one prior to running the installer, I'd say that there is not really a supported way to run a kernel not shipped with the image.
If you have a customer needing to change kernels who is not able to move to the latest dataproc image, can you have them open a support case and request that it be assigned to cjac for triage?
That day is today. Please take a look at #1282 |
Thanks for sharing! I'm currently traveling, I'll get to it as soon as I can. |
I've moved the mig-specific changes into PR #1284 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes. These look good.
|
||
# Hold all NVIDIA-related packages from upgrading unintenionally or services like unattended-upgrades | ||
# Users should run apt-mark unhold before they wish to upgrade these packages | ||
function hold_nvidia_packages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this function. I've integrated this into the templates/gpu/util_functions file from #1282
readonly LOCAL_INSTALLER_DEB="cuda-repo-ubuntu${UBUNTU_VERSION}04-${CUDA_VERSION_MAJOR//./-}-local_${CUDA_VERSION}-${NVIDIA_DRIVER_VERSION}-1_amd64.deb" | ||
curl -fsSL --retry-connrefused --retry 3 --retry-max-time 5 \ | ||
"https://developer.download.nvidia.com/compute/cuda/${CUDA_VERSION}/local_installers/${LOCAL_INSTALLER_DEB}" -o /tmp/local-installer.deb | ||
# Ubuntu 18.04 is not supported by new style NV debs; install from .run files + github |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code only installs driver utils and cuda from .run files ; I can add the network install support back in if someone asks for it.
|
||
systemctl reboot | ||
} | ||
|
||
# Verify if compatible linux distros and secure boot options are used | ||
function check_os_and_secure_boot() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for this function. I've split it into two and included the os check in templates/common/util_functions and the secure-boot checks into templates/secure-boot/util_functions
if [[ "${SECURE_BOOT}" == "enabled" ]]; then | ||
echo "Error: Secure Boot is enabled. Please disable Secure Boot while creating the cluster." | ||
if [[ "${SECURE_BOOT}" == "enabled" && $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then | ||
echo "Error: Secure Boot is not supported before image 2.2. Please disable Secure Boot while creating the cluster." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just Debian that it's not supported on before 2.2 ; rocky and ubuntu work fine with secure boot enabled on 2.0 and 2.1
fi | ||
} | ||
|
||
# Detect dataproc image version from its various names | ||
if (! test -v DATAPROC_IMAGE_VERSION) && test -v DATAPROC_VERSION; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using spark-submit --version
to determine which dataproc version we're on.
if [[ ${OS_NAME} == debian ]] || [[ ${OS_NAME} == ubuntu ]]; then | ||
export DEBIAN_FRONTEND=noninteractive | ||
execute_with_retries "apt-get update" | ||
execute_with_retries "apt-get --allow-releaseinfo-change update" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to do this? Shouldn't the customer be using a more recent image to get an apt cache with the correct releaseinfo changes?
execute_with_retries "apt-get update" | ||
|
||
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-drivers-${NVIDIA_DRIVER_VERSION_PREFIX}" | ||
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-toolkit-${CUDA_VERSION_MAJOR//./-}" | ||
|
||
# enable a systemd service that updates kernel headers after reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does creating a new cluster not work? Unless you have a reliable way to change kernels and reboot into the new one prior to running the installer, I'd say that there is not really a supported way to run a kernel not shipped with the image.
If you have a customer needing to change kernels who is not able to move to the latest dataproc image, can you have them open a support case and request that it be assigned to cjac for triage?
mig.sh tests do not work right now, but we have customers indicating that this version works for them. |
Resolves #1259
This PR updates the MIG script to align the driver installation steps with those in
spark-rapids.sh
.