[gpu][spark-rapids] Fix MIG script #1269

SurajAralihalli · 2024-12-04T11:51:39Z

Resolves #1259

This PR updates the MIG script to align the driver installation steps with those in spark-rapids.sh.

Signed-off-by: Suraj Aralihalli <[email protected]>

SurajAralihalli · 2024-12-06T06:37:02Z

Hey @cjac, I can't add you as a reviewer for this PR. Could you please take a look? Thanks!
This PR was tested for NVIDIA/spark-rapids#11675.

cjac

Hi there! Take a look at the latest version of the utility scripts here:

https://github.com/LLC-Technologies-Collier/initialization-actions/blob/db33fce27205535f440fa5cc0545cabc591e6730/gpu/install_gpu_driver.sh#L19

function version_ge() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | tail -n1)" ] ; )
function version_gt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_ge $1 $2 ; )
function version_le() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ] ; )
function version_lt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_le $1 $2 ; )

readonly -A supported_os=(
['debian']="10 11 12"
['rocky']="8 9"
['ubuntu']="18.04 20.04 22.04"
)

dynamically define OS version test utility functions

if [[ "$(os_id)" == "rocky" ]];
then _os_version=$(os_version | sed -e 's/[^0-9].*$//g')
else os_version="$(os_version)"; fi
for os_id_val in 'rocky' 'ubuntu' 'debian' ; do
eval "function is${os_id_val}() ( set +x ; [[ "$(os_id)" == '${os_id_val}' ]] ; )"

for osver in $(echo "${supported_os["${os_id_val}"]}") ; do
eval "function is_${os_id_val}${osver%%.}() ( set +x ; is_${os_id_val} && [[ "${os_version}" == "${osver}" ]] ; )"
eval "function ge${os_id_val}${osver%%.}() ( set +x ; is_${os_id_val} && version_ge "${os_version}" "${osver}" ; )"
eval "function le${os_id_val}${osver%%.*}() ( set +x ; is_${os_id_val} && version_le "${_os_version}" "${osver}" ; )"
done
done

function is_debuntu() ( set +x ; is_debian || is_ubuntu ; )

function os_vercat() ( set +x
if is_ubuntu ; then os_version | sed -e 's/[^0-9]//g'
elif is_rocky ; then os_version | sed -e 's/[^0-9].*$//g'
else os_version ; fi ; )

cjac

I want to create a template system so that we don't have to keep these functions synchronized manually.

A job for another day.

cjac

I think these changes should wait on the next PR. There are a lot of things happening in the GPU driver that should be included, IMHO.

cjac · 2024-12-06T07:22:53Z

spark-rapids/mig.sh

-CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.2.2')  #12.2.2
-NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '535.104.05') #535.104.05
+CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.4.1')  #12.2.2
+NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '550.54.15') #535.104.05
 CUDA_VERSION_MAJOR="${CUDA_VERSION%.*}"  #12.2


The cuda and driver version selection code is presently getting a revamp, too. You might want to wait for that...

Thanks for letting me know. Can we include these improvements in a different PR to unblock the customers (NVIDIA/spark-rapids#11675)

the revamp is in #1275 and is ready for review. If you can take a look and leave a comment, it will give the final reviewers solace to know that our users have exercised the changes.

cjac · 2024-12-06T07:24:32Z

spark-rapids/mig.sh

+      WORKDIR=/opt/install-nvidia-driver
+      mkdir -p "${WORKDIR}"
+      pushd $_
+      # Fetch open souce kernel module with corresponding tag


cjac · 2024-12-06T07:26:23Z

spark-rapids/mig.sh

-    execute_with_retries "apt-get update"    
-
-    execute_with_retries "apt-get install -y -q --no-install-recommends cuda-drivers-${NVIDIA_DRIVER_VERSION_PREFIX}"
-    execute_with_retries "apt-get install -y -q --no-install-recommends cuda-toolkit-${CUDA_VERSION_MAJOR//./-}"

    # enable a systemd service that updates kernel headers after reboot


No, bad idea. The kernel should not change. To get a new kernel, upgrade the image. The kernel version should be pinned, in my opinion. Otherwise, there are too many failure states.

I agree. Currently, reboots can trigger kernel upgrades, potentially breaking the drivers. We should avoid kernel upgrades.

Update: I'm concerned that pinning the kernel might prevent security-related fixes from being applied to existing clusters. Please correct me if my concern is unwarranted.

I don't know if I can support customers running outdated images. I guess we've been doing it to date by applying main force. But it's not really sustainable. Can you tell me whether your users will be able to move to a new dataproc image to get new kernels?

The customer can only apply new kernel fixes if they reboot into a new kernel. When they do that, they will lose the drivers that I build for them just for that kernel version. I do not enable DKMS because we do not recommend changing kernels. Since DKMS is not installed, the system will not even attempt to build a new version of the kernel driver. The assumption is that the service account is only granted access to the driver signing private key material during the installation, or more realistically, during custom image generation.

On systems with secure boot enabled, the user will not be able to fetch the signing material required for driver re-build. It's just a lot less hassle if customers decommission old images and move to new images regularly.

We transitioned away from installing drivers using run files because this approach caused issues with drivers during kernel upgrades and subsequent reboots. This was likely due to the kernel version not being pinned, leading to kernel upgrades.

Are you suggesting we pin the kernel version for each Dataproc image version (e.g., v2.0, v2.1, v2.2)?
If a user has a long-running GPU cluster with v2.2 and needs to upgrade the kernel for security patches, what would be the recommended process?

It's just a lot less hassle if customers decommission old images and move to new images regularly

Does Dataproc (v2.0, v2.1, v2.2) support minor versions?

How can one ensure the latest kernel is installed during cluster creation, specifically at the time of installing the run file?

Does creating a new cluster not work? Unless you have a reliable way to change kernels and reboot into the new one prior to running the installer, I'd say that there is not really a supported way to run a kernel not shipped with the image.

If you have a customer needing to change kernels who is not able to move to the latest dataproc image, can you have them open a support case and request that it be assigned to cjac for triage?

cjac · 2024-12-20T02:52:26Z

I want to create a template system so that we don't have to keep these functions synchronized manually.

A job for another day.

That day is today. Please take a look at #1282

SurajAralihalli · 2024-12-22T04:40:54Z

That day is today. Please take a look at #1282

Thanks for sharing! I'm currently traveling, I'll get to it as soon as I can.

cjac · 2025-01-10T01:25:57Z

I've moved the mig-specific changes into PR #1284

cjac

Thank you for the changes. These look good.

cjac · 2024-12-20T03:04:36Z

spark-rapids/mig.sh

+
+# Hold all NVIDIA-related packages from upgrading unintenionally or services like unattended-upgrades
+# Users should run apt-mark unhold before they wish to upgrade these packages
+function hold_nvidia_packages() {


thanks for this function. I've integrated this into the templates/gpu/util_functions file from #1282

cjac · 2024-12-20T03:06:08Z

spark-rapids/mig.sh

-    readonly LOCAL_INSTALLER_DEB="cuda-repo-ubuntu${UBUNTU_VERSION}04-${CUDA_VERSION_MAJOR//./-}-local_${CUDA_VERSION}-${NVIDIA_DRIVER_VERSION}-1_amd64.deb"
-    curl -fsSL --retry-connrefused --retry 3 --retry-max-time 5 \
-      "https://developer.download.nvidia.com/compute/cuda/${CUDA_VERSION}/local_installers/${LOCAL_INSTALLER_DEB}" -o /tmp/local-installer.deb
+    # Ubuntu 18.04 is not supported by new style NV debs; install from .run files + github


The new code only installs driver utils and cuda from .run files ; I can add the network install support back in if someone asks for it.

cjac · 2024-12-20T03:08:09Z

spark-rapids/mig.sh

-
-  systemctl reboot
-}
-
 # Verify if compatible linux distros and secure boot options are used
 function check_os_and_secure_boot() {


thank you for this function. I've split it into two and included the os check in templates/common/util_functions and the secure-boot checks into templates/secure-boot/util_functions

cjac · 2024-12-20T03:08:58Z

spark-rapids/mig.sh

-  if [[ "${SECURE_BOOT}" == "enabled" ]]; then 
-    echo "Error: Secure Boot is enabled. Please disable Secure Boot while creating the cluster."
+  if [[ "${SECURE_BOOT}" == "enabled" && $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then
+    echo "Error: Secure Boot is not supported before image 2.2. Please disable Secure Boot while creating the cluster."


it's just Debian that it's not supported on before 2.2 ; rocky and ubuntu work fine with secure boot enabled on 2.0 and 2.1

cjac · 2024-12-20T03:17:18Z

spark-rapids/mig.sh

  fi
 }

-# Detect dataproc image version from its various names
-if (! test -v DATAPROC_IMAGE_VERSION) && test -v DATAPROC_VERSION; then


I'm using spark-submit --version to determine which dataproc version we're on.

cjac · 2024-12-20T03:18:41Z

spark-rapids/mig.sh

  if [[ ${OS_NAME} == debian ]] || [[ ${OS_NAME} == ubuntu ]]; then
    export DEBIAN_FRONTEND=noninteractive
-    execute_with_retries "apt-get update"
+    execute_with_retries "apt-get --allow-releaseinfo-change update"


do we really want to do this? Shouldn't the customer be using a more recent image to get an apt cache with the correct releaseinfo changes?

cjac · 2025-01-07T18:23:02Z

spark-rapids/mig.sh

-    execute_with_retries "apt-get update"    
-
-    execute_with_retries "apt-get install -y -q --no-install-recommends cuda-drivers-${NVIDIA_DRIVER_VERSION_PREFIX}"
-    execute_with_retries "apt-get install -y -q --no-install-recommends cuda-toolkit-${CUDA_VERSION_MAJOR//./-}"

    # enable a systemd service that updates kernel headers after reboot


Does creating a new cluster not work? Unless you have a reliable way to change kernels and reboot into the new one prior to running the installer, I'd say that there is not really a supported way to run a kernel not shipped with the image.

If you have a customer needing to change kernels who is not able to move to the latest dataproc image, can you have them open a support case and request that it be assigned to cjac for triage?

cjac · 2025-01-24T04:29:08Z

mig.sh tests do not work right now, but we have customers indicating that this version works for them.

SurajAralihalli added 2 commits December 4, 2024 16:06

update mig from spark-rapids.sh

44f136f

Signed-off-by: Suraj Aralihalli <[email protected]>

add OS_NAME

64f00d3

Signed-off-by: Suraj Aralihalli <[email protected]>

SurajAralihalli mentioned this pull request Dec 4, 2024

[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 NVIDIA/spark-rapids#11675

Open

cjac reviewed Dec 6, 2024

View reviewed changes

SurajAralihalli mentioned this pull request Jan 7, 2025

[gpu] strict driver and cuda version assignment #1275

Open

cjac approved these changes Jan 24, 2025

View reviewed changes

cjac merged commit 75594a7 into GoogleCloudDataproc:master Jan 24, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpu][spark-rapids] Fix MIG script #1269

[gpu][spark-rapids] Fix MIG script #1269

SurajAralihalli commented Dec 4, 2024

SurajAralihalli commented Dec 6, 2024

cjac left a comment

cjac left a comment

cjac left a comment

cjac Dec 6, 2024

SurajAralihalli Dec 6, 2024

cjac Dec 20, 2024

cjac Dec 6, 2024

cjac Dec 6, 2024

SurajAralihalli Dec 6, 2024 •

edited

Loading

cjac Dec 20, 2024

SurajAralihalli Jan 7, 2025

SurajAralihalli Jan 7, 2025

cjac Jan 7, 2025

cjac commented Dec 20, 2024

SurajAralihalli commented Dec 22, 2024

cjac commented Jan 10, 2025

cjac left a comment

cjac Dec 20, 2024

cjac Dec 20, 2024

cjac Dec 20, 2024

cjac Dec 20, 2024

cjac Dec 20, 2024

cjac Dec 20, 2024

cjac Jan 7, 2025

cjac commented Jan 24, 2025

[gpu][spark-rapids] Fix MIG script #1269

[gpu][spark-rapids] Fix MIG script #1269

Conversation

SurajAralihalli commented Dec 4, 2024

SurajAralihalli commented Dec 6, 2024

cjac left a comment

Choose a reason for hiding this comment

dynamically define OS version test utility functions

cjac left a comment

Choose a reason for hiding this comment

cjac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SurajAralihalli Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjac commented Dec 20, 2024

SurajAralihalli commented Dec 22, 2024

cjac commented Jan 10, 2025

cjac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjac commented Jan 24, 2025

SurajAralihalli Dec 6, 2024 •

edited

Loading