Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu][spark-rapids] Fix MIG script #1269

Merged
merged 2 commits into from
Jan 24, 2025

Conversation

SurajAralihalli
Copy link
Contributor

Resolves #1259

This PR updates the MIG script to align the driver installation steps with those in spark-rapids.sh.

Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
@SurajAralihalli
Copy link
Contributor Author

Hey @cjac, I can't add you as a reviewer for this PR. Could you please take a look? Thanks!
This PR was tested for NVIDIA/spark-rapids#11675.

Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there! Take a look at the latest version of the utility scripts here:

https://github.com/LLC-Technologies-Collier/initialization-actions/blob/db33fce27205535f440fa5cc0545cabc591e6730/gpu/install_gpu_driver.sh#L19

function os_id() ( set +x ; grep '^ID=' /etc/os-release | cut -d= -f2 | xargs ; )
function os_version() ( set +x ; grep '^VERSION_ID=' /etc/os-release | cut -d= -f2 | xargs ; )
function os_codename() ( set +x ; grep '^VERSION_CODENAME=' /etc/os-release | cut -d= -f2 | xargs ; )

function version_ge() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | tail -n1)" ] ; )
function version_gt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_ge $1 $2 ; )
function version_le() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ] ; )
function version_lt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_le $1 $2 ; )

readonly -A supported_os=(
['debian']="10 11 12"
['rocky']="8 9"
['ubuntu']="18.04 20.04 22.04"
)

dynamically define OS version test utility functions

if [[ "$(os_id)" == "rocky" ]];
then _os_version=$(os_version | sed -e 's/[^0-9].*$//g')
else os_version="$(os_version)"; fi
for os_id_val in 'rocky' 'ubuntu' 'debian' ; do
eval "function is
${os_id_val}() ( set +x ; [[ "$(os_id)" == '${os_id_val}' ]] ; )"

for osver in $(echo "${supported_os["${os_id_val}"]}") ; do
eval "function is_${os_id_val}${osver%%.}() ( set +x ; is_${os_id_val} && [[ "${os_version}" == "${osver}" ]] ; )"
eval "function ge
${os_id_val}${osver%%.
}() ( set +x ; is_${os_id_val} && version_ge "${os_version}" "${osver}" ; )"
eval "function le
${os_id_val}${osver%%.*}() ( set +x ; is_${os_id_val} && version_le "${_os_version}" "${osver}" ; )"
done
done

function is_debuntu() ( set +x ; is_debian || is_ubuntu ; )

function os_vercat() ( set +x
if is_ubuntu ; then os_version | sed -e 's/[^0-9]//g'
elif is_rocky ; then os_version | sed -e 's/[^0-9].*$//g'
else os_version ; fi ; )

Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to create a template system so that we don't have to keep these functions synchronized manually.

A job for another day.

Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes should wait on the next PR. There are a lot of things happening in the GPU driver that should be included, IMHO.

CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.2.2') #12.2.2
NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '535.104.05') #535.104.05
CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.4.1') #12.2.2
NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '550.54.15') #535.104.05
CUDA_VERSION_MAJOR="${CUDA_VERSION%.*}" #12.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuda and driver version selection code is presently getting a revamp, too. You might want to wait for that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know. Can we include these improvements in a different PR to unblock the customers (NVIDIA/spark-rapids#11675)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the revamp is in #1275 and is ready for review. If you can take a look and leave a comment, it will give the final reviewers solace to know that our users have exercised the changes.

WORKDIR=/opt/install-nvidia-driver
mkdir -p "${WORKDIR}"
pushd $_
# Fetch open souce kernel module with corresponding tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*source

execute_with_retries "apt-get update"

execute_with_retries "apt-get install -y -q --no-install-recommends cuda-drivers-${NVIDIA_DRIVER_VERSION_PREFIX}"
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-toolkit-${CUDA_VERSION_MAJOR//./-}"

# enable a systemd service that updates kernel headers after reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, bad idea. The kernel should not change. To get a new kernel, upgrade the image. The kernel version should be pinned, in my opinion. Otherwise, there are too many failure states.

Copy link
Contributor Author

@SurajAralihalli SurajAralihalli Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Currently, reboots can trigger kernel upgrades, potentially breaking the drivers. We should avoid kernel upgrades.

Update: I'm concerned that pinning the kernel might prevent security-related fixes from being applied to existing clusters. Please correct me if my concern is unwarranted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I can support customers running outdated images. I guess we've been doing it to date by applying main force. But it's not really sustainable. Can you tell me whether your users will be able to move to a new dataproc image to get new kernels?

The customer can only apply new kernel fixes if they reboot into a new kernel. When they do that, they will lose the drivers that I build for them just for that kernel version. I do not enable DKMS because we do not recommend changing kernels. Since DKMS is not installed, the system will not even attempt to build a new version of the kernel driver. The assumption is that the service account is only granted access to the driver signing private key material during the installation, or more realistically, during custom image generation.

On systems with secure boot enabled, the user will not be able to fetch the signing material required for driver re-build. It's just a lot less hassle if customers decommission old images and move to new images regularly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We transitioned away from installing drivers using run files because this approach caused issues with drivers during kernel upgrades and subsequent reboots. This was likely due to the kernel version not being pinned, leading to kernel upgrades.

Are you suggesting we pin the kernel version for each Dataproc image version (e.g., v2.0, v2.1, v2.2)?
If a user has a long-running GPU cluster with v2.2 and needs to upgrade the kernel for security patches, what would be the recommended process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a lot less hassle if customers decommission old images and move to new images regularly

  • Does Dataproc (v2.0, v2.1, v2.2) support minor versions?
  • How can one ensure the latest kernel is installed during cluster creation, specifically at the time of installing the run file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does creating a new cluster not work? Unless you have a reliable way to change kernels and reboot into the new one prior to running the installer, I'd say that there is not really a supported way to run a kernel not shipped with the image.

If you have a customer needing to change kernels who is not able to move to the latest dataproc image, can you have them open a support case and request that it be assigned to cjac for triage?

@cjac
Copy link
Contributor

cjac commented Dec 20, 2024

I want to create a template system so that we don't have to keep these functions synchronized manually.

A job for another day.

That day is today. Please take a look at #1282

@SurajAralihalli
Copy link
Contributor Author

That day is today. Please take a look at #1282

Thanks for sharing! I'm currently traveling, I'll get to it as soon as I can.

@cjac
Copy link
Contributor

cjac commented Jan 10, 2025

I've moved the mig-specific changes into PR #1284

Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes. These look good.


# Hold all NVIDIA-related packages from upgrading unintenionally or services like unattended-upgrades
# Users should run apt-mark unhold before they wish to upgrade these packages
function hold_nvidia_packages() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this function. I've integrated this into the templates/gpu/util_functions file from #1282

readonly LOCAL_INSTALLER_DEB="cuda-repo-ubuntu${UBUNTU_VERSION}04-${CUDA_VERSION_MAJOR//./-}-local_${CUDA_VERSION}-${NVIDIA_DRIVER_VERSION}-1_amd64.deb"
curl -fsSL --retry-connrefused --retry 3 --retry-max-time 5 \
"https://developer.download.nvidia.com/compute/cuda/${CUDA_VERSION}/local_installers/${LOCAL_INSTALLER_DEB}" -o /tmp/local-installer.deb
# Ubuntu 18.04 is not supported by new style NV debs; install from .run files + github
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code only installs driver utils and cuda from .run files ; I can add the network install support back in if someone asks for it.


systemctl reboot
}

# Verify if compatible linux distros and secure boot options are used
function check_os_and_secure_boot() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this function. I've split it into two and included the os check in templates/common/util_functions and the secure-boot checks into templates/secure-boot/util_functions

if [[ "${SECURE_BOOT}" == "enabled" ]]; then
echo "Error: Secure Boot is enabled. Please disable Secure Boot while creating the cluster."
if [[ "${SECURE_BOOT}" == "enabled" && $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then
echo "Error: Secure Boot is not supported before image 2.2. Please disable Secure Boot while creating the cluster."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just Debian that it's not supported on before 2.2 ; rocky and ubuntu work fine with secure boot enabled on 2.0 and 2.1

fi
}

# Detect dataproc image version from its various names
if (! test -v DATAPROC_IMAGE_VERSION) && test -v DATAPROC_VERSION; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using spark-submit --version to determine which dataproc version we're on.

if [[ ${OS_NAME} == debian ]] || [[ ${OS_NAME} == ubuntu ]]; then
export DEBIAN_FRONTEND=noninteractive
execute_with_retries "apt-get update"
execute_with_retries "apt-get --allow-releaseinfo-change update"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to do this? Shouldn't the customer be using a more recent image to get an apt cache with the correct releaseinfo changes?

execute_with_retries "apt-get update"

execute_with_retries "apt-get install -y -q --no-install-recommends cuda-drivers-${NVIDIA_DRIVER_VERSION_PREFIX}"
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-toolkit-${CUDA_VERSION_MAJOR//./-}"

# enable a systemd service that updates kernel headers after reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does creating a new cluster not work? Unless you have a reliable way to change kernels and reboot into the new one prior to running the installer, I'd say that there is not really a supported way to run a kernel not shipped with the image.

If you have a customer needing to change kernels who is not able to move to the latest dataproc image, can you have them open a support case and request that it be assigned to cjac for triage?

@cjac
Copy link
Contributor

cjac commented Jan 24, 2025

mig.sh tests do not work right now, but we have customers indicating that this version works for them.

@cjac cjac merged commit 75594a7 into GoogleCloudDataproc:master Jan 24, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[gpu][spark-rapids] Consolidate mig.sh Scripts and Sync Driver Installation Steps Across Copies
2 participants