Skip to content

Latest commit

 

History

History
214 lines (169 loc) · 7.09 KB

README.md

File metadata and controls

214 lines (169 loc) · 7.09 KB

NVIDIA GPU Driver and DevicePlugin Installation

Summary

This kops hook container may be used to enable nodes with GPUs to work with Kubernetes. It is targeted specifically for AWS GPU instance types.

It installs the following from web sources.

  1. Nvidia Device Drivers
  2. Cuda Libraries v9.1
  3. nvidia-docker
  4. docker-ce

Using this hook indicates that you agree to the Nvidia licenses.

How it works

  • This kops hook container runs on a kubernetes node upon every boot.
  • It installs onto the host system a systemd oneshot service unit nvidia-device-plugin.service along with setup scripts.
  • The systemd unit nvidia-device-plugin.service runs and executes the setup scripts in the host directory /nvidia-device-plugin.
  • The scripts install the Nvidia device drivers, Cuda libs, Nvidia docker along with the matching version of docker-ce.
  • The scheduling of work in a separate systemd unit outside of this kops hook is required because it is not possible to upgrade docker-ce on the host from within a docker container.

Prerequisites

Although this hook may work among many combinatorial versions of software and images, it has only been tested with the following:

  • kops: 1.9
  • kubernetes: 1.10, 1.11
  • OS Image: kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27
    • This is most certainly not the default image for kops. The OS image must be explicitly overridden in the cluster or instancegroup spec.
    • Debian stretch is needed because nvidia-docker requires a newer version of docker-ce >= 18.0, which is not available in the Debian jessie package repository. In addition, the Debian jessie kernel was compiled with gcc-7, while the system packages install gcc-4, thus making the nvidia driver compilation fail.
  • cloud: AWS
    • This hook will only work on AWS at this moment.
    • This is due to the fact that it uses an AWS discovery mechanism to determine node instancetype, and subsequently install the correct drivers and configure the optimal settings for the GPU chipsets.

Test Matrix

This kops hook was developed against the following version combinations.

Kops Version Kubernetes Version GPU Mode OS Image
1.10-beta.1 1.10 deviceplugin kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27
1.9.1 1.11 deviceplugin kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27
1.9.1 1.10 legacy kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27

Using this DevicePlugin

Create a Cluster with GPU Nodes

kops create cluster gpu.example.com \
  --zones us-east-1c \
  --node-size p2.xlarge \
  --node-count 1 \
  --image kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27 \
  --kubernetes-version 1.11.0

Enable the Kops Installation Hook and DevicePlugins

This should be safe to do for all machines, because the hook auto-detects if the machine is an AWS GPU instancetype and will NO-OP otherwise. Choose between the DevicePlugin GPU Mode or Legacy Accelerators GPU Mode.

(Preferred) DevicePlugin GPU Mode

This mode is:

  • Required for kubernetes >= 1.11.0
  • Optional for 1.8.0 =< kubernetes <= 1.11.0

For Kubernetes >= 1.11.0 or clusters supporting DevicePlugins

# > kops edit instancegroup nodes

spec:
  image: kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27
  hooks:
  - execContainer:
      image: dcwangmit01/nvidia-device-plugin:0.1.0

### The settings below are only necessary for kubernetes <= 1.11.0, where
###   deviceplugins are not enabled by default.
# kubelet:
#   featureGates:
#     # Enable DevicePlugins
#     DevicePlugins: "true"
#     # Disable Accelerators (may interfere with DevicePlugins)
#     Accelerators: "false"

(Deprecated) Legacy Accelerators GPU Mode

The legacy accelerator GPU mode is equivalent to the original GPU hook. Accelerators are deprecated in Kubernetes >= 1.11.0.

# > kops edit instancegroup nodes

spec:
  image: kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27
  hooks:
  - execContainer:
      image: dcwangmit01/nvidia-device-plugin:0.1.0
      environment:
        NVIDIA_DEVICE_PLUGIN_MODE: legacy
  kubelet:
    featureGates:
      # Disable DevicePlugins (may interfere with DevicePlugins)
      DevicePlugins: "false"
      # Enable Accelerators
      Accelerators: "true"

Update the cluster

kops update cluster gpu.example.com --yes
kops rolling-update cluster gpu.example.com --yes

Deploy the Daemonset for the Nvidia DevicePlugin

Only for DevicePlugin GPU Mode, load the deviceplugin daemonset for your specific environment. This is not required for the Legacy Accelerators GPU Mode.

# For kubernetes 1.10
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

# For kubernetes 1.11
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

# (Optional) Set permissive toleration to allow daemonset to run anywhere.
#   By default this is permissive in case you have tainted your GPU nodes.
kubectl patch daemonset nvidia-device-plugin-daemonset --namespace kube-system \
  -p '{ "spec": { "template": { "spec": { "tolerations": [ { "operator": "Exists" } ] } } } }'

Validate that GPUs are Working

Deploy a Test Pod

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: tf-gpu
spec:
  containers:
  - name: gpu
    image: tensorflow/tensorflow:1.9.0-gpu
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        memory: 1024Mi
        # ^ Set memory in case default limits are set low
        nvidia.com/gpu: 1 # requesting 1 GPUs
        # ^ For Legacy Accelerators mode this key must be renamed
        #   'alpha.kubernetes.io/nvidia-gpu'
  tolerations:
  # This toleration will allow the gpu hook to run anywhere
  #   By default this is permissive in case you have tainted your GPU nodes.
  - operator: "Exists"
  # ^ If you have a specific taint to target, comment out the above and modify
  #   the example below

### Example tolerations
# - key: "dedicated"
#   operator: "Equal"
#   value: "gpu"
#   effect: "NoExecute"
EOF

Validate that GPUs are working

# Check that nodes are detected to have GPUs
kubectl describe nodes|grep -E 'gpu:\s.*[1-9]'

# Check the logs of the Tensorflow Container to ensure that it ran
kubectl logs tf-gpu

# Show GPU info from within the pod
#   Only works in DevicePlugin mode
kubectl exec -it tf-gpu nvidia-smi

# Show Tensorflow detects GPUs from within the pod.
#   Only works in DevicePlugin mode
kubectl exec -it tf-gpu -- \
  python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'