RuntimeClassName not transmitted to the provider cluster #2869

remmen-io · 2024-12-16T09:51:33Z

What happened:

Using namespace offloading, the runtimeClassName is not transferred from the consumer cluster to the provider cluster.

Deployed on the consumer cluster is a YAML with the runtimeClassName

kind: Deployment
apiVersion: apps/v1
metadata:
  name: gpu-test
  labels:
    app: gpu-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      runtimeClassName: nvidia
      containers:
        - resources:
            limits:
              cpu: "1"
              memory: "1Gi"
              nvidia.com/gpu: "1"
            requests:
              cpu: "1"
              nvidia.com/gpu: "1"
          name: gpu
          imagePullPolicy: IfNotPresent
          image: "nvidia/cuda:12.6.1-base-ubuntu24.04"
          command: ["/bin/bash"]
          args:
           - "-c"
           - "while true; do nvidia-smi;sleep 100000; done"

On the provider cluster, the runtimeclass is missing.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-12-13T13:56:15Z"
  labels:
    app: gpu-test
    liqo.io/creator-user: gpupool-9529260abb9b
    liqo.io/managed-by: shadowpod
    offloading.liqo.io/destination: e1-k8s-lab-t
    offloading.liqo.io/nodename: gpupool
    offloading.liqo.io/origin: e1-k8s-lab-b
    pod-template-hash: 79b6ccb455
  name: gpu-test-79b6ccb455-gtpz7
  namespace: liqo-demo-e1-k8s-lab-b
  ownerReferences:
  - apiVersion: offloading.liqo.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ShadowPod
    name: gpu-test-79b6ccb455-gtpz7
    uid: 61675316-6f37-49d4-8f5b-197896752371
  resourceVersion: "9930727"
  uid: 5d56a142-afce-43e8-b25a-e9b255689a74
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - -c
    - while true; do nvidia-smi;sleep 100000; done
    command:
    - /bin/bash
    env:
    - name: KUBERNETES_SERVICE_HOST
      value: kubernetes.default
    - name: KUBERNETES_SERVICE_PORT
      value: "443"
    - name: KUBERNETES_PORT
      value: tcp://kubernetes.default:443
    - name: KUBERNETES_PORT_443_TCP
      value: tcp://kubernetes.default:443
    - name: KUBERNETES_PORT_443_TCP_PROTO
      value: tcp
    - name: KUBERNETES_PORT_443_TCP_ADDR
      value: kubernetes.default
    - name: KUBERNETES_PORT_443_TCP_PORT
      value: "443"
    image: nvidia/cuda:12.6.1-base-ubuntu24.04
    imagePullPolicy: IfNotPresent
    name: gpu
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "1"
        memory: 1Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-dbb7m
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: e1-k8shpc-001
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists

What you expected to happen:

Keep the same runtimeClassName

How to reproduce it (as minimally and precisely as possible):

Define a runtimeClassName on the deployment on the consumer cluster.

Anything else we need to know?:

Environment:

Liqo version: latest HEAD at that time (tag: 896af81 )
Liqoctl version: v1.0.0-rc.2
Kubernetes version (use kubectl version): v1.30.5
Cloud provider or hardware configuration: BareMetal with Talos
Node image:
Network plugin and version: Cilium v1.15.10-cee
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

aleoli · 2024-12-18T09:33:28Z

Hi @remmen-io!

Thanks for reporting. This is a known issue that will probably addressed in a future version.

Currently, a possible workaround is to use a label on the pod to perform a patch provider-side.

For instance, this can be done with a Kyverno policy like this one:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: set-nvidia-runtime
spec:
  rules:
    - name: set-notebooks-runtime-class-nvidia
      match:
        any:
        - resources:
            kinds:
            - Pod
            selector:
              matchLabels:
                liqo-nvidia: "true"
      mutate:
        patchStrategicMerge:
          spec:
            runtimeClassName: nvidia

fra98 · 2024-12-27T15:43:31Z

Hi @remmen-io, PR #2887 adds the support for RuntimeClass reflection, at both the Pod level and Node level (through the VirtualNode OffloadingPatch field)

remmen-io added the kind/bug label Dec 16, 2024

remmen-io changed the title R RuntimeClassName not transmitted to the provider cluster Dec 16, 2024

cheina97 removed the kind/bug label Dec 20, 2024

aleoli added feat Adds a new feature to the codebase workaround This issue or pull request contains a workaround labels Dec 23, 2024

fra98 mentioned this issue Dec 27, 2024

[virtual-kubelet] reflect pod RuntimeClass #2887

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeClassName not transmitted to the provider cluster #2869

RuntimeClassName not transmitted to the provider cluster #2869

remmen-io commented Dec 16, 2024

aleoli commented Dec 18, 2024

fra98 commented Dec 27, 2024 •

edited

Loading

RuntimeClassName not transmitted to the provider cluster #2869

RuntimeClassName not transmitted to the provider cluster #2869

Comments

remmen-io commented Dec 16, 2024

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

aleoli commented Dec 18, 2024

fra98 commented Dec 27, 2024 • edited Loading

fra98 commented Dec 27, 2024 •

edited

Loading