Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeClassName not transmitted to the provider cluster #2869

Open
remmen-io opened this issue Dec 16, 2024 · 2 comments
Open

RuntimeClassName not transmitted to the provider cluster #2869

remmen-io opened this issue Dec 16, 2024 · 2 comments
Labels
feat Adds a new feature to the codebase workaround This issue or pull request contains a workaround

Comments

@remmen-io
Copy link

What happened:

Using namespace offloading, the runtimeClassName is not transferred from the consumer cluster to the provider cluster.

Deployed on the consumer cluster is a YAML with the runtimeClassName

kind: Deployment
apiVersion: apps/v1
metadata:
  name: gpu-test
  labels:
    app: gpu-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      runtimeClassName: nvidia
      containers:
        - resources:
            limits:
              cpu: "1"
              memory: "1Gi"
              nvidia.com/gpu: "1"
            requests:
              cpu: "1"
              nvidia.com/gpu: "1"
          name: gpu
          imagePullPolicy: IfNotPresent
          image: "nvidia/cuda:12.6.1-base-ubuntu24.04"
          command: ["/bin/bash"]
          args:
           - "-c"
           - "while true; do nvidia-smi;sleep 100000; done"

On the provider cluster, the runtimeclass is missing.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-12-13T13:56:15Z"
  labels:
    app: gpu-test
    liqo.io/creator-user: gpupool-9529260abb9b
    liqo.io/managed-by: shadowpod
    offloading.liqo.io/destination: e1-k8s-lab-t
    offloading.liqo.io/nodename: gpupool
    offloading.liqo.io/origin: e1-k8s-lab-b
    pod-template-hash: 79b6ccb455
  name: gpu-test-79b6ccb455-gtpz7
  namespace: liqo-demo-e1-k8s-lab-b
  ownerReferences:
  - apiVersion: offloading.liqo.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ShadowPod
    name: gpu-test-79b6ccb455-gtpz7
    uid: 61675316-6f37-49d4-8f5b-197896752371
  resourceVersion: "9930727"
  uid: 5d56a142-afce-43e8-b25a-e9b255689a74
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - -c
    - while true; do nvidia-smi;sleep 100000; done
    command:
    - /bin/bash
    env:
    - name: KUBERNETES_SERVICE_HOST
      value: kubernetes.default
    - name: KUBERNETES_SERVICE_PORT
      value: "443"
    - name: KUBERNETES_PORT
      value: tcp://kubernetes.default:443
    - name: KUBERNETES_PORT_443_TCP
      value: tcp://kubernetes.default:443
    - name: KUBERNETES_PORT_443_TCP_PROTO
      value: tcp
    - name: KUBERNETES_PORT_443_TCP_ADDR
      value: kubernetes.default
    - name: KUBERNETES_PORT_443_TCP_PORT
      value: "443"
    image: nvidia/cuda:12.6.1-base-ubuntu24.04
    imagePullPolicy: IfNotPresent
    name: gpu
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "1"
        memory: 1Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-dbb7m
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: e1-k8shpc-001
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists

What you expected to happen:

Keep the same runtimeClassName

How to reproduce it (as minimally and precisely as possible):

Define a runtimeClassName on the deployment on the consumer cluster.

Anything else we need to know?:

Environment:

  • Liqo version: latest HEAD at that time (tag: 896af81 )
  • Liqoctl version: v1.0.0-rc.2
  • Kubernetes version (use kubectl version): v1.30.5
  • Cloud provider or hardware configuration: BareMetal with Talos
  • Node image:
  • Network plugin and version: Cilium v1.15.10-cee
  • Install tools:
  • Others:
@remmen-io remmen-io changed the title R RuntimeClassName not transmitted to the provider cluster Dec 16, 2024
@aleoli
Copy link
Member

aleoli commented Dec 18, 2024

Hi @remmen-io!

Thanks for reporting. This is a known issue that will probably addressed in a future version.

Currently, a possible workaround is to use a label on the pod to perform a patch provider-side.

For instance, this can be done with a Kyverno policy like this one:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: set-nvidia-runtime
spec:
  rules:
    - name: set-notebooks-runtime-class-nvidia
      match:
        any:
        - resources:
            kinds:
            - Pod
            selector:
              matchLabels:
                liqo-nvidia: "true"
      mutate:
        patchStrategicMerge:
          spec:
            runtimeClassName: nvidia

@cheina97 cheina97 removed the kind/bug label Dec 20, 2024
@aleoli aleoli added feat Adds a new feature to the codebase workaround This issue or pull request contains a workaround labels Dec 23, 2024
@fra98
Copy link
Member

fra98 commented Dec 27, 2024

Hi @remmen-io, PR #2887 adds the support for RuntimeClass reflection, at both the Pod level and Node level (through the VirtualNode OffloadingPatch field)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Adds a new feature to the codebase workaround This issue or pull request contains a workaround
Projects
None yet
Development

No branches or pull requests

4 participants