Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add runtimeClassName and nodeAffinity for koordlet #71

Closed
wants to merge 1 commit into from

Conversation

guhuajun
Copy link

@guhuajun guhuajun commented Oct 15, 2023

Note on DCO:

If the DCO action in the integration test fails, one or more of your commits are not signed off. Please click on the Details link next to the DCO action for instructions on how to resolve this.

Checklist:

  • I have bumped the chart version according to versioning
  • I have updated the chart changelog with all the changes that come with this pull request according to changelog.
  • Any new values are backwards compatible and/or have sensible default.
  • I have signed off all my commits as required by DCO.

Changes are automatically published when merged to master. They are not published on branches.


Please kindly do a review. I am running a k3s cluster, 1 master node, 2 worker (with containerd, see example agent config. when specifying default runtime in /etc/docker/daemon.json, it's working.) nodes with GPU cards.

  • I need to specify runtimeClassName to allow the GPU cards be discovered. Otherwise kubectl get devices will return an empty list.
  • I need to specify node nodeAffinity to allow koordlet could be scheduled to worker nodes only.

After above modifications, I am seeing the expected output from kubectl get devices.

user@k3s81:~# kubectl get devices -o yaml
apiVersion: v1
items:
- apiVersion: scheduling.koordinator.sh/v1alpha1        
  kind: Device
  metadata:
    creationTimestamp: "2023-10-15T07:12:40Z"
    generation: 1
    labels:
      node.koordinator.sh/gpu-driver-version: 535.104.12
      node.koordinator.sh/gpu-model: GeForce-RTX-2080-Ti
    name: k3s82
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Node
      name: k3s82
      uid: 42b37e92-e079-466d-ab18-a00ce2187cf3
    resourceVersion: "10140"
    uid: 3e24800e-7cfb-4c96-8a4f-d2eca3fef5a1
  spec:
    devices:
    - health: true
      id: GPU-ebca0fd7-1821-8765-3239-ca056e15b028      
      minor: 0
      resources:
        koordinator.sh/gpu-core: "100"
        koordinator.sh/gpu-memory: 11Gi
        koordinator.sh/gpu-memory-ratio: "100"
      type: gpu
  status: {}
- apiVersion: scheduling.koordinator.sh/v1alpha1
  kind: Device
  metadata:
    creationTimestamp: "2023-10-15T07:12:40Z"
    generation: 1
    labels:
      node.koordinator.sh/gpu-driver-version: 535.104.12
      node.koordinator.sh/gpu-model: GeForce-RTX-2080-Ti
    name: k3s83
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Node
      name: k3s83
      uid: a09e5037-42fa-4868-b061-89af3d4951bb
    resourceVersion: "10139"
    uid: 8ff8e0a7-f557-4d9e-8e30-345d2341fb90
  spec:
    devices:
    - health: true
      id: GPU-7e738f84-ca39-bce7-3107-73ba51a16e9b
      minor: 0
      resources:
        koordinator.sh/gpu-core: "100"
        koordinator.sh/gpu-memory: 11Gi
        koordinator.sh/gpu-memory-ratio: "100"
      type: gpu
  status: {}
kind: List
metadata:
  resourceVersion: ""
# https://github.com/k3s-io/k3s/issues/1264#issuecomment-903821584
token: TOKEN
server: https://k3s81:6443
docker: false
kubelet-arg:
  - "node-status-update-frequency=4s"
private-registry: "/etc/rancher/k3s/registry.yaml"
flannel-iface: "ens193"
log: "/var/log/k3s-agent.log"

@koordinator-bot koordinator-bot bot requested review from eahydra and stormgbs October 15, 2023 08:00
@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign stormgbs after the PR has been reviewed.
You can assign the PR to them by writing /assign @stormgbs in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ZiMengSheng
Copy link
Contributor

@guhuajun we have added release v1.4.0 chart, would you like to provide a pr like this for release v1.4.0?

@guhuajun
Copy link
Author

@guhuajun we have added release v1.4.0 chart, would you like to provide a pr like this for release v1.4.0?

Let me find some timeslots for doing a new pr.

@guhuajun guhuajun closed this Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants