Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ensure nvidia modprobe is run before kubelet #5099

Merged
merged 4 commits into from
Oct 24, 2024

Conversation

benjamin-brady
Copy link
Member

@benjamin-brady benjamin-brady commented Oct 16, 2024

What type of PR is this?
/kind bug

What this PR does / why we need it:
Removes the kubelet restart from nvidia-modprobe service and ensures nvidia-modprobe is run before kubelet.
Which issue(s) this PR fixes:
Incident 487929399, 552859186

Requirements:

@benjamin-brady
Copy link
Member Author

I have run the tests locally on GPU nodes and they pass.

@benjamin-brady benjamin-brady merged commit 67d7159 into master Oct 24, 2024
19 checks passed
@benjamin-brady benjamin-brady deleted the bbrady/nv-kubelet branch October 24, 2024 01:51
@tomasgiden
Copy link

I'm one of the customers who had issues which triggered this action (TrackingID#2402010050000797) and though the comments in this PR looks like you really understood this problem, I'm sorry to say that the issue isn't solved.

I'm running on AKSUbuntu-2204containerd-202411.03.0 on AKS 1.28.3 in Western Europe and kubelet logs still show a kubelet restart after starting a deallocated node and this still triggers the cluster to mark the pod as unschedulable for a few seconds which triggers the autoscaler to spin up new unnecessary nodes. A few seconds later the pod is scheduled on the first node anyways. (I have a feeling though that the kubelet reboot goes faster now than when the issue appeared last spring and only adds a few seconds of a delay for the pod start while I believe it added tens of seconds of delays before and made the pod fail to start and a new pod to be created.)

Here is what I see in the kubelet log:
Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.110140 1966 kubelet.go:2421] "SyncLoop (PLEG): event for pod" pod="kube-system/csi-azuredisk-node-tnxtt" event={"ID":"798a5c08-bb47-481f-9191-d544242a0960","Type":"ContainerStarted","Data":"fc6dd50d229c087189c245115468f9ae4554550b4d0a2483c26511437a13798f"}

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.859934 1966 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.859973 1966 handler.go:94] "Registered client" name="nvidia.com/gpu"

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.867133 1966 manager.go:229] "Device plugin connected" resourceName="nvidia.com/gpu"

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.867523 1966 client.go:91] "State pushed for device plugin" resource="nvidia.com/gpu" resourceCapacity=1

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.880345 1966 manager.go:278] "Processed device updates for resource" resourceName="nvidia.com/gpu" totalCount=1 healthyCount=1

Nov 25 13:37:53 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:53.113166 1966 kubelet.go:2421] "SyncLoop (PLEG): event for pod" pod="default/nvidia-device-plugin-daemonset-rdhsp" event={"ID":"4207e042-3614-48ff-b050-598aa3eb9567","Type":"ContainerStarted","Data":"8651dd02ee89c31607009e69bcb04d8e59a074ac0525980005d14d1b87e2ad80"}

Nov 25 13:37:53 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:53.131703 1966 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="default/nvidia-device-plugin-daemonset-rdhsp" podStartSLOduration=5.131659833 podCreationTimestamp="2024-11-25 13:37:48 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-11-25 13:37:53.131351661 +0000 UTC m=+8.213644351" watchObservedRunningTime="2024-11-25 13:37:53.131659833 +0000 UTC m=+8.213952533"

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.073217 1966 plugin_watcher.go:194] "Adding socket path or updating timestamp to desired state cache" path="/var/lib/kubelet/plugins_registry/disk.csi.azure.com-reg.sock"

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.122293 1966 reconciler.go:161] "OperationExecutor.RegisterPlugin started" plugin={"SocketPath":"/var/lib/kubelet/plugins_registry/disk.csi.azure.com-reg.sock","Timestamp":"2024-11-25T13:37:54.073243489Z","Handler":null,"Name":""}

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.123339 1966 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: disk.csi.azure.com endpoint: /var/lib/kubelet/plugins/disk.csi.azure.com/csi.sock versions: 1.0.0

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.123372 1966 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: disk.csi.azure.com at endpoint: /var/lib/kubelet/plugins/disk.csi.azure.com/csi.sock

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.915338 1966 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/certs/ca.crt"

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 systemd[1]: kubelet.service: Deactivated successfully.

The whole log should be available in connection to the Tracking ID but I can of course share it directly on request.

The consequence of this issue is that deallocated gpu nodes cannot be used for autoscaling since extra nodes are scaled up costing money and not being utilized. This makes it very hard for a startup which cannot afford to have GPUs online the whole time to do semi-real time event driven autoscaling where customers are waiting on a response.

@benjamin-brady
Copy link
Member Author

@tomasgiden I've been looking into this and you're right the problem still exists. I understand the scaling issues are frustrating and costly so thanks for bringing this to my attention.
Though I thought I had fixed the issue when I look at fresh node on the latest image versions sure enough the kubelet restart is still there inside the nvidia-modprobe.service. The team and I have tracked this down to this file https://github.com/Azure/AgentBaker/blob/master/parts/linux/cloud-init/nodecustomdata.yml#L300

I've updated the service on another branch (bbrady/nv-kubelet2) and I'm currently adding some more robust testing because my test didn't actually catch the problem although that was the intention. I'll be opening another PR with my changes as soon as I'm confident the bug is fixed and ping you on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants