fix: Ensure nvidia modprobe is run before kubelet #5099

benjamin-brady · 2024-10-16T02:11:56Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Removes the kubelet restart from nvidia-modprobe service and ensures nvidia-modprobe is run before kubelet.
Which issue(s) this PR fixes:
Incident 487929399, 552859186

Requirements:

uses conventional commit messages

benjamin-brady · 2024-10-23T22:38:00Z

I have run the tests locally on GPU nodes and they pass.

e2e/validators.go

See IcM 552859186

…installed

tomasgiden · 2024-11-26T07:35:47Z

I'm one of the customers who had issues which triggered this action (TrackingID#2402010050000797) and though the comments in this PR looks like you really understood this problem, I'm sorry to say that the issue isn't solved.

I'm running on AKSUbuntu-2204containerd-202411.03.0 on AKS 1.28.3 in Western Europe and kubelet logs still show a kubelet restart after starting a deallocated node and this still triggers the cluster to mark the pod as unschedulable for a few seconds which triggers the autoscaler to spin up new unnecessary nodes. A few seconds later the pod is scheduled on the first node anyways. (I have a feeling though that the kubelet reboot goes faster now than when the issue appeared last spring and only adds a few seconds of a delay for the pod start while I believe it added tens of seconds of delays before and made the pod fail to start and a new pod to be created.)

Here is what I see in the kubelet log:
Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.110140 1966 kubelet.go:2421] "SyncLoop (PLEG): event for pod" pod="kube-system/csi-azuredisk-node-tnxtt" event={"ID":"798a5c08-bb47-481f-9191-d544242a0960","Type":"ContainerStarted","Data":"fc6dd50d229c087189c245115468f9ae4554550b4d0a2483c26511437a13798f"}

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.859934 1966 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.859973 1966 handler.go:94] "Registered client" name="nvidia.com/gpu"

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.867133 1966 manager.go:229] "Device plugin connected" resourceName="nvidia.com/gpu"

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.867523 1966 client.go:91] "State pushed for device plugin" resource="nvidia.com/gpu" resourceCapacity=1

Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.880345 1966 manager.go:278] "Processed device updates for resource" resourceName="nvidia.com/gpu" totalCount=1 healthyCount=1

Nov 25 13:37:53 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:53.113166 1966 kubelet.go:2421] "SyncLoop (PLEG): event for pod" pod="default/nvidia-device-plugin-daemonset-rdhsp" event={"ID":"4207e042-3614-48ff-b050-598aa3eb9567","Type":"ContainerStarted","Data":"8651dd02ee89c31607009e69bcb04d8e59a074ac0525980005d14d1b87e2ad80"}

Nov 25 13:37:53 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:53.131703 1966 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="default/nvidia-device-plugin-daemonset-rdhsp" podStartSLOduration=5.131659833 podCreationTimestamp="2024-11-25 13:37:48 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-11-25 13:37:53.131351661 +0000 UTC m=+8.213644351" watchObservedRunningTime="2024-11-25 13:37:53.131659833 +0000 UTC m=+8.213952533"

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.073217 1966 plugin_watcher.go:194] "Adding socket path or updating timestamp to desired state cache" path="/var/lib/kubelet/plugins_registry/disk.csi.azure.com-reg.sock"

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.122293 1966 reconciler.go:161] "OperationExecutor.RegisterPlugin started" plugin={"SocketPath":"/var/lib/kubelet/plugins_registry/disk.csi.azure.com-reg.sock","Timestamp":"2024-11-25T13:37:54.073243489Z","Handler":null,"Name":""}

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.123339 1966 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: disk.csi.azure.com endpoint: /var/lib/kubelet/plugins/disk.csi.azure.com/csi.sock versions: 1.0.0

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.123372 1966 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: disk.csi.azure.com at endpoint: /var/lib/kubelet/plugins/disk.csi.azure.com/csi.sock

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.915338 1966 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/certs/ca.crt"

Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 systemd[1]: kubelet.service: Deactivated successfully.

The whole log should be available in connection to the Tracking ID but I can of course share it directly on request.

The consequence of this issue is that deallocated gpu nodes cannot be used for autoscaling since extra nodes are scaled up costing money and not being utilized. This makes it very hard for a startup which cannot afford to have GPUs online the whole time to do semi-real time event driven autoscaling where customers are waiting on a response.

benjamin-brady · 2024-12-11T02:32:46Z

@tomasgiden I've been looking into this and you're right the problem still exists. I understand the scaling issues are frustrating and costly so thanks for bringing this to my attention.
Though I thought I had fixed the issue when I look at fresh node on the latest image versions sure enough the kubelet restart is still there inside the nvidia-modprobe.service. The team and I have tracked this down to this file https://github.com/Azure/AgentBaker/blob/master/parts/linux/cloud-init/nodecustomdata.yml#L300

I've updated the service on another branch (bbrady/nv-kubelet2) and I'm currently adding some more robust testing because my test didn't actually catch the problem although that was the intention. I'll be opening another PR with my changes as soon as I'm confident the bug is fixed and ping you on it.

benjamin-brady requested review from juan-lee, cameronmeissner, UtheMan, ganeshkumarashok, anujmaheshwari1, AlisonB319, Devinwong, lilypan26, ShiqianTao, AbelHu, junjiezhang1997, jason1028kr, djsly, phealy and r2k1 as code owners October 16, 2024 02:11

benjamin-brady temporarily deployed to test October 16, 2024 02:12 — with GitHub Actions Inactive

cameronmeissner changed the title ~~Bbrady/nv kubelet~~ fix: Ensure nvidia modprobe is run before kubelet Oct 16, 2024

cameronmeissner approved these changes Oct 16, 2024

View reviewed changes

benjamin-brady force-pushed the bbrady/nv-kubelet branch from 0c4119e to a945165 Compare October 20, 2024 22:41

benjamin-brady requested a review from timmy-wright as a code owner October 20, 2024 22:41

benjamin-brady temporarily deployed to test October 20, 2024 22:41 — with GitHub Actions Inactive

benjamin-brady temporarily deployed to test October 21, 2024 23:07 — with GitHub Actions Inactive

benjamin-brady requested a review from cameronmeissner October 23, 2024 22:37

timmy-wright reviewed Oct 23, 2024

View reviewed changes

e2e/validators.go Show resolved Hide resolved

benjamin-brady temporarily deployed to test October 23, 2024 23:22 — with GitHub Actions Inactive

benjamin-brady added 4 commits October 24, 2024 13:42

fix: Ensure nvidia modprobe is run before kubelet

1b01afd

See IcM 552859186

test:ensure kubelet is not restarted during GPU node bootstrapping

84a5eb4

test: validate that kubelet has not restarted and nvidia-modprobe is …

bb6498b

…installed

style: add commentary to explain why the test exists

5e11530

benjamin-brady force-pushed the bbrady/nv-kubelet branch from 2e4e72c to 5e11530 Compare October 24, 2024 00:42

benjamin-brady temporarily deployed to test October 24, 2024 00:42 — with GitHub Actions Inactive

benjamin-brady merged commit 67d7159 into master Oct 24, 2024
19 checks passed

benjamin-brady deleted the bbrady/nv-kubelet branch October 24, 2024 01:51

benjamin-brady mentioned this pull request Dec 18, 2024

fix: Kubelet restarts initiated by nvidia-modprobe service #5450

Merged

4 tasks

benjamin-brady mentioned this pull request Jan 2, 2025

fix: Kubelet restarts initiated by nvidia-modprobe service (#5450) #5464

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Ensure nvidia modprobe is run before kubelet #5099

fix: Ensure nvidia modprobe is run before kubelet #5099

benjamin-brady commented Oct 16, 2024 •

edited

Loading

benjamin-brady commented Oct 23, 2024

tomasgiden commented Nov 26, 2024

benjamin-brady commented Dec 11, 2024

fix: Ensure nvidia modprobe is run before kubelet #5099

fix: Ensure nvidia modprobe is run before kubelet #5099

Conversation

benjamin-brady commented Oct 16, 2024 • edited Loading

benjamin-brady commented Oct 23, 2024

tomasgiden commented Nov 26, 2024

benjamin-brady commented Dec 11, 2024

benjamin-brady commented Oct 16, 2024 •

edited

Loading