-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Ensure nvidia modprobe is run before kubelet #5099
Conversation
0c4119e
to
a945165
Compare
I have run the tests locally on GPU nodes and they pass. |
See IcM 552859186
2e4e72c
to
5e11530
Compare
I'm one of the customers who had issues which triggered this action (TrackingID#2402010050000797) and though the comments in this PR looks like you really understood this problem, I'm sorry to say that the issue isn't solved. I'm running on AKSUbuntu-2204containerd-202411.03.0 on AKS 1.28.3 in Western Europe and kubelet logs still show a kubelet restart after starting a deallocated node and this still triggers the cluster to mark the pod as unschedulable for a few seconds which triggers the autoscaler to spin up new unnecessary nodes. A few seconds later the pod is scheduled on the first node anyways. (I have a feeling though that the kubelet reboot goes faster now than when the issue appeared last spring and only adds a few seconds of a delay for the pod start while I believe it added tens of seconds of delays before and made the pod fail to start and a new pod to be created.) Here is what I see in the kubelet log: Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.859934 1966 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu" Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.859973 1966 handler.go:94] "Registered client" name="nvidia.com/gpu" Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.867133 1966 manager.go:229] "Device plugin connected" resourceName="nvidia.com/gpu" Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.867523 1966 client.go:91] "State pushed for device plugin" resource="nvidia.com/gpu" resourceCapacity=1 Nov 25 13:37:52 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:52.880345 1966 manager.go:278] "Processed device updates for resource" resourceName="nvidia.com/gpu" totalCount=1 healthyCount=1 Nov 25 13:37:53 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:53.113166 1966 kubelet.go:2421] "SyncLoop (PLEG): event for pod" pod="default/nvidia-device-plugin-daemonset-rdhsp" event={"ID":"4207e042-3614-48ff-b050-598aa3eb9567","Type":"ContainerStarted","Data":"8651dd02ee89c31607009e69bcb04d8e59a074ac0525980005d14d1b87e2ad80"} Nov 25 13:37:53 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:53.131703 1966 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="default/nvidia-device-plugin-daemonset-rdhsp" podStartSLOduration=5.131659833 podCreationTimestamp="2024-11-25 13:37:48 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-11-25 13:37:53.131351661 +0000 UTC m=+8.213644351" watchObservedRunningTime="2024-11-25 13:37:53.131659833 +0000 UTC m=+8.213952533" Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.073217 1966 plugin_watcher.go:194] "Adding socket path or updating timestamp to desired state cache" path="/var/lib/kubelet/plugins_registry/disk.csi.azure.com-reg.sock" Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.122293 1966 reconciler.go:161] "OperationExecutor.RegisterPlugin started" plugin={"SocketPath":"/var/lib/kubelet/plugins_registry/disk.csi.azure.com-reg.sock","Timestamp":"2024-11-25T13:37:54.073243489Z","Handler":null,"Name":""} Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.123339 1966 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: disk.csi.azure.com endpoint: /var/lib/kubelet/plugins/disk.csi.azure.com/csi.sock versions: 1.0.0 Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.123372 1966 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: disk.csi.azure.com at endpoint: /var/lib/kubelet/plugins/disk.csi.azure.com/csi.sock Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 kubelet[1966]: I1125 13:37:54.915338 1966 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/certs/ca.crt" Nov 25 13:37:54 aks-gpu10-42902258-vmss000000 systemd[1]: kubelet.service: Deactivated successfully. The whole log should be available in connection to the Tracking ID but I can of course share it directly on request. The consequence of this issue is that deallocated gpu nodes cannot be used for autoscaling since extra nodes are scaled up costing money and not being utilized. This makes it very hard for a startup which cannot afford to have GPUs online the whole time to do semi-real time event driven autoscaling where customers are waiting on a response. |
@tomasgiden I've been looking into this and you're right the problem still exists. I understand the scaling issues are frustrating and costly so thanks for bringing this to my attention. I've updated the service on another branch (bbrady/nv-kubelet2) and I'm currently adding some more robust testing because my test didn't actually catch the problem although that was the intention. I'll be opening another PR with my changes as soon as I'm confident the bug is fixed and ping you on it. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Removes the
kubelet restart
from nvidia-modprobe service and ensures nvidia-modprobe is run before kubelet.Which issue(s) this PR fixes:
Incident 487929399, 552859186
Requirements: