-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-node should untaint node #2808
Comments
@runningman84 the node should not be marked as "Ready" until the |
The VPC CNI plugin is not able to modify node taints, as that would be a sizable security risk |
okay what could be the reason for us seeing this behaviour? |
Do you see the nodes as "Not Ready" during this time window? Do these application pods have their own tolerations? |
I have double check that ... t = point in time... can be seconds or even minute between two numbers t0 node appears as not ready It looks like the node gets ready too fast before waiting for the aws-node pod... |
@runningman84 can you please sure the node logs during this timeline? Mainly we would need to look at the CNI and IPAMD logs in |
I just sent the logs and we also have an open case id: 170893944201879 |
Thanks @runningman84, let's work through the support case, as the support team will triage and then bring in the service team if needed. |
Hi any news on this one? we have the same issue |
The aws support case did not really solve that, we got the suggestion that we should try to use prefix delegation mode or things like that to speedup the ip allocation. The general question is should a node be unready until aws-node is up and running? |
i am facing the same issue on eks 1.28 kubelet shows ready status on node seems i will have to make my own workaround by monitoring new nodes myself and assigning a label aged=y after they have been there for a minute then make all my pods have a nodeaffinity looking for that label ideally aws pods would add label to the node themself any ideas @jdn5126 ? |
The node should be marked as Not Ready, until aws-node pod copies the configuration files of CNI to /etc/cni/net.d/, which it does after it finishes initialization. Marking the node as schedulable (Ready or Not Ready) is marked by kubelet process. @tooptoop4 - did you manage to resolve this issue at your end? |
@orsenthil not resolved, the node goes 1.NotReady>2.Ready>3.NotReady>4.Ready pods are getting scheduled and stuck between 2. and 3. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
still an issue in EKS v1.29.7 with amazon-k8s-cni:v1.18.2 kubectl events shows this timeline: default 40m Normal Starting node/redactip kubelet, redactip Starting kubelet. 40m 1 redactip |
I have two, somehow opposite points:
|
@runningman84 @tooptoop4 Have you tried the following type of debugging to narrow down the timing of the issue?
Example 1 fields @timestamp, @message
| filter @message like /ip-<redacted>.<redacted AWS region>.compute.internal/
| sort @timestamp asc Example 2 (remove # to view only messages for userAgent kubelet fields @timestamp, @message
| filter @message like /<K8s node name>/
#| filter userAgent like /kubelet/
| filter @message like /kubelet is posting ready status/
| display @timestamp, userAgent, verb, requestURI
| sort @timestamp asc Note: There are some CloudWatch costs associated with this. Another ask just to be sure: Have you modified kubelet maxPods in a way that it is larger then number of IP for this particular instance type, see Maximum IP addresses per network interface? |
This is strange. Do you have any logs that indicate why the node is getting marked as NotReady from the Ready State. This is key to understanding the behavior here. The pods were probably scheduled when node was marked in the Ready State, but then it went to NotReady for some reason. |
@runningman84 @tooptoop4 if you have additional logs please send them to
@orsenthil Anything else? Please advise! Thank you! |
kubelet logs for the affected node:
one of my pods got allocated to this node by argoworkflows at 14:03:26 and got stuck |
@tooptoop4 Thank you, I see. But we need the other logs mentioned here for this timeframe as well! |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
/unrotten 🐔 |
@youwalther65 i've raised a support case with the logs |
@tooptoop4 Can you please share the support case ID with me, thx. |
What would you like to be added:
In our eks bottlerocket usecase we see karpenter provisioning nodes which get the aws-node pod and some application pods upon start. Unfortunatly the aws-node pod takes several dozens of seconds to get ready. The application pod try to start in the meantime and get error because they do not get an ip address.
Should we use karpenter to taint the nodes until the aws-node is ready? Is the cni plugin able to remove a startup taint once it is ready?
Why is this needed:
Pods constantly fail during startup due to missing ip addresses because the aws-node pod is not ready.
The text was updated successfully, but these errors were encountered: