Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between aws-vpc-cni-k8s and systemd-networkd on AL2023 #3162

Open
dracut5 opened this issue Dec 30, 2024 · 8 comments
Open

Race condition between aws-vpc-cni-k8s and systemd-networkd on AL2023 #3162

dracut5 opened this issue Dec 30, 2024 · 8 comments
Labels

Comments

@dracut5
Copy link

dracut5 commented Dec 30, 2024

What happened:
Hi!
We have faced weird behavior when some cronjobs in the EKS cluster fail occasionally due network timeouts. It occurs randomly and we were not even able to reproduce the issue for a long time, but when we got into that we found out that a node secondary network interface, which was created and attached by the Amazon VPC CNI plugin, didn't have an IP address assigned. And when a pod gets the ip address from the pool located on this secondary ENI it is unable to perform any network actions, the network is fully inaccessible.

It was the root cause of our problem, but we decided to proceed our investigation and figure out why the ip is absent, moreover, not every time.

Attach logs
At the initial phase the vpc-cni creates and attaches a secondary ENI to an instance to satisfy the condition MINIMUM_IP_TARGET=10. The plugin also configures the corresponded network interface by adding the primary IP address

{"level":"info","ts":"2024-12-27T13:22:35.562Z","caller":"networkutils/network.go:1019","msg":"Setting up network for an ENI with IP address 172.26.103.105, MAC address 06:78:44:01:96:d1, CIDR 172.26.100.0/22 and route table 2"}
{"level":"debug","ts":"2024-12-27T13:22:35.562Z","caller":"networkutils/network.go:1019","msg":"Setting up ENI's primary IP 172.26.103.105"}
{"level":"debug","ts":"2024-12-27T13:22:35.564Z","caller":"networkutils/network.go:1019","msg":"Adding IP address 172.26.103.105/22"}

At the same time, until /etc/systemd/network/80-ec2.network.d/10-eks_primary_eni_only.conf appears, the systemd-networkd service takes the mentioned interface under control and runs dhcp client to get a lease

Dec 27 13:22:35 ip-172-26-101-7.eu-west-1.compute.internal systemd-networkd[1438]: ens6: Configuring with /usr/lib/systemd/network/80-ec2.network.
Dec 27 13:22:35 ip-172-26-101-7.eu-west-1.compute.internal systemd-networkd[1438]: ens6: Link UP
Dec 27 13:22:35 ip-172-26-101-7.eu-west-1.compute.internal systemd-networkd[1438]: ens6: Gained carrier
Dec 27 13:22:35 ip-172-26-101-7.eu-west-1.compute.internal systemd-networkd[1438]: ens6: Gained IPv6LL
Dec 27 13:22:39 ip-172-26-101-7.eu-west-1.compute.internal systemd-networkd[1438]: ens6: DHCPv4 address 172.26.103.105/22, gateway 172.26.100.1 acquired from 172.26.100.1

When 10-eks_primary_eni_only.conf file, which specifies what certain interfaces systemd-networkd should manage (only one to be more precisely - the primary ENI interface), is created the service decides to stop managing the device

Dec 27 13:22:43 ip-172-26-101-7.eu-west-1.compute.internal systemd-networkd[1438]: ens6: Unmanaging interface.
Dec 27 13:22:43 ip-172-26-101-7.eu-west-1.compute.internal systemd-networkd[1438]: ens6: DHCP lease lost
Dec 27 13:22:43 ip-172-26-101-7.eu-west-1.compute.internal systemd-networkd[1438]: ens6: DHCPv6 lease lost

This action results in the deletion of the assigned IP address since the lease should be released.

In the future our secondary interface, ens6, will be managed by aws-cni exclusively, but it knows nothing about IP address removal done by systemd-networkd. At the end we get the unrouted broken network interface and, as has been said before, when a pod obtains an IP address from the related pool it is just unable to reach any network endpoints. Plugin restart helps, but it is not a long term solution.

So, very likely, it is a race condition between the Amazon VPC CNI plugin and the systemd-networkd service - they are managing the same network interface at the same time due some circumstances.

It is worth mentioning that from time to time the secondary interface is able to keep assigned primary IP when the dhcp client hadn’t got a lease before "Unmanaging interface" event occurred.

What you expected to happen:

Secondary interfaces are managed only by Amazon VPC CNI plugin and get proper permanent network configuration during a node init process.

How to reproduce it (as minimally and precisely as possible):

  1. Set MINIMUM_IP_TARGET slightly higher than the particular instance type ENI can handle, it will force the plugin to create and attach additional one while a node is starting. As example, for c7*.large types it will be 10 and more: ENI supports 10 ips max - 1 primary and 9 secondaries.
  2. Just create new instances from that template until you catch the issue - the secondary interface without any IP address.

Anything else we need to know?:
We found some workaround: decreasing MINIMUM_IP_TARGET to 9 in order to place all ips on one primary interface during the instance init process, it works for all instance types used in our environments. The vpc-cni can create other ones if needed, attach and configure them, but it prevents the race condition against systemd-networkd at the very start, before 10-eks_primary_eni_only.conf file exists.

Merry Christmas and Happy New Year y'all 🎄

Environment:

  • Kubernetes version (use kubectl version): v1.31.3-eks
  • CNI Version: v1.19.0-eksbuild.1, installed as EKS Add-on
  • OS (e.g: cat /etc/os-release): Amazon Linux 2023.6.20241212
  • Kernel (e.g. uname -a): 6.1.119-129.201.amzn2023.aarch64
  • Systemd: 252.23-2
  • Previous CNI config, which caused the issue
{
  "env": {
    "WARM_IP_TARGET": "2",
    "MINIMUM_IP_TARGET": "10",
    "MAX_ENI": "3"
  }
}
@dracut5 dracut5 added the bug label Dec 30, 2024
@dracut5 dracut5 changed the title Race condition between aws-vpc-cni-k8s and systemd-networked on AL2023 Race condition between aws-vpc-cni-k8s and systemd-networkd on AL2023 Dec 30, 2024
@orsenthil
Copy link
Member

orsenthil commented Dec 31, 2024

So, very likely, it is a race condition between the Amazon VPC CNI plugin and the systemd-networkd service - they are managing the same network interface at the same time due some circumstances.

Amazon Linux 2023.6.20241212

Thank you providing these details.

We had seen a similar issue with a race between "amazon-ec2-net-utils" package installed AL2023 AMI (earlier to v20240329) and VPC CNI, the default gateway route for these secondary ENIs installed by CNI was deleted by amazon-ec2-net-utils. It was realized that the ENI and its routes were first initialized by the VPC CNI plugin, then, amazon-ec2-net-utils init came into action, subsequently removed those routes in the race.

A change was made in this in awslabs/amazon-eks-ami#1738 by @M00nF1sh to make systemd.network manage primary ENI only.

It seems like direct solution to your report.

And AMI with the fix was released - https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240329. The resolution was to upgrade to AMI v20240329 or to the latest.

Could you ensure that you are in these supported AMIs?
Did you happen to this after any AMI upgrade ?

@dracut5
Copy link
Author

dracut5 commented Jan 2, 2025

@orsenthil, thank you for the answer!

Our AMI version is v20241213, which looks much newer than v20240329. Of course, we can update it to the latest v20241225, but I'm not sure that it will help.

I read about changes in the mentioned PR and I think, the reason might be that nodeadm-run generates 10-eks_primary_eni_only.conf after aws-vpc-cni-k8s creates and attaches a secondary ENI, so, until the file is created, systemd-networkd still picks up all network interfaces.

In our case, the plugin doesn't wait for pods scheduled, it tries to satisfy the condition of MINIMUM_IP_TARGET=10, for the c7g.large instance type it means creating an additional ENI instantly.

TODOs after this PR: there are limitations on current solutions as well, and we should figure long term solution for this:

Also, I have noticed they were mentioning some limitations, is it possible to get any info about that?
Maybe, this issue is a part of limitations 😞

@dracut5
Copy link
Author

dracut5 commented Jan 6, 2025

Hi all once again!

Could someone take a look at this, please? We was able to find how to bypass the issue, but we are still not fully certain that it will work each time.

That fix in the EKS AMI project seems not a solution of this particular problem since the 10-eks_primary_eni_only.conf file is eventually created and it includes the MAC address of primary ENI. If you need additional info I will try to provide any, which I have found.
Really appreciate any help and thank you in advance!

P.S. Maybe I should submit a ticket for the AWS support team?

@orsenthil
Copy link
Member

@dracut5, I will take a look at this and try to reproduce this with the information you have to provided.

@dracut5
Copy link
Author

dracut5 commented Jan 6, 2025

Sorry, forgot to add: we are using self-managed node groups in this case, not EKS managed.

@dracut5
Copy link
Author

dracut5 commented Jan 10, 2025

Hi @orsenthil,

Did you have a chance to try to reproduce the issue? Was it successful?
Could I provide some details, which make the research simpler?

Thanks a lot!

@orsenthil
Copy link
Member

Hello @dracut5 , I haven't been able to reproduce this issue yet.
I created an EKS with self-managed node group, although, MNG vs self-managed Node Group shouldn't matter for this.

[senthilx@88665a371033:~/github-issues/3162]$ cat cluster.yaml
# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-race-condition
  region: us-west-2  # specify your desired region
[senthilx@88665a371033:~/github-issues/3162]$ cat nodegroup.yaml
# nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-race-condition
  region: us-west-2

nodeGroups:
  - name: self-managed-ng
    instanceType: t3.large
    desiredCapacity: 2
    minSize: 1
    maxSize: 4
    amiFamily: AmazonLinux2023
    labels:
      nodegroup-type: self-managed
    tags:
      nodegroup-role: worker
    ssh:
      allow: true
  1. Since it t3.large instance, it has 3 ENIs and each CNI can support max of 12. But keeping reserved, it can support 11 ips per ENI. So, I set the MINIMUM_IP_TARGET to 14.

  2. I scaled up nodes in my cluster. (The nodes came up with AMI ami-079182793d10affb9)

eksctl scale nodegroup --cluster=$CLUSTER_NAME --name=self-managed-ng --nodes=4 --nodes-max=10 --nodes-min=0

And noticed the additional ENI allocated.

{"level":"debug","ts":"2025-01-11T02:05:07.084Z","caller":"datastore/data_store.go:1002","msg":"ENI eni-00bca9c5a3fda2c89 cannot be deleted because it is primary"}
{"level":"debug","ts":"2025-01-11T02:05:07.084Z","caller":"datastore/data_store.go:1002","msg":"ENI eni-098689bba2b6e8cbe cannot be deleted because it is required for MINIMUM_IP_TARGET: 14"}
  1. I ended up creating pods with ips assigned secondary network interface and they continued to get the routable ips.

The journalctl -u systemd-networkd do show this, means the interface is now managed by aws-cni .

Jan 11 02:00:47 ip-192-168-74-236.us-west-2.compute.internal systemd-networkd[1517]: ens5: Reconfiguring with /usr/lib/systemd/network/80-ec2.network.
Jan 11 02:00:47 ip-192-168-74-236.us-west-2.compute.internal systemd-networkd[1517]: ens5: DHCP lease lost
Jan 11 02:00:47 ip-192-168-74-236.us-west-2.compute.internal systemd-networkd[1517]: ens5: DHCPv6 lease lost
Jan 11 02:00:47 ip-192-168-74-236.us-west-2.compute.internal systemd-networkd[1517]: ens5: DHCPv4 address 192.168.74.236/19, gateway 192.168.64.1 acquired from 192.168.64.1
Jan 11 02:01:16 ip-192-168-74-236.us-west-2.compute.internal systemd-networkd[1517]: eth0: Interface name change detected, renamed to ens6.

  1. Do you only this when you setup MINIMUM_IP_TARGET ?
  2. Assuming this race condition, perhaps I try the above steps couple of times to see I end with non routable secondary IP address?

Do the above steps look close to how you have been reproducing this issue?

@dracut5
Copy link
Author

dracut5 commented Jan 13, 2025

It looks very similar to what I have done to reproduce the issue, that's true.

  1. We didn't try to change the plugin configuration considerably since found the workaround to set MINIMUM_IP_TARGET = 9, which prevents instant secondary ENI creation.
  2. Yes, please, try to run a scaling process several times.

I would also suggest to add a few additional steps to match our configuration as much as possible:

  1. Add 2-3 DaemonSet instances, it might be any dummy services, which should be placed on newly created nodes right after they begin accessible.
  2. Create a simple userdata bash script like sleep 3. This command might be inserted before and after the main nodeadm entry since we have pre and post cloud-init scripts (currently we are using terraform-aws-eks for EKS creation).

I have some assumptions that, precisely, userdata scripts might affect the network race condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants