Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check kubernetes services are running as expected after bootstrapping / joining #873

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

claudiubelu
Copy link
Contributor

@claudiubelu claudiubelu commented Dec 6, 2024

Currently, we're doing only a few checks that would ensure a successful installation of the k8s-snap (the containerd path is free for us to use, the ports meant for Kubernetes services are free, binaries are sane).

However, due to various reasons, the bootstrap / join process can still succeed, even if some of the services are not running (e.g.: invalid extra service arguments given in the bootstrap config, leading to a service to fail due to an unknown or invalid argument). At the moment, we're only verifying that kube-apiserver is accessible.

We're adding a few checks that the Kubernetes services are not stopped / failed after we've started them.

Adds integration test for this scenario.

@claudiubelu claudiubelu requested a review from a team as a code owner December 6, 2024 09:19
@claudiubelu claudiubelu marked this pull request as draft December 6, 2024 09:19
@claudiubelu claudiubelu marked this pull request as ready for review December 10, 2024 11:19
@claudiubelu claudiubelu force-pushed the svc-checks branch 3 times, most recently from 5a4e5dc to a1e2f8b Compare December 10, 2024 13:34
# Source arguments and substitute environment variables. Will fail if we cannot read the file.
declare -a args="($(cat "${SNAP_COMMON}/args/${service_name}"))"

set -xe
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: moved these above. If the args are missing, it should error out as well.

Comment on lines +69 to +72
if err := control.Consistently(ctx, 3, 5*time.Second, func() error {
if err := snaputil.CheckControlPlaneServicesStates(ctx, snap, "active"); err != nil {
return fmt.Errorf("failed to ensure all control plane services are active: %w", err)
}
Copy link
Contributor

@eaudetcobello eaudetcobello Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the biggest fan of adding 15 seconds to every bootstrap... Is it not possible to poll the state frequently and continue once all services are active?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it should have been 10: 3 attempts, with 5 second delays between them. Will amend that as well.

We can reduce the delay between checks; currently it's meant to catch cases in which the services fail almost immediately (e.g. wrong service arguments). What about 2 second between checks?

Copy link
Contributor

@eaudetcobello eaudetcobello Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of any kind of hard-coded time. What about this?

Is it not possible to poll the state frequently (milliseconds) and continue as soon as all services are active?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, systemd reports a unit as active immediately after starting it, even if it'll fail almost immediately afterwards. Ofcourse, it'll eventually mark it as failed afterwards, and it'll try again to start it due to the restart policy.

If you are to check the unit state immediately after starting it, you'll most likely see it as active. We can easily test this. Let's say we have a --foo=lish argument in /var/snap/k8s/common/args/kubelet, which will cause kubelet to fail. If we then run the following command:

svc_name="snap.k8s.kubelet.service" && sudo systemctl stop $svc_name && sudo systemctl start $svc_name && systemctl is-active $svc_name && sleep 5 && systemctl is-active $svc_name

We'd see the following:

active
failed

If the sleep is shorter (2s), you might even see it in activating state.

An idea would be to have ExecStartPost in the systemd units, which could help ensure that the service is running. As an interesting tidbit, it seems that setting ExecStartPost=/usr/bin/sleep 1 would actually protect against immediately-exiting services (though I'm not 100% it is an intended behaviour).

Currently, we're doing only a few checks that would ensure a successful
installation of the k8s-snap (the containerd path is free for us to use,
the ports meant for Kubernetes services are free, binaries are sane).

However, due to various reasons, the bootstrap / join process can still
succeed, even if some of the services are not running (e.g.: invalid
extra service arguments given in the bootstrap config, leading to a
service to fail due to an unknown or invalid argument). At the moment,
we're only verifying that kube-apiserver is accessible.

We're adding a few checks that the Kubernetes services are not stopped /
failed after we've started them.

Adds integration test for this scenario.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants