-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check kubernetes services are running as expected after bootstrapping / joining #873
base: main
Are you sure you want to change the base?
Conversation
2a12af1
to
58f3985
Compare
5a4e5dc
to
a1e2f8b
Compare
# Source arguments and substitute environment variables. Will fail if we cannot read the file. | ||
declare -a args="($(cat "${SNAP_COMMON}/args/${service_name}"))" | ||
|
||
set -xe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: moved these above. If the args are missing, it should error out as well.
if err := control.Consistently(ctx, 3, 5*time.Second, func() error { | ||
if err := snaputil.CheckControlPlaneServicesStates(ctx, snap, "active"); err != nil { | ||
return fmt.Errorf("failed to ensure all control plane services are active: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not the biggest fan of adding 15 seconds to every bootstrap... Is it not possible to poll the state frequently and continue once all services are active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it should have been 10: 3 attempts, with 5 second delays between them. Will amend that as well.
We can reduce the delay between checks; currently it's meant to catch cases in which the services fail almost immediately (e.g. wrong service arguments). What about 2 second between checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of any kind of hard-coded time. What about this?
Is it not possible to poll the state frequently (milliseconds) and continue as soon as all services are active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, systemd reports a unit as active
immediately after starting it, even if it'll fail almost immediately afterwards. Ofcourse, it'll eventually mark it as failed afterwards, and it'll try again to start it due to the restart policy.
If you are to check the unit state immediately after starting it, you'll most likely see it as active
. We can easily test this. Let's say we have a --foo=lish
argument in /var/snap/k8s/common/args/kubelet
, which will cause kubelet
to fail. If we then run the following command:
svc_name="snap.k8s.kubelet.service" && sudo systemctl stop $svc_name && sudo systemctl start $svc_name && systemctl is-active $svc_name && sleep 5 && systemctl is-active $svc_name
We'd see the following:
active
failed
If the sleep is shorter (2s), you might even see it in activating
state.
An idea would be to have ExecStartPost
in the systemd units, which could help ensure that the service is running. As an interesting tidbit, it seems that setting ExecStartPost=/usr/bin/sleep 1
would actually protect against immediately-exiting services (though I'm not 100% it is an intended behaviour).
Currently, we're doing only a few checks that would ensure a successful installation of the k8s-snap (the containerd path is free for us to use, the ports meant for Kubernetes services are free, binaries are sane). However, due to various reasons, the bootstrap / join process can still succeed, even if some of the services are not running (e.g.: invalid extra service arguments given in the bootstrap config, leading to a service to fail due to an unknown or invalid argument). At the moment, we're only verifying that kube-apiserver is accessible. We're adding a few checks that the Kubernetes services are not stopped / failed after we've started them. Adds integration test for this scenario.
a1e2f8b
to
5ce8ec3
Compare
Currently, we're doing only a few checks that would ensure a successful installation of the k8s-snap (the containerd path is free for us to use, the ports meant for Kubernetes services are free, binaries are sane).
However, due to various reasons, the bootstrap / join process can still succeed, even if some of the services are not running (e.g.: invalid extra service arguments given in the bootstrap config, leading to a service to fail due to an unknown or invalid argument). At the moment, we're only verifying that
kube-apiserver
is accessible.We're adding a few checks that the Kubernetes services are not stopped / failed after we've started them.
Adds integration test for this scenario.