Download values.yaml and configure the settings (the default settings MAY NOT work!). For multi-cluster setup, a separate values.yaml is required for each cluster.
- For single-cluster setup, leave as default.
- For multi-cluster setup, set
core: false
andworker: true
for non-core clusters. For the core cluster, setcore: true
, and setworker
totrue
orfalse
depending on whether pods in the core cluster need to be protected. Also setworkerCellId
to a unique cluster name ifworker
istrue
.
coreCluster
:- For single-cluster setup, leave as default.
- Provide a literal kubeconfig or reference an existing secret to the core cluster
if the current release is
release.worker = true
andrelease.core = false
.
-
host
:- For single-cluster setup, leave empty.
- For multi-cluster and out-of-cluster setups, see the "Webhook service discovery" section.
-
tls
- Set
custom
totrue
, and generate a new certificate for the webhook. This step is required for most setups unlesswebhook.host
is set to a reverse proxy trusted by the system CA root of kube-apiserver. To generate a new certificate for single-cluster setups, use one of the following methods:
- Set
- Change DAYS to number of days until certificate expiry.
- Change YOUR_NAMESPACE to the namespace for the Podseidon Helm release.
openssl req -x509 -newkey rsa:1024 -keyout webhook.key -out webhook.cert -sha256 -nodes -subj '/CN=CN' \
-days DAYS \
-addext 'subjectAltName=DNS:*.YOUR_NAMESPACE.svc'
Copy the contents of webhook.key
and webhook.cert
into the corresponding fields in values.yaml.
- Download
cfssl*.json
fromtest/assets
of this repository. - Replace all occurrences of
podseidon-webhook
to*.YOUR_NAMESPACE.svc
, whereYOUR_NAMESPACE
is the namespace for the Podseidon Helm release. - Run the following commands:
cfssl gencert -initca cfssl-ca-csr.json | cfssljson -bare root
cfssl gencert -ca=root.pem -ca-key=root-key.pem \
-config=cfssl-config.json -profile=webhook \
cfssl-webhook-csr.json | cfssljson -bare webhook
Other settings are usable by default and self-explanatory.
Install the chart with the configured values for each cluster.
helm install podseidon oci://ghcr.io/kubewharf/podseidon-chart --values values.yaml
Tip
For multi-cluster setups with multiple core clusters,
treat each core cluster as a separate Podseidon installation.
It is possible to install the Podseidon chart multiple times using different Helm release names.
Each installation can be configured with a different coreCluster
.
However, note the possible performance impact since
each installation runs a separate validation webhook,
which kube-apiserver calls serially and may greatly increase request latency.
There are two possible approaches for out-of-cluster deployment of the Podseidon components.
The first method is to use a virtual kubelet that
proxies the Podseidon controllers pods to a separate control plane (Kube-on-Kube) cluster.
To use this approach, simply set the nodeSelector
under the relevant components.
The second method is to avoid creating the Deployments in the Helm chart
and manually create the Deployments in a separate control plane cluster.
This option is not natively supported by the Helm chart,
which is published for general users and is too complicated to include such options.
However, it is possible to modify the helm chart to generate non-workload templates only
by changing the main entrypoints to include the template
podseidon.boilerplate.aux.obj
instead of podseidon.boilerplate.entrypoint.obj
,
and separately generate the appropriate command-line options passed to the controller binaries
by invoking the template podseidon.boilerplate.args.yaml-array
.
Note that <component>.yaml
only invokes the templates in _<component>.yaml
,
and the latter file only defines templates without invoking anything at the top level,
so it is possible to import all files starting with _
as library templates
and build your own Helm chart.
In both approaches, the coreCluster
and workerCluster
in the corresponding components
need to be configured accordingly to avoid using the inCluster
configuration
and connecting to the control plane cluster incorrectly.
The impersonate
option may be useful to
authenticate as the service accounts created by the Podseidon chart,
which declares more granular access to cluster resources,
while using a user account with higher access from the literal
/fromSecret
kubeconfigs.
Podseidon webhook instances are only deployed in the release where release.core = true
.
For releases where release.worker = true
and release.core = false
,
set webhook.host
to a URL that resolves to the webhook Service created in the host cluster.
The provision of such URL is subject to the multi-cluster service discovery solution used.
To minimize disruption to existing operations, the following options are available for seamless adoption:
- Set
webhook.failurePolicy
toIgnore
initially to check if the webhook is actually reachable. - Set
webhook.dryRun
totrue
to obtain metrics on rejection rate without actually blocking pod deletion. - Do not select everything under
generator.protectedSelector
initially. Only label specific canary workloads with the selector to observe any disruption to operations on these workloads before expanding to all other workloads.
Each container exposes Prometheus metrics over HTTP at port 9090. The following metrics have been found relevant.
All time-related metrics are in seconds.
heartbeat
: The age of active instances.leader_heartbeat
: Time since acquisition of current leader lease for a component. Consistently low value may indicate a crash loop.generator_reconcile
: A histogram of reconcile duration in generator. Theerror
tag indicates the error rate of various causes.aggregator_reconcile
: A histogram of reconcile duration in aggregator. Theerror
tag indicates the error rate of various causes.aggregator_next_event_pool_current_size
,aggregator_next_event_pool_current_latency
: The former is the number of PodProtector objects with newer webhook admission events than the last watch event received from the Pod watch stream in aggregator. The latter is the time since the oldest of such events. Elevated values may indicate Pod watch lag or clock skew between webhook and aggregator.webhook_request
: Number of webhook requests processed. A sudden drop in request rate indicates possible misconfiguration in the apiserver → webhook path. The histogram indicates the total processing time for each webhook request.webhook_handle_pod_in_ppr
: Number of PodProtector–Pod pairs processed by webhook. A sudden drop in processing rate indicates possible problem with PodProtectors, either due to webhook informer cache inconsistency or selector misconfiguration. Therejected
tag indicates whether the admission review is rejected, which may be used to monitor either false positives/negatives or actual incidents caused by other controllers trying to delete existing pods.webhook_http_error
: Number of webhook requests that failed (instead of getting rejected or approved). Cross check withapiserver_admission_webhook_rejection_count{error_type=*}
from kube-apiserver.retrybatch_submit_retry_count
: A histogram of the number of PodProtector updates involved with each PodProtector–Pod pair. Note that multiple Pods for the same PodProtector may be involved in the same PodProtector update, of which the count is indicated by theretrybatch_execute_batch_size
histogram. Elevated values may indicate a high conflict rate, e.g. caused by too many webhook instances.
Core cluster kube-apiserver/etcd not ready
Podseidon webhook incorrectly blocks pod deletion in worker clusterss due to inability to update PodProtector.
If the core cluster control plane cannot be recovered shortly and pod deletion in worker clusters urgently needs to recover, consider the following steps:
- Stop the relevant controllers to avoid unexpected activity.
- Disable the Podseidon webhook by patching the webhook configuration:
KUBECONFIG=worker-cluster.yaml kubectl \
patch validatingwebhookconfiguration podseidon.kubewharf.io \
--type=json -p '[{
"op": "replace",
"path": "/webhooks/0/objectSelector",
"value": {"matchLabels": {"this-selector-does": "not-match-anything"}}
}]'
Revert the response steps in reverse order. Ensure the webhook is handling events correctly before restarting the stopped controllers to avoid the risk of controller malfunction if the control plane experienced data corruption or cache inconsistency.
Worker cluster kube-apiserver/etcd not ready
Aggregator for the worker cluster may not update correctly, voiding protection by Podseidon. Assuming the absence of data plane fault and considering that other controllers are equally unable to dispatch pod-disrupting requests to the cluster, no action is required as the cluster status is mostly stationary and hence accurate. However, if the issue persists for a long time, failed pod eviction may accumulate, and since apiserver is unable to create new pods correctly, this may develop into data plane fault (see next row).
Monitor pod health in case this develops into a data plane fault.
Data center network connectivity issues, large scale pod unreachability
Aggregator is unable to obtain the latest cluster status from the worker cluster and/or write the latest status into the data into the PodProtector in the core cluster. This may result in an excessive count of available pods when they are actually unusable. Thus, webhook may incorrectly allow killing healthy pods in other worker clusters and further worsen the situation.
Instruct generator to delete the cell status for all or selected workloads with an incorrect status:
KUBECONFIG=core-cluster.yaml kubectl \
annotate ppr ${NAME_OR_LABEL_SELECTOR} \
podseidon.kubewharf.io/remove-cell-once=${WORKER_CELL_ID}
This command is a backdoor implemented in generator to force clear aggregator and webhook status for a worker cluster when the cluster is removed or unusable. If aggregator somehow manages to add back the status to the command, stop aggregator for that worker cluster.
If aggregator was stopped, restart it after ensuring that the pods are really available as indicated in pod status.
- Generator leader heartbeat disappearance
- Generator reconcile QPS drop
- PodProtector
minAvailable
not updating
Webhook may perceive a lagging replica requirement greater/less than the desired value, thus incorrectly blocking/allowing pod deletion.
Try each of the procedures in order.
- Scale up generator if CPU/memory resource saturation is observed.
- Force delete and recreate generator pods to attempt recovery.
- Confirm the problem is merely caused by Podseidon, and all webhook rejections are false positives. Confirm the absence of concurrent control plane malfunction.
- Disable the Podseidon webhook in all worker clusters by patching ValidatingWebhookConfiguration.
- Aggregator leader heartbeat disappearance
- Accumulation of
next_event_pool_*
metrics - Accumulation of admission history in PodProtector status
- High
.status.summary.maxLatencyMillis
in PodProtector objects
Webhook admission history may accumulate due to lack of cleanup from aggregator, resulting in false positive rejection.
Try each of the procedures in order.
- Scale up aggregator if CPU/memory resource saturation is observed.
- Force delete and recreate aggregator pods to attempt recovery.
- Confirm the problem is merely caused by Podseidon, and all webhook rejections are false positives. Confirm the absence of concurrent control plane malfunction.
- Disable the Podseidon webhook in worker clusters by patching ValidatingWebhookConfiguration. The webhook for worker clusters may also need to be disabled since they share the same unavailability counter.
Elevated values of the kube-apiserver metric
apiserver_admission_webhook_rejection_count{error_type=calling_webhook_error,name=podseidon.kubewharf.io}
.
Worker clusters may be uanble to delete pods normally.
Try each of the procedures in order.
- Scale up/out webhook if CPU/memory resource saturation is observed.
- Force delete and recreate webhook pods to attempt recovery.
- Confirm the problem is merely caused by Podseidon. Confirm the absence of concurrent control plane malfunction.
- Disable the Podseidon webhook in all worker clusters by patching ValidatingWebhookConfiguration.