Skip to content

Latest commit

 

History

History
226 lines (205 loc) · 6.07 KB

coverage-analysis.md

File metadata and controls

226 lines (205 loc) · 6.07 KB

Coverage analysis

It is impossible to enumerate all failure scenarios, much less protect from all of them. However, on this page we try to analyze the impact of various components malfunctioning.

Core control plane

"Core control plane" refers to the common dependency of all control plane components, namely kube-apiserver and etcd.

Cluster Scenario Affected objects What happens without Podseidon What happens with Podseidon
Core Data disappearance (e.g. due to etcd data corruption or buggy controllers) Source workload only ❌ GC controller (or equivalent cascade deletion controllers) would cascade-delete all pods. ✅ PodProtector is not cascade-deleted due to lack of explicit deletionTimestamp. Cascade deletion of underlying pods is rejected by the Podseidon webhook.
PodProtector only N/A ⚠️ Webhooks can no longer reject pod deletion, but controllers will not actively try to delete the pods since the normal path is not affected.
PodProtector + source workload/intermediate objects ❌ GC controller (or equivalent cascade deletion controllers) would cascade-delete all pods. ❌ GC controller (or equivalent cascade deletion controllers) would cascade-delete all pods. Podseidon webhook is unable to protect the pods if kube-apiserver sent the deletion event to its informer.
Other dependency objects ⚠️ No direct impact to running pods, but recreated pods cannot start correctly.
Loss of strong consistency PodProtector N/A ⚠️ No direct impact to normal operations, but webhook may be incorrectly allow pod deletion if apiserver returns 200 OK to conflicting PodProtector status updates.
Worker Data disappearance (e.g. due to etcd data corruption or buggy controllers) Pod ❌ Kubelet will kill pods without warning. This cannot be mitigated without modifying kubelet code.
Intermediate objects (e.g. ReplicaSet) ❌ GC controller (or equivalent cascade deletion controllers) would cascade-delete all pods. ✅ PodProtector is not cascade-deleted due to lack of explicit deletionTimestamp. Cascade deletion of underlying pods is rejected by the Podseidon webhook.
Podseidon ValidatingWebhookConfiguration N/A ⚠️ kube-apiserver no longer calls Podseidon webhook, so protection is lost. Such data disappearance is often correlated with mass pod disappearance as well, so the pod count drops immediately, and the ReplicaSet controller is unlikely to try to delete pods at the same time.
Significant watch cache lag (but any available watch events are still delivered in order) Pod → Podseidon Aggregator N/A ⚠️ Normal operations (such as scaling and eviction) may be disrupted due to Podseidon webhook not observing new pods becoming available thus resetting the quota for pod deletion.
If `--aggregator-informer-synctime-algorithm=clock`, this may result in incorrect approval of pod deletion due to the lag between PodProtector admission and event reception. This issue does not happen if `status` is used instead.
Loss of strong consistency Pod → Podseidon Aggregator watch N/A ⚠️ Aggregator incorrectly invalidates old `admissionHistory` entries, which have not been observed in the current view of pod list yet. The resultant `estimatedAvailableReplicas` is greater than actual, resulting in incorrect approval of pod deletion.

Podseidon components

Component Scenario Consequence
Generator Not working ⚠️ Insufficient protection after scaling up.
Incorrect rejection after scaling down.
Incorrect logic
Aggregator Not working ⚠️ False positives in admission history are not cleared in time.
New available pods are not observed in aggregation.
Both may disrupt normal operations due to incorrect rejections from Podseidon webhook.
Incorrect logic ⚠️ Admission history may be incorrectly cleared or preserved, or aggregated replica count may be too large or too little, resulting in incorrect approval or rejection from Podseidon webhook respectively.
Webhook Unavailable ⚠️ Pods will be denied from deletion if `failurePolicy` is set to `Fail` and all instances are unavailable, disrupting normal operations.
Incorrect logic ⚠️ Webhook may incorrectly approve or reject pod deletions.

Other components

✅ Disruptions to the chain between the source of truth (main workload) and pods shall not result in service disruption beyond the level permitted by maxUnavailable.