-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pod diagnostics before scaling down to zero in scaler #15326
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15326 +/- ##
==========================================
- Coverage 84.76% 84.60% -0.16%
==========================================
Files 218 218
Lines 13504 13534 +30
==========================================
+ Hits 11447 11451 +4
- Misses 1690 1713 +23
- Partials 367 370 +3 ☔ View full report in Codecov by Sentry. |
|
/retest |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: skonto The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@dprotaso gentle ping. |
1 similar comment
@dprotaso gentle ping. |
func (pas *PodAutoscalerStatus) MarkScaleTargetNotInitialized(reason, message string) { | ||
podCondSet.Manage(pas).MarkFalse(PodAutoscalerConditionScaleTargetInitialized, reason, message) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should double check usages of this condition. Because before it would always be Unknown=>(True|False)
and then remain unchanged.
I can't recall if there's code that assumes that it never changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we might want to introduce a new condition - to maybe surface subsequent scaling issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall if there's code that assumes that it never changes.
Tests are not covering revision transitions? I checked the pa status propagation for the revision reconciliation and we have specific cases where this matters, but don't seem affected. I can take a look again if there is a scenario where this might be a problem. In general we should be able to set this to False
(for whatever reason), since it is a legitimate value and then any reconciliation should take into consideration that condition and adjust. Here we go from True
to False
.
pod, err := podCounter.GetAnyPod() | ||
if err != nil { | ||
return fmt.Errorf("error getting a pod for the revision: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fetching a pod here seems premature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? We already do that in that function via the pod accessor for getting the state a few lines bellow. We are going to test for handling the scale to zero case and check pod status, if we have to.
@@ -114,10 +114,16 @@ func (c *Reconciler) ReconcileKind(ctx context.Context, pa *autoscalingv1alpha1. | |||
if err := c.ReconcileMetric(ctx, pa, resolveScrapeTarget(ctx, pa)); err != nil { | |||
return fmt.Errorf("error reconciling Metric: %w", err) | |||
} | |||
podCounter := resourceutil.NewPodAccessor(c.podsLister, pa.Namespace, pa.Labels[serving.RevisionLabelKey]) | |||
|
|||
pod, err := podCounter.GetAnyPod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be getting a pod that isn't ready - eg. you could have min scale = 10 and the last pod can't be scheduled (due to resource constraints)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not targeting all pods, we are targeting the scenario with the image issue. If someone wants to cover all cases he can extend the work here later. Maybe I should change the PR title, here we are adding pod diagnostics for the issue with the image only or similar issues where all pods are stuck and deployment reconciliation cannot catch it due to the known K8s limitations (progress deadline cannot catch all cases).
return retry.RetryOnConflict(retry.DefaultBackoff, func() error { | ||
rev, err := client.ServingV1().Revisions(pa.Namespace).Get(ctx, pa.Name, metav1.GetOptions{}) | ||
if err != nil { | ||
return err | ||
} | ||
rev.Status.MarkResourcesAvailableFalse(w.Reason, w.Message) | ||
if _, err = client.ServingV1().Revisions(pa.Namespace).UpdateStatus(ctx, rev, metav1.UpdateOptions{}); err != nil { | ||
return err | ||
} | ||
return nil | ||
}) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is sorta violating our abstractions - if we wanted to propagate this error message to the revision we would have to do it through a PodAutoscaler condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have to do it through a PodAutoscaler condition.
Normally yes but due to the distributed status logic (not clear, undocumented) and how things are implemented this is safer imho, as it makes the decision locally and avoids influencing anything else, down the code path. 🤷 I can try change but it will require to propagate this decision down to the pa status update (not ideal as that code is many lines bellow), I had it this ways previously. Let's see.
This Pull Request is stale because it has been open for 90 days with |
/remove-lifecycle stale |
I will create another PR to address the comments. |
Fixes #14157
Proposed Changes
Replaces If deployment is never available propagate the container msg #14835
Adds pod diagnostics as it was pending here, I am wondering what it is needed to remove
activationTimeoutBuffer
.The idea is to mark the revision with resourcesAvailable=false and pa with ScaleTargetInitialized=false just before
we apply scaling down to zero and after we have timedout and we failed the activation here.
This would trigger the following condition in the revision lifecycle and pa status propagation:
if !ps.IsScaleTargetInitialized() && !resUnavailable && ps.ServiceName != "" {
A revision with no resources available will be set to ready false (due to its condSet) and that will propagate the condition up to the ksvc.
Tested with:
Steps to reproduce. First run the ksvc, let it scale down to zero and then remove the revision image from the local registry. Disable net access so image cannot be fetched, issue a new request.
The status of the Serving resources will become:
After we bring the image back a new request will work as expected and resource statuses go back to the usual.
Release Note