Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pod diagnostics before scaling down to zero in scaler #15326

Closed
wants to merge 2 commits into from

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Jun 12, 2024

Fixes #14157

Proposed Changes

  • Replaces If deployment is never available propagate the container msg #14835

  • Adds pod diagnostics as it was pending here, I am wondering what it is needed to remove activationTimeoutBuffer.

  • The idea is to mark the revision with resourcesAvailable=false and pa with ScaleTargetInitialized=false just before
    we apply scaling down to zero and after we have timedout and we failed the activation here.
    This would trigger the following condition in the revision lifecycle and pa status propagation:
    if !ps.IsScaleTargetInitialized() && !resUnavailable && ps.ServiceName != "" {
    A revision with no resources available will be set to ready false (due to its condSet) and that will propagate the condition up to the ksvc.

  • Tested with:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello
spec:
  template:
    metadata:
      annotations:
        serving.knative.dev/progress-deadline: "45s"
    spec:
      timeoutSeconds: 30
      containers:
        - image: ghcr.io/knative/helloworld-go:latest
          ports:
            - containerPort: 8080
          env:
            - name: TARGET
              value: "World"

Steps to reproduce. First run the ksvc, let it scale down to zero and then remove the revision image from the local registry. Disable net access so image cannot be fetched, issue a new request.
The status of the Serving resources will become:

{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "serving.knative.dev/v1",
            "kind": "Service",
            "metadata": {
                "annotations": {
....
            },
            "status": {
                "address": {
                    "url": "http://hello.default.svc.cluster.local"
                },
                "conditions": [
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "Revision \"hello-00001\" failed with message: Initial scale was never achieved.",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:05Z",
                        "message": "Revision \"hello-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:05Z",
                        "message": "Revision \"hello-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
                ],
    }
}
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "serving.knative.dev/v1",
            "kind": "Revision",
            "metadata": {
                "annotations": {
                    "serving.knative.dev/creator": "minikube-user",
                    "serving.knative.dev/progress-deadline": "45s",
                    "serving.knative.dev/routes": "hello",
                    "serving.knative.dev/routingStateModified": "2024-06-12T12:57:33Z"
                },
...

            "status": {
                "actualReplicas": 0,
                "conditions": [
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "The target is not receiving traffic.",
                        "reason": "NoTraffic",
                        "severity": "Info",
                        "status": "False",
                        "type": "Active"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T12:57:51Z",
                        "status": "True",
                        "type": "ContainerHealthy"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "Initial scale was never achieved",
                        "reason": "ProgressDeadlineExceeded",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "Initial scale was never achieved",
                        "reason": "ProgressDeadlineExceeded",
                        "status": "False",
                        "type": "ResourcesAvailable"
                    }
                ],
...
}
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "serving.knative.dev/v1",
            "kind": "Configuration",
            "metadata": {
...
                "name": "hello",
                "namespace": "default",
...
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "Revision \"hello-00001\" failed with message: Initial scale was never achieved.",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "Ready"
                    }
                ],
...
}
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "autoscaling.internal.knative.dev/v1alpha1",
            "kind": "PodAutoscaler",

  ...
            "spec": {
                "protocolType": "http1",
                "reachability": "Reachable",
                "scaleTargetRef": {
                    "apiVersion": "apps/v1",
                    "kind": "Deployment",
                    "name": "hello-00001-deployment"
                }
            },
            "status": {
                "actualScale": 0,
                "conditions": [
                    {
                        "lastTransitionTime": "2024-06-12T13:02:05Z",
                        "message": "The target is not receiving traffic.",
                        "reason": "NoTraffic",
                        "status": "False",
                        "type": "Active"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:05Z",
                        "message": "The target is not receiving traffic.",
                        "reason": "NoTraffic",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T12:58:51Z",
                        "message": "K8s Service is not ready",
                        "reason": "NotReady",
                        "status": "Unknown",
                        "type": "SKSReady"
                    },
                    {
                "desiredScale": 0,
                "metricsServiceName": "hello-00001-private",
                "observedGeneration": 1,
                "serviceName": "hello-00001"
            }
        }
    ],
}

After we bring the image back a new request will work as expected and resource statuses go back to the usual.
Release Note


@skonto skonto requested a review from dprotaso June 12, 2024 13:22
@skonto skonto self-assigned this Jun 12, 2024
@knative-prow knative-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 12, 2024
@knative-prow knative-prow bot requested a review from izabelacg June 12, 2024 13:22
@skonto skonto removed their assignment Jun 12, 2024
@knative-prow knative-prow bot requested a review from ReToCode June 12, 2024 13:22
@skonto skonto changed the title Add podchecking before scaling down to zero in scaler Add pod diagnostics before scaling down to zero in scaler Jun 12, 2024
Copy link

codecov bot commented Jun 12, 2024

Codecov Report

Attention: Patch coverage is 21.21212% with 26 lines in your changes missing coverage. Please review.

Project coverage is 84.60%. Comparing base (62ce45c) to head (248d6e8).
Report is 150 commits behind head on main.

Files with missing lines Patch % Lines
pkg/reconciler/autoscaling/kpa/scaler.go 21.05% 13 Missing and 2 partials ⚠️
pkg/resources/pods.go 0.00% 7 Missing ⚠️
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go 0.00% 2 Missing ⚠️
pkg/reconciler/autoscaling/kpa/kpa.go 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15326      +/-   ##
==========================================
- Coverage   84.76%   84.60%   -0.16%     
==========================================
  Files         218      218              
  Lines       13504    13534      +30     
==========================================
+ Hits        11447    11451       +4     
- Misses       1690     1713      +23     
- Partials      367      370       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skonto skonto changed the title Add pod diagnostics before scaling down to zero in scaler [wip] Add pod diagnostics before scaling down to zero in scaler Jun 12, 2024
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2024
@skonto
Copy link
Contributor Author

skonto commented Jun 12, 2024

error: the server doesn't have a resource type "ksvc"

@skonto skonto changed the title [wip] Add pod diagnostics before scaling down to zero in scaler Add pod diagnostics before scaling down to zero in scaler Jun 12, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2024
@skonto
Copy link
Contributor Author

skonto commented Jun 14, 2024

/retest

Copy link

knative-prow bot commented Jun 14, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto
Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@skonto
Copy link
Contributor Author

skonto commented Jun 18, 2024

@dprotaso gentle ping.

1 similar comment
@skonto
Copy link
Contributor Author

skonto commented Jun 25, 2024

@dprotaso gentle ping.

Comment on lines +220 to +223
func (pas *PodAutoscalerStatus) MarkScaleTargetNotInitialized(reason, message string) {
podCondSet.Manage(pas).MarkFalse(PodAutoscalerConditionScaleTargetInitialized, reason, message)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should double check usages of this condition. Because before it would always be Unknown=>(True|False) and then remain unchanged.

I can't recall if there's code that assumes that it never changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we might want to introduce a new condition - to maybe surface subsequent scaling issues

Copy link
Contributor Author

@skonto skonto Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall if there's code that assumes that it never changes.

Tests are not covering revision transitions? I checked the pa status propagation for the revision reconciliation and we have specific cases where this matters, but don't seem affected. I can take a look again if there is a scenario where this might be a problem. In general we should be able to set this to False (for whatever reason), since it is a legitimate value and then any reconciliation should take into consideration that condition and adjust. Here we go from True to False.

Comment on lines +119 to +122
pod, err := podCounter.GetAnyPod()
if err != nil {
return fmt.Errorf("error getting a pod for the revision: %w", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetching a pod here seems premature

Copy link
Contributor Author

@skonto skonto Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? We already do that in that function via the pod accessor for getting the state a few lines bellow. We are going to test for handling the scale to zero case and check pod status, if we have to.

@@ -114,10 +114,16 @@ func (c *Reconciler) ReconcileKind(ctx context.Context, pa *autoscalingv1alpha1.
if err := c.ReconcileMetric(ctx, pa, resolveScrapeTarget(ctx, pa)); err != nil {
return fmt.Errorf("error reconciling Metric: %w", err)
}
podCounter := resourceutil.NewPodAccessor(c.podsLister, pa.Namespace, pa.Labels[serving.RevisionLabelKey])

pod, err := podCounter.GetAnyPod()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be getting a pod that isn't ready - eg. you could have min scale = 10 and the last pod can't be scheduled (due to resource constraints)

Copy link
Contributor Author

@skonto skonto Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not targeting all pods, we are targeting the scenario with the image issue. If someone wants to cover all cases he can extend the work here later. Maybe I should change the PR title, here we are adding pod diagnostics for the issue with the image only or similar issues where all pods are stuck and deployment reconciliation cannot catch it due to the known K8s limitations (progress deadline cannot catch all cases).

Comment on lines +398 to +410
return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
rev, err := client.ServingV1().Revisions(pa.Namespace).Get(ctx, pa.Name, metav1.GetOptions{})
if err != nil {
return err
}
rev.Status.MarkResourcesAvailableFalse(w.Reason, w.Message)
if _, err = client.ServingV1().Revisions(pa.Namespace).UpdateStatus(ctx, rev, metav1.UpdateOptions{}); err != nil {
return err
}
return nil
})
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is sorta violating our abstractions - if we wanted to propagate this error message to the revision we would have to do it through a PodAutoscaler condition.

Copy link
Contributor Author

@skonto skonto Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have to do it through a PodAutoscaler condition.

Normally yes but due to the distributed status logic (not clear, undocumented) and how things are implemented this is safer imho, as it makes the decision locally and avoids influencing anything else, down the code path. 🤷 I can try change but it will require to propagate this decision down to the pa status update (not ideal as that code is many lines bellow), I had it this ways previously. Let's see.

Copy link

This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 25, 2024
@skonto
Copy link
Contributor Author

skonto commented Sep 26, 2024

/remove-lifecycle stale

@skonto
Copy link
Contributor Author

skonto commented Sep 30, 2024

I will create another PR to address the comments.

@skonto skonto closed this Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error for failed revision is not reported due to scaling to zero
2 participants