Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST] [WIP] Debug https_serving_main #15027

Closed
wants to merge 8 commits into from

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Mar 21, 2024

@knative-prow knative-prow bot requested review from evankanderson and mgencur March 21, 2024 14:19
@knative-prow knative-prow bot added area/test-and-release It flags unit/e2e/conformance/perf test issues for product features size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 21, 2024
Copy link

codecov bot commented Mar 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.95%. Comparing base (c2d0af1) to head (b5579cf).
Report is 48 commits behind head on main.

❗ Current head b5579cf differs from pull request most recent head 68bc1e3. Consider uploading reports for the commit 68bc1e3 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15027      +/-   ##
==========================================
+ Coverage   84.11%   84.95%   +0.83%     
==========================================
  Files         213      213              
  Lines       16783    13107    -3676     
==========================================
- Hits        14117    11135    -2982     
+ Misses       2315     1619     -696     
- Partials      351      353       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skonto
Copy link
Contributor Author

skonto commented Mar 21, 2024

/test ?

Copy link

knative-prow bot commented Mar 21, 2024

@skonto: The following commands are available to trigger required jobs:

  • /test build-tests
  • /test contour-latest
  • /test contour-tls
  • /test gateway-api-latest
  • /test istio-latest-no-mesh
  • /test istio-latest-no-mesh-tls
  • /test kourier-stable
  • /test kourier-stable-tls
  • /test unit-tests
  • /test upgrade-tests

The following commands are available to trigger optional jobs:

  • /test gateway-api-latest-and-contour
  • /test https
  • /test istio-latest-mesh
  • /test istio-latest-mesh-short
  • /test istio-latest-mesh-tls
  • /test performance-tests

Use /test all to run the following jobs that were automatically triggered:

  • build-tests_serving_main
  • istio-latest-no-mesh-tls_serving_main
  • istio-latest-no-mesh_serving_main
  • unit-tests_serving_main
  • upgrade-tests_serving_main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@skonto
Copy link
Contributor Author

skonto commented Mar 21, 2024

/test https

@knative-prow knative-prow bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 21, 2024
@skonto
Copy link
Contributor Author

skonto commented Mar 21, 2024

/test https

1 similar comment
@skonto
Copy link
Contributor Author

skonto commented Mar 21, 2024

/test https

@skonto skonto force-pushed the debug_https_serving_main branch from c822b7c to 820b3b0 Compare March 28, 2024 09:57
Copy link

knative-prow bot commented Mar 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@skonto
Copy link
Contributor Author

skonto commented Mar 28, 2024

/test https

@skonto
Copy link
Contributor Author

skonto commented Mar 28, 2024

    autoscale.go:379: revision "autoscale-sustaining-aggregation-linear-rqjstwbr-00001" #replicas: 12, want between [7, 17]
    autoscale.go:130: Stopping generateTraffic
    autoscale_test.go:144: request success rate under SLO: total = 38251, errors = 29586, rate = 0.226530, SLO = 0.999000

Need to check why SLO fails as several requests fail.

@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 28, 2024
@skonto
Copy link
Contributor Author

skonto commented Mar 28, 2024

/test https

2 similar comments
@skonto
Copy link
Contributor Author

skonto commented Mar 28, 2024

/test https

@skonto
Copy link
Contributor Author

skonto commented Mar 29, 2024

/test https

@skonto skonto changed the title [TEST] [WIP] Debug for https_serving_main [TEST] [WIP] Debug https_serving_main Mar 29, 2024
@skonto
Copy link
Contributor Author

skonto commented Mar 29, 2024

Looking at the logs dump:

      lastTransitionTime: "2024-03-29T14:28:49Z"
      message: 'containers with unready status: [controller]'
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2024-03-29T14:28:49Z"
      message: 'containers with unready status: [controller]'
      reason: ContainersNotReady
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2024-03-29T14:28:49Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://9867d1e0e7a9a20cf5af6ec887e5c154441c3669d4855d8422b1f3a7153dc39b
      image: sha256:bb6131df0ac9a6d55d411db23512603a3d784ffc2f032f757d65066ad3266ab2
      imageID: gcr.io/knative-nightly/knative.dev/net-certmanager/cmd/controller@sha256:2282bd8df0ef44ff4bd64ee6a6e11103e3301003c7053fb08d91b062113f46fa
      lastState:
        terminated:
          containerID: containerd://9867d1e0e7a9a20cf5af6ec887e5c154441c3669d4855d8422b1f3a7153dc39b
          exitCode: 1
          finishedAt: "2024-03-29T14:29:33Z"
          reason: Error
          startedAt: "2024-03-29T14:29:33Z"
      name: controller
      ready: false
      restartCount: 3
      started: false
      state:
        waiting:
          message: back-off 40s restarting failed container=controller pod=net-certmanager-controller-879c76664-jbn7r_c9023fde-9375-4d36-ba73-4c0aa3ed8b65(20cc3f18-c917-4601-81e8-678155d28f76)
          reason: CrashLoopBackOff
    hostIP: 10.128.0.12

@ReToCode
Copy link
Member

ReToCode commented Apr 16, 2024

I managed to get some logs, so far I identified one container that was restarted, but that one was killed by the chaosduck: https://gist.github.com/ReToCode/f140c99b13ad9efe6d81b43344fb1824#file-logs-txt-L122 which seams reasonable.

So we had these containers created (+) and deleted (-) that stern tailed:

+ net-certmanager-controller-56d454f56c-snpg5 › controller
+ net-certmanager-controller-56d454f56c-5mcfc › controller
- net-certmanager-controller-56d454f56c-snpg5 › controller
+ net-certmanager-controller-56d454f56c-2d95n › controller
- net-certmanager-controller-56d454f56c-5mcfc › controller
+ net-certmanager-controller-56d454f56c-4x8wd › controller
- net-certmanager-controller-56d454f56c-2d95n › controller
+ net-certmanager-controller-56d454f56c-j2s9r › controller
- net-certmanager-controller-56d454f56c-4x8wd › controller
+ net-certmanager-controller-56d454f56c-qdgdx › controller
- net-certmanager-controller-56d454f56c-j2s9r › controller
cat k8s.logs.txt| grep Quacking | grep net-certmanager
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:21:04 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-8tgw6"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:21:58 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-cqz8j"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:22:56 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-92lfj"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:23:52 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-8gllm"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:24:22 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-snpg5"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:25:17 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-5mcfc"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:25:49 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-2d95n"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:26:38 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-4x8wd"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:27:15 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-j2s9r"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:28:08 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-qdgdx"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:28:57 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-jlq9r"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:29:25 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-z2765"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:30:23 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-6jk9x"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:31:21 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-zfrl2"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:31:47 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-b4bnh"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:32:42 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-s7vfd"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:33:26 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-2dcgg"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:33:46 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-wk7q4"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:34:34 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-rbsnv"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:35:19 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-7tz4r"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:35:47 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-l9xqt"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:36:09 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-j8jxt"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:36:29 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-r2vs5"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:37:24 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-2rfgx"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:38:09 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-4qkdc"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:39:01 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-hvrg4"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:39:56 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-tqzrt"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:40:48 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-5qzgn"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:41:32 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-pqnq8"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:42:22 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-f6jk4"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:42:42 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-7wh58"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:43:08 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-qfl8c"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:44:05 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-qgj4v"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:45:03 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-nts7r"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:45:25 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-5wrf2"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:46:25 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-8p58k"
chaosduck-6d6879f88-8nlrb chaosduck 2024/04/15 12:47:11 Quacking at "net-certmanager-controller" leader "net-certmanager-controller-56d454f56c-nrm58"

It seems like that pod is killed pretty often. Not sure if this should be the case or not with the chaosduck thing. Maybe @dprotaso knows more? But in general, that should not explain why we see errors on requests.

@dprotaso
Copy link
Member

dprotaso commented Apr 16, 2024

I'd recommend disabling chaos duck for this component to see if it helps. If it does then create an issue to add some more resiliency after

@skonto
Copy link
Contributor Author

skonto commented Apr 18, 2024

/retest

@skonto skonto force-pushed the debug_https_serving_main branch from bc46b25 to b5579cf Compare April 18, 2024 10:01
Copy link

knative-prow bot commented Apr 18, 2024

@skonto: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
build-tests_serving_main 68bc1e3 link true /test build-tests
https_serving_main 68bc1e3 link false /test https

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@skonto skonto closed this Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants