Deflake tracing tests #10328

ashishb-solo · 2024-11-12T21:16:29Z

Description

See issue link for flake details. This pull request attempts to address a flake by adding retries to one particular request that is resulting in failures.

Additionally, the test output was a bit hard to read, so this pull request makes two changes to address that:

Add the -s flag to avoid progress bars appearing in the curl stderr
Add some newlines that were missing from a few printf statements

API changes

Nee

Code changes

Add retry logic in the test and some print statements

CI changes

Nil

Docs changes

None

Context

Issue link has information on the test flakes

Interesting decisions

I decided to add the retry logic to the failing requests only. We could added the retry attempts in this block to preempt other flakes like this one from occurring, but that could also have an adverse effect on other tests and feels like a bad separation of privileges, so I preferred not going with that solution. Happy to change my mind here though.

Testing steps

See issue link

Notes for reviewers

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works

solo-changelog-bot · 2024-11-12T21:16:42Z

Issues linked to changelog:
solo-io#10327

github-actions · 2024-11-12T21:28:39Z

Visit the preview URL for this PR (updated for commit 738b661):

https://gloo-edge--pr10328-deflake-tracing-test-4k7yuir8.web.app

_{(expires Wed, 20 Nov 2024 20:41:26 GMT)}

_{🔥 via Firebase Hosting GitHub Action 🌎}

_{Sign: 77c2b86e287749579b7ff9cadb81e099042ef677}

sam-heilbron

lgtm

sam-heilbron · 2024-11-12T21:29:40Z

test/kubernetes/e2e/features/tracing/suite.go

@@ -124,7 +124,14 @@ func (s *testingSuite) TestSpanNameTransformationsWithoutRouteDecorator() {
 			})),
 			curl.WithHostHeader(testHostname),
 			curl.WithPort(gatewayProxyPort),
+			// this request sometimes times out, so let's add some retries


What is unique about this test that means it benefits from a retry? If there is something unique that would be good to callout. If other tests are susceptible, would we be better adding this as a default to the AssertEventuallyConsistentCurlResponse function instead and letting callers who don't want retries opt out?

since my guess is that this is a timing issue, i think really only speculate here, but here are a few possible reasons:

we define a non-default gateway - maybe this is causing some sort of translation delay.

we also define a new service that routes to the separate port on the new gateway. maybe that service needs some kind of "warm up" time or something. or possibly there's some sort of timing issue or race condition in kubernetes' routing logic.

something to do wih the opentelemetry collector, possibly having to do with verifying the presense of the upstream. this was something david suggested when we were discussing this initially.

all of these possibilities sound a little ridiculous to me to be honest. i could spend more time investigating this to find the root cause, but i don't feel like it's worth the effort given how difficult it is to reproduce this at this point. and since i've gone through the effort of pausing the cluster after the failure and demonstrating that things are still in a consistent state and subsequent requests are working, i think it's safe to just add some retries here since that's more or less what i was doing when i reproduced the issue locally.

there is one alternate approach that we could consider here, and that would be adding a small delay in the AssertEventuallyConsistentCurlResponse call that we're using in the test. that would preempt additional flakes that could hypothetically be affected by this same issue. if you think that's a better approach, then i'm happy to do that instead.

finally, i think it's also worth considering that we can't really know what fixes the flake until we try merging something and observing whether this flake continues to occur. so i'd like to try something, but i'm happy to continue discussion on whatever we think is the best thing to try.

would we be better adding this as a default to the AssertEventuallyConsistentCurlResponse function instead and letting callers who don't want retries opt out?

this is something that i did consider. personally, i don't particularly like it when i call a function with a set of options and then observe the function changing said options under my feet. but there is a benefit of preempting this flake a occurring again, so i'd feel more comfortable doing that in one of the following two ways:

add a delay in AssertEventuallyConsistentCurlResponse itself

consider making AssertEventuallyConsistentCurlResponse slightly less strict about its requirements. for example, maybe we could make the Consistently block try again a few times, or replace it with something like "eventually, we get 5 successful requests in a row" or something like that.

I'm not quite sure I understand the "delay" idea. Isn't the goal of first having an eventually, is that that operates as the delay and continues iterating until it's ready.

As for reducing the strict requirements, I agree, we may need to come up with ways to ensure this works. In my mind, curl retries were the best way to do this. What do you think wrapping the consistently in an eventually block will help test, that having retries into our requests would not?

sam-heilbron

makes sense to me! If this is a useful starting point I think we add it for this test. If we find it effective, I would argue that we shoudl consider approaching a standard that other tests can use as well

pkg/utils/kubeutils/kubectl/cli.go

sam-heilbron · 2024-11-13T17:23:22Z

test/kubernetes/e2e/features/tracing/suite.go

 			curl.WithPath(pathWithoutRouteDescriptor),
+			curl.Silent(),


If it's valuable for this test, should we make the silent argument a default?

jenshu · 2025-01-13T21:54:09Z

@ashishb-solo are you still working on this or can it be closed?

ashishb-solo added 4 commits November 12, 2024 14:25

Add some diagnostic improvements

0d3bd0c

Add retry logic to the flaky test

d878d79

Add changelog

4c67867

Fix changelog issue link

3c20fe8

github-actions bot added keep pr updated signals bulldozer to keep pr up to date with base branch work in progress signals bulldozer to keep pr open (don't auto-merge) labels Nov 12, 2024

Merge main into deflake-tracing-tests

a93c427

ashishb-solo removed the work in progress signals bulldozer to keep pr open (don't auto-merge) label Nov 12, 2024

sam-heilbron reviewed Nov 12, 2024

View reviewed changes

davidjumani approved these changes Nov 12, 2024

View reviewed changes

andy-fong approved these changes Nov 12, 2024

View reviewed changes

soloio-bulldozer bot and others added 4 commits November 12, 2024 23:30

Merge refs/heads/main into deflake-tracing-tests

5a8a471

Merge refs/heads/main into deflake-tracing-tests

d88155b

Adding changelog file to new location

81995f1

Deleting changelog file from old location

a5f2cf5

sam-heilbron reviewed Nov 13, 2024

View reviewed changes

sheidkamp and others added 2 commits November 13, 2024 20:26

Fix published Helm docs from main branch (#10334)

82785b3

Merge refs/heads/main into deflake-tracing-tests

738b661

sam-heilbron added the work in progress signals bulldozer to keep pr open (don't auto-merge) label Nov 13, 2024

github-actions bot mentioned this pull request Nov 14, 2024

[Migrated] Deflake tracing tests solo-io/gloo#10328

Closed

yuval-k force-pushed the main branch from 5974a88 to b545ea8 Compare December 9, 2024 17:28

ashishb-solo closed this Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deflake tracing tests #10328

Deflake tracing tests #10328

ashishb-solo commented Nov 12, 2024

solo-changelog-bot bot commented Nov 12, 2024

github-actions bot commented Nov 12, 2024 •

edited

Loading

sam-heilbron left a comment

sam-heilbron Nov 12, 2024

ashishb-solo Nov 13, 2024

ashishb-solo Nov 13, 2024

sam-heilbron Nov 13, 2024

sam-heilbron left a comment

sam-heilbron Nov 13, 2024

jenshu commented Jan 13, 2025

Deflake tracing tests #10328

Deflake tracing tests #10328

Conversation

ashishb-solo commented Nov 12, 2024

Description

API changes

Code changes

CI changes

Docs changes

Context

Interesting decisions

Testing steps

Notes for reviewers

Checklist:

solo-changelog-bot bot commented Nov 12, 2024

github-actions bot commented Nov 12, 2024 • edited Loading

sam-heilbron left a comment

Choose a reason for hiding this comment

sam-heilbron Nov 12, 2024

Choose a reason for hiding this comment

ashishb-solo Nov 13, 2024

Choose a reason for hiding this comment

ashishb-solo Nov 13, 2024

Choose a reason for hiding this comment

sam-heilbron Nov 13, 2024

Choose a reason for hiding this comment

sam-heilbron left a comment

Choose a reason for hiding this comment

sam-heilbron Nov 13, 2024

Choose a reason for hiding this comment

jenshu commented Jan 13, 2025

github-actions bot commented Nov 12, 2024 •

edited

Loading