-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deflake tracing tests #10328
Deflake tracing tests #10328
Conversation
Issues linked to changelog: |
Visit the preview URL for this PR (updated for commit 738b661): https://gloo-edge--pr10328-deflake-tracing-test-4k7yuir8.web.app (expires Wed, 20 Nov 2024 20:41:26 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -124,7 +124,14 @@ func (s *testingSuite) TestSpanNameTransformationsWithoutRouteDecorator() { | |||
})), | |||
curl.WithHostHeader(testHostname), | |||
curl.WithPort(gatewayProxyPort), | |||
// this request sometimes times out, so let's add some retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is unique about this test that means it benefits from a retry? If there is something unique that would be good to callout. If other tests are susceptible, would we be better adding this as a default to the AssertEventuallyConsistentCurlResponse function instead and letting callers who don't want retries opt out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since my guess is that this is a timing issue, i think really only speculate here, but here are a few possible reasons:
-
we define a non-default gateway - maybe this is causing some sort of translation delay.
-
we also define a new service that routes to the separate port on the new gateway. maybe that service needs some kind of "warm up" time or something. or possibly there's some sort of timing issue or race condition in kubernetes' routing logic.
-
something to do wih the opentelemetry collector, possibly having to do with verifying the presense of the upstream. this was something david suggested when we were discussing this initially.
all of these possibilities sound a little ridiculous to me to be honest. i could spend more time investigating this to find the root cause, but i don't feel like it's worth the effort given how difficult it is to reproduce this at this point. and since i've gone through the effort of pausing the cluster after the failure and demonstrating that things are still in a consistent state and subsequent requests are working, i think it's safe to just add some retries here since that's more or less what i was doing when i reproduced the issue locally.
there is one alternate approach that we could consider here, and that would be adding a small delay in the AssertEventuallyConsistentCurlResponse
call that we're using in the test. that would preempt additional flakes that could hypothetically be affected by this same issue. if you think that's a better approach, then i'm happy to do that instead.
finally, i think it's also worth considering that we can't really know what fixes the flake until we try merging something and observing whether this flake continues to occur. so i'd like to try something, but i'm happy to continue discussion on whatever we think is the best thing to try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we be better adding this as a default to the AssertEventuallyConsistentCurlResponse function instead and letting callers who don't want retries opt out?
this is something that i did consider. personally, i don't particularly like it when i call a function with a set of options and then observe the function changing said options under my feet. but there is a benefit of preempting this flake a occurring again, so i'd feel more comfortable doing that in one of the following two ways:
-
add a delay in
AssertEventuallyConsistentCurlResponse
itself -
consider making
AssertEventuallyConsistentCurlResponse
slightly less strict about its requirements. for example, maybe we could make theConsistently
block try again a few times, or replace it with something like "eventually, we get 5 successful requests in a row" or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure I understand the "delay" idea. Isn't the goal of first having an eventually, is that that operates as the delay and continues iterating until it's ready.
As for reducing the strict requirements, I agree, we may need to come up with ways to ensure this works. In my mind, curl retries were the best way to do this. What do you think wrapping the consistently in an eventually block will help test, that having retries into our requests would not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me! If this is a useful starting point I think we add it for this test. If we find it effective, I would argue that we shoudl consider approaching a standard that other tests can use as well
curl.WithPath(pathWithoutRouteDescriptor), | ||
curl.Silent(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's valuable for this test, should we make the silent argument a default?
@ashishb-solo are you still working on this or can it be closed? |
Description
See issue link for flake details. This pull request attempts to address a flake by adding retries to one particular request that is resulting in failures.
Additionally, the test output was a bit hard to read, so this pull request makes two changes to address that:
Add the
-s
flag to avoid progress bars appearing in thecurl
stderrAdd some newlines that were missing from a few
printf
statementsAPI changes
Nee
Code changes
Add retry logic in the test and some print statements
CI changes
Nil
Docs changes
None
Context
Issue link has information on the test flakes
Interesting decisions
I decided to add the retry logic to the failing requests only. We could added the retry attempts in this block to preempt other flakes like this one from occurring, but that could also have an adverse effect on other tests and feels like a bad separation of privileges, so I preferred not going with that solution. Happy to change my mind here though.
Testing steps
See issue link
Notes for reviewers
Checklist: