Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deflake tracing tests #10328
Deflake tracing tests #10328
Changes from 9 commits
0d3bd0c
d878d79
4c67867
3c20fe8
a93c427
5a8a471
d88155b
81995f1
a5f2cf5
82785b3
738b661
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is unique about this test that means it benefits from a retry? If there is something unique that would be good to callout. If other tests are susceptible, would we be better adding this as a default to the AssertEventuallyConsistentCurlResponse function instead and letting callers who don't want retries opt out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since my guess is that this is a timing issue, i think really only speculate here, but here are a few possible reasons:
we define a non-default gateway - maybe this is causing some sort of translation delay.
we also define a new service that routes to the separate port on the new gateway. maybe that service needs some kind of "warm up" time or something. or possibly there's some sort of timing issue or race condition in kubernetes' routing logic.
something to do wih the opentelemetry collector, possibly having to do with verifying the presense of the upstream. this was something david suggested when we were discussing this initially.
all of these possibilities sound a little ridiculous to me to be honest. i could spend more time investigating this to find the root cause, but i don't feel like it's worth the effort given how difficult it is to reproduce this at this point. and since i've gone through the effort of pausing the cluster after the failure and demonstrating that things are still in a consistent state and subsequent requests are working, i think it's safe to just add some retries here since that's more or less what i was doing when i reproduced the issue locally.
there is one alternate approach that we could consider here, and that would be adding a small delay in the
AssertEventuallyConsistentCurlResponse
call that we're using in the test. that would preempt additional flakes that could hypothetically be affected by this same issue. if you think that's a better approach, then i'm happy to do that instead.finally, i think it's also worth considering that we can't really know what fixes the flake until we try merging something and observing whether this flake continues to occur. so i'd like to try something, but i'm happy to continue discussion on whatever we think is the best thing to try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is something that i did consider. personally, i don't particularly like it when i call a function with a set of options and then observe the function changing said options under my feet. but there is a benefit of preempting this flake a occurring again, so i'd feel more comfortable doing that in one of the following two ways:
add a delay in
AssertEventuallyConsistentCurlResponse
itselfconsider making
AssertEventuallyConsistentCurlResponse
slightly less strict about its requirements. for example, maybe we could make theConsistently
block try again a few times, or replace it with something like "eventually, we get 5 successful requests in a row" or something like that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure I understand the "delay" idea. Isn't the goal of first having an eventually, is that that operates as the delay and continues iterating until it's ready.
As for reducing the strict requirements, I agree, we may need to come up with ways to ensure this works. In my mind, curl retries were the best way to do this. What do you think wrapping the consistently in an eventually block will help test, that having retries into our requests would not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's valuable for this test, should we make the silent argument a default?