-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add mutex to testnode not to run multiple nodes #1342
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1342 +/- ##
==========================================
+ Coverage 48.21% 48.24% +0.02%
==========================================
Files 79 79
Lines 4405 4411 +6
==========================================
+ Hits 2124 2128 +4
- Misses 2103 2105 +2
Partials 178 178
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested just to make sure that this resolves these flaky tests?
testutil/testnode/full_node.go
Outdated
@@ -195,13 +200,17 @@ func DefaultNetwork(t *testing.T, blockTime time.Duration) (cleanup func() error | |||
tmNode, app, cctx, err := New(t, DefaultParams(), tmCfg, false, genState, kr) | |||
require.NoError(t, err) | |||
|
|||
// locking the mutex not to be able to spin up multiple nodes at the same time. | |||
mut.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mut.Lock() | |
mut.Lock() | |
t.Cleanup(mut.Unlock) |
I would personally do it this way. Perhaps even remove the custom cleanup function since the testing fixture already provides it. There are two cases that I can see where the code exits before calling Unlock
. Also I generally prefer to not pass responsibility for cleanup to a function outside the scope where the resources were created (if possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps even remove the custom cleanup function since the testing fixture already provides it
Understandable, we could also define a certain TestNode
struct that contains an exported Cleanup
function to be used in tests. However, I am not against the way it is right now as it is really easy to start and stop.
There are two cases that I can see where the code exits before calling Unlock
Yes, but those are not returning errors, they're require
ing no errors. This means that, if I understand right, if we hit that case, the whole test suite will tear down and there will be no need to unlock nor cleanup.
This brings a question to my mind, will starting a separate test in a separate package, after failing a require, be affected by the locked mutex? I assume not, but not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can stick everything in the testing.T.Cleanup, then we totally should. @sweexordious if you think this is out of scope of this PR, then I can handle it in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can open a follow up PR doing that. Do we want to merge this one?
testutil/testnode/full_node.go
Outdated
@@ -195,13 +200,17 @@ func DefaultNetwork(t *testing.T, blockTime time.Duration) (cleanup func() error | |||
tmNode, app, cctx, err := New(t, DefaultParams(), tmCfg, false, genState, kr) | |||
require.NoError(t, err) | |||
|
|||
// locking the mutex not to be able to spin up multiple nodes at the same time. | |||
mut.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can stick everything in the testing.T.Cleanup, then we totally should. @sweexordious if you think this is out of scope of this PR, then I can handle it in a follow up
@cmwaters I didn't try the flaky tests specifically, but this should give us an extra layer of security to avoid misusing the default node. So, even if it doesn't fix the flakiness, it's a step in that path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sweexordious when you get the chance, can you try this with the qgb tests to see if it fixes the issue? thx!
@evan-forbes Yes definitly, tomorrow. Can I merge this one? |
yeah sure. i'm probably just stating the obvious, but if for some reason this doesn't help, or we missed something, then lets make a mental note to remove or fix this change 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional comment edits to improve clarity. No blocking feedback
Co-authored-by: Rootul P <[email protected]>
Co-authored-by: Rootul P <[email protected]>
Co-authored-by: Rootul P <[email protected]>
Co-authored-by: Rootul P <[email protected]>
@evan-forbes Just checked, doesn't fix my issue
The problem is, this issue doesn't happen when running the package tests alone Should we still merge? or keep it up until we figure out why it's failing this way |
I don't think this is the correct fix. How is this different from just ensuring that no tests run in parallel? i.e. not using So it isn't a test running in parallel issue, but something else. Since the tests aren't running in parallel, then ports don't seem to be getting freed efficiently, or tests just need to handle this and find a free port. Node had this issue, they implemented a Where in the test code are we defining ports? Additionally looking at the linked issue, the db concerned should be solvable by simply creating unique test directories for each node, unless I'm misunderstanding that bug. |
is this deterministic on your machine? I just ran the testsuite on my Mac and it didn't fail. |
That's one way to solve it. But I wanted in this PR to help address this issue not to have to assign ports dynamically in every repo using this code.
They're coming from the default config from Celestia-core:
Yes, definitely the DB issue could be solved by that. In fact, we could fix the ports issue by assigning them dynamically when initializing the config, the db by specifying different home directories, but still we will be stuck with the configuration being sealed when starting the node. Thus, unless we do a big refactor not to seal the config somehow during startup, we will never be able to run multiple nodes in parallel with different configs. Thus, comes this PR to add an extra layer of security by ensuring that no node is being run in parallel (which doesn't solve the original problem).
Correct me please if I'm wrong @evan-forbes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think having a global lock to bottleneck tests to a single node is the right approach. We should be able to run tests in parallel. Additionally since the issue persists there is no benefit to merging this in.
I think we need to clarify the actually problem statement we are trying to solve. The issue cites the port collision and the overlap in directory, which this PR doesn't address.
Along with the detailed problem statement, we should also align on the ideal end state.
To me being able to run tests and nodes in parallel would be ideal, so we should find a solution that does that.
@MSevey Makes sense, I'll see if I can solve the tests issue locally and update this PR with the changes. And then, we open an issue that concerns running the tests in parallel. |
Created an issue for the above problem + a repo that is able to reproduce it: celestiaorg/orchestrator-relayer#104 |
Just found out what the issue is about. I didn't know that As a solution, what do you think of following:
So ports by default are allocated dynamically.
I am happy to open a fast PR for Let me know what you think. |
interesting, I also did not know that it would run things in parallel by default. So this runs directories in parallel? In order to run tests in parallel, within a directory, I imagine you still need the I'm in favor of option 1. I think option 2 is a little bit of an anti pattern. If someone is trying to run things in parallel, and our test code is forcing it to be sequential, that could be giving developers a false positive. |
Yes, 100% Cool, will open a new PR making the change and close this one. Thanks a lot everyone for the help 👍 👍 |
I'd be in favor for 1 as well, we should also change the default home dir tendermint uses to the t.TempDir if we're not doing that already somewhere |
I have seen some |
Interesting find and this seems to be the case based on
Just so I follow, two separate packages are running tests that spin up a testnode and these testnodes conflict? Which packages are they? I see app_test here but I don't see another invocation of the testutil/testnode node. I do see usage of the cosmos-sdk Maybe silly question but is it possible to make the testnodes independent so that multiple can be run simultaneously? If that's too difficult then it may not be worth it. My hunch is that the integration tests that use nodes are the longest running tests so if we don't run them in parallel we're going to increase our total test run time which will lead to slower dev iteration speed and CI check. |
Sorry for not specifying correctly. This is happening in a separate repo on a personal branch. You can reproduce using this: https://github.com/sweexordious/orchestrator-relayer/tree/not_working_test_poc
Do you mean parallel tests inside a single package or parallel between packages? If it's the latter, then with this change #1368, you can do it. If you mean parallel tests inside the same package. Then, that can be achieved (I believe, I tried it locally, but might be missing something), with this #1368, but only for testnodes having the same configuration. However, if we want to test something that change something in the Line 12 in 6163e27
Also, there are some data races happening even inside a single testnode instance. So, running multiple ones might lead to undefined behavior #1369 |
I meant parallel between packages so the default |
<!-- Please read and fill out this form before submitting your PR. Please make sure you have reviewed our contributors guide before submitting your first PR. --> ## Overview <!-- Please provide an explanation of the PR, including the appropriate context, background, goal, and rationale. If there is an issue with this information, please provide a tl;dr and link the issue. --> Dynamic port allocation for the testnode as per #1342 (comment) ## Checklist <!-- Please complete the checklist to ensure that the PR is ready to be reviewed. IMPORTANT: PRs should be left in Draft until the below checklist is completed. --> - [ ] New and updated code has appropriate documentation - [ ] New and updated code has new and/or updated testing - [ ] Required CI checks are passing - [ ] Visual proof for any user facing features like CLI or documentation updates - [ ] Linked issues closed with keywords
<!-- Please read and fill out this form before submitting your PR. Please make sure you have reviewed our contributors guide before submitting your first PR. --> <!-- Please provide an explanation of the PR, including the appropriate context, background, goal, and rationale. If there is an issue with this information, please provide a tl;dr and link the issue. --> Dynamic port allocation for the testnode as per #1342 (comment) <!-- Please complete the checklist to ensure that the PR is ready to be reviewed. IMPORTANT: PRs should be left in Draft until the below checklist is completed. --> - [ ] New and updated code has appropriate documentation - [ ] New and updated code has new and/or updated testing - [ ] Required CI checks are passing - [ ] Visual proof for any user facing features like CLI or documentation updates - [ ] Linked issues closed with keywords
<!-- Please read and fill out this form before submitting your PR. Please make sure you have reviewed our contributors guide before submitting your first PR. --> ## Overview <!-- Please provide an explanation of the PR, including the appropriate context, background, goal, and rationale. If there is an issue with this information, please provide a tl;dr and link the issue. --> Dynamic port allocation for the testnode as per celestiaorg/celestia-app#1342 (comment) ## Checklist <!-- Please complete the checklist to ensure that the PR is ready to be reviewed. IMPORTANT: PRs should be left in Draft until the below checklist is completed. --> - [ ] New and updated code has appropriate documentation - [ ] New and updated code has new and/or updated testing - [ ] Required CI checks are passing - [ ] Visual proof for any user facing features like CLI or documentation updates - [ ] Linked issues closed with keywords
Overview
Closes #1299
Checklist