Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add mutex to testnode not to run multiple nodes #1342

Closed
wants to merge 7 commits into from

Conversation

rach-id
Copy link
Member

@rach-id rach-id commented Feb 6, 2023

Overview

Closes #1299

Checklist

  • New and updated code has appropriate documentation
  • New and updated code has new and/or updated testing
  • Required CI checks are passing
  • Visual proof for any user facing features like CLI or documentation updates
  • Linked issues closed with keywords

@rach-id rach-id added the enhancement New feature or request label Feb 6, 2023
@rach-id rach-id requested a review from evan-forbes as a code owner February 6, 2023 19:49
@rach-id rach-id self-assigned this Feb 6, 2023
@MSevey MSevey requested review from a team and mojtaba-esk and removed request for a team February 6, 2023 19:49
@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2023

Codecov Report

Merging #1342 (35c4042) into main (0b447ad) will increase coverage by 0.02%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main    #1342      +/-   ##
==========================================
+ Coverage   48.21%   48.24%   +0.02%     
==========================================
  Files          79       79              
  Lines        4405     4411       +6     
==========================================
+ Hits         2124     2128       +4     
- Misses       2103     2105       +2     
  Partials      178      178              
Impacted Files Coverage Δ
testutil/testnode/rpc_client.go 70.21% <66.66%> (-0.52%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@rach-id rach-id added the testing items that are strictly related to adding or extending test coverage label Feb 6, 2023
Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested just to make sure that this resolves these flaky tests?

@@ -195,13 +200,17 @@ func DefaultNetwork(t *testing.T, blockTime time.Duration) (cleanup func() error
tmNode, app, cctx, err := New(t, DefaultParams(), tmCfg, false, genState, kr)
require.NoError(t, err)

// locking the mutex not to be able to spin up multiple nodes at the same time.
mut.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mut.Lock()
mut.Lock()
t.Cleanup(mut.Unlock)

I would personally do it this way. Perhaps even remove the custom cleanup function since the testing fixture already provides it. There are two cases that I can see where the code exits before calling Unlock. Also I generally prefer to not pass responsibility for cleanup to a function outside the scope where the resources were created (if possible).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps even remove the custom cleanup function since the testing fixture already provides it

Understandable, we could also define a certain TestNode struct that contains an exported Cleanup function to be used in tests. However, I am not against the way it is right now as it is really easy to start and stop.

There are two cases that I can see where the code exits before calling Unlock

Yes, but those are not returning errors, they're requireing no errors. This means that, if I understand right, if we hit that case, the whole test suite will tear down and there will be no need to unlock nor cleanup.

This brings a question to my mind, will starting a separate test in a separate package, after failing a require, be affected by the locked mutex? I assume not, but not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can stick everything in the testing.T.Cleanup, then we totally should. @sweexordious if you think this is out of scope of this PR, then I can handle it in a follow up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can open a follow up PR doing that. Do we want to merge this one?

testutil/testnode/full_node.go Outdated Show resolved Hide resolved
@@ -195,13 +200,17 @@ func DefaultNetwork(t *testing.T, blockTime time.Duration) (cleanup func() error
tmNode, app, cctx, err := New(t, DefaultParams(), tmCfg, false, genState, kr)
require.NoError(t, err)

// locking the mutex not to be able to spin up multiple nodes at the same time.
mut.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can stick everything in the testing.T.Cleanup, then we totally should. @sweexordious if you think this is out of scope of this PR, then I can handle it in a follow up

@rach-id
Copy link
Member Author

rach-id commented Feb 6, 2023

Have you tested just to make sure that this resolves these flaky tests?

@cmwaters I didn't try the flaky tests specifically, but this should give us an extra layer of security to avoid misusing the default node. So, even if it doesn't fix the flakiness, it's a step in that path

@MSevey MSevey requested a review from a team February 6, 2023 22:48
evan-forbes
evan-forbes previously approved these changes Feb 6, 2023
Copy link
Member

@evan-forbes evan-forbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sweexordious when you get the chance, can you try this with the qgb tests to see if it fixes the issue? thx!

@rach-id
Copy link
Member Author

rach-id commented Feb 6, 2023

@evan-forbes Yes definitly, tomorrow. Can I merge this one?

@evan-forbes
Copy link
Member

yeah sure. i'm probably just stating the obvious, but if for some reason this doesn't help, or we missed something, then lets make a mental note to remove or fix this change 🙂

Copy link
Collaborator

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional comment edits to improve clarity. No blocking feedback

testutil/testnode/rpc_client.go Outdated Show resolved Hide resolved
testutil/testnode/rpc_client.go Outdated Show resolved Hide resolved
testutil/testnode/rpc_client.go Outdated Show resolved Hide resolved
testutil/testnode/rpc_client.go Outdated Show resolved Hide resolved
@MSevey MSevey requested a review from a team February 7, 2023 08:34
@rach-id
Copy link
Member Author

rach-id commented Feb 7, 2023

@evan-forbes Just checked, doesn't fix my issue

failed to listen on 127.0.0.1:26657: listen tcp 127.0.0.1:26657: bind: address already in use

The problem is, this issue doesn't happen when running the package tests alone go test . or when running them in the IDE. However, with make test which runs go test -mod=readonly ./..., it fails with the above message. So, still not sure what's causing the issue. I assumed it to be the tests for some reason trying to spin up multiple nodes, but apparently not.

Should we still merge? or keep it up until we figure out why it's failing this way

@MSevey
Copy link
Member

MSevey commented Feb 7, 2023

I don't think this is the correct fix. How is this different from just ensuring that no tests run in parallel? i.e. not using t.Parallel()? Which the repo doesn't appear to be using, so it makes sense that this doesn't fix the issue.

So it isn't a test running in parallel issue, but something else. Since the tests aren't running in parallel, then ports don't seem to be getting freed efficiently, or tests just need to handle this and find a free port.

Node had this issue, they implemented a getFreePort helper https://github.com/celestiaorg/celestia-node/blob/main/core/testing.go#L192

Where in the test code are we defining ports?

Additionally looking at the linked issue, the db concerned should be solvable by simply creating unique test directories for each node, unless I'm misunderstanding that bug.

@MSevey
Copy link
Member

MSevey commented Feb 7, 2023

@evan-forbes Just checked, doesn't fix my issue

is this deterministic on your machine? I just ran the testsuite on my Mac and it didn't fail.

@rach-id
Copy link
Member Author

rach-id commented Feb 7, 2023

Node had this issue, they implemented a getFreePort helper https://github.com/celestiaorg/celestia-node/blob/main/core/testing.go#L192

That's one way to solve it. But I wanted in this PR to help address this issue not to have to assign ports dynamically in every repo using this code.

Where in the test code are we defining ports?

They're coming from the default config from Celestia-core:
https://github.com/celestiaorg/celestia-core/blob/07b8dbcb153563db8f2429b94a632b23d2621519/config/config.go#L83-L96

Additionally looking at the linked issue, the db concerned should be solvable by simply creating unique test directories for each node, unless I'm misunderstanding that bug.

Yes, definitely the DB issue could be solved by that.

In fact, we could fix the ports issue by assigning them dynamically when initializing the config, the db by specifying different home directories, but still we will be stuck with the configuration being sealed when starting the node. Thus, unless we do a big refactor not to seal the config somehow during startup, we will never be able to run multiple nodes in parallel with different configs.

Thus, comes this PR to add an extra layer of security by ensuring that no node is being run in parallel (which doesn't solve the original problem).
The original goal is to fix the port allocation issue which I'm still investigating locally.

is this deterministic on your machine? I just ran the testsuite on my Mac and it didn't fail.
The tests running into these issues are not the ones in this repo, but ones I am working on for the orchestrator-relayer repo. Sorry for not providing much context.

Correct me please if I'm wrong @evan-forbes

Copy link
Member

@MSevey MSevey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think having a global lock to bottleneck tests to a single node is the right approach. We should be able to run tests in parallel. Additionally since the issue persists there is no benefit to merging this in.

I think we need to clarify the actually problem statement we are trying to solve. The issue cites the port collision and the overlap in directory, which this PR doesn't address.

Along with the detailed problem statement, we should also align on the ideal end state.

To me being able to run tests and nodes in parallel would be ideal, so we should find a solution that does that.

@rach-id
Copy link
Member Author

rach-id commented Feb 7, 2023

@MSevey Makes sense, I'll see if I can solve the tests issue locally and update this PR with the changes. And then, we open an issue that concerns running the tests in parallel.

@rach-id
Copy link
Member Author

rach-id commented Feb 8, 2023

Created an issue for the above problem + a repo that is able to reproduce it: celestiaorg/orchestrator-relayer#104

@rach-id rach-id marked this pull request as draft February 10, 2023 09:15
@rach-id
Copy link
Member Author

rach-id commented Feb 10, 2023

Just found out what the issue is about. I didn't know that go test ./... runs the packages tests in parallel.
In fact, this can be controlled with the flag -p: go test ./... -p 1 which makes everything work.
So, there is no leak in resources.

As a solution, what do you think of following:

  1. Update the configuration of the testnode in here as follows:
	tmCfg.RPC.ListenAddress = "tcp://localhost:0"
	tmCfg.P2P.ListenAddress = "tcp://localhost:0"
	tmCfg.RPC.GRPCListenAddress = "tcp://localhost:0"
	appConf.GRPC.Address = "localhost:0"
	appConf.API.Address = "tcp://localhost:0"

So ports by default are allocated dynamically.

  1. Keep the mutex that this PR adds so that if anyone wants to run 2 testnodes in parallel (inside the same package), the tests are forced to run sequentially.

I am happy to open a fast PR for 1. And for 2, we can just merge this one.

Let me know what you think.

cc @evan-forbes @MSevey

@MSevey
Copy link
Member

MSevey commented Feb 10, 2023

Just found out what the issue is about. I didn't know that go test ./... runs the packages tests in parallel. In fact, this can be controlled with the flag -p: go test ./... -p 1 which makes everything work. So, there is no leak in resources.

interesting, I also did not know that it would run things in parallel by default. So this runs directories in parallel? In order to run tests in parallel, within a directory, I imagine you still need the t.Parallel() flag.

I'm in favor of option 1.

I think option 2 is a little bit of an anti pattern. If someone is trying to run things in parallel, and our test code is forcing it to be sequential, that could be giving developers a false positive.

@rach-id
Copy link
Member Author

rach-id commented Feb 10, 2023

In order to run tests in parallel, within a directory, I imagine you still need the t.Parallel() flag.

Yes, 100%

Cool, will open a new PR making the change and close this one.

Thanks a lot everyone for the help 👍 👍

@evan-forbes
Copy link
Member

I'd be in favor for 1 as well, we should also change the default home dir tendermint uses to the t.TempDir if we're not doing that already somewhere

@rach-id
Copy link
Member Author

rach-id commented Feb 10, 2023

we should also change the default home dir tendermint uses to the t.TempDir if we're not doing that already somewhere

I have seen some tempDir used somewhere, but not sure where.
Should we create an issue to investigate the data folder that gets sometimes created when running the tests?

@rootulp
Copy link
Collaborator

rootulp commented Feb 10, 2023

Just found out what the issue is about. I didn't know that go test ./... runs the packages tests in parallel.

Interesting find and this seems to be the case based on go help test output:

Each listed package causes the execution of a separate test binary.

Just so I follow, two separate packages are running tests that spin up a testnode and these testnodes conflict? Which packages are they? I see app_test here but I don't see another invocation of the testutil/testnode node. I do see usage of the cosmos-sdk network.New node in app_test and testutil so maybe they conflict with the testutil/testnode?

Maybe silly question but is it possible to make the testnodes independent so that multiple can be run simultaneously? If that's too difficult then it may not be worth it. My hunch is that the integration tests that use nodes are the longest running tests so if we don't run them in parallel we're going to increase our total test run time which will lead to slower dev iteration speed and CI check.

@rach-id
Copy link
Member Author

rach-id commented Feb 10, 2023

Just so I follow, two separate packages are running tests that spin up a testnode and these testnodes conflict? Which packages are they? I see app_test here but I don't see another invocation of the testutil/testnode node. I do see usage of the cosmos-sdk network.New node in app_test and testutil so maybe they conflict with the testutil/testnode?

Sorry for not specifying correctly. This is happening in a separate repo on a personal branch. You can reproduce using this: https://github.com/sweexordious/orchestrator-relayer/tree/not_working_test_poc

Maybe silly question but is it possible to make the testnodes independent so that multiple can be run simultaneously? If that's too difficult then it may not be worth it.

Do you mean parallel tests inside a single package or parallel between packages?

If it's the latter, then with this change #1368, you can do it.

If you mean parallel tests inside the same package. Then, that can be achieved (I believe, I tried it locally, but might be missing something), with this #1368, but only for testnodes having the same configuration. However, if we want to test something that change something in the Bech32 prefixes or the baseapp configuration, we won't be able to:

cfg.Seal()

https://github.com/celestiaorg/cosmos-sdk/blob/0aef7ba8d215a3978de8c298dc40d9fa65041695/baseapp/baseapp.go#L80-L81

Also, there are some data races happening even inside a single testnode instance. So, running multiple ones might lead to undefined behavior #1369

@rootulp
Copy link
Collaborator

rootulp commented Feb 10, 2023

Do you mean parallel tests inside a single package or parallel between packages?

I meant parallel between packages so the default go test ./... behavior. Cool, looking forward to #1368 !

rach-id added a commit that referenced this pull request Feb 10, 2023
<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.
-->

## Overview

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 
-->

Dynamic port allocation for the testnode as per
#1342 (comment)

## Checklist

<!-- 
Please complete the checklist to ensure that the PR is ready to be
reviewed.

IMPORTANT:
PRs should be left in Draft until the below checklist is completed.
-->

- [ ] New and updated code has appropriate documentation
- [ ] New and updated code has new and/or updated testing
- [ ] Required CI checks are passing
- [ ] Visual proof for any user facing features like CLI or
documentation updates
- [ ] Linked issues closed with keywords
evan-forbes pushed a commit that referenced this pull request Apr 28, 2023
<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.
-->

<!--
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue.
-->

Dynamic port allocation for the testnode as per
#1342 (comment)

<!--
Please complete the checklist to ensure that the PR is ready to be
reviewed.

IMPORTANT:
PRs should be left in Draft until the below checklist is completed.
-->

- [ ] New and updated code has appropriate documentation
- [ ] New and updated code has new and/or updated testing
- [ ] Required CI checks are passing
- [ ] Visual proof for any user facing features like CLI or
documentation updates
- [ ] Linked issues closed with keywords
0xchainlover pushed a commit to celestia-org/celestia-app that referenced this pull request Aug 1, 2024
<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.
-->

## Overview

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 
-->

Dynamic port allocation for the testnode as per
celestiaorg/celestia-app#1342 (comment)

## Checklist

<!-- 
Please complete the checklist to ensure that the PR is ready to be
reviewed.

IMPORTANT:
PRs should be left in Draft until the below checklist is completed.
-->

- [ ] New and updated code has appropriate documentation
- [ ] New and updated code has new and/or updated testing
- [ ] Required CI checks are passing
- [ ] Visual proof for any user facing features like CLI or
documentation updates
- [ ] Linked issues closed with keywords
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request testing items that are strictly related to adding or extending test coverage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only allow a single testnode network to be spun up at once
6 participants