feat: add mutex to testnode not to run multiple nodes #1342

rach-id · 2023-02-06T19:49:06Z

Overview

Checklist

New and updated code has appropriate documentation
New and updated code has new and/or updated testing
Required CI checks are passing
Visual proof for any user facing features like CLI or documentation updates
Linked issues closed with keywords

codecov-commenter · 2023-02-06T19:58:34Z

Codecov Report

Merging #1342 (35c4042) into main (0b447ad) will increase coverage by 0.02%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main    #1342      +/-   ##
==========================================
+ Coverage   48.21%   48.24%   +0.02%     
==========================================
  Files          79       79              
  Lines        4405     4411       +6     
==========================================
+ Hits         2124     2128       +4     
- Misses       2103     2105       +2     
  Partials      178      178

Impacted Files	Coverage Δ
testutil/testnode/rpc_client.go	`70.21% <66.66%> (-0.52%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

cmwaters

Have you tested just to make sure that this resolves these flaky tests?

cmwaters · 2023-02-06T20:37:44Z

testutil/testnode/full_node.go

@@ -195,13 +200,17 @@ func DefaultNetwork(t *testing.T, blockTime time.Duration) (cleanup func() error
 	tmNode, app, cctx, err := New(t, DefaultParams(), tmCfg, false, genState, kr)
 	require.NoError(t, err)

+	// locking the mutex not to be able to spin up multiple nodes at the same time.
+	mut.Lock()


Suggested change

mut.Lock()

mut.Lock()

t.Cleanup(mut.Unlock)

I would personally do it this way. Perhaps even remove the custom cleanup function since the testing fixture already provides it. There are two cases that I can see where the code exits before calling Unlock. Also I generally prefer to not pass responsibility for cleanup to a function outside the scope where the resources were created (if possible).

Perhaps even remove the custom cleanup function since the testing fixture already provides it

Understandable, we could also define a certain TestNode struct that contains an exported Cleanup function to be used in tests. However, I am not against the way it is right now as it is really easy to start and stop.

There are two cases that I can see where the code exits before calling Unlock

Yes, but those are not returning errors, they're requireing no errors. This means that, if I understand right, if we hit that case, the whole test suite will tear down and there will be no need to unlock nor cleanup.

This brings a question to my mind, will starting a separate test in a separate package, after failing a require, be affected by the locked mutex? I assume not, but not sure.

if we can stick everything in the testing.T.Cleanup, then we totally should. @sweexordious if you think this is out of scope of this PR, then I can handle it in a follow up

I can open a follow up PR doing that. Do we want to merge this one?

testutil/testnode/full_node.go

evan-forbes · 2023-02-06T20:59:43Z

testutil/testnode/full_node.go

@@ -195,13 +200,17 @@ func DefaultNetwork(t *testing.T, blockTime time.Duration) (cleanup func() error
 	tmNode, app, cctx, err := New(t, DefaultParams(), tmCfg, false, genState, kr)
 	require.NoError(t, err)

+	// locking the mutex not to be able to spin up multiple nodes at the same time.
+	mut.Lock()


if we can stick everything in the testing.T.Cleanup, then we totally should. @sweexordious if you think this is out of scope of this PR, then I can handle it in a follow up

rach-id · 2023-02-06T22:25:38Z

Have you tested just to make sure that this resolves these flaky tests?

@cmwaters I didn't try the flaky tests specifically, but this should give us an extra layer of security to avoid misusing the default node. So, even if it doesn't fix the flakiness, it's a step in that path

evan-forbes

@sweexordious when you get the chance, can you try this with the qgb tests to see if it fixes the issue? thx!

rach-id · 2023-02-06T23:07:59Z

@evan-forbes Yes definitly, tomorrow. Can I merge this one?

evan-forbes · 2023-02-07T00:19:28Z

yeah sure. i'm probably just stating the obvious, but if for some reason this doesn't help, or we missed something, then lets make a mental note to remove or fix this change 🙂

rootulp

Optional comment edits to improve clarity. No blocking feedback

testutil/testnode/rpc_client.go

Co-authored-by: Rootul P <[email protected]>

rach-id · 2023-02-07T09:36:00Z

@evan-forbes Just checked, doesn't fix my issue

failed to listen on 127.0.0.1:26657: listen tcp 127.0.0.1:26657: bind: address already in use

The problem is, this issue doesn't happen when running the package tests alone go test . or when running them in the IDE. However, with make test which runs go test -mod=readonly ./..., it fails with the above message. So, still not sure what's causing the issue. I assumed it to be the tests for some reason trying to spin up multiple nodes, but apparently not.

Should we still merge? or keep it up until we figure out why it's failing this way

MSevey · 2023-02-07T16:28:39Z

I don't think this is the correct fix. How is this different from just ensuring that no tests run in parallel? i.e. not using t.Parallel()? Which the repo doesn't appear to be using, so it makes sense that this doesn't fix the issue.

So it isn't a test running in parallel issue, but something else. Since the tests aren't running in parallel, then ports don't seem to be getting freed efficiently, or tests just need to handle this and find a free port.

Node had this issue, they implemented a getFreePort helper https://github.com/celestiaorg/celestia-node/blob/main/core/testing.go#L192

Where in the test code are we defining ports?

Additionally looking at the linked issue, the db concerned should be solvable by simply creating unique test directories for each node, unless I'm misunderstanding that bug.

MSevey · 2023-02-07T16:39:04Z

@evan-forbes Just checked, doesn't fix my issue

is this deterministic on your machine? I just ran the testsuite on my Mac and it didn't fail.

rach-id · 2023-02-07T16:43:01Z

Node had this issue, they implemented a getFreePort helper https://github.com/celestiaorg/celestia-node/blob/main/core/testing.go#L192

That's one way to solve it. But I wanted in this PR to help address this issue not to have to assign ports dynamically in every repo using this code.

Where in the test code are we defining ports?

They're coming from the default config from Celestia-core:
https://github.com/celestiaorg/celestia-core/blob/07b8dbcb153563db8f2429b94a632b23d2621519/config/config.go#L83-L96

Additionally looking at the linked issue, the db concerned should be solvable by simply creating unique test directories for each node, unless I'm misunderstanding that bug.

Yes, definitely the DB issue could be solved by that.

In fact, we could fix the ports issue by assigning them dynamically when initializing the config, the db by specifying different home directories, but still we will be stuck with the configuration being sealed when starting the node. Thus, unless we do a big refactor not to seal the config somehow during startup, we will never be able to run multiple nodes in parallel with different configs.

Thus, comes this PR to add an extra layer of security by ensuring that no node is being run in parallel (which doesn't solve the original problem).
The original goal is to fix the port allocation issue which I'm still investigating locally.

is this deterministic on your machine? I just ran the testsuite on my Mac and it didn't fail.
The tests running into these issues are not the ones in this repo, but ones I am working on for the orchestrator-relayer repo. Sorry for not providing much context.

Correct me please if I'm wrong @evan-forbes

MSevey

I don't think having a global lock to bottleneck tests to a single node is the right approach. We should be able to run tests in parallel. Additionally since the issue persists there is no benefit to merging this in.

I think we need to clarify the actually problem statement we are trying to solve. The issue cites the port collision and the overlap in directory, which this PR doesn't address.

Along with the detailed problem statement, we should also align on the ideal end state.

To me being able to run tests and nodes in parallel would be ideal, so we should find a solution that does that.

rach-id · 2023-02-07T17:47:17Z

@MSevey Makes sense, I'll see if I can solve the tests issue locally and update this PR with the changes. And then, we open an issue that concerns running the tests in parallel.

rach-id · 2023-02-08T16:47:17Z

Created an issue for the above problem + a repo that is able to reproduce it: celestiaorg/orchestrator-relayer#104

rach-id · 2023-02-10T15:05:24Z

Just found out what the issue is about. I didn't know that go test ./... runs the packages tests in parallel.
In fact, this can be controlled with the flag -p: go test ./... -p 1 which makes everything work.
So, there is no leak in resources.

As a solution, what do you think of following:

Update the configuration of the testnode in here as follows:

	tmCfg.RPC.ListenAddress = "tcp://localhost:0"
	tmCfg.P2P.ListenAddress = "tcp://localhost:0"
	tmCfg.RPC.GRPCListenAddress = "tcp://localhost:0"
	appConf.GRPC.Address = "localhost:0"
	appConf.API.Address = "tcp://localhost:0"

So ports by default are allocated dynamically.

Keep the mutex that this PR adds so that if anyone wants to run 2 testnodes in parallel (inside the same package), the tests are forced to run sequentially.

I am happy to open a fast PR for 1. And for 2, we can just merge this one.

Let me know what you think.

cc @evan-forbes @MSevey

MSevey · 2023-02-10T15:10:36Z

Just found out what the issue is about. I didn't know that go test ./... runs the packages tests in parallel. In fact, this can be controlled with the flag -p: go test ./... -p 1 which makes everything work. So, there is no leak in resources.

interesting, I also did not know that it would run things in parallel by default. So this runs directories in parallel? In order to run tests in parallel, within a directory, I imagine you still need the t.Parallel() flag.

I'm in favor of option 1.

I think option 2 is a little bit of an anti pattern. If someone is trying to run things in parallel, and our test code is forcing it to be sequential, that could be giving developers a false positive.

rach-id · 2023-02-10T15:18:28Z

In order to run tests in parallel, within a directory, I imagine you still need the t.Parallel() flag.

Yes, 100%

Cool, will open a new PR making the change and close this one.

Thanks a lot everyone for the help 👍 👍

evan-forbes · 2023-02-10T15:24:45Z

I'd be in favor for 1 as well, we should also change the default home dir tendermint uses to the t.TempDir if we're not doing that already somewhere

rach-id · 2023-02-10T15:41:02Z

we should also change the default home dir tendermint uses to the t.TempDir if we're not doing that already somewhere

I have seen some tempDir used somewhere, but not sure where.
Should we create an issue to investigate the data folder that gets sometimes created when running the tests?

rootulp · 2023-02-10T16:19:40Z

Just found out what the issue is about. I didn't know that go test ./... runs the packages tests in parallel.

Interesting find and this seems to be the case based on go help test output:

Each listed package causes the execution of a separate test binary.

Just so I follow, two separate packages are running tests that spin up a testnode and these testnodes conflict? Which packages are they? I see app_test here but I don't see another invocation of the testutil/testnode node. I do see usage of the cosmos-sdk network.New node in app_test and testutil so maybe they conflict with the testutil/testnode?

Maybe silly question but is it possible to make the testnodes independent so that multiple can be run simultaneously? If that's too difficult then it may not be worth it. My hunch is that the integration tests that use nodes are the longest running tests so if we don't run them in parallel we're going to increase our total test run time which will lead to slower dev iteration speed and CI check.

rach-id · 2023-02-10T18:27:44Z

Just so I follow, two separate packages are running tests that spin up a testnode and these testnodes conflict? Which packages are they? I see app_test here but I don't see another invocation of the testutil/testnode node. I do see usage of the cosmos-sdk network.New node in app_test and testutil so maybe they conflict with the testutil/testnode?

Sorry for not specifying correctly. This is happening in a separate repo on a personal branch. You can reproduce using this: https://github.com/sweexordious/orchestrator-relayer/tree/not_working_test_poc

Maybe silly question but is it possible to make the testnodes independent so that multiple can be run simultaneously? If that's too difficult then it may not be worth it.

Do you mean parallel tests inside a single package or parallel between packages?

If it's the latter, then with this change #1368, you can do it.

If you mean parallel tests inside the same package. Then, that can be achieved (I believe, I tried it locally, but might be missing something), with this #1368, but only for testnodes having the same configuration. However, if we want to test something that change something in the Bech32 prefixes or the baseapp configuration, we won't be able to:

celestia-app/app/config.go

Line 12 in 6163e27

cfg.Seal()

https://github.com/celestiaorg/cosmos-sdk/blob/0aef7ba8d215a3978de8c298dc40d9fa65041695/baseapp/baseapp.go#L80-L81

Also, there are some data races happening even inside a single testnode instance. So, running multiple ones might lead to undefined behavior #1369

rootulp · 2023-02-10T18:32:13Z

Do you mean parallel tests inside a single package or parallel between packages?

I meant parallel between packages so the default go test ./... behavior. Cool, looking forward to #1368 !

## Overview  Dynamic port allocation for the testnode as per #1342 (comment) ## Checklist  - [ ] New and updated code has appropriate documentation - [ ] New and updated code has new and/or updated testing - [ ] Required CI checks are passing - [ ] Visual proof for any user facing features like CLI or documentation updates - [ ] Linked issues closed with keywords

Dynamic port allocation for the testnode as per #1342 (comment)  - [ ] New and updated code has appropriate documentation - [ ] New and updated code has new and/or updated testing - [ ] Required CI checks are passing - [ ] Visual proof for any user facing features like CLI or documentation updates - [ ] Linked issues closed with keywords

## Overview  Dynamic port allocation for the testnode as per celestiaorg/celestia-app#1342 (comment) ## Checklist  - [ ] New and updated code has appropriate documentation - [ ] New and updated code has new and/or updated testing - [ ] Required CI checks are passing - [ ] Visual proof for any user facing features like CLI or documentation updates - [ ] Linked issues closed with keywords

feat: add mutex to testnode not to run multiple nodes

07ced1c

rach-id added the enhancement New feature or request label Feb 6, 2023

rach-id requested a review from evan-forbes as a code owner February 6, 2023 19:49

rach-id self-assigned this Feb 6, 2023

MSevey requested review from a team and mojtaba-esk and removed request for a team February 6, 2023 19:49

rach-id added the testing items that are strictly related to adding or extending test coverage label Feb 6, 2023

cmwaters reviewed Feb 6, 2023

View reviewed changes

evan-forbes reviewed Feb 6, 2023

View reviewed changes

feat: move the mutex to StartNode

35c4042

MSevey requested a review from a team February 6, 2023 22:48

evan-forbes previously approved these changes Feb 6, 2023

View reviewed changes

rootulp reviewed Feb 7, 2023

View reviewed changes

testutil/testnode/rpc_client.go Outdated Show resolved Hide resolved

testutil/testnode/rpc_client.go Outdated Show resolved Hide resolved

testutil/testnode/rpc_client.go Outdated Show resolved Hide resolved

testutil/testnode/rpc_client.go Outdated Show resolved Hide resolved

Update testutil/testnode/rpc_client.go

4645db6

Co-authored-by: Rootul P <[email protected]>

rach-id dismissed evan-forbes’s stale review via 4645db6 February 7, 2023 08:34

Update testutil/testnode/rpc_client.go

3e7ad95

Co-authored-by: Rootul P <[email protected]>

MSevey requested a review from a team February 7, 2023 08:34

rach-id and others added 3 commits February 7, 2023 09:34

Update testutil/testnode/rpc_client.go

e9f5275

Co-authored-by: Rootul P <[email protected]>

Update testutil/testnode/rpc_client.go

faee351

Co-authored-by: Rootul P <[email protected]>

Merge branch 'main' into mutex_testnode

1badda1

MSevey requested changes Feb 7, 2023

View reviewed changes

This was referenced Feb 8, 2023

TestNode stick everything in the testing.T.Cleanup #1358

Closed

Fix the Celestia network test suite celestiaorg/orchestrator-relayer#104

Closed

rach-id marked this pull request as draft February 10, 2023 09:15

rach-id closed this Feb 10, 2023

rach-id mentioned this pull request Feb 10, 2023

Only allow a single testnode network to be spun up at once #1299

Closed

rach-id mentioned this pull request Feb 10, 2023

feat: make the testnode port allocation dynamic #1368

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add mutex to testnode not to run multiple nodes #1342

feat: add mutex to testnode not to run multiple nodes #1342

rach-id commented Feb 6, 2023

codecov-commenter commented Feb 6, 2023 •

edited

Loading

cmwaters left a comment

cmwaters Feb 6, 2023

rach-id Feb 6, 2023

evan-forbes Feb 6, 2023

rach-id Feb 6, 2023

evan-forbes Feb 6, 2023

rach-id commented Feb 6, 2023

evan-forbes left a comment •

edited

Loading

rach-id commented Feb 6, 2023

evan-forbes commented Feb 7, 2023

rootulp left a comment

rach-id commented Feb 7, 2023 •

edited

Loading

MSevey commented Feb 7, 2023

MSevey commented Feb 7, 2023

rach-id commented Feb 7, 2023 •

edited

Loading

MSevey left a comment

rach-id commented Feb 7, 2023

rach-id commented Feb 8, 2023

rach-id commented Feb 10, 2023

MSevey commented Feb 10, 2023

rach-id commented Feb 10, 2023

evan-forbes commented Feb 10, 2023

rach-id commented Feb 10, 2023

rootulp commented Feb 10, 2023

rach-id commented Feb 10, 2023

rootulp commented Feb 10, 2023

feat: add mutex to testnode not to run multiple nodes #1342

feat: add mutex to testnode not to run multiple nodes #1342

Conversation

rach-id commented Feb 6, 2023

Overview

Checklist

codecov-commenter commented Feb 6, 2023 • edited Loading

Codecov Report

cmwaters left a comment

Choose a reason for hiding this comment

cmwaters Feb 6, 2023

Choose a reason for hiding this comment

rach-id Feb 6, 2023

Choose a reason for hiding this comment

evan-forbes Feb 6, 2023

Choose a reason for hiding this comment

rach-id Feb 6, 2023

Choose a reason for hiding this comment

evan-forbes Feb 6, 2023

Choose a reason for hiding this comment

rach-id commented Feb 6, 2023

evan-forbes left a comment • edited Loading

Choose a reason for hiding this comment

rach-id commented Feb 6, 2023

evan-forbes commented Feb 7, 2023

rootulp left a comment

Choose a reason for hiding this comment

rach-id commented Feb 7, 2023 • edited Loading

MSevey commented Feb 7, 2023

MSevey commented Feb 7, 2023

rach-id commented Feb 7, 2023 • edited Loading

MSevey left a comment

Choose a reason for hiding this comment

rach-id commented Feb 7, 2023

rach-id commented Feb 8, 2023

rach-id commented Feb 10, 2023

MSevey commented Feb 10, 2023

rach-id commented Feb 10, 2023

evan-forbes commented Feb 10, 2023

rach-id commented Feb 10, 2023

rootulp commented Feb 10, 2023

rach-id commented Feb 10, 2023

rootulp commented Feb 10, 2023

codecov-commenter commented Feb 6, 2023 •

edited

Loading

evan-forbes left a comment •

edited

Loading

rach-id commented Feb 7, 2023 •

edited

Loading

rach-id commented Feb 7, 2023 •

edited

Loading