Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent panic on chain replay #1197

Merged
merged 19 commits into from
Dec 11, 2024
Merged

fix: prevent panic on chain replay #1197

merged 19 commits into from
Dec 11, 2024

Conversation

karlem
Copy link
Contributor

@karlem karlem commented Nov 6, 2024

Close #1196

Removing the dependency on the CometBFT client in favor of caching. When CometBFT is catching up—replaying from the beginning of the chain to synchronize with the ABCI app—it does not start the RPC API. Unfortunately, our ABCI app relied on the API during consensus events, which made it impossible to replay the chain.

Solved by relying on the exec and app states instead of making calls to CometBFT.

@karlem karlem requested a review from a team as a code owner November 6, 2024 20:28
@raulk raulk requested a review from LePremierHomme November 11, 2024 11:39
fendermint/app/src/app.rs Outdated Show resolved Hide resolved
fendermint/app/src/app.rs Outdated Show resolved Hide resolved
fendermint/app/src/validators.rs Outdated Show resolved Hide resolved
@karlem
Copy link
Contributor Author

karlem commented Nov 14, 2024

@LePremierHomme could you have another look please?

@karlem karlem requested a review from cryptoAtwill November 14, 2024 13:33
Copy link
Contributor

@cryptoAtwill cryptoAtwill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlem A quick question, how will the cache be recovered after crashes? Seems not queried when booting?

fendermint/app/src/validators.rs Outdated Show resolved Hide resolved
fendermint/app/src/validators.rs Outdated Show resolved Hide resolved
fendermint/app/src/app.rs Outdated Show resolved Hide resolved
@karlem
Copy link
Contributor Author

karlem commented Nov 14, 2024

@karlem A quick question, how will the cache be recovered after crashes? Seems not queried when booting?

Yeah good point. We would loose it after the crash. Do you think maybe saving it to the app state might be a way to go?

fendermint/app/src/consensus_params.rs Outdated Show resolved Hide resolved
fendermint/app/src/app.rs Outdated Show resolved Hide resolved
fendermint/app/src/consensus_params.rs Outdated Show resolved Hide resolved
@cryptoAtwill
Copy link
Contributor

@karlem The rabbit hole gets deeper. But I think it's possible to read the state during boot up. The validators and gas limits are all stored in the contract or actor, so technically we can call the getters from the store directly.

@raulk
Copy link
Contributor

raulk commented Nov 15, 2024

@cryptoAtwill that's a good point; it would also take us one tiny step closer to our desired end state of having as much logic as possible in on-chain actors.

@karlem
Copy link
Contributor Author

karlem commented Nov 15, 2024

@cryptoAtwill Yeah, I think working with the contract makes more sense than this. I would ditch this in favor of using the contracts, as they are part of the app state—much more solid than this.

@karlem
Copy link
Contributor Author

karlem commented Nov 20, 2024

The validators and gas limits are all stored in the contract or actor, so technically we can call the getters from the store directly.

@cryptoAtwill , I believe the issue is that validators are only stored in the actor after top-down finality has been finalized, whereas we need them available earlier.

We have two potential approaches to resolve this:

Store the initial set of validators from genesis in the top-down finality actor/contract.
This approach might introduce unintended side effects, so it may not be ideal.

Create a new actor specifically for storing these validators before top-down finality is achieved.

What are your thoughts on this? Also looping in @raulk for input.

@karlem
Copy link
Contributor Author

karlem commented Nov 21, 2024

Related: #1166

@karlem karlem force-pushed the fix-replay-panic branch 2 times, most recently from 77d6fe3 to 7b48df2 Compare November 26, 2024 22:59
@karlem
Copy link
Contributor Author

karlem commented Nov 26, 2024

@cryptoAtwill Changed the implementation to rely on app and exec states instead.

fendermint/vm/interpreter/src/chain.rs Outdated Show resolved Hide resolved
fendermint/vm/interpreter/src/fvm/exec.rs Outdated Show resolved Hide resolved
fendermint/vm/interpreter/src/fvm/exec.rs Outdated Show resolved Hide resolved
@raulk
Copy link
Contributor

raulk commented Nov 28, 2024

I'm wrapping my head around the problem and the proposed solution to help push things forward. Here are some notes.

If I'm following correctly, this attempts to fix an edge case whereby during a CometBFT startup, the latter may attempts to perform a chain replay either from the WAL or from the network, by feeding blocks to the ABCI app. The issue arises because we recently introduced logic to fetch the block proposer's public key so we could credit gas premiums to their on-chain account. This happens in begin_block and was introduced in #1173. Because our local cache is empty, we fall back to CometBFT, whose RPC API has not started yet, therefore making us fail.

I think all of this can be greatly simplified if instead of querying CometBFT, we query our power table from the gateway actor and match on the block proposer's identity. The gateway already has public keys. Such a call would be entirely inside the state tree, so it has no dependencies on CometBFT, and would break the problematic cycle.

In fact, we already do this in end_block in order to calculate power table updates.

  1. end_block calls interpreter.end():
    .modify_exec_state(|s| self.interpreter.end(s))
  2. When we create a checkpoint, it may come with power table updates:
    checkpoint::maybe_create_checkpoint(&self.gateway, &mut state, &mut block_end_events)
  3. We query the current power table:
    ipc_power_table(gateway, state).context("failed to get the current power table")?;
  4. We already have a helper that queries the gateway actor in the state tree and returns domain objects, containing validator pubkeys:
    pub fn current_power_table(

In a nutshell, I don't think we need one more cache here, nor to manage its lifecycle, nor anything like that. Assuming I'm right, we can kill the ValidatorTracker entirely and simply replace it with the above.

@cryptoAtwill does this sound right to you?

@karlem
Copy link
Contributor Author

karlem commented Nov 29, 2024

@cryptoAtwill and @raulk

After the last review and team decision, I have updated the PR again. The following changes are included:

  • Rename ValidatorTracker to ValidatorCache: This emphasizes that it serves as a cache and is not the source of truth.
  • Remove dependency on the CometBFT client:
    • Instead of relying on the CometBFT client, the system now queries the gateway through FvmExecState, leveraging the current_power_table from the cache.
  • Make the cache immutable:
    • The cache now offers two methods for interaction:
      1. new_from_state: Constructs and populates a new cache based on the current state.
      2. get_validator: Retrieves a validator from the cache.
  • Integrate the cache into the App state wrapped in an Option:
    • Initially, the cache is None.
    • In begin_block, the cache is initialized if it is currently None.
    • In end_block, the existing cache is replaced with a freshly populated one if the validator table has been updated.

Hopefully, this aligns closely with what we want.

fendermint/app/src/validators.rs Outdated Show resolved Hide resolved
fendermint/crypto/Cargo.toml Outdated Show resolved Hide resolved
fendermint/app/src/app.rs Outdated Show resolved Hide resolved
fendermint/app/src/app.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits; other than that, LGTM. Glad we're landing this after the several twists and bends this has taken.

fendermint/app/src/app.rs Outdated Show resolved Hide resolved
fendermint/app/src/validators.rs Outdated Show resolved Hide resolved
@@ -158,7 +159,7 @@ pub struct GenesisOutput {

pub struct GenesisBuilder {
/// Hardhat like util to deploy ipc contracts
hardhat: Option<Hardhat>,
hardhat: Hardhat,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good cleanup here!

fendermint/vm/interpreter/src/fvm/checkpoint.rs Outdated Show resolved Hide resolved
@karlem karlem merged commit f358fde into main Dec 11, 2024
15 of 17 checks passed
@karlem karlem deleted the fix-replay-panic branch December 11, 2024 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
4 participants