Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(warp): added warp update to madara and other changes #393

Merged
merged 86 commits into from
Nov 29, 2024

Conversation

Trantorian1
Copy link
Collaborator

@Trantorian1 Trantorian1 commented Nov 21, 2024

Pull Request type

  • Feature
  • Refactoring (no functional changes, no API changes)
  • Documentation content changes

What is the current behavior?

There currently is no way to migrate the Madara db on a breaking change. This pr fixes this, adding some more quality of life features in the process.

What is the new behavior?

  • Added warp update, which allows Madara to migrate its database at speeds much faster than a re-sync from genesis. This works by running a local fgw node and a second node which syncs from it with higher than normal parallelization. This has been documented in the README, where you can find more detailed information on how to test this feature.

  • Added cli parameter presets. This is quite primitive atm, and only functions by overriding other cli args after they have been parsed. This sadly means the user cannot currently override options set in the presets (also this is all very manual). Imo it is still a good start though in making some common args configuration such as rpc, gateway (or just lengthy ones like warp sync) easier to use.

  • Db flush no longer occurs on a fixed-interval timer of 5s. Instead, maybe_flush has been renamed to flush and is called every n blocks in the l2 sync block storage process. This can be set through the cli arg --flush-every-n-blocks and defaults to 1000.

  • Added the option to increase sync parallelism. This is used by warp sync and is available under the --sync-parallelism cli flag, which defaults to 10 (meaning 10 blocks will be fetched in parallel during l2 sync).

Some important changes have also been made to the rpc admin endpoints:

  • Added a new admin endpoint, stop_node, which can be used to stop madara at a distance. This is used by warp update to stop the warp update sender once the receiver has finished synchronizing.

  • Added a new admin endpoint, ping, which just returns the unix time at which it was received. This can be used to check if a node is active or calculate network latency.

  • Added various new admin endpoints which allow to control the capabilities of the node (for example stopping / restarting sync). This is quite basic atm and has only been implemented for sync and user-facing rpc services. It is sadly not possible to start services which were not available on startup atm.

Note

There are currently two main issues I have with the current service architecture:

  1. It is not possible to start a service later on in the execution of the node
  2. Interruption with graceful shutdown can occur at any time in the execution. Other than that, we have no hook to simply pause the execution of a service.

I am thinking of how to best approach this. Currently, I am favoring a monitor solution where ServiceGroupd also keeps in memory the necessary information to start (and restart) a service. There also needs to be some mechanism for the ServiceGroup to be able to cleanly pause the execution of a service but I am not sure of the best way to go about this yet.

On the rpc front:

  • Updated the rpc versioning middleware to accept rpc methods sent with jsonrpsee. This is because (due to our versioning scheme) jsonrpsee will send requests of the form namespace_version_method, whereas our version middleware only accepts request of the form namespace_method, with version being specified in the header. This has been changed to allow for both formats. This is only useful for us to be able to make rpc calls from one node to the other, but the alternative would have been to do this manually. I think this is a better solution as jsonrpsee provides type checking and function signatures for each of our methods, which should reduce errors.

Refactor

  • Finally, the various methods in the l2 sync section of madara have been refactored to be a bit more readable.

Does this introduce a breaking change?

No.

Trantorian1 and others added 30 commits November 18, 2024 12:01
This should be handled by an external proxy anyways.
This currently only contains the `MadaraWriteRpc` methods but will be
used to encapsulate any sensitive admin methods in the future.
@antiyro
Copy link
Member

antiyro commented Nov 27, 2024

Capture d’écran 2024-11-27 à 14 20 03

problems when sending sigint to the sync. sometimes not detected or no gracefull shutdown with segfault

@Trantorian1
Copy link
Collaborator Author

Capture d’écran 2024-11-27 à 14 20 03 problems when sending sigint to the sync. sometimes not detected or no gracefull shutdown with segfault

This has been fixed

@cchudant
Copy link
Member

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

was this caught by the ci? if not, what was wrong here and why did the e2e pass? the idea of such a big problem not getting caught by the ci makes me worried

@Trantorian1
Copy link
Collaborator Author

I get infinite freeze when running with: `--full --network mainnet --l1-endpoint <> --base-path /tmp/madara-b --warp-update-receiver``

was this caught by the ci? if not, what was wrong here and why did the e2e pass? the idea of such a big problem not getting caught by the ci makes me worried

This was in fact not an infinite freeze but just tracing being disabled after #401

Copy link
Member

@cchudant cchudant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I'm not sure i understand where the difference between what's checked during regular block import and what's checked when importing a block in warp updates is

are the block hashes and stuff checked in warp mode too?

README.md Outdated
## ✔ Supported Features
## 🔄 Migration

When migration to a newer version of Madara which introduces breaking changes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When migrating to a newer version of Madara [that introduces breaking changes]
i would probably simply remove the later part of the sentence tbh, people are not expected to migrate to new versions that dont require upgrading the db?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed.

@@ -176,11 +170,6 @@ impl BlockImporter {
validation: BlockValidationContext,
) -> Result<BlockImportResult, BlockImportError> {
let result = self.verify_apply.verify_apply(block, validation).await?;
// Flush step.
let force = self.always_force_flush;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaah!! no, we really need to flush every block, always, in block_production mode

Copy link
Collaborator Author

@Trantorian1 Trantorian1 Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no importance anymore as maybe_flush has been removed and will always flush anyways, so always_force_flush is no longer needed. This is after I noticed the only part of the code we were performing this check was in the block import, so instead we just do the check there while other parts of the code just call flush directly.

@@ -238,7 +239,13 @@ pub async fn handle_get_block_traces(
traces: Vec<TransactionTraceWithHash>,
}

let traces = v0_7_1_trace_block_transactions(&Starknet::new(backend, add_transaction_provider), block_id).await?;
// TODO: we should probably use the actual service context here instead of
// creating a new one!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably not implement trace block like this at all actually, i dont see why gateway would depend on rpc

Copy link
Member

@cchudant cchudant Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a hack afaik

self.backend
.maybe_flush(true)
.map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?;
self.backend.flush().map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, just a note here
flushing is the responsability of the block import crate, this flushing here is a hack for pending blocks.
i'd like the flushing to be moved back into block import and be removed from sync l2.rs once again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can keep this hack here of course since it was already present

Self {
backend: Arc::clone(&self.backend),
add_transaction_provider: Arc::clone(&self.add_transaction_provider),
ctx: self.ctx.branch(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean that cloning a starknet instance will make it a parent to a newly created child?
this feels weird.. i wouldnt expect Clone to do stuff like that

how about just wrapping ServiceContext into an Arc so that cloning a Starknet actually shares the cancellation token instead of forking it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum. branch only clones the CancellationToken though, which is itself behind an Arc, so I don't think this is necessary.

pub struct CancellationToken {
    inner: Arc<tree_node::TreeNode>,
}

Copy link
Collaborator Author

@Trantorian1 Trantorian1 Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also ServiceContext::branch does not create any child CancellationToken, that is handled by ServiceContext::branch_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i did not realize branch was clone sorry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I've update it to clone to make this more obvious :)

crates/client/sync/src/fetch/mod.rs Show resolved Hide resolved
crates/client/sync/src/fetch/mod.rs Show resolved Hide resolved
let BlockImportResult { header, block_hash } = block_import.verify_apply(block, validation.clone()).await?;

if header.block_number - last_block_n >= flush_every_n_blocks as u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: i really think this should not be here, flushing should def not be the concern of l2.rs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you suggest we move this instead? This was already called indirectly in the verify_apply method if I remember correctly. The issue is that if we are running a full node with only sync services enabled we still need to have a point where new blocks are flushed. Also I don't quite see how we can share information across services in an elegant way. Imo it makes most sense to call flush were information about the block store is directly available, be that sync, block_production or the mempool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it makes more sense to call it in block_import? this seems way too low level to be a concern of l2.rs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean sharing info in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangentially related but I have added a timer inside the l2 verify_apply_task which will flush blocks after a set interval of time has elapsed as you had suggested orally. This can be controlled via the --sync-every-n-seconds cli arg. So now we can be certain blocks will be flushed to the db either every n seconds or every n blocks, whichever one is faster.

.multiple(false)
)
)]
pub struct ArgsPresetParams {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this preset thing and how does it relate to the older presets feature

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So. Basically I realized some configs were starting to take up a lot of arguments. Initially, warp updates were just set up as a series of predefined arguments (this is still the case for --warp-update-sender), and it seemed really tedious and error-prone for the user to have to remember so many options. All this does is that it sets various options in the RunCmd struct after parsing has occurred. This has its downsides, most notably the user cannot override args presets this way as they are evaluated after user inputs. The advantage is that this streamlines more complex cli setups with many options and flags (this mostly applies to --warp-update-sender).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be adding this in a doc comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed. Added some documentation to ArgsPresetParams which detail this, as well as some of the issues and limitation with the current cli setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i this seems weird, but i think it's fine to merge that as is for now

@@ -201,7 +197,7 @@ async fn main() -> anyhow::Result<()> {

app.start_and_drive_to_end().await?;

tracing::info!("Shutting down analytics");
tracing::info!("🔌 Shutting down analytics...");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this log entirely, it's actually not useful tbh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed.

@Trantorian1
Copy link
Collaborator Author

Trantorian1 commented Nov 29, 2024

so, I'm not sure i understand where the difference between what's checked during regular block import and what's checked when importing a block in warp updates is

are the block hashes and stuff checked in warp mode too?

Yes, the block hashes and state root are checked. This is a bit of a shame, as there clearly are lots of performance gains to be had in performing a single state root computation since we know we are synchronizing from a trusted source. This was not implemented as it felt like this was going a bit out of scope (at that point I would be implementing warp sync as well). Still, we are nonetheless cutting down on network latency and increasing the number of block synchronized by maximizing CPU parallelism. Some initial benchmarks have shown a 33% percent improvement in sync time through this alone, so I think this is good enough for now.

In the future, we have some low hanging fruits left which could make this a lot faster, mainly computing the state root in a single go from genesis to full sync. But for now I think it'd be better to move on to other features, most importantly RPC v0.13.3, especially since this feature has already been divided between this and another upcoming pr.

Copy link
Member

@cchudant cchudant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add a db flush in close_block.rs just after verify_apply

/// This is mostly relevant in the context of the `--n-blocks-to-sync` cli
/// argument which states that a node might stop synchronizing before it has
/// caught up with the tip of the chain.
enum SyncStatus {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh.. maybe rename it to SyncedStatus or SyncStopReason or idk?

.multiple(false)
)
)]
pub struct ArgsPresetParams {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i this seems weird, but i think it's fine to merge that as is for now

@@ -66,16 +77,73 @@ pub struct SyncParams {
/// blocks. This can either be once the node has caught up with the head of
/// the chain or when it has synced as many blocks as specified by
/// --n-blocks-to-sync.
#[clap(env = "MADARA_STOP_ON_SYNC", long, default_value_t = false)]
#[clap(env = "MADARA_STOP_ON_SYNC", long, value_name = "STOP ON SYNC", default_value_t = false)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding value_name to a bool seems weird?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this makes sense. Removed.

default_value_t = 10,
value_parser = clap::value_parser!(u8).range(1..)
)]
pub sync_parallelism: u8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why u8?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like a reasonable bound for parallelism, I don't think anyone will be running Madara with sync_parallelism > 255 anytime soon.

@Trantorian1
Copy link
Collaborator Author

you need to add a db flush in close_block.rs just after verify_apply

This has been fixed. Added a call to backend.flus() in BlockProductionTask::<Mempool>::on_block_time, which is the only place where close_block from close_block.rs is called. This mirrors the flush in on_pending_time_tick.

@Trantorian1 Trantorian1 merged commit 73691f4 into main Nov 29, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation changes or improvements feature Request for new feature or enhancement
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants