-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(warp): added warp update
to madara and other changes
#393
Conversation
This should be handled by an external proxy anyways.
This currently only contains the `MadaraWriteRpc` methods but will be used to encapsulate any sensitive admin methods in the future.
…edundant mp_block::BlockId
was this caught by the ci? if not, what was wrong here and why did the e2e pass? the idea of such a big problem not getting caught by the ci makes me worried |
This was in fact not an infinite freeze but just tracing being disabled after #401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, I'm not sure i understand where the difference between what's checked during regular block import and what's checked when importing a block in warp updates is
are the block hashes and stuff checked in warp mode too?
README.md
Outdated
## ✔ Supported Features | ||
## 🔄 Migration | ||
|
||
When migration to a newer version of Madara which introduces breaking changes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When migrating to a newer version of Madara [that introduces breaking changes]
i would probably simply remove the later part of the sentence tbh, people are not expected to migrate to new versions that dont require upgrading the db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed.
@@ -176,11 +170,6 @@ impl BlockImporter { | |||
validation: BlockValidationContext, | |||
) -> Result<BlockImportResult, BlockImportError> { | |||
let result = self.verify_apply.verify_apply(block, validation).await?; | |||
// Flush step. | |||
let force = self.always_force_flush; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaah!! no, we really need to flush every block, always, in block_production mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has no importance anymore as maybe_flush
has been removed and will always flush anyways, so always_force_flush
is no longer needed. This is after I noticed the only part of the code we were performing this check was in the block import, so instead we just do the check there while other parts of the code just call flush
directly.
@@ -238,7 +239,13 @@ pub async fn handle_get_block_traces( | |||
traces: Vec<TransactionTraceWithHash>, | |||
} | |||
|
|||
let traces = v0_7_1_trace_block_transactions(&Starknet::new(backend, add_transaction_provider), block_id).await?; | |||
// TODO: we should probably use the actual service context here instead of | |||
// creating a new one! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably not implement trace block like this at all actually, i dont see why gateway would depend on rpc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a hack afaik
self.backend | ||
.maybe_flush(true) | ||
.map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?; | ||
self.backend.flush().map_err(|err| BlockImportError::Internal(format!("DB flushing error: {err:#}").into()))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, just a note here
flushing is the responsability of the block import crate, this flushing here is a hack for pending blocks.
i'd like the flushing to be moved back into block import and be removed from sync l2.rs once again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can keep this hack here of course since it was already present
crates/client/rpc/src/lib.rs
Outdated
Self { | ||
backend: Arc::clone(&self.backend), | ||
add_transaction_provider: Arc::clone(&self.add_transaction_provider), | ||
ctx: self.ctx.branch(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean that cloning a starknet instance will make it a parent to a newly created child?
this feels weird.. i wouldnt expect Clone to do stuff like that
how about just wrapping ServiceContext into an Arc so that cloning a Starknet actually shares the cancellation token instead of forking it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum. branch
only clones the CancellationToken
though, which is itself behind an Arc
, so I don't think this is necessary.
pub struct CancellationToken {
inner: Arc<tree_node::TreeNode>,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also ServiceContext::branch
does not create any child CancellationToken
, that is handled by ServiceContext::branch_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i did not realize branch was clone sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I've update it to clone
to make this more obvious :)
crates/client/sync/src/l2.rs
Outdated
let BlockImportResult { header, block_hash } = block_import.verify_apply(block, validation.clone()).await?; | ||
|
||
if header.block_number - last_block_n >= flush_every_n_blocks as u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: i really think this should not be here, flushing should def not be the concern of l2.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you suggest we move this instead? This was already called indirectly in the verify_apply
method if I remember correctly. The issue is that if we are running a full node with only sync services enabled we still need to have a point where new blocks are flushed. Also I don't quite see how we can share information across services in an elegant way. Imo it makes most sense to call flush
were information about the block store is directly available, be that sync
, block_production
or the mempool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it makes more sense to call it in block_import? this seems way too low level to be a concern of l2.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean sharing info in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangentially related but I have added a timer inside the l2 verify_apply_task
which will flush blocks after a set interval of time has elapsed as you had suggested orally. This can be controlled via the --sync-every-n-seconds
cli arg. So now we can be certain blocks will be flushed to the db either every n seconds or every n blocks, whichever one is faster.
.multiple(false) | ||
) | ||
)] | ||
pub struct ArgsPresetParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this preset thing and how does it relate to the older presets feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So. Basically I realized some configs were starting to take up a lot of arguments. Initially, warp updates were just set up as a series of predefined arguments (this is still the case for --warp-update-sender
), and it seemed really tedious and error-prone for the user to have to remember so many options. All this does is that it sets various options in the RunCmd
struct after parsing has occurred. This has its downsides, most notably the user cannot override args presets this way as they are evaluated after user inputs. The advantage is that this streamlines more complex cli setups with many options and flags (this mostly applies to --warp-update-sender
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be adding this in a doc comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed. Added some documentation to ArgsPresetParams
which detail this, as well as some of the issues and limitation with the current cli setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i this seems weird, but i think it's fine to merge that as is for now
crates/node/src/main.rs
Outdated
@@ -201,7 +197,7 @@ async fn main() -> anyhow::Result<()> { | |||
|
|||
app.start_and_drive_to_end().await?; | |||
|
|||
tracing::info!("Shutting down analytics"); | |||
tracing::info!("🔌 Shutting down analytics..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove this log entirely, it's actually not useful tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed.
Yes, the block hashes and state root are checked. This is a bit of a shame, as there clearly are lots of performance gains to be had in performing a single state root computation since we know we are synchronizing from a trusted source. This was not implemented as it felt like this was going a bit out of scope (at that point I would be implementing warp sync as well). Still, we are nonetheless cutting down on network latency and increasing the number of block synchronized by maximizing CPU parallelism. Some initial benchmarks have shown a 33% percent improvement in sync time through this alone, so I think this is good enough for now. In the future, we have some low hanging fruits left which could make this a lot faster, mainly computing the state root in a single go from genesis to full sync. But for now I think it'd be better to move on to other features, most importantly RPC v0.13.3, especially since this feature has already been divided between this and another upcoming pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add a db flush in close_block.rs just after verify_apply
/// This is mostly relevant in the context of the `--n-blocks-to-sync` cli | ||
/// argument which states that a node might stop synchronizing before it has | ||
/// caught up with the tip of the chain. | ||
enum SyncStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmh.. maybe rename it to SyncedStatus or SyncStopReason or idk?
.multiple(false) | ||
) | ||
)] | ||
pub struct ArgsPresetParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i this seems weird, but i think it's fine to merge that as is for now
crates/node/src/cli/sync.rs
Outdated
@@ -66,16 +77,73 @@ pub struct SyncParams { | |||
/// blocks. This can either be once the node has caught up with the head of | |||
/// the chain or when it has synced as many blocks as specified by | |||
/// --n-blocks-to-sync. | |||
#[clap(env = "MADARA_STOP_ON_SYNC", long, default_value_t = false)] | |||
#[clap(env = "MADARA_STOP_ON_SYNC", long, value_name = "STOP ON SYNC", default_value_t = false)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding value_name to a bool seems weird?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this makes sense. Removed.
default_value_t = 10, | ||
value_parser = clap::value_parser!(u8).range(1..) | ||
)] | ||
pub sync_parallelism: u8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why u8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed like a reasonable bound for parallelism, I don't think anyone will be running Madara with sync_parallelism
> 255 anytime soon.
This has been fixed. Added a call to |
Pull Request type
What is the current behavior?
There currently is no way to migrate the Madara db on a breaking change. This pr fixes this, adding some more quality of life features in the process.
What is the new behavior?
Added warp update, which allows Madara to migrate its database at speeds much faster than a re-sync from genesis. This works by running a local fgw node and a second node which syncs from it with higher than normal parallelization. This has been documented in the
README
, where you can find more detailed information on how to test this feature.Added cli parameter presets. This is quite primitive atm, and only functions by overriding other cli args after they have been parsed. This sadly means the user cannot currently override options set in the presets (also this is all very manual). Imo it is still a good start though in making some common args configuration such as rpc, gateway (or just lengthy ones like warp sync) easier to use.
Db flush no longer occurs on a fixed-interval timer of 5s. Instead,
maybe_flush
has been renamed toflush
and is called everyn
blocks in the l2 sync block storage process. This can be set through the cli arg--flush-every-n-blocks
and defaults to 1000.Added the option to increase sync parallelism. This is used by warp sync and is available under the
--sync-parallelism
cli flag, which defaults to 10 (meaning 10 blocks will be fetched in parallel during l2 sync).Some important changes have also been made to the rpc admin endpoints:
Added a new admin endpoint,
stop_node
, which can be used to stop madara at a distance. This is used by warp update to stop the warp update sender once the receiver has finished synchronizing.Added a new admin endpoint,
ping
, which just returns the unix time at which it was received. This can be used to check if a node is active or calculate network latency.Added various new admin endpoints which allow to control the capabilities of the node (for example stopping / restarting sync). This is quite basic atm and has only been implemented for sync and user-facing rpc services. It is sadly not possible to start services which were not available on startup atm.
Note
There are currently two main issues I have with the current service architecture:
I am thinking of how to best approach this. Currently, I am favoring a monitor solution where
ServiceGroupd
also keeps in memory the necessary information to start (and restart) a service. There also needs to be some mechanism for theServiceGroup
to be able to cleanly pause the execution of a service but I am not sure of the best way to go about this yet.On the rpc front:
namespace_version_method
, whereas our version middleware only accepts request of the formnamespace_method
, withversion
being specified in the header. This has been changed to allow for both formats. This is only useful for us to be able to make rpc calls from one node to the other, but the alternative would have been to do this manually. I think this is a better solution as jsonrpsee provides type checking and function signatures for each of our methods, which should reduce errors.Refactor
Does this introduce a breaking change?
No.