feat(services): reworked Madara services for better cancellation control #405

Trantorian1 · 2024-12-03T17:00:37Z

Pull Request type

Please add the labels corresponding to the type of changes your PR introduces:

Bugfix
Feature
Refactoring (no functional changes, no API changes)
Documentation content changes

What is the current behavior?

Current Madara service architecture is limited and only allows us to start services on node startup. This is especially a problem if we wish to achieve zero downtime on warp updates.

What is the new behavior?

Significantly overhauls the madara service architecture. Not much will be said in this pr as this has already been extensively documented in service.rs. You can also find more information about how this impacts warp update in the Readme.

Some services also had to be updated to allow them to be compatible with the new architecture. This mostly meant supporting service restarts. You can check out the RpcService for example as it has been quite significantly refactored.

Does this introduce a breaking change?

Yes. Some changes were made to cli args for the sake of cleanup: --sync-disabled is now called --l1-sync-disabled to more accurately reflect its function.

Other information

I am not sure how much I like my handling of SERVICE_GRACE_PERIOD. On the one hand this is handy to avoid node stalls in case of mistakes on our behalf and guarantee SIGINT and SIGTERM always work, on the other though it can still lead to forceful cancellations (this is by design). Let me know if you think this trade is worth it.

…arp update

…t options

…ntime

cchudant · 2024-12-06T08:43:59Z

README.md

+| `madara_gatewayDisable`         | Disables the feeder gateway                                         |
+| `madara_gatewayEnable`          | Enables the feeder gateway                                          |
+| `madara_gatewayRestart`         | Restarts the feeder gateway                                         |
+| `madara_telemetryDisable`       | Disables node telemetry                                             |


what is this used for?

These are supposed to be endpoint node admin can use in case they need to disable a specific service, for example for maintenance reasons. I am going to refactor this into a single endpoint though, as per @jbcaron's suggestion

cchudant · 2024-12-06T08:45:31Z

README.md

-> database.
+### Warp Update
+
+Warp update requires an already synchronized _local_ node with a working


we may want to be a bit clearer that this is the source database that we're setting up here

Yep, will add some clarifying information!

README.md

cchudant · 2024-12-06T08:48:40Z

README.md

+  --full               `# This also works with other types of nodes` \
+  --network mainnet    \
+  --warp-update-sender \
+  --l1-sync-disabled   `# We disable sync, for testing purposes` \


"replace this argument with your --l1-endpoint parameter"

Will add more info about this. The goal for this demonstration though is to stop on sync though, else the sender node would keep synchronizing and the update would take forever

cchudant · 2024-12-06T08:50:35Z

README.md

+Suppose your are an RPC service provider and your node is also running an RPC
+server and exposing it to your clients: if you have to shut it down or restart
+it for the duration of a migration this will result in downtime for your service
+and added complexity in setting up redundancies.


well you can just resync a new node, i think it makes more sense to talk about sequencers here

Yea fair, though this is still a bit faster than that due to being on a local network. It would also be quite easy to make this a lot faster "in the future ™️ " by computing the state root once the migration has completed, instead of at each block which is what we do on a normal sync. Something to keep in mind.

cchudant · 2024-12-06T09:11:24Z

crates/client/telemetry/src/lib.rs

 pub struct TelemetryEvent {
    verbosity: VerbosityLevel,
    message: serde_json::Value,
 }

 #[derive(Debug, Clone)]
-pub struct TelemetryHandle(Option<Arc<mpsc::Sender<TelemetryEvent>>>);
+#[repr(transparent)]
+pub struct TelemetryHandle(tokio::sync::broadcast::Sender<TelemetryEvent>);


ah! why the change?

So! We need services to be able to be potentially restarted. This does not work if we store TelemetryHandle as an Option and just call std::mem::take on it. Similarly, we cannot use a tokio::sync::mpsc as we need to be able to re-create the receiver on each service start (the service future takes ownership of it).

cchudant · 2024-12-06T09:16:43Z

crates/primitives/utils/src/service.rs

 use tokio::task::JoinSet;

+pub const SERVICE_COUNT: usize = 8;


i think this shouldnt be here btw, it should be in main.rs

if an appchain wants to add a service by importing all of our crates and just have their own main.rs which starts and hook services differently (/ using other services), this wont work for them

Well they'll have bigger problems due to the limit in size of using an std::sync::atomic::AtomicU8. I assume you mean having the service id behind a trait so appchains can create their own services?

Well, this was a bit of a refactor but MadaraServiceId is now only a blanket implementation of Serviceid for our services, with PowerOfTwo being the actual backing type used to identify services. Appchains can create and register their own services by implementing the Service and ServiceId traits on their own structs, and they are free to use any service id enum they want, all relevant methods now only require an impl ServiceId.

As for SERVICE_COUNT, it has been renamed to SERVICE_COUNT_MAX to better reflect what it actually represents and has been increased to 64.

crates/primitives/utils/src/service.rs

cchudant · 2024-12-06T09:18:19Z

crates/primitives/utils/src/service.rs

+            // been canceled or this service was deactivated
+            let res = tokio::select! {
+                svc = rx.recv() => svc.ok(),
+                _ = token_global.cancelled() => break,


why isnt token global a parent of token local? you shouldnt have to await both

We have to await both in case the token_local is cancelled but the token_global is not.

cchudant · 2024-12-06T09:18:58Z

crates/primitives/utils/src/service.rs

+///
+/// Used as part of [ServiceContext::service_subscribe].
+#[derive(Clone, Copy)]
+pub struct ServiceTransport {


why is this called transport?

Just a name I came up with.

jbcaron · 2024-12-06T08:32:14Z

crates/client/eth/src/l1_gas_price.rs

+    while !ctx.is_cancelled() {
+        interval.tick().await;
+        gas_price_worker_once(eth_client, &l1_gas_provider, gas_price_poll_ms).await?;
    }


Currently, cancellation is only checked before waiting for the interval tick. If ctx.is_cancelled() becomes true during interval.tick().await, the worker cannot be canceled immediately, potentially delaying the shutdown. It would be better to modify this code to allow cancellation even during the tick wait.

Good catch, will be adding that!

This has been fixed!

jbcaron · 2024-12-06T08:58:20Z

crates/client/eth/src/state_update.rs

-    listen_and_update_state(eth_client, backend, &eth_client.l1_block_metrics, chain_id, ctx)
+    let event_filter = eth_client.l1_core_contract.event_filter::<StarknetCoreContract::LogStateUpdate>();
+
+    let mut event_stream = event_filter


This has been fixed.

jbcaron · 2024-12-06T09:03:13Z

crates/client/eth/src/sync.rs

+    if !gas_price_sync_disabled {
+        tokio::try_join!(
+            state_update_worker(&backend, &eth_client, ctx.clone()),
+            gas_price_worker(&eth_client, l1_gas_provider, gas_price_poll_ms, ctx.clone()),
+            sync(&backend, &eth_client, &chain_id, mempool, ctx.clone())
+        )?;
+    } else {
+        tokio::try_join!(
+            state_update_worker(&backend, &eth_client, ctx.clone()),
+            sync(&backend, &eth_client, &chain_id, mempool, ctx.clone())
+        )?;
+    }


Suggested change

if !gas_price_sync_disabled {

tokio::try_join!(

state_update_worker(&backend, &eth_client, ctx.clone()),

gas_price_worker(&eth_client, l1_gas_provider, gas_price_poll_ms, ctx.clone()),

sync(&backend, &eth_client, &chain_id, mempool, ctx.clone())

)?;

} else {

tokio::try_join!(

state_update_worker(&backend, &eth_client, ctx.clone()),

sync(&backend, &eth_client, &chain_id, mempool, ctx.clone())

)?;

}

let mut join_set = JoinSet::new();

join_set.spawn(state_update_worker(&backend, &eth_client, ctx.clone()));

join_set.spawn(sync(&backend, &eth_client, &chain_id, mempool, ctx.clone()));

if !gas_price_sync_disabled {

join_set.spawn(gas_price_worker(&eth_client, l1_gas_provider, gas_price_poll_ms, ctx.clone()));

}

while let Some(result) = join_set.join_next().await {

result??;

}

This has been fixed.

crates/client/rpc/src/versions/admin/v0_1_0/api.rs

jbcaron · 2024-12-06T10:37:54Z

crates/client/sync/src/fetch/mod.rs

@@ -105,17 +107,27 @@ pub async fn l2_fetch_task(

        let mut interval = tokio::time::interval(sync_polling_interval);
        interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
-        while wait_or_graceful_shutdown(interval.tick(), &ctx).await.is_some() {
+        while !ctx.is_cancelled() {
+            interval.tick().await;


this has been fixed.

jbcaron · 2024-12-06T10:38:43Z

crates/client/sync/src/l2.rs

@@ -190,17 +197,20 @@ async fn l2_pending_block_task(

    let mut interval = tokio::time::interval(pending_block_poll_interval);
    interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
-    while wait_or_graceful_shutdown(interval.tick(), &ctx).await.is_some() {
+    while !ctx.is_cancelled() {
+        interval.tick().await;


This has been fixed.

This allows the implementation of new services by appchains simply by implementing the new `ServicId` trait (in adition to the existing `Service`) trait.

…vices

…l_cancelled

jbcaron · 2024-12-10T13:49:01Z

crates/client/rpc/src/providers/mod.rs

You could have resolved #387 at the same time

antiyro

works extremely fast locally

cchudant

okay

Trantorian1 added 7 commits December 2, 2024 16:24

feat(service): it compiles (fuck fuck fuck fuck fuck...)

630cb59

feat(service): got services running normally again

7323542

feat(telemetry): refactored telemetry service to be able to be restarted

da45d87

feat(rpc): rpc now supports new service architecture

ce2f8f0

feat(l1_sync): it should now be possible to restart l1 sync service

62dd2fd

feat(l2_sync): it should now be possible to restart l2 sync service

d451b0a

feat(services): got global cancellation to work again!

d05d9b6

Trantorian1 added the feature Request for new feature or enhancement label Dec 3, 2024

Trantorian1 self-assigned this Dec 3, 2024

Trantorian1 added 11 commits December 4, 2024 10:21

docs(service): started work on service documentation

3169195

docs(services): added section on service status requests

ee8a7ca

docs(services): added more docs to some service structures

bd2c879

docs(services): added docs for ServiceMonitor

1b48772

feat(clippy): clippy no longer complains and fixed tests

2eca30a

Merge branch 'main' into feat/services

cd58d07

feat(services): added handling for SIGTERM

91716f4

feat(services): added deffered start to conflicting services during w…

b1e4e1e

…arp update

feat(admin): updated adming rpc methods to include all service restar…

4117d54

…t options

docs(warp): updated the docs on database migration now it is zero dow…

bf32cc1

…ntime

docs(links): fixed some links

c2c7655

Trantorian1 force-pushed the feat/services branch from 09de91d to c2c7655 Compare December 5, 2024 10:39

docs(admin): updated the list of admin methods

c71a08c

Trantorian1 requested review from cchudant, antiyro and jbcaron December 5, 2024 10:58

chore(changelog)

f76b5ee

Trantorian1 marked this pull request as ready for review December 5, 2024 10:58

fix(test): all tests now pass locally

8383ad1

cchudant reviewed Dec 6, 2024

View reviewed changes

jbcaron reviewed Dec 6, 2024

View reviewed changes

Trantorian1 added 8 commits December 6, 2024 15:40

fix(admin): bundled all service status methods into one

3ae5478

feat(warp): warp updates now actually work for sequencer nodes

647895f

feat(service): service id is now behind a trait

44a4c0d

This allows the implementation of new services by appchains simply by implementing the new `ServicId` trait (in adition to the existing `Service`) trait.

docs(service): added a fair bit of docs on how to create your own ser…

5cd6550

…vices

refactor(service): ctx.cancelled in tokio::select is now ctx.run_unti…

78c0874

…l_cancelled

Merge branch 'main' into feat/services

8c3aed4

fix(lint)

6148d87

fix(me): stupid

a13810f

jbcaron approved these changes Dec 10, 2024

View reviewed changes

crates/client/rpc/src/providers/mod.rs Outdated

Copy link

Member

jbcaron Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have resolved #387 at the same time

antiyro approved these changes Dec 10, 2024

View reviewed changes

Merge branch 'main' into feat/services

47be1c5

Trantorian1 force-pushed the feat/services branch from 9d71153 to 47be1c5 Compare December 12, 2024 09:09

Trantorian1 added 2 commits December 12, 2024 11:25

fix(comments)

5f81948

fix(lint)

780776f

antiyro force-pushed the feat/services branch from 780776f to 34a51f5 Compare December 12, 2024 10:30

Trantorian1 force-pushed the feat/services branch from 34a51f5 to 780776f Compare December 12, 2024 10:31

cchudant approved these changes Dec 13, 2024

View reviewed changes

antiyro merged commit d3a8367 into main Dec 16, 2024
15 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(services): reworked Madara services for better cancellation control #405

feat(services): reworked Madara services for better cancellation control #405

Trantorian1 commented Dec 3, 2024 •

edited by jbcaron

Loading

cchudant Dec 6, 2024

Trantorian1 Dec 6, 2024

cchudant Dec 6, 2024

Trantorian1 Dec 6, 2024

cchudant Dec 6, 2024

Trantorian1 Dec 6, 2024

cchudant Dec 6, 2024

Trantorian1 Dec 6, 2024

cchudant Dec 6, 2024

Trantorian1 Dec 6, 2024

cchudant Dec 6, 2024

Trantorian1 Dec 6, 2024

Trantorian1 Dec 9, 2024

cchudant Dec 6, 2024

Trantorian1 Dec 6, 2024

cchudant Dec 6, 2024

Trantorian1 Dec 6, 2024 •

edited

Loading

jbcaron Dec 6, 2024

Trantorian1 Dec 6, 2024

Trantorian1 Dec 9, 2024

jbcaron Dec 6, 2024

Trantorian1 Dec 10, 2024

jbcaron Dec 6, 2024

Trantorian1 Dec 10, 2024

jbcaron Dec 6, 2024

Trantorian1 Dec 10, 2024

jbcaron Dec 6, 2024

Trantorian1 Dec 10, 2024

jbcaron Dec 10, 2024

antiyro left a comment

cchudant left a comment

		use tokio::task::JoinSet;

		pub const SERVICE_COUNT: usize = 8;

feat(services): reworked Madara services for better cancellation control #405

feat(services): reworked Madara services for better cancellation control #405

Conversation

Trantorian1 commented Dec 3, 2024 • edited by jbcaron Loading

Pull Request type

What is the current behavior?

What is the new behavior?

Does this introduce a breaking change?

Other information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trantorian1 Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antiyro left a comment

Choose a reason for hiding this comment

cchudant left a comment

Choose a reason for hiding this comment

Trantorian1 commented Dec 3, 2024 •

edited by jbcaron

Loading

Trantorian1 Dec 6, 2024 •

edited

Loading