Optimize trace generation #1558

Al-Kindi-0 · 2024-10-31T06:46:46Z

Currently, trace generation, both main and auxiliary, takes roughly about 10% of the total proving time. We should benchmark and profile this step in order to get an idea of the bottlenecks.
Witness generation is an inherently sequential process but there are still strategies that one could explore in order to improve things. One such approach is multi-stage building of the main trace through several passes, where for example the first pass could fill the minimal amount of trace values needed in order to be able to fill the hasher chiplet in parallel.

plafer · 2024-10-31T10:26:08Z

Additional context: #1533 (comment)

bobbinth · 2024-11-16T12:18:18Z

I think a good target here would be to get main trace generation to run at at least 50 MHz on something like an M1. And a stretch goal could be 100 MHz. I think it should be double but would require quite a bit of investigation and potential refactoring.

bobbinth · 2025-01-06T06:44:29Z

One of the approach to investigate here is to completely split execution and trace generation. So, we'd have 2 components:

Executor: execute a program from start to finish as quickly as possible while maintaining minimal required state.
Trace generator: build the execution trace for a contiguous set of VM cycles.

The main idea is that the executor will run in a single thread while multiple trace generators could run concurrently in multiple threads:

The key question is which parts of the state trace generators would need to generate the trace. Here is my initial stab at this:

Stack state - this can be tracked pretty efficiently and can probably be cloned with minimal overhead when the executor is done with a given segment.
Memory state - this would also need to be cloned but probably would be somewhat more expensive then cloning the stack state. One potential optimization here is to clone only the part of the memory which is "touched" during the current segment.
Advice provider state - cloning this one may require quite a bit of work as advice provider may contain quite a bit of data. However, the Merkle Store component is "append only" and so does not need to be cloned. If we can make Advice Map append-only as well, we can avoid cloning here too. Cloning of the advice stack should be pretty cheap.
Decoder state - I haven't thought this through completely, but I think the state here should be relatively small and shouldn't be too expensive to clone.

Assuming trace generation is at least 10x more expensive than execution, we should be able to get close to 5x-10x speed up on high-end laptops, and even more on sever-grade machines (i.e., the more cores we can throw at it, the faster we'll generate the trace).

bobbinth assigned plafer Jan 8, 2025

bobbinth added this to the v0.13.0 milestone Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize trace generation #1558

Optimize trace generation #1558

Al-Kindi-0 commented Oct 31, 2024 •

edited

Loading

plafer commented Oct 31, 2024

bobbinth commented Nov 16, 2024

bobbinth commented Jan 6, 2025

Optimize trace generation #1558

Optimize trace generation #1558

Comments

Al-Kindi-0 commented Oct 31, 2024 • edited Loading

plafer commented Oct 31, 2024

bobbinth commented Nov 16, 2024

bobbinth commented Jan 6, 2025

Al-Kindi-0 commented Oct 31, 2024 •

edited

Loading