Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize trace generation #1558

Open
Al-Kindi-0 opened this issue Oct 31, 2024 · 3 comments
Open

Optimize trace generation #1558

Al-Kindi-0 opened this issue Oct 31, 2024 · 3 comments
Assignees
Milestone

Comments

@Al-Kindi-0
Copy link
Collaborator

Al-Kindi-0 commented Oct 31, 2024

Currently, trace generation, both main and auxiliary, takes roughly about 10% of the total proving time. We should benchmark and profile this step in order to get an idea of the bottlenecks.
Witness generation is an inherently sequential process but there are still strategies that one could explore in order to improve things. One such approach is multi-stage building of the main trace through several passes, where for example the first pass could fill the minimal amount of trace values needed in order to be able to fill the hasher chiplet in parallel.

@plafer
Copy link
Contributor

plafer commented Oct 31, 2024

Additional context: #1533 (comment)

@bobbinth
Copy link
Contributor

I think a good target here would be to get main trace generation to run at at least 50 MHz on something like an M1. And a stretch goal could be 100 MHz. I think it should be double but would require quite a bit of investigation and potential refactoring.

@bobbinth
Copy link
Contributor

bobbinth commented Jan 6, 2025

One of the approach to investigate here is to completely split execution and trace generation. So, we'd have 2 components:

  1. Executor: execute a program from start to finish as quickly as possible while maintaining minimal required state.
  2. Trace generator: build the execution trace for a contiguous set of VM cycles.

The main idea is that the executor will run in a single thread while multiple trace generators could run concurrently in multiple threads:

image

The key question is which parts of the state trace generators would need to generate the trace. Here is my initial stab at this:

  1. Stack state - this can be tracked pretty efficiently and can probably be cloned with minimal overhead when the executor is done with a given segment.
  2. Memory state - this would also need to be cloned but probably would be somewhat more expensive then cloning the stack state. One potential optimization here is to clone only the part of the memory which is "touched" during the current segment.
  3. Advice provider state - cloning this one may require quite a bit of work as advice provider may contain quite a bit of data. However, the Merkle Store component is "append only" and so does not need to be cloned. If we can make Advice Map append-only as well, we can avoid cloning here too. Cloning of the advice stack should be pretty cheap.
  4. Decoder state - I haven't thought this through completely, but I think the state here should be relatively small and shouldn't be too expensive to clone.

Assuming trace generation is at least 10x more expensive than execution, we should be able to get close to 5x-10x speed up on high-end laptops, and even more on sever-grade machines (i.e., the more cores we can throw at it, the faster we'll generate the trace).

@bobbinth bobbinth added this to the v0.13.0 milestone Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants