-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster (than rcomb_*
) instruction to compute random linear combinations
#1600
Comments
In relation to the use of the above proposal in Falcon, we can use the The way to do so is, again, to compute the powers of As an example of the changes in the code, we will have the following block
This would mean an improvement in the cycle count from |
We'd also need to store intermediate results in helper registers, no? (otherwise expression degree would be 5 for extension field, and 9 for base field operations) I think there are some other benefits to Horner eval approach:
But, there are also drawback. I think the main one is some security loss because we'd be now using powers of alpha rather than distinct alphas. How can we estimate this loss? |
Indeed, we would use those to reduce the degree of constraints. Since we have spare helper registers, this should be doable.
Indeed, this would depend on the extension degree and the security we are targeting but there are basically two situations of interest (this is a bit of a simplification though):
Note however, that the lost bits can be compensated for by grinding. See the ethSTARK paper for an explanation of this. This means that we can regain the lost bits by doing an |
And if we are targeting 100 bits in LDR (conjectured security), would Separately, would be great to get an rough estimate for the number of cycles we'd need to compute |
The conjecture on the security of the toy problem in the ethSTARK paper didn't consider the case of single challenge. Hence we would have to adapt the analysis to this case and without thinking too much it seems to me that the same estimate therein will also work in the case of single challenge (I think this is what Plonky2 does). Note: we can use the following crude rule of thumb, if we use a cubic extension field then there will be no loss. If we are using just the quadratic extension then there will be a loss of
Makes sense, we should have something more concrete pretty soon. |
So, with |
That's correct, but I think 92 should be 94 (since the math is |
Fix an Then This means that the task of computing
Given the above, the main work when computing A few subtle points worth mentioning:
This is a stab at how we can do the above in MASM. Specifically, we can change the (roughly) the following lines between 525 and 565. Note that in what follows:
The current cost of doing the equivalent of the above, i.e., lines 525 to 565 in the previous reference, is at around
All in all, the above gives some promising indications but we would probably want to have something that is fully integrated using an emulation of To compare with the current implementation, with the some caveats included towards the end, we have:
The constants are not included in the above as they should take into account the impact of using |
rcomb_*
) instruction to compute random linear combinations
My understanding of the process and costs is currently as follows: First, we need to compute Then, for each query, we'd need to:
So, all in all, I'm estimating about 200 cycles per query. Then, the overall cost becomes Could you double-check if my estimates are close to reality? And also curious how does 6K cycles compare to our current costs. |
I have updated the previous comment to be more specific with regards to the cycle count. To summarize:
|
For completeness sake, using the
Though the above is a nice improvement which implies that we can stop one layer earlier during FRI folding, I suspect that with |
Feature description
We usually use the following pattern for loading data from the advice provider into the VM's memory:
This pattern is so fundamental that it might make sense to introduce a native instruction in order to do it at the chiplet level. The following is meant to start a discussion on how to build such an instruction.
Modifying the hasher chiplet
From the stack side, the instruction could have as input the following stack elements:
Digest
is the claimed commitment to the data just about to get hashed, or in other words the hash of the unhashed data.dst_ptr
is the pointer to the start of the region of memory to store the aforementioned data.end_ptr
is the pointer to the last word of the said region of memory.The stack will then have to send a request to the hasher chiplet in order to initiate a linear hash elements. The request could be
(op_label, clk, ctx, dst_ptr)
and(op_label, clk, ctx, end_ptr, Digest)
.The hasher will then respond with$4$ , the chiplets' bus can accommodate the above response and double request even if we have to increase the current internal flags degrees.
(op_label, clk, ctx, dst_ptr)
to signal the start of the requested operation and at the last step with(op_label, clk, ctx, end_ptr, Digest)
to signal the end of the operation. Note that there is no way, at least not to me, around adding at least three extra columns to the hasher chiplet in order to keep track ofclk
,ctx
andptr
, whereclk
andctx
remain fixed throughout the operation andptr
starts withdst_ptr
and is incremented at each 8-row cycle by 2 until we reachend_ptr
. We might also need one additional column to be able to add the new hasher internal instruction but we might get away without doing so by redesigning things.Thus in total we would need to increase the number of columns in the hasher chiplet by at least 3.
We now describe what happens between the two hasher responses above. The hasher chiplet, upon initiating the linear hash operation above, will send a request, at each 8-row cycle, to the memory chiplet of the form
(op_label, clk, ctx, addr, Data[..4])
and(op_label, clk, ctx, addr + 1, Data[4..])
whereData[0..8]
is the current double word occupying the rate portion of the hasher state and which is filled up by the prover non-deterministically.Note that with the current degree of the flags in the hasher chiplet, which is maximum
From the side of the memory chiplet nothing changes as we are using one bus to connect all of the VM components. This would change if for some reason we decide to isolate memory related things into a separate bus.
The Old Approach and Computing inner products
Most of the time, the hashed-to-memory data will at some point need to be involved in some random linear combination computation. For most situations, this amounts to an inner product between two vectors stored in memory, where at least one of the two vectors will be loaded from the advice provider i.e., un-hashed. To get a feeling for this, I will describe this using the example of computing the DEEP queries.
Fix an$x$ in the LDE domain and let
Then
Notice the following:
All in all, the most critical part in computing DEEP queries is the computation of quantities of the form$\sum_{i=0}^{n-1} a_i \cdot b_i$ where $b_i$ can be over either the base or extension field and $a_i$ is always over the extension field. Hence, we will need separate operations in order to handle both situations. Note that one of these instructions is similar but more general to the
rcomb_base
instruction.Assume the degree of our extension field is 2, then given the above, we can introduce a new instruction, call it$$a_0 \cdot b_0 + \cdots + a_3 \cdot b_3$$ where $a_i$ are over the base field and are on the operand stack and $b_i$ are extension field elements and are stored in memory.
inner_product_base
, which will help us compute the inner product as the sum of terms of the formThe main change that will be needed to make the following proposal work is extending the size of the helper registers from$6$ to $8$ field elements. Assuming this, the effect of the
inner_product_base
can be illustrated with the following:Note that$acc = (acc0, acc1)$ is an accumulator holding the values of inner product sum thus far and $acc^{'}$ is the updated sum that includes $$a_0 \cdot b_0 + \cdots + a_3 \cdot b_3$$
Note also that the pointer to$b$ is updated and that the two upper words of the stack are permuted. Hence we can compute the next term by an immediate second call to
inner_product_base
with the following resulting effect:Example
Assume we have a very wide (main) trace, say$2400$ columns wide as in the Keccak case. Then for each of the DEEP queries we will need to compute an inner product between a base field vector and an extension field vector of length $2400$ . This means that for each query we will have to run the following block:
This means that we will only need$1200$ cycles to do most of the work needed to compute the DEEP queries. If we use the estimate of $100$ cycles for the remaining work, then we are looking at $1300$ cycles in total. This means that if the number of queries is $27$ then we would need around $35000$ cycles. Note that the two main components of recursive proof verification who's work increases linearly with the width of the trace are constraints evaluation and the computation of the DEEP queries. Hence, given our current implementation of the recursive verifier, the above suggests that we would be able to verify a recursive proof for Keccak in under $64000$ cycles.
For the recursive verifier for the Miden VM, substituting the current
rcomb_base
andrcomb_ext
withinner_product_base
andinner_product_ext
should lead to a slightly faster implementation. If the above is interesting, then it should be relatively easy to have a POC implementation of DEEP queries computation using the new candidate instructions in order to get more concrete numbers.Why is this feature needed?
This will improve recursive verification in Miden VM.
The text was updated successfully, but these errors were encountered: