Faster (than `rcomb_*`) instruction to compute random linear combinations #1600

Al-Kindi-0 · 2024-12-13T15:11:38Z

Feature description

We usually use the following pattern for loading data from the advice provider into the VM's memory:

push.DESTINATION_POINTER
padw padw padw
repeat.x
    adv_pipe hperm
end

This pattern is so fundamental that it might make sense to introduce a native instruction in order to do it at the chiplet level. The following is meant to start a discussion on how to build such an instruction.

Modifying the hasher chiplet

From the stack side, the instruction could have as input the following stack elements:

Digest is the claimed commitment to the data just about to get hashed, or in other words the hash of the unhashed data.
dst_ptr is the pointer to the start of the region of memory to store the aforementioned data.
end_ptr is the pointer to the last word of the said region of memory.

The stack will then have to send a request to the hasher chiplet in order to initiate a linear hash elements. The request could be
(op_label, clk, ctx, dst_ptr) and (op_label, clk, ctx, end_ptr, Digest).

The hasher will then respond with (op_label, clk, ctx, dst_ptr) to signal the start of the requested operation and at the last step with (op_label, clk, ctx, end_ptr, Digest) to signal the end of the operation. Note that there is no way, at least not to me, around adding at least three extra columns to the hasher chiplet in order to keep track of clk, ctx and ptr, where clk and ctx remain fixed throughout the operation and ptr starts with dst_ptr and is incremented at each 8-row cycle by 2 until we reach end_ptr. We might also need one additional column to be able to add the new hasher internal instruction but we might get away without doing so by redesigning things.
Thus in total we would need to increase the number of columns in the hasher chiplet by at least 3.
We now describe what happens between the two hasher responses above. The hasher chiplet, upon initiating the linear hash operation above, will send a request, at each 8-row cycle, to the memory chiplet of the form (op_label, clk, ctx, addr, Data[..4]) and (op_label, clk, ctx, addr + 1, Data[4..]) where Data[0..8] is the current double word occupying the rate portion of the hasher state and which is filled up by the prover non-deterministically.
Note that with the current degree of the flags in the hasher chiplet, which is maximum $4$, the chiplets' bus can accommodate the above response and double request even if we have to increase the current internal flags degrees.
From the side of the memory chiplet nothing changes as we are using one bus to connect all of the VM components. This would change if for some reason we decide to isolate memory related things into a separate bus.

The Old Approach and Computing inner products

Most of the time, the hashed-to-memory data will at some point need to be involved in some random linear combination computation. For most situations, this amounts to an inner product between two vectors stored in memory, where at least one of the two vectors will be loaded from the advice provider i.e., un-hashed. To get a feeling for this, I will describe this using the example of computing the DEEP queries.

Fix an $x$ in the LDE domain and let

$$ \mathsf{S} := \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)}. $$

Then

$$\begin{aligned} \mathsf{S} &= \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} \right)} + \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} \\\ &= \sum_{i=0}^k {\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z)}{x - z} \right)} + \sum_{i=0}^k{\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g)}{x - z \cdot g} \right)} \\\ &= \frac{1}{x - z}\sum_{i=0}^k \left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z) \right) +\frac{1}{x - z \cdot g} \sum_{i=0}^k\left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g) \right) \\\ &= \frac{1}{x - z} \left(\mathsf{ip}_x - \mathsf{ip}_z \right) +\frac{1}{x - z \cdot g} \left(\mathsf{ip}_x - \mathsf{ip}_{gz} \right) \end{aligned}$$

$$</code>\mathsf{ip}x := \langle\alpha{\cdot}, T_{\cdot}(x)\rangle = \sum_{i=0}^k\alpha_i\cdot T_i(x)<code>$$
$$</code>\mathsf{ip}z := \langle\alpha{\cdot}, T_{\cdot}(z)\rangle := \sum_{i=0}^k\alpha_i\cdot T_i(z)<code>$$
$$</code>\mathsf{ip}{gz} := \langle\alpha{\cdot}, T_{\cdot}(z \cdot g)\rangle := \sum_{i=0}^k\alpha_i\cdot T_i(z \cdot g)<code>$$

Notice the following:

$\mathsf{ip}_z$ and $\mathsf{ip}_{gz}$ are independent of $x$ and as such they can be computed once and for all FRI queries.
The only terms dependent on $x$ are $\mathsf{ip}_x$, $\frac{1}{x - z}$ and $\frac{1}{x - gz}$.
Computing $\mathsf{ip}_x$, $\mathsf{ip}_z$ or $\mathsf{ip}_{gz}$ amounts to the same as computing an inner product of two vectors.

All in all, the most critical part in computing DEEP queries is the computation of quantities of the form $\sum_{i=0}^{n-1} a_i \cdot b_i$ where $b_i$ can be over either the base or extension field and $a_i$ is always over the extension field. Hence, we will need separate operations in order to handle both situations. Note that one of these instructions is similar but more general to the rcomb_base instruction.

Assume the degree of our extension field is 2, then given the above, we can introduce a new instruction, call it inner_product_base, which will help us compute the inner product as the sum of terms of the form $$a_0 \cdot b_0 + \cdots + a_3 \cdot b_3$$ where $a_i$ are over the base field and are on the operand stack and $b_i$ are extension field elements and are stored in memory.

The main change that will be needed to make the following proposal work is extending the size of the helper registers from $6$ to $8$ field elements. Assuming this, the effect of the inner_product_base can be illustrated with the following:

Note that $acc = (acc0, acc1)$ is an accumulator holding the values of inner product sum thus far and $acc^{'}$ is the updated sum that includes $$a_0 \cdot b_0 + \cdots + a_3 \cdot b_3$$

Note also that the pointer to $b$ is updated and that the two upper words of the stack are permuted. Hence we can compute the next term by an immediate second call to inner_product_base with the following resulting effect:

Example

Assume we have a very wide (main) trace, say $2400$ columns wide as in the Keccak case. Then for each of the DEEP queries we will need to compute an inner product between a base field vector and an extension field vector of length $2400$. This means that for each query we will have to run the following block:

repeat.300
    adv_pipe hperm
    inner_product_base
    inner_product_base
end

This means that we will only need $1200$ cycles to do most of the work needed to compute the DEEP queries. If we use the estimate of $100$ cycles for the remaining work, then we are looking at $1300$ cycles in total. This means that if the number of queries is $27$ then we would need around $35000$ cycles. Note that the two main components of recursive proof verification who's work increases linearly with the width of the trace are constraints evaluation and the computation of the DEEP queries. Hence, given our current implementation of the recursive verifier, the above suggests that we would be able to verify a recursive proof for Keccak in under $64000$ cycles.

For the recursive verifier for the Miden VM, substituting the current rcomb_base and rcomb_ext with inner_product_base and inner_product_ext should lead to a slightly faster implementation. If the above is interesting, then it should be relatively easy to have a POC implementation of DEEP queries computation using the new candidate instructions in order to get more concrete numbers.

Why is this feature needed?

This will improve recursive verification in Miden VM.

The text was updated successfully, but these errors were encountered:

Al-Kindi-0 · 2024-12-18T04:56:47Z

In relation to the use of the above proposal in Falcon, we can use the inner_product_base op in order to evaluate the polynomials $h$, $s_2$, $\pi_1$ and $\pi_2$ at a random point $\tau$.

The way to do so is, again, to compute the powers of $\tau$ and store them in memory. The pointer for this will be the same as the a_ptr above. The difference will be that now $\tau^i$ will be laid out in the more compact form [a_i, b_i, c_i, d_i] where $(a_i, b_i) := \tau^i$ and $(c_i, d_i) := \tau^{i+1}$ instead of [a_i, b_i, 0, 0].

As an example of the changes in the code, we will have the following block

    # Compute h(tau)
    repeat.64
        mem_stream
        repeat.8
            rcomb_base
        end
    end

    # Compute h(tau)
    repeat.64
        mem_stream
        inner_product_base
        inner_product_base
    end

This would mean an improvement in the cycle count from $576$ to $192$. Since we are evaluating 4 polynomials this means that we can shave off $1500$ cycles from the probabilistic_product procedure.

Al-Kindi-0 · 2025-01-07T10:41:08Z

Another way to achieve the same goal would be to simplify the problem we are considering, namely instead of computing quantities of the form $$\sum a_i \cdot b_i$$ we compute instead $$\sum a_i \alpha^i$$
This amounts to the problem of evaluating a polynomial given in coefficient form at some point $\alpha$.

General Idea

The main idea is to use Horner's method but the main challenge we face then is the need to have the coefficients of the polynomial in reverse order. Let's assume this and then discuss how we can lift this restriction.
Let us also assume that both the coefficients and the evaluation point are elements in the quadratic extension field. It should be straightforward to tackle other configurations.

Let's start by defining our new primitive for doing Horner evaluation, call it horner_eval_ext, which has the following transition diagram:

where

$$acc' = (acc0', acc0') := ((((acc \cdot \alpha + a3) \cdot \alpha + a2) \cdot \alpha + a1) \cdot \alpha) + a0$$

Note that the order of the $a$ coefficients is reversed from the point of view of un-hashing (i.e., we have hashing from right to left), namely $a3$ is the coefficient of the monomial with highest power among $a_0, a_1, a_2, a_3$.

Now, using the above and assuming, again, that the coefficients are in reverse order, we can compute $$\sum_{0}^{4n} a_i \alpha^i$$
using the following code:

repeat.n
 adv_pipe
 horner_eval_ext
 hperm
end

The above suggests that horner_eval_ext will have the same improvement effect as inner_prod_ext but horner_eval_base will be twice as fast as inner_prod_base as we don't have the restriction imposed by having to store the random coefficients in the helper registers anymore.

Lifting the Restriction

The easiest way to make the above work would be to change the backend to always process the leaves of Merkle trees in reverse order, but this is a bit un-clean.
The other way is to use an interesting identity which we describe now. Let $$P(x) = \sum_{1}^{n} a_i \cdot x^i$$ be the polynomial we want to evaluate and suppose we have access to its reversed version, namely
$$P_{rev}(x) = \sum_{1}^{n} a_{n-i} \cdot x^i.$$

Then we have the following identity:

$$P(\alpha) = {\alpha^n}P_{rev}(\alpha^{-1}) $$

This suggests the following strategy to evaluate $P(x)$ at $\alpha$ when we are given its coefficients in reverse order:

Compute $\alpha^{-1}$ and store it at address with pointer alpha_ptr
Use the method presented in the previous section with alpha_ptr
Compute $\alpha^n$
Multiply the final value of $acc$ by $\alpha^n$ to obtain $P(\alpha)$

Note that Points 1 and 2 above are done only once in the context of, for example, computing queries to the DEEP polynomial. Hence these costs are negligible by the amortization.

bobbinth · 2025-01-07T11:21:12Z

The above suggests that horner_eval_ext will have the same improvement effect as inner_prod_ext but horner_eval_base will be twice as fast as inner_prod_base as we don't have the restriction imposed by having to store the random coefficients in the helper registers anymore.

We'd also need to store intermediate results in helper registers, no? (otherwise expression degree would be 5 for extension field, and 9 for base field operations)

I think there are some other benefits to Horner eval approach:

We don't need to introduce extra helper registers.
We only do a single memory read per instruction invocation (to read the alpha). With inner product, we were doing 2 memory reads. Of course, it would be ideal if we could put alpha somehow onto the stack and then we wouldn't need to do memory reads at all - but I'm not sure that's possible.
We need to draw fewer random values.

But, there are also drawback. I think the main one is some security loss because we'd be now using powers of alpha rather than distinct alphas. How can we estimate this loss?

Al-Kindi-0 · 2025-01-07T14:44:59Z

The above suggests that horner_eval_ext will have the same improvement effect as inner_prod_ext but horner_eval_base will be twice as fast as inner_prod_base as we don't have the restriction imposed by having to store the random coefficients in the helper registers anymore.

We'd also need to store intermediate results in helper registers, no? (otherwise expression degree would be 5 for extension field, and 9 for base field operations)

Indeed, we would use those to reduce the degree of constraints. Since we have spare helper registers, this should be doable.

I think there are some other benefits to Horner eval approach:
1. We don't need to introduce extra helper registers.
2. We only do a single memory read per instruction invocation (to read the alpha). With inner product, we were doing 2 memory reads. Of course, it would be ideal if we could put alpha somehow onto the stack and then we wouldn't need to do memory reads at all - but I'm not sure that's possible.
3. We need to draw fewer random values.
But, there are also drawback. I think the main one is some security loss because we'd be now using powers of alpha rather than distinct alphas. How can we estimate this loss?

Indeed, this would depend on the extension degree and the security we are targeting but there are basically two situations of interest (this is a bit of a simplification though):

Using a quadratic extension field, and assuming we are targeting the maximal security one can achieve in the UDR setting, then there will be $\log_2(L)$ bits of security lost if we use a single power of a challenge. Here $L$ is the number of columns we are batching. For example, in the case of Keccak, we would lose about $11$ bits of security using a powers versus independent challenges per column.
Using a cubic extension field, and assuming we are targeting $128$ in the LDR setting, then for all typical trace sizes, there will not be a loss in security using powers of a single challenge.

Note however, that the lost bits can be compensated for by grinding. See the ethSTARK paper for an explanation of this. This means that we can regain the lost bits by doing an $11$ bits proof of work just before sampling the challenge $\alpha$.

bobbinth · 2025-01-08T07:18:06Z

Using a quadratic extension field, and assuming we are targeting the maximal security one can achieve in the UDR setting, then there will be $\log_2(L)$ bits of security lost if we use a single power of a challenge. Here $L$ is the number of columns we are batching. For example, in the case of Keccak, we would lose about $11$ bits of security using a powers versus independent challenges per column.

And if we are targeting 100 bits in LDR (conjectured security), would $L=2000$ reduce this security to 89 bits?

Separately, would be great to get an rough estimate for the number of cycles we'd need to compute $ip x$, $ip z$ and $ip gz$ assuming we have 80 based field columns and 8 extension field columns.

Al-Kindi-0 · 2025-01-08T17:41:52Z

And if we are targeting 100 bits in LDR (conjectured security), would L = 2000 reduce this security to 89 bits?

The conjecture on the security of the toy problem in the ethSTARK paper didn't consider the case of single challenge. Hence we would have to adapt the analysis to this case and without thinking too much it seems to me that the same estimate therein will also work in the case of single challenge (I think this is what Plonky2 does).
Just to remind myself, the conjectured security of the toy problem has the first term as $\frac{1}{|\mathbb{F}|}$ and the above suggests that it will not change and hence there won't be a degradation.
In Winterfell the term $\frac{|D|}{|\mathbb{F}|}$ appears, where $D$ is the LDE domain. This also appears in the ethSTARK GitHub repo. Technically, this is not congruent with the conjecture on the security of the toy problem but is based instead on Conjecture 8.4 in the proximity gaps paper. This is a conjecture about what happens beyond LDR i.e., up to the capacity bound of RS-codes. If we use this conjecture instead then, the error term will go from $\frac{|D|}{|\mathbb{F}|}$ to $\frac{(N - 1) \cdot |D|}{|\mathbb{F}|}$ where $N$ is the number of terms in the sum. Again a loss of $log(N)$ bits. Note that this is a bit counter-intuitive as it suggests that the error term behaves as nicely as it does in the UDR (implying that one can get away with a quadratic extension degree for moderate trace lengths) which is better than LDR (remember that the error term behaves like $\frac{ |D|^2}{|\mathbb{F}|}$ which means that one cannot reach any meaningful security level, for moderate trace lengths, with a degree extension less than $3$).

Note: we can use the following crude rule of thumb, if we use a cubic extension field then there will be no loss. If we are using just the quadratic extension then there will be a loss of $log(N)$ bits compared to independent challenges.

Separately, would be great to get an rough estimate for the number of cycles we'd need to compute i p x , i p z and i p g z assuming we have 80 based field columns and 8 extension field columns.

Makes sense, we should have something more concrete pretty soon.

bobbinth · 2025-01-08T18:16:16Z

So, with $\frac{(N - 1) \cdot |D|}{|\mathbb{F}|}$, 1M cycle trace and 80 columns would cap the security at 98 bits, right? And for 2000 columns, the cap is 92 bits? And if I understood correctly, we should be bring this up with grinding?

Al-Kindi-0 · 2025-01-08T19:17:54Z

That's correct, but I think 92 should be 94 (since the math is $128 - 20 - 11 - 3 = 94$). Say we want to reach 100 bits of security, then we would need to add $2$ bits of PoW in the first case and $6$ bits of PoW in the second. The PoW would happen before sampling $\alpha$.

Al-Kindi-0 · 2025-01-10T16:39:09Z

Fix an $x$ in the LDE domain and let

$$ \mathsf{S} := \sum_{i=0}^{m-1}\alpha^i {\left(\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} + \sum_{i=m}^{m+n-1}\alpha^i {\left(\frac{A_i(x) - A_i(z)}{x - z} + \frac{A_i(x) - A_i(z \cdot g)}{x - z \cdot g} \right)} + \sum_{i=m+n}^{m+n+o-1}\alpha^i {\left(\frac{H_i(x) - H_i(z)}{x - z}\right)}. $$

Then

$$\begin{aligned} \mathsf{S} &= \frac{1}{x - z} \left( \sum_{i=0}^{m-1}\alpha^i {\left( {T_i(x) - T_i(z)} \right)} + \sum_{i=m}^{m+n-1}\alpha^i {\left( {A_i(x) - A_i(z)} \right)} + \sum_{i=m+n}^{m+n+o-1}\alpha^i {\left( {H_i(x) - H_i(z)} \right)} \right) + \frac{1}{x - z \cdot g} \left( \sum_{i=0}^{m-1}\alpha^i {\left( {T_i(x) - T_i(z \cdot g)} \right)} + \sum_{i=m}^{m+n-1}\alpha^i {\left( {A_i(x) - A_i(z \cdot g)} \right)} \right).\\\ &= \frac{1}{x - z} \left( \left( \sum_{i=0}^{m-1}\alpha^i {{T_i(x)}} + \sum_{i=m}^{m+n-1}\alpha^i {{A_i(x)}} + \sum_{i=m+n}^{m+n+o-1}\alpha^i {{H_i(x)}} \right) - \left( \sum_{i=0}^{m-1}\alpha^i {{T_i(z)}} + \sum_{i=m}^{m+n-1}\alpha^i {{A_i(z)}} + \sum_{i=m+n}^{m+n+o-1}\alpha^i {{H_i(z)}} \right) \right) + \frac{1}{x - z \cdot g} \left( \left( \sum_{i=0}^{m-1}\alpha^i {{T_i(x)}} + \sum_{i=m}^{m+n-1}\alpha^i {{A_i(x)}} \right) - \left( \sum_{i=0}^{m-1}\alpha^i {{T_i( z \cdot g )}} + \sum_{i=m}^{m+n-1}\alpha^i {{A_i( z \cdot g )}} \right) \right) .\\\ &= \frac{1}{x - z} \left( Q^x(\alpha) - Q^z(\alpha) \right) + \frac{1}{x - z \cdot g} \left( P^x(\alpha) - P^{z \cdot g}(\alpha) \right) . \end{aligned}$$

This means that the task of computing $\mathsf{S}$ amounts to computing the evaluations of $4$ polynomials at the challenge point $\alpha$. However, we note the following:

$Q^z(\alpha)$ is the same for all queries (independent of $x$) and hence it needs to be computed only once.
$P^{z \cdot g}(\alpha)$ is the same for all queries (independent of $x$) and hence it needs to be computed only once.
The $n + m$ coefficients of $P^x(X)$ are the first $n + m$ coefficients of $Q^x(X)$ and since we will evaluating the reversed versions of both polynomials, computing the value of $P^x(\alpha)$ will come almost for free while computing $Q^x(\alpha)$.

Given the above, the main work when computing $\mathsf{S}$ goes into computing $Q^x(\alpha)$.

A few subtle points worth mentioning:

The coefficients of $P^x(X)$ and $Q^x(X)$ are a mix of elements (purely) in the base field and in some in the extension field. However, this is not an issue as long as we use the appropriate version of horner_eval_*.
There is a complication at the boundary between elements $T_i$, $A_i$ and $H_i$, in addition to the previous point, when the number of elements of $T_i$, $A_i$ and $H_i$ is not divisible by $8$, $4$ and $4$, respectively. This is because horner_eval_* is atomic. The solution to this will then be an extra correction step at the boundaries. This step amounts to a multiplication by a power of $\alpha$ which are pre-computed and used for all queries.
If we are willing to open the constraint composition segment polynomials at $zg$ as well i.e., include terms of the form $\frac{H_i(x) - H_i(gz)}{x - gz}$ then we get $P^x(\alpha) = Q^x(\alpha)$ which can bring a nice improvement.

This is a stab at how we can do the above in MASM. Specifically, we can change the (roughly) the following lines between 525 and 565. Note that in what follows:

We assume that the main trace width is 80, aux trace width is 8 and constraint composition trace is 8. Thus we avoid the need to adjust the Horner evaluation accumulator at the boundaries.
The computation of the terms which are independent of $x$ is not done as this is a constant amount of work independent of the number of queries.

    # I) Set up the stack

    # 1) Pointer to the inverse of the challenge
    exec.constants::alpha_inv_ptr
    
    # 2) Pointer to T_i(x), A_i(x), H_i(x) for current x
    exec.constants::current_trace_row_ptr
    
    # 3) Accumulator
    push.0.0

    # 4) Configure hasher state
    padw padw padw

    # II) Compute P^x(\alpha) and Q^x(\alpha)

    # 1) Compute P^x_rev(\alpha_inv)

    ## a) compute \sum alpha_inv^i T_i(x)
    repeat.10
        mstream horner_eval_base
    end

    ## b) compute \sum alpha_inv^i A_i(x)
    repeat.2
        mstream horner_eval_ext
    end

    ## c) save P^x_rev(\alpha_inv)
    swapw.3 exec.constants::TMP20 mem_storew swapw.3

    # 2) Compute Q^x_rev(\alpha_inv)

    # compute \sum alpha_inv^i H_i(x)
    repeat.2
        mstream horner_eval_ext
    end

    # 3) Adjust Q^x_rev(\alpha_inv) to get Q^x(\alpha)
    
    ## a) Load adjustment factor alpha^K where K = n + m + o - 1
    swapdw exec.constants::alpha_k_ptr mem_loadw drop drop
    # => [alphaK1, alphaK0, acc1, acc0, Y, Y, y, y, ...] where y is garbage and Y = [y, y, y, y]

    ## b) Compute Q^x(\alpha) as Q^x_rev(\alpha_inv) * alpha^K
    ext2mul
    # => [res1, res0, y, y, Y, Y, ...] where res = Q^x(\alpha)

    # III) Compute Q^x(\alpha) - Q^z(\alpha)

    ## a) Load  (- Q^z(\alpha))
    swapw exec.constants::Q_z_alpha mem_loadw 
    # => [y, y, tmp1, tmp0, res1, res0, y, y, Y, ...]

    ## b) Compute difference and hence the numerator
    movdn.5 movdn.4 
    # => [tmp1, tmp0, res1, res0, y, y, y, y, Y, ...]
    ext2add
    # => [num1, num0, y, y, y, y, Y, ...] where num = Q^x(\alpha) - Q^z(\alpha)

    # IV) Compute the first quotient (Q^x(\alpha) - Q^z(\alpha)) / (x - z)

    ## a) Load the denominator (x - z)
    swapw exec.constants::Z_ptr mem_loadw
    # => [zg1, zg0, z1, z0, num1, num0, y, y, y, y, ...]

    ## b) Divide
    movdn.5 movdn.5
    # => [z1, z0, num1, num0, zg1, zg0, y, y, y, y, ...]
    ext2div
    # => [res1, res0, zg1, zg0, y, y, y, y, ...]

    # V) Compute  P^x(\alpha) - P^gz(\alpha)

    ## a) Compute P^x(\alpha) from P^x_rev(\alpha_inv)
    swapw exec.constants::TMP20 mem_loadw
    # => [tmp1, tmp0, y, y, res1, res0, zg1, zg0, ...]

    ## b) Multiply by alpha^L where L = n + m - 1
    movup.3 movup.3 push.0.0 exec.constants::alpha_l_ptr mem_loadw
    # => [y, y, alphaL1, alphaL0, tmp1, tmp0, res1, res0, zg1, zg0, ...]
    push.0.0 swapw
    # => [alphaL1, alphaL0, tmp1, tmp0, Y, res1, res0, zg1, zg0, ...]
    ext2mul
    # => [num`1, num`0, Y, res1, res0, zg1, zg0, ...] where num` = P^x(\alpha)

    ## c) Load (- P^gz(\alpha))  
    movdn.5 movdn.5 exec.constants::P_zg_alpha mem_loadw drop drop
    ext2add
    # => [num`1, num`0, res1, res0, zg1, zg0, ...]

    ## d) compute (P^x(\alpha) - P^{zg}(\alpha)) / (x - zg)
    movup.5 movup.5 ext2div
    # => [res`1, res`0, res1, res0, ...] where res` = (P^x(\alpha) - P^{zg}(\alpha)) / (x - zg)

    # IV) Compute final result
    ext2add
    # => [v1, v0, ...] where v = (Q^x(\alpha) - Q^z(\alpha)) / (x - z) + (P^x(\alpha) - P^{zg}(\alpha)) / (x - zg)

The current cost of doing the equivalent of the above, i.e., lines 525 to 565 in the previous reference, is at around $165$ while the cost of the given code above is $130$.
Note that:

a good portion of the cycles goes into computing $P^x(\alpha)$ from $Q^x(\alpha)$. Hence including terms of the form $\frac{H_i(x) - H_i(gz)}{x - gz}$ into the sum can avoid this cost.
The improvement between the new and old way of computing the DEEP queries will grow larger the wider the trace becomes.

All in all, the above gives some promising indications but we would probably want to have something that is fully integrated using an emulation of horner_eval_*

To compare with the current implementation, with the some caveats included towards the end, we have:

New implementation: 130 cycles for the above + 200 cycles for hashing and Merkle path authenticating. Note that in the actual implementation we will not use mem_stream as we did in the above code but will adv_pipe, horner_eval_* and hperm at the same time, so 200 cycles is a bit of an overestimate for hashing and Merkle path authenticating. On the other hand, 130 cycles is also underestimating the cost to compute Q^x(\alpha) and P^x(\alpha) as we most certainly will not have perfect alignment of the committed traces from the POV of horner_eval_*. All in all, I think 330 cycles per query is a fair estimate here.
Old implementation: 445 cycles per query.

The constants are not included in the above as they should take into account the impact of using horner_eval_* over the whole recursive verifier. This will include the cost of generating the randomness which is better when using horner_eval_* versus not, the cost of computing P^{zg}(\alpha) and Q^{z}(\alpha) which is around 500 cycles when using horner_eval_* versus 0 cycles when not, and the cost on things which are second order effects (e.g., ability to stop FRI folding earlier or maybe avoid the probabilistic NTT when the number of FRI queries is low, etc ..).

bobbinth · 2025-01-18T09:05:13Z

My understanding of the process and costs is currently as follows:

First, we need to compute $Q^z(\alpha)$ and $P^{z \cdot g}(\alpha)$. This can be done with the aid of horner_eval_ext instruction. If my estimation in #1610 (comment) is not too far off, we should be able to do this in under 500 cycles.

Then, for each query, we'd need to:

Compute $P^x(X)$ and $Q^x(X)$. Assuming computing $P^x(X)$ comes nearly for free when computing $Q^x(X)$, and also that we can use horner_eval_base instruction (which processes 8 terms per cycle), we may be able to do this in under 100 cycles (let's confirm this).
Compute $\frac{1}{x - z}$ and $\frac{1}{x - z \cdot g}$. We could pre-compute $-z$ and $-z \cdot g$, and with these, I think we'd need under 50 cycles to to compute these.
Compute the final expression - maybe another 50 cycles.

So, all in all, I'm estimating about 200 cycles per query. Then, the overall cost becomes $500 + 200 \cdot n$, where $n$ is the number of queries. For $n=27$ we get about 6K cycles.

Could you double-check if my estimates are close to reality? And also curious how does 6K cycles compare to our current costs.

Al-Kindi-0 · 2025-01-19T14:27:37Z

I have updated the previous comment to be more specific with regards to the cycle count. To summarize:

With horner_eval_*: 330 cycles per FRI query,
Current implementation: 450 cycles per FRI query.

Al-Kindi-0 · 2025-01-20T12:21:42Z

For completeness sake, using the inner_prod_ext can lead to halving of the cycle count of the probabilistic NTT procedure.
Here is an example of how the code would look like, and note that it loads data from memory which should be replaced by loading-and-hashing from the advice provider:

#! Input: [o1, o0, t1, t0, dst, src, ...]
#! Output: [t1, t0, dst, src, ...]
#!
#! Cycles: 44
proc.compute_and_store
    dup.3 dup.2 mul
    # => [o0 * t0, o1, o0, t1, t0, dst, src, ...]
    sub.1
    # => [o0 * t0 - 1, o1, o0, t1, t0, dst, src, ...]
    # => [tmp0, o1, o0, t1, t0, dst, src, ...]
    dup.3 movup.3 mul
    # => [o0 * t1, tmp0, o1, t1, t0, dst, src, ...]
    # => [tmp1, tmp0, o1, t1, t0, dst, src, ...]
    ext2inv
    # => [res1, res0, o1, t1, t0, dst, src, ...]

    dup.4 dup.3 mul
    # => [o1 * t0, res1, res0, o1, t1, t0, dst, src, ...]
    sub.1
    # => [tmp0, res1, res0, o1, t1, t0, dst, src, ...]
    dup.4 movup.4 mul
    # => [o1 * t1, tmp0, res1, res0, t1, t0, dst, src, ...]
    # => [tmp1, tmp0, res1, res0, t1, t0, dst, src, ...]
    ext2inv
    # => [res'1, res'0, res1, res0, t1, t0, dst, src, ...]

    dup.6 add.1 swap.7
    mem_storew
    dropw
    # => [t1, t0, dst, src, ...]
end

#! Given τ and starting memory address to q (FRI codeword of size 64), this routine computes β.
#!
#! Input: [τ1, τ0, q_ptr, ...]
#! Output: [...]
#!
#! Cycles: 1412
proc.compute_and_store_64


    dup.2 add.32 movdn.2
    # => [τ1, τ0, dst_ptr, q_ptr, ...]

    push.1 push.8 # these should be inverses
    exec.compute_and_store

    push.64 push.512 # these should be inverses
    exec.compute_and_store

    # ...

    push.18158513693329981441 push.16140901060737761281 # these should be inverses
    exec.compute_and_store


    dropw
end


#! 
#!
#! Input: [q_ptr, ...]
#! Output: [β1, β0, q_ptr, ...]
#!
#! Cycles: 68
proc.compute_inner_product


    # 1) Compute the weights that goes into the inner product with the codeword
    # Note that the challenge gets loaded from the advice provider and hence will need to be
    # checked later on to assert that it is indeed the result of hashing the codeword and coefficients
    # representing the remainder polynomial.
    dup
    adv_push.2
    # => [τ1, τ0, q_ptr, q_ptr, ...]
    
    compute_and_store_64
    # => [q_ptr, ...]

    # 2) Compute the inner product i.e., beta in the docs
    # Compute the dst_ptr
    dup add.32 swap
    # => [q_ptr, dst_ptr, ...]

    # Push accumulator
    push.0 movdn.2 push.0 movdn.2
    # => [q_ptr, dst_ptr, acc1, acc0, ...]

    # Prepare the stack for using `mem_stream`
    padw padw padw
    # => [X, X, X, q_ptr, dst_ptr, acc1, acc0, ...]

    repeat.16
        mem_stream
        inner_product_ext
    end

    dropw dropw dropw drop drop
    # => [beta1, beta0, ...]

    # 3) Compute the evaluation of the remainder poly at tau

    exec.compute_alpha_64
    # => [alpha1, alpha0, beta1, beta0, ...]

    dup.2 assert_eq
    assert_eq
    # => [...]
end

Though the above is a nice improvement which implies that we can stop one layer earlier during FRI folding, I suspect that with horner_eval_ext we might be able to do even better. This has to be double checked, but I think that for moderate number of FRI queries we might be able to do away completely with the probabilistic NTT check as evaluating polynomials directly becomes very cheap with horner_eval_ext.
A good thing to find out is the cost function of FRI with the final check using probabilistic NTT and for different remainder polynomial degrees versus the cost of FRI without the probabilistic NTT and using horner_eval_ext for different remainder poly degrees.

Al-Kindi-0 added the enhancement New feature or request label Dec 13, 2024

bobbinth added this to the v0.13.0 milestone Jan 8, 2025

bobbinth assigned Al-Kindi-0 Jan 8, 2025

Al-Kindi-0 changed the title ~~Faster advice-provider-to-memory data un-hashing~~ Faster (than rcomb_*) instruction to compute random linear combinations Jan 10, 2025

This was referenced Jan 21, 2025

Tracking issue: Finalize STARK recursive verifier #1632

Open

Update Winterfell to accomodate both linear as well algebraic batching when computing the DEEP polynomial #1634

Open

bobbinth added the recursive verifier Related to recursive verifier implementation label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster (than `rcomb_*`) instruction to compute random linear combinations #1600

Faster (than `rcomb_*`) instruction to compute random linear combinations #1600

Al-Kindi-0 commented Dec 13, 2024 •

edited

Loading

Al-Kindi-0 commented Dec 18, 2024

Al-Kindi-0 commented Jan 7, 2025

bobbinth commented Jan 7, 2025

Al-Kindi-0 commented Jan 7, 2025 •

edited by bobbinth

Loading

bobbinth commented Jan 8, 2025

Al-Kindi-0 commented Jan 8, 2025

bobbinth commented Jan 8, 2025

Al-Kindi-0 commented Jan 8, 2025

Al-Kindi-0 commented Jan 10, 2025 •

edited

Loading

bobbinth commented Jan 18, 2025

Al-Kindi-0 commented Jan 19, 2025

Al-Kindi-0 commented Jan 20, 2025

Faster (than rcomb_*) instruction to compute random linear combinations #1600

Faster (than rcomb_*) instruction to compute random linear combinations #1600

Comments

Al-Kindi-0 commented Dec 13, 2024 • edited Loading

Feature description

Modifying the hasher chiplet

The Old Approach and Computing inner products

Example

Why is this feature needed?

Al-Kindi-0 commented Dec 18, 2024

Al-Kindi-0 commented Jan 7, 2025

General Idea

Lifting the Restriction

bobbinth commented Jan 7, 2025

Al-Kindi-0 commented Jan 7, 2025 • edited by bobbinth Loading

bobbinth commented Jan 8, 2025

Al-Kindi-0 commented Jan 8, 2025

bobbinth commented Jan 8, 2025

Al-Kindi-0 commented Jan 8, 2025

Al-Kindi-0 commented Jan 10, 2025 • edited Loading

bobbinth commented Jan 18, 2025

Al-Kindi-0 commented Jan 19, 2025

Al-Kindi-0 commented Jan 20, 2025

Faster (than `rcomb_*`) instruction to compute random linear combinations #1600

Faster (than `rcomb_*`) instruction to compute random linear combinations #1600

Al-Kindi-0 commented Dec 13, 2024 •

edited

Loading

Al-Kindi-0 commented Jan 7, 2025 •

edited by bobbinth

Loading

Al-Kindi-0 commented Jan 10, 2025 •

edited

Loading