Skip to content

Latest commit

 

History

History
330 lines (251 loc) · 11.5 KB

fusion.adoc

File metadata and controls

330 lines (251 loc) · 11.5 KB

WIP: DRAFT, July 2020 - Ben Marshall <[email protected]>

Macro-op Fusion Recommendations

This document contains recommendations for macro-op fusion in RISC-V cores that will be particularly helpful for cryptography.

Scope and Background

We use the following criteria to identify recommended macro-op fusions and associated code sequences:

  • We focus on pairs of instructions to fuse. Longer sequences are possible but harder to utilise.

  • We focus on sequences which read two register operands, and write a single register operand. This constraint means that even the most simple RISC-V implementation doesn’t need to add ports to its register file.

  • We focus on instructions which will appear in an RV32I, or RV64I architecture, with the scalar cryptography instructions also implemented.

  • We separately consider cases where the compressed C instructions are also implemented.

  • For modelling performance gains, we assume all integer instructions have unit latency, and ignore the costs of loops.

Where the C extension is implemented, the shortest (in bytes) possible fusable sequence is two instructions of two bytes each. The longest two instruction sequence we consider in all cases is eight bytes. The longest sequence a core can fuse is hence limited by the size of its instruction decode buffer. A simple implementation of RV32IC (or RV64IC) will have a 48-bit instruction buffer - primarily to handle the case of jumping onto a 16-bit aligned 32-bit instruction. Hence, fusable sequences consisting of two 16-bit instructions or one 16-bit and one 32-bit instruction are of particular interest.

The C extension instructions are designed to expand into normal, 32-bit instructions. However, they can only address a subset of the registers. We repeat table 16.1 from Volume 1 of the RISC-V instruction set manual here:

Table 1. Registers addressable by compressed instructions.

RVC Register Number

Integer Register Number

ABI Register name

0

x8

s0

1

x9

s1

2

x10

a0

3

x11

a1

4

x12

a2

5

x13

a3

6

x14

a4

7

x15

a5

Note that these tables assume that the scalar Cryptography instructions from Bitmanip have been implemented, namely:

  • andn rD, rA, rB // rD = rA & ~rB

  • orn rD, rA, rB // rD = rA | ~rB

  • xnor rD, rA, rB // rD = rA ^ ~rB

The Size column gives the size of the instruction sequence in bytes.

Table 2. Recommended 2-instruction fusion sequences with 2 operands and 1 result.
Instruction 1 Instruction 2 Size Function

c.xor rA, rA, rB

rori rA, rA, imm

6

rA = (rA ^ rB) >>> imm

xor rA, rA, rB

rori rA, rA, imm

8

rA = (rA ^ rB) >>> imm

rori rD, rA, imm

xor rD, rD, rB

8

rD = (rA >>> imm) ^ rB

rori rD, rA, imm

c.xor rD, rD, rB

6

rD = (rA >>> imm) ^ rB

c.or rD, rD, rB

xori rD, rD, -1

6

rD = ~(rA | rB)

or rD, rA, rB

xori rD, rD, -1

8

rD = ~(rA | rB)

c.and rD, rD, rB

xori rD, rD, -1

6

rD = ~(rA & rB)

and rD, rA, rB

xori rD, rD, -1

8

rD = ~(rA & rB)

Table 3. Recommended 2-instruction fusion sequences with 2 operands and 2 results.
Instruction 1 Instruction 2 Size Function

clmulh rX, rA, rB

clmul rY, rA, rB

8

.

mulh* rX, rA, rB

mul rY, rA, rB

8

.

Table 4. Recommended 2-instruction fusion sequences with 3 operands and 1 result.
Instruction 1 Instruction 2 Size Function

xor rD, rA, rB

xor rD, rD, rC

8

rD = rA ^ rB ^ rC

xor rD, rA, rB

c.xor rD, rD, rC

6

rD = rA ^ rB ^ rC

c.xor rA, rA, rB

c.xor rA, rA, rC

4

rA ^= rB ^ rC

andn rD, rA, rB

c.xor rD, rD, rC

6

rD = rC ^ (rA & ~rB)

andn rD, rA, rB

xor rD, rD, rC

8

rD = rC ^ (rA & ~rB)

and rD, rA, rB

xor rD, rD, rC

8

rD = (rA & rB) ^ rC

c.and rA, rA, rB

xor rA, rA, rC

6

rA = (rA & rB) ^ rC

c.and rA, rA, rB

c.xor rA, rA, rC

4

rA = (rA & rB) ^ rC

xor rD, rA, rB

and rD, rD, rC

8

rD = (rA ^ rB) & rC

c.xor rA, rA, rB

and rA, rA, rC

6

rA = (rA ^ rB) & rC

c.xor rA, rA, rB

c.and rA, rA, rC

4

rA = (rA ^ rB) & rC

Case Study: ChaCha20

The ChaCha20 stream cipher[RFC8439] uses the following quarter round (QR) function:

#define QR(a,b,c,d)                 \
a += b;  d ^= a;  d = ROTL(d,16);   \
c += d;  b ^= c;  b = ROTL(b,12);   \
a += b;  d ^= a;  d = ROTL(d, 8);   \
c += d;  b ^= c;  b = ROTL(b, 7);   \

which, when displayed graphically, looks like:

ChaCha20 Round Function Diagram

Squares represent the additions modulo 32, circles represent bit-wise XOR, and <<<_x is a 32-bit left rotation by x.

A complete ChaCha20 round is then expressed as:

QR(x[0], x[4], x[ 8], x[12]);
QR(x[1], x[5], x[ 9], x[13]);
QR(x[2], x[6], x[10], x[14]);
QR(x[3], x[7], x[11], x[15]);
QR(x[0], x[5], x[10], x[15]);
QR(x[1], x[6], x[11], x[12]);
QR(x[2], x[7], x[ 8], x[13]);
QR(x[3], x[4], x[ 9], x[14]);

where x is the 16-element array of 32-bit words representing the round state.

When implemented on RV32I with the scalar cryptography extensions, the QR function requires 12 instructions. One round is hence 48 instructions.

There is one fusable sequence which meets our criteria:

xor  rA, rA, rB
rori rA, rA, imm

Note there is no "rotate left by immediate" instruction, so we use the the "rotate right by immediate" with an adjusted immediate. Note also that there is no 16-bit rotate instruction, but there is a 16-bit xor instruction which fits the xor rA, rA, rB pattern.

The ChaCha20 state consists of 16 32-bit words. All of which can be kept in the registers of an RV32I/RV64I implementation. However, only eight words may be kept in registers addressable by the compressed instructions, and so maximise the opportunities for fusion.

Two compressed instruction registers (s0, s1) must be saved to the stack before being used. This will cost at minimum 6 instructions per call to a ChaCha20 block function: two stack adjustments, two stores and two loads.

The a0 and a1 registers may also need their contents moving prior to entering the block round loop, since they are used to pass parameters to functions. For the ChaCha20 block function, this will likely be the input/output array pointers. This will cost two instructions to move them from a0 / a1. If the arguments are marked as const in C code, then two more instructions are needed to put them back.

A core capable of fusing this sequence when both instructions are up to 32 bits long saves 4 cycles per quarter round.

A core capable of fusing this sequence only when the xor is 16 bits and the rori is 32 bits is more complex to analyse. Looking at the data flow graph, the fusable sequences takes their inputs from either variables a and d, or b and c. Looking at how a complete ChaCha20 round is structured, it then becomes clear that either state elements x[0..4,12..15], or x[4..11] are the ones best placed in the compressed instruction registers s0,s1,a0,…​,a5, since this creates the largest number of fusable 48-bit instruction sequences. In either case, two occurrences of sequence 1 are fusable. Hence a quarter round is then 10 cycles and a round 40 cycles.

A core which can only fuse two 16-bit instructions is incapable of fusing sequence 1.

Case Study: Keccak

Note: this section repeatedly refers to Markku’s implementation of the Keccak round function on RV64. Though based on RV64, the sequences here also apply to RV32.

rotate + xor

The Keccak Round function uses 5 instances of the sequence

rori    rD, rA, imm
xor     rD, rD, rB

A core capable of fusing this sequence when both instructions are 32 bits long will save 5 cycles per Keccak round, or 125 in total.

xor + rotate

The Keccak Round function uses 25 instances of xor-then-rotate-by-immediate.

xor     rA, rA, rB
rori    rA, rA, imm

A core capable of fusing this sequence when both instructions are 32 bits long with save 25 cycles per round, or 625 cycles in total.

andn + xor

The chi step of the Keccak round function consists of the expression x ^= (y & ~z). Hence the sequence:

andn    rD, rA, rB
xor     rD, rD, rC

A core capable of fusing this 8-byte sequence wil save 20 cycles per round, or 500 cycles in total.

xor + xor

The theta step of Keccak uses five xor reductions of five variables each: tmp = ABCDE. Being able to fuse the sequence

xor     rD, rA, rB
xor     rD, rD, rC

Saves two cycles per reduction, or 10 per round, which is 250 in total.

Case Study: SHA2

The SHA2 algorithms use two ternary non-linear functions: Ch and Maj. There is a discussion in the bitlogic document on how best to implement these functions given the base instruction set and the Bitmanip extension. Here, we only discuss the relevant macro-op fusion sequences for the Cryptography extension.

Ch(x,y,z)  = z ^ (x & (y ^ z))
Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))

For the Ch function, we can fuse either the xor+and or the and+xor instruction pairs:

xor  rT, rY, rZ
and  rT, rT, rX
xor  rT, rT, rZ

Also note that the result of Ch cannot always overwrite its inputs, making it difficult to use the compressed instruction extensions. In the sequence above, the inputs are not overwritten, and the and and xor instructions can use the compressed forms.

For the Maj function,the same and/xor fusion sequences can be used. The following code assumes the input variables cannot be overwritten.

xor rA, rX, rZ
xor rB, rX, rY  //
and rB, rB, rA  // Fuse the xor/and, or the and/xor.
xor rB, rB, rX  //

Security Considerations

These considerations are taken in the context of macro-op fusion generally, not just the recommendations listed in this document.

Where some algorithm must compute either A or B based on the value of some secret C, it is essential that A and B take the same length of time to compute. Otherwise an adversary who can measure execution time can learn something of C. One method of doing this is control flow balancing, where either A or B is artificially padded with instructions such that they take the same time to execute. If the programmer or compiler does not know exactly when a core will fuse certain macro-ops, it is possible that A and B will become un-balanced again, because one path includes a fusable sequence and the other does not. This may be addressed by computing both A and B regardless of C, and selecting the appropriate result in a constant time fashion.

References