WIP: DRAFT, July 2020 - Ben Marshall <[email protected]>
This document contains recommendations for macro-op fusion in RISC-V cores that will be particularly helpful for cryptography.
We use the following criteria to identify recommended macro-op fusions and associated code sequences:
-
We focus on pairs of instructions to fuse. Longer sequences are possible but harder to utilise.
-
We focus on sequences which read two register operands, and write a single register operand. This constraint means that even the most simple RISC-V implementation doesn’t need to add ports to its register file.
-
We focus on instructions which will appear in an RV32I, or RV64I architecture, with the scalar cryptography instructions also implemented.
-
We separately consider cases where the compressed
C
instructions are also implemented. -
For modelling performance gains, we assume all integer instructions have unit latency, and ignore the costs of loops.
Where the C
extension is implemented, the shortest (in bytes) possible
fusable sequence is two instructions of two bytes each.
The longest two instruction sequence we consider in all cases is
eight bytes.
The longest sequence a core can fuse is hence limited by the size of its
instruction decode buffer.
A simple implementation of RV32IC (or RV64IC) will have a 48-bit
instruction buffer - primarily to handle the case of jumping onto
a 16-bit aligned 32-bit instruction.
Hence, fusable sequences consisting of two 16-bit instructions or
one 16-bit and one 32-bit instruction are of particular interest.
The C
extension instructions are designed to expand into normal,
32-bit instructions. However, they can only address a subset of the
registers. We repeat table 16.1 from Volume 1 of the RISC-V instruction
set manual here:
RVC Register Number |
Integer Register Number |
ABI Register name |
0 |
x8 |
s0 |
1 |
x9 |
s1 |
2 |
x10 |
a0 |
3 |
x11 |
a1 |
4 |
x12 |
a2 |
5 |
x13 |
a3 |
6 |
x14 |
a4 |
7 |
x15 |
a5 |
Note that these tables assume that the scalar Cryptography instructions from Bitmanip have been implemented, namely:
-
andn rD, rA, rB // rD = rA & ~rB
-
orn rD, rA, rB // rD = rA | ~rB
-
xnor rD, rA, rB // rD = rA ^ ~rB
The Size column gives the size of the instruction sequence in bytes.
Instruction 1 | Instruction 2 | Size | Function |
---|---|---|---|
|
|
6 |
|
|
|
8 |
|
|
|
8 |
|
|
|
6 |
|
|
|
6 |
|
|
|
8 |
|
|
|
6 |
|
|
|
8 |
|
Instruction 1 | Instruction 2 | Size | Function |
---|---|---|---|
|
|
8 |
. |
|
|
8 |
. |
Instruction 1 | Instruction 2 | Size | Function |
---|---|---|---|
|
|
8 |
|
|
|
6 |
|
|
|
4 |
|
|
|
6 |
|
|
|
8 |
|
|
|
8 |
|
|
|
6 |
|
|
|
4 |
|
|
|
8 |
|
|
|
6 |
|
|
|
4 |
|
The ChaCha20 stream cipher[RFC8439] uses the following
quarter round (QR
) function:
#define QR(a,b,c,d) \ a += b; d ^= a; d = ROTL(d,16); \ c += d; b ^= c; b = ROTL(b,12); \ a += b; d ^= a; d = ROTL(d, 8); \ c += d; b ^= c; b = ROTL(b, 7); \
which, when displayed graphically, looks like:
Squares represent the additions modulo 32, circles represent bit-wise
XOR, and <<<_x
is a 32-bit left rotation by x
.
A complete ChaCha20 round is then expressed as:
QR(x[0], x[4], x[ 8], x[12]); QR(x[1], x[5], x[ 9], x[13]); QR(x[2], x[6], x[10], x[14]); QR(x[3], x[7], x[11], x[15]); QR(x[0], x[5], x[10], x[15]); QR(x[1], x[6], x[11], x[12]); QR(x[2], x[7], x[ 8], x[13]); QR(x[3], x[4], x[ 9], x[14]);
where x
is the 16-element array of 32
-bit words representing the
round state.
When implemented on RV32I with the scalar cryptography extensions,
the QR
function requires 12 instructions.
One round is hence 48
instructions.
There is one fusable sequence which meets our criteria:
xor rA, rA, rB rori rA, rA, imm
Note there is no "rotate left by immediate" instruction, so
we use the the "rotate right by immediate" with an adjusted immediate.
Note also that there is no 16-bit rotate instruction, but there
is a 16-bit xor instruction which fits the xor rA, rA, rB
pattern.
The ChaCha20 state consists of 16 32-bit words. All of which can be kept in the registers of an RV32I/RV64I implementation. However, only eight words may be kept in registers addressable by the compressed instructions, and so maximise the opportunities for fusion.
Two compressed instruction registers (s0
, s1
) must be saved to
the stack before being used. This will cost at minimum 6
instructions
per call to a ChaCha20 block function: two stack adjustments, two
stores and two loads.
The a0
and a1
registers may also need their contents moving prior to
entering the block round loop, since they are used to pass parameters
to functions. For the ChaCha20 block function, this will likely be
the input/output array pointers. This will cost two instructions to
move them from a0
/ a1
. If the arguments are marked as const
in
C code, then two more instructions are needed to put them back.
A core capable of fusing this sequence when both instructions are up to
32 bits long saves 4
cycles per quarter round.
A core capable of fusing this sequence only when the xor is 16 bits and
the rori is 32 bits is more complex to analyse.
Looking at the data flow graph, the fusable sequences takes their
inputs from either variables a
and d
, or b
and c
.
Looking at how a complete ChaCha20 round is structured, it then becomes
clear that either state elements x[0..4,12..15]
, or x[4..11]
are the ones best placed
in the compressed instruction registers s0,s1,a0,…,a5
, since this
creates the largest number of fusable 48-bit instruction sequences.
In either case, two occurrences of sequence 1 are fusable.
Hence a quarter round is then 10
cycles and a round 40
cycles.
A core which can only fuse two 16-bit instructions is incapable of fusing sequence 1.
Note: this section repeatedly refers to Markku’s implementation of the Keccak round function on RV64. Though based on RV64, the sequences here also apply to RV32.
The Keccak Round function uses 5
instances of the sequence
rori rD, rA, imm xor rD, rD, rB
A core capable of fusing this sequence when both instructions are
32
bits long will save 5
cycles per Keccak round, or 125
in
total.
The Keccak Round function uses 25
instances of
xor-then-rotate-by-immediate.
xor rA, rA, rB rori rA, rA, imm
A core capable of fusing this sequence when both instructions are
32
bits long with save 25
cycles per round, or 625
cycles in total.
The chi
step of the Keccak round function consists of the expression
x ^= (y & ~z)
. Hence the sequence:
andn rD, rA, rB xor rD, rD, rC
A core capable of fusing this 8-byte sequence wil save 20
cycles per
round, or 500
cycles in total.
The SHA2 algorithms use two ternary non-linear functions: Ch
and Maj
.
There is a discussion in the
bitlogic
document on how best to implement these functions given the base instruction
set and the Bitmanip extension.
Here, we only discuss the relevant macro-op fusion sequences for
the Cryptography extension.
Ch(x,y,z) = z ^ (x & (y ^ z)) Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))
For the Ch
function, we can fuse either the xor+and
or the and+xor
instruction pairs:
xor rT, rY, rZ and rT, rT, rX xor rT, rT, rZ
Also note that the result of Ch
cannot always overwrite its inputs, making
it difficult to use the compressed instruction extensions.
In the sequence above, the inputs are not overwritten, and the
and
and xor
instructions can use the compressed forms.
For the Maj
function,the same and/xor
fusion sequences can be used.
The following code assumes the input variables cannot be overwritten.
xor rA, rX, rZ xor rB, rX, rY // and rB, rB, rA // Fuse the xor/and, or the and/xor. xor rB, rB, rX //
These considerations are taken in the context of macro-op fusion generally, not just the recommendations listed in this document.
Where some algorithm must compute either A
or B
based on the value
of some secret C
, it is essential that A
and B
take the same
length of time to compute. Otherwise an adversary who can measure
execution time can learn something of C
.
One method of doing this is control flow balancing, where
either A
or B
is artificially padded with instructions such that
they take the same time to execute. If the programmer or compiler
does not know exactly when a core will fuse certain macro-ops, it
is possible that A
and B
will become un-balanced again, because
one path includes a fusable sequence and the other does not.
This may be addressed by computing both A
and B
regardless of
C
, and selecting the appropriate result in a constant time fashion.
-
[RFC8439] RFC 8439 - ChaCha20 and Poly1305 for IETF Protocols. https://tools.ietf.org/html/rfc8439