forked from cliffordwolf/xbitmanip
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbext.tex
630 lines (493 loc) · 26.7 KB
/
bext.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
\chapter{RISC-V XBitmanip Extension}
\label{bext}
In the proposals provided in this section, the C code examples are for
illustration purposes. They are not optimal implementations, but are intended
to specify the desired functionality. See Chapter~\ref{fastc} for fast C code
for use in emulators.
The sections on encodings are mere placeholders.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Count Leading/Trailing Zeros (\texttt{clz, ctz})}
The {\tt clz} operation counts the number of 0 bits at the MSB end of the
argument. That is, the number of 0 bits before the first 1 bit counting from
the most significant bit. If the input is 0, the output is XLEN. If the input
is -1, the output is 0.
The {\tt ctz} operation counts the number of 0 bits at the LSB end of the
argument. If the input is 0, the output is XLEN. If the input is -1, the
output is 0.
\input{bextcref-clz-ctz}
\input{bextclz.tex}
One possible encoding for \texttt{clz} and \texttt{ctz} is as standard I-type opcodes
somewhere in the brownfield surrounding the shift-immediate instructions.
% \subsection{References}
%
% https://en.wikipedia.org/wiki/Find\_first\_set\#CLZ
%
% https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Count Bits Set (\texttt{pcnt})}
This instruction counts the number of 1 bits in a register. This operations is known as
population count, popcount, sideways sum, bit summation, or Hamming weight.~\cite{HammingWeight,Warren12}
\input{bextcref-pcnt}
\input{bextpcnt.tex}
One possible encoding for \texttt{pcnt} is as a standard I-type opcode somewhere
in the brownfield surrounding the shift-immediate instructions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{And-with-complement (\texttt{andc})}
This instruction implements the and-with-complement operation.
\input{bextcref-andc}
Other with-complement operations ({\tt orc, nand, nor}, etc) can be implemented
by combining {\tt not} ({\tt c.not}) with the base ALU operation. (Which can
fit in 32 bit when using two compressed instructions.) Only and-with-complement
occurs frequently enough to warrant a dedicated instruction.
\input{bextandc.tex}
% \subsection{Justification}
%
% http://svn.clifford.at/handicraft/2017/bitcode/
%
% \subsection{References}
%
% http://www.hackersdelight.org/basics2.pdf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Shift Ones (Left/Right) (\texttt{slo,\ sloi,\ sro,\ sroi})}
These instructions are similar to shift-logical operations from the base
spec, except instead of shifting in zeros, they shifts in ones.
\input{bextcref-sxo}
\input{bextsxo.tex}
\texttt{s(l/r)o(i)} is encoded similarly to the logical shifts in the
base spec. However, the spec of the entire family of instructions is
changed so that the high bit of the instruction indicates the value to
be inserted during a shift. This means that a \texttt{sloi} instruction
can be encoded similarly to an \texttt{slli} instruction, but with a 1
in the highest bit of the encoded instruction. This encoding is
backwards compatible with the definition for the shifts in the base
spec, but allows for simple addition of a ones-insert.
When implementing this circuit, the only change in the ALU over a
standard logical shift is that the value shifted in is not zero, but is
a 1-bit register value that has been forwarded from the high bit of the
instruction decode. This creates the desired behavior on both logical
zero-shifts and logical ones-shifts.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Rotate (Left/Right) (\texttt{rol,\ ror,\ rori})}
These instructions are similar to shift-logical operations from the base
spec, except they shift in the values from the opposite side of the
register, in order. This is also called `circular shift'.
\input{bextcref-rox}
\input{bextrox.tex}
Rotate shift is implemented very similarly to the other shift
instructions. One possible way to encode it is to re-use the way that
bit 30 in the instruction encoding selects `arithmetic shift' when bit
31 is zero (signalling a logical-zero shift). We can re-use this so that
when bit 31 is set (signalling a logical-ones shift), if bit 30 is also
set, then we are doing a rotate. The following table summarizes the
behavior. The generalized reverse instructions can be encoded using the
bit pattern that would otherwise encode an ``Arithmetic Left Shift''
(which is an operation that does not exist). Likewise, the generalized zip
instruction can be encoded using the bit pattern that would otherwise
encode an ``Rotate left immediate''.
\begin{center}
\begin{tabular}{lll}
Bit 31 & Bit 30 & Meaning \\
\hline
0 & 0 & Logical Shift-Zeros \\
0 & 1 & Arithmetic Shift \\
1 & 0 & Logical Shift-Ones \\
1 & 1 & Rotate \\
\end{tabular}
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\begin{center}
\input{bextcref-printperm-ror}
\end{center}
\caption{\texttt{ror} permutation network}
\label{permnet-ror}
\end{figure}
\section{Generalized Reverse (\texttt{grev,\ grevi})}
\label{grev}
This instruction provides a single hardware instruction that can implement all
of byte-order swap, bitwise reversal, short-order-swap, word-order-swap
(RV64), nibble-order swap, bitwise reversal in a byte, etc, all from a single
hardware instruction. It takes in a single register value and an immediate that
controls which function occurs, through controlling the levels in the recursive
tree at which reversals occur.
This operation iteratively checks each bit $i$ in rs2 from $i=0$ to
$\textrm{XLEN}-1$, and if the corresponding bit is set, swaps each adjacent
pair of $2^i$ bits.
\begin{figure}[t]
\begin{center}
\input{bextcref-printperm-grev}
\end{center}
\caption{\texttt{grev} permutation network}
\label{permnet-grev}
\end{figure}
\input{bextcref-grev}
The above pattern should be intuitive to understand in order to extend
this definition in an obvious manner for RV128.
\begin{table}[h]
\begin{small}
\begin{center}
\begin{tabular}{r l p{0.5in} r l p{0.3in} r l}
\multicolumn{2}{c}{RV32} & &
\multicolumn{5}{c}{RV64} \\
\cline{1-2}
\cline{4-8}
\multicolumn{1}{c}{shamt} & Instruction & &
\multicolumn{1}{c}{shamt} & Instruction & &
\multicolumn{1}{c}{shamt} & Instruction \\
\cline{1-2}
\cline{4-5}
\cline{7-8}
0: 00000 & --- & & 0: 000000 & --- & & 32: 100000 & {\tt wswap} \\
1: 00001 & {\tt brev.p} & & 1: 000001 & {\tt brev.p} & & 33: 100001 & --- \\
2: 00010 & {\tt pswap.n} & & 2: 000010 & {\tt pswap.n} & & 34: 100010 & --- \\
3: 00011 & {\tt brev.n} & & 3: 000011 & {\tt brev.n} & & 35: 100011 & --- \\
4: 00100 & {\tt nswap.b} & & 4: 000100 & {\tt nswap.b} & & 36: 100100 & --- \\
5: 00101 & --- & & 5: 000101 & --- & & 37: 100101 & --- \\
6: 00110 & {\tt pswap.b} & & 6: 000110 & {\tt pswap.b} & & 38: 100110 & --- \\
7: 00111 & {\tt brev.b} & & 7: 000111 & {\tt brev.b} & & 39: 100111 & --- \\
\cline{1-2}
\cline{4-5}
\cline{7-8}
8: 01000 & {\tt bswap.h} & & 8: 001000 & {\tt bswap.h} & & 40: 101000 & --- \\
9: 01001 & --- & & 9: 001001 & --- & & 41: 101001 & --- \\
10: 01010 & --- & & 10: 001010 & --- & & 42: 101010 & --- \\
11: 01011 & --- & & 11: 001011 & --- & & 43: 101011 & --- \\
12: 01100 & {\tt nswap.h} & & 12: 001100 & {\tt nswap.h} & & 44: 101100 & --- \\
13: 01101 & --- & & 13: 001101 & --- & & 45: 101101 & --- \\
14: 01110 & {\tt pswap.h} & & 14: 001110 & {\tt pswap.h} & & 46: 101110 & --- \\
15: 01111 & {\tt brev.h} & & 15: 001111 & {\tt brev.h} & & 47: 101111 & --- \\
\cline{1-2}
\cline{4-5}
\cline{7-8}
16: 10000 & {\tt hswap} & & 16: 010000 & {\tt hswap.w} & & 48: 110000 & {\tt hswap} \\
17: 10001 & --- & & 17: 010001 & --- & & 49: 110001 & --- \\
18: 10010 & --- & & 18: 010010 & --- & & 50: 110010 & --- \\
19: 10011 & --- & & 19: 010011 & --- & & 51: 110011 & --- \\
20: 10100 & --- & & 20: 010100 & --- & & 52: 110100 & --- \\
21: 10101 & --- & & 21: 010101 & --- & & 53: 110101 & --- \\
22: 10110 & --- & & 22: 010110 & --- & & 54: 110110 & --- \\
23: 10111 & --- & & 23: 010111 & --- & & 55: 110111 & --- \\
\cline{1-2}
\cline{4-5}
\cline{7-8}
24: 11000 & {\tt bswap} & & 24: 011000 & {\tt bswap.w} & & 56: 111000 & {\tt bswap} \\
25: 11001 & --- & & 25: 011001 & --- & & 57: 111001 & --- \\
26: 11010 & --- & & 26: 011010 & --- & & 58: 111010 & --- \\
27: 11011 & --- & & 27: 011011 & --- & & 59: 111011 & --- \\
28: 11100 & {\tt nswap} & & 28: 011100 & {\tt nswap.w} & & 60: 111100 & {\tt nswap} \\
29: 11101 & --- & & 29: 011101 & --- & & 61: 111101 & --- \\
30: 11110 & {\tt pswap} & & 30: 011110 & {\tt pswap.w} & & 62: 111110 & {\tt pswap} \\
31: 11111 & {\tt brev} & & 31: 011111 & {\tt brev.w} & & 63: 111111 & {\tt brev} \\
\end{tabular}
\end{center}
\end{small}
\caption{Pseudo-instructions for {\tt grevi} instruction}
\label{grevi-modes}
\end{table}
The {\tt grev} operation can easily be implemented using a permutation
network with $log_2(\textrm{XLEN})$ stages. Figure~\ref{permnet-ror}
shows the permutation network for {\tt ror} for reference.
Figure~\ref{permnet-grev} shows the permutation network for {\tt grev}.
\input{bextgrev.tex}
\texttt{grev} is encoded as standard R-type opcode and \texttt{grevi} is
encoded as standard I-type opcode. \texttt{grev} and \texttt{grevi} can
use the instruction encoding for ``arithmetic shift left''.
% \subsection{References}
%
% Hackers Delight, Chapter 7.1, ``Generalized Bit Reversal'' in
%
% https://books.google.com/books?id=iBNKMspIlqEC\&lpg=PP1\&pg=RA1-SL20-PA2\#v=onepage\&q\&f=false
%
% http://hackersdelight.org/
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Generalized shuffle (\texttt{shfl}, \texttt{unshfl})}
\label{gzip}
Shuffle is the third bit permutation instruction in XBitmanip, after rotary
shift and generalized reverse. It implements a generalization of the operation
commonly known as perfect outer shuffle and its inverse (shuffle/unshuffle),
also known as zip/unzip or interlace/uninterlace.
Bit permutations can be understood as reversible functions on bit indeces (i.e.
5 bit functions on RV32 and 6 bit functions on RV64).
\begin{center}
\begin{tabular}{l l}
Operation & Corresponding function on bit indices \\
\hline
Rotate shift & Addition modulo {\rm XLEN} \\
Generalized reverse & XOR with bitmask \\
Generalized shuffle & Bitpermutation \\
\end{tabular}
\end{center}
A generalized (un)shuffle operation has $log_2(\textrm{XLEN})-1$ control bits,
one for each pair of neighbouring bits in a bit index. When the bit is set,
generalized shuffle will swap the two index bits. The {\tt shfl} operation
performs this swaps in MSB-to-LSB order (performing a rotate left shift on
continous regions of set control bits), and the {\tt unshfl} operation performs
the swaps in LSB-to-MSB order (performing a rotate right shift on continous
regions of set control bits). Combining up to $log_2(\textrm{XLEN})$ of those
{\tt shfl}/{\tt unshfl} operations can implement any bitpermutation on the
bit indices.
The most common type of shuffle/unshuffle operation is one on an immediate
control value that only contains one continous region of set bits. We call
those operations zip/unzip and provide pseudo-instructions for them.
Shuffle/unshuffle operations that only have individual bits set (not a continous
region of two or more bits) are their own inverse. This means that a few of the
{\tt unshfli} opcodes are redundant with {\tt shfli} opcodes and therefore can
be used to encode unary functions such as {\tt ctz}, {\tt clz}, and {\tt pcnt}.
\begin{table}[h]
\begin{small}
\begin{center}
\begin{tabular}{r l l}
\multicolumn{1}{c}{mode} &
Bit index rotations &
Pseudo-Instruction \\
\hline
0: 0000 0 & no-op & --- \\
1: 0000 1 & no-op & {\it reserved} \\
2: 0001 0 & {\tt i[1] -> i[0]} & {\tt zip.n, unzip.n} \\
\sout{ 3: 0001 1} & {\it equivalent to 0001 0} & {\it reserved} \\
4: 0010 0 & {\tt i[2] -> i[1]} & {\tt zip2.b, unzip2.b} \\
\sout{ 5: 0010 1} & {\it equivalent to 0010 0} & {\it reserved} \\
6: 0011 0 & {\tt i[2] -> i[0]} & {\tt zip.b} \\
7: 0011 1 & {\tt i[2] <- i[0]} & {\tt unzip.b} \\
\hline
8: 0100 0 & {\tt i[3] -> i[2]} & {\tt zip4.h, unzip4.h} \\
\sout{ 9: 0100 1} & {\it equivalent to 0100 0} & {\it reserved} \\
10: 0101 0 & {\tt i[3] -> i[2], i[1] -> i[0]} & --- \\
\sout{11: 0101 1} & {\it equivalent to 0101 0} & {\it reserved} \\
12: 0110 0 & {\tt i[3] -> i[1]} & {\tt zip2.h} \\
13: 0110 1 & {\tt i[3] <- i[1]} & {\tt unzip2.h} \\
14: 0111 0 & {\tt i[3] -> i[0]} & {\tt zip.h} \\
15: 0111 1 & {\tt i[3] <- i[0]} & {\tt unzip.h} \\
\hline
16: 1000 0 & {\tt i[4] -> i[3]} & {\tt zip8, unzip8} \\
\sout{17: 1000 1} & {\it equivalent to 1000 0} & {\it reserved} \\
18: 1001 0 & {\tt i[4] -> i[3], i[1] -> i[0]} & --- \\
\sout{19: 1001 1} & {\it equivalent to 1001 0} & {\it reserved} \\
20: 1010 0 & {\tt i[4] -> i[3], i[2] -> i[1]} & --- \\
\sout{21: 1010 1} & {\it equivalent to 1010 0} & {\it reserved} \\
22: 1011 0 & {\tt i[4] -> i[3], i[2] -> i[0]} & --- \\
23: 1011 1 & {\tt i[4] <- i[3], i[2] <- i[0]} & --- \\
\hline
24: 1100 0 & {\tt i[4] -> i[2]} & {\tt zip4} \\
25: 1100 1 & {\tt i[4] <- i[2]} & {\tt unzip4} \\
26: 1101 0 & {\tt i[4] -> i[2], i[1] -> i[0]} & --- \\
27: 1101 1 & {\tt i[4] <- i[2], i[1] <- i[0]} & --- \\
28: 1110 0 & {\tt i[4] -> i[1]} & {\tt zip2} \\
29: 1110 1 & {\tt i[4] <- i[1]} & {\tt unzip2} \\
30: 1111 0 & {\tt i[4] -> i[0]} & {\tt zip} \\
31: 1111 1 & {\tt i[4] <- i[0]} & {\tt unzip} \\
\end{tabular}
\end{center}
\end{small}
\caption{RV32 modes and pseudo-instructions for {\tt gzip} instruction}
\label{gzip32-modes}
\end{table}
\begin{table}[h]
\begin{small}
\begin{center}
\begin{tabular}{c l p{1in} c l}
mode & Pseudo-Instruction & & mode & Pseudo-Instruction \\
\cline{1-2}
\cline{4-5}
0: 00000 0 & --- & & 32: 10000 0 & {\tt zip16, unzip16} \\
\sout{ 1: 00000 1} & {\it reserved} & & \sout{33: 10000 1} & {\it reserved} \\
2: 00001 0 & {\tt zip.n, unzip.n} & & 34: 10001 0 & --- \\
\sout{ 3: 00001 1} & {\it reserved} & & \sout{35: 10001 1} & {\it reserved} \\
4: 00010 0 & {\tt zip2.b, unzip2.b} & & 36: 10010 0 & --- \\
\sout{ 5: 00010 1} & {\it reserved} & & \sout{37: 10010 1} & {\it reserved} \\
6: 00011 0 & {\tt zip.b} & & 38: 10011 0 & --- \\
7: 00011 1 & {\tt unzip.b} & & 39: 10011 1 & --- \\
\cline{1-2}
\cline{4-5}
8: 00100 0 & {\tt zip4.h, unzip4.h} & & 40: 10100 0 & --- \\
\sout{ 9: 00100 1} & {\it reserved} & & \sout{41: 10100 1} & {\it reserved} \\
10: 00101 0 & --- & & 42: 10101 0 & --- \\
\sout{11: 00101 1} & {\it reserved} & & \sout{43: 10101 1} & {\it reserved} \\
12: 00110 0 & {\tt zip2.h} & & 44: 10110 0 & --- \\
13: 00110 1 & {\tt unzip2.h} & & 45: 10110 1 & --- \\
14: 00111 0 & {\tt zip.h} & & 46: 10111 0 & --- \\
15: 00111 1 & {\tt unzip.h} & & 47: 10111 1 & --- \\
\cline{1-2}
\cline{4-5}
16: 01000 0 & {\tt zip8.w, unzip8.w} & & 48: 11000 0 & {\tt zip8} \\
\sout{17: 01000 1} & {\it reserved} & & 49: 11000 1 & {\tt unzip8} \\
18: 01001 0 & --- & & 50: 11001 0 & --- \\
\sout{19: 01001 1} & {\it reserved} & & 51: 11001 1 & --- \\
20: 01010 0 & --- & & 52: 11010 0 & --- \\
\sout{21: 01010 1} & {\it reserved} & & 53: 11010 1 & --- \\
22: 01011 0 & --- & & 54: 11011 0 & --- \\
23: 01011 1 & --- & & 55: 11011 1 & --- \\
\cline{1-2}
\cline{4-5}
24: 01100 0 & {\tt zip4.w} & & 56: 11100 0 & {\tt zip4} \\
25: 01100 1 & {\tt unzip4.w} & & 57: 11100 1 & {\tt unzip4} \\
26: 01101 0 & --- & & 58: 11101 0 & --- \\
27: 01101 1 & --- & & 59: 11101 1 & --- \\
28: 01110 0 & {\tt zip2.w} & & 60: 11110 0 & {\tt zip2} \\
29: 01110 1 & {\tt unzip2.w} & & 61: 11110 1 & {\tt unzip2} \\
30: 01111 0 & {\tt zip.w} & & 62: 11111 0 & {\tt zip} \\
31: 01111 1 & {\tt unzip.w} & & 63: 11111 1 & {\tt unzip} \\
\end{tabular}
\end{center}
\end{small}
\caption{RV64 modes and pseudo-instructions for {\tt gzip} instruction}
\label{gzip64-modes}
\end{table}
\begin{figure}[t]
\begin{center}
\input{bextcref-printperm-gzip-noflip}
\end{center}
\caption{\texttt{gzip} permutation network without ``flip'' stages}
\label{permnet-gzip-noflip}
\end{figure}
Like GREV and rotate shift, the (un)shuffle instruction can be implemented using a short
sequence of elementry permutations, that are enabled or disabled by the mode (shamt)
bits. But (un)shuffle has one stage fewer than GREV and the additional bit of the mode
controls the order in which the stages are applied:
\input{bextcref-gzip32}
Or for RV64:
\input{bextcref-gzip64}
The above pattern should be intuitive to understand in order to extend
this definition in an obvious manner for RV128.
Alternatively (un)shuffle) can be implemented in a single network with one more
stage than GREV, with the additional first and last stage executing a
permutation that effectively reverses the order of the inner stages. However,
since the inner stages only mux half of the bits in the word each, a hardware
implementation using this additional ``flip'' stages might actually be more
expensive than simply creating two networks.
\input{bextcref-gzip32-alt}
Figure~\ref{permnet-gzip-flip} shows the (un)shuffle permutation network with
``flip'' stages and Figure~\ref{permnet-gzip-noflip} shows the (un)shuffle
permutation network without ``flip'' stages.
\begin{figure}[t]
\begin{center}
\input{bextcref-printperm-gzip-flip}
\end{center}
\caption{(un)shuffle permutation network with ``flip'' stages}
\label{permnet-gzip-flip}
\end{figure}
The \texttt{zip} instruction with the upper half of its input cleared performs
the commonly needed ``fan-out'' operation. (Equivalent to {\tt bdep} with a
0x55555555 mask.) The \texttt{zip} instruction applied twice fans out the bits
in the lower quarter of the input word by a spacing of 4 bits.
For example, the following code calculates the bitwise prefix sum of the bits
in the lower byte of a 32 bit word on RV32:
\begin{verbatim}
andi a0, a0, 0xff
zip a0, a0
zip a0, a0
slli a1, a0, 4
c.add a0, a1
slli a1, a0, 8
c.add a0, a1
slli a1, a0, 16
c.add a0, a1
\end{verbatim}
The final prefix sum is stored in the 8 nibbles of the {\tt a0} output word.
Other {\tt zip} modes can be used to ``fan-out'' in blocks of 2, 4, 8, or 16 bit.
{\tt zip} can be combined with {\tt grevi} to perform inner shuffles. For example
on RV64:
\begin{verbatim}
li a0, 0x0000000012345678
zip4 t0, a0 ; <- 0x0102030405060708
nswap.b t1, t0 ; <- 0x1020304050607080
zip8 t2, a0 ; <- 0x0012003400560078
bswap.h t3, t2 ; <- 0x1200340056007800
zip16 t4, a0 ; <- 0x0000123400005678
hswap.w t5, t4 ; <- 0x1234000056780000
\end{verbatim}
\input{bextgzip}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Byte-swap with sign extend (\texttt{bswaps.h, bswaps.w})}
This instructions perform a byte-swap (endianness conversion) on the lower 16 bits
(\texttt{bswaps.h}), 32 bits (\texttt{bswaps.w}, RV64), or 64 bits (\texttt{bswaps.d}, RV128)
of the argument and then sign extend the result. (Byte-swap without sign extend can be
performed using the \texttt{grevi} \texttt{bswap} pseudo-instructions.
\input{bextcref-bswaps}
The hardware for this operation can for the most part share resources from
\texttt{grevi} and the instructions can be encoded in the reserved space in \texttt{gzip}.
\input{bextbswaps}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Bit Extract/Deposit (\texttt{bext,\ bdep})}
This instructions implement the generic bit extract and bit deposit functions.
This operation is also referred to as bit gather/scatter, bit pack/unpack,
parallel extract/deposit, compress/expand, or right\_compress/right\_expand.
\texttt{bext} collects LSB justified bits to rd from rs1 using extract mask in rs2.
\texttt{bdep} writes LSB justified bits from rs1 to rd using deposit mask in rs2.
\input{bextcref-bext}
Implementations may choose to use smaller multi-cycle implementations of
\texttt{bext} and \texttt{bdep}, or even emulate the instructions in software.
Even though multi-cycle \texttt{bext} and \texttt{bdep} often are not fast
enough to outperform algortihms that use sequences of shifts and bit masks,
dedicated instructions for those operations can still be of great advantage in
cases where the mask argument is not constant.
For example, the following code efficiently calculates the index of the tenth
set bit in {\tt a0} using \texttt{bdep}:
\begin{verbatim}
li a1, 0x00000200
bdep a0, a1, a0
ctz a0, a0
\end{verbatim}
For cases with a constant mask an optimizing compiler would decide when to use
\texttt{bext} or \texttt{bdep} based on the optimization profile for the
concrete processor it is optimizing for. This is similar to the decision
whether to use MUL or DIV with a constant, or to perform the same operation
using a longer sequence of much simpler operations.
\input{bextscagat.tex}
% \subsection{Justification}
%
% http://svn.clifford.at/handicraft/2017/permsyn/
%
% \subsection{References}
%
% http://programming.sirrida.de/bit\_perm.html\#gather\_scatter
%
% Hackers Delight, Chapter 7.1, ``Compress, Generalized Extract'' in
%
% https://books.google.com/books?id=iBNKMspIlqEC\&lpg=PP1\&pg=RA1-SL20-PA2\#v=onepage\&q\&f=false
%
% http://hackersdelight.org/
%
% https://github.com/cliffordwolf/bextdep
%
% http://palms.ee.princeton.edu/system/files/Hilewitz_JSPS_08.pdf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Compressed instructions (\texttt{c.not,\ c.neg,\ c.brev})}
The RISC-V ISA has no dedicated instructions for bitwise inverse (\texttt{not})
and arithmetic inverse (\texttt{neg}). Instead \texttt{not} is implemented as
\texttt{xori\ rd,\ rs,\ -1} and \texttt{neg} is implemented as
\texttt{sub\ rd,\ x0,\ rs}.
In bitmanipulation code \texttt{not} and \texttt{neg} are very common operations. But
there are no compressed encodings for those operations because there is no \texttt{c.xori}
instruction and \texttt{c.sub} can not operate on \texttt{x0}.
Many bit manipulation operations that have dedicated opcodes in other ISAs
must be constructed from smaller atoms in RISC-V XBitmanip code. But
implementations might choose to implement them in a single micro-op using
macro-op-fusion. For this it can be helpful when the fused sequences are short.
\texttt{not} and \texttt{neg} are good candidates for macro-op-fusion, so
it can be helpful to have compressed opcodes for them.
Likewise \texttt{brev} (an alias for \texttt{grevi\ rd,\ rs,\ -1}, i.e. bitwise
reversal) is also a very common atom for building bit manipulation operations. So it
is helpful to have a compressed opcode for this instruction as well.
The compressed instructions \texttt{c.not,\ c.neg,\ c.brev} must be supported by
all implementations that support the C extension and XBitmanip.
\input{bextcompr}
These three instructions fit nicely in the reserved space in C.LUI/C.ADDI16SP.
They only occupy $0.1\%$ of the $\approx15.6$ bits wide RVC encoding space.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Bit-field extract pseudo instruction ({\tt bfext})}
\label{bfext}
Extract the continous bit field starting at {\tt pos} with length {\tt len}
from {\tt rs} (with $\texttt{pos}>0$, $\texttt{len}>0$, and
$\texttt{pos}+\texttt{len}\le\textrm{XLEN}$).
\begin{verbatim}
bfext rd, rs, pos, len -> slli rd, rs, (XLEN-pos-len)
srli rd, rd, (XLEN-len)
\end{verbatim}
If possible, an implementation should fuse this two shift operations into a single
macro-op. (Some implementors have raised concerns about the lack of a dedicated
bit field extract instruction with large immediate, especially for implementations
that can not fuse instructions into macro-ops and/or implementations that do
not support compressed instructions. See Chapter~\ref{bfxp} for a brief discussion
of why no such instruction is part of XBitmanip, and a short separate extension
proposal that adds such an instruction.)