-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FPGA Optimized Register File Version #433
base: master
Are you sure you want to change the base?
Conversation
Hi guys, I have done some optimizations for improved mapping of hardware to FPGA resources on Ibex as part of my master thesis. I have now been instructed to port those optimizations to other cores (cv32e40p, cva and snitch). This is the first simple but effective optimization, exploiting distributed RAM capabilities provided by many FPGAs I have tested the new register file by executing the I did not find very extensive tests exercising the register file (particularily for the FPU register file). cheers, Noam |
Hi @ganoam. I can comment on the tests. In the core-v-verif does not have any testing for the FPU at this time. We will be freezing the RTL for CV32E40P with the top-level parameter "FPU" set to 0. Subsequent releases of CV32E40P will include full verification of the FPU and associated APU interface. Question: are you pulled into the OpenHW Hardware Task Group? They are actively developing a port of the CV32E40P to an FPGA right now. |
Hi @MikeOpenHWGroup. Thanks for your answer. The ci_check runs through fine when using the pulp toolchain - when using my previously installed generic toolchain, Edit: It does work with the generic riscv toolchain. I have screwed up my environment variables. To verify
The implementation passed the tests Due to the output of
I suspect that was not all that should have been tested. Am I missing something? Thanks a lot for your help. To your question: No, I am not in the OpenHW Hardware Task Group. Should / can I become part of it? Altough I am not sure how much work I will be able to put into it - I scheduled to not spend too much time on those optimizations. |
Yes there is a lot of background there that you could not be aware of. the VCS Makefiles do not yet support all tests in our regression, most notably the Google riscv-dv tests. But if ci_check says your are OK, then the PR is considered safe to merge. We will subject the merged code to our full regression at least daily.
Absolutely. |
ea467b5
to
53e963a
Compare
Add a register file, optimized for synthesis on FPGAs supporting distributed RAM. The register file features two RAM blocks each with 1 sync-write and 3 async read ports. To achieve the behavior of a 2 sync-write / 3 async-read register file, the read access is arbitrated depending on which block was last written to. For this purpose an additional array of *NUM_TOT_WORDS* 1-bit registers is introduced. Savings for FPGA synthesis are achieved by: - Replacing an Array of FFs with distributed RAM. Example: 31 32-bit registers as FFs occupy 992 FFs, or 446 LUTs on Xilinx Artix-7 FPGAs. The equivalent storage capacity using distributed RAM is implemented by 36 RAM32M primitives (inferrred from generic HDL), or 144 distributed RAM enabled LUTs, and 31 FFs for block selection (16 LUTs). - The distributed RAM primitives have the read- address decoders already integrated. This saves three 32-bit 32 to 1 multiplexers at the read ports. - Since both write ports unconditionally write to their respective RAM blocks, the multiplexing of the write ports is also saved. That is 32 32-bit 2 to 1 multiplexers. Concrete Savings: (synthesized for Xilinx Artix-7 FPGA) - without FPU reg file: baseline: 7347 LUTs, 2508 FFs optimized: 5722 LUTs, 1541 FFs ------------------------------- difference: -1625 LUTS (-22.1%) -967 FFs (-38.6%) - with FPU reg file: baseline: 13160 LUTs, 4027 FFs optimized: 10257 LUTs, 2062 FFs ------------------------------- difference: -3353 LUTS (-24.6%) -1965 FFs (-48.8%) Signed-off-by: ganoam <[email protected]>
53e963a
to
0d91965
Compare
I had a look at this MR and it looks good. FPGA mapping is significantly improved. I suggest to merge it |
Add a register file, optimized for synthesis on FPGAs supporting
distributed RAM. The register file features two RAM blocks each with 1
sync-write and 3 async read ports. To achieve the behavior of a 2
sync-write / 3 async-read register file, the read access is arbitrated
depending on which block was last written to. For this purpose an
additional array of 1-bit registers is introduced.
Savings for FPGA synthesis are achieved by:
registers as FFs occupy 992 FFs, or 446 LUTs on Xilinx Artix-7 FPGAs.
The equivalent storage capacity using distributed RAM is implemented
by 36 RAM32M primitives (inferrred from generic HDL), or 144
distributed RAM enabled LUTs, and 31 FFs for block selection (16
LUTs).
decoders already integrated. This saves three 32-bit 32-to-1
multiplexers at the read ports.
RAM blocks, the multiplexing of the write ports is also saved. That
is 32 32-bit 2-to-1 multiplexers.
Concrete Savings:
without FPU reg file:
baseline: 7347 LUTs, 2508 FFs
optimized: 5722 LUTs, 1541 FFs
-------------------------------
difference: -1625 LUTS (-22.1%)
-967 FFs (-38.6%)
with FPU reg file:
baseline: 13160 LUTs, 4027 FFs
optimized: 10257 LUTs, 2062 FFs
-------------------------------
difference: -3353 LUTS (-24.6%)
-1965 FFs (-48.8%)
Signed-off-by: ganoam [email protected]