This repository contains tools to benchmark the finite-element(ish) tool FeenoX with Google’s Benchmark micro-benchmarking library.
You'll need the Google Benchmark library (and headers). Luckily it is on Debian's (and probably Ubuntu's) repositories:
sudo apt-get install libbenchmark-dev
You'll also need to have everything needed in order to compile FeenoX sources. See https://www.seamplex.com/feenox/doc/compilation.html.
The benchmarks use actual FeenoX code, so you need to have FeenoX configured and compile in a subdirectory called feenox
.
There is a script bootstrap.sh
that will perform all the steps with the default options, which should be enough to start.
In any case, these are the steps
- Clone FeenoX from Github
- Bootstrap FeenoX
- Configure FeenoX
- Compile FeenoX
- Create a
benchmark.make
file out of FeenoX’ makefiles
Run ./bootstrap.sh
to get everything set up (with the default options).
You can repeat the steps above and tweak a little bit the set up:
-
Clone FeenoX from Github: either through https or ssh
git clone https://www.github.com/seamplex/feenox
-
Bootstrap FeenoX: run
autogen.sh
, this will callautogen.sh
andmake clean
under the hood so everything will start from scratchcd feenox ./autogen.sh
-
Configure FeenoX: run
configure.sh
, optionally changing the flags and/or the compiler, e.g../configure CFLAGS="-O3 -flto" MPICH_CC="clang"
Make sure the PETSc/SLEPc architecture is no-debug!
-
Compile FeenoX: call
make
in parallelmake -j6
-
Create a
benchmark.make
file out of FeenoX’ makefiles: callbootstrap.sh
after all the other steps have been performedcd .. ./bootstrap.sh
The procedure to compile an existing benchmark is
- Go to the benchmark directory
- Check and/or edit the
Makefile
to customize the benchmark’s compilation flags. Note that the benchmark is C++ while FeenoX is C, so mind the diffence betweenCXXFLAGS
andCFLAGS
. - Run
make
- Run the benchmark executable
Ideally each benchmark should show some custom context with the compiler and flags used for both FeenoX and the benchmark itself:
benchmark_compiler_command: g++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich
benchmark_compiler_flags: -O2
benchmark_compiler_version: g++ (Debian 12.2.0-1) 12.2.0
feenox_compiler_command: gcc -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich
feenox_compiler_flags: -Ofast -flto=auto
feenox_compiler_version: gcc (Debian 12.2.0-1) 12.2.0
This benchmark uses only one call to FeenoX, namely feenox_principal_stress_from_cauchy()
.
The other lines do the same job in different ways.
cd principal_stress
make
./principal_stress
All the following runs use FeenoX compiled with GCC and CFLAGS=-Ofast -flto=auto
(see feenox_compiler_flags
in the outputs below).
-
No optimization in the benchmark's
Makefile
, i.e.CXXFLAGS=-O0
. The call to FeenoX' code is faster because it was compiled with-Ofast
and the others use-O0
:benchmark_compiler_command: g++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich benchmark_compiler_flags: benchmark_compiler_version: g++ (Debian 12.2.0-1) 12.2.0 feenox_compiler_command: gcc -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich feenox_compiler_flags: -Ofast -flto=auto feenox_compiler_version: gcc (Debian 12.2.0-1) 12.2.0 ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------- BM_principal_stress_feenox 72.3 ns 72.3 ns 7829337 BM_principal_stress_call 96.8 ns 96.8 ns 7282039 BM_principal_stress_void 98.4 ns 98.4 ns 7078540 BM_principal_stress_wrapper 104 ns 104 ns 6755073 BM_principal_stress_wrapper2 106 ns 106 ns 6468864 BM_principal_stress_wrapper3 112 ns 112 ns 6331257 BM_principal_stress_call_cpp_same 97.0 ns 97.0 ns 7289604 BM_principal_stress_expanded 94.0 ns 93.9 ns 7203169 BM_principal_stress_inline 97.5 ns 97.5 ns 7039346 BM_principal_stress_inline_optimized_out 95.7 ns 95.7 ns 7138072 BM_principal_stress_overhead_sigmax_double 2.38 ns 2.38 ns 290591419 BM_principal_stress_overhead_sigmax_int 1.88 ns 1.88 ns 372357007
-
Intermediate optimization
-O2
in the benchmark. Now the other calls are slightly faster but still slower than FeenoX:benchmark_compiler_command: g++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich benchmark_compiler_flags: -O2 benchmark_compiler_version: g++ (Debian 12.2.0-1) 12.2.0 feenox_compiler_command: gcc -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich feenox_compiler_flags: -Ofast -flto=auto feenox_compiler_version: gcc (Debian 12.2.0-1) 12.2.0 ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------- BM_principal_stress_feenox 70.1 ns 70.1 ns 8231293 BM_principal_stress_call 74.2 ns 74.2 ns 9106319 BM_principal_stress_void 73.1 ns 73.1 ns 9528729 BM_principal_stress_wrapper 73.9 ns 73.9 ns 9129438 BM_principal_stress_wrapper2 73.7 ns 73.7 ns 9572405 BM_principal_stress_wrapper3 74.2 ns 74.2 ns 9401638 BM_principal_stress_call_cpp_same 73.4 ns 73.4 ns 9507737 BM_principal_stress_expanded 72.1 ns 72.1 ns 9356419 BM_principal_stress_inline 72.4 ns 72.4 ns 9626681 BM_principal_stress_inline_optimized_out 26.0 ns 26.0 ns 27283808 BM_principal_stress_overhead_sigmax_double 1.05 ns 1.05 ns 673686331 BM_principal_stress_overhead_sigmax_int 0.522 ns 0.522 ns 1000000000
-
Level-three optimization
-O3
in the benchmark. Closer but FeenoX is still faster, even though some of the benchmark calls can be inlined while the call to FeenoX cannot:benchmark_compiler_command: g++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich benchmark_compiler_flags: -O3 benchmark_compiler_version: g++ (Debian 12.2.0-1) 12.2.0 feenox_compiler_command: gcc -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich feenox_compiler_flags: -Ofast -flto=auto feenox_compiler_version: gcc (Debian 12.2.0-1) 12.2.0 ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------- BM_principal_stress_feenox 70.8 ns 70.8 ns 8061182 BM_principal_stress_call 72.2 ns 72.2 ns 9700876 BM_principal_stress_void 72.7 ns 72.7 ns 9441510 BM_principal_stress_wrapper 72.6 ns 72.6 ns 9749349 BM_principal_stress_wrapper2 72.9 ns 72.9 ns 9295160 BM_principal_stress_wrapper3 72.9 ns 72.9 ns 9566969 BM_principal_stress_call_cpp_same 72.3 ns 72.3 ns 9220847 BM_principal_stress_expanded 72.4 ns 72.4 ns 9672055 BM_principal_stress_inline 72.7 ns 72.7 ns 9553364 BM_principal_stress_inline_optimized_out 26.0 ns 26.0 ns 26423158 BM_principal_stress_overhead_sigmax_double 1.07 ns 1.06 ns 661818956 BM_principal_stress_overhead_sigmax_int 0.522 ns 0.522 ns 1000000000
-
Fast optimization
-Ofast
in the benchmark. Now all the calls in the benchmark are faster because all of them are inlined while the call to FeenoX is not inlined.benchmark_compiler_command: g++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich benchmark_compiler_flags: -Ofast benchmark_compiler_version: g++ (Debian 12.2.0-1) 12.2.0 feenox_compiler_command: gcc -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich feenox_compiler_flags: -Ofast -flto=auto feenox_compiler_version: gcc (Debian 12.2.0-1) 12.2.0 ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------- BM_principal_stress_feenox 70.0 ns 70.0 ns 8231407 BM_principal_stress_call 66.4 ns 66.4 ns 10278649 BM_principal_stress_void 66.0 ns 66.0 ns 10674543 BM_principal_stress_wrapper 66.2 ns 66.2 ns 10429742 BM_principal_stress_wrapper2 65.9 ns 65.9 ns 10682682 BM_principal_stress_wrapper3 66.4 ns 66.4 ns 10319174 BM_principal_stress_call_cpp_same 66.4 ns 66.4 ns 10632631 BM_principal_stress_expanded 66.0 ns 66.0 ns 10160912 BM_principal_stress_inline 66.2 ns 66.2 ns 10610910 BM_principal_stress_inline_optimized_out 0.000 ns 0.000 ns 1000000000 BM_principal_stress_overhead_sigmax_double 1.06 ns 1.06 ns 661165736 BM_principal_stress_overhead_sigmax_int 0.524 ns 0.524 ns 1000000000
-
Level-three optimization and link-time optimization
-O3 -flto
in the benchmark. FeenoX is slightly faster but the call does not seem to be inlined automatically, i.e. the effect of-flto
is not obvious.benchmark_compiler_command: g++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich benchmark_compiler_flags: -flto=auto -O3 benchmark_compiler_version: g++ (Debian 12.2.0-1) 12.2.0 feenox_compiler_command: gcc -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich feenox_compiler_flags: -Ofast -flto=auto feenox_compiler_version: gcc (Debian 12.2.0-1) 12.2.0 ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------- BM_principal_stress_feenox 71.2 ns 71.2 ns 8141022 BM_principal_stress_call 71.8 ns 71.8 ns 9698680 BM_principal_stress_void 72.9 ns 72.9 ns 9588978 BM_principal_stress_wrapper 72.4 ns 72.4 ns 9652976 BM_principal_stress_wrapper2 71.4 ns 71.4 ns 9351013 BM_principal_stress_wrapper3 72.5 ns 72.5 ns 9801144 BM_principal_stress_call_cpp_same 72.2 ns 72.2 ns 9520815 BM_principal_stress_expanded 73.1 ns 73.1 ns 9768862 BM_principal_stress_inline 72.0 ns 72.0 ns 9721819 BM_principal_stress_inline_optimized_out 26.2 ns 26.2 ns 26201980 BM_principal_stress_overhead_sigmax_double 1.06 ns 1.06 ns 648864982 BM_principal_stress_overhead_sigmax_int 0.524 ns 0.524 ns 1000000000
-
Fast optimization and link-time optimization
-Ofast -flto
in the benchmark. Now the call to FeenoX is equivalent to the inlined and fast-optimized code within the benchmark.benchmark_compiler_command: g++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich benchmark_compiler_flags: -flto=auto -Ofast benchmark_compiler_version: g++ (Debian 12.2.0-1) 12.2.0 feenox_compiler_command: gcc -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich feenox_compiler_flags: -Ofast -flto=auto feenox_compiler_version: gcc (Debian 12.2.0-1) 12.2.0 ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------- BM_principal_stress_feenox 66.7 ns 66.7 ns 8693376 BM_principal_stress_call 66.0 ns 66.0 ns 10472553 BM_principal_stress_void 66.1 ns 66.1 ns 10317352 BM_principal_stress_wrapper 66.1 ns 66.1 ns 10133735 BM_principal_stress_wrapper2 66.5 ns 66.4 ns 10577702 BM_principal_stress_wrapper3 66.2 ns 66.1 ns 10156528 BM_principal_stress_call_cpp_same 66.5 ns 66.5 ns 10597228 BM_principal_stress_expanded 66.7 ns 66.6 ns 10326400 BM_principal_stress_inline 66.7 ns 66.6 ns 10601185 BM_principal_stress_inline_optimized_out 0.000 ns 0.000 ns 1000000000 BM_principal_stress_overhead_sigmax_double 1.05 ns 1.05 ns 667892504 BM_principal_stress_overhead_sigmax_int 0.519 ns 0.519 ns 1000000000
This benchmark measures the time neded for FeenoX to build a mechanical stiffness matrix with a call to feenox_problem_build()
. This case is slightly more complex because an actual mechanical problem has to be set up, including
- reading the mesh
- setting the material properties
- setting the boundary conditions
Using GCC:
benchmark_compiler_command: g++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich
benchmark_compiler_flags: -O3
benchmark_compiler_version: g++ (Debian 12.2.0-1) 12.2.0
feenox_compiler_command: gcc -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich
feenox_compiler_flags: -Ofast -flto=auto
feenox_compiler_version: gcc (Debian 12.2.0-1) 12.2.0
feenox_git_branch: main
feenox_git_clean: yes
feenox_git_date: Wed Sep 14 08:03:40 2022 -0300
feenox_git_version: v0.2.129-g8234f97
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
BM_build_only/min_time:2.000 8.23 ms 8.23 ms 319
Using Clang:
benchmark_compiler_command: clang++ -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich
benchmark_compiler_flags: -O3
benchmark_compiler_version: Debian clang version 14.0.6-2
feenox_compiler_command: clang -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpich
feenox_compiler_flags: -O3
feenox_compiler_version: Debian clang version 14.0.6-2
feenox_git_branch: main
feenox_git_clean: yes
feenox_git_date: Wed Sep 14 08:03:40 2022 -0300
feenox_git_version: v0.2.129-g8234f97
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
BM_build_only/min_time:2.000 8.19 ms 8.19 ms 340
https://github.com/google/benchmark/blob/main/docs/user_guide.md#disabling-cpu-frequency-scaling
If you see this error:
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
you might want to disable the CPU frequency scaling while running the benchmark. Exactly how to do this depends on the Linux distribution, desktop environment, and installed programs. Specific details are a moving target, so we will not attempt to exhaustively document them here.
One simple option is to use the cpupower
program to change the
performance governor to "performance". This tool is maintained along with
the Linux kernel and provided by your distribution.
It must be run as root, like this:
sudo cpupower frequency-set --governor performance
After this you can verify that all CPUs are using the performance governor by running this command:
cpupower frequency-info -o proc
The benchmarks you subsequently run will have less variance.
Note that changing the governor in this way will not persist across reboots. To set the governor back, run the first command again with the governor your system usually runs with, which varies.
If you find yourself doing this often, there are probably better options than running the commands above. Some approaches allow you to do this without root access, or by using a GUI, etc. The Arch Wiki Cpu frequency scaling page is a good place to start looking for options.
CXXFLAGS += -I/home/gtheler/codigos/benchmark/include $(DEFS) $(SLEPC_CC_INCLUDES) $(PETSC_CC_INCLUDES) $(DOWNLOADED_GSL_INCLUDES)
LDFLAGS += /home/gtheler/codigos/benchmark/build/src/libbenchmark.a $(SLEPC_LIB) $(PETSC_LIB) $(LIBS) $(DOWNLOADED_GSL_LIBS)