Using the madgraph4gpu benchmarking (BMK) containers

The HEP-benchmarks project provides docker and singularity containers that fully encapsulate typical software workloads of the LHC experiments. A test container based on madgraph4gpu, using the standalone tests with cudacpp matrix elements, has recently been added:

https://gitlab.cern.ch/hep-benchmarks/hep-workloads/-/tree/master/mg5amc/madgraph4gpu-2022

This uses software builds that are prepared in the BMK CI using cuda11.7 and gcc11.2. They use the current latest master of madgraph4gpu.

The current version of the container is v0.6, it is available from the following locations:

The following is an example, where the singularity cache dir and tmp dir are also redirected:

  export SINGULARITY_TMPDIR=/scratch/SINGULARITY_TMPDIR
  export SINGULARITY_CACHEDIR=/scratch/SINGULARITY_CACHEDIR
  singularity run -B /scratch/TMP_RESULTS:/results oras://registry.cern.ch/hep-workloads/mg5amc-madgraph4gpu-2022-bmk:v0.6 -h

The containers are configurable. Using -h will print out a list of options. These are still UNDER TEST: please report any issues to AndreaV. Both CPU and GPU tests are available.

For CPU tests, you may use -c to change the number of simultaneous copies that run on your node as separate (single threaded) processes. You should typically use $(nproc) copies to fill the CPU, and you can also try overcommitting the node.
For GPU tests, it is recommended that you use -c1 to have a single copy running. The GPU is able to also share amongst different CPU processes, but the overhead reduces the overall throughput.

Scripts and results in madgraph4gpu

A few preliminary results have been obtained using some simple scripts to run CPU tests and then analyse the results and produce some plots:

run some CPU tests and produce json results: https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/driver.sh
produce some plots from the json results: https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/bmkplots.py

Several png plots are available from two nodes

Haswell with nproc=32 and AVX2: https://github.com/madgraph5/madgraph4gpu/tree/master/tools/benchmarking/BMK-pmpe04
Silver with nproc=4 and AVX512 (one FMA unit only): https://github.com/madgraph5/madgraph4gpu/tree/master/tools/benchmarking/BMK-itscrd70

Some example results for multi-core + SIMD:

https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-pmpe04/pmpe04-e001-all-sa-cpp-d-inl0.png : maximum throughput on pmpe04 is a factor ~64 higher than 1-core no-SIMD, in double precision - a factor x16 from the 16 physical cores (only reached by using HT), a factor x4 from AVX2 for doubles
https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-pmpe04/pmpe04-e001-all-sa-cpp-f-inl0.png : maximum throughput on pmpe04 is a factor ~128 higher than 1-core no-SIMD, in single precision - a factor x16 from the 16 physical cores (only reached by using HT), a factor x8 from AVX2 for floats
https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-itscrd70/itscrd70-e001-all-sa-cpp-d-inl0.png : maximum throughput on itsrcrd70 is a factor ~16 higher than 1-core no-SIMD, in double precision - a factor x4 from the 4 physical cores (NB the plot is wrong, HT is disabled, there are 4 cores), a factor more than x4 from AVX512/ymm for doubles... note that AVX512/zmm is lower (one FMA unit only)
https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-itscrd70/itscrd70-e001-all-sa-cpp-f-inl0.png : maximum throughput on itsrcrd70 is a factor ~32 higher than 1-core no-SIMD, in double precision - a factor x4 from the 4 physical cores (NB the plot is wrong, HT is disabled, there are 4 cores), a factor more than x8 from AVX512/ymm for floats... note that AVX512/zmm is lower (one FMA unit only)
NB in all of these plots, the highest throughputs in overcommit and AVX2/AVX512 are probably overestaimated because the tests ran for too short, they should be repeated

A comparison of absolute throughputs for four processes, using the best SIMD:

https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-pmpe04/pmpe04-e001-all-sa-cpp-d-inl0-best.png : there is approximately one order of magnitude less in throughout going from eemumu to ggtt to ggttg to ggttgg

Just for internal reference (NB for production use stick to inl0, do not use inl1!):

https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-pmpe04/pmpe04-e001-all-sa-cpp-d-inl.png : this shows that "aggressive inlining" in the C++ code seem to behave very well for the simplest eemumu process (which is why it was introduced and kept in the code), but this is counterproductive for the more complex ggtt* processes

Recommendations

Many options are configurable, here's a few recommendations:

for benchmarking different systems, the most complex ggttgg is recommended: but if you may collect results also for the other three processes, the results amy be useful later on...
for benchmarking different systems, double precision (double) is recommended: but if you may collect results also for single precision (float), the results amy be useful later on...
run separate tests for --cpu and --gpu : for CPU tests use nproc copies (this should be the default), unless you also want to produce scaling plots for different numbers of copies; for GPU tests use a single copy -c1 as the standalone test completely saturates the GPU and there is no point in overcommitting it
the "number of events" configurables by -e is a multiplier over predefined numbers of events (hundreds of thousands!), but the default -e1 runs tests that are too short and results are probably overestimated (especially for ggttg and ggtt): try to use -e10 or even more... if you do a scan and check score stability, that may be useful
run only "inl0" and forget about "inl1"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the madgraph4gpu benchmarking (BMK) containers

Scripts and results in madgraph4gpu

Recommendations

Clone this wiki locally