Skip to content

Using the madgraph4gpu benchmarking (BMK) containers

Andrea Valassi edited this page Sep 3, 2022 · 10 revisions

The HEP-benchmarks project provides docker and singularity containers that fully encapsulate typical software workloads of the LHC experiments. A test container based on madgraph4gpu, using the standalone tests with cudacpp matrix elements, has recently been added:

This uses software builds that are prepared in the BMK CI using cuda11.7 and gcc11.2. They use the current latest master of madgraph4gpu.

The current version of the container is v0.6, it is available from the following locations:

The following is an example, where the singularity cache dir and tmp dir are also redirected:

  export SINGULARITY_TMPDIR=/scratch/SINGULARITY_TMPDIR
  export SINGULARITY_CACHEDIR=/scratch/SINGULARITY_CACHEDIR
  singularity run -B /scratch/TMP_RESULTS:/results oras://registry.cern.ch/hep-workloads/mg5amc-madgraph4gpu-2022-bmk:v0.6 -h

The containers are configurable. Using -h will print out a list of options. These are still UNDER TEST: please report any issues to AndreaV. Both CPU and GPU tests are available.

  • For CPU tests, you may use -c to change the number of simultaneous copies that run on your node as separate (single threaded) processes. You should typically use $(nproc) copies to fill the CPU, and you can also try overcommitting the node.
  • For GPU tests, it is recommended that you use -c1 to have a single copy running. The GPU is able to also share amongst different CPU processes, but the overhead reduces the overall throughput.

Scripts and results in madgraph4gpu

A few preliminary results have been obtained using some simple scripts to run CPU tests and then analyse the results and produce some plots:

Several png plots are available from two nodes

Some example results for multi-core + SIMD:

A comparison of absolute throughputs for four processes, using the best SIMD:

Just for internal reference (NB for production use stick to inl0, do not use inl1!):

Recommendations

Many options are configurable, here's a few recommendations:

  • for benchmarking different systems, the most complex ggttgg is recommended: but if you may collect results also for the other three processes, the results amy be useful later on...
  • for benchmarking different systems, double precision (double) is recommended: but if you may collect results also for single precision (float), the results amy be useful later on...
  • run separate tests for --cpu and --gpu : for CPU tests use nproc copies (this should be the default), unless you also want to produce scaling plots for different numbers of copies; for GPU tests use a single copy -c1 as the standalone test completely saturates the GPU and there is no point in overcommitting it
  • the "number of events" configurables by -e is a multiplier over predefined numbers of events (hundreds of thousands!), but the default -e1 runs tests that are too short and results are probably overestimated (especially for ggttg and ggtt): try to use -e10 or even more... if you do a scan and check score stability, that may be useful
  • run only "inl0" and forget about "inl1"