- Getting started with HPC on Graviton instances
- Compile instructions on an Ec2
- MPI application profiling
C7gn/Hpc7g instances are the latest additions to Graviton based EC2 instances, optimized for network and compute intensive High-Performance Computing (HPC) applications. This document is aimed to help HPC users get the optimal performance on Graviton instances. It covers the recommended compilers, libraries, and runtime configurations for building and running HPC applications. Along with the recommended software configuration, the document also provides example scripts to get started with 3 widely used open-source HPC applications: Weather Research and Forecasting (WRF), Open Source Field Operation And Manipulation (OpenFOAM) and Gromacs.
Instance type: C7gn and Hpc7g (Graviton3E processor, max 200 Gbps network bandwidth, 2 GB RAM/vCPU)
Cluster manager: AWS ParallelCluster
- Base AMI: aws-parallelcluster-3.5.1-ubuntu-2004-lts-hvm-arm64
- Operating System: Ubuntu 20.04 (The latest version supported by Parallel Cluster)
- Linux Kernel: 5.15 & later (for users intended to use custom AMIs)
ENA driver: version 2.8.3 & later (Enhanced networking)
EFA driver: version 1.23.0 & later (docs.aws.amazon.coml#efa-start-enable)
Compiler: Arm Compiler for Linux (ACfL) v23.04 & later (see below for other compiler options)
ArmPL: v23.04 & later (included in the ACfL compiler)
MPI: Open MPI v4.1.4 & later (the latest official release)
Storage: FSx for Lustre for shared file system. HPC instance types have limited EBS bandwidth, and using FSx for Lustre avoids a bottleneck at the headnode.
We recommend using AWS ParallelCluster (previously known as CfnCluster) to deploy and manage HPC clusters on AWS EC2. AWS ParallelCluster 3.5.1 is a tool that can automatically set up the required compute resources, job scheduler, and shared filesystem commonly needed to run HPC applications. This section covers step-by-step instructions on how to set up or upgrade the tools and software packages to the recommended versions on a new ParallelCluster. Please refer to the individual sub-sections if you need to update certain software package on an existing cluster. For a new cluster setup, you can use this template and replace the subnet, S3 bucket name for custom action script, and ssh key information from your account to create a Ubuntu 20.04 cluster. The command to create a new cluster is
pcluster create-cluster --cluster-name test-cluster --cluster-configuration hpc7g-ubuntu2004-useast1.yaml
The cluster creation process takes about 10 minutes. You can find headNode information under the EC2 console page once the creation process is finished (see the image below). In the case that you have multiple headNodes under the account, you can go to instance summary and check Instance profile arn
attribute to find out which one has a prefix matching the cluster-name you created.
Alternatively, you can also use pcluster describe-cluster --cluster-name test-cluster
to find the instanceId of the headNode and aws ec2 describe-instances --instance-ids <instanceId>
to find the public Ip.
{
"creationTime": "2023-04-19T12:56:19.079Z",
"headNode": {
"launchTime": "2023-05-09T14:17:39.000Z",
"instanceId": "i-01489594da7c76f77",
"publicIpAddress": "3.227.12.112",
"instanceType": "c7g.4xlarge",
"state": "running",
"privateIpAddress": "10.0.1.55"
},
"version": "3.5.1",
...
}
You can log in to the headNode in the same way as a regular EC2 instance. Run the setup script with command ./scripts-setup/install-tools-headnode-ubuntu2004.sh
to install the required tools and packages (ACfL and Open MPI) on the shared storage, /shared
.
Many HPC applications depend on compiler optimizations for better performance. We recommend using Arm Compiler for Linux (ACfL) because it is tailored for HPC codes and comes with Arm Performance Libraries (ArmPL), which includes optimized BLAS, LAPACK, FFT and math libraries. Follow the below instructions to install and use ACfL 23.04 (latest version as of Apr 2023) or run the installation script with command ./scripts-setup/0-install-acfl.sh
.
# Install environment modules
sudo apt install environment-modules
source /etc/profile.d/modules.sh
# Find the download link to Arm compiler for your OS on https://developer.arm.com/downloads/-/arm-compiler-for-linux
mkdir -p /shared/tools && cd /shared/tools
wget -O arm-compiler-for-linux_23.04_Ubuntu-20.04_aarch64.tar <link to the tar ball>
tar xf arm-compiler-for-linux_23.04_Ubuntu-20.04_aarch64.tar
sudo ./arm-compiler-for-linux_23.04_Ubuntu-20.04/arm-compiler-for-linux_23.04_Ubuntu-20.04.sh \
-i /shared/arm -a --force
# load the module to use Arm Compiler for Linux (ACfL)
module use /shared/arm/modulefiles
module load acfl
You will see the following message if ACfL installation is successful.
Unpacking...
Installing...The installed packages contain modulefiles under /shared/arm/modulefiles
You can add these to your environment by running:
$ module use /shared/arm/modulefiles
Alternatively: $ export MODULEPATH=$MODULEPATH:/shared/arm/modulefiles
Please refer to Appendix for a partial list of other HPC compilers with Graviton support.
Using highly optimized linear algebra and FFT libraries can significantly speed-up the computation for certain HPC applications. We recommend Arm Performance Libraries (ArmPL) because it offers a vectorized math library (libamath), BLAS, LAPACK, and FFT libraries with better performance compared to other implementations like OpenBLAS or FFTW. ArmPL can be used with the -armpl
flag for ACfL; ArmPL can also be use with other compilers, for example GCC, by adding compilation options: -I${ARMPL_INCLUDES} -L${ARMPL_LIBRARIES} -larmpl
.
ACfL includes the ArmPL packages as well. If you wish to just install the ArmPL, follow the below steps or use script with command ./scripts-setup/1-install-armpl.sh
.
# Find the download link to ArmPL (Ubuntu 20.04, GCC-12) on https://developer.arm.com/downloads/-/arm-performance-libraries
mkdir -p /shared/tools && cd /shared/tools
wget -O arm-performance-libraries_23.04_Ubuntu-20.04_gcc-10.2.tar <link to ArmPL.tar>
tar xf arm-performance-libraries_23.04_Ubuntu-20.04_gcc-10.2.tar
cd arm-performance-libraries_23.04_Ubuntu-20.04/
./arm-performance-libraries_23.04_Ubuntu-20.04.sh -i /shared/arm -a --force
You will see the following message if the installation is successful.
Unpacking...
Installing...The installed packages contain modulefiles under /shared/arm/modulefiles
You can add these to your environment by running:
$ module use /shared/arm/modulefiles
Alternatively: $ export MODULEPATH=$MODULEPATH:/shared/arm/modulefiles
C7gn/Hpc7g instances come with an EFA (Elastic Fabric Adapter) interface for low latency node to node communication that offers a peak bandwidth of 200Gbps. Getting the correct EFA driver is crucial for the performance of network intensive HPC applications. AWS parallel cluster 3.5.1 comes with the latest EFA driver, that supports the EFA interface on C7gn and Hpc7g. If you prefer to stay with an existing cluster generated by earlier versions of AWS ParallelCluster, please follow the steps below to check the EFA driver version and upgrade the driver if necessary.
# ssh into a compute instance after it is configured
fi_info -p efa
# Output on instances without the proper EFA driver
fi_getinfo: -61
# Output on instances with the proper EFA driver
provider: efa
fabric: EFA-fe80::94:3dff:fe89:1b70
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
For applications that use the Message Passing Interface (MPI) to communicate, we recommend using Open MPI v4.1.4 or later for Graviton Instances. AWS Parallel cluster 3.5.1 provides the Open MPI libraries built with default GCC. For best performance, it is recommended to re-compile them with ACfL 23.04 or GCC-11 and later version. The following snippet provides instructions on how to build Open MPI 4.1.4 with ACfL 23.04 or use the script with command ./scripts-setup/2a-install-openmpi-with-acfl.sh
.
# compile Open MPI with ACfL
export INSTALLDIR=/shared
export OPENMPI_VERSION=4.1.4
module use /shared/arm/modulefiles
module load acfl
export CC=armclang
export CXX=armclang++
export FC=armflang
export CFLAGS="-mcpu=neoverse-512tvb"
# assuming the efa driver is installed at the default directory /opt/amazon/efa
cd /shared/tools
wget -N https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz
tar -xzvf openmpi-4.1.4.tar.gz
cd openmpi-4.1.4
mkdir build-acfl
cd build-acfl
../configure --prefix=${INSTALLDIR}/openmpi-${OPENMPI_VERSION}-acfl --enable-mpirun-prefix-by-default --without-verbs --disable-man-pages --enable-builtin-atomics --with-libfabric=/opt/amazon/efa --with-libfabric-libdir=/opt/amazon/efa/lib
make -j$(nproc) && make install
To check if the Open MPI build with ACfL,
export PATH=/shared/openmpi-4.1.4-acfl/bin:$PATH
export LD_LIBRARY_PATH=/shared/openmpi-4.1.4-acfl/lib:$LD_LIBRARY_PATH
mpicc --version
You will get the following message if the build is successful
Arm C/C++/Fortran Compiler version 23.04 (build number 21) (based on LLVM 16.0.0)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /shared/arm/arm-linux-compiler-23.04_Ubuntu-20.04/bin
Some HPC applications require significant amounts of file I/O, however HPC instance types (Graviton instances included) don't have local storage, and have limited EBS bandwidth and IOPS. Relying on EBS on each node can cause surprise slow-downs when the instance runs out of EBS burst credits. This is one reason we don't recommend using an Hpc7g (or any HPC instance type) for headnodes, since the headnode performs additional I/O as the scheduler, and often serves a home directory to the compute nodes. For these reasons the following recommendations are made:
- Use FSx for Lustre to serve data and configuration files to compute nodes. FSx for Lustre file systems can be configured in a variety of sizes and throughputs to meet your specific needs. See the SharedStorage section in the example cluster configuration.
- Headnodes should be compute-optimized instances (such as C7gn or C7g), and sized with both compute needs and EBS/networking needs in mind.
Once the HPC cluster is setup following the above steps, you can run the following sample HPC applications on Graviton and check their performance. If there are any challenges in running these sample applications on Graviton instances, please raise an issue on aws-graviton-getting-started github page.
Package | Version | Build options | Run time configurations |
---|---|---|---|
WRF (Weather Research & Forecasting) | v4.5+ | ACfL | 8 CPUs per rank |
OpenFOAM (Computational Fluid Dynamics simulation) | v2112+ | ACfL | 1 CPU per rank |
Gromacs (Molecular Dynamics simulation) | v2022.4+ | ACfL with SVE_SIMD option | 1 CPU per rank |
The WRF model is one of the most used numerical weather prediction (NWP) systems. WRF is used extensively for research and real-time forecasting. Large amount of computation resources are required for each simulation, especially for high resolution simulations. We recommend using WRF 4.5.
The WRF Pre-Processing System (WPS) preapres a domain (region of the Earth) for input to WRF. We recommend using WPS 4.5.
Use this script with command ./scripts-wrf/install-wrf-tools-acfl.sh
to install the required tools: zlib, hdf5, pnetcdf, netcdf-c, and netcdf-fortran. Or use these scripts in the numeric order to install the tools sequentially. You will get this message if pnetcdf installation is successful; this message if netcdf-c installation is successful; this message if netcdf-fortran installation is successful.
Use this script with command ./scripts-wrf/compile-wrf-v45-acfl.sh
to configure and compile WRF.
# get WRF source v45
git clone https://github.com/wrf-model/WRF.git
cd WRF && git checkout release-v4.5
# apply a patch that includes ACfL compiler options
wget https://raw.githubusercontent.com/aws/aws-graviton-getting-started/main/HPC/scripts-wrf/WRF-v45-patch-acfl.diff
git apply WRF-v45-patch-acfl.diff
# choose option '12. (dm+sm) armclang (armflang/armclang): Aarch64' and '1=basic'
./configure
sed -i 's/(WRF_NMM_CORE)$/(WRF_NMM_CORE) -Wno-error=implicit-function-declaration -Wno-error=implicit-int/g' configure.wrf
./compile -j 1 em_real 2>&1 | tee compile_wrf.out
You will get the following message if the WRF build is successful.
==========================================================================
build started: Fri May 12 17:32:14 UTC 2023
build completed: Fri May 12 18:10:12 UTC 2023
---> Executables successfully built <---
-rwxrwxr-x 1 ubuntu ubuntu 47804664 May 12 18:10 main/ndown.exe
-rwxrwxr-x 1 ubuntu ubuntu 47553704 May 12 18:10 main/real.exe
-rwxrwxr-x 1 ubuntu ubuntu 47167056 May 12 18:10 main/tc.exe
-rwxrwxr-x 1 ubuntu ubuntu 52189632 May 12 18:09 main/wrf.exe
==========================================================================
WRF uses shared memory and distributed memory programming model. It is recommended to use 8 threads per rank and setting threads affinity to be "compact" to reduce communication overhead and achieve better performance. The following is an example Slurm script that will download the WRF CONUS 12km model and run on a single Hpc7g instance with 8 ranks and 8 threads for each rank. You can submit the Slurm job by running this command sbatch sbatch-wrf-v45-acfl.sh
. At the end of the WRF log file from rank 0 (rsl.error.0000), you will see the following message if the job completes successfully.
Timing for main: time 2019-11-26_23:58:48 on domain 1: 0.46453 elapsed seconds
Timing for main: time 2019-11-27_00:00:00 on domain 1: 0.46581 elapsed seconds
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid
2
Timing for Writing wrfout_d01_2019-11-27_00:00:00 for domain 1: 0.97232 elapsed seconds
wrf: SUCCESS COMPLETE WRF
You can view WRF output model using Nice DCV and Ncview. Typically the elapsed time spent on the computing steps is used to measure the performance of the WRF simulation on a system.
num_compute_time_steps=$( grep "Timing for main" rsl.error.0000 | awk 'NR>1' | wc -l )
time_compute_steps=$( grep "Timing for main" rsl.error.0000 | awk 'NR>1' | awk '{ sum_comp += $9} END { print sum_comp }' )
echo $time_compute_steps
After compiling WRF 4.5, use this script with command ./scripts-wps/0-install-jasper.sh
to install the required tools, jasper. Then, use this script with command ./scripts-wps/compile-wps.sh
to configure and compile WPS.
# get WPS source 4.5
wget https://github.com/wrf-model/WPS/archive/refs/tags/v4.5.tar.gz
tar xf v4.5.tar.gz
cd WPS-4.5
# apply a patch that includes ACfL compiler options
cat >> arch/configure.defaults << EOL
########################################################################################################################
#ARCH Linux aarch64, Arm compiler OpenMPI # serial smpar dmpar dm+sm
#
COMPRESSION_LIBS = CONFIGURE_COMP_L
COMPRESSION_INC = CONFIGURE_COMP_I
FDEFS = CONFIGURE_FDEFS
SFC = armflang
SCC = armclang
DM_FC = mpif90
DM_CC = mpicc -DMPI2_SUPPORT
FC = CONFIGURE_FC
CC = CONFIGURE_CC
LD = $(FC)
FFLAGS = -ffree-form -O -fconvert=big-endian -frecord-marker=4 -ffixed-line-length-0 -Wno-error=implicit-function-declaration -Wno-error=implicit-int -Wno-error=incompatible-function-pointer-types
F77FLAGS = -ffixed-form -O -fconvert=big-endian -frecord-marker=4 -ffree-line-length-0 -Wno-error=implicit-function-declaration -Wno-error=implicit-int -Wno-error=incompatible-function-pointer-types
FCSUFFIX =
FNGFLAGS = $(FFLAGS)
LDFLAGS =
CFLAGS = -Wno-error=implicit-function-declaration -Wno-error=implicit-int -Wno-error=incompatible-function-pointer-types
CPP = /usr/bin/cpp -P -traditional
CPPFLAGS = -D_UNDERSCORE -DBYTESWAP -DLINUX -DIO_NETCDF -DBIT32 -DNO_SIGNAL CONFIGURE_MPI
RANLIB = ranlib
EOL
# configure (with option 2), and compile
./configure <<< 2
sed -i 's/-lnetcdf/-lnetcdf -lnetcdff -lgomp /g' configure.wps
./compile | tee compile_wps.log
You will see the geogrid.exe, metgrid.exe, and ungrib.exe files in your directory if the WPS build is successful.
OpenFOAM is a free, open-source CFD software released and developed by OpenCFD Ltd since 2004. OpenFOAM has a large user base and is used for finite element analysis (FEA) in a wide variety of industries, including aerospace, automotive, chemical manufacturing, petroleum exploration, etc.
Use this script with command ./scripts-openfoam/compile-openfoam-acfl.sh
to compile OpenFOAM with ACfL.
mkdir -p /shared/tools/openfoam-root && cd /shared/tools/openfoam-root
export PATH=/shared/openmpi-4.1.4-acfl/bin:$PATH
export LD_LIBRARY_PATH=/shared/openmpi-4.1.4-acfl/lib:$LD_LIBRARY_PATH
module use /shared/arm/modulefiles
module load acfl armpl
[ -d openfoam ] || git clone -b OpenFOAM-v2112 https://develop.openfoam.com/Development/openfoam.git
[ -d ThirdParty-common ] || git clone -b v2112 https://develop.openfoam.com/Development/ThirdParty-common.git
pushd ThirdParty-common
scotch_version="6.1.0"
git clone -b v${scotch_version} https://gitlab.inria.fr/scotch/scotch.git scotch_${scotch_version}
popd
cd openfoam
# a patch required for ACfL or GCC-12 (https://develop.openfoam.com/Development/openfoam/-/commit/91198eaf6a0c11b57446374d97a079ca95cf1412)
wget https://raw.githubusercontent.com/aws/aws-graviton-getting-started/main/HPC/scripts-openfoam/openfoam-v2112-patch.diff
git apply openfoam-v2112-patch.diff
sed -i -e "s/WM_COMPILER=Gcc/WM_COMPILER=Arm/g" etc/bashrc
source etc/bashrc || echo "Non-zero exit of source etc/bashrc"
./Allwmake -j
You will see the following message if the installation is successful.
========================================
Done OpenFOAM applications
========================================
========================================
prefix = /shared/tools/openfoam-root/openfoam/platforms/linuxARM64ArmDPInt32Opt
ignoring possible compilation errors
make certain to check the output file
2023-05-12 21:03:31 +0000
========================================
openfoam
Arm system compiler
linuxARM64ArmDPInt32Opt, with SYSTEMOPENMPI sys-openmpi
api = 2112
patch = 0
bin = 263 entries
lib = 120 entries
========================================
Use this script with command sbatch ./sbatch-openfoam-acfl.sh
to set up the environment parameters, perform domain decomposition, generate meshes, and run the OpenFOAM motorBike 70M benchmark, included in OpenFOAM 2112 package, on a single instance with 64 ranks.
If the simulation has succeeded, you should see the final model statistics at the end of the log file, /shared/data-openfoam/motorBike-70M/motorBike/log/simpleFoam.log
, like below. You can also use Paraview and Nice DCV to visualize the OpenFOAM output model.
streamLine streamLines write:
seeded 20 particles
Tracks:20
Total samples:18175
Writing data to "/shared/data-openfoam/motorBike-70M/motorBike/postProcessing/sets/streamLines/500"
forceCoeffs forces execute:
Coefficients
Cd : 0.438588 (pressure: 0.412171 viscous: 0.0264166)
Cs : 0.00672088 (pressure: 0.00631824 viscous: 0.000402645)
Cl : -0.0259146 (pressure: -0.0215873 viscous: -0.00432727)
CmRoll : 0.00360773 (pressure: 0.0034373 viscous: 0.000170428)
CmPitch : 0.228219 (pressure: 0.215858 viscous: 0.0123609)
CmYaw : 0.00165442 (pressure: 0.00162885 viscous: 2.55688e-05)
Cd(f) : 0.222901
Cd(r) : 0.215686
Cs(f) : 0.00501486
Cs(r) : 0.00170602
Cl(f) : 0.215262
Cl(r) : -0.241177
End
Finalising parallel run
Gromacs is a widely used molecular dynamics software package. Gromacs is a computation heavy software, and can get better performance with the modern processors' SIMD (single instruction multiple data) capabilities. We recommend using Gromacs 2022.4 or later releases because they implement performance critical routines using the SVE instruction set found on Hpc7g/C7gn.
Use this script with command ./scripts-gromacs/compile-gromacs-acfl.sh
to build Gromacs with ACfL
# note: Gromacs supports 3 different programming interfaces for FFT:
# "fftw3", "mkl" and "fftpack". The ArmPL FFT library has the same
# programming interface as FFTW, so, setting "-DGMX_FFT_LIBRARY=fftw3" and
# "-DFFTWF_LIBRARY=${ARMPL_LIBRARIES}/libarmpl_lp64.so" enables the
# ArmPL FFT library for Gromacs.
cmake .. -DGMX_BUILD_OWN_FFTW=OFF \
-DREGRESSIONTEST_DOWNLOAD=ON \
-DCMAKE_C_FLAGS="-mcpu=neoverse-512tvb —param=aarch64-autovec-preference=4 -g" \
-DCMAKE_CXX_FLAGS="-mcpu=neoverse-512tvb —param=aarch64-autovec-preference=4 -g" \
-DCMAKE_C_COMPILER=$(which mpicc) \
-DCMAKE_CXX_COMPILER=$(which mpicxx) \
-DGMX_OMP=ON \
-DGMX_MPI=ON \
-DGMX_SIMD=ARM_SVE \
-DGMX_BUILD_MDRUN_ONLY=OFF \
-DGMX_DOUBLE=OFF \
-DCMAKE_INSTALL_PREFIX=${CURDIR} \
-DBUILD_SHARED_LIBS=OFF \
-DGMX_FFT_LIBRARY=fftw3 \
-DFFTWF_LIBRARY=${ARMPL_LIBRARIES}/libarmpl_lp64.so \
-DFFTWF_INCLUDE_DIR=${ARMPL_INCLUDES} \
\
-DGMX_BLAS_USER=${ARMPL_LIBRARIES}/libarmpl_lp64.so \
-DGMX_LAPACK_USER=${ARMPL_LIBRARIES}/libarmpl_lp64.so \
\
-DGMXAPI=OFF \
-DGMX_GPU=OFF
make
make install
You will see the following message if the installation is successful.
-- Installing: /shared/gromacs-2022.4-acfl/bin/gmx_mpi
-- Up-to-date: /shared/gromacs-2022.4-acfl/bin
-- Installing: /shared/gromacs-2022.4-acfl/bin/gmx-completion.bash
-- Installing: /shared/gromacs-2022.4-acfl/bin/gmx-completion-gmx_mpi.bash
To get the best performance for benchRIB, a benchmark from Max Planck Institute, we recommend a single core for each rank and 64 ranks per instance. Below is an example Slurm script for running Gromacs job on a single instance. You can start the Slurm job by sbatch sbatch-gromacs-acfl.sh
.
At the end of benchRIB output log, /shared/data-gromacs/benchRIB/benchRIB.log
, you can find a section showing the performance of the simulation. Below is an example of the output file on a single Hpc7g instance. The performance is measured by ns/day (higher is better), which means the number of nanoseconds of the system's dynamics that can be simulated in 1 day of computing.
Core t (s) Wall t (s) (%)
Time: 17989.180 281.082 6400.0
(ns/day) (hour/ns)
Performance: 6.149 3.903
Finished mdrun on rank 0 Fri May 12 22:18:17 2023
code_saturne is a general-purpose computational fluid dynamics free computer software package. Developed since 1997 at Électricité de France R&D, code_saturne is distributed under the GNU GPL licence.
Use this script with command ./scrpits-code_saturne/install-codesaturne-gcc-mpi4.sh
to build Code Saturne with GCC. The configuration below uses BLAS library from ArmPL. The default multi-grid solver is cs_sles_solve_native. Users can change the solver and solver settings (n_max_iter_coarse_solver, min_g_cells) by updating ./src/user/cs_user_parameters.c. This user parameters file example shows a use case to use CS_SLES_P_SYM_GAUSS_SEIDEL solver for better solver performance.
cd /shared/tools
module use /shared/arm/modulefiles
module load armpl
export PATH=/shared/openmpi-4.1.6/bin:$PATH
export LD_LIBRARY_PATH=/shared/openmpi-4.1.6/lib:$LD_LIBRARY_PATH
export CC=mpicc
export CXX=mpicxx
export FC=mpif90
export F77=mpif90
export F90=mpif90
if [ ! -d code_saturne-8.0.2 ]; then
wget https://www.code-saturne.org/releases/code_saturne-8.0.2.tar.gz
tar xf code_saturne-8.0.2.tar.gz
fi
cd code_saturne-8.0.2
PREFIX=/shared/code_saturne_8.0-mpi4
mkdir build-mpi4
cd build-mpi4
../configure CC=${CC} CXX=${CXX} FC=${FC} \
--with-blas=$ARMPL_LIBRARIES --prefix=$PREFIX \
--disable-gui --without-med \
--without-hdf5 --without-cgns \
--without-metis --disable-salome \
--without-salome --without-eos \
--disable-static --enable-long-gnum \
--enable-profile
make -j
make install
The code_saturne benchmark data can be generated using the following procedures.
mkdir -p /shared/data-codesaturne && cd /shared/data-codesaturne
git clone https://github.com/code-saturne/saturne-open-cases.git
cd /shared/data-codesaturne/saturne-open-cases/BUNDLE/BENCH_F128_PREPROCESS/DATA
$PREFIX/bin/code_saturne run --initialize
cd /shared/data-codesaturne/saturne-open-cases/BUNDLE/BENCH_F128_PREPROCESS/RESU/extrude_128
./run_solver
cd /shared/data-codesaturne/saturne-open-cases/BUNDLE/BENCH_F128_02/DATA
$PREFIX/bin/code_saturne run --initialize
After that you can use the following slurm batch script with command sbatch scripts-code_saturne/submit-F128-2-hpc7g-gcc-mpi4.sh
to run the benchmark.
At the end of the benchmark run, you will find run_solver.log
and performance.log
in the run directory. These logs contain the correctness and performance information of the run. You can find the Elapsed time for the job in performance.log
and one of the sample can be found below.
Calculation time summary:
User CPU time: 294.198 s
System CPU time: 13.001 s
Total CPU time: 57958.255 s
Elapsed time: 318.384 s
CPU / elapsed time 0.965
Ideally, as you add more resources, the runtime of HPC applications should reduce linearly. When scaling is sub-linear or worse, it is usually because of the non-optimal communication patterns. To debug these cases, open-source tools such as the Tau Performance System, can generate profiling and tracing reports to help you locate the bottlenecks.
Configure and build Tau as follows (shown here for an AWS EC2 instance launched in ParallelCluster setup):
$ ./configure -prefix=/shared/TauOpenMPI \
-mpi \
-mpiinc=/opt/amazon/openmpi/include \
-mpilib=/opt/amazon/openmpi/lib
After having built/installed the profiler, collect a profile by executing the command below:
$ mpirun tau_exec mpiApplication > ./output.log 2>&1
A successful collection of a Tau profile would cause the creation of profile.*
files. You can visualize the results using paraprof
or pprof
utilities in Tau. Shown below is a summary profile using command pprof -s
.
FUNCTION SUMMARY (mean):
---------------------------------------------------------------------------------------
%Time Exclusive Inclusive #Call #Subrs Inclusive Name
msec total msec usec/call
---------------------------------------------------------------------------------------
100.0 0.556 2:11.067 1 1 131067754 .TAU application
100.0 1:09.130 2:11.067 1 478495 131067198 taupreload_main
27.9 14,889 36,577 171820 171820 213 MPI_Allreduce()
16.8 22,037 22,037 172288 0 128 MPI Collective Sync
9.7 12,708 12,708 94456 0 135 MPI_Waitall()
2.8 3,624 3,624 1 0 3624935 MPI_Finalize()
2.7 3,518 3,518 1 0 3518172 MPI_Init_thread()
2.2 2,920 2,920 3597.37 0 812 MPI_Recv()
1.1 1,475 1,475 438.314 0 3367 MPI_Probe()
The table below has a list of HPC compilers and options that you can for Graviton instance:
Compiler | Minimum version | Target: Graviton3 and up | Enable OpenMP | Fast Math |
---|---|---|---|---|
GCC | 11 | -O3 -mcpu=neoverse-v1 | -fopenmp | -ffast-math |
CLang/LLVM | 14 | -O3 -mcpu=neoverse-512tvb | -fopenmp | -ffast-math |
Arm Compiler for Linux | 23.04 | -O3 -mcpu=neoverse-512tvb | -fopenmp | -ffast-math |
Nvidia HPC SDK | 23.1 | -O3 -tp=neoverse-v1 | -mp | -fast |
Below is a list of some common HPC applications that run on Graviton.
ISV | Application | Release of support | Additional Notes |
---|---|---|---|
Ansys | Fluent | v221 | Graviton Applications (AWS) |
Ansys | LS-Dyna | 12.1 | Graviton Applications (AWS), ANYS Deployment (Rescale) |
Ansys | RedHawk-SC | 2023R1 | Release Notes |
Fritz Haber Institute | FHIaims | 21.02 | Quantum Chemistry (AWS) |
National Center for Atmospheric Research | WRF | WRFV4.5 | Weather on Graviton (AWS), WRF on Graviton2 (ARM) |
OpenFOAM Foundation / ESI | OpenFOAM | OpenFOAM7 | Getting Best Performance (AWS), Graviton Applications (AWS), Instructions (AWS) |
Sentieon | DNAseq , TNseq, DNAscope | 202112.02 | Release Notes, Cost Effective Genomics (AWS) |
Siemens | StarCCM++ | 2023.2 | Release Notes |
Université de Genève | Palabos | 2010 | Lattice-Boltzmann Palabos (AWS) |
Altair Engineering | OpenRadioss | 20231204 | Presentations-Aachen270623 - OpenRadioss, Instructions |
Électricité de France | Code Saturne | 8.0.2 | https://www.code-saturne.org/cms/web/documentation/Tutorials |
HEXAGON | Cradle CFD | 2024.1 | Release Notes |