Skip to content

Utilizing Heterogeneous Nodes

Tim Wildey edited this page Dec 29, 2024 · 2 revisions

MrHyDE is designed to take advantage of heterogeneous computational architectures, which includes on-node parallelism, to enable large-scale simulations. Most second-generation Trilinos packages enable this through an MPI+X perspective. To take advantage of these capabilities, one needs to build both Trilinos and MrHyDE with the appropriate configurations. This is not a trivial task, and more details are provided below.

MPI Parallelism

At this time, MrHyDE only uses MPI for inter-core communication. In order to use inter-core parallelism, the mesh will be decomposed onto equal number of partitions as the number of requested MPI processes. For the most part, the inter-node parallelism happens behind the scenes in several of the Trilinos packages, e.g., Tpetra, Belos, MueLu, etc. The linear algebra objects that MrHyDE uses all have the concept of owned data structure and owned+shared data structures. The assembly routines require the owned+shared data structures, while the linear solvers require the owned data structures. The linear algebra interface handles most of the interaction with the appropriate functionality in Trilinos to go back and forth between these data structures. However, it is worth noting that duplicating these data structures induces a significant memory increase. Finally, we note that MPI allReduce commands are used in a few places in MrHyDE to determine a total or a max number of a quantity over all of the MPI processes.

On-Node Parallelism

On-node parallelism refers to the utilization of multiple threads or concurrently executed tasks within a given node. The availability of on-node parallelism depends on the computational architecture. Many modern CPU cores can allow for multiple threads, typically two or four per core, and modern heterogeneous systems, such the hybrid CPU/GPU systems, provide several thousand SIMD threads. Unfortunately, designing algorithms and data structures to exploit the full computational power of any of these architectures is not trivial, and may vary significantly between architectures. MrHyDE utilizes Kokkos to enable efficient implementations that are performance portable across a wide range of architectures. A full description of Kokkos is beyond the scope of this document. Briefly, MrHyDE utilizes Kokkos Devices, which are a MemorySpace plus an ExecutionSpace, to define where and how computations occur on the node. Two different devices are used prevalently throughout the code: HostDevice and AssemblyDevice. At this time, the HostDevice must be

  • <Kokkos::Serial, Kokkos::HostSpace>

The AssemblyDevice can be:

  • <Kokkos::Serial, Kokkos::HostSpace>,
  • <Kokkos::OpenMP, Kokkos::HostSpace>,
  • <Kokkos::Cuda, Kokkos::CudaSpace>.

Additional options are available through Kokkos, but have not been tested at this time. The last consideration is where the linear algebra objects will be allocated and utilized. At this time, the linear algebra packages, and therefore the linear algebra interface, use a concept of a Node rather than a device although the differences between the two are minimal. MrHyDE can use the following SolverNodes:

  • Kokkos::Compat::KokkosSerialWrapperNode,
  • Kokkos::Compat::KokkosOpenMPWrapperNode,
  • Kokkos::Compat::KokkosCudaWrapperNode.

The choice of the appropriate AssemblyDevice and SolverNode can have a tremendous impact on performance, but depends on the problem being solved. Not all preconditioners are available on the GPU and solver performance may actually be worse on the GPU versus the CPU. Moreover, many linear solvers and preconditioners require substantial amounts of memory which may exceed that available on the GPU. Finally, we comment on one notable exception from the MemorySpaces:

  • Kokkos::CudaUVMSpace.

UVM, or Unified Virtual Memory, is a memory space that is accessible to either the CPU or the GPU on a heterogeneous device. While this space was quite popular several years ago due to the relative ease of implementations and the ability to overlap some communication with computation, it is not a concept that applies to all GPU architectures and can cause unintended memory transfers, and therefore is not recommended for usage in MrHyDE. At this time, MrHyDE does not utilize UVM space in any of the custom routines. However, Trilinos has not completely purged UVM from all of the packages and this is still the default MemorySpace for some packages when Cuda is enabled. To mitigate this discrepancy, MrHyDE employs a compatibility layer with temporary Kokkos Views that use UVM memory. The data in these Views gets copied into MrHyDE UVM-free Views once the interaction with the specific Trilinos packages is complete. This memory transfer will be removed eventually.

Enabling Parallelism