Rewriting RRTMGP in C Plus Plus

Why rewrite in C++?

The main motivation for a rewrite in C++ is to avoid the many subtle compiler bugs present in OpenMP and OpenACC runtimes for PGI, GNU, XL, and Cray, especially given the large use of modern Fortran features in the RRTMGP code, such as type-bound procedures in classes and class references present in kernels. We've been working on the OpenACC port for several years, and still it is not well supported by PGI's OpenACC. With a C++ port, we are pretty much guaranteed the code will perform well on AMD, Intel, and Nvidia GPUs with a single source code. Also, experience has shown the number of bugs in C++ compiler implementations are significantly fewer than Fortran directives-based approaches. Finally, even with Fortran directives, it may still be necessary to have both OpenMP and OpenACC directives, which will not be ideal.

Workflow

Testing Framework

Luckily, the RRTMGP codebase comes with its own testing framework using clear sky and cloudy forcing data. We will use this testing framework for the vast majority of our testing. The results can be diff'd based on an envelope defined by -O0 and -O3 compiler optimizations as an expected bit-level difference.

There are parts of the code that will not be covered in the existing testing framework, so before moving each piece into C++, we will need to ensure that piece is covered by the test (a printf will suffice for this). For any piece not covered by the existing testing framework, we will need to create a unit test for that, presumably forced by random data.

Each testing cycle should also include valgrind, cuda-memcheck, Fortran bounds checking, and C++ MD Array bounds checking as well. This can be done in three total passes: (1) A valgrind pass on the CPU, (2) A Fortran and C++ bounds checking pass on the CPU, and (3) A cuda-memcheck pass on the GPU with answer checking. Given the testing framework runs in tens of seconds, this should be OK for a development cycle time length.

Incremental Porting

The ideal workflow for an effort of this type is to change as little code as possible at a time between testing cycles. That way, when a bug is introduced, it can only be due to a very limited number of lines of code. It makes maintaining a bug-free code significantly easier. Along these lines, the plan is to port one routine at a time starting at the lowest level kernels first, and working our way up from there.

Workflow for Each Routine

We will use the Fortran iso_c_binding to call C++ routines from Fortran. Let's start with the routine below:

pure subroutine combine_and_reorder_2str(ncol, nlay, ngpt, tau_abs, tau_rayleigh, tau, ssa, g) &
  bind(C, name="combine_and_reorder_2str")
  integer,                             intent(in) :: ncol, nlay, ngpt
  real(wp), dimension(ngpt,nlay,ncol), intent(in   ) :: tau_abs, tau_rayleigh
  real(wp), dimension(ncol,nlay,ngpt), intent(inout) :: tau, ssa, g ! inout because components are allocated
  integer  :: icol, ilay, igpt
  real(wp) :: t

  do icol = 1, ncol
    do ilay = 1, nlay
      do igpt = 1, ngpt
         t = tau_abs(igpt,ilay,icol) + tau_rayleigh(igpt,ilay,icol)
         tau(icol,ilay,igpt) = t
         g  (icol,ilay,igpt) = 0._wp
         if(t > 2._wp * tiny(t)) then
           ssa(icol,ilay,igpt) = tau_rayleigh(igpt,ilay,icol) / t
         else
           ssa(icol,ilay,igpt) = 0._wp
         end if
      end do
    end do
  end do
end subroutine combine_and_reorder_2str

The first step is to create a Fortran interface for this routine:

interface
  subroutine combine_and_reorder_2str(ncol, nlay, ngpt, tau_abs, tau_rayleigh, tau, ssa, g) &
                bind(C, name="combine_and_reorder_2str")
    use iso_c_binding
    c_int      , intent(in   ) :: ncol, nlay, ngpt
    type(c_ptr), intent(in   ) :: tau_abs, tau_rayleigh
    type(c_ptr), intent(inout) :: tau, ssa, g
  end subroutine gator_init_c
end interface

Then, the C++ routine can be defined as follows:

typedef double real;
typedef Kokkos::View<real*      , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real1d;
typedef Kokkos::View<real**     , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real2d;
typedef Kokkos::View<real***    , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real3d;
typedef Kokkos::View<real****   , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real4d;
typedef Kokkos::View<real*****  , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real5d;
typedef Kokkos::View<real****** , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real6d;
typedef Kokkos::View<real*******, Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real7d;

#include <limits>

extern "C" void combine_and_reorder_2str( int ncol, int nlay, int ngpt, real const *tau_abs_ptr , 
                                          real const *tau_rayleigh_ptr , real *tau_ptr , real *ssa_ptr ,
                                          real *g_ptr ) {
  // This replaces the Fortran tiny intrinsic function
  real realtiny = std::numeric_limits<real>::min();

  // Immediately wrap all pointers in unmanaged Kokkos Views
  real3d tau_abs     ( tau_abs_ptr      , ncol , nlay , ngpt );
  real3d tau_rayleigh( tau_rayleigh_ptr , ncol , nlay , ngpt );
  real3d tau         ( tau_ptr          , ngpt , nlay , ncol );
  real3d ssa         ( ssa_ptr          , ngpt , nlay , ncol );
  real3d g           ( g_ptr            , ngpt , nlay , ncol );

  // Function body
  for (int icol=0; icol < ncol; icol++) {
    for (int ilay=0; ilay < nlay; ilay++) {
      for (int igpt=0; igpt < ngpt; igpt++) {
        real t = tau_abs(icol,ilay,igpt) + tau_rayleigh(icol,ilay,igpt);
        tau(igpt,ilay,icol) = t;
        g  (igpt,ilay,icol) = 0;
        if (t > 2*realtiny) {
          ssa(igpt,ilay,icol) = tau_rayleigh(icol,ilay,igpt) / t;
        } else {
          ssa(igpt,ilay,icol) = 0;
        }
      }
    }
  }
}

Some points to explain what's happening above:

Make sure the Fortran real types match the C++ types (double versus float). In this case, double will be the default type. This is easiest to handle with a C++ typedef like above.
Fortran intent(in) variables will have a const keyword (double const *varname). Note that in C++, you read lines from right to left. This means "varname is a pointer to a constant double", or rather, the data pointed to by the pointer does not change in this routine.
Fortran intent(out) and intent(inout) will not have the const attribute. In most Fortran implementations, out and inout intents are handled identically anyway.
It's best to immediately wrap the raw pointers in unmanaged Kokkos Views. The reason is that we can keep the multi-dimensional array syntax, and we can use Kokkos array bounds checking.
Kokkos::View<real**, Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> >
- real**: The number of asterisks is the number of dimensions
- Kokkos::LayoutRight: This says that the right index varies the fastest, and it's a contiguous block of memory
- Kokkos::HostSpace: This is a CPU array, not a GPU array. (We'll GPU port later)
- Kokkos::MemoryTraits<Kokkos::Unmanaged>: The data is already allocated. This tells Kokkos to simply wrap the data in a convenient multi-dimensional syntax rather than mange its allocation, deallocation, and reference counting.
"But don't you have to transpose data between C and Fortran?"
- In Fortran, it's a contiguous block of memory with the left index varying the fastest. In C++, we're defining this same block of memory as contiguous with the right index varying the fastest.
- The only thing you have to make sure is that the index that's varying the fastest is still varying the fastest. Therefore, when we define LayoutRight in the View, we need to transpose the indices. The main reason for defining LayoutRight is because that's what people expect when they read C / C++.
extern "C" is required for linking purposes.
Once you create the C++ routine and the Fortran interface, you can then call the function exactly the same as you did before from Fortran.
Fortran requires you to declare all local variables before the function code, but C++ allows you to declare them wherever you want. It's good form to only declare local variables in the scope where you actually need them to avoid a number of potential bugs you could otherwise introduce.
As you port higher level routines to C++ that are calling these kernels, you can move the unmanaged Kokkos Views outside the routines and into the higher level, and you can get rid of the extern "C" declarations as follows:

void combine_and_reorder_2str( int ncol, int nlay, int ngpt, real3d const tau_abs , 
                               real3d const tau_rayleigh , real3d tau , real3d ssa ,
                               real3d g ) {
  // This replaces the Fortran tiny intrinsic function
  real realtiny = std::numeric_limits<real>::min();

  // Function body
  for (int icol=0; icol < ncol; icol++) {
    for (int ilay=0; ilay < nlay; ilay++) {
      for (int igpt=0; igpt < ngpt; igpt++) {
        real t = tau_abs(icol,ilay,igpt) + tau_rayleigh(icol,ilay,igpt);
        tau(igpt,ilay,icol) = t;
        g  (igpt,ilay,icol) = 0;
        if (t > 2*realtiny) {
          ssa(igpt,ilay,icol) = tau_rayleigh(icol,ilay,igpt) / t;
        } else {
          ssa(igpt,ilay,icol) = 0;
        }
      }
    }
  }
}

Once the code is giving the right answer with for loops, we can then move it to Kokkos::parallel_for with:

  // for (int icol=0; icol < ncol; icol++) {
  //   for (int ilay=0; ilay < nlay; ilay++) {
  //     for (int igpt=0; igpt < ngpt; igpt++) {
  Kokkos::parallel_for( ncol*nlay*ngpt , KOKKOS_LAMBDA (int iGlob) {
    int icol, ilay, igpt;
    unpackIndices( iGlob , ncol , nlay , ngpt , icol , ilay , igpt );

    real t = tau_abs(icol,ilay,igpt) + tau_rayleigh(icol,ilay,igpt);
    tau(igpt,ilay,icol) = t;
    g  (igpt,ilay,icol) = 0;
    if (t > 2*realtiny) {
      ssa(igpt,ilay,icol) = tau_rayleigh(icol,ilay,igpt) / t;
    } else {
      ssa(igpt,ilay,icol) = 0;
    }
  });

where the unpackInidices functions can be copied from https://github.com/mrnorman/YAKL/blob/master/YAKL.h , and YAKL_INLINE is replaced with KOKKOS_INLINE.

Unit Tests

For routines not covered by the existing testing framework in RRTMGP, we will need to create unit tests for those. In those cases, it's easy to just call the C++ routine from a test driver, meaning we do not need to modify the tested routine at all. Also, it's best to keep test code in a separate directory for unit tests so that we don't clutter the main code directory.

File Organization

We should try to mimic the file structure of RRTMGP. Therefore, each file in RRTMGP will have a corrosponding .cpp and .h file. The reason for the header files is to be able to call any of the routines in a test driver in a different directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly