-
Notifications
You must be signed in to change notification settings - Fork 3
Rewriting RRTMGP in C Plus Plus
The main motivation for a rewrite in C++ is to avoid the many subtle compiler bugs present in OpenMP and OpenACC runtimes for PGI, GNU, XL, and Cray, especially given the large use of modern Fortran features in the RRTMGP code, such as type-bound procedures in classes and class references present in kernels. We've been working on the OpenACC port for several years, and still it is not well supported by PGI's OpenACC. With a C++ port, we are pretty much guaranteed the code will perform well on AMD, Intel, and Nvidia GPUs with a single source code. Also, experience has shown the number of bugs in C++ compiler implementations are significantly fewer than Fortran directives-based approaches. Finally, even with Fortran directives, it may still be necessary to have both OpenMP and OpenACC directives, which will not be ideal.
Luckily, the RRTMGP codebase comes with its own testing framework using clear sky and cloudy forcing data. We will use this testing framework for the vast majority of our testing. The results can be diff'd based on an envelope defined by -O0
and -O3
compiler optimizations as an expected bit-level difference.
There are parts of the code that will not be covered in the existing testing framework, so before moving each piece into C++, we will need to ensure that piece is covered by the test (a printf
will suffice for this). For any piece not covered by the existing testing framework, we will need to create a unit test for that, presumably forced by random data.
Each testing cycle should also include valgrind
, cuda-memcheck
, Fortran bounds checking, and C++ MD Array bounds checking as well. This can be done in three total passes: (1) A valgrind
pass on the CPU, (2) A Fortran and C++ bounds checking pass on the CPU, and (3) A cuda-memcheck
pass on the GPU with answer checking. Given the testing framework runs in tens of seconds, this should be OK for a development cycle time length.
The ideal workflow for an effort of this type is to change as little code as possible at a time between testing cycles. That way, when a bug is introduced, it can only be due to a very limited number of lines of code. It makes maintaining a bug-free code significantly easier. Along these lines, the plan is to port one routine at a time starting at the lowest level kernels first, and working our way up from there.
We will use the Fortran iso_c_binding
to call C++ routines from Fortran. Let's start with the routine below:
pure subroutine combine_and_reorder_2str(ncol, nlay, ngpt, tau_abs, tau_rayleigh, tau, ssa, g) &
bind(C, name="combine_and_reorder_2str")
integer, intent(in) :: ncol, nlay, ngpt
real(wp), dimension(ngpt,nlay,ncol), intent(in ) :: tau_abs, tau_rayleigh
real(wp), dimension(ncol,nlay,ngpt), intent(inout) :: tau, ssa, g ! inout because components are allocated
integer :: icol, ilay, igpt
real(wp) :: t
do icol = 1, ncol
do ilay = 1, nlay
do igpt = 1, ngpt
t = tau_abs(igpt,ilay,icol) + tau_rayleigh(igpt,ilay,icol)
tau(icol,ilay,igpt) = t
g (icol,ilay,igpt) = 0._wp
if(t > 2._wp * tiny(t)) then
ssa(icol,ilay,igpt) = tau_rayleigh(igpt,ilay,icol) / t
else
ssa(icol,ilay,igpt) = 0._wp
end if
end do
end do
end do
end subroutine combine_and_reorder_2str
The first step is to create a Fortran interface for this routine:
interface
subroutine combine_and_reorder_2str(ncol, nlay, ngpt, tau_abs, tau_rayleigh, tau, ssa, g) &
bind(C, name="combine_and_reorder_2str")
use iso_c_binding
c_int , intent(in ) :: ncol, nlay, ngpt
type(c_ptr), intent(in ) :: tau_abs, tau_rayleigh
type(c_ptr), intent(inout) :: tau, ssa, g
end subroutine gator_init_c
end interface
Then, the C++ routine can be defined as follows:
typedef double real;
typedef Kokkos::View<real* , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real1d;
typedef Kokkos::View<real** , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real2d;
typedef Kokkos::View<real*** , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real3d;
typedef Kokkos::View<real**** , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real4d;
typedef Kokkos::View<real***** , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real5d;
typedef Kokkos::View<real****** , Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real6d;
typedef Kokkos::View<real*******, Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> > real7d;
#include <limits>
extern "C" void combine_and_reorder_2str( int ncol, int nlay, int ngpt, real const *tau_abs_ptr ,
real const *tau_rayleigh_ptr , real *tau_ptr , real *ssa_ptr ,
real *g_ptr ) {
// This replaces the Fortran tiny intrinsic function
real realtiny = std::numeric_limits<real>::min();
// Immediately wrap all pointers in unmanaged Kokkos Views
real3d tau_abs ( tau_abs_ptr , ncol , nlay , ngpt );
real3d tau_rayleigh( tau_rayleigh_ptr , ncol , nlay , ngpt );
real3d tau ( tau_ptr , ngpt , nlay , ncol );
real3d ssa ( ssa_ptr , ngpt , nlay , ncol );
real3d g ( g_ptr , ngpt , nlay , ncol );
// Function body
for (int icol=0; icol < ncol; icol++) {
for (int ilay=0; ilay < nlay; ilay++) {
for (int igpt=0; igpt < ngpt; igpt++) {
real t = tau_abs(icol,ilay,igpt) + tau_rayleigh(icol,ilay,igpt);
tau(igpt,ilay,icol) = t;
g (igpt,ilay,icol) = 0;
if (t > 2*realtiny) {
ssa(igpt,ilay,icol) = tau_rayleigh(icol,ilay,igpt) / t;
} else {
ssa(igpt,ilay,icol) = 0;
}
}
}
}
}
Some points to explain what's happening above:
- Make sure the Fortran real types match the C++ types (
double
versusfloat
). In this case,double
will be the default type. This is easiest to handle with a C++typedef
like above. - Fortran
intent(in)
variables will have aconst
keyword (double const *varname
). Note that in C++, you read lines from right to left. This means "varname is a pointer to a constant double", or rather, the data pointed to by the pointer does not change in this routine. - Fortran
intent(out)
andintent(inout)
will not have theconst
attribute. In most Fortran implementations,out
andinout
intents are handled identically anyway. - It's best to immediately wrap the raw pointers in unmanaged Kokkos
View
s. The reason is that we can keep the multi-dimensional array syntax, and we can use Kokkos array bounds checking. -
Kokkos::View<real**, Kokkos::LayoutRight, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged> >
-
real**
: The number of asterisks is the number of dimensions -
Kokkos::LayoutRight
: This says that the right index varies the fastest, and it's a contiguous block of memory -
Kokkos::HostSpace
: This is a CPU array, not a GPU array. (We'll GPU port later) -
Kokkos::MemoryTraits<Kokkos::Unmanaged>
: The data is already allocated. This tells Kokkos to simply wrap the data in a convenient multi-dimensional syntax rather than mange its allocation, deallocation, and reference counting.
-
-
"But don't you have to transpose data between C and Fortran?"
- In Fortran, it's a contiguous block of memory with the left index varying the fastest. In C++, we're defining this same block of memory as contiguous with the right index varying the fastest.
- The only thing you have to make sure is that the index that's varying the fastest is still varying the fastest. Therefore, when we define
LayoutRight
in theView
, we need to transpose the indices. The main reason for definingLayoutRight
is because that's what people expect when they read C / C++.
-
extern "C"
is required for linking purposes. - Once you create the C++ routine and the Fortran interface, you can then call the function exactly the same as you did before from Fortran.
- Fortran requires you to declare all local variables before the function code, but C++ allows you to declare them wherever you want. It's good form to only declare local variables in the scope where you actually need them to avoid a number of potential bugs you could otherwise introduce.
- As you port higher level routines to C++ that are calling these kernels, you can move the unmanaged Kokkos Views outside the routines and into the higher level, and you can get rid of the
extern "C"
declarations as follows:
void combine_and_reorder_2str( int ncol, int nlay, int ngpt, real3d const tau_abs ,
real3d const tau_rayleigh , real3d tau , real3d ssa ,
real3d g ) {
// This replaces the Fortran tiny intrinsic function
real realtiny = std::numeric_limits<real>::min();
// Function body
for (int icol=0; icol < ncol; icol++) {
for (int ilay=0; ilay < nlay; ilay++) {
for (int igpt=0; igpt < ngpt; igpt++) {
real t = tau_abs(icol,ilay,igpt) + tau_rayleigh(icol,ilay,igpt);
tau(igpt,ilay,icol) = t;
g (igpt,ilay,icol) = 0;
if (t > 2*realtiny) {
ssa(igpt,ilay,icol) = tau_rayleigh(icol,ilay,igpt) / t;
} else {
ssa(igpt,ilay,icol) = 0;
}
}
}
}
}
Once the code is giving the right answer with for
loops, we can then move it to Kokkos::parallel_for
with:
// for (int icol=0; icol < ncol; icol++) {
// for (int ilay=0; ilay < nlay; ilay++) {
// for (int igpt=0; igpt < ngpt; igpt++) {
Kokkos::parallel_for( ncol*nlay*ngpt , KOKKOS_LAMBDA (int iGlob) {
int icol, ilay, igpt;
unpackIndices( iGlob , ncol , nlay , ngpt , icol , ilay , igpt );
real t = tau_abs(icol,ilay,igpt) + tau_rayleigh(icol,ilay,igpt);
tau(igpt,ilay,icol) = t;
g (igpt,ilay,icol) = 0;
if (t > 2*realtiny) {
ssa(igpt,ilay,icol) = tau_rayleigh(icol,ilay,igpt) / t;
} else {
ssa(igpt,ilay,icol) = 0;
}
});
where the unpackInidices
functions can be copied from https://github.com/mrnorman/YAKL/blob/master/YAKL.h , and YAKL_INLINE is replaced with KOKKOS_INLINE.
For routines not covered by the existing testing framework in RRTMGP, we will need to create unit tests for those. In those cases, it's easy to just call the C++ routine from a test driver, meaning we do not need to modify the tested routine at all. Also, it's best to keep test code in a separate directory for unit tests so that we don't clutter the main code directory.
We should try to mimic the file structure of RRTMGP. Therefore, each file in RRTMGP will have a corrosponding .cpp
and .h
file. The reason for the header files is to be able to call any of the routines in a test driver in a different directory.