Skip to content

Random number generation

Denis Alevi edited this page Aug 19, 2020 · 1 revision

Currently, there are two random number buffer classes implemented for when we know the number of random numbers needed per kernel or function call. One buffer is for random numbers used in host code and one for random numbers used in device code. Additionally, we use on-the-fly random number generation in device code when we don't know how many random numbers we will need before execution (which is happening in our implementation of the binomial function). The details are explained below.

rand() and randn()

Whenever we use rand/randn calls, we know how many random numbers we need and can use our buffer systems, which differ in host and device code.

Host code

For synapse generation we use the same algorithm as brian2 which needs random numbers (see here). The number of needed random numbers is not known before execution, therefore random numbers need to be generated on the fly (slow) or a buffer system is needed. We implemented the latter. In curand_buffer.h the CurandBuffer class is implemented. It implements a random number buffer for usage in host code, which generates random number on a CUDA device, using the cuRAND host API. These numbers are then copied over to host memory and used for synapses generation. The class implementes the operator[] method which ignores its index and just returns the next random number. Once all numbers are used, new random numbers are generated on the device and copied to the host. This allows us to use the synapse generation mechanism from brian2 without much change.

Implementation details worth noting

In the synapse generation code _rand() and _randn() functions are called. We use a little hackery here: These function calls are just precompiler macros mapping them to a _ptr_... variable which is in fact not an array but an instance of the CurandBuffer class. Therefore, at each call to _rand() or _randn(), a new random number from the buffer is used. This was implemented this way because it allowed using the cpp synapse generation code without having to adapt the algorithm.

TODOs

  • Cleanest solution: Implement a CUDA kernel version of the synapse generation code that uses the curand device api for on-the-fly random number generation (or using some device side RNG buffer if that turns out to be more efficient) (see #177, below).
    • Pros:
      • Algorithm might run faster on GPU (or not, depends on how well it can be parallelized)
      • No messy RNG on device, copy to host side buffer, regenerate on device
      • No _rand / _randn precompiler macro hackery
      • We could clean up all the mess of having variables on both, host and device and copying them around during the init process and instead just have all that is only needed on the device, generated and used on the device (such as this connectivity matrix).
    • Cons:
      • Optimizations here are probably not worth it for most simulations. Synapse generation is not run every clock cycle but only once per synapses object. Except for very large networks, where synapse generation can take long. If we find a good parallelization, this could be beneficial for those cases.

Related issues

  • #88 (overloaded operator[] seems to have bad performance)
  • #177 Use an own algorithm for synapse creation, parallelized on GPU, which could then use curand device api calls to generate numbers on the fly (if that is not slower than pregenerating them on the host, possibly estimating the maximum number of needed random numbers)?

Device code

For codeobjects that need random numbers generated with rand/randn in their device code, a different buffer system is implemented in rand.cu (RandomNumberBuffer class). First, the number of needed random numbers (uniform and normal) is determined by regex matching the codeobject.code.cu_file attributes for rand/randn (see here) This is probably not the best solution but works for now. There are also some tests here that check that this is working correctly (see here The number or rand/randn calls needed are then passed to the RandomNumberBuffer, where the total number of needed random numbers per codeobject and time step is calculated. Depending on buffer size, a chunk of random numbers is generated on the device and each time step a pointer is forwarded the correct number of memory entries in that device array such that the codeobject kernels use a new set of random numbers each time step. From the known number of generated and needed random numbers, the time step at which the buffer needs to be refilled is calculated.

TODOs

  • Currently the buffer size is hard coded to 50MB and reduced if there is not sufficient space. The initial number should either be chosen dynamically depending on the needs and available memory. This would need knowing the amount of device memory that will be used during the network run before using it (currently not implemented but possibly deirable in the future). Or if that is not known, than the inintial MB should be a user preference.

Binomial functions

The binomial function implementation is here. The algorithm is adapted from the cpp_standalone standalone implementation. It uses multiple calls to rand/randn, the number of which is not known before execution. Therefore, it is not straightforward to use a buffer system (see TODOs below) and for now I use on-the-fly RNG. There are two code paths (defined by the __CUDA_ARCH__ preprocessor macros), such that the implementation can be used in both, host and device code. There is some hackery involved again to make this work...

Since the functions in both code paths need to always be defined (that means when executing on the device the functions in the host code path still need to be defined), their definitions change depending on the template they are used in. In the host code path, we call the functions host_rand and host_randn. These have different definitions in synapses create templates (which run on the host) and all other templates (which run on the device):

When used on in host code (in synapses_create_generator.cu if binomials are used in conditions for Synapses.connect(...) calls), host_rand() and host_randn are implemented such that they just return the next random number buffer from the host side buffer system (see here)

For all other templates, which run the binomial functions on the device, host_rand and host_randn have a dummy implementation that is never executed (see here). Instead the device side path is executed which uses the curand device API.

TODOs

  • I just copied the binomial implementation from cpp_standalone. It could be that one could parallelize the implementation using e.g. dynamic parallelism.
  • Using a device side buffer system would likely increase performance. In general, I didn't benchmark the implementation and it could be that on-the-fly random number generation with curand device API is much slower than using a buffer. Problem with a buffer is that we don't know the number of random numbers needed before execution. So if we run out of random numbers in the buffer during execution of a binomial function, we would have to find a way to fix that. Either refilling the random numbers using some sort of dynamic parallelism (just call the refill kernel from the binomial kernel) or figuring out what the maximum number of random numbers is that the algorithm would need.
  • PROBLEM WITH ATOMIC ADD SITUATION

Setting brian.seed(custom_seed)

The RandomNumberBuffer class takes care of setting the correct seeds (see here). It sets two seeds, custom_seed for the curand host API and custom_seed + 1 for the curand device API. For the device API, we generate one curandState per thread with a different sequence offset. This is the safest way to generate random numbers as described in the cuRAND documentation. In each kernel call the same thread (defined by its threadId) uses the same curandState. These states are all reset when a new seed is set in brian.