-
Notifications
You must be signed in to change notification settings - Fork 12
Random number generation
Currently, there are two random number buffer classes implemented for when we know the number of random numbers needed per kernel or function call. One buffer is for random numbers used in host code and one for random numbers used in device code. Additionally, we use on-the-fly random number generation in device code when we don't know how many random numbers we will need before execution (which is happening in our implementation of the binomial function). The details are explained below.
Whenever we use rand/randn
calls, we know how many random numbers we need
and can use our buffer systems, which differ in host and device code.
For synapse generation we use the same algorithm as brian2 which needs random
numbers
(see here).
The number of needed random numbers is not known before execution, therefore
random numbers need to be generated on the fly (slow) or a buffer system is
needed. We implemented the latter. In
curand_buffer.h
the CurandBuffer
class is implemented. It implements a random number buffer
for usage in host code, which generates random number on a CUDA device, using
the cuRAND host API. These numbers are
then copied over to host memory and used for synapses generation. The class
implementes the operator[]
method which ignores its index and just returns
the next random number. Once all numbers are used, new random numbers are
generated on the device and copied to the host. This allows us to use the
synapse generation mechanism from brian2 without much change.
In the synapse generation code _rand()
and _randn()
functions are called.
We use a little hackery here: These function calls are just precompiler macros
mapping them to a _ptr_...
variable which is in fact not an array but an
instance of the CurandBuffer
class. Therefore, at each call to _rand()
or
_randn()
, a new random number from the buffer is used. This was implemented
this way because it allowed using the cpp synapse generation code without
having to adapt the algorithm.
- Cleanest solution: Implement a CUDA kernel version of the synapse generation
code that uses the curand device api for on-the-fly random number
generation (or using some device side RNG buffer if that turns out to be
more efficient) (see #177, below).
- Pros:
- Algorithm might run faster on GPU (or not, depends on how well it can be parallelized)
- No messy RNG on device, copy to host side buffer, regenerate on device
- No
_rand
/_randn
precompiler macro hackery - We could clean up all the mess of having variables on both, host and device and copying them around during the init process and instead just have all that is only needed on the device, generated and used on the device (such as this connectivity matrix).
- Cons:
- Optimizations here are probably not worth it for most simulations. Synapse generation is not run every clock cycle but only once per synapses object. Except for very large networks, where synapse generation can take long. If we find a good parallelization, this could be beneficial for those cases.
- Pros:
- #88 (overloaded
operator[]
seems to have bad performance) - #177 Use an own algorithm for synapse creation, parallelized on GPU, which could then use curand device api calls to generate numbers on the fly (if that is not slower than pregenerating them on the host, possibly estimating the maximum number of needed random numbers)?
For codeobjects that need random numbers generated with rand/randn
in their
device code, a different buffer system is implemented in
rand.cu
(RandomNumberBuffer
class).
First, the number of needed random numbers (uniform and normal) is determined
by regex matching the codeobject.code.cu_file
attributes for rand/randn
(see
here)
This is probably not the best solution but works for now. There are also some
tests here that check that this is working correctly (see
here
The number or rand/randn calls needed are then passed to the
RandomNumberBuffer
, where the total number of needed random numbers per
codeobject and time step is calculated. Depending on buffer size, a chunk of
random numbers is generated on the device and each time step a pointer is
forwarded the correct number of memory entries in that device array such that
the codeobject kernels use a new set of random numbers each time step. From
the known number of generated and needed random numbers, the time step at
which the buffer needs to be refilled is calculated.
- Currently the buffer size is hard coded to 50MB and reduced if there is not sufficient space. The initial number should either be chosen dynamically depending on the needs and available memory. This would need knowing the amount of device memory that will be used during the network run before using it (currently not implemented but possibly deirable in the future). Or if that is not known, than the inintial MB should be a user preference.
The binomial function implementation is here. The
algorithm is adapted from the cpp_standalone
standalone implementation. It
uses multiple calls to rand
/randn
, the number of which is not known before
execution. Therefore, it is not straightforward to use a buffer system (see
TODOs below) and for now I use on-the-fly RNG. There are two code paths
(defined by the __CUDA_ARCH__
preprocessor macros), such that the
implementation can be used in both, host and device code. There is some
hackery involved again to make this work...
Since the functions in both code paths need to always be defined (that means when
executing on the device the functions in the host code path still need to be
defined), their definitions change depending on the template they are used in.
In the host code path, we call the functions host_rand
and host_randn
.
These have different definitions in synapses create templates (which run on
the host) and all other templates (which run on the device):
When used on in host code (in synapses_create_generator.cu
if binomials are
used in conditions for Synapses.connect(...)
calls), host_rand()
and
host_randn
are implemented such that they just return the next random number
buffer from the host side buffer system (see
here)
For all other templates, which run the binomial functions on the device,
host_rand
and host_randn
have a dummy implementation that is never
executed (see
here).
Instead the device side path is executed which uses the curand device API.
- I just copied the binomial implementation from
cpp_standalone
. It could be that one could parallelize the implementation using e.g. dynamic parallelism. - Using a device side buffer system would likely increase performance. In general, I didn't benchmark the implementation and it could be that on-the-fly random number generation with curand device API is much slower than using a buffer. Problem with a buffer is that we don't know the number of random numbers needed before execution. So if we run out of random numbers in the buffer during execution of a binomial function, we would have to find a way to fix that. Either refilling the random numbers using some sort of dynamic parallelism (just call the refill kernel from the binomial kernel) or figuring out what the maximum number of random numbers is that the algorithm would need.
- PROBLEM WITH ATOMIC ADD SITUATION
The RandomNumberBuffer
class takes care of setting the correct seeds (see
here).
It sets two seeds, custom_seed
for the curand host API and custom_seed + 1
for the curand device API. For the device API, we generate one curandState
per thread with a different sequence offset. This is the safest way to
generate random numbers as described in the cuRAND documentation. In each
kernel call the same thread (defined by its threadId
) uses the same
curandState
. These states are all reset when a new seed is set in brian.