WIP: Improving random number generation on gpu #599

Shihab-Shahriar · 2024-12-05T17:49:14Z

Hi,

This PR attempts to improve the random number generation component of GPU accelerated odgi-layout. It replaces the current generator with a counter-based one: Philox. These can be created, used and discarded all from within thread registers, completely eliminating any global memory bookkeeping, communication etc. It also makes the code reproducible and it is statistically very robust. See our library page for more detailes.

On a V100 GPU, I noticed performance improvement of around 17% on chr20 dataset. On A100-SXM4 with significantly higher memory speed, the effect was a little less pronouned. But it was still around 10%.

This is a work-in-progress, it needs a bit of cleanup. I am more than happy to hear you suggestions, feedback and incorporate them back into the code.

On a relevant note, I am still trying to figure things out, but it seems the current code assumes there can be at most one thread block running on a SM. There appears to be one curandStateXORWOWCoalesced_t object per SM, so if there is more than one block on a SM, that could create race condition.

Thanks to @tonyjie, @subwaystation for helping me set odgi up.

tonyjie · 2024-12-07T03:08:37Z

Could you first check the layout figure generated by our current odgi-layout and yours "visually" to make sure they are similar? This is done by odgi draw, check here for reference.

Later you might also check the quantitative metric of the layouts generated by your PRNG. Check this odgi tension command to evaluate the metric. But note that this is not merged into the pangenome/odgi yet. It is available in this branch

tonyjie · 2024-12-07T03:13:33Z

And is it possible to highlight the key differences/advantages between your PRNG and cuRAND? I took a look at the paper, but not sure if I totally get it. Thanks!

Shihab-Shahriar · 2024-12-10T06:23:09Z

Hi,

Visually they look pretty similar for both chr20 and DRB1 datasets, with small differences. Sometimes, the key structures can be "inverted" (e.g. the smaller "knot" on the right). Here is one result for DRB1-3123 dataset:

Quantitatively, the result does tend to vary between runs, but it appears to be at least as good as the current method (top: master branch, bottom: this PR).

Shihab-Shahriar · 2024-12-10T06:33:56Z

To answer your other question, OpenRAND is a library that implements several well-known PRNGs. For example, Philox used here was introduced in 2014 in this paper. These cryptography-inspired generators tend to be more statistically robust and GPU-friendly than alternatives like Xorwow.

As a library, OpenRAND API does provide a key performance benefit over cuRAND- in that it doesn't require a separate random state initialization kernel and memory load/store when they are subsequently used. These steps are completely unnecessary for a generator like Philox. That can save quite a bit of memory communication.

use openrand to generate random numbers in gpu

919146a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Improving random number generation on gpu #599

WIP: Improving random number generation on gpu #599

Shihab-Shahriar commented Dec 5, 2024 •

edited

Loading

tonyjie commented Dec 7, 2024

tonyjie commented Dec 7, 2024

Shihab-Shahriar commented Dec 10, 2024

Shihab-Shahriar commented Dec 10, 2024

WIP: Improving random number generation on gpu #599

Are you sure you want to change the base?

WIP: Improving random number generation on gpu #599

Conversation

Shihab-Shahriar commented Dec 5, 2024 • edited Loading

tonyjie commented Dec 7, 2024

tonyjie commented Dec 7, 2024

Shihab-Shahriar commented Dec 10, 2024

Shihab-Shahriar commented Dec 10, 2024

Shihab-Shahriar commented Dec 5, 2024 •

edited

Loading