Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Improving random number generation on gpu #599

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Shihab-Shahriar
Copy link

@Shihab-Shahriar Shihab-Shahriar commented Dec 5, 2024

Hi,

This PR attempts to improve the random number generation component of GPU accelerated odgi-layout. It replaces the current generator with a counter-based one: Philox. These can be created, used and discarded all from within thread registers, completely eliminating any global memory bookkeeping, communication etc. It also makes the code reproducible and it is statistically very robust. See our library page for more detailes.

On a V100 GPU, I noticed performance improvement of around 17% on chr20 dataset. On A100-SXM4 with significantly higher memory speed, the effect was a little less pronouned. But it was still around 10%.

This is a work-in-progress, it needs a bit of cleanup. I am more than happy to hear you suggestions, feedback and incorporate them back into the code.

On a relevant note, I am still trying to figure things out, but it seems the current code assumes there can be at most one thread block running on a SM. There appears to be one curandStateXORWOWCoalesced_t object per SM, so if there is more than one block on a SM, that could create race condition.

Thanks to @tonyjie, @subwaystation for helping me set odgi up.

@tonyjie
Copy link
Contributor

tonyjie commented Dec 7, 2024

Could you first check the layout figure generated by our current odgi-layout and yours "visually" to make sure they are similar? This is done by odgi draw, check here for reference.

Later you might also check the quantitative metric of the layouts generated by your PRNG. Check this odgi tension command to evaluate the metric. But note that this is not merged into the pangenome/odgi yet. It is available in this branch

@tonyjie
Copy link
Contributor

tonyjie commented Dec 7, 2024

And is it possible to highlight the key differences/advantages between your PRNG and cuRAND? I took a look at the paper, but not sure if I totally get it. Thanks!

@Shihab-Shahriar
Copy link
Author

Hi,

Visually they look pretty similar for both chr20 and DRB1 datasets, with small differences. Sometimes, the key structures can be "inverted" (e.g. the smaller "knot" on the right). Here is one result for DRB1-3123 dataset:

orig_drb
my_drb

Quantitatively, the result does tend to vary between runs, but it appears to be at least as good as the current method (top: master branch, bottom: this PR).

image

@Shihab-Shahriar
Copy link
Author

To answer your other question, OpenRAND is a library that implements several well-known PRNGs. For example, Philox used here was introduced in 2014 in this paper. These cryptography-inspired generators tend to be more statistically robust and GPU-friendly than alternatives like Xorwow.

As a library, OpenRAND API does provide a key performance benefit over cuRAND- in that it doesn't require a separate random state initialization kernel and memory load/store when they are subsequently used. These steps are completely unnecessary for a generator like Philox. That can save quite a bit of memory communication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants