Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard-coded length limits on createCUDABatchAligner causing poor performance #11

Open
SamStudio8 opened this issue Oct 23, 2019 · 1 comment

Comments

@SamStudio8
Copy link

I've been trying to polish one of our mock community datasets with racon-gpu, but am seeing slow performance during the overlap alignment phase.

Screenshot from 2019-10-23 09-06-45

I can see many alignments are not being run on the GPU, but the CPU instead. Admittedly, slow performance was exacerbated by the use of only four CPU cores. I've had a little look around the code and as I understand it, an alignment can be prevented from running on the GPU under two conditions:

I see there is also an error mode for exceeded_max_alignment_difference but I can't seem to find a case where that is actually raised by CUDAAligner.

I've checked the stats on the reads I am assembling and polishing with and the N50 is 28.3 Kbp (nice one @joshquick), so I'm thinking perhaps our longest reads are getting thrown off the GPU and are left to run on the CPU afterwards.

I've found where the CUDABatchAligner is initialised and see it has hard-coded limits of 15000 for both the max query and max target. Is this a specific limit for performance reasons, or would it be possible to perhaps allow users to set these limits themselves? Does the choice here affect the memory allocation on the GPU later? Ideally we'd want to raise it to at least 25Kbp, if not 50Kbp.

Just to check I was on the right track, I've filtered this data set of reads longer than 15Kbp and run the polishing again; and see there's now very little time spent aligning overlaps on the CPU. Though, I'm not entirely sure if this is just because the reads are <= 15 Kbp, or if there are fewer reads.

Screenshot from 2019-10-23 10-27-32

@SamStudio8
Copy link
Author

SamStudio8 commented Oct 23, 2019

I thought I would try raising this myself, but it seems to linearly require more memory, meaning you must run fewer batches. This ends up taking much more GPU time overall, and presumably wastes a lot of memory in cases where the read overlaps are assigned to a batch are much shorter than the maximum allowed length. I wonder if there would be any point in having batches of different sizes and binning the overlaps; or ordering the overlaps by size and creating/destroying increasingly larger batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant