Hard-coded length limits on `createCUDABatchAligner` causing poor performance #11

SamStudio8 · 2019-10-23T09:30:16Z

I've been trying to polish one of our mock community datasets with racon-gpu, but am seeing slow performance during the overlap alignment phase.

I can see many alignments are not being run on the GPU, but the CPU instead. Admittedly, slow performance was exacerbated by the use of only four CPU cores. I've had a little look around the code and as I understand it, an alignment can be prevented from running on the GPU under two conditions:

I see there is also an error mode for exceeded_max_alignment_difference but I can't seem to find a case where that is actually raised by CUDAAligner.

I've checked the stats on the reads I am assembling and polishing with and the N50 is 28.3 Kbp (nice one @joshquick), so I'm thinking perhaps our longest reads are getting thrown off the GPU and are left to run on the CPU afterwards.

I've found where the CUDABatchAligner is initialised and see it has hard-coded limits of 15000 for both the max query and max target. Is this a specific limit for performance reasons, or would it be possible to perhaps allow users to set these limits themselves? Does the choice here affect the memory allocation on the GPU later? Ideally we'd want to raise it to at least 25Kbp, if not 50Kbp.

Just to check I was on the right track, I've filtered this data set of reads longer than 15Kbp and run the polishing again; and see there's now very little time spent aligning overlaps on the CPU. Though, I'm not entirely sure if this is just because the reads are <= 15 Kbp, or if there are fewer reads.

The text was updated successfully, but these errors were encountered:

SamStudio8 · 2019-10-23T11:33:26Z

I thought I would try raising this myself, but it seems to linearly require more memory, meaning you must run fewer batches. This ends up taking much more GPU time overall, and presumably wastes a lot of memory in cases where the read overlaps are assigned to a batch are much shorter than the maximum allowed length. I wonder if there would be any point in having batches of different sizes and binning the overlaps; or ordering the overlaps by size and creating/destroying increasingly larger batches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard-coded length limits on `createCUDABatchAligner` causing poor performance #11

Hard-coded length limits on `createCUDABatchAligner` causing poor performance #11

SamStudio8 commented Oct 23, 2019

SamStudio8 commented Oct 23, 2019 •

edited

Loading

Hard-coded length limits on createCUDABatchAligner causing poor performance #11

Hard-coded length limits on createCUDABatchAligner causing poor performance #11

Comments

SamStudio8 commented Oct 23, 2019

SamStudio8 commented Oct 23, 2019 • edited Loading

Hard-coded length limits on `createCUDABatchAligner` causing poor performance #11

Hard-coded length limits on `createCUDABatchAligner` causing poor performance #11

SamStudio8 commented Oct 23, 2019 •

edited

Loading