Reduce porechop memory consumption #3571

mvdbeek · 2021-04-05T18:35:39Z

rrwick/Porechop#77 might be an issue for us. @jennaj is going to check if reducing the total file size helps, if it does we should split the input in a predefined amount of reads to reduce excessive memory consumption.

bernt-matthias · 2021-04-06T08:47:15Z

One might do this natively in Galaxy using Split file to dataset collection. This would even potentially process everything in parallel.

Then adding a note to the help section might be sufficient.

mvdbeek · 2021-04-06T09:36:03Z

I am all for doing things in Galaxy, however here there's no downside to doing this in the tool wrapper. You can still use the split tool upstream for parallelization, but this prevents excessive memory usage if that has not been done

jennaj · 2021-04-06T15:54:42Z

One small wrinkle -- if Split file to dataset collection is used by the user first, Porechop will not accept the results. The split tool outputs fastq datatype results, and porechop is limited to fasta or fastqsanger.

I'm thinking that we don't want to expand the datatypes that Porechop accepts when run separately (for practical usage reasons). But we could get around that if the Split file to dataset collection operation is included in the Porechop wrapper. I was able to drag-n-drop individual collection datasets in to get around the datatype filter.

Another choice is to update Split file to dataset collection to inherit the original input datatype. Meaning, if fastqsanger or fastqsanger.gz is input, assign the datatype fastqsanger instead of just fastq. The data already uncompresses compressed data, and that seems to be intentional. Many tools still require uncompressed data. Then the user can purge the original compressed non-collection input to Split to recover disc space early on in analysis. We could do this whether or not Porechop is wrapped to break up the job or not under the hood (and I do think that would be easier on end-users == simpler execution path). Tool form help is easy to miss, not everyone understands how to use collections, and how large of a dataset will process with Porechop tool varies between even the usegalaxy.* servers.

Thoughts? In short, this could involve one change or two, and I'm leaning toward two. If agree two is a good idea, here is the ticket for updating split: bgruening/galaxytools#1099

mvdbeek added the enhancement label Apr 5, 2021

jennaj mentioned this issue Apr 6, 2021

Enhancement: Have the tool "Split file to collection" inherit the input datatype bgruening/galaxytools#1099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce porechop memory consumption #3571

Reduce porechop memory consumption #3571

mvdbeek commented Apr 5, 2021

bernt-matthias commented Apr 6, 2021

mvdbeek commented Apr 6, 2021

jennaj commented Apr 6, 2021

Reduce porechop memory consumption #3571

Reduce porechop memory consumption #3571

Comments

mvdbeek commented Apr 5, 2021

bernt-matthias commented Apr 6, 2021

mvdbeek commented Apr 6, 2021

jennaj commented Apr 6, 2021