Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce porechop memory consumption #3571

Open
mvdbeek opened this issue Apr 5, 2021 · 3 comments
Open

Reduce porechop memory consumption #3571

mvdbeek opened this issue Apr 5, 2021 · 3 comments

Comments

@mvdbeek
Copy link
Member

mvdbeek commented Apr 5, 2021

rrwick/Porechop#77 might be an issue for us. @jennaj is going to check if reducing the total file size helps, if it does we should split the input in a predefined amount of reads to reduce excessive memory consumption.

@bernt-matthias
Copy link
Contributor

One might do this natively in Galaxy using Split file to dataset collection. This would even potentially process everything in parallel.

Then adding a note to the help section might be sufficient.

@mvdbeek
Copy link
Member Author

mvdbeek commented Apr 6, 2021

I am all for doing things in Galaxy, however here there's no downside to doing this in the tool wrapper. You can still use the split tool upstream for parallelization, but this prevents excessive memory usage if that has not been done

@jennaj
Copy link
Member

jennaj commented Apr 6, 2021

One small wrinkle -- if Split file to dataset collection is used by the user first, Porechop will not accept the results. The split tool outputs fastq datatype results, and porechop is limited to fasta or fastqsanger.

I'm thinking that we don't want to expand the datatypes that Porechop accepts when run separately (for practical usage reasons). But we could get around that if the Split file to dataset collection operation is included in the Porechop wrapper. I was able to drag-n-drop individual collection datasets in to get around the datatype filter.

Another choice is to update Split file to dataset collection to inherit the original input datatype. Meaning, if fastqsanger or fastqsanger.gz is input, assign the datatype fastqsanger instead of just fastq. The data already uncompresses compressed data, and that seems to be intentional. Many tools still require uncompressed data. Then the user can purge the original compressed non-collection input to Split to recover disc space early on in analysis. We could do this whether or not Porechop is wrapped to break up the job or not under the hood (and I do think that would be easier on end-users == simpler execution path). Tool form help is easy to miss, not everyone understands how to use collections, and how large of a dataset will process with Porechop tool varies between even the usegalaxy.* servers.

Thoughts? In short, this could involve one change or two, and I'm leaning toward two. If agree two is a good idea, here is the ticket for updating split: bgruening/galaxytools#1099

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants