Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running DADA2 on clusters with pool option -- run of memory #2061

Open
emankhalaf opened this issue Nov 22, 2024 · 8 comments
Open

Running DADA2 on clusters with pool option -- run of memory #2061

emankhalaf opened this issue Nov 22, 2024 · 8 comments

Comments

@emankhalaf
Copy link

Hello,

I am working on processing a large number of 16S sequence files generated using PacBio sequencing technology, specifically 647 and 944 files in separate runs. I am interested in using the pool option during the denoising step in DADA2. Currently, I am running the R script on a cluster with 200G of RAM.

As I am relatively new to using HPC and clusters, I started with smaller memory allocations (64G) and gradually increased the allocation as the script kept failing due to insufficient memory. Upon consulting the cloud's technical support, I was advised to explore whether parallelization is possible for my code to utilize more cores and request additional CPUs, potentially speeding up the process.

Is parallelization supported in the DADA2 R package? If so, could you kindly guide me on how to implement it?

Below are the parameters I am using in my bash script to run the R script:
#SBATCH --time=0-48:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=200000M

Your help is greatly appreciated!
Thanks

@benjjneb
Copy link
Owner

To run with pool=TRUE it is required that all samples be loaded into memory at once to make the "pool" sample that is then analyzed. Thus, there is no way to break apart the job across different nodes in a way that would reduce the memory.

The pseudo-pooling approach is our path forward when pool=TRUE gets too large for available memory. It is not a perfect match, but our testing shows that it approximates pool=TRUE while only needing to load one sample into memory at a time. More info here: https://benjjneb.github.io/dada2/pseudo.html

@emankhalaf
Copy link
Author

@benjjneb Thank you so much and my apologies for the delayed reply.

I have one more question, is it necessary to dereplicate (drp) the sequences prior to the denoise (dd) step, or the denoise step (dd) inherently dereplicate the sequences in order to infer the ASVs. In other words, would it be an issue to proceed directly from the error model to the denoising step without performing a separate dereplication step?

@benjjneb
Copy link
Owner

benjjneb commented Dec 2, 2024

I have one more question, is it necessary to dereplicate (drp) the sequences prior to the denoise (dd) step, or the denoise step (dd) inherently dereplicate the sequences in order to infer the ASVs. In other words, would it be an issue to proceed directly from the error model to the denoising step without performing a separate dereplication step?

Yes, the denoising step does dereplication itself, and this is the preferred way to run denoising (i.e. without calling derepFastq explicitly) because then only one sample is dereplicated and loaded into memory at a time, instead of all at once.

@cjfields
Copy link

cjfields commented Dec 3, 2024

@emankhalaf just a note:

#SBATCH --time=0-48:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=200000M

This will run single-threaded, which can take a very long time and (depending on how many sequences you have) could time out.

If you have access to it, I'd recommend: 1 node, 24-48 tasks per node, and 200GB memory total (not per core). Many of the functions have multithreading options so you'll need to set these accordingly. In the majority of cases this works great, but in some cases (really diverse samples) you may need more memory.

@emankhalaf
Copy link
Author

Thanks@cjfields for your input. Do you have any idea how can I use 200GB memory total?

@emankhalaf
Copy link
Author

@benjjneb I have one more question: I work on multiple projects by subsetting samples from different tissues, with the sequences from each group of tissues forming a distinct project. Some tissues are included in multiple projects. When processing the sequences based on the objectives of each project separately—creating the phyloseq object at the end of the workflow, then subsetting the genotypes and subsequently the tissues associated with each genotype—I occasionally observe minor differences (1–2 taxa) in the taxa count for specific tissues across different phyloseq objects or projects. This suggests that the quality and quantity of sequences processed together can influence the denoising and ASV inference steps. Is that correct?

Your input is much appreciated!

@cjfields
Copy link

cjfields commented Dec 4, 2024

Thanks@cjfields for your input. Do you have any idea how can I use 200GB memory total?

On our cluster we use --mem 200g. In a batch script (4 cores total, 12BM memory):

#!/bin/bash
#SBATCH -n 4
#SBATCH --mem=12g
...

@benjjneb
Copy link
Owner

benjjneb commented Dec 6, 2024

@emankhalaf If the error model is identical, you would get exactly the same results for the same sample in each study. The DADA2 denoising algorithm given a fixed error model is determinative. However, the error model is learned from the dataset, so there are likely some quantitative differences in the error models between your different datasets. These won't be large if everything is being generated using the same amplification/sequencing tech, but those minor differences are enough to produce the signal you are observing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants