-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running DADA2 on clusters with pool option -- run of memory #2061
Comments
To run with The pseudo-pooling approach is our path forward when |
@benjjneb Thank you so much and my apologies for the delayed reply. I have one more question, is it necessary to dereplicate (drp) the sequences prior to the denoise (dd) step, or the denoise step (dd) inherently dereplicate the sequences in order to infer the ASVs. In other words, would it be an issue to proceed directly from the error model to the denoising step without performing a separate dereplication step? |
Yes, the denoising step does dereplication itself, and this is the preferred way to run denoising (i.e. without calling |
@emankhalaf just a note:
This will run single-threaded, which can take a very long time and (depending on how many sequences you have) could time out. If you have access to it, I'd recommend: 1 node, 24-48 tasks per node, and 200GB memory total (not per core). Many of the functions have multithreading options so you'll need to set these accordingly. In the majority of cases this works great, but in some cases (really diverse samples) you may need more memory. |
Thanks@cjfields for your input. Do you have any idea how can I use 200GB memory total? |
@benjjneb I have one more question: I work on multiple projects by subsetting samples from different tissues, with the sequences from each group of tissues forming a distinct project. Some tissues are included in multiple projects. When processing the sequences based on the objectives of each project separately—creating the phyloseq object at the end of the workflow, then subsetting the genotypes and subsequently the tissues associated with each genotype—I occasionally observe minor differences (1–2 taxa) in the taxa count for specific tissues across different phyloseq objects or projects. This suggests that the quality and quantity of sequences processed together can influence the denoising and ASV inference steps. Is that correct? Your input is much appreciated! |
On our cluster we use
|
@emankhalaf If the error model is identical, you would get exactly the same results for the same sample in each study. The DADA2 denoising algorithm given a fixed error model is determinative. However, the error model is learned from the dataset, so there are likely some quantitative differences in the error models between your different datasets. These won't be large if everything is being generated using the same amplification/sequencing tech, but those minor differences are enough to produce the signal you are observing. |
Hello,
I am working on processing a large number of 16S sequence files generated using PacBio sequencing technology, specifically 647 and 944 files in separate runs. I am interested in using the pool option during the denoising step in DADA2. Currently, I am running the R script on a cluster with 200G of RAM.
As I am relatively new to using HPC and clusters, I started with smaller memory allocations (64G) and gradually increased the allocation as the script kept failing due to insufficient memory. Upon consulting the cloud's technical support, I was advised to explore whether parallelization is possible for my code to utilize more cores and request additional CPUs, potentially speeding up the process.
Is parallelization supported in the DADA2 R package? If so, could you kindly guide me on how to implement it?
Below are the parameters I am using in my bash script to run the R script:
#SBATCH --time=0-48:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=200000M
Your help is greatly appreciated!
Thanks
The text was updated successfully, but these errors were encountered: