Running DADA2 on clusters with pool option -- run of memory #2061

emankhalaf · 2024-11-22T14:35:37Z

Hello,

I am working on processing a large number of 16S sequence files generated using PacBio sequencing technology, specifically 647 and 944 files in separate runs. I am interested in using the pool option during the denoising step in DADA2. Currently, I am running the R script on a cluster with 200G of RAM.

As I am relatively new to using HPC and clusters, I started with smaller memory allocations (64G) and gradually increased the allocation as the script kept failing due to insufficient memory. Upon consulting the cloud's technical support, I was advised to explore whether parallelization is possible for my code to utilize more cores and request additional CPUs, potentially speeding up the process.

Is parallelization supported in the DADA2 R package? If so, could you kindly guide me on how to implement it?

Below are the parameters I am using in my bash script to run the R script:
#SBATCH --time=0-48:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=200000M

Your help is greatly appreciated!
Thanks

benjjneb · 2024-11-22T17:26:38Z

To run with pool=TRUE it is required that all samples be loaded into memory at once to make the "pool" sample that is then analyzed. Thus, there is no way to break apart the job across different nodes in a way that would reduce the memory.

The pseudo-pooling approach is our path forward when pool=TRUE gets too large for available memory. It is not a perfect match, but our testing shows that it approximates pool=TRUE while only needing to load one sample into memory at a time. More info here: https://benjjneb.github.io/dada2/pseudo.html

emankhalaf · 2024-11-26T14:33:11Z

@benjjneb Thank you so much and my apologies for the delayed reply.

I have one more question, is it necessary to dereplicate (drp) the sequences prior to the denoise (dd) step, or the denoise step (dd) inherently dereplicate the sequences in order to infer the ASVs. In other words, would it be an issue to proceed directly from the error model to the denoising step without performing a separate dereplication step?

benjjneb · 2024-12-02T18:27:48Z

I have one more question, is it necessary to dereplicate (drp) the sequences prior to the denoise (dd) step, or the denoise step (dd) inherently dereplicate the sequences in order to infer the ASVs. In other words, would it be an issue to proceed directly from the error model to the denoising step without performing a separate dereplication step?

Yes, the denoising step does dereplication itself, and this is the preferred way to run denoising (i.e. without calling derepFastq explicitly) because then only one sample is dereplicated and loaded into memory at a time, instead of all at once.

cjfields · 2024-12-03T00:37:36Z

@emankhalaf just a note:

#SBATCH --time=0-48:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=200000M

This will run single-threaded, which can take a very long time and (depending on how many sequences you have) could time out.

If you have access to it, I'd recommend: 1 node, 24-48 tasks per node, and 200GB memory total (not per core). Many of the functions have multithreading options so you'll need to set these accordingly. In the majority of cases this works great, but in some cases (really diverse samples) you may need more memory.

emankhalaf · 2024-12-04T19:04:49Z

Thanks@cjfields for your input. Do you have any idea how can I use 200GB memory total?

emankhalaf · 2024-12-04T19:21:39Z

@benjjneb I have one more question: I work on multiple projects by subsetting samples from different tissues, with the sequences from each group of tissues forming a distinct project. Some tissues are included in multiple projects. When processing the sequences based on the objectives of each project separately—creating the phyloseq object at the end of the workflow, then subsetting the genotypes and subsequently the tissues associated with each genotype—I occasionally observe minor differences (1–2 taxa) in the taxa count for specific tissues across different phyloseq objects or projects. This suggests that the quality and quantity of sequences processed together can influence the denoising and ASV inference steps. Is that correct?

Your input is much appreciated!

cjfields · 2024-12-04T20:34:29Z

Thanks@cjfields for your input. Do you have any idea how can I use 200GB memory total?

On our cluster we use --mem 200g. In a batch script (4 cores total, 12BM memory):

#!/bin/bash
#SBATCH -n 4
#SBATCH --mem=12g
...

benjjneb · 2024-12-06T19:46:29Z

@emankhalaf If the error model is identical, you would get exactly the same results for the same sample in each study. The DADA2 denoising algorithm given a fixed error model is determinative. However, the error model is learned from the dataset, so there are likely some quantitative differences in the error models between your different datasets. These won't be large if everything is being generated using the same amplification/sequencing tech, but those minor differences are enough to produce the signal you are observing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running DADA2 on clusters with pool option -- run of memory #2061

Running DADA2 on clusters with pool option -- run of memory #2061

emankhalaf commented Nov 22, 2024

benjjneb commented Nov 22, 2024

emankhalaf commented Nov 26, 2024

benjjneb commented Dec 2, 2024

cjfields commented Dec 3, 2024

emankhalaf commented Dec 4, 2024

emankhalaf commented Dec 4, 2024

cjfields commented Dec 4, 2024

benjjneb commented Dec 6, 2024

Running DADA2 on clusters with pool option -- run of memory #2061

Running DADA2 on clusters with pool option -- run of memory #2061

Comments

emankhalaf commented Nov 22, 2024

benjjneb commented Nov 22, 2024

emankhalaf commented Nov 26, 2024

benjjneb commented Dec 2, 2024

cjfields commented Dec 3, 2024

emankhalaf commented Dec 4, 2024

emankhalaf commented Dec 4, 2024

cjfields commented Dec 4, 2024

benjjneb commented Dec 6, 2024