R Encountering Issues When Processing PacBio Revio 16S Data with the DADA2 Pipeline #2074

emankhalaf · 2025-01-09T16:39:19Z

I am currently working on a new batch of 16S data generated using PacBio Revio technology. While I successfully processed the sequences in Qiime2 within a reasonable time, utilizing the DADA2 plugin for the denoising step, I encountered significant challenges when using the DADA2 pipeline in R.

Each step of the pipeline takes an unusually long time, and R has crashed multiple times during the process. After each crash, I resumed the script by uploading the latest output. For example, the denoising step alone took several days to process 56 sequence files, which seems unreasonable. Similarly, the alignment step ran for four days before ultimately causing RStudio to crash.

Given these issues, I am wondering whether there might be an incompatibility between Revio sequences and the algorithms used in the DADA2 pipeline in R. I ran the script several times, both on a physical server and in cloud environments with high computational power, but the problems persisted.

Attached, I’ve included the error plots generated from R. These plots appear unusual compared to those typically generated from PacBio Sequel II technology.

I would greatly appreciate your insights on interpreting these issues and any guidance you can provide to address them.

Thank you for your time and assistance.

benjjneb · 2025-01-09T20:54:19Z

I am currently working on a new batch of 16S data generated using PacBio Revio technology. While I successfully processed the sequences in Qiime2 within a reasonable time, utilizing the DADA2 plugin for the denoising step, I encountered significant challenges when using the DADA2 pipeline in R.

Could you provide a bit more information about this. In particular, what is your QIIME2 version? And what are the versions of R and relevant packages in the R environment that is failing to process the same data? (e.g. sessionInfo() output) Also, what is your "pre-processing" workflow on the R side, i.e. what are you doing prior to getting to learnErrors.

emankhalaf · 2025-01-09T23:03:40Z

@benjjneb

I am using qiime2-amplicon-2024.10 version, and R 4.4.2. dada2 1.32.0
Code used prior to getting to learnErrors:

fns <- list.files("/home/sequences", pattern="fq", full.names=TRUE)
F27 <- "AGAGTTTGATCMTGGCTCAG"
R1492 <- "TACGGYTACCTTGTTAYGACTT"
rc <- dada2:::rc
theme_set(theme_bw())

#Remove Primers and Filter
nops <- file.path("/home/sequences", "noprimers", basename(fns))
prim <- removePrimers(fns, nops, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)

#filter
filts <- file.path("/home/sequences", "noprimers", "filtered", basename(fns))
track <- filterAndTrim(nops, filts, minQ=3, minLen=1300, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
track

#learn errors
err <- learnErrors(filts, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)

#Sorry I did not copy other loaded packages since I have been using RStudio for other scripts, so not all loaded packages are relevant to the dada2 script

sessionInfo() #
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Toronto
tzcode source: system (glibc)

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] dada2_1.32.0 Rcpp_1.0.13 gridExtra_2.3 VennDiagram_1.7.3 futile.logger_1.4.3 readxl_1.4.3
[7] dplyr_1.1.4

cjfields · 2025-01-13T23:27:12Z

@emankhalaf see #1892, but note that @benjjneb recently added a new function for dealing with binned quality scores on the 'master' branch. I have found processing any Revio data (especially from Kinnex kits) requires pretty significant compute resources

emankhalaf · 2025-01-14T15:51:42Z

@cjfields
Thank you for your reply and your input. From the provided link, it seems I should increase the nbases parameter in the learnErrors step. For example:

errs <- learnErrors(dereps, 
    nbases = 1e9, 
    errorEstimationFunction = PacBioErrfun, 
    BAND_SIZE = 32, 
    multithread = 36, 
    verbose = TRUE)

Alternatively, I might need to use a value as high as 1e10.

I have not been able to find specific updates or recommendations online regarding the DADA2 workflow for processing Revio sequences. The link provided for a previous similar issue is one of the most commonly referenced sources when searching for solutions to my problem. However, I am still unclear about the exact adjustments needed to process Revio sequences properly.

What I understand so far is that the Revio system uses a quality score binning approach which means that the PacBio Revio sequencing platform groups/bins quality scores into predefined categories, rather than assigning a distinct quality score to each base call., similar to Illumina's NovaSeq, and this affects the error learning model step. However, I am unsure of the precise modifications required to address this issue effectively.

Any further guidance or clarification would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R Encountering Issues When Processing PacBio Revio 16S Data with the DADA2 Pipeline #2074

R Encountering Issues When Processing PacBio Revio 16S Data with the DADA2 Pipeline #2074

emankhalaf commented Jan 9, 2025

benjjneb commented Jan 9, 2025

emankhalaf commented Jan 9, 2025

cjfields commented Jan 13, 2025

emankhalaf commented Jan 14, 2025

R Encountering Issues When Processing PacBio Revio 16S Data with the DADA2 Pipeline #2074

R Encountering Issues When Processing PacBio Revio 16S Data with the DADA2 Pipeline #2074

Comments

emankhalaf commented Jan 9, 2025

benjjneb commented Jan 9, 2025

emankhalaf commented Jan 9, 2025

cjfields commented Jan 13, 2025

emankhalaf commented Jan 14, 2025