Version 2.0 (#1)

- Use Dorado instead of Guppy (with option to use it without local installation). - Include software versions and basecalling model in MultiQC report.
catg-umag · Feb 26, 2024 · f3706ab · f3706ab
1 parent 9467458
commit f3706ab
Show file tree

Hide file tree

Showing 13 changed files with 410 additions and 170 deletions.
diff --git a/README.md b/README.md
@@ -1,24 +1,48 @@
 # ONT Basecalling / Demux Pipeline
 
 Small pipeline to perform basecalling and demultiplexing (optional) of ONT data, collect QC metrics and generate a MultiQC report.
-It uses Guppy for basecalling and demultiplexing.
+It uses Dorado for basecalling and demultiplexing.
 
 ## Requirements
+
 - [Nextflow](https://www.nextflow.io/) (>= 22.04)
 - [Apptainer](https://apptainer.org/) / Singularity
-- Guppy GPU (>= 6.4.6). Not distributed with the pipeline, hast to be downloaded from [ONT community](https://community.nanoporetech.com/)
+- Dorado (0.5.3 tested). It can be used via container, or installed locally from https://github.com/nanoporetech/dorado.
 
 ## Usage
+
 - Clone this repository
-- If you want to demultiplex: create a `samples.csv` file with at least the `barcode` and `sample` columns. The `barcode` column should contain the barcode used for demultiplexing (with the leading zero, e.g. `barcode01`), and the `sample` column should contain the sample name (this name with be used on the report and as name for FASTQ file).
-- Make a copy of `params.default.yml` and modify it according to your needs. Remember to point `sample_data` parameter to the file created at the previous step.
+- **If you want to demultiplex:** create a `samples.csv` file with at least the `barcode` and `sample` columns. The `barcode` column should contain the barcode used for demultiplexing (with the leading zero, e.g. `barcode01`), and the `sample` column should contain the sample name (this name with be used on the report and as name for FASTQ file).
+- Copy `params.example.yml` (for example to `./my_params.yml`) and modify it according to your needs. Remember to point `sample_data` parameter to the file created at the previous step.
 - Run the pipeline passing your params file to `-params-file` option:
+
 ```
 nextflow run ont-basecalling-demultiplexing/ -params-file my_params.yml
 ```
 
+## Parameters
+
+| Parameter                                | Required | Default                              | Description                                                                                                             |
+| ---------------------------------------- | -------- | ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------- |
+| `experiment_name`                        | False    | -                                    | Name of the experiment, used for final reports (title and filename).                                                    |
+| `data_dir`                               | True     | -                                    | Path to the folder containing the POD5 files.                                                                           |
+| `sample_data`                            | True     | `input/samples.csv`                  | Path to the CSV file containing the sample data (required if demultiplexing).                                           |
+| `output_dir`                             | False    | `demultiplex_results`                | Path to the folder where the results will be saved.                                                                     |
+| `fastq_output`                           | False    | `true`                               | If `true`, the pipeline will generate FASTQ files (if not, it would be UBAM files).                                     |
+| `qscore_filter`                          | False    | `10`                                 | Minimum QScore for the "pass" data, used for demultiplexing.                                                            |
+| `dorado_basecalling_model`               | False    | `[email protected]` | Model used for basecalling.                                                                                             |
+| `dorado_basecalling_extra_config`        | False    | -                                    | Extra configuration for Dorado basecalling.                                                                             |
+| `dorado_basecalling_gpus`                | False    | `1`                                  | Number of GPUs to use for basecalling.                                                                                  |
+| `skip_demultiplexingskip_demultiplexing` | False    | `false`                              | If `true`, the pipeline will not perform demultiplexing                                                                 |
+| `dorado_demux_kit`                       | False    | `EXP-NBD196`                         | Kit used for demultiplexing.                                                                                            |
+| `dorado_demux_both_ends`                 | False    | `false`                              | If `true`, the pipeline will demultiplex using barcodes from both sides (5' and 3').                                    |
+| `dorado_demux_extra_config`              | False    | -                                    | Extra configuration for Dorado demultiplexing.                                                                          |
+| `dorado_demux_cpus`                      | False    | `16`                                 | Number of CPUs to use for demultiplexing.                                                                               |
+| `use_dorado_container`                   | False    | `true`                               | If `true`, the pipeline will use Dorado via container (~3.5GB download). If `false`, it will expect to find it locally. |
+
 ## Considerations
-- The pipeline is designed to run on a SLURM cluster, but should run on local machines as well.
+
+- It is possible to run the pipeline either in SLURM clusters using `--profile slurm`.
 - Basecalling and demultiplexing are performed on separated steps to allow for a better control of the resources used by each process, and to prevent a whole basecalling redo in case of a failure during demultiplexing, wrong kit specified, etc.
-- The basecalling process uses GPU, so make sure to have one available. The SLURM job will be submitted with `--gres=gpu:X` option (with `X` as 1 by default).
-- Demultiplexing doesn't use GPU.
+- The basecalling process uses GPU, so make sure to have one available. If using SLURM, the job will be submitted with `--gres=gpu:X` option.
+- Demultiplexing step won't use GPU, only CPU.
diff --git a/conf/containers.config b/conf/containers.config
@@ -0,0 +1,17 @@
+// containers
+process {
+  withLabel: linux    { container = 'ubuntu:22.04' }
+  withLabel: fastqc   { container = 'quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0' }
+  withLabel: nanoplot { container = 'quay.io/biocontainers/nanoplot:1.42.0--pyhdfd78af_0' }
+  withLabel: multiqc  { container = 'quay.io/biocontainers/multiqc:1.19--pyhdfd78af_0' }
+  withLabel: pigz     { container = 'ghcr.io/dialvarezs/containers/utils:latest' }
+  withLabel: pycoqc   { container = 'quay.io/biocontainers/pycoqc:2.5.2--py_0' }
+  withLabel: samtools { container = 'quay.io/biocontainers/samtools:1.19.2--h50ea8bc_0' }
+
+  withLabel: dorado   {
+	container = params.use_dorado_container 
+  				? 'ghcr.io/dialvarezs/containers/dorado:0.5.3'
+				: null
+	containerOptions = '--nv'
+  }
+}
diff --git a/conf/params.config b/conf/params.config
@@ -0,0 +1,17 @@
+params {
+	experiment_name = ''
+	data_dir = null
+	sample_data = 'input/samples.csv'
+	output_dir = 'demultiplex_results/'
+	fastq_output = true
+	qscore_filter = 10
+	dorado_basecalling_model = '[email protected]'
+	dorado_basecalling_extra_config = ''
+	dorado_basecalling_gpus = 1
+	skip_demultiplexing = false
+	dorado_demux_kit = 'EXP-NBD196'
+	dorado_demux_both_ends = false
+	dorado_demux_extra_config = ''
+	dorado_demux_cpus = 16
+	use_dorado_container = true
+}
diff --git a/conf/profiles.config b/conf/profiles.config
@@ -0,0 +1,18 @@
+profiles {
+  apptainer {
+    apptainer {
+      enabled = true
+      autoMounts = true
+    }
+  }
+  slurm {
+    process {
+      executor = 'slurm'
+      module = 'apptainer'
+
+      withLabel: dorado {
+		    module = params.use_dorado_container ? null : 'dorado'
+	    }
+    }
+  }
+}
diff --git a/main.nf b/main.nf
@@ -1,17 +1,15 @@
 #!/usr/bin/env nextflow
-include { addDefaultParamValues; pathCheck } from './lib/groovy/utils.gvy'
-
-// load default parameters from YAML
-addDefaultParamValues(params, "${workflow.projectDir}/params.default.yml")
-
-
 include { BasecallingAndDemux } from './subworkflows/basecalling_demux.nf'
 include { QualityCheck }        from './subworkflows/quality_check.nf'
+include { GenerateReports }     from './subworkflows/reports.nf'
+include { CollectVersions }     from './subworkflows/versions.nf'
+
+include { pathCheck } from './lib/groovy/utils.gvy'
 
 
 // check and prepare input channels
 data_dir = pathCheck(params.data_dir, isDirectory = true)
-multiqc_config = pathCheck("${workflow.projectDir}/conf/multiqc_config.yaml")
+multiqc_config = pathCheck("${workflow.projectDir}/tool_conf/multiqc_config.yaml")
 
 if (params.skip_demultiplexing) {
   sample_names = channel.fromList([])
@@ -25,10 +23,18 @@ if (params.skip_demultiplexing) {
 
 workflow {
   BasecallingAndDemux(sample_names, data_dir)
+
   QualityCheck(
     BasecallingAndDemux.out.sequences,
-    BasecallingAndDemux.out.sequencing_summary,
-    BasecallingAndDemux.out.barcoding_summary,
+    BasecallingAndDemux.out.sequencing_summary
+  )
+
+  CollectVersions()
+
+  GenerateReports(
+    QualityCheck.out.software_reports,
+    CollectVersions.out.software_versions,
+    CollectVersions.out.model_versions,
     multiqc_config
   )
 }
diff --git a/nextflow.config b/nextflow.config
@@ -4,24 +4,6 @@ process {
   errorStrategy = 'finish'
 }
 
-singularity {
-  enabled = true
-  autoMounts = true
-}
-
-process {
-  executor = 'slurm'
-  module = 'apptainer'
-
-  withLabel: guppy  { module = 'guppy' }
-}
-
-// containers
-process {
-  withLabel: linux    { container = 'ubuntu:22.04' }
-  withLabel: pigz     { container = 'ghcr.io/dialvarezs/containers/pigz:2.7' }
-  withLabel: fastqc   { container = 'quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0' }
-  withLabel: nanoplot { container = 'quay.io/biocontainers/nanoplot:1.41.3--pyhdfd78af_0' }
-  withLabel: multiqc  { container = 'quay.io/biocontainers/multiqc:1.14--pyhdfd78af_0' }
-  withLabel: pycoqc   { container = 'quay.io/biocontainers/pycoqc:2.5.2--py_0' }
-}
+includeConfig 'conf/params.config'
+includeConfig 'conf/profiles.config'
+includeConfig 'conf/containers.config'
diff --git a/params.default.yml b/params.default.yml
diff --git a/params.example.yml b/params.example.yml
@@ -0,0 +1,15 @@
+experiment_name: ''
+data_dir: input/pod5/
+sample_data: input/samples.csv
+output_dir: demultiplex_results/
+fastq_output: true
+qscore_filter: 10
+dorado_basecalling_model: [email protected]
+dorado_basecalling_extra_config: ''
+dorado_basecalling_gpus: 1
+skip_demultiplexing: false
+dorado_demux_kit: EXP-NBD196
+dorado_demux_both_ends: false
+dorado_demux_extra_config: ''
+dorado_demux_cpus: 16
+use_dorado_container: true