Merge pull request #56 from genepi/code_refactoring

Code refactoring
genepi · Aug 7, 2023 · b8f4c6c · b8f4c6c
2 parents bae68d0 + a886f08
commit b8f4c6c
Show file tree

Hide file tree

Showing 47 changed files with 500 additions and 1,110 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,8 +1,8 @@
-# ecSeq/DNAseq
+# genepi/umi-pipeline-nf
 ---
 # Releases
 
 ---
 # Prereleases
 ## v0.1.0 - 
-* Initialised repo
+* Initialised repo
diff --git a/README.md b/README.md
@@ -1,18 +1,34 @@
-[<img width="200" align="right" src="docs/images/ecseq.jpg">](https://www.ecseq.com)
 [![Nextflow](https://img.shields.io/badge/nextflow-20.07.1-brightgreen.svg)](https://www.nextflow.io/)
 [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/)
-[![Docker](https://img.shields.io/docker/automated/ecseq/dnaseq.svg)](https://hub.docker.com/r/ecseq/dnaseq)
 
-umi-pipeline-nf Pipeline
+Umi-pipeline-nf
 ======================
 
-**umi-pipeline-nf** is based on a [snakemake pipeline](https://github.com/nanoporetech/pipeline-umi-amplicon) provided by [Oxford Nanopore Technologies (ONT)](https://nanoporetech.com/). To increase efficiency and usability the pipeline was transferred to [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation simple and results highly reproducible.
+**Umi-pipeline-nf** creates highly accurate single-molecule consensus sequences for unique molecular identifier (UMI)-tagged amplicon data.  
+The pipeline can be run for the whole fastq_pass folder of your nanopore run and, per default, outputs the aligned consensus sequences of each UMI cluster in bam file. The optional variant calling creates a vcf file for all variants that are found in the consensus sequences.
+umi-pipeline-nf is based on the snakemake [ONT UMI analysis pipeline](https://github.com/nanoporetech/pipeline-umi-amplicon) (workflow originally developed by [Karst et al, Nat Biotechnol 18:165–169, 2021](https://www.nature.com/articles/s41592-020-01041-y)). We transferred the pipeline to [Nextflow](https://www.nextflow.io) and included [additional functionalities](#main-adaptations).  
 
-## Overview
-`umi-pipeline-nf` creates highly accurate single-molecule consensus sequences based on amplicon data tagged by unique molecular identifiers (UMIs). The pipeline can be run for the whole fastq_pass folder of your nanopore run and per default, the output are the aligned consensus sequences in bam file format. 
-Additional flags can be set to perform a variant calling ( [freebayes](https://github.com/freebayes/freebayes), [lofreq](http://csb5.github.io/lofreq/) or [mutserve](https://mitoverse.readthedocs.io/mutserve/mutserve/) )
+## Workflow
 
-> See the [output documentation](docs/output.md) for more details of the results.
+1. Input reads are aligned against a reference genome.
+2. The flanking UMI sequences of all reads are extracted.
+3. The extracted UMIs are used to cluster the reads.
+4. Per cluster, highly accurate consensus sequences are created.
+5. The consensus sequences are aligned against the reference sequenced.
+6. An optional variant calling step can be performed.
+
+> See the [output documentation](docs/output.md) for a detailed overview of the pipeline and its output files.
+
+## Main Adaptations
+
+* It comes with docker containers making **installation simple, portable** and **results highly reproducible**.
+* The pipeline is **optimized for parallelization**.
+* Read filtering strategy per UMI cluster was adapted to **preserve the highest quality reads**.
+* **Three commonly used variant callers** ([freebayes](https://github.com/freebayes/freebayes), [lofreq](http://csb5.github.io/lofreq/) or [mutserve](https://mitoverse.readthedocs.io/mutserve/mutserve/)) are supported by the pipeline.
+* The raw reads can be optionally **subsampled**.
+* The raw reads can be **filtered by read length and quality**.
+
+> See the [usage documentation](docs/usage.md) for all of the available parameters of the pipeline.
 
 ## Quick Start
 
@@ -21,19 +37,20 @@ Additional flags can be set to perform a variant calling ( [freebayes](https://g
 2. Download the pipeline and test it on a minimal dataset with a single command
 
 ```bash
-nextflow run AmstlerStephan/umi-pipeline-nf -profile test,docker
+nextflow run genepi/umi-pipeline-nf -profile test,docker
 ```
 
 3. Start running your own analysis!
 3.1 Download and adapt the config/custom.config with paths to your data (relative and absolute paths possible)
 
 ```bash
-nextflow run AmstlerStephan/umi-pipeline-nf -r main -c <custom.config> -profile docker 
+nextflow run genepi/umi-pipeline-nf -r main -c <custom.config> -profile docker 
 ```
 
-> See the [usage documentation](docs/usage.md) for all of the available options when running the pipeline.
-
 
 ### Credits
 
-These scripts were originally written for use by [GENEPI](https://genepi.i-med.ac.at/), by ([@StephanAmstler](https://github.com/AmstlerStephan)).
+The pipeline was written by ([@StephanAmstler](https://github.com/AmstlerStephan)).  
+Nextflow template pipeline: [EcSeq](https://github.com/ecSeq).  
+Original snakemake-based pipeline: [nanoporetech/pipeline-umi-amplicon](https://github.com/nanoporetech/pipeline-umi-amplicon).  
+Original workflow: [SorenKarst/longread_umi](https://github.com/SorenKarst/longread_umi).
diff --git a/bin/bam_to_phred.py b/bin/bam_to_phred.py
diff --git a/bin/extract_umis.py b/bin/extract_umis.py
@@ -52,13 +52,26 @@ def parse_args(argv):
         help="Length of adapter",
     )
     parser.add_argument(
-        "-t", "--threads", dest="THREADS", type=int, default=1, help="Number of threads."
+        "-t",
+        "--threads",
+        dest="THREADS",
+        type=int,
+        default=1,
+        help="Number of threads."
     )
     parser.add_argument(
-        "--tsv", dest="TSV", action="store_true", help="write TSV output file"
+        "--tsv",
+        dest="TSV",
+        action="store_true",
+        help="write TSV output file"
     )
     parser.add_argument(
-        "-o", "--output", dest="OUT", type=str, required=False, help="Output directory"
+        "-o",
+        "--output",
+        dest="OUT",
+        type=str,
+        required=False,
+        help="Output directory"
     )
     parser.add_argument(
         "--output_format",
@@ -82,7 +95,10 @@ def parse_args(argv):
         help="Reverse UMI sequence",
     )
     parser.add_argument(
-        "INPUT_FA", type=str, default="/dev/stdin", help="Filtered Reads"
+        "INPUT_FA",
+        type=str,
+        default="/dev/stdin",
+        help="Filtered Reads"
     )
 
     args = parser.parse_args(argv)
@@ -109,8 +125,10 @@ def extract_umi(query_seq, query_qual, pattern, max_edit_dist, format):
     edit_dist = result["editDistance"]
     locs = result["locations"][0]
     umi = query_seq[locs[0]:locs[1]+1]
+
     if format == "fastq":
         umi_qual = query_qual[locs[0]:locs[1]+1]
+
     return edit_dist, umi, umi_qual
 
 
@@ -123,15 +141,18 @@ def extract_adapters(entry, max_adapter_length, format):
     if len(entry.sequence) > max_adapter_length:
         read_5p_seq = entry.sequence[:max_adapter_length]
         read_3p_seq = entry.sequence[-max_adapter_length:]
+
         if format == "fastq":
             read_5p_qual = entry.quality[:max_adapter_length]
             read_3p_qual = entry.quality[-max_adapter_length:]
 
     return read_5p_seq, read_3p_seq, read_5p_qual, read_3p_qual
 
+
 def get_read_name(entry):
     return entry.name.split(";")[0]
 
+
 def get_read_strand(entry):
     strand = entry.name.split("strand=")
     if len(strand) > 1:
@@ -140,6 +161,7 @@ def get_read_strand(entry):
     else:
         return "+"
 
+
 def combine_umis_fasta(seq_5p, seq_3p, strand):
     if strand == "+":
         return seq_5p + seq_3p

diff --git a/bin/setup.py b/bin/setup.py
@@ -15,7 +15,6 @@
     description='Toolset to work with ONT amplicon sequencing using UMIs',
     zip_safe=False,
     install_requires=[
-        'tqdm',
         'pysam',
         'numpy',
         'pandas',
@@ -32,7 +31,6 @@
             'umi_extract = umi_amplicon_tools.extract_umis:main',
             'umi_reformat_consensus = umi_amplicon_tools.reformat_consensus:main',
             'umi_parse_clusters = umi_amplicon_tools.parse_clusters:main',
-            'umi_bam_to_phred = umi_amplicon_tools.bam_to_phred:main',
             'umi_stats = umi_amplicon_tools.umi_stats:main'
         ]
     },