Merge pull request #134 from EBI-Metagenomics/feature/re-structure-pi…

…peline Re-structure pipeline
EBI-Metagenomics · Oct 11, 2024 · 0c20229 · 0c20229
2 parents df647fb + 1a1b878
commit 0c20229
Show file tree

Hide file tree

Showing 278 changed files with 4,113 additions and 293,511 deletions.
diff --git a/.github/workflows/unit_tests.yml b/.github/workflows/unit_tests.yml
@@ -20,10 +20,17 @@ jobs:
       uses: actions/setup-python@v3
       with:
         python-version: "3.10"
+
+    - name: update pip
+      run: |
+        python -m pip install --upgrade pip
+
     - name: Install dependencies
       run: |
         pip install -r requirements-test.txt
+        pip install --upgrade numpy pandas
     - name: Unit tests
       run: |
         # TODO, improve the pythonpath handling
-        PYTHONPATH="$PYTHONPATH:bin" python -m unittest discover tests
+        export PYTHONPATH=$PYTHONPATH:bin 
+        python -m unittest discover tests
diff --git a/LICENSE b/LICENSE
@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright [yyyy] [name of copyright owner]
+   Copyright 2019 EMBL-EBI
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

diff --git a/README.md b/README.md
@@ -6,11 +6,10 @@
 
 1. [ The VIRify pipeline ](#virify)
 2. [ Nextflow execution ](#nf)
-3. [ CWL execution (discontinued) ](#cwl)
-4. [ Pipeline overview ](#overview)
-5. [ Detour: Metatranscriptomics ](#metatranscriptome)
-6. [ Resources ](#resources)
-7. [ Citations ](#cite)
+3. [ Pipeline overview ](#overview)
+4. [ Detour: Metatranscriptomics ](#metatranscriptome)
+5. [ Resources ](#resources)
+6. [ Citations ](#cite)
 
 <a name="virify"></a>
 
@@ -22,14 +21,12 @@ VIRify is a pipeline for the detection, annotation, and taxonomic classification
 
 The pipeline is implemented in [Nextflow](#nf) and additionally only Docker or Singularity are needed to run VIRify. Details about installation and usage are given below.
 
-**Please note**, that until v1.0 the pipeline was also implemented in [CWL](#cwl) as an alternative to [Nextflow](#nf). However, later updates were only included in the [Nextflow](#nf) version of the pipeline. 
-
 
 <a name="nf"></a>
 
 # Nextflow
 
-A [Nextflow](https://www.nextflow.io/) implementation of the VIRify pipeline. In the backend, the same scripts are used as in the [CWL](#cwl) implementation.
+A [Nextflow](https://www.nextflow.io/) implementation of the VIRify pipeline.
 
 ## What do I need?
 
@@ -155,21 +152,7 @@ The labels used in the Type column of the gff file correspond to the following n
 | prophage  | [SO:0001006](http://www.sequenceontology.org/browser/current_svn/term/SO:0001006) |
 | CDS | [SO:0000316](http://www.sequenceontology.org/browser/current_svn/term/SO:0000316) |
 
-Note that CDS are reported only when a ViPhOG match has been found. 
-
-
-<a name="cwl"></a>
-
-# Common Workflow Language (discontinued)
-
-**Until VIRify v1.0**, VIRify was implemented in [Common Workflow Language (CWL)](https://www.commonwl.org/) next to the Nextflow implementation. Both Workflow Management Systems were previously supported. 
-
-## What do I need?
-The implementation until v1.0 of VIRify uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.
-
-## How?
-For instructions go to the [CWL README](cwl/README.md).
-
+Note that CDS are reported only when a ViPhOG match has been found.
 
 <a name="overview"></a>
 

diff --git a/assets/methods_description_template.yml b/assets/methods_description_template.yml
@@ -0,0 +1,35 @@
+id: "ebi-metagenomics/emg-viral-pipeline-methods-description"
+description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication."
+section_name: "ebi-metagenomics/emg-viral-pipeline Methods Description"
+section_href: "https://github.com/EBI-Metagenomics/emg-viral-pipeline"
+plot_type: "html"
+data: |
+  <h4>Methods</h4>
+  <p>Data was processed using ebi-metagenomics/genomes-generation v${workflow.manifest.version} (${doi_text}; <a href="https://doi.org/10.1093/nargab/lqac007">Krakau <em>et al.</em>, 2022</a>) of the nf-core collection of workflows (<a href="https://doi.org/10.1038/s41587-020-0439-x">Ewels <em>et al.</em>, 2020</a>), utilising reproducible software environments from the Bioconda (<a href="https://doi.org/10.1038/s41592-018-0046-7">Grüning <em>et al.</em>, 2018</a>) and Biocontainers (<a href="https://doi.org/10.1093/bioinformatics/btx192">da Veiga Leprevost <em>et al.</em>, 2017</a>) projects.</p>
+  <p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
+  <pre><code>${workflow.commandLine}</code></pre>
+  <p>${tool_citations}</p>
+  <h4>References</h4>
+  <ul>
+    <li>
+      Informative Regions In Viral Genomes
+      <i>Viruses (2021)</i>
+      doi: <a href="https://doi.org/10.3390/v13061164">10.3390/v13061164</a>
+      Moreno-Gallego, Jaime Leonardo, and Alejandro Reyes
+    </li>
+    <li>
+      VIRify: an integrated detection, annotation and taxonomic classification pipeline using virus-specific protein profile hidden Markov models
+      <i>bioRxiv</i>
+      doi: <a href="https://doi.org/10.1101/2022.08.22.504484">10.1101/2022.08.22.504484</a>
+      Rangel-Pineros, Guillermo, et al.
+    </li>
+    ${tool_bibliography}
+  </ul>
+  <div class="alert alert-info">
+    <h5>Notes:</h5>
+    <ul>
+      ${nodoi_text}
+      <li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
+      <li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
+    </ul>
+  </div>
diff --git a/assets/mgnify_logo.png b/assets/mgnify_logo.png
diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -0,0 +1,61 @@
+report_comment: >
+
+  This report has been generated by the <a href="https://github.com/ebi-metagenomics/emg-viral-pipeline/" target="_blank">ebi-metagenomics/emg-viral-pipeline</a> pipeline.
+
+report_section_order:
+  "ebi-metagenomics/emg-viral-pipeline-methods-description":
+    order: -1000
+  software_versions:
+    order: -1001
+  "ebi-metagenomics/emg-viral-pipeline-summary":
+    order: -1002
+
+export_plots: true
+
+data_format: "yaml"
+
+run_modules:
+  - fastqc
+  - fastp
+
+## Module order
+module_order:
+  - fastqc
+  - fastp
+
+## File name cleaning
+extra_fn_clean_exts:
+  - "_fastp"
+
+## Prettification
+custom_logo: "mgnify_logo.png"
+custom_logo_url: https://github.com/ebi-metagenomics/emg-viral-pipeline/
+custom_logo_title: "ebi-metagenomics/emg-viral-pipeline"
+
+## General Stats customisation
+table_columns_visible:
+  "fastp":
+    pct_duplication: False
+    after_filtering_q30_rate: False
+    after_filtering_q30_bases: False
+    filtering_result_passed_filter_reads: 3300
+    after_filtering_gc_content: False
+    pct_surviving: True
+    pct_adapter: True
+
+table_columns_placement:
+  "fastp":
+    pct_duplication: 3000
+    after_filtering_q30_rate: 3100
+    after_filtering_q30_bases: 3200
+    filtering_result_passed_filter_reads: 3300
+    after_filtering_gc_content: 3400
+    pct_surviving: 3500
+    pct_adapter: 3600
+
+custom_table_header_config:
+  general_stats_table:
+    "Total length":
+      hidden: True
+    N50:
+      hidden: True
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -0,0 +1,48 @@
+{
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "$id": "https://raw.githubusercontent.com/ebi-metagenomics/miassembler/master/assets/schema_input.json",
+    "title": "ebi-metagenomics/emg-viral-pipeline - params.input schema",
+    "description": "Schema for the file provided with params.input",
+    "type": "array",
+    "items": {
+        "type": "object",
+        "properties": {
+            "id": {
+                "type": "string",
+                "pattern": "^\\S+$",
+                "errorMessage": "Sample identifier",
+                "minLength": 3
+            },
+            "assembly": {
+                "type": "string",
+                "pattern": "^\\S+\\.f(ast)?a(\\.gz)?$",
+                "errorMessage": "Assembly file in FASTA format",
+                "minLength": 3
+            },
+            "fastq_1": {
+                "type": "string",
+                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
+                "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+            },
+            "fastq_2": {
+                "type": "string",
+                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
+                "errorMessage": "FastQ file for reads 2 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+            }
+        },
+        "required": ["id"],
+        "oneOf": [
+          {
+            "required": ["assembly"],
+            "description": "An assembly file must be provided"
+          },
+          {
+            "required": ["fastq_1", "fastq_2"],
+            "description": "Both fastq_1 and fastq_2 files must be provided"
+          }
+        ],
+        "errorMessage": {
+          "oneOf": "You must specify either an assembly file or both fastq_1 and fastq_2 files."
+        }
+    }
+}
diff --git a/bin/write_viral_gff.py b/bin/write_viral_gff.py
@@ -112,6 +112,14 @@ def aggregate_annotations(virify_annotation_files):
     return viral_sequences, cds_annotations
 
 
+def open_fasta_file(filename):
+    if filename.endswith('.gz'):
+        f = gzip.open(filename, "rt")
+    else:
+        f = open(filename, "rt")
+    return f
+
+
 def write_gff(
     checkv_files,
     taxonomy_files,
@@ -181,11 +189,13 @@ def empty_if_number(string):
                 taxonomy_dict[contig] = taxonomy_string
 
     # Read unmodified contig length from the renamed assembly file
-    for record in SeqIO.parse(assembly_file, "fasta"):
+    handle = open_fasta_file(assembly_file)
+    for record in SeqIO.parse(handle, "fasta"):
         contig_id = str(record.id)
         seq_len = len(str(record.seq))
         contigs_len_dict[contig_id] = seq_len
-
+    handle.close()
+
     with open(output_filename, "w") as gff:
         print("##gff-version 3", file=gff)
         # Constants

diff --git a/configs/base.config b/configs/base.config
@@ -0,0 +1,64 @@
+/*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    MGnify genomes-generation pipeline Nextflow base config file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    A 'blank slate' config file, appropriate for general use on most high performance
+    compute environments. Assumes that all software is installed and available on
+    the PATH. Runs in `local` mode - all jobs will be run on the logged in environment.
+----------------------------------------------------------------------------------------
+*/
+
+process {
+
+    cpus   = { check_max( 1    * task.attempt, 'cpus'   ) }
+    memory = { check_max( 6.GB * task.attempt, 'memory' ) }
+    time   = { check_max( 4.h  * task.attempt, 'time'   ) }
+
+    errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
+    maxRetries    = 3
+    maxErrors     = '-1'
+
+    // Process-specific resource requirements
+    // NOTE - Please try and re-use the labels below as much as possible.
+    //        These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
+    //        If possible, it would be nice to keep the same label naming convention when
+    //        adding in your local modules too.
+    // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
+
+    withLabel:process_single {
+        cpus   = { check_max( 1                  , 'cpus'    ) }
+        memory = { check_max( 6.GB * task.attempt, 'memory'  ) }
+        time   = { check_max( 4.h  * task.attempt, 'time'    ) }
+    }
+    withLabel:process_low {
+        cpus   = { check_max( 2     * task.attempt, 'cpus'    ) }
+        memory = { check_max( 12.GB * task.attempt, 'memory'  ) }
+        time   = { check_max( 4.h   * task.attempt, 'time'    ) }
+    }
+    withLabel:process_medium {
+        cpus   = { check_max( 6     * task.attempt, 'cpus'    ) }
+        memory = { check_max( 36.GB * task.attempt, 'memory'  ) }
+        time   = { check_max( 8.h   * task.attempt, 'time'    ) }
+    }
+    withLabel:process_high {
+        cpus   = { check_max( 12    * task.attempt, 'cpus'    ) }
+        memory = { check_max( 72.GB * task.attempt, 'memory'  ) }
+        time   = { check_max( 16.h  * task.attempt, 'time'    ) }
+    }
+    withLabel:process_long {
+        time   = { check_max( 20.h  * task.attempt, 'time'    ) }
+    }
+    withLabel:process_high_memory {
+        memory = { check_max( 200.GB * task.attempt, 'memory' ) }
+    }
+    withLabel:error_ignore {
+        errorStrategy = 'ignore'
+    }
+    withLabel:error_retry {
+        errorStrategy = 'retry'
+        maxRetries    = 2
+    }
+    withName:CUSTOM_DUMPSOFTWAREVERSIONS {
+        cache = false
+    }
+}
diff --git a/configs/conda.config b/configs/conda.config
@@ -0,0 +1,20 @@
+process {
+    withNAME: ANNOTATION {         conda = "$baseDir/envs/python3.yaml" }
+    withNAME: ASSIGN {             conda = "$baseDir/envs/python3.yaml" }
+    withNAME: BALLOON {            conda = "$baseDir/envs/balloon.yaml" }
+    withNAME: basics {             conda = "$baseDir/envs/python3.yaml" }
+    withNAME: BLAST {              conda = "$baseDir/envs/blast.yaml" }
+    withNAME: HMMSCAN {            conda = "$baseDir/envs/hmmer.yaml" }
+    withNAME: KAIJU {              conda = "$baseDir/envs/kaiju.yaml" }
+    withNAME: KRONA {              conda = "$baseDir/envs/krona.yaml"  }
+    withNAME: PLOT_CONTIG_MAP {    conda = "$baseDir/envs/r.yaml" }
+    withNAME: PARSE {              conda = "$baseDir/envs/python3.yaml" }
+    withNAME: PRODIGAL {           conda = "$baseDir/envs/prodigal.yaml" }
+    withNAME: PHANOTATE {          conda = "$baseDir/envs/phanotate.yaml" }
+    withNAME: python3 {            conda = "$baseDir/envs/python3.yaml" }
+    withNAME: RATIO_EVALUE {       conda = "$baseDir/envs/python3.yaml" }
+    withNAME: ruby {               conda = "$baseDir/envs/ruby.yaml" } 
+	withNAME: VIRSORTER {          conda = "$baseDir/envs/virsorter.yaml" }
+    withNAME: VIRFINDER {          conda = "$baseDir/envs/virfinder.yaml" }
+    withNAME: CHECKV {             conda = "$baseDir/envs/checkv.yaml" }
+}
diff --git a/configs/local.config b/configs/local.config
@@ -0,0 +1,31 @@
+process.executor = 'local'
+
+process {
+    withName: ANNOTATION      { cpus = 1;             }
+    withName: ASSIGN          { cpus = 1;             }
+    withName: BALLOON         { cpus = 1;             } 
+    withLabel: basics         { cpus = 1;             } 
+    withName: BLAST           { cpus = params.cores;  } 
+    withName: CHROMOMAP       { cpus = 1;             } 
+    withName: CHECKV          { cpus = params.cores   }
+    withName: FASTP           { cpus = params.cores;  } 
+    withName: FASTQC          { cpus = params.cores;  } 
+    withName: HMMSCAN         { cpus = params.cores;  }
+    withName: KAIJU           { cpus = params.cores;  }
+    withName: KRONA           { cpus = params.cores;  }
+    withName: PLOT_CONTIG_MAP { cpus = 1;             }
+    withName: PPRMETA         { cpus = params.cores;  }
+    withName: MULTIQC         { cpus = params.cores;  } 
+    withName: PARSE           { cpus = 1;             }
+    withName: PRODIGAL        { cpus = 1;             }
+    withName: PHANONATE       { cpus = 1;             }
+    withLabel: python3        { cpus = 1;             }
+    withName: RATIO_EVALUE    { cpus = 1;             }
+    withLabel: ruby           { cpus = 1;             } 
+    withName: SPADES          { cpus = params.cores;  } 
+    withName: SANKEY          { cpus = 1;             } 
+    withName: VIRSORTER       { cpus = params.cores;  }
+    withName: VIRFINDER       { cpus = 1;             }
+    withName: MASHMAP         { cpus = params.cores;  }
+}
+