Skip to content

Commit

Permalink
Merge pull request #134 from EBI-Metagenomics/feature/re-structure-pi…
Browse files Browse the repository at this point in the history
…peline

Re-structure pipeline
  • Loading branch information
KateSakharova authored Oct 11, 2024
2 parents df647fb + 1a1b878 commit 0c20229
Show file tree
Hide file tree
Showing 278 changed files with 4,113 additions and 293,511 deletions.
9 changes: 8 additions & 1 deletion .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,17 @@ jobs:
uses: actions/setup-python@v3
with:
python-version: "3.10"

- name: update pip
run: |
python -m pip install --upgrade pip
- name: Install dependencies
run: |
pip install -r requirements-test.txt
pip install --upgrade numpy pandas
- name: Unit tests
run: |
# TODO, improve the pythonpath handling
PYTHONPATH="$PYTHONPATH:bin" python -m unittest discover tests
export PYTHONPATH=$PYTHONPATH:bin
python -m unittest discover tests
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2019 EMBL-EBI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
29 changes: 6 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,10 @@

1. [ The VIRify pipeline ](#virify)
2. [ Nextflow execution ](#nf)
3. [ CWL execution (discontinued) ](#cwl)
4. [ Pipeline overview ](#overview)
5. [ Detour: Metatranscriptomics ](#metatranscriptome)
6. [ Resources ](#resources)
7. [ Citations ](#cite)
3. [ Pipeline overview ](#overview)
4. [ Detour: Metatranscriptomics ](#metatranscriptome)
5. [ Resources ](#resources)
6. [ Citations ](#cite)

<a name="virify"></a>

Expand All @@ -22,14 +21,12 @@ VIRify is a pipeline for the detection, annotation, and taxonomic classification

The pipeline is implemented in [Nextflow](#nf) and additionally only Docker or Singularity are needed to run VIRify. Details about installation and usage are given below.

**Please note**, that until v1.0 the pipeline was also implemented in [CWL](#cwl) as an alternative to [Nextflow](#nf). However, later updates were only included in the [Nextflow](#nf) version of the pipeline.


<a name="nf"></a>

# Nextflow

A [Nextflow](https://www.nextflow.io/) implementation of the VIRify pipeline. In the backend, the same scripts are used as in the [CWL](#cwl) implementation.
A [Nextflow](https://www.nextflow.io/) implementation of the VIRify pipeline.

## What do I need?

Expand Down Expand Up @@ -155,21 +152,7 @@ The labels used in the Type column of the gff file correspond to the following n
| prophage | [SO:0001006](http://www.sequenceontology.org/browser/current_svn/term/SO:0001006) |
| CDS | [SO:0000316](http://www.sequenceontology.org/browser/current_svn/term/SO:0000316) |

Note that CDS are reported only when a ViPhOG match has been found.


<a name="cwl"></a>

# Common Workflow Language (discontinued)

**Until VIRify v1.0**, VIRify was implemented in [Common Workflow Language (CWL)](https://www.commonwl.org/) next to the Nextflow implementation. Both Workflow Management Systems were previously supported.

## What do I need?
The implementation until v1.0 of VIRify uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.

## How?
For instructions go to the [CWL README](cwl/README.md).

Note that CDS are reported only when a ViPhOG match has been found.

<a name="overview"></a>

Expand Down
35 changes: 35 additions & 0 deletions assets/methods_description_template.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
id: "ebi-metagenomics/emg-viral-pipeline-methods-description"
description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication."
section_name: "ebi-metagenomics/emg-viral-pipeline Methods Description"
section_href: "https://github.com/EBI-Metagenomics/emg-viral-pipeline"
plot_type: "html"
data: |
<h4>Methods</h4>
<p>Data was processed using ebi-metagenomics/genomes-generation v${workflow.manifest.version} (${doi_text}; <a href="https://doi.org/10.1093/nargab/lqac007">Krakau <em>et al.</em>, 2022</a>) of the nf-core collection of workflows (<a href="https://doi.org/10.1038/s41587-020-0439-x">Ewels <em>et al.</em>, 2020</a>), utilising reproducible software environments from the Bioconda (<a href="https://doi.org/10.1038/s41592-018-0046-7">Grüning <em>et al.</em>, 2018</a>) and Biocontainers (<a href="https://doi.org/10.1093/bioinformatics/btx192">da Veiga Leprevost <em>et al.</em>, 2017</a>) projects.</p>
<p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
<pre><code>${workflow.commandLine}</code></pre>
<p>${tool_citations}</p>
<h4>References</h4>
<ul>
<li>
Informative Regions In Viral Genomes
<i>Viruses (2021)</i>
doi: <a href="https://doi.org/10.3390/v13061164">10.3390/v13061164</a>
Moreno-Gallego, Jaime Leonardo, and Alejandro Reyes
</li>
<li>
VIRify: an integrated detection, annotation and taxonomic classification pipeline using virus-specific protein profile hidden Markov models
<i>bioRxiv</i>
doi: <a href="https://doi.org/10.1101/2022.08.22.504484">10.1101/2022.08.22.504484</a>
Rangel-Pineros, Guillermo, et al.
</li>
${tool_bibliography}
</ul>
<div class="alert alert-info">
<h5>Notes:</h5>
<ul>
${nodoi_text}
<li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
<li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
</ul>
</div>
Binary file added assets/mgnify_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
61 changes: 61 additions & 0 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
report_comment: >
This report has been generated by the <a href="https://github.com/ebi-metagenomics/emg-viral-pipeline/" target="_blank">ebi-metagenomics/emg-viral-pipeline</a> pipeline.
report_section_order:
"ebi-metagenomics/emg-viral-pipeline-methods-description":
order: -1000
software_versions:
order: -1001
"ebi-metagenomics/emg-viral-pipeline-summary":
order: -1002

export_plots: true

data_format: "yaml"

run_modules:
- fastqc
- fastp

## Module order
module_order:
- fastqc
- fastp

## File name cleaning
extra_fn_clean_exts:
- "_fastp"

## Prettification
custom_logo: "mgnify_logo.png"
custom_logo_url: https://github.com/ebi-metagenomics/emg-viral-pipeline/
custom_logo_title: "ebi-metagenomics/emg-viral-pipeline"

## General Stats customisation
table_columns_visible:
"fastp":
pct_duplication: False
after_filtering_q30_rate: False
after_filtering_q30_bases: False
filtering_result_passed_filter_reads: 3300
after_filtering_gc_content: False
pct_surviving: True
pct_adapter: True

table_columns_placement:
"fastp":
pct_duplication: 3000
after_filtering_q30_rate: 3100
after_filtering_q30_bases: 3200
filtering_result_passed_filter_reads: 3300
after_filtering_gc_content: 3400
pct_surviving: 3500
pct_adapter: 3600

custom_table_header_config:
general_stats_table:
"Total length":
hidden: True
N50:
hidden: True
48 changes: 48 additions & 0 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/ebi-metagenomics/miassembler/master/assets/schema_input.json",
"title": "ebi-metagenomics/emg-viral-pipeline - params.input schema",
"description": "Schema for the file provided with params.input",
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample identifier",
"minLength": 3
},
"assembly": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?a(\\.gz)?$",
"errorMessage": "Assembly file in FASTA format",
"minLength": 3
},
"fastq_1": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
},
"fastq_2": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 2 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
}
},
"required": ["id"],
"oneOf": [
{
"required": ["assembly"],
"description": "An assembly file must be provided"
},
{
"required": ["fastq_1", "fastq_2"],
"description": "Both fastq_1 and fastq_2 files must be provided"
}
],
"errorMessage": {
"oneOf": "You must specify either an assembly file or both fastq_1 and fastq_2 files."
}
}
}
14 changes: 12 additions & 2 deletions bin/write_viral_gff.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,14 @@ def aggregate_annotations(virify_annotation_files):
return viral_sequences, cds_annotations


def open_fasta_file(filename):
if filename.endswith('.gz'):
f = gzip.open(filename, "rt")
else:
f = open(filename, "rt")
return f


def write_gff(
checkv_files,
taxonomy_files,
Expand Down Expand Up @@ -181,11 +189,13 @@ def empty_if_number(string):
taxonomy_dict[contig] = taxonomy_string

# Read unmodified contig length from the renamed assembly file
for record in SeqIO.parse(assembly_file, "fasta"):
handle = open_fasta_file(assembly_file)
for record in SeqIO.parse(handle, "fasta"):
contig_id = str(record.id)
seq_len = len(str(record.seq))
contigs_len_dict[contig_id] = seq_len

handle.close()

with open(output_filename, "w") as gff:
print("##gff-version 3", file=gff)
# Constants
Expand Down
64 changes: 64 additions & 0 deletions configs/base.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MGnify genomes-generation pipeline Nextflow base config file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A 'blank slate' config file, appropriate for general use on most high performance
compute environments. Assumes that all software is installed and available on
the PATH. Runs in `local` mode - all jobs will be run on the logged in environment.
----------------------------------------------------------------------------------------
*/

process {

cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }

errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries = 3
maxErrors = '-1'

// Process-specific resource requirements
// NOTE - Please try and re-use the labels below as much as possible.
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors

withLabel:process_single {
cpus = { check_max( 1 , 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}
withLabel:process_low {
cpus = { check_max( 2 * task.attempt, 'cpus' ) }
memory = { check_max( 12.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}
withLabel:process_medium {
cpus = { check_max( 6 * task.attempt, 'cpus' ) }
memory = { check_max( 36.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
withLabel:process_high {
cpus = { check_max( 12 * task.attempt, 'cpus' ) }
memory = { check_max( 72.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }
}
withLabel:process_long {
time = { check_max( 20.h * task.attempt, 'time' ) }
}
withLabel:process_high_memory {
memory = { check_max( 200.GB * task.attempt, 'memory' ) }
}
withLabel:error_ignore {
errorStrategy = 'ignore'
}
withLabel:error_retry {
errorStrategy = 'retry'
maxRetries = 2
}
withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
}
20 changes: 20 additions & 0 deletions configs/conda.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
process {
withNAME: ANNOTATION { conda = "$baseDir/envs/python3.yaml" }
withNAME: ASSIGN { conda = "$baseDir/envs/python3.yaml" }
withNAME: BALLOON { conda = "$baseDir/envs/balloon.yaml" }
withNAME: basics { conda = "$baseDir/envs/python3.yaml" }
withNAME: BLAST { conda = "$baseDir/envs/blast.yaml" }
withNAME: HMMSCAN { conda = "$baseDir/envs/hmmer.yaml" }
withNAME: KAIJU { conda = "$baseDir/envs/kaiju.yaml" }
withNAME: KRONA { conda = "$baseDir/envs/krona.yaml" }
withNAME: PLOT_CONTIG_MAP { conda = "$baseDir/envs/r.yaml" }
withNAME: PARSE { conda = "$baseDir/envs/python3.yaml" }
withNAME: PRODIGAL { conda = "$baseDir/envs/prodigal.yaml" }
withNAME: PHANOTATE { conda = "$baseDir/envs/phanotate.yaml" }
withNAME: python3 { conda = "$baseDir/envs/python3.yaml" }
withNAME: RATIO_EVALUE { conda = "$baseDir/envs/python3.yaml" }
withNAME: ruby { conda = "$baseDir/envs/ruby.yaml" }
withNAME: VIRSORTER { conda = "$baseDir/envs/virsorter.yaml" }
withNAME: VIRFINDER { conda = "$baseDir/envs/virfinder.yaml" }
withNAME: CHECKV { conda = "$baseDir/envs/checkv.yaml" }
}
31 changes: 31 additions & 0 deletions configs/local.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
process.executor = 'local'

process {
withName: ANNOTATION { cpus = 1; }
withName: ASSIGN { cpus = 1; }
withName: BALLOON { cpus = 1; }
withLabel: basics { cpus = 1; }
withName: BLAST { cpus = params.cores; }
withName: CHROMOMAP { cpus = 1; }
withName: CHECKV { cpus = params.cores }
withName: FASTP { cpus = params.cores; }
withName: FASTQC { cpus = params.cores; }
withName: HMMSCAN { cpus = params.cores; }
withName: KAIJU { cpus = params.cores; }
withName: KRONA { cpus = params.cores; }
withName: PLOT_CONTIG_MAP { cpus = 1; }
withName: PPRMETA { cpus = params.cores; }
withName: MULTIQC { cpus = params.cores; }
withName: PARSE { cpus = 1; }
withName: PRODIGAL { cpus = 1; }
withName: PHANONATE { cpus = 1; }
withLabel: python3 { cpus = 1; }
withName: RATIO_EVALUE { cpus = 1; }
withLabel: ruby { cpus = 1; }
withName: SPADES { cpus = params.cores; }
withName: SANKEY { cpus = 1; }
withName: VIRSORTER { cpus = params.cores; }
withName: VIRFINDER { cpus = 1; }
withName: MASHMAP { cpus = params.cores; }
}

Loading

0 comments on commit 0c20229

Please sign in to comment.