Skip to content

Commit

Permalink
Merge pull request #71 from EBI-Metagenomics/xlarge-release-staging
Browse files Browse the repository at this point in the history
Xlarge release staging
  • Loading branch information
tgurbich authored Nov 29, 2023
2 parents 6ad6ef6 + c66b327 commit 155e230
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 33 deletions.
63 changes: 32 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,36 +7,36 @@ Gurbich TA, Almeida A, Beracochea M, Burdett T, Burgin J, Cochrane G, Raj S, Ric
Detailed information about existing MGnify catalogues: https://docs.mgnify.org/src/docs/genome-viewer.html

### Tools used in the pipeline
| Tool/Database | Version | Purpose |
| ----------- | ----------- |----------- |
| CheckM | 1.1.3 | Determining genome quality |
| dRep | 3.2.2 | Genome clustering |
| Mash | 2.3 | Sketch for the catalogue; placement of genomes into clusters (update only); strain tree |
| GUNC | 1.0.3 | Quality control |
| GUNC DB | 2.0.4 | Database for GUNC |
| GTDB-Tk | 2.3.0 | Assigning taxonomy; generating alignments |
| GTDB | r214 | Database for GTDB-Tk |
| Prokka | 1.14.6 | Protein annotation |
| IQ-TREE 2 | 2.2.0.3 | Generating a phylogenetic tree |
| Kraken 2 | 2.1.2 | Generating a kraken database |
| Bracken | 2.6.2 | Generating a bracken database |
| MMseqs2 | 13.45111 | Generating a protein catalogue |
| eggNOG-mapper | 2.1.11 | Protein annotation (eggNOG, KEGG, COG, CAZy) |
| eggNOG DB | 5.0 | Database for eggNOG-mapper |
| Diamond | 2.0.11 | Protein annotation (eggNOG) |
| InterProScan | 5.62-94.0 | Protein annotation (InterPro, Pfam) |
| CRISPRCasFinder | 4.3.2 | Annotation of CRISPR arrays |
| AMRFinderPlus | 3.11.4 | Antimicrobial resistance gene annotation; virulence factors, biocide, heat, acid, and metal resistance gene annotation |
| AMRFinderPlus DB | 3.11 2023-02-23.1 | Database for AMRFinderPlus |
| SanntiS | 0.9.3.2 | Biosynthetic gene cluster annotation |
| Infernal | 1.1.4 | RNA predictions |
| tRNAscan-SE | 2.0.9 | tRNA predictions |
| Rfam | 14.9 | Identification of SSU/LSU rRNA and other ncRNAs |
| Panaroo | 1.3.2 | Pan-genome computation |
| Seqtk | 1.3 | Generating a gene catalogue |
| VIRify | - | Viral sequence annotation |
| MoMofy | 1.0.0 | Mobilome annotation |
| samtools | 1.15 | FASTA indexing |
| Tool/Database | Version | Purpose |
|----------------------------------|------------------|----------- |
| CheckM | 1.1.3 | Determining genome quality |
| dRep | 3.2.2 | Genome clustering |
| Mash | 2.3 | Sketch for the catalogue; placement of genomes into clusters (update only); strain tree |
| GUNC | 1.0.3 | Quality control |
| GUNC DB | 2.0.4 | Database for GUNC |
| GTDB-Tk | 2.3.0 | Assigning taxonomy; generating alignments |
| GTDB | r214 | Database for GTDB-Tk |
| Prokka | 1.14.6 | Protein annotation |
| IQ-TREE 2 | 2.2.0.3 | Generating a phylogenetic tree |
| Kraken 2 | 2.1.2 | Generating a kraken database |
| Bracken | 2.6.2 | Generating a bracken database |
| MMseqs2 | 13.45111 | Generating a protein catalogue |
| eggNOG-mapper | 2.1.11 | Protein annotation (eggNOG, KEGG, COG, CAZy) |
| eggNOG DB | 5.0 | Database for eggNOG-mapper |
| Diamond | 2.0.11 | Protein annotation (eggNOG) |
| InterProScan | 5.62-94.0 | Protein annotation (InterPro, Pfam) |
| CRISPRCasFinder | 4.3.2 | Annotation of CRISPR arrays |
| AMRFinderPlus | 3.11.4 | Antimicrobial resistance gene annotation; virulence factors, biocide, heat, acid, and metal resistance gene annotation |
| AMRFinderPlus DB | 3.11 2023-02-23.1 | Database for AMRFinderPlus |
| SanntiS | 0.9.3.2 | Biosynthetic gene cluster annotation |
| Infernal | 1.1.4 | RNA predictions |
| tRNAscan-SE | 2.0.9 | tRNA predictions |
| Rfam | 14.9 | Identification of SSU/LSU rRNA and other ncRNAs |
| Panaroo | 1.3.2 | Pan-genome computation |
| Seqtk | 1.3 | Generating a gene catalogue |
| VIRify | 2.0.0 | Viral sequence annotation |
| [Mobilome annotation pipeline](https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline) | 2.0.0-rc.1 | Mobilome annotation |
| samtools | 1.15 | FASTA indexing |

## Setup

Expand All @@ -57,6 +57,7 @@ The pipeline needs the following reference databases and configuration files (ro
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/kegg_classes.tsv
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/continent_countries.csv
- https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.0/auxillary_files/gtdbtk_r214_data.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/3.11/2023-02-23.1

### Containers

Expand All @@ -76,7 +77,7 @@ cd containers && bash build.sh

1. You need to pre-download your data to directories and make sure that genomes are uncompressed. Scripts to fetch genomes from ENA ([fetch_ena.py](https://github.com/EBI-Metagenomics/genomes-pipeline/blob/master/containers/genomes-catalog-update/scripts/fetch_ena.py)) and NCBI ([fetch_ncbi.py](https://github.com/EBI-Metagenomics/genomes-pipeline/blob/master/containers/genomes-catalog-update/scripts/fetch_ncbi.py)) are provided and need to be executed separately from the pipeline. If you have downloaded genomes from both ENA and NCBI, put them into separate folders.

2. When genomes are fetched from ENA using the `fetch_ena.py` script, a CSV file with contamination and completeness statistics is also created in the same directory where genomes are saved to. If you are downloading genomes using a different approach, a CSV file needs to be created manually (each line should be genome accession, % completeness, % contamination). The ENA fetching script also pre-filters genomes to satisfy the QS50 cut-off (QS = % completeness - 5 * % contamination).
2. When genomes are fetched from ENA using the `fetch_ena.py` script, a CSV file with contamination and completeness statistics is also created in the same directory where genomes are saved to. If you are downloading genomes using a different approach, a CSV file needs to be created manually (each line should be genome accession, % completeness, % contamination). The ENA fetching script also pre-filters genomes to satisfy the QS50 cut-off (QS = % completeness - 5 * % contamination).

3. You will need the following information to run the pipeline:
- catalogue name (for example, zebrafish-faecal)
Expand Down
4 changes: 2 additions & 2 deletions helpers/file_organiser.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env bash

# The script organises output from the catalogue generation + Virify + Momofy to prepare it for upload to MGnify
# The script organises output from the catalogue generation + Virify + Mobilome annotation pipeline to prepare it for upload to MGnify


function Usage {
Expand Down Expand Up @@ -40,7 +40,7 @@ function GenerateRNACentralJSON {
echo "Copying GFFs"
for R in $REPS
do
cp ${RESULTS_PATH}/all_genomes/${R::-2}/${R}/${R}.gff* ${RESULTS_PATH}/additional_data/rnacentral/GFFs/
cp ${RESULTS_PATH}/all_genomes/${R::-2}/${R}/genomes1/${R}.gff* ${RESULTS_PATH}/additional_data/rnacentral/GFFs/
done

echo "Running JSON generation"
Expand Down

0 comments on commit 155e230

Please sign in to comment.