Skip to content

Latest commit

 

History

History
246 lines (174 loc) · 13 KB

hands-on-metagenomics.md

File metadata and controls

246 lines (174 loc) · 13 KB

Workshop: Long-read Nanopore Metagenomics, Data Formats & Quality Control

Note: If internet connection is slow, we can also distribute the example data via an USB stick (ask your instructor ;) ).

General: We will usually start with an "Hands-on" part that will guide you through some of the content from the lectures. Then, there is an "Excercise" part with more open questions.

Example data: We will work with one downsampled metagenomics long-read example data set to showcase different usecases.

💡 Think about a good and descriptive folder and file structure when working on the data!

Create a folder for the hands-on work

Below are just example paths, you can also adjust them and use other folder names! Assuming you are on a Linux system on a local machine (laptop, workstation):

# Switch to a path on your system where you want to store your data and results
cd /home/$USER
# Create new folder
mkdir 2023-ont-metagenomics-workshop
cd 2023-ont-metagenomics-workshop

It's always important that you keep a clean and descriptive folder structure when doing bioinformatics or data science in general. Let's start with creating a project directory. First, change to some location on your file system where you want to store the data and create the project folder, then:

Attention: we will always start from this folder doing the analysis! Double check, e.g. via pwd, if you are in the correct folder on your system!

Install and use analysis tools

  • Note: Bioinformatics tools are regulary updated and input parameters might change (use --help or -h to see the manual for a tool!)
  • Install most of them into our environment
    • we will already install many tools that we will use over the next days!
mkdir envs
mamba create -y -p envs/qc nanoplot filtlong minimap2 samtools igv
conda activate envs/qc
# test
NanoPlot --help
minimap2 --version

Reminder: You can also install specific versions of a tool!

  • important for full reproducibility
  • e.g. mamba install minimap2==2.26
  • per default, mamba will try to install the newest tool version based on your configured channels and system architecture and dependencies to other tools

Get example long-read Nanopore data of a mock community

We will use the ZymoBIOMICS Microbial Community Standard II, Log Distribution (CSII) as example. The CSII is a mixture of the genomic DNA of eight bacterial (three Gram-negative and five Gram-positive) and two fungal strains covering a wide range of abundances, with the following theoretical composition:

  • Listeria monocytogenes - 89.1%
  • Pseudomonas aeruginosa - 8.9%
  • Bacillus subtilis - 0.89%
  • Saccharomyces cerevisiae - 0.89%
  • Escherichia coli - 0.089%
  • Salmonella enterica - 0.089%
  • Lactobacillus fermentum - 0.0089%
  • Enterococcus faecalis - 0.00089%
  • Cryptococcus neoformans - 0.00089%
  • Staphylococcus aureus - 0.000089%

We isolated DNA from the Zymo community mix yielding 660 ng/~73 µl recovered with EtOH (219 µl) precipitation + 3M NaOAc (1/10 vol ~29.3 µl) in 13 µl H2O. We used the ONT Rapid Barcoding Kit without bead step (SQK-RBK004) for library preparation, which resulted in 11 µl DNA input for a "standard" MinION flow cell (~400 ng input). The prepared library was sequenced on a MinION with an R9.4.1 flow cell (FLO-MIN106). A laptop computer with an Intel Core i5-10400H CPU and NVIDIA GeForce RTX 5000, 16 GB GDDR6 GPU was used for basecalling using MinKNOW (v22.08.4) with the Super accurate basecalling model (SUP). The used barcode was 01.

After basecalling, we combined all fastq_pass reads yielding a 4.5 GB FASTQ file. For this workshop, we downsampled the FASTQ file down to 10% of the oiginal reads (345 MB) using the following command (you dont have to do that!):

# downsampling was used to reduce the size of the example data set, dont run this!
conda activate seqkit
zcat barcode01.fastq.gz | seqkit sample -p 0.1 -o zymo-2022-barcode01-perc10.fastq.gz

The downsampled FASTQ file can be downloaded here or via this command:

mkdir reads
cd reads
wget https://osf.io/buajf/download -O zymo-2022-barcode01-perc10.fastq.gz
cd ..

Let's download the example data and have a look on the FASTQ file! We place it in a reads folder in your working directory! This should then look like this:

2023-ont-metagenomics-workshop/reads/zymo-2022-barcode01-perc10.fastq.gz

Quality control (NanoPlot)

# assuming you are in your working dir: 2023-ont-metagenomics-workshop and the "qc" environment is activated
NanoPlot -t 4 --fastq reads/zymo-2022-barcode01-perc10.fastq.gz --title "Raw reads" --color darkslategrey --N50 --loglength -f png -o nanoplot/raw

Publication | Code

Read filtering (Filtlong)

# Note: we use 1 kb as the minimum length cutoff as an example. For your "real" samples other parameters might be better. Do QC before. 
filtlong --min_length 1000 reads/zymo-2022-barcode01-perc10.fastq.gz > reads/zymo-2022-barcode01-perc10.filtered.fastq

# Check the quality again:
NanoPlot -t 4 --fastq reads/zymo-2022-barcode01-perc10.filtered.fastq --title "Filtered reads" --color darkslategrey --N50 --loglength -f png -o nanoplot/filtered

Code

Blast some reads online

Let's check the first read in the data set:

head -4 reads/zymo-2022-barcode01-perc10.filtered.fastq

Copy the nucleotide sequence and BLAST online (using Nucleotide BLAST search).

What do you found? Check also the alignment. Do you see differences (errors?).

Download reference genomes

We know which species are in our sample. So let's download reference genomes for them!

Also, later, we want to build a custom database for the 10 species included in our mock community, so let's download them from NCBI first:

# make a folder to store the genome FASTAs
mkdir -p reference-genomes
cd reference-genomes

# Now we download reference genomes for the 10 target species from NCBI
## Bacillus subtilis
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/045/GCF_000009045.1_ASM904v1/GCF_000009045.1_ASM904v1_genomic.fna.gz
## Enterococcus faecalis
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/393/015/GCF_000393015.1_Ente_faec_T5_V1/GCF_000393015.1_Ente_faec_T5_V1_genomic.fna.gz
## Escherichia coli
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
## Lactobacillus fermentum
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029/961/225/GCF_029961225.1_ASM2996122v1/GCF_029961225.1_ASM2996122v1_genomic.fna.gz
## Listeria monocytogenes
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/196/035/GCF_000196035.1_ASM19603v1/GCF_000196035.1_ASM19603v1_genomic.fna.gz
## Pseudomonas aeruginosa
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.fna.gz
## Salmonella enterica
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz
## Staphylococcus aureus
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/013/425/GCF_000013425.1_ASM1342v1/GCF_000013425.1_ASM1342v1_genomic.fna.gz
## Saccharomyces cerevisiae
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
## Cryptococcus neoformans
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/091/045/GCF_000091045.1_ASM9104v1/GCF_000091045.1_ASM9104v1_genomic.fna.gz

# now we extract the FASTA files
gunzip *.fna.gz

# change back to the main working directory
cd ..

Exercise

Now check your own data! We uploaded your runs here, please download your data set because we will delete the data from the online repository in the next days. If internet connection is bad, we can also provide the data on an external drive.

It's a good idea to make a new project folder for working on a new data set!

First, place the archive of your run somewhere where you want to work on the data. Then, unzip the archive. Investigate the folder. You should see one fastq_pass folder with subfolders for the barcodes.

Combine the reads in one barcode folder into a single fastq.gz file (use cat and a wildcard to do that). Perform QC. How does your own data compare to the example data in terms of yield and read length?

Bonus (and a little detour into containers)

Do basecalling by your own.

Note that this can be quite tricky and involves deeper Linux knowledge. Usually, you will be fine with the FASTQ and the already basecalled data that comes out of MinKNOW.

This might not work well on all systems. A good internet connection is needed as well as some basic knowledge in Docker container usage (for Singularity see the paragraph at the end). Also, good hardware and in the best case a GPU are recommended.

Another nice overview (even though might be slightly outdated) is provided here: Basecalling with Guppy.

From your own sequencing run, you should find raw signal FAST5 data. Re-basecall the signal data to generate a FASTQ output with guppy. Chose an appropriate basecalling model (check for details in the MinKNOW report). Unfortunately, guppy can not be installed via Conda and is only provided via the ONT community which needs an account. However, the tool can be installed in a so-called Docker container and then run. If you never used Docker, here are some introductory slides. A few other good resources:

# Assuming that you have Docker installed and configured
# Get a container image with guppy
docker pull nanozoo/guppy_gpu:6.4.6-1--2c17584

# Navigate to the folder where you have your raw FAST5 data located, e.g. in a folder called 'fast5'. 
# Then start an interactive session from the container and mount the folder you are currently in *into* the container. 
# Otherwise your local files would not be visible from inside of the container.
docker run --rm -it -w $PWD -v $PWD:$PWD nanozoo/guppy_gpu:6.4.6-1--2c17584 /bin/bash

# list the available flowcell and library kits, so you can pick an appropriate basecalling model
guppy_basecaller --print_workflows

# run basecalling, here we're using as an example a model for R10.4.1 flow cell, 
# run with 260 bp/s translocation speed (which was discontinued in summer 2023, now 400 bp/s is default) 
# and the super-acc SUP model
guppy_basecaller –i ./fast5 –s ./guppy_out –c dna_r10.4.1_e8.2_260bps_sup.cfg \
    --num_callers 48 --cpu_threads_per_caller 1

# Attention! This will take ages even on a small FAST5 like in this example. You can also cancel that with "ctrl C". 
# In this example, I ran on 48 cores (num_callers) and for the small example FAST5 this took 24 h for 70% of the reads basecalled.
# You should really run basecalling on a GPU, if possible.  

# Here is an example command using a GPU and some additional qc parameters. 
guppy_basecaller -i ./fast5 -s ./guppy_out_gpu -c dna_r10.4.1_e8.2_260bps_sup.cfg \
    -x auto -r --trim_strategy dna -q 0 --disable_pings

# On the RKI High-Performance Cluster (using Singularity, see below), this command took 2 minutes for basecalling the example data. 

Using Singularity instead of Docker to run Guppy

Note: An alternative to Docker is Singularity. The commands are slightly different then. However, on some systems (e.g. a High-performance cluster) it is not possible to use Docker due to permission settings, and then Singularity is an option. Luckily, Docker containers can be easily (and automatically) converted into Singularity. Here are some example commands assuming an HPC with SLURM as the job scheduler (like at the RKI):

# Get the Docker image and convert to Singularity
singularity pull --name guppy_gpu-6.4.6-1--2c17584.img docker://nanozoo/guppy_gpu:6.4.6-1--2c17584

# Start an interactive session on a GPU node
srun --gpus=1 --pty bash -i

# Start the Singularity image with GPU support
singularity run --nv guppy_gpu-6.4.6-1--2c17584.img

# Make sure you are in the correct directory, which contains the fast5/ subdirectory
cd 2023-08-nanopore-workshop-example-bacteria
ls
# output:
# fast5  fastq  minknow_report.html  pycoqc.html  sequencing_summary_small.txt

# Run guppy
guppy_basecaller -i fast5 -s guppy_out_gpu -c dna_r10.4.1_e8.2_260bps_sup.cfg \
    -x auto -r --trim_strategy dna -q 0 --disable_pings

# You should see in the terminal output that GPU access is activated:
# ...
# gpu device:         auto
# ...