Galaxy for virologist training Exercise 5: Illumina Mapping 101

Title	Galaxy
Training dataset:	PRJEB43037 - In August 2020, an outbreak of West Nile Virus affected 71 people with meningoencephalitis in Andalusia and 6 more cases in Extremadura (south-west of Spain), causing a total of eight deaths. The virus belonged to the lineage 1 and was relatively similar to previous outbreaks occurred in the Mediterranean region. Here we present a detailed analysis of the outbreak, including an extensive phylogenetic study. This is one of the outbreak samples.
Questions:	What is mapping? What is a BAM file? Which metrics are important to check after mapping?
Objectives:	Understand the concept of mapping Learn how to interpret mapping metrics Learn how to visualize mapping results
Estimated time:	40 min

1. Description

One of the most common experiments using massive sequencing are re-sequencing experiments. This type of experiments sequence already known microorganisms, with the goal to discover variation between an already assembled and known reference, and our reads. Mapping is a mandatory step for this kind of experiments, where we need to sort all the short sequences (reads) we have in our fastq file, lacking any genomic context. After the mapping step, we will transform our fastq file into a bam file that contains information about where a read came from, meaning we are going to have the coordinates where each read is placed inside our reference genome.

2. Upload data to galaxy

Training dataset

Experiment info: PRJEB43037, WGS, Illumina MiSeq, paired-end
Fastq R1: ERR5310322_1 - url : ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_1.fastq.gz
Fastq R2: ERR5310322_2 url : ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_2.fastq.gz
Reference genome NC_009942.1: fasta -- gff

Create new history

Click the + icon at the top of the history panel and create a new history with the name mapping 101 tutorial as explained here

Upload data

Import and rename the read files ERR5310322_1 and ERR5310322_2
1. Click in upload data.
2. Click in paste/fetch data
3. Copy url for fastq R1 (select and Ctrl+C) and paste (Ctrl+V).
4. Click in Start.
5. Wait until the job finishes (green in history)
6. Do the same for fastq R2.

Rename R1 and R2 files.
1. Click in ✏️ in the history for ERR5310322_1.fastq.gz
2. Change the name to ERR5310322_1
3. Do the same for R2.

Import the reference genome:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/875/385/GCF_000875385.1_ViralProj30293/GCF_000875385.1_ViralProj30293_genomic.fna.gz

Rename the reference genome and gff file.
1. Click the ✏️ for the reference file in the history.
2. Change the name to NC_009942.1

Map reads using Bowtie2

Search bowtie2 software in the search tools box on the left.

Set bowtie2 parameters:
- Is this single or paired library: paired.
- FASTA/Q file #1 : ERR5310322_1
- FASTA/Q file #2 : ERR5310322_2
- Will you select a reference genome from your history or use a built-in index? : Use a genome from the history and build index.
- Do you want to use presets? : Very sensitive local. This setting will hugly affect the mapping results, depending on the dataset/experiment must be tweaked (read bowtie2 manual)
- Save the bowtie2 mapping statistics to the history

Click execute and wait.

Visualize bam file and calculate metrics

Click the 👁️ icon in the Bowtie2 aligments in history.

Interpret the columns in the bam format according to the theory from class.
Visualize mapping metrics
- Click on the eye icon on Bowtie2 mapping stats history.
Which is the mapping rate?

93.44%
Calculate depth of coverage metrics using picard collectWGSMetrics.
- Search collectwgsmetrics on the search tool box.
- Select SAM/BAM dataset or dataset collection: Bowtie2 alignments
- Load reference genome from: History and select reference genome fasta file.
- Treat bases with coverage exceeding this value as if they had coverage at this value: 3000

Click execute and wait.

Which is mean depth of coverage?

2805

Which is genome coverage > 10x?

0.961193

Visualize bam file using IGV

In order to visualize our mapping we will use IGV (Integrative Genomics Viewer). This is an open source, freely available and lightweight visualization tool that enables intuitive real-time exploration of diverse, large-scale genomic data sets on standard desktop computers. It supports flexible integration of a wide range of genomic data types including aligned sequence reads, mutations, copy number, RNA interference screens, gene expression, methylation and genomic annotations.

Navigation through a data set is similar to that of Google Maps, allowing the user to zoom and pan seamlessly across the genome at any level of detail, from whole genome to base pair. Data sets can be loaded from either local or remote sources, including cloud-based resources, enabling investigators to view their own genomic data sets alongside publicly available data.

Install IGV
Launch IGV on your computer
Expand the param-file output of Bowtie2 tool
Click on the local in display with IGV to load the reads into the IGV browser
Here you have a galaxy training document for IGV usage.

This history is available at: https://usegalaxy.eu/u/smonzon/h/mapping-101-tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05_mapping.md

05_mapping.md

Galaxy for virologist training Exercise 5: Illumina Mapping 101

1. Description

2. Upload data to galaxy

Training dataset

Create new history

Upload data

Map reads using Bowtie2

Visualize bam file and calculate metrics

Visualize bam file using IGV

Files

05_mapping.md

Latest commit

History

05_mapping.md

File metadata and controls

Galaxy for virologist training Exercise 5: Illumina Mapping 101

1. Description

2. Upload data to galaxy

Training dataset

Create new history

Upload data

Map reads using Bowtie2

Visualize bam file and calculate metrics

Visualize bam file using IGV