Galaxy for virologist training Exercise 2: Quality control and trimming

Despite the improvement of sequencing methods, there is no error-free technique. A correct measuring of the sequencing quality is essential for identifying problems in the sequencing, thus, this must be the first step in every sequencing analysis. Once the quality control is finished, it's important to remove those low quality reads, or short reads, for which a trimming step is mandatory. After the trimming step it is recommended to perform a new quality control step to be sure that trimming worked.

1. Illumina Quality control and trimming

Title	Pre-processing
Training dataset:	PRJEB43037 - In August 2020, an outbreak of West Nile Virus affected 71 people with meningoencephalitis in Andalusia and 6 more cases in Extremadura (south-west of Spain), causing a total of eight deaths. The virus belonged to the lineage 1 and was relatively similar to previous outbreaks occurred in the Mediterranean region. Here, we present a detailed analysis of the outbreak, including an extensive phylogenetic study. This is one of the outbreak samples.
Questions:	How do I check whether my Illumina data was correctly sequenced? How can I improve the quality of my data?
Objectives:	Perform a quality control in raw Illumina reads Perform a quality trimming in raw Illumina reads Perform a quality control in trimmed Illumina reads
Estimated time:	25 min

1.1. Quality control

1.1.1. Upload data

To run the quality control over the samples, follow these steps:

Create a new history, as we explained yesterday named Illumina preprocessing
Upload data as seen yesterday, copy and paste the following URLs:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_2.fastq.gz

Add some tags to the files. It is mandatory that the tag starts with # to be propagated to the processes.

1.1.2. Run FastQC

Search for the fastqc tool
Select FastQC Read Quality reports and set the following parameters:
Select multiple file data set in Raw read data from your current history
Select the two datasets
Then go down and select Run tool

To see the results we are going to open the jobs with Web page in their name for both data 1 and data 2.

Here, you can see the number of reads in each file, the maximum and minimum length of all reads in the sample, and the quality plots for both R1 and R2. They look quite good, but we are going to run trimming over the samples.

How many reads do the samples have?

265989

First question

How do I check whether my Illumina data was correctly sequenced?

Using FastQC

1.2. Trimming

Once we have performed the quality control, we have to perform the quality and read length trimming:

1.2.1. Run Fastp

1.Search for fastp in the tools

2.Then select fastp - fast all-in-one preprocessing for FASTQ files

-Select custom parameters:

3.Single-end or paired reads > Paired

    4.Input 1 > Browse datasets (right folder icon) > Select ERR5310322_1.fastq.gz

    5.Input 2 > Browse datasets > Select ERR5310322_2.fastq.gz

6.Display Filter Options

    -Quality Filtering options

        7.Qualified Quality Phred = 30

        8.Unqualified percent limit = 10

    -Length Filtering Options

        9.Length required = 50

10.Read modification options

    11.PoliX tail trimming > Enable polyX tail trimming

    -Per read cutting by quality options

        12.Cut by quality in front (5') > Yes

        13.Cut by quality in tail (3') > Yes

        14.Cutting mean quality = 30

15.Finally, click on Run tool

To see the trimming stats, have a look at the fastp on data 2 and data 1: HTML report file. You should see something like that.

How many reads have we lost?

98664 reads

1.2.2. Other trimming tools: Trimmomatic

1.Search for trimmomatic in the tools

2.Select Trimmomatic flexible read trimming tool for Illumina NGS data

-Select custom parameters:

3.Single-end or paired-end reads? = Paired-end (two separated files)

4.Input FASTQ file (R1/first of pair) = ERR5310322_1.fastq.gz

5.Input FASTQ file (R2/second of pair) = ERR5310322_2.fastq.gz

6.Average quality required = 30

7.Insert Trimmomatic Operation:

    8.Select Trimmomatic operation to perform: **MINLEN**

    9.Minimum length of reads to be kept = 50

10.Select Run tool

Trimmomatic does not perform statistics over trimmed reads, so we need to perform FastQC again over the Trimmomatic results.

Try to do it on your own.

Second question

How can I improve the quality of my data?

Using a trimming software, such as fastp or trimmomatic.

This hands-on history URL: https://usegalaxy.eu/u/svarona/h/illumina-preprocessing

2. Nanopore Quality control and trimming

Title	Galaxy
Training dataset:	The data we are going to manage corresponds to Nanopore amplicon sequencing data using ARTIC network primers por SARS-CoV-2 genome. From the Fast5 files generated by the ONT software, we are going to select the pass reads, so they are already filtered by quality.
Questions:	How do I know if my Nanopore data was correctly sequenced?
Objectives:	Perform a quality control in raw Illumina reads Perform a quality trimming in raw Nanopore reads Perform a quality control in trimmed Nanopore reads
Estimated time:	15 min

2.1. Quality control

To run the quality control over the samples, follow these steps:

Create a new history has explained yesterday named Nanopore quality
Upload data as seen yesterday, copy and paste the following URLs:

https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_0.fastq
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_1.fastq
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_2.fastq

2.1.1. PycoQC

To use PycoQC we need to use the sequencing_summary.txt provided by de Nanopore sequencing machine.

Upload data as seen yesterday, copy and paste the following URL:

https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/sequencing_summary.txt

Search for the Pycoqc tool
Select Pycoqc quality control for Nanopore sequencing data
In A sequencing_summary file: Select the sequencing_summary.txt we just uploaded
Select Run tool

Then inspect the resulting PycoQC HTML Report:

Question

How many reads do the samples have?

3K reads

Do you understand all the plots?

Basecalled reads length

This plot shows the distribution of fragment sizes in the file that was analyzed. Long reads have a variable length and this will show the relative amounts of each different size of sequence fragment. In this example, the distribution of read length is quite dispersed with a minimum read length for the passed reads around 150 and a maximum length ~5000bp. However, most of the reads are about 500 nt length, as expected by the amplicon experiment.

Basecalled reads PHRED quality:

This plot shows the distribution of the Qscores (Q) for each read. This score aims to give a global quality score for each read. The exact definition of Qscores is: the average per-base error probability, expressed on the log (Phred) scale. In case of Nanopore data, the distribution is generally centered around 10 or 12. For old runs, the distribution can be lower, as basecalling models are less precise than recent models. In our case, the median read Qscore is 13, which means that this run has good quality.

Basecalled reads length vs reads PHRED quality:

This representation give a 2D visualisation of read Qscore according to the length.

Output over experiment time:

This representation gives information about sequenced reads over the time for a single run. We can see that the production of reads is decreasing over time, which can be due to the sequencing of mosth of the genetic material, the saturation of pores and/or the degradation of the marial and/or pores. In this example, the “Cummulative” plot area (light blue) indicates that 50% of all reads and almost 50% of all bases were produced in the first 3h of the 8h experiment. We can see that from 6 to 8h of the experiment, only 200 reads were yield, which means that we could have ended the experiment 2h before.

Read length over experiment time:

The read length over experiment time should be stable. It can slightly increase over the time as short fragments tend to be over-sequenced at the beginning and are less present over the time. In this case, as almost all the fragments have same length, the plot is really constant over time.

Read quality over experiment time:

The read quality over experiment time should be stable too, but usually it slightly decrease over the time as pores get saturated or degraded. In this case, we can see a clear decrease of sequencing quality over experiment time, but it mantains between the good quality values and this can be fixed with further post processing of the reads.

Number of reads per barcode:

This plot shows the number of reads per barcode, which means de number of reads per sample to be demultiplexed. In a goog experiment, all the barcodes should have the same number of reads. In this training we only used reads from barcode01 sample but we can see that barcode08 couldn't be correctly sequenced.

Channel activity over time:

It gives an overview of available pores, pore usage during the experiment, inactive pores and shows if the loading of the flow cell is good (almost all pores are used). In this case, the vast majority of channels/pores are inactive (white) after the 6h of experiment, so the run should have been dinished at that time. You would hope for a plot that it is dark near the X-axis, and with higher Y-values (increasing time) doesn’t get too light/white. Depending if you chose “Reads” or “Bases” on the left the colour indicates either number of bases or reads per time interval.

How do I check whether my Nanopore data was correctly sequenced?

Using NanoPlot or PycoQC and having a look to the statistic values.

2.2. Trimming

When Nanopore reads are being sequenced, the MinKnown software splits Fast5 reads into quality pass and quality fail. As we will select only Fast5 pass reads, we won't need to perform a quality trimming, so even if we see that the reads have a bad Phred score, we know that the ONT software considered the reads as "good quality".

Then we will only be performing a read length trimming. As we are using amplicon sequencing data, we won't be expecting reads smaller than 400 nucleotides, nor higher than 600, which would obviously correspond to chimeric reads.

2.2.1. Artic

Search for artic tool
Select ARTIC guppyplex Filter Nanopore reads by read length and (optionally) quality
Structure of your input data: Multiple input datasets per sample
While pressing the Ctrl key, select the three samples
Remove reads longer than = 600
Remove reads shorter than = 300
Do not filter on quality score (speeds up processing) = Yes (we had already select pass reads)

2.2.2. Nanoplot

Now we are going to run NanoPlot on filtered data:

Search for the Nanoplot tool and select NanoPlot Plotting suite for Oxford Nanopore sequencing data and alignments
Run the tool as follows:
- In the files part, select ARTIC output file.
- Display Options for customizing the plots created:
  - Specify the bivariate format of the plots > Select all
  - Show the N50 mark in the read length histogram > Yes
- Select Execute

Questions

Did our data length and quality improve?

Yes, now we hace reads in the length and quality specified.

How many reads did we lost during trimming step?

137 reads

This hands-on history URL: https://usegalaxy.eu/u/svarona/h/nanopore-quality

NOTE: We can't use nanofilt because it is not installed in Galaxy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02_quality.md

02_quality.md

Galaxy for virologist training Exercise 2: Quality control and trimming

1. Illumina Quality control and trimming

1.1. Quality control

1.1.1. Upload data

1.1.2. Run FastQC

1.2. Trimming

1.2.1. Run Fastp

1.2.2. Other trimming tools: Trimmomatic

2. Nanopore Quality control and trimming

2.1. Quality control

2.1.1. PycoQC

2.2. Trimming

2.2.1. Artic

2.2.2. Nanoplot

Files

02_quality.md

Latest commit

History

02_quality.md

File metadata and controls

Galaxy for virologist training Exercise 2: Quality control and trimming

1. Illumina Quality control and trimming

1.1. Quality control

1.1.1. Upload data

1.1.2. Run FastQC

1.2. Trimming

1.2.1. Run Fastp

1.2.2. Other trimming tools: Trimmomatic

2. Nanopore Quality control and trimming

2.1. Quality control

2.1.1. PycoQC

2.2. Trimming

2.2.1. Artic

2.2.2. Nanoplot