Despite the improvement of sequencing methods, there is no error-free technique. A correct measuring of the sequencing quality is essential for identifying problems in the sequencing, thus, this must be the first step in every sequencing analysis. Once the quality control is finished, it's important to remove those low quality reads, or short reads, for which a trimming step is mandatory. After the trimming step it is recommended to perform a new quality control step to be sure that trimming worked.
Title | Pre-processing |
---|---|
Training dataset: | PRJEB43037 - In August 2020, an outbreak of West Nile Virus affected 71 people with meningoencephalitis in Andalusia and 6 more cases in Extremadura (south-west of Spain), causing a total of eight deaths. The virus belonged to the lineage 1 and was relatively similar to previous outbreaks occurred in the Mediterranean region. Here, we present a detailed analysis of the outbreak, including an extensive phylogenetic study. This is one of the outbreak samples. |
Questions: |
|
Objectives: |
|
Estimated time: | 25 min |
To run the quality control over the samples, follow these steps:
- Create a new history, as we explained yesterday named Illumina preprocessing
- Upload data as seen yesterday, copy and paste the following URLs:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_2.fastq.gz
- Add some tags to the files. It is mandatory that the tag starts with
#
to be propagated to the processes.
- Search for the fastqc tool
- Select FastQC Read Quality reports and set the following parameters:
- Select multiple file data set in Raw read data from your current history
- Select the two datasets
- Then go down and select Run tool
To see the results we are going to open the jobs with Web page in their name for both data 1 and data 2.
Here, you can see the number of reads in each file, the maximum and minimum length of all reads in the sample, and the quality plots for both R1 and R2. They look quite good, but we are going to run trimming over the samples.
How many reads do the samples have?
265989
First question
How do I check whether my Illumina data was correctly sequenced?
Using FastQC
Once we have performed the quality control, we have to perform the quality and read length trimming:
1.Search for fastp in the tools
2.Then select fastp - fast all-in-one preprocessing for FASTQ files
-Select custom parameters:
3.Single-end or paired reads > Paired
4.Input 1 > Browse datasets (right folder icon) > Select ERR5310322_1.fastq.gz
5.Input 2 > Browse datasets > Select ERR5310322_2.fastq.gz
6.Display Filter Options
-Quality Filtering options
7.Qualified Quality Phred = 30
8.Unqualified percent limit = 10
-Length Filtering Options
9.Length required = 50
10.Read modification options
11.PoliX tail trimming > Enable polyX tail trimming
-Per read cutting by quality options
12.Cut by quality in front (5') > Yes
13.Cut by quality in tail (3') > Yes
14.Cutting mean quality = 30
15.Finally, click on Run tool
To see the trimming stats, have a look at the fastp on data 2 and data 1: HTML report file. You should see something like that.
How many reads have we lost?
98664 reads
1.Search for trimmomatic in the tools
2.Select Trimmomatic flexible read trimming tool for Illumina NGS data
-Select custom parameters:
3.Single-end or paired-end reads? = Paired-end (two separated files)
4.Input FASTQ file (R1/first of pair) = ERR5310322_1.fastq.gz
5.Input FASTQ file (R2/second of pair) = ERR5310322_2.fastq.gz
6.Average quality required = 30
7.Insert Trimmomatic Operation:
8.Select Trimmomatic operation to perform: **MINLEN**
9.Minimum length of reads to be kept = 50
10.Select Run tool
Trimmomatic does not perform statistics over trimmed reads, so we need to perform FastQC again over the Trimmomatic results.
Second question
How can I improve the quality of my data?
Using a trimming software, such as fastp or trimmomatic.
- This hands-on history URL: https://usegalaxy.eu/u/svarona/h/illumina-preprocessing
Title | Galaxy |
---|---|
Training dataset: | The data we are going to manage corresponds to Nanopore amplicon sequencing data using ARTIC network primers por SARS-CoV-2 genome. From the Fast5 files generated by the ONT software, we are going to select the pass reads, so they are already filtered by quality. |
Questions: |
|
Objectives: |
|
Estimated time: | 15 min |
To run the quality control over the samples, follow these steps:
- Create a new history has explained yesterday named Nanopore quality
- Upload data as seen yesterday, copy and paste the following URLs:
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_0.fastq
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_1.fastq
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_2.fastq
To use PycoQC we need to use the sequencing_summary.txt
provided by de Nanopore sequencing machine.
Upload data as seen yesterday, copy and paste the following URL:
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/sequencing_summary.txt
- Search for the Pycoqc tool
- Select Pycoqc quality control for Nanopore sequencing data
- In A sequencing_summary file: Select the
sequencing_summary.txt
we just uploaded - Select Run tool
Then inspect the resulting PycoQC HTML Report:
Question
How many reads do the samples have?
3K reads
Do you understand all the plots?
Basecalled reads length This plot shows the distribution of fragment sizes in the file that was analyzed. Long reads have a variable length and this will show the relative amounts of each different size of sequence fragment. In this example, the distribution of read length is quite dispersed with a minimum read length for the passed reads around 150 and a maximum length ~5000bp. However, most of the reads are about 500 nt length, as expected by the amplicon experiment.
Basecalled reads PHRED quality: This plot shows the distribution of the Qscores (Q) for each read. This score aims to give a global quality score for each read. The exact definition of Qscores is: the average per-base error probability, expressed on the log (Phred) scale. In case of Nanopore data, the distribution is generally centered around 10 or 12. For old runs, the distribution can be lower, as basecalling models are less precise than recent models. In our case, the median read Qscore is 13, which means that this run has good quality.
Basecalled reads length vs reads PHRED quality: This representation give a 2D visualisation of read Qscore according to the length.
Output over experiment time: This representation gives information about sequenced reads over the time for a single run. We can see that the production of reads is decreasing over time, which can be due to the sequencing of mosth of the genetic material, the saturation of pores and/or the degradation of the marial and/or pores. In this example, the “Cummulative” plot area (light blue) indicates that 50% of all reads and almost 50% of all bases were produced in the first 3h of the 8h experiment. We can see that from 6 to 8h of the experiment, only 200 reads were yield, which means that we could have ended the experiment 2h before.
Read length over experiment time: The read length over experiment time should be stable. It can slightly increase over the time as short fragments tend to be over-sequenced at the beginning and are less present over the time. In this case, as almost all the fragments have same length, the plot is really constant over time.
Read quality over experiment time: The read quality over experiment time should be stable too, but usually it slightly decrease over the time as pores get saturated or degraded. In this case, we can see a clear decrease of sequencing quality over experiment time, but it mantains between the good quality values and this can be fixed with further post processing of the reads.
Number of reads per barcode:
This plot shows the number of reads per barcode, which means de number of reads per sample to be demultiplexed. In a goog experiment, all the barcodes should have the same number of reads. In this training we only used reads from barcode01 sample but we can see that barcode08 couldn't be correctly sequenced.Channel activity over time: It gives an overview of available pores, pore usage during the experiment, inactive pores and shows if the loading of the flow cell is good (almost all pores are used). In this case, the vast majority of channels/pores are inactive (white) after the 6h of experiment, so the run should have been dinished at that time. You would hope for a plot that it is dark near the X-axis, and with higher Y-values (increasing time) doesn’t get too light/white. Depending if you chose “Reads” or “Bases” on the left the colour indicates either number of bases or reads per time interval.
How do I check whether my Nanopore data was correctly sequenced?
Using NanoPlot or PycoQC and having a look to the statistic values.
When Nanopore reads are being sequenced, the MinKnown software splits Fast5 reads into quality pass and quality fail. As we will select only Fast5 pass reads, we won't need to perform a quality trimming, so even if we see that the reads have a bad Phred score, we know that the ONT software considered the reads as "good quality".
Then we will only be performing a read length trimming. As we are using amplicon sequencing data, we won't be expecting reads smaller than 400 nucleotides, nor higher than 600, which would obviously correspond to chimeric reads.
- Search for artic tool
- Select ARTIC guppyplex Filter Nanopore reads by read length and (optionally) quality
- Structure of your input data: Multiple input datasets per sample
- While pressing the Ctrl key, select the three samples
- Remove reads longer than = 600
- Remove reads shorter than = 300
- Do not filter on quality score (speeds up processing) = Yes (we had already select pass reads)
Now we are going to run NanoPlot on filtered data:
- Search for the Nanoplot tool and select NanoPlot Plotting suite for Oxford Nanopore sequencing data and alignments
- Run the tool as follows:
- In the files part, select ARTIC output file.
- Display Options for customizing the plots created:
- Specify the bivariate format of the plots > Select all
- Show the N50 mark in the read length histogram > Yes
- Select Execute
Questions
Did our data length and quality improve?
Yes, now we hace reads in the length and quality specified.
How many reads did we lost during trimming step?
137 reads
- This hands-on history URL: https://usegalaxy.eu/u/svarona/h/nanopore-quality
NOTE: We can't use nanofilt because it is not installed in Galaxy