-
Notifications
You must be signed in to change notification settings - Fork 0
Chapter 2 ‐ Quality Control And Data Preprocessing
During this section you will learn to:
- Assess the intrinsic quality of your raw reads (in fastq format) using metrics generated by the sequencing platform (e.g. quality scores)
- Pre-process data, i.e. trimming the low-quality bases, remove short reads.
- NanoPack: https://github.com/wdecoster/nanopack
First we will have to find our data on the system. This has been saved for you in our shared drive here:
NERC_EcologicalGenomics/june_2023/prom_data/raw_data/
Take a look:
$ ls NERC_EcologicalGenomics/june_2023/prom_data/raw_data/
You should see a collection of files, including one that has been labelled with your sample number, this is your raw data from the sequencer that has been uploaded this week:
<YOUR_FASTQ_FILE>.fastq.gz
We’ll make a ‘soft link’ to your data in your home directory to save copying across a large data file:
$ ln -sr NERC_EcologicalGenomics/june_2023/prom_data/raw_data/<YOUR_FASTQ_FILE>.fastq.gz NERC_EcologicalGenomics/QC
Let’s move to the directory “QC” and keep tidy any files we produce in this section:
$ cd NERC_EcologicalGenomics/QC
We will use NanoPlot from NanoPack to assess the quality of your FASTQ file.
Help message can be displayed by:
$ NanoPlot -h
Once you’ve had a look at the options for running NanoPlot we’ll go ahead and run it using our data in a fairly default setting (this may take around 15 mins):
$ NanoPlot -t 2 --fastq <YOUR_FASTQ_FILE>.fastq.gz --loglength -o <YOUR_FILE>_nanoplot --plots dot
where: -t Set the number of threads to be used by the script --fastq Data is in one or more default fastq file --loglength Show logarithmic scaling of lengths in plots --plots Specify which bivariate plots have to be made (kde, hex,dot,pauvre)
Let’s take a look at the output:
$ cd <YOUR_FILE>_nanoplot
$ ls
As you can see, NanoPlot has created a few images and a stats file in text format. All of these are summarized in a nice html report. Let’s take a look at that:
$ google-chrome NanoPlot-report.html &
The first thing you’ll see is a table containing some useful statistics about your data:
➔ How many reads are in the sample?
➔ What is the mean quality?
➔ What’s the read length N50?
➔ Overall, how good is your data?
We will filter the reads based on your average read quality generated above, <your_mean_qual>, and on a minimum read length using chopper. This tool can also remove reads mapping to the lambda phage genome (control DNA used in nanopore sequencing).
We’re going to have to pipe a couple of commands into each other here because chopper needs an unzipped version of our raw data file, we then write out to a new ‘_trimmed.fastq.gzip’ file using a redirect ‘>’.
$ gunzip -c <YOUR_FASTQ_FILE>.fastq.gz | chopper -q <your_mean_qual> -l 500 | gzip > <YOUR_FASTQ_FILE>_trimmed.fastq.gz
We will use NanoComp to compare the data pre and post filtering. NanoComp can also be used to compare the quality of multiple samples/runs in one go.
Unlike Illumina short reads, you may not see a massive difference in the quality improvement of the data but the aim here is to check if the filtering cutoffs were not too stringent and also if there is enough data left to carry out further analysis.
$ NanoComp --fastq <YOUR_FASTQ_FILE>.fastq.gz <YOUR_FASTQ_FILE>_trimmed.fastq.gz -t 2 --names reads_raw reads_trimmed -o <YOUR_FILE>_Nanocomp
Let’s take a look at the results:
$ cd <YOUR_FILE>_Nanocomp
$ google-chrome NanoComp-report.html &
➔ Spot the differences! Fill out the same metrics from the previous question and compare results.
➔ How big is the loss in the yield?
➔ What is our average quality now?