Chapter 3 ‐ De Novo Nuclear Genome Assembly

During this session you will learn to:

Generate a de novo genome assembly using Redbean (wtdbg2)
Assess the quality of assembly

1. Software

All these software are open source and can be downloaded from the links given below:

Redbean (wtdbg2) https://github.com/ruanjue/wtdbg2
QUAST: http://quast.sourceforge.net/
Minimap2: https://github.com/lh3/minimap2
Samtools: http://samtools.github.io/

2. Data

We will be using the FASTQ data for the sample sequenced by you. We have filtered and trimmed the data using chopper. Create a folder named 'nuclear_assembly' and create a softlink to your trimmed file within this directory.

$ mkdir nuclear_assembly 

$ cd nuclear_assembly

$ ln -s NERC_training/QC/<YOUR_FASTQ_FILE>

3. Generate de novo genome assembly

Redbean is one of the widely used tools for assembling genomes de novo from long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output. It is one of the fastest assemblers available and is designed for a wide range of datasets from small bacteria to large mammals.
wtdbg2 has two key components: an assembler wtdbg2 and a consenser wtpoa-cns.

3.1 wtdbg2

wtdbg2 assembles raw reads and generates the contig layout and edge sequences in a file "prefix.ctg.lay.gz". In order to run this command, you need to use the trimmed fastq file you generated yesterday. Furthermore, one of the mandatory parameters the program requires is estimated genome size. You can select this from the table below based on the barcode number you used during the course. These genome size estimates are for the species we are analysing here (but not the same individuals) based on flow cytometry, a non-sequenced based method. These data were downloaded from the Kew Plant DNA C-value database (https://cvalues.science.kew.org/) Note that these are haploid values, which is the standard way to report genome sizes.

Barcode number	Expected Genome Size - April	Expected Genome size - June
1	900 MB	1.45 GB
2	1.5 GB	1.45 GB
3	1.5 GB	1.1 GB
4	600 MB	1.1 GB
5	1.5 GB	600 MB
6	1.4 GB	1.4 GB
7	1.5 GB	300 MB
8	600 MB	1.5 GB
9	300 MB	1.5 GB
10	1.45 GB	1.5 GB
11	1.5 GB	1.5 GB
12	1.1 GB	900 MB
13		600 MB
14		1.5 GB

Run wtdbg2:

$ wtdbg2 -i <YOUR_FASTQ_FILE> -o wtdbg2 -t 36 -x ont -g <estimated_genome_size>

where,
-i Long reads sequences file (can be multiple)
-o Prefix of output files
-t Number of threads
-x ONT preset (sets other params like k-mer sizes,etc. accordingly)
-g Estimated genome size (We need to select this value from the table shown above. This should correspond to your
barcode ID and can be specified as: 1.4g, 800m, etc.)

The output of wtdbg2 will show a k-mer distribution. Take a look.

This plot shows that most Kmers are composed of singletons 77M/167M showing the genome is extremely repetitive and contains erroneous reads. Also, the average kmer depth is extremely low (8) which shows the genome coverage is sparse. For an ideal assembly most k-mers are usually within 2~1000 and average k-mer depth is between 20 - 100.

➔ How does this distribution look for you?

3.2 wtpoa-cns

wtpoa-cns takes "prefix.ctg.lay.gz" as input and produces the final consensus in FASTA.

$ wtpoa-cns -t 32 -i wtdbg2.ctg.lay.gz -fo wtdbg2.ctg.fa

4. Polishing genome assemblies

We will be using mapping based polishing methods to polish the nuclear genome assembly we have generated in the previous steps.

4.1 Map the reads to assembly

We will use minimap2 for mapping the raw data to the Redbean assembly generated earlier. Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is compatible with the long reads produced from both PacBio and ONT.

minimap2 -ax map-ont -t 36 wtdbg2.ctg.fa <YOUR_FASTQ_FILE> | samtools sort -@36 > minimap2_wtdbg2_sorted.bam

4.2 Get mapping stats

We will explore samtools to get the mapping stats and perform other operations such as sorting and indexing the BAM file.

$ samtools index minimap2_wtdbg2_sorted.bam

$ samtools flagstat minimap2_wtdbg2_sorted.bam > minimap2_wtdbg2_sorted.bam.stat

$ cat minimap2_wtdbg2_sorted.bam.stat

A read may map ambiguously to multiple locations, e.g. due to repeats. Only one of the multiple read alignments is considered primary, and this decision may be arbitrary. All other alignments have the secondary alignment flag.

4.3 Polish

We are only selecting primary alignments and excluding supplementary/secondary alignments. This is done to get more confidence which will be the main basis for polishing/error correction. Later, the consensus is called again based on the mapping data.

$ samtools view -F0x772 minimap2_wtdbg2_sorted.bam | wtpoa-cns -t 36 -d wtdbg2.ctg.fa -i - -fo wtdbg2.cns.fa

5. Assessment of assemblies

5.1 QUAST

QUAST is a quality assessment tool for evaluating and comparing genome assemblies. QUAST can evaluate assemblies with a reference genome as well as without a reference.

Help message can be accessed by:

$ quast --help

Run QUAST:

$ quast -o quast -t 36 --labels raw,polished wtdbg2.ctg.fa wtdbg2.cns.fa

QUAST produces many reports, summary tables and plots for exploration of assembly quality. The key start point is the “report.html” file where all of these are summarized.

$ cd quast

$ google-chrome report.html &

➔ Is there any difference in the stats between these assemblies (E.g. higher N50 and less number of contigs)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly