Gene conservation informed contig alignment (GCICA)

Separation haplotigs from the genome assembly

Introduction

Some regions of genome are highly heterozygous, which contain two alleles at the same gene locus (we called it ‘primary contigs’ and associated ‘haplotigs’ according to FALCON_unzip (https://github.com/PacificBiosciences/FALCON_unzip/)). Assemblers may produce both primary contig and haplotig without labelling instead of a single haplotype-fused contig. Those ‘unidentified’ haplotigs may cause issues in the downstream analyses. Some tools are developed for separating the primary contigs and haplotigs from the assembly, such as Purge Haplotigs (https://bitbucket.org/mroachawri/purge_haplotigs), HaploMerger2 (https://github.com/mapleforest/HaploMerger2/releases/) or FALCON_unzip. Purge Haplotigs remapped short-reads or long-reads to obtain the coverage information for haplotigs separation, whereas HaploMerger2 separated haplotigs basing on all-against-all whole genome alignment.

Here, we developed a new and quick method separated haplotigs basing on sequence similarity. The idea is simple: we use gene nt sequences (can be from the same species or shorter evolutionary distance species) as the marker, mapped the marker to the assembly. If a contig (haplotig) is associated to another contig (primary contig), most of regions of them should be very similarity even the same. Therefore, we can find the same marker between them. One drawback of this method may not work for those contigs that do not contain any genes.

Requirments

Python 2.7 or higher
Matplotlib package (Python)
BLAST 2.2.30+
Mummer 4.0

Installation

No installation required, just download the pipeline from github.

git clone https://github.com/asdcid/Gene-conservation-informed-contig-alignment.git

Usage

This method requires a genome assembly (fasta format) and a whole set of gene nt sequences (fasta format) from the same species or shorter evolutionary distance species.

The path of BLAST and Mummer should be set into the environment path first.

Usage: GCICA -g assembly.fa -g genes.fa [options]

Required:
    -a      assembly file, fasta format
    -g      gene nt sequence file, fasta format

Options:
    -o      outputDir. Default is basename of assembly file + timestamp in the current directory
    -t      number of threads. Default is 10
    -i      minimun identity for blastn. Default is 80
    -m      the minimum number of aligned gene that each contig contains. Default is 3
    -l      the minimum fraction of aligned length in a gene. If the aligned length of a gene is lower than this fraction, this gene would not consider as a aligned gene. Deafult is 0.7
    -c      the percentage of minimum coverage between the possible primary and its associated haplotigs. Default is 80
    -f      fraction = the number of aligend gene in contig A which also aligned to contig B / total aligned gene in contig A. If the fracion > minGeneCover, these two contigs could be the possible primary contig and associated haplotig. Default is 0.8

OutputFiles

The final result is in the final_result directory.

separate_info file indicates the primary contigs and their associated haplotigs.

figures directory is the dot-plot between primary contig and its associated haplotig.

summary file is the datails of comparison between primary contigs and their associated haplotigs. 1-to-1: the unique mapping; m-to-m: the multiple mapping. Both 1-to-1, m-to-m and coverage information are from Mummer dnadiff result.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
scripts		scripts
GCICA		GCICA
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gene conservation informed contig alignment (GCICA)

Introduction

Requirments

Installation

Usage

OutputFiles

About

Releases

Packages

Languages

License

asdcid/Gene-conservation-informed-contig-alignment

Folders and files

Latest commit

History

Repository files navigation

Gene conservation informed contig alignment (GCICA)

Introduction

Requirments

Installation

Usage

OutputFiles

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages