Skip to content

Make diploid genome for F1 hybrid mouse. Modified for AlleleDB pipeline

License

Notifications You must be signed in to change notification settings

shaopei/vcf2diploid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README file for vcf2diploid tool distribution v_0.2.6a (19th Sept 2014)

This version of vcf2diploid integrates vcf2diploid_v_0.2.6 with generation of read depth for AlleleSeq filtering of potential SNPs residing in CNV locations.
There is also an additional option on vcf2diploid for an output folder.

The read-depth-file generator calculates a normalized read depth for a SNP in a 2000-bp window (+- 1000bp) around the SNP against an average depth computed 
from the entire genome. The output generated can be directly used for the AlleleSeq pipeline.

CITATION:
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M.
AlleleSeq: analysis of allele-specific expression and binding in a network framework.
Mol Syst Biol. 2011 Aug 2;7:522. doi: 10.1038/msb.2011.54.

############################################################
#### makePersonalGenome.mk
############################################################

After installation/compilation of vcf2diploid, modify the file makePersonalGenome.mk to run the following
1) vcf2diploid
2) vcf2snp
3) read-depth-file generator

USAGE: make -f makePersonalGenome DATA_DIR=/path/to/your/VCFdir OUTPUT_SAMPLE_NAME=NA12878 FILE_NAME_BAM=filename.in.DATA_DIR.bam FILE_NAME_VCF=filename.in.DATA_DIR.vcf
--currently DATA_DIR is set as the directory where the BAM and VCF files are kept and where the output directory is going to be.
--options can be modified in the file

Acknowledgement: The modifications in this version are created by R. Kitchen of the Gerstein Lab@Yale.

#############################################################
#### The following section describes vcf2diploid_v_0.2.6
#############################################################

1. Compilation
==============

$ make


2. Running
==========

java -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...]

where sample_id is the ID of individual whose genome is being constructed
(e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and
file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and
VCF files at a time. Splitting the whole genome in multiple files (e.g., with
one FASTA file per chromosome) reduces memory usage.
Amount of memory used by Java can be increased as follows

java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...]

You can try the program by running 'test_run.sh' script in the 'example'
directory. See also "Important notes" below.


3. Constructing personal annotation and splice-junction library
===============================================================

* Using chain file one can lift over annotation of reference genome to personal
haplotpes. This can be done with the liftOver tools
(see http://hgdownload.cse.ucsc.edu/admin/exe).

For example, to lift over Gencode annotation once can do

$ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt


* To construct personal splice-junction library(s) for RNAseq analysis one can
use RSEQtools (http://archive.gersteinlab.org/proj/rnaseq/rseqtools).


Important notes
===============

All characters between '>' and first white space in FASTA header are used
internally as chromosome/sequence names. For instance, for the header

>chr1 human

vcf2diploid will upload the corresponding sequence into the memory under the
name 'chr1'.
Chromosome/sequence names should be consistent between FASTA and VCF files but
omission of 'chr' at the beginning is allows, i.e. 'chr1' and '1' are treated as
the same name.

The output contains (file formats are described below):
1) FASTA files with sequences for each haplotype.
2) CHAIN files relating paternal/maternal haplotype to the reference genome.
3) MAP files with base correspondence between paternal-maternal-reference
sequences.

File formats:
* FASTA -- see http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
* CHAIN -- http://genome.ucsc.edu/goldenPath/help/chain.html
* MAP file represents block with equivalent bases in all three haplotypes
(paternal, maternal and reference) by one record with indices of the first
bases in each haplotype. Non-equivalent bases are represented as separate
records with '0' for haplotypes having non-equivalent base (see
clarification below).

Pat Mat Ref           MAP format
X   X   X    ____
X   X   X        \
X   X   X         --> P1 M1 R1
X   X   -    -------> P4 M4  0
X   X   -       ,--->  0 M6 R4
-   X   X    --'  ,-> P6 M7 R5
X   X   X    ----' 
X   X   X     
X   X   X     




For question and comments contact: Alexej Abyzov ([email protected]) and
Mark Gerstein ([email protected])

About

Make diploid genome for F1 hybrid mouse. Modified for AlleleDB pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •