Skip to content

🌿 Efficiently identifying shared genetic segments in large-scale data

License

Notifications You must be signed in to change notification settings

gusevlab/germline2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌿 GERMLINE2

Efficiently identifying shared genetic segments in large-scale data.

Reference: Saada et al. Identity-by-descent detection across 487,409 British samples reveals fine scale population structure and ultra-rare variant associations, 2020, Nature Communications

Usage

The boost/1.57.0 library is required.

make
g2 [options] <haps file> <sample file> <genetic map file> <output file>

Required inputs:

  • haps file: SHAPEIT/IMPUTE format input phased haplotypes, alleles can be any string as long as haplotype entries are 0/1
  • sample file: SHAPEIT/IMPUTE format input sample identifiers, only the second ID column is currently used for output
  • genetic map file: Each row has three fields: [physical position] [cm/Mb] [cM], and the 2nd field is ignored
  • output file: Pointer to where the outputs will go, will generate an $OUT.match file

Optional switches:

switch description
-b Binary output for large files, see parse_bmatch [default off]
-d Dynamic hash seed cutoff (for big N) [default = 0/off]
-f Minimum minor allele frequency [default = 0.0]
-g Allowed gaps between seeds [default = 1]
-h Haploid mode, do not allow switches between haplotypes [default off]
-m Minimum match length [default = 1.0]
-s Skip words with (seeds/samples) less than than this value (for big N) [default = 0.0]

Output

Output goes into a $OUT.match file with each row containing the following entries:

ID1 ID2 P0 P1 cM # words # gaps

If haploid mode is on (-h) then ".0" or ".1" is appended to the IDs to indicate a match along the first or second haplotype.

For large data, you can enable binary outputs by adding the -b switch, which will generate three files ($OUT.bmatch/bmid/bsid) that can be parsed using the provided parse_bmatch program (~3x reduction in file size).

Example

make test runs sample data in the example/ directory using the following command:

./g2 -m 0.9 \
example/SIM.NE_20000.MATCH_FREQ.SHAPEIT.haps \
example/SIM.NE_20000.MATCH_FREQ.SHAPEIT.sample \
example/genMap.1KG.b37.chr1.map \
example/SIM.NE_20000.MATCH_FREQ.INFERRED.match

The output segments are then evaluated for accuracy using the example/accuracy.sh script.

This data was simulated using the ARGON software as shown in example/sim.sh, down-sampled to a HapMap3 allele frequency distribution, and phased with SHAPEIT2.

About

🌿 Efficiently identifying shared genetic segments in large-scale data

Resources

License

Stars

Watchers

Forks