This is a demonstration project for CSE284. It implements a smaller, simpler version of the admixture algorithm. See ADMIXTURE page for more details. Refer to the final-project-files directory for presentation materials and summarization document.
I plan to implement several improvements listed in Alexander 2011
Installation requires NumPy and argparse.
Navigate to the directory in which you would like to download wf-admixture and install:
git clone https://github.com/WillardFord/wf-admixture.git
cd wf-admixture
python setup.py install
Note: To install locally (i.e. without root access) use the --user
tag when calling setup.py
. (You might then have to make sure the local installation directory is your PATH.)
If the install was successful, typing wf-admixture -h
should show a useful message.
The basic usage of wf-admixture
requires a .bed
, .bim
, and .fam
files in the same directory.
Note: This is not a .bed file from UCSC genome browser. Refer to the plink file documentation for information about input file types.
To run wf-admixture
on a small test example (using files in this repo):
wf-admixture -bed toy_files/toy.bed -k 3 -o toy_files/toy
Output files will then be stored at as numpy text arrays at the following location.
toy_files/toy.Q
toy_files/toy.F
To compare to output to ADMIXTURE
, run:
admixture toy_files/toy.bed 3
There are several optional inputs to wf-admixture
-
-q [int]
,--threads
: Number of threads to use. Default will not use multithreading. Only recommended for large inputs. -
-t [float]
,--threshold
: Threshold between subsequent log liklihoods to indicate completion. Smaller values will be more accurate but take significantly more time. Default is 1. -
-v
,--verbose
: Indicates whether to output log liklihood information for each iteration.
Benchmark wf-admixture against ADMIXTURE using 1000 genomes data
Available in analysis/benchmark.ipynb
and analysis/benchmark/visualization.ipynb
The output file formats are the same as ADMIXTURE, a whitespace seperated table. Any continuing analysis should be interchangable.
UNDER CONSTRUCTION: unit tests not yet implemented
To run tests:
# Run command line tests
sh tests/cmdline_tests.sh
# Run unit tests
python -m pytest --cov=
The wf-admixture tool implements a linear programming algorithm based off of Admixture. We are trying to estimate the contribution of population, k, to each individual, i. Let this be represented as an I x K matrix Q. In doing so we must calculate the minor allele frequency, j, of each population. Let this be represented as a K x J matrix F.
- Prune input SNPs in LD with each other using centimorgan distance.
- Assume individuals are a independent unions of random gametes. This gives rise to a Hardy-Weinberg equilibrium of genotype at each SNP for each individual based on Q and F.
- Calculate Log Liklihood equation from Hardy-Weinberg equilibrium.
- Use linear programming techniques to iteratively calculate MLE from Log Liklihood. Primarily Expectation-Maximization algorithm.
Given additional time I'd also like to implement the following improvements:
- Use a Block Relaxation algorithm described in ADMIXTURE as opposed to EM algorithm which should be much faster when combined with updating only Q or F on a single step which allows us to use Taylor approximation given convexity assumption.
- Add acceleration to linear programming method to accelerate our arrival at the optimum.
- Add some level of confidence or error rate in these results.
This repository was generated by Willard Ford, with inspiration from the CSE 185 Example Repository and the work of my fellow students.
Group 27 for the purposes of CSE 284.
Please submit a pull request with any corrections or suggestions.