A command-line program for detection of copy number variants using targeted sequencing data. GeneCNV is designed for copy number analysis across a subset of genes using parameters derived from a predefined set of reference (normal) samples.
This repository is divided into several main sections.
- The
cnv
folder. This contains the library code that both trains and runs the MCMC sampling algorithm to determine if samples contain duplications or deletions. - The
test_data
folder which contains sample data used in publication validation experiments.
Read the full documentation at http://genecnv.readthedocs.io.
The recommended way to download and install geneCNV is:
git clone https://github.com/GenePeeks/geneCNV.git
cd geneCNV
pip install -r requirements.txt
python setup.py install
If you don't already have Python 2.7, you can install it for your OS here.
The package also requires the following Python dependencies, which you can install via pip
.
Note that some of these packages have their own non-Python dependencies, including several in C.
Make sure you've installed properly by running unit tests as follows:
./runtests.sh
The tests may take a few minutes to complete successfully.
GeneCNV involves three main sub-commands: create-matrix
, train-model
, and
evaluate-sample
, corresponding to the following main steps in the
computational pipeline.
To get started, generate coverage counts across relevant targets
and samples using the create-matrix
command. You must
provide a BED file of relevant targets in this format:
X 32867834 32867947 Ex3 DMD
X 33038245 33038327 Ex2 DMD
X 33229388 33229673 Ex1 DMD
An example BED file for the DMD gene is provided in test_data
. Note that the first
four fields (chromosome, start position, end position, label) are required,
while the fifth is optional.
You must also provide a text file of paths to the sample BAM files in this format:
/path/to/file1.bam
/path/to/file2.bam
An example create-matrix
command looks like:
genecnv create-matrix test_data/example_dmd_baseline.bed training_samples.fofn \
training_sample_coverage.csv --targetArgfile dmd_baseline_targets.pickle
Serialized target/argument files can be optionally produced with this command, and
you only need to produce a target/argument file once for a specific set of targets.
An example output CSV for this command is provided in test_data
. This can be
used to run the subsequent train-model
command.
To estimate the model hyperparameters using all of the samples included in the coverage count matrix, run the following:
genecnv train-model dmd_baseline_targets.pickle test_data/training_sample_coverage.csv \
dmd_baseline_params.pickle --use_baseline_sum
Baseline autosomal targets are used to identify absolute copy number when no CNVs are present, and help provide more accurate results overall. Including baseline targets can also allow you to identify the sex of a sample when targets on the X chromosome are being tested. Baseline targets are not analyzed for copy number and are assumed to have copy number of 2.
If you are using a large number of baseline targets (>20), it's recommended to use
the optional --use_baseline_sum
argument when calling train-model
. This
reduces the total number of baseline targets to one during training.
Once parameters have been estimated from an appropriate set of training samples,
they can be used to perform copy number analysis for the relevant targets on
a test sample with the evaluate-sample
command. Here you can pass simply a sample BAM file
or a coverage matrix CSV (generated using the same targets).
To evaluate the first test sample in the file test_data/test_female_sample_coverage.csv
use the following command:
genecnv evaluate-sample test_data/test_female_sample_coverage.csv dmd_baseline_params.pickle \
normal_female_results
This command will produce three output files with the provided prefix, normal_female_results.txt
,
which provides the posterior probabilities and copy numbers for all relevant targets,
normal_female_results_summary.txt
which provides a summary of any CNVs detected,
and normal_female_results.pdf
, which provides a visualization of the copy numbers and
posterior probabilities across targets.
Depending on the number of total targets and MCMC iterations needed for convergence, the
sample evaluation may take up to 10-12 minutes to complete. By default it takes advantage
of multiple cores, but this can be turned off with the option --use_single_process
.