-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
109 additions
and
46 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
#+title: vcfpp: a C++ API for scripting rapid variant analysis | ||
#+setupfile: latex.org | ||
#+language: en | ||
#+startup: show2levels indent hidestars hideblocks | ||
#+options: title:nil toc:nil H:4 author:nil | ||
#+bibliography: "ref.bib" | ||
|
||
|
@@ -12,8 +13,8 @@ Email: [email protected].}} | |
Given the complexity and popularity of the VCF/BCF format as well as | ||
ever-growing data size, there is always a need for fast and flexible | ||
methods to manipulate it in different programming languages. Static | ||
languages like C++ has the superiority of performance as C, and its | ||
modern standards strive to provide enlarged standard library for | ||
languages like C++ has the superiority of performance as C, and | ||
modern C++ standards strive to provide enlarged standard library for | ||
developing program quickly and easily. In this work, we present | ||
vcfpp, a C++ API of htslib, providing a rapid and intuitive | ||
interface to manipulate VCF/BCF files safely and as easily as | ||
|
@@ -30,22 +31,35 @@ Email: [email protected].}} | |
|
||
* Introduction | ||
|
||
The VCF format [cite:@danecek2011] is the standard for representing | ||
genetic variation observed in DNA sequencing studies. The strength | ||
of the VCF format is its ability to represent the location of a | ||
variant, the genotype of the sequenced individual at each locus, as | ||
well as extensive variant metadata. However, the consequence of this | ||
flexibility and the rather complicated specification of the VCF | ||
format, is that researchers require powerful software libraries to | ||
access, query and manipulate variants from VCF files. Many efforts from computaional biologists | ||
|
||
\newpage | ||
Many efforts have been contributed by computational biologists, aim | ||
for making analyses of genetic variants no pain. The genetic | ||
variants are typically stored in the VCF format, which becomes the | ||
standard for representing genetic variation information with | ||
complicated specification [cite:@danecek2011]. Later on, as the | ||
ever-growing data size, BCF format is designed to query and store | ||
big data efficiently. The C API of htslib provide the full | ||
functionalities to manipulate VCF and BCF format for both compressed | ||
and uncompressed files. As the C API is hard to use for | ||
in-proficient programmers, API of other languages are | ||
invented. Existing popular libraries include vcfR[cite:@brian2017] | ||
for R, cyvcf2 for python [cite:@pedersen2017a], hts-nim | ||
[cite:@pedersen2018] for nim and vcflib for C++ | ||
[cite:@garrison2022]. All are useful for corresponding language | ||
community and have advantages and disadvantages. In particular, | ||
vcflib is both an API and a large collection of command line tools | ||
to manipulate VCF file, with the main disadvantage being not | ||
supporting BCF file. Many methods designed for large sample size can | ||
not even take the BCF file that is tailed for big data as input like | ||
Syllbable-PBWT [cite:@wang2023]. The motivation of vcfpp is to offer | ||
full functionalities as htslib and provide easy-to-use API in a | ||
single header file that can be easily integrated for writing scripts | ||
quickly in C++, python and R. | ||
|
||
* Features | ||
|
||
vcfpp is implemented as a single header file for being easily | ||
intergrated and compiled. There are four core classes for | ||
manipulating VCF/BCF as showed in [[tb:class]]. | ||
manipulating VCF/BCF as summarized in Table [[tb:class]]. | ||
|
||
#+caption: vcfpp capabilities and implemented C++ class | ||
#+name: tb:class | ||
|
@@ -62,23 +76,39 @@ manipulating VCF/BCF as showed in [[tb:class]]. | |
|
||
In an effort to demonstrate the power and performance of vcfpp, the | ||
following sections highlight typical VCF analyses and illustrate | ||
commonly used features in cyvcf2. Other examples and further details | ||
commonly used features in vcfpp. Other examples and further details | ||
of the vcfpp can be found at https://github.com/Zilong-Li/vcfpp. | ||
|
||
** Python-like API | ||
|
||
In this example, we count the number of heterozygous sites for | ||
each sample in all records. The core is *just 10 lines*. | ||
each sample in all records. The following code first includes a | ||
single vcfpp.h file (line 1), opens a compressed BCF file with subset | ||
samples and a constrained region (line 2), and creates a | ||
variant record object associated with the header information in | ||
the BCF (line 3). Then it defines several types of objects to | ||
collect the results we want (line 4-6). Taking advantage of | ||
generic templates in C++, type of genotype can be very | ||
flexible. Then it starts iterating over the BCF file and process | ||
each variant record in the loop (line 7). We ignore variant of | ||
other types (INDEL, SV), with missing genotypes and with QUAL | ||
value smaller than 9, while we can also retrieve more information | ||
of the variant and do filtering (line 9-10). Finally, we count the | ||
heterozygous sites for each diploid sample (line 11-13). The core | ||
is just 13 lines. | ||
|
||
#+caption: Counting heterozygous genotypes per sample on chr21 | ||
#+attr_latex: :options captionpos=t | ||
#+begin_src C++ -n | ||
vcfpp::BcfReader vcf("bcf.gz", "id1,id2,id3", "chr21"); | ||
vcfpp::BcfRecord var(vcf.header); | ||
#include <vcfpp.h> | ||
vcfpp::BcfReader vcf("bcf.gz", "chr21", "id1,id2,id3"); | ||
vcfpp::BcfRecord var(vcf.header); // create a variant object | ||
vector<char> gt; // genotype can be of bool, char or int type | ||
vector<int> hetsum(vcf.nsamples, 0); // store the het counts | ||
float hwe, exchet; // capture more INFO | ||
while(vcf.getNextVariant(var)){ | ||
var.getGenotypes(gt); | ||
var.getINFO("HWE",hwe), var.getINFO("ExcHet",exchet);//more | ||
if(!var.isSNP()||!var.isNoneMissing()||var.QUAL()<9) continue; | ||
assert(var.ploidy()==2); // make sure it is diploidy | ||
for(int i=0; i<gt.size()/2; i++) | ||
|
@@ -90,7 +120,17 @@ while(vcf.getNextVariant(var)){ | |
|
||
** Working with R | ||
|
||
Stats like working with R. Rcpp can interact R and C++ seamlessly. | ||
While vcfpp is very simple for writing a C++ program, a single C++ | ||
header file can be easily ported into popular script languages | ||
like Python and R. The process of finding biological insights from | ||
variants information always involves Statistical model, and | ||
Statisticians prefer analyzing and visualizing data in R. In this | ||
example, we demonstrate how vcfpp can work with R seamlessly by | ||
using Rcpp [cite:@eddelbuettel2011]. In the "vcfpp-r.cpp" file, we | ||
write a simple C++ function as the above with only changing the | ||
output objects to Rcpp::List. Then we can easily call the | ||
/genotypes/ function and get a list of genotypes to proceed further | ||
statistical tests in R. | ||
|
||
#+caption: vcfpp-r.cpp | ||
#+attr_latex: :options captionpos=t | ||
|
@@ -119,24 +159,40 @@ List genotypes(string vcffile) { | |
#+begin_src R | ||
library(Rcpp) | ||
sourceCpp("vcfpp-r.cpp") | ||
gts <- genotypes("vcf.gz") | ||
gts <- genotypes("bcf.gz") | ||
## perform statistical modeling with gts | ||
#+end_src | ||
|
||
* Benchmarking | ||
|
||
vcfR [cite:@brian2017] is an R package that provides function to parse vcf into data tables in R. | ||
In addition to simplicity and flexibility, we show how fast and | ||
efficient scripts using vcfpp can be. In the benchmarking, we | ||
performed the same task as in cyvcf2 paper. As shown in Table [[tb:counthets]], the python script | ||
using cyvcf2 is $1.8\times$ slower than the R script using vcfpp, | ||
and both using little RAM. | ||
|
||
#+caption: Performance of counting heterozygous genotypes per sample in the 1000 Genome Project for chromosome 21. (*) used by /sourceCpp/ function. | ||
#+name: tb:counthets | ||
#+attr_latex: :align lllll :placement [H] | ||
|------------------+----------+-------+-------------| | ||
| API | Time (s) | Ratio | RAM (Gb) | | ||
|------------------+----------+-------+-------------| | ||
| vcfpp::BcfReader | 115+3* | 1.0 | 0.071+0.28* | | ||
| cyvcf2.VCF | 209 | 1.8 | 0.035 | | ||
| cyvcf2::VCF | 209 | 1.8 | 0.035 | | ||
| vcfR::read.vcfR | 7207 | 61.1 | 97 | | ||
|------------------+----------+-------+-------------| | ||
|
||
As vcfR first loads the whole VCF contents into R objects and | ||
performs analyses afterwards , it is not preferable to use vcfR for | ||
this task. In the second task, we compare the performance of | ||
converting all contents in the VCF file into R tables. As shown in | ||
Table [[tb:loadtables]], vcfpp is as fast as the well known | ||
data.table::fread while using less RAM. Besides, vcfpp has the full | ||
functionalities of htslib that is supporting reading BCF, subsetting | ||
samples and regions. | ||
|
||
#+caption: Performance of converting the gzipped VCF of 1000 Genome Project into data tables of R for chromosome 21. | ||
#+name: tb:loadtables | ||
#+attr_latex: :align lllll :placement [H] | ||
|-------------------+----------+-------+----------| | ||
| API | Time (s) | Ratio | RAM (Gb) | | ||
|
@@ -146,14 +202,20 @@ vcfR [cite:@brian2017] is an R package that provides function to parse vcf into | |
| data.table::fread | 155 | 0.7 | 38 | | ||
|-------------------+----------+-------+----------| | ||
|
||
|
||
* Discussion | ||
|
||
We have developed vcfpp, a fast and flexible C++ API for scripting | ||
high-perfomance genetic variant analyses. Its easy-to-use can be | ||
very useful for both package developers and writing of daily used | ||
scripts. | ||
|
||
* Software and Code | ||
|
||
The latest release of vcfpp.h can be downloaded from | ||
https://github.com/Zilong-Li/vcfpp/releases. Scripts for | ||
Benchmarking can be found here | ||
https://github.com/Zilong-Li/vcfpp/tree/main/scripts. | ||
|
||
#+print_bibliography: | ||
|
||
* Local setup :noexport: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ | |
* @file https://github.com/Zilong-Li/vcfpp/vcfpp.h | ||
* @author Zilong Li | ||
* @email [email protected] | ||
* @version v0.1.5 | ||
* @version v0.1.6 | ||
* @breif a single C++ file for manipulating VCF | ||
* Copyright (C) 2022-2023.The use of this code is governed by the LICENSE file. | ||
******************************************************************************/ | ||
|
@@ -1091,31 +1091,26 @@ class BcfReader | |
/** | ||
* @brief construct a vcf/bcf reader with subset samples | ||
* @param file the input vcf/bcf with suffix vcf(.gz)/bcf(.gz) or stdin "-" | ||
* @param samples LIST samples to include or exclude as a comma-separated string. \n | ||
* LIST : select samples in list \n | ||
* ^LIST : exclude samples from list \n | ||
* "-" : include all samples \n | ||
* "" : exclude all samples | ||
* @param region samtools-like region "chr:start-end", skip if empty | ||
*/ | ||
BcfReader(const std::string & file, const std::string & samples) : fname(file) | ||
BcfReader(const std::string & file, const std::string & region) : fname(file) | ||
{ | ||
open(file); | ||
header.setSamples(samples); | ||
nsamples = bcf_hdr_nsamples(header.hdr); | ||
if(!region.empty()) setRegion(region); | ||
SamplesName = header.getSamples(); | ||
} | ||
|
||
/** | ||
* @brief construct a vcf/bcf reader with subset samples in target region | ||
* @param file the input vcf/bcf with suffix vcf(.gz) or bcf(.gz) | ||
* @param region samtools-like region "chr:start-end", skip if empty | ||
* @param samples LIST samples to include or exclude as a comma-separated string. \n | ||
* LIST : select samples in list \n | ||
* ^LIST : exclude samples from list \n | ||
* "-" : include all samples \n | ||
* "" : exclude all samples | ||
* @param region samtools-like region "chr:start-end", skip if empty | ||
*/ | ||
BcfReader(const std::string & file, const std::string & samples, const std::string & region) : fname(file) | ||
BcfReader(const std::string & file, const std::string & region, const std::string & samples) : fname(file) | ||
{ | ||
open(file); | ||
header.setSamples(samples); | ||
|
@@ -1168,8 +1163,8 @@ class BcfReader | |
return hts_set_threads(fp, n); | ||
} | ||
|
||
/** @brief get records of current contig to use */ | ||
uint64_t get_region_records(const std::string & region) | ||
/** @brief get the number of records of given region from index file */ | ||
uint64_t getRegionIndex(const std::string & region) | ||
{ | ||
setRegion(region); // only one chromosome | ||
int tid = 0, ret = 0, nseq = 0; | ||
|