Skip to content

Commit

Permalink
major release and bump version
Browse files Browse the repository at this point in the history
  • Loading branch information
Zilong-Li committed Sep 1, 2023
1 parent 4050a4d commit d0bf083
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 31 deletions.
65 changes: 35 additions & 30 deletions doc/paper.org
Original file line number Diff line number Diff line change
Expand Up @@ -83,21 +83,21 @@ details of vcfpp can be found at https://github.com/Zilong-Li/vcfpp.

** C++ API

In this example, we count the number of heterozygous sites for
each sample in all records. The following code first includes a
single vcfpp.h file (line 1), opens a compressed BCF file
constrained to 3 samples and "chr21" region (line 2), and creates
a variant record object associated with the header information in
the BCF (line 3). Then it defines several types of objects to
collect the results we want (line 4-6). Taking advantage of
generic templates in C++, the type of field like genotype can be
very flexible. Then it starts iterating over the BCF file and
processes each variant record in the loop (line 7). We ignore
variants of other types (INDEL, SV), or with missing genotypes, or
with QUAL value smaller than 9, while we can also retrieve more
information of the variant and do filtering (line 9-10). Finally,
we count the heterozygous sites for each diploid sample (line
11-13). The core is just 13 lines.
In this example, we count the number of heterozygous sites for each
sample in all records. The following code first includes a single
vcfpp.h file (line 1), opens a compressed BCF file constrained to 3
samples and region "chr21" (line 2), and creates a variant record
associated with the header information in the file (line 3). Then it
defines several types of objects to collect the results we want (line
4-6). Taking advantage of generic templates in C++, the type of field
like genotype can be very flexible regarding memory consumption. Then
it starts iterating over the BCF file and processes each variant
record in the loop (line 7). We ignore variants of other types (INDEL,
SV), or with missing genotypes, or with /QUAL/ value smaller than 9,
while we can also retrieve more information of the variant and do
filtering (line 9-10). Finally, we count the number of heterozygous
variant for each diploid sample (line 11-13). The core is just 13
lines.

#+caption: Counting the number of heterozygous genotypes for 3 samples on chr21
#+attr_latex: :options captionpos=t
Expand Down Expand Up @@ -180,16 +180,16 @@ perform analysis later, I also write a function to load whole VCF
content in R for such two-step comparison although it is not
preferable for this task. Notably, the "test-vcfpp-2.R" script is
only $1.9\times$ slower compared to $70\times$, $85\times$ slower
for "test-vcfR.R" and "test-fread.R". This is because genotypes made
by both vcfR and data.table::fread are characters, which are very
slow to parse in R despite a faster string library was
used. However, using vcfpp can get integer vectors directly in R for
computation. Moreover, regarding the best practice of data analysis
in R, we usually want to inspect part of the data table first to
make further decisions. And vcfpp has the full functionalities of
htslib that is supporting reading BCF, selecting samples and
regions. A quick lookup of VCF content can be achieved by passing a
region parameter.
for "test-vcfR.R" and "test-fread.R" respectively. This is because
genotypes made by both vcfR and data.table::fread are characters,
which are very slow to parse in R despite a faster string library
was used. In contrast, with vcfpp, integer vectors of genotypes can
be returned to R directly for computation. Moreover, regarding the
best practice of data analysis in R, we usually want to inspect part
of the data table first to make further decisions. And vcfpp has the
full functionalities of htslib that is supporting reading BCF,
selecting samples and regions. A rapid query of VCF content can be
achieved by passing a region parameter.

#+caption: Performance of counting heterozygous genotypes per sample in the 1000 Genome Project for chromosome 21. (^) used by /sourceCpp/ function. (*) used by loading data in two-step strategy.
#+name: tb:counthets
Expand All @@ -214,25 +214,30 @@ writing daily used scripts. Many existing packages written in C++
using customized VCF parser can be replaced with vcfpp to offer more
functionalities. For instance, imputation software STITCH
[cite:@davies2016] and QUILT [cite:@davies2021] are using vcfpp to
parse reference panel in VCF/BCF format. Also, we provide a modified
Syllable-PBWT and some useful command line tools in
https://github.com/Zilong-Li/vcfpp/tree/main/tools.
parse large reference panel in VCF/BCF format. Also, there are some
useful command line tools available, such as a modified
Syllable-PBWT with compressed VCF/BCF as input.

* Software and Code

The latest release of vcfpp.h can be downloaded from
https://github.com/Zilong-Li/vcfpp/releases. Scripts for
Benchmarking can be found here
https://github.com/Zilong-Li/vcfpp/tree/main/scripts. More helpful
example functions can be found here
https://github.com/Zilong-Li/vcfpp/tree/main/tools.
example functions can be found at
https://github.com/Zilong-Li/vcfpp.

* Acknowledgments

I would like to thank Anders Albrechtsen at Copenhagen University and
Robert W Davies at Oxford University for helpful comments. Also, this
project is inspired by open source projects SeqLib and cyvcf2.

* Funding

This work is supported by the Novo Nordisk 462 Foundation
(NNF20OC0061343).

#+print_bibliography:

* Local setup :noexport:
Expand Down
2 changes: 1 addition & 1 deletion vcfpp.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
* @file https://github.com/Zilong-Li/vcfpp/vcfpp.h
* @author Zilong Li
* @email [email protected]
* @version v0.1.8
* @version v0.2.0
* @breif a single C++ file for manipulating VCF
* Copyright (C) 2022-2023.The use of this code is governed by the LICENSE file.
******************************************************************************/
Expand Down

0 comments on commit d0bf083

Please sign in to comment.