DeepVariant-TrioTrain is an automated pipeline for extending DeepVariant (DV), a deep-learning-based germline variant caller. See the original DeepVariant GitHub page to learn more.
The existing DeepVariant models were only trained on human data. Previous work built species-specific DeepVariant models for mosquito genomes and the endangered Kākāpō parrot. We built TrioTrain (DV-TT) to enable us to build custom DeepVariant models for cattle, bison, and yak genomes. Our custom models incorporate allele frequency data from over 5,500 published Bovine samples, making DV-TT the first tool to expand the existing Allele Frequency model into non-human, mammalian genomes. Our work illustrates the limitations of applying models built exclusively with human-genome datasets in other species. Our findings suggest that comparative genomics approaches in deep learning model development offer performance benefits over species-specific models.
DV-TT is a SLURM-based, automated pipeline that produces new DV model(s) for germline variant-calling in any diploid organism, focusing on species without NIST-GIAB reference materials.
Currently, TrioTrain supports initializing training using an existing DV model. An index of compatible models can be found here.
Specifically, TrioTrain builds upon the existing DV model for short-read (Illumina) Whole Genome Sequence (WGS) data and, optionally, adds population-level allele frequency data from published samples. During model development, DV-TrioTrain iteratively feeds labeled examples from parent-offspring duos. Intuitively, a model trained on both parents should better predict inherited variants in the offspring; therefore, two training rounds are performed for each trio. After re-training, any models built with DV-TrioTrain become an alternative checkpoint with DeepVariant's one-step, single-sample variant caller.
Assuming the necessary training data for your favorite species already exist, TrioTrain automatically enables customizing the DeepVariant model. Additional details about the required data can be found here.
The unique re-training approach enables the model to incorporate inheritance expectations; however, models built by DV-TrioTrain do not require trio-binned data for variant calling.
While the DV-TT pipeline assumes re-training data are from trio-binned samples, models are trained to prioritize features of inherited variants to produce fewer Mendelian Inheritance Errors (MIE) in individual samples, in contrast to the DeepTrio joint-caller.
Detailed user guides for installation, configuration, and a tutorial walk-through using the Human GIAB samples are available here.
Citation to go here
A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983–987 (2018).
Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T. Afshar, Sam S. Gross, Lizzie Dorfman, Cory Y. McLean, and Mark A. DePristo.
doi: https://doi.org/10.1038/nbt.4235
Improving variant calling using population data and deep learning. BMC Bioinformatics 24, 197 (2023).
Nae-Chyun Chen, Alexey Kolesnikov, Sidharth Goel, Taedong Yun, Pi-Chuan Chang, and Andrew Carroll.
doi: https://doi.org/10.1186/s12859-023-05294-0
For questions, suggestions, or technical assistance, feel free to open an issue page or reach out to Jenna Kalleberg at [email protected]
Please open a pull request if you wish to contribute to TrioTrain.
Many thanks to the developers and contributors of the many open-source packages used by TrioTrain: