Sandy is a bioinformatics tool that provides a simple engine to simulate next-generation sequencing (NGS) reads for genomic and transcriptomic pipelines. Simulated data works as experimental control - a key step to optimize NGS analysis - in comparison to hypothetical models. Sandy is a straightforward, easy-to-use, fast and highly customizable tool that generates reads requiring only a fasta file as input. Sandy can simulate single-end and paired-end reads from both DNA and RNA sequencing as if produced from the most used second and third-generation platforms. The tool also tracks a built-in database with predefined models extracted from real data for sequencer quality-profiles (i.e. Illumina hiseq, miseq, nextseq), expression-matrices generated from GTExV8 data for 54 human tissues, and genomic-variations such as SNVs and Indels from 1KGP and gene fusions from COSMIC.
For full documentation, please visit https://galantelab.github.io/sandy/.
-
Simulate DNA and RNA sequencing
Simulate single-end (long and short fragments) and paired-end sequencing reads for genome and transcriptome analysis. The simulation can be customized with raffle seed, sequencing coverage, number of reads, fragment mean, output formats (
fastq
,sam
and their compressed versionsfastq.gz
andbam
), sequence identifier (header of entries infastq
) and much more. -
Sequencer quality-profile
Sandy generates
fastq
quality entries that mimic the Illumina, PacBio and Nanopore sequencers, as well as generating the phred-score using a statistical model based on the poisson distribution. -
RNA-Seq expression-matrix
It is possible to simulate a RNA-Seq which reflects the abundance of gene expression for transcripts and genes of a given tissue. For this purpose, expression-matrices were created from the gene expression data of 54 tissues of the GTExV8 project.
-
Whole-genome sequencing with genomic-variiation
The user can tune the reference genome (eg GRCh38.p13.genome.fa.gz), adding homozygous or heterozygous genomic-variations such as SNVs, Indels, gene fusions and other types of structural variations (eg CNVs, retroCNVs). Sandy has in its database genomic-variations obtained from the 1KGP and from COSMIC.
-
Custom user models
Users can include their models for quality-profile, expression-matrix and genomic-variation in order to adapt the simulation to their needs.
-
Custom sequence identifier
The sequence identifier, as the name implies, is a string that identifies a biological sequence (usually nucleotides) within a sequencing data. For example, the
fasta
format includes the sequence identifier always after the>
character at the beginning of the line; thefastq
format always includes it after the@
character at the beginning of the line; thesam
format uses the first column (called the query template name).Sequence identifier File format >MYID and Optional information
ATCGATCGfasta
@MYID and Optional information
ATCGATCG
+
ABCDEFGHfastq
MYID 99 chr1 123456 20 8M chr1 123478 30 ATCGATCG ABCDEFGH sam
Sequence identifiers may be customized in output using a format string passed by the user. This format is a combination of literal and escaped characters, in a similar fashion to that used in C programming language’s
printf
function.For example, simulating a paired-end sequencing you can add the read length, read position and mate position into all sequence identifiers with the following format:
%i.%U read=%c:%t-%n mate=%c:%T-%N length=%r
In this case, results in
fastq
format would be:==> Into R1 @SR.1 read=chr6:979-880 mate=chr6:736-835 length=100 ... ==> Into R2 @SR.1 read=chr6:736-835 mate=chr6:979-880 length=100
There are two recommended ways to obtain Sandy: Pulling the official Docker image and installing through CPAN.
Assuming that docker
is already installed on your server, simply run the command:
$ docker pull galantelab/sandy
For more details, see docker/README.md file.
Along with perl
, you must have zlib
, gcc
, make
and cpanm
packages installed:
-
Debian/Ubuntu
% apt-get install perl zlib1g-dev gcc make cpanminus
-
CentOS/Fedora
% yum install perl zlib gcc make perl-App-cpanminus
-
Archlinux
% pacman -S perl zlib gcc make cpanminus
Install Sandy with the following command:
% cpanm App::Sandy
If you concern about speed, you can avoid testing with the flag --notest
:
% cpanm --notest App::Sandy
For more details, see INSTALL file
Institution | Site |
---|---|
Coordination for the Improvement of Higher Level Personnel | CAPES |
The São Paulo Research Foundation | FAPESP |
Teaching and Research Institute from Sírio-Libanês Hospital | Galantelab |
This is free software, licensed under:
The GNU General Public License, Version 3, June 2007