Skip to content

A straightforward and complete next-generation sequencing read simulator

License

Notifications You must be signed in to change notification settings

galantelab/sandy

Repository files navigation

sandy logo

A straightforward and complete next-generation sequencing read simulator

Sandy is a bioinformatics tool that provides a simple engine to simulate next-generation sequencing (NGS) reads for genomic and transcriptomic pipelines. Simulated data works as experimental control - a key step to optimize NGS analysis - in comparison to hypothetical models. Sandy is a straightforward, easy-to-use, fast and highly customizable tool that generates reads requiring only a fasta file as input. Sandy can simulate single-end and paired-end reads from both DNA and RNA sequencing as if produced from the most used second and third-generation platforms. The tool also tracks a built-in database with predefined models extracted from real data for sequencer quality-profiles (i.e. Illumina hiseq, miseq, nextseq), expression-matrices generated from GTExV8 data for 54 human tissues, and genomic-variations such as SNVs and Indels from 1KGP and gene fusions from COSMIC.

For full documentation, please visit https://galantelab.github.io/sandy/.

Features

  • Simulate DNA and RNA sequencing

    Simulate single-end (long and short fragments) and paired-end sequencing reads for genome and transcriptome analysis. The simulation can be customized with raffle seed, sequencing coverage, number of reads, fragment mean, output formats (fastq, sam and their compressed versions fastq.gz and bam), sequence identifier (header of entries in fastq) and much more.

  • Sequencer quality-profile

    Sandy generates fastq quality entries that mimic the Illumina, PacBio and Nanopore sequencers, as well as generating the phred-score using a statistical model based on the poisson distribution.

  • RNA-Seq expression-matrix

    It is possible to simulate a RNA-Seq which reflects the abundance of gene expression for transcripts and genes of a given tissue. For this purpose, expression-matrices were created from the gene expression data of 54 tissues of the GTExV8 project.

  • Whole-genome sequencing with genomic-variiation

    The user can tune the reference genome (eg GRCh38.p13.genome.fa.gz), adding homozygous or heterozygous genomic-variations such as SNVs, Indels, gene fusions and other types of structural variations (eg CNVs, retroCNVs). Sandy has in its database genomic-variations obtained from the 1KGP and from COSMIC.

  • Custom user models

    Users can include their models for quality-profile, expression-matrix and genomic-variation in order to adapt the simulation to their needs.

  • Custom sequence identifier

    The sequence identifier, as the name implies, is a string that identifies a biological sequence (usually nucleotides) within a sequencing data. For example, the fasta format includes the sequence identifier always after the > character at the beginning of the line; the fastq format always includes it after the @ character at the beginning of the line; the sam format uses the first column (called the query template name).

    Sequence identifier File format
    >MYID and Optional information
    ATCGATCG
    fasta
    @MYID and Optional information
    ATCGATCG
    +
    ABCDEFGH
    fastq
    MYID 99 chr1 123456 20 8M chr1 123478 30 ATCGATCG ABCDEFGH sam

    Sequence identifiers may be customized in output using a format string passed by the user. This format is a combination of literal and escaped characters, in a similar fashion to that used in C programming language’s printf function.

    For example, simulating a paired-end sequencing you can add the read length, read position and mate position into all sequence identifiers with the following format:

      %i.%U read=%c:%t-%n mate=%c:%T-%N length=%r
    

    In this case, results in fastq format would be:

      ==> Into R1
      @SR.1 read=chr6:979-880 mate=chr6:736-835 length=100
      ...
      ==> Into R2
      @SR.1 read=chr6:736-835 mate=chr6:979-880 length=100
    

Installation

There are two recommended ways to obtain Sandy: Pulling the official Docker image and installing through CPAN.

Docker

Assuming that docker is already installed on your server, simply run the command:

$ docker pull galantelab/sandy

For more details, see docker/README.md file.

CPAN

Prerequisites

Along with perl, you must have zlib, gcc, make and cpanm packages installed:

  • Debian/Ubuntu

      % apt-get install perl zlib1g-dev gcc make cpanminus
    
  • CentOS/Fedora

      % yum install perl zlib gcc make perl-App-cpanminus
    
  • Archlinux

      % pacman -S perl zlib gcc make cpanminus
    

Installing with cpanm

Install Sandy with the following command:

% cpanm App::Sandy

If you concern about speed, you can avoid testing with the flag --notest:

% cpanm --notest App::Sandy

For more details, see INSTALL file

Acknowledgments

Institution Site
Coordination for the Improvement of Higher Level Personnel CAPES
The São Paulo Research Foundation FAPESP
Teaching and Research Institute from Sírio-Libanês Hospital Galantelab

License

This is free software, licensed under:

The GNU General Public License, Version 3, June 2007