DEPRECATED - This project is obsolete. PacBio CLR has been replaced by HiFi (CCS), and there are far better methods for running these samples today.
See these other tools for SV detection with HiFi (CCS) or CLR:
- PAV: https://github.com/EichlerLab/pav
- pbsv: https://github.com/PacificBiosciences/pbsv
- Sniffles: https://github.com/fritzsedlazeck/Sniffles
- SVIM: https://github.com/eldariont/svim
SMRT-SV calls structural variants (SVs) using long reads (PacBio RS II or Sequel). The input is an FOFN file that lists each input file (BAX H5 or BAM) and a genome reference. Reads are aligned to the reference and local assemblies are done in kilobase-scale overlapping windows across the genome. Assemblies are aligned back to the reference, and structural variants are called from them.
Because of this approach, SMRT-SV yields both SV calls and contigs containing the SV with know breakpoints on the contig. These data feed into powerful analysis tools that use the reference and the contigs, such as the SMRT-SV genotyper, which is also included in this package.
Variant calls discovered with long reads using the SMRT-SV method can be genotyped in short-reads with the SMRT-SV genotyper, which is also included in this package. See GENOTYPE.md for information about configuring and running the SMRT-SV genotyper.
SMRT-SV is run in several steps:
- Align
- Detect SV signatures
- Assemble
- Call variants
All reads are aligned to the reference. Alignments are done in batches, which is controlled by the --batches
parameter
(default = 20). Batching is useful for distributing alignment workload over a cluster. Alignments are done in the
alignment
subdirectory, and align/alignments.fofn
contains a list of aligned BAM files.
Overlapping windows are tiled across the genome. By default, 60 kbp windows are tiled each offset by 20 kbp.
The pipeline also searches for signatures of structural variation, and it places additional windows around those. The
output from this step goes into the detect
subdirectory.
For each window, reads are pulled down and assembled with Canu. The assemblies are polished (Arrow or Quiver), and
then aligned back to the reference. This step takes the most time and compute resources. Output is in the assemble
subdirectory with assemble/local_assemblies.bam
containing all assemblies.
From the contigs, variants are called from the alignments. The output from this step is in call
, and the final
output is written to the root of the working directory (default = variantsn.vcf.gz
)
In the install directory, go into the dep
directory and run make
. This will build a set of programs SMRT-SV relies
on. This must be completed before attempting to run SMRT-SV.
A RepeatMasker
step is run by the pipeline, but it does not install this dependency. Your environment should have a
working version of RepeatMasker accessible through PATH
.
SMRT-SV is made of several steps (run
, genotype
, align
, see above), and each step is implemented by one or more rules. Each rule performs a
simple task, such as aligning a set of reads or generating a data table for variant calling. Snakemake coordinates these
rules to accomplish the step.
SMRT-SV has two sets of command-line options, global and per-step. All global options must appear before the step name and all per-step options must appear after the step name.
Global options are not specific to the pipeline step. For example, --tempdir
or --distribute
. To see help for the
global options, run smrtsv -h
.
Per-step options are specific to the step being executed. For example, align
and run
have the option --batches
,
which tells SMRT-SV how many batches to split input reads into before aligning. Since the --batches
option does not
make sense for downstream analyses or the genotyper, those steps don't have it. To see the options for a specific step,
run smrtsv <step> -h
where <step>
is the name of the step (e.g. run
or genotype
).
For example, to specify a temp directory (global option) and the number of batches (per-step option), the command would
look something like smrtsv --temp-dir /path/to/temp run --batches 15 ...
.
In the SMRT-SV directory, there is a smrtsv
and a smrtsv.py
. Be sure to run smrtsv
. It will setup the correct
runtime environment for SMRT-SV (using the dep
directory) and run it.
SMRTSV only requires that a version of RepeatMasker
be runnable and in the PATH
. It will ignore almost everthing
else in the environment and use the packages in dep
. Some distributed systems clear LD_LIBRARY_PATH
when a job
is run on a compute node, but SMRT-SV will set it back to LD_LIBRARY_PATH
from calling environment.
SMRT-SV requires an input FOFN file, which is a list of paths to the input sequence reads. Each file listed must be a ".subreads.bam" (subreads BAM) file. For the Sequel platform, one subreads BAM file is output per sequencing cell. Do not include ".scraps.bam" files in the input.
The RS II platform generates three ".bax.h5" (BAX) files for each cell, and these must be converted to subreads BAM
files before running SMRT-SV. A cell can be converted to BAM files by running dep/bin/bax2bam
and giving it all three
BAX files. To save space, remove the ".scraps.bam" files, which are rarely used for anything. If sequence reads are
aligned from BAX files, some sequence read metadata is lost and assemblies will fail.
Change to a clean working directory that has nothing in it. The SMRT-SV pipeline should be installed in another
directory. Setup your environment so all dependencies are available. All output will go into this working directory.
The command below assumes a variable SMRTSV_DIR
is defined to point to the root of the SMRT-SV install directory
containing smrtsv.py
.
To run all steps:
${SMRTSV_DIR}/smrtsv run --batches 20 --threads 8 path/to/ref.fasta path/to/reads.fofn
The run
command will execute the whole pipeline, but each step (see above) can be run on its own by replacing run
with the name of the step. Commands later in the pipeline expect to find output from previous commands (e.g. detect
depends on output from align
), so these steps still need to be run in order. Note that each command has its own set of
options. See smrtsv -h
for help, or use -h
on a command to see its options.
SMRT-SV can be run on a local machine or distributed over a cluster. Distributing is managed by Snakemake. When
it is distributed, see cluster.eichler.json
for expected resource usage for each rule. The cluster configuration
file will probably need to be modified for your environment.
Several steps in SMRT-SV will use temporary directories to store files. SMRT-SV uses Python's tempfile.gettempdir()
function to choose a temporary location (See https://docs.python.org/3.7/library/tempfile.html#tempfile.gettempdir).
By default, it uses environment variables TMPDIR
, TEMP
, and TMP
to choose a temporary location. If they are not
defined, then in falls back to /tmp
. The temporary directory can be explicitly set by using the SMRT-SV command-line
option --tempdir
option (e.g. --tempdir /path/to/temp
), which will ignore these defaults and use the specified
location. Note that --tempdir
must appear on the command-line before the SMRT-SV command name because it is a
global option. If the temporary directory does not exist, SMRT-SV will create it the first time it is accessed.
Each rule within SMRT-SV creates a unique path within this temporary directory so that one rule will not overwrite temporary files from another rule. Files in the temporary directory are never shared among rules, so once a rule completes or SMRT-SV stops, the temporary directory can be cleared without losing progress. Each rule should clear it's temporary directory unless the rule is stopped before it reaches the end.
When SMRT-SV is distributed over a cluster, the temporary directory should be set to a location on the compute node running the job. This saves distributed filesystem from unnecessary I/O. If there is a fast disk available, such as a solid-state drive (SSD), then the pipeline can be sped up by placing a temp directory on that storage.
SMRT-SV outputs variants with varying levels of confidence, and this includes false calls with weak support.
The best way to filter SMRT-SV calls is by using the CONTIG_SUPPORT
value in the INFO
field of the output VCF. By
selecting SVs with a contig support value of at least 2, most false calls can be eliminated.
The QUAL
field is based on a phred-scaled ratio of contig support and total contig depth at the SV locus
(CONTIG_SUPPORT / CONTIG_DEPTH
), which is then further scaled by the CONTIG_DEPTH
to give more weight to SVs where
there are many contigs. A maximum value of 100
is imposed. This score has limited value as a quality measurement.
SMRT-SV aligns variants to the reference and pulls them down in windows to do assemblies. These come from taking sliding-windows across the genome (default: 60 kbp widows each offset by 20 kbp) and from adding additional windows around signatures of SVs seen in the sequence read alignments. This attempts to build multiple assemblies over each SV for quality control.
When pulling reads down and performing an assembly, both haplotypes are mixed (assuming it is a diploid sample). For heterozygous SVs, the assembly could yeild either the SV-containing haplotype or the reference haplotype. By adding additional windows around SV signature, it increases the chances that the SV haplotype is represented in one or more assemblies.
As a result of heterozygosity and mixed-haplotype assemblies, many heterozygous calls go undetected. It takes a phasing approach or an approach based only on read alignments to recover these.
For a full treatment of this phenomenon using SMRT-SV, see http://genome.cshlp.org/lookup/doi/10.1101/gr.214007.116
For a newer method that employs phasing, see https://www.biorxiv.org/content/early/2018/06/13/193144
If the Snakemake process running a step is killed, it will leave behind locks, and the next invocation will result in
error Error: Directory cannot be locked.
. These locks can be manually removed with rm .snakemake/locks/*
. Be sure
the process is not running because two active instances may produce erroneous results.
SMRT-SV groups assemblies into megabase-size regions and performs assemblies in that group as one job. When distributing over a cluster, this help improve performance by pulling data for the group region from shared storage into local temp storage, and each group is assembled from the locally stored data. It also reduces the complexity of assembling so many regions.
Some degenerate groups will have many assembly windows in them from signature-based window detection ("detect" step),
but they are given the same amount of runtime as other windows. Also, because they are degenerate, the assemblies in
these windows may also take longer. This results in assembly jobs failing because they are reaching runtime limits. This
often affects around groups in a human genome. You can try increasing the group runtime (--asm-group-rt
) or the
individual assembly runtime (--asm-rt
), but pericentromeric regions with dense alhpa-satellite repeat units may still
fail to assemble.
When assemblies fail, first try re-running them. Some of the assemblies may complete with a second attempt. The remaining regions are likely poor assemblies and can be discarded (check the regions before discarding to see what they are).
To allow SMRT-SV to move past these regions without assembling, first make sure the pipeline is not running. Then you
can touch the output BAM and BAI files to create empty files (e.g. assemble/group/GROUP_ID/contig.bam assemble/group/GROUP_ID/contig.bam
). Lastly, you will need to move or delete the Snakemake directory or Snakemake will
try to re-run the rules because it knows the last run failed (e.g. mv .snakemake .snakemake.asmfail
). After touching
BAM, touching BAI files, and moving/removing .snakemake
, restart SMRT-SV, and it will move past those assembly groups.