This package is used to cluster structural variants of similar cancer cell fraction (CCF). SVclone is divided into five components: annotate, count, filter, cluster and post-assign. The annotate step infers directionality of each breakpoint (if not supplied), recalibrates breakpoint position to the soft-clip boundary and subsequently classifies SVs using a rule-based approach. The count step counts the variant and non-variant reads from breakpoint locations. Both the annotate and count steps utilise BAM-level information. The filter step removes SVs based on a number of adjustable parameters and prepares the variants for clustering. SNVs can also be added at this step as well as CNV information, which is matched to SV and SNV loci. The post-assign step then allows SVs to be assigned to the derived model from SNV clustering. Optionally, SVs and SNVs can also be post-assigned to a joint SV + SNV model.
Install Anaconda3 (or Python 3.* with Numpy and Pandas). Install R.
Now run the following:
git clone https://github.com/mcmero/SVclone.git
cd SVclone
Rscript install_R_requirements.R
pip install -r requirements.txt
python setup.py install
Example data is provided to test your SVclone installation (data contains simulated clonal deletions). Run as:
./run_example.sh
You can check the following output plot, which will summarise the clustering result:
- tumour_p80_DEL/ccube_out/tumour
You can also test the simulated SNV data by running:
./run_example_wsnvs.sh
The simulated data contains a 100% CCF clone and a 30% subclone.
An indexed whole-genome sequencing BAM and a list of paired breakpoints from an SV caller of choice is required. This step is required for clustering of SVs, however, classifiation and directionality information from your choice of SV caller can be used rather than being inferred.
./SVclone.py annotate -i <sv_input> -b <indexed_bamfile> -s <sample_name>
Input is expected in VCF format (directionality inferred from the ALT field is also supported). Each defined SV must have a matching mate, given in the MATEID value in the INFO section. Please see the VCF spec (section 3) for representing SVs using the VCF format. SVclone does not support unpaired break-ends, which means that the INFO field PARID must be specified (please see Section 5.4.4 in the VCF spec for an example).
Input may also be entered in Socrates or simple format (must be specified with --sv_format simple
or --sv_format socrates
). Simple format is as follows:
chr1 pos1 chr2 pos2
22 18240676 22 18232335
22 19940482 22 19937820
22 21383572 22 21382745
22 21395573 22 21395746
We recommend that directions from the SV caller of choice be used (use_dir
must be set to True
in the configuration file in this case). Optionally, if you already know the SV classifications, the name of the classification field can be specified in the config file (e.g. sv_class_field: classification
).
An example of the 'full' SV simple format is as follows:
chr1 pos1 dir1 chr2 pos2 dir2 classification
22 18240676 - 22 18232335 - INV
22 19940482 - 22 19937820 + DEL
22 21383572 - 22 21382745 + DUP
22 21395573 + 22 21395746 + INV
Note that your classifications in your SV input will have to match those specified in the configuration file (these may be comma-separated):
[SVclasses]
# Naming conventions used to label SV types.
inversion_class: INV
deletion_class: DEL
dna_gain_class: DUP,INTDUP
dna_loss_class: DEL,INV,TRX
itrx_class: INTRX
Note that dna_gain_class
will include any SV classification involving DNA duplication and dna_loss_class
is any intra-chromosomal rearrangement not involving a gain (including balanced rearrangements). itrx_class
refers to all inter-chromosomal translocations.
A blacklist (bed file) can also be supplied at this step to not process areas to remove SVs where any of its breakpoints fall into one of these areas.
The above input example also corresponds with the output of this step (output to <out>/<sample>_svin.txt), with an added SV ID present in the column. Events that are considered part of the same event will have the same ID (which may be multiple breakpoints).
- -i or --input : structural variants input file (see above).
- -b or --bam : bam file with corresponding index file.
- -s or --sample : Sample name. Will create processed output file as <out>/<sample>_svinfo.txt, parameters output as <out>/<sample>_params.txt.
- -o or --out <directory> : output directory to create files. Default: the sample name.
- -cgf or --config <config.ini> : SVclone configuration file with additional parameters (svclone_config.ini is the default).
- --sv_format <vcf, simple, socrates> : input format of SV calls, VCF by default, but may also be simple (see above) or from the SV caller Socrates.
- --blacklist <file.bed> : Takes a list of intervals in BED format. Skip processing of any break-pairs where either SV break-end overlaps an interval specified in the supplied bed file. Using something like the DAC blacklist is recommended.
Run SV processing submodule to obtain read counts around breakpoints on each sample BAM file like so:
./SVclone.py count -i <svs> -b <indexed_bamfile> -s <sample_name>
The classification strings are not used by the program, except for DNA-gain events (such as duplications). The classification names for these types of SVs should be specified in the svclone_config.ini file (see configuration file section).
The count step will create a tab-separated <out>/<sample>_svinfo.txt file containing count information. For example:
ID chr1 pos1 dir1 chr2 pos2 dir2 classification split_norm1 norm_olap_bp1 span_norm1 win_norm1 split1 sc_bases1 total_reads1 split_norm2 norm_olap_bp2 span_norm2 win_norm2 split2 sc_bases2 total_reads2 anomalous spanning norm1 norm2 support vaf1 vaf2
1 12 227543 + 12 228250 - DEL 12 405 13 96 4 215 189 15 473 8 94 4 149 190 32 4 25 23 12 0.32432432432432434 0.34285714285714286
2 12 333589 + 12 338298 - DEL 19 585 23 132 1 69 222 19 492 13 100 8 385 213 18 12 42 32 21 0.33333333333333331 0.39622641509433965
3 12 461142 + 12 465988 - DEL 14 490 12 120 6 247 202 12 374 16 104 6 149 214 20 6 26 28 18 0.40909090909090912 0.39130434782608697
4 12 526623 + 12 554937 - DEL 11 322 18 112 8 312 220 17 567 15 106 8 232 205 12 9 29 32 25 0.46296296296296297 0.43859649122807015
5 12 693710 + 12 696907 - DEL 13 433 15 104 9 329 212 16 446 21 138 5 245 229 20 9 28 37 23 0.45098039215686275 0.38333333333333336
The output fields are briefly described:
- split: split read count at each locus
- split_norm/span_norm: number of normal split and spanning reads crossing the boundary at locus 1 and 2 respectively.
- norm_olap_bp: count of normal read base-pairs overlapping the break (for normal reads that cross the break boundary).
- win_norm: normal read count (no soft-clips, normal insert size) for all normal reads extracted from the locus window (+/- insert size from locus).
- sc_bases: count of soft-clipped bases corresponding to split reads crossing the break.
- norm: normal read count at each locus.
- spanning: number of spanning reads supporting the break.
- support: split1 + split2 + spanning.
- anomalous: reads not counted in any other category.
- vaf: support / (norm + support).
- -i or --input : structural variants input file. This should be the output file from the annotate step.
- -b or --bam : bam file with corresponding index file.
- -s or --sample : Sample name. Will create processed output file as /_svinfo.txt, parameters output as <out>/<sample>_params.txt.
- -o or --out <directory> : output directory to create files. Default: the sample name.
- -cgf or --config <config.ini>: SVclone configuration file with additional parameters (svclone_config.ini is the default).
To filter the data obtained from the SV counting program and/or filter SNV data, can be done like so:
./SVclone.py filter -i <sv_info.txt> -s <sample_name>
Note that read length and insert sizes used by the filter step are provided as outputs from the count step (<out>/read_params.txt), based on the first 50,000 sampled reads in the bam file.
The filter step outputs the file <out>/<sample>_filtered_svs.tsv and/or <out>/<sample>_filtered_snvs.tsv depending on input. For SVs, the output is akin to the _svinfo.txt file format with added fields:
- norm_mean: average of norm1 and norm2
- gtype1/2: copy-number states at loci 1 and 2: "major, minor, CNV fraction" for example, "1,1,1.0". May be subclonal if battenberg input is supplied e.g. "1,1,0.7|2,1,0.3".
- gtype1/2_adusted: the next closest copy-number state of the given locus in the same format as above.
- adjusted_norm1/2: normal read count at the given locus adjusted for DNA-gains.
- adjusted_support: total adjusted supporting read count.
- adjusted_depth1/2: adjusted_norm + adjusted_support at the given locus.
- raw_mean_vaf: support / (mean(norm1, norm2) + support)
- adjusted_vaf1/2: adjusted_support / adjusted_depth1/2
For SNVs, example output looks like:
chrom pos gtype ref var
1 44395 1,1,1.0 33.0 15.0
1 4865339 1,1,1.0 23.0 25.0
1 13846233 1,1,1.0 28.0 25.0
1 33976797 1,1,1.0 30.0 19.0
1 51346133 1,1,1.0 33.0 22.0
Where:
- ref: total reference allele reads at locus.
- var: total variant allele reads at locus.
- -s or --sample <name> : sample name, currently only a single sample is supported. WARNING: if clustering using mutect SNVs, the sample name must match the sample name in the vcf file.
- -i or --input <svinfo.txt> : sv info file from SV count step.
- -o or --out <out> : output directory to create files. Default: the sample name.
- -cgf or --config <config.ini>: SVclone configuration file with additional parameters (svclone_config.ini is the default).
- --params <params.txt> : Parameters file from processing step containing read information. If not supplied, the default search path is <out>/<sample>_params.txt'
- -c or --cnvs <cnv file> : Battenberg subclones.txt or ASCAT caveman csv file containing segmented copy-numbers for patient sample. If not supplied, will assume that all regions are copy-number neutral (Major = 1, Minor = 1).
- -p <file> or --purity_ploidy <file>: Tumour purity and ploidy in tab-separated text file. Purity must be between 0 and 1. Ploidy can be a floating point number. Column names must be labelled 'sample', 'purity' and 'ploidy' (without quotes). Row 2 must contain the sample name and purity and ploidy values respectively.
- -g or --germline <germline_svinfo.txt> : Germline SVs; will filter out any tumour SVs which have >1 supporting read in the germline. Expects the same input format as the sv info file (you can run sv_process on the tumour SVs against the germline bam file to obtain this).
- --snvs <snv_file> : SNVs in VCF format to (optionally) compare the clustering with SVs.
- --snv_format <sanger, mutect, mutect_callstats> (default = sanger) : Specify VCF input format (only if clustering SNVs).
- --blacklist <file.bed> : Takes a list of intervals in BED format. Skip processing of any break-pairs where either SV break-end overlaps an interval specified in the supplied bed file. Using something like the DAC blacklist is recommended.
Two copy-number input formats are supported: Battenberg and ASCAT. The following shows a Battenberg subclones.txt example:
X chr startpos endpos BAF pval LogR ntot nMaj1_A nMin1_A frac1_A nMaj2_A nMin2_A frac2_A
1 1 54491 7958088 0.509836471 1 0.029104607 2.510592339 1 1 1 NA NA NA
2 1 7958594 15094665 0.519197319 6.23E-36 -0.179467225 1.74352809 1 0 0.12550915 1 1 0.87449085
3 1 15095331 39549452 0.519007705 3.72E-09 -0.050483364 2.037000342 1 1 0.86585984 2 1 0.13414016
4 1 39556288 40141768 0.840924898 1 1.617870568 9.512437457 8 1 1 NA NA NA
Battenberg provides more possible segmentations, but SVclone will only use fractions for the '_A' segments. Only the genomic coordinates, allelic copy-number and fraction fields are used by SVclone.
ASCAT's cavemam csv format is as follows:
1,1,54491,7958088,2,1,2,1
2,1,7958594,15094665,2,1,2,1
3,1,15095331,39549452,2,1,2,1
4,1,39556288,40141768,2,1,8,1
The fields are ID, chrom, start, end, normal total copy-number, normal minor allele copy-number, tumour total copy-number and tumour minor copy-number. This format gives copy-number calls at clonal resolution only.
Purity/ploidy file must have the following layout (tab-separated):
sample purity ploidy
<sample> <float between 0 - 1> <positive float or integer>
Once we have the filtered SV and/or SNV counts, we can run the clustering:
./SVclone.py cluster -s <sample_name>
NOTE: cluster step must be run from the SVclone base directory.
SVclone creates output based on the PCAWG output specification. This includes (per run):
- number_of_clusters: number of clusters found.
- <sample>_multiplicity.txt: the total copy-number, the number of copies the variant occurs on, the different multiplicity options and the probability of each.
- <sample>_assignment_probability_table.txt: probability of each variant's assignment to each cluster, based on number of times the proportion that a variant occurs in a particular cluster over all MCMC iterations.
- <sample>_cluster_certainty.txt: each variant's most likely assignment to a particular cluster and corresponding average proportion (CCF x purity).
- <sample>_subclonal_structure: clusters found, the number of variants per cluster, the proportion and CCF.
Ccube will create an RData file under <ccube_out>/<sample>_[sv/snv]_results.RData
- -s or --samples : sample names, currently only a single sample is supported.
- -o or --out : output directory (sample name by default)
- -cgf or --config <config.ini>: SVclone configuration file with additional parameters (svclone_config.ini is the default).
- --params <params.txt> : Parameters file from processing step containing read information. If not supplied, the default search path is <out>/<sample>_params.txt'
- --snvs <filtered_snvs> : SNVs output from filter step (automatically detected if present).
- --no_adjust : do not adjust read counts based on different classes of SV events.
- --subsample <integer> : (SNVs only); subsample N variants from total filtered input.
- --ss_seed (only valid if using subsampling) integer seed, can be set to ensure subsampling is replicable.
- --XX and --XY : overwrite the config file genotype with XX or XY (useful for bulk runs).
Post-assignment involves reassigning SV cluster memberships to either an SNV model, or to a joint SV + SNV model. If using the joint-model approach, SNVs may also be reassigned using the derived joint model. Run this step as follows:
Rscript SVclone/post_assign.R <sv_results.RData> <snv_results.RData> <output_dir> <sample> [--joint]
If the --joint
argument is used, SVs and SNVs will be reassigned to a joint model. If this flag is omitted, SVs will be assigned to the SNV model (default behaviour).
NOTE: post-assign step must be run from the SVclone base directory.
Post-assign creates the same output as the cluster step.
If customisation is required for parameters, these can be modified in the svclone_config.ini, or a new config file can be specified and each step run with the --config or -cfg flag for all steps.
In silico sample mixtures were generated from patient data derived from patient 001 from the Hong et al. study (doi:10.1038/ncomms7605). The data are available in the EGA Sequence Read Archive under accession EGAS00001000942. Somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium is described here and available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier which does not require access approval. To access potentially identification information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP for access to the TCGA portion of the dataset, and to the ICGC Data Access Compliance Office (DACO) for the ICGC portion. In addition, to access somatic single-nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorisation. Derived datasets described specifically in this manuscript can be found at these locations:
- Consensus SVs (consensus SVs)
- Consensus SNVs and INDELs (consensus SNVs and INDELs)
- Consensus copy-numbers (consensus copy-numbers)
All the other data supporting the findings of this study are available within the article and its supplementary information files and from the corresponding author upon reasonable request. A reporting summary for this article is available as a Supplementary Information file.
Ccube clustering code can be found under https://github.com/keyuan/ccube. Code for generating all figures in the manuscript and the in silico mixture samples can be found under https://github.com/mcmero/SVclone_Rmarkdown. Code for simulating SVs can be found under https://github.com/mcmero/sv_simu_pipe. The core computational pipelines used by the PCAWG Consortium for alignment, quality control and variant calling are available to the public at https://dockstore.org/search?search=pcawg under the GNU General Public License v3.0, which allows for reuse and distribution.