SPLASH+

SPLASH+ is a new analytic method to detect a wide range of biological processes that diversify transcripts, including but not limited to RNA splicing, mutations, RNA editing, and V(D)J recombination inference directly on raw sequencing reads by integrating a micro-assembly and biological interpretation framework with the recently developed SPLASH algorithm. SPLASH is a unified reference-free algorithm that performs statistical inference directly on raw sequencing reads. SPLASH+ builds on SPLASH by utilizing new approaches to analyze SPLASH’s output, including a new, reference-free statistical approach for de novo assembly (being called as Compactors) as well as a framework for interpretation and annotation by assigning a meaningful biological class to each SPLASH's call.

How to run SPLASH+

SPLASH+ pipeline consists of 3 main steps:

Running SPLASH: to obtain sequences (anchors) that are followed by a set of sample-dependent diverse sequences (targets)
Running Compactors: for de novo local assembly of sequences called by SPLASH
Running Biological Interpretation: to assign a biologically relevant event (single base pair change, alternative splicing, ...) accounting for the observed sequence diversity.

1- SPLASH

SPLASH can be run on an input set of FASTQ files by following the steps in https://github.com/salzman-lab/SPLASH. After running SPLASH, the output file will be a list of significant anchors (we refer to it as anchors.txt in this readme), where each anchor is associated with a set of statistically significant sample-dependent target sequences. anchors.txt will then be used in the next step (Compactors) to perform a local de novo assembly and obtain extended sequences for each called anchor to facilitate and improve biological interpretation.

2- Compactors

Compactors analyze the sequence composition at each position to the right of each seed to evaluate whether the nucleotides presented at that position constitute noise or biological signal. This test is applied recursively on read sets, resulting in one or multiple assembled sequences (compactors) for each called anchor. The compactor step is implemented in a fully containerized Nextflow pipeline (nf-compactors) with minimal installation requirements.

Compactors need two input files:

anchors.txt: a single column file containing the list of significant anchors from SPLASH
samplesheet.csv: each line in this file provides the path to an input FASTQ file used for running SPLASH.

After running Compactors, two output files will be generated:

compactor_summary.tsv: Contains the resulting assembled sequences (compactors) for each significant significant anchor. This file will then be used in the next step for biological interpretation.
sample_specificity.tsv: reporting the supporting read counts for each compactor in input samples.

Quick Start for running Compactors pipeline:

Install Nextflow (>=21.10.3)
Install any of Docker or Singularity. You can also use Conda both to install Nextflow itself and also to manage software within pipelines.

Create your --fastq_samplesheet, and run the pipeline. The FASTQ samplesheet should be of this format. anchors_file can be a any TSV presenting seeds or anchors in a column called anchor.

nextflow run salzmanlab/compactors \
    -r main \
    -latest \
    -profile test,YOURPROFILE \
    --fastq_samplesheet samplesheet.csv \
    --anchors_file anchors.txt \
    --outdir <OUTDIR>

Test run for Compactors pipeline:

We provided test data for a quick run of compactors in compactor_test_run folder.
You first need to download sim_adipose_1.fq.gz fastq file from https://data.broadinstitute.org/Trinity/STAR_FUSION_PAPER/SupplementaryData/sim_reads/sim_50_fastq/ and update its path in sample_sheet.csv file
After running the pipeline using the anchors.txt file provided in the folder, you should obtain two files compactor_summary.tsv and sample_specificity.tsv files as given in the folder.

3- Biological interpretation

For biological interpretation of called anchors (obtained from step 1) using their assembled compactors (obtained from step 2), we provide a script SPLASH_plus_classification.R to categorize anchors into biologically meaningful events. Currently, we consider 6 different categories: Single base pair changes, alternative splicing, internal splicing (such as insertions, and deletions), 3'UTR, Centromere, and Repeats. The script needs the following inputs:

directory: Directory for writing output files
compactor_file: path to the compactors file compactor_summary.tsv generated from the compactors step
STAR_executable: path to STAR executable file
samtools_executable: path to Samtools executable file
bedtools_executable: path to bedtools executable file
bowtie2_executable: path to Bowtie2 executable file
STAR_reference: path to STAR index files for the reference genome
annotated_splice_juncs: path to the file containing annotated splice junctions from the reference transcriptome (can be either downloaded or generated from SPLASH_build.R)
annotated_exon_boundaries: path to the file containing annotated exon boundaries from the reference transcriptome (can be either downloaded or generated from SPLASH_build.R)
gene_coordinates: path to the file containing gene coordinates from the reference transcriptome (can be either downloaded or generated from SPLASH_build.R)
centromere_annotation_file: (optional) path to the centromere annotation file
repeats_annotation_file: (optional) path to annotation file for repetitive elements
UTR_annotation_file: (optional) path to UTR annotation file

The script will generate a file classified_anchors.tsv in the same directory specified by the directory input argument. The file contains significant anchors along with their compactors, biological classification, and alignment information.

SPLASH+ output file description

The classified_anchors.tsv file generated by running SPLASH_plus_classification.Rscript on the compactors file compactor_summary.tsv is the main output file by SPLASH+ which contains the list of significant anchors along with their generated compactors, the alignment information for each compactor and most importantly a column named anchor_event which gives the biological classification category of the anchor. Currently, we are considering these 6 categories for each anchor: splicing, internal splicing, base pair change, 3’UTR, centromere, and repeat. Each row in classified_anchors.tsv corresponds to a compactor sequence generated for an anchor. Below is the description of the columns in the output file:

anchor: anchor sequence
compactor: generated compactor sequence for the anchor
anchor_abundance: total abundance (read count) of the anchor
compactor_abundance: read count for the generated compactor
anchor_index: a unique index assigned to each anchor
compactor_index: a unique index assigned to each compactor which is in the form of A_B where A is the index for the corresponding anchor and B indicates the compactor's rank among the compactors for that anchor, sorted by abundance)
compactor_gene: the assigned gene name to the compactor obtained by aligning the compactor sequence to the reference genome
is.aligned_STAR: specifies whether compactor was mapped by STAR
is.STAR_SJ: specifies whether STAR reported a splice junction for the compactor
is.STAR_chimeric: specifies whether STAR reported a chimeric alignment for the compactor
STAR_flag: the alignment FLAG reported by STAR for the compactor
STAR_chr: chromosome of the compactor as reported by STAR
STAR_coord: the mapping coordinate of STAR for the compactor
STAR_CIGAR: CIGAR flag reported by STAR for the compactor in the BAM file
STAR_num_alignments: number of reported alignments reported by STAR for the compactor
anchor_event: the biological classification of the anchor, currently one of these: splicing, internal splicing, base pair change, 3’UTR, centromere, and repeat
number_nonzero_samples: number of cells (samples) expressing the anchor
all_splice_juncs: this column gives the concatenated list of the splice junctions obtained by aligning the top two compactors of the anchor to the genome. For example, if compactor1 contains splice junction SJA with coordinates ChrA:PosA1:PosA2 and compactor2 contains splice junction SJB with coordinates ChrB:PosB1:PosB2 the all_splice_juncs for this anchor would be ChrA:PosA1:PosA2--ChrB:PosB1:PosB2.
all_SS_AS_annot: This column indicates whether each splice site is annotated as involved in alternative splicing (1 for annotated, 0 for unannotated). For example for the splice junctions above if all_SS_AS_annot is 0:0--1:0 this means that PosB1 is a splice site known to be involved in alternative splicing based on reference transcriptome.

Building index and annotation files needed for running classification script

To be able to run SPLASH_plus_classification.R for a reference assembly, you need STAR index for reference genome and three annotation files (annotated_splice_juncs, annotated_exon_boundaries, gene_coordinates) for annotated splice junctions, exons, and genes in the reference transcriptome. To build these files, you should obtain a fasta file for the reference genome and a gtf file for the transcriptome annotation. You can then perform the following two steps (note that fasta and gtf files should be from the same assembly as they need to have consistent coordinates, chr names for accurate annotating of anchors):

STAR index: You can use default parameters to build STAR index: STAR --runThreadN 4 --runMode genomeGenerate --genomeDir STAR_index_files --genomeFastaFiles $fasta file$ --sjdbGTFfile $gtf file$
Annotation files: the three files for annotated exon boundaries, annotated splice junctions, and gene coordinates can be built by running a script we have provided SPLASH_plus_build.R. SPLASH_plus_build.R needs 3 inputs:
- $gtf_file$ : absolute path to the gtf file,
- $hisat2_directory$ : directory containing HISAT2 codes downloaded from HISAT2 repository, the script assumes that there are two python scripts at: $hisat2_directory$/extract_exons.py and $hisat2_directory$/extract_splice_sites.py),
- $outfile_name$ : the name used for the annotation files that script will generate.
The SPLASH_plus_build.R can be run using the following command:
Rscript SPLASH_build.R $gtf_file$ $hisat2_directory$ $outfile_name$ If the script finishes successfully, it will generate 3 output annotation files in the same directory as the script:
- $outfile_name$_known_splice_sites.txt for annotated splice sites (can be used as annotated_splice_juncs input for SPLASH_plus_classification.R)
- $outfile_name$_exon_coordinates.bed for annotated exon boundaries (can be used as annotated_exon_boundaries input for SPLASH_plus_classification.R)
- $outfile_name$_genes.bed for annotated gene coordinates (can be used as gene_coordinates input for SPLASH_plus_classification.R)

Downloading pre-built annotation files for human and mouse genomes:

The human files were built for both T2T assembly and GRCh38 assembly. The mouse files were built based on mm39 assembly. The annotation files can be downloaded using the following links:

Human (T2T):
- annotated_splice_juncs: https://drive.google.com/file/d/1owlOQyP1z4cyFvYcAAA-qQmc-K6jGbs9/view?usp=share_link
- annotated_exon_boundaries: https://drive.google.com/file/d/1R-4-ICDAzmIBgQmlOF22nNrCWoSgrmHi/view?usp=share_link
- gene_coordinates: https://drive.google.com/file/d/1L0A7iGXEYiOsPQ0QiJayKPybJ79ZDi2F/view?usp=sharing
Human (GRCh38):
- annotated_splice_juncs: https://drive.google.com/file/d/1izVHy1m-ddlNgJtFKfWcHdtkc_Y5bHHP/view?usp=sharing
- annotated_exon_boundaries: https://drive.google.com/file/d/1oK6OgQnFFVvybBo0EZ5aIyeoZLAtMyZF/view?usp=sharing
- gene_coordinates: https://drive.google.com/file/d/1REfnl9ZNYcsb-1jSurDHcsL7QFJ00JEp/view?usp=sharing
Mouse (mm39):
- annotated_splice_juncs: https://drive.google.com/file/d/1iJhf421nMRDC0uCo_0jh7Nkns8NAieTE/view?usp=sharing
- annotated_exon_boundaries: https://drive.google.com/file/d/1npE0rkxhsDtJk3FeMdfuZwc5Elfuk4bq/view?usp=sharing
- gene_coordinates: https://drive.google.com/file/d/1V8By-yq7AmgXY-XDhipgjjsamL0ghhJa/view?usp=sharing

Contact

Please contact Roozbeh Dehghannasiri ([email protected]).

Citation

Dehghannasiri*, R., Henderson*, G., Bierman, R., Chaung, K., Baharav, T., Wang, P., and Salzman, J. Unsupervised reference-free inference reveals unrecognized regulated transcriptomic complexity in human single cells, bioRxiv, (2023).

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github		.github
Splicing_concordance_analysis		Splicing_concordance_analysis
assets		assets
bin		bin
compactor_test_run		compactor_test_run
conf		conf
docs		docs
lib		lib
modules		modules
subworkflows/local		subworkflows/local
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Compactor.png		Compactor.png
LICENSE		LICENSE
README.md		README.md
SPLASH.png		SPLASH.png
SPLASH_plus.png		SPLASH_plus.png
SPLASH_plus_build.R		SPLASH_plus_build.R
SPLASH_plus_classification.R		SPLASH_plus_classification.R
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
test_anchors.tsv		test_anchors.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPLASH+

How to run SPLASH+

1- SPLASH

2- Compactors

Quick Start for running Compactors pipeline:

Test run for Compactors pipeline:

3- Biological interpretation

SPLASH+ output file description

Building index and annotation files needed for running classification script

Downloading pre-built annotation files for human and mouse genomes:

Contact

Citation

About

Releases

Packages

Contributors 3

Languages

License

salzman-lab/SPLASH-plus

Folders and files

Latest commit

History

Repository files navigation

SPLASH+

How to run SPLASH+

1- SPLASH

2- Compactors

Quick Start for running Compactors pipeline:

Test run for Compactors pipeline:

3- Biological interpretation

SPLASH+ output file description

Building index and annotation files needed for running classification script

Downloading pre-built annotation files for human and mouse genomes:

Contact

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages