Skip to content

Latest commit

 

History

History
227 lines (172 loc) · 14.6 KB

README.md

File metadata and controls

227 lines (172 loc) · 14.6 KB

SCOTCH

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing. See our pre-print at here. All codes and analysis results in the manuscript can be found at here

Background

Recent development involving long-read single-cell transcriptome sequencing (lr-scRNA-Seq) represents a significant leap forward in single-cell genomics. With the recent introduction of R10 flowcells by Oxford Nanopore, computational methods should now shift focus on harnessing the unique benefits of long reads to analyze transcriptome complexity. In this context, we introduce a comprehensive suite of computational methods named Single-Cell Omics for Transcriptome CHaracterization (SCOTCH). Our method is compatible with the single-cell library preparation platform from 10X Genomics, Pacbio Biosciences, and Parse Biosciences, facilitating the analysis of special cell populations, such as neurons, hepatocytes and developing cardiomyocytes.

SCOTCH offers both a preprocessing and a statistical pipeline. The preprocessing pipeline accepts BAM files with tagged barcodes generated by vendor-supplied pipelines (e.g., wf-single-cell) and aligns reads to both known and novel isoforms. SCOTCH can leverage existing high-quality gene annotations, update these annotations using the data to enhance novel isoform identification, or operate entirely annotation-free. SCOTCH outputs count matrices at both the gene and transcript levels. The statistical pipeline supports the analysis of differential transcript usage (DTU), which is determined by the relative abundances of transcripts within the same gene.

Part1: Preprocessing Pipeline

Installation

To avoid package version conflicts, we strongly recommand to use conda to set up the environment. If you don't have conda installed, you can run codes below in linux to install.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash Miniconda3-latest-Linux-x86_64.sh

After installing conda, SCOTCH sources can be downloaded.

git clone https://github.com/WGLab/SCOTCH.git
cd SCOTCH
conda env create --name SCOTCH --file scotch.yml
conda activate SCOTCH

Run SCOTCH preprocessing pipeline

main_preprocessing.py is the primary function for running the SCOTCH preprocessing pipeline. This pipeline consists of four steps: (1) generating gene annotation files (--task annotation), (2) generating the compatibility matrix (--task 'compatible matrix'), (3) summarizing novel gene annotations(--task summary), and (4) generating the count matrix (--task 'count matrix'). We will cover more details below. SCOTCH is currently compatible with several platforms (set: --platform 10x-ont or --platform 10x-pacbio or parse-ont)

Input and Output

In general, SCOTCH accepts BAM file(s) tagged by vendor-supplied pipelines for its preprocessing pipeline (For ONT data, please see here for more details), and output a series of information including read-isoform mappings, gene count matrix, isoform count matrix, and updated gene annotations. The --bam argument is used to assign input file(s), and --target is used to assign the output folder(s).

The SCOTCH pipeline allows users to process one sample at a time or multiple samples simultaneously depending on study designs. When preprocessing multiple samples together, SCOTCH generates a unified gene/isoform annotation based on all BAM files. This approach is particularly beneficial in studies involving multiple samples, as it ensures consistent identification and consolidation of novel isoforms across different samples. Some usage examples are below:

Example1: --bam accepts a single bam file

--bam sample1.bam --target path/to/sample1/output

Example2: --bam accepts multiple bam files, each from one sample

--bam sample1.bam sample2.bam --target path/to/sample1/output path/to/sample2/output

Example3: --bam accepts a folder of bam files from one sample

--bam path/to/sample1/bamfiles/folder --target path/to/sample1/output

Example4: --bam accepts multiple bam folders, each from one sample

--bam path/to/sample1/bamfiles/folder path/to/sample2/bamfiles/folder --target path/to/sample1/output path/to/sample2/output

Below is the directory structure of output under each target folder you can expect:

  • auxiliary/: This folder contains information of read-isoform mappings, cell barcodes, UMI, and exon coordinates of Isoforms.

  • bam/: Information of BAM files preprocessed, including read names, barcodes, read length etc.

  • compatible_matrix/: This directory holds the compatibility matrices of read - isoform alignments.

  • reference/: This folder includes the reference files used or generated by SCOTCH.

  • count_matrix/: This folder holds the count matrices on both gene and isoform levels generated by SCOTCH.

Step1: prepare annotation file

In this step, SCOTCH will generate annotation files for the reference genome and tagged BAM files. Set --tast annotation to run this step. Note that, users have to use the same gtf file throughout, including during the genome alignment step, which is typically processed by vendor-supplied pipelines.

SCOTCH offers three modes for generating gene annotations:

  1. Annotation-Only Mode: SCOTCH can rely entirely on existing gene annotations. This mode allows for the discovery of novel isoforms defined by combinations of known exons. This mode will fail to identify novel isoforms involving unknown exons, such as intron retention. Set --reference as path to gene annotation .gtf file. To save time, users can also set --reference_pkl as the path to SCOTCH generated annotation based on given gtf file. SCOTCH has pre-computated this file based on this (human hg38) provided by 10X genome. In addition, set --update_gtf_off.
  2. Enhanced-Annotation Mode: SCOTCH can use BAM files from one or multiple samples to update and refine existing gene annotations. This mode allows for the discovery of de novo (sub)exons with more types of novel isoforms than annotation-only mode. Set --reference as path to gene annotation .gtf file. To save time, users can also set --reference_pkl as the path to SCOTCH generated annotation based on given gtf file. SCOTCH has pre-computated this file based on this (human hg38) provided by 10X genome. In addition, set --update_gtf.
  3. Annotation-Free Mode: SCOTCH can generate gene and isoform annotations based solely on BAM files, allowing for the discovery of novel genes and isoforms. Set --reference None and --reference_pkl None.

arguments

  • --bam: path(s) to the bam file(s)/folder(s)
  • --workers: number of threads for parallel computing.
  • --coverage_threshold_exon: coverage threshold to support exon discovery, percentage to the maximum coverage, larger values will be more conservative, default is 0.02.
  • --coverage_threshold_splicing: threshold to support splicing discovery, percentage to the maximum splicing junctions, larger values will be more conservative, default is 0.02.
  • --z_score_threshold: z score threshold to discovery sharp changes of read coverage, larger values will be more conservative, default is 10.
  • --reference: Path to gene annotation file in gtf format.
  • --reference_pkl: Path to gene annotation file generated by SCOTCH named with geneStructureInformation.pkl file
  • --platform: 10x-ont or 10x-pacbio or parse-ont
  • --barcode_cell and --barcode_umi: if using 10x-ont platform, default setting is --barcode_cell CB --barcode_umi UB; if using 10x-pacbio platform, default setting is --barcode_cell XC --barcode_umi XM. Users can change this setting according to tags in bam files generated by different preprocessing workflows.

Below is an example of generating annotation in the Enhanced-Annotation Mode for two samples simultanuously. See example/annotation.sh for an implementation in slurm.

python3 src/main_preprocessing.py \
--task annotation \
--platform 10x-ont \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--bam path/to/bam/file/or/bamfolder/sample1 path/to/bam/file/or/bamfolder/sample2 \
--update_gtf \
--reference path/to/reference/genes.gtf \
--reference_pkl data/geneStructureInformation.pkl \
--workers 30

Step2: generate compatible matrix

In this step, SCOTCH aligns reads to existing gene isoforms and identifies novel isoforms based on the annotation files generated in step 1. Set --tast 'compatible matrix' to run this step. A compatibility matrix is generated for each gene, where each row represents a read and each column represents a gene isoform. Detailed information on read-isoform mappings can be found in the auxiliary folder. To speed up this step, job arrays can be submitted in SLURM to process genes in parallel. An example implementation for 100 job arrays is provided in example/compatible.sh.

  • --target: the same with step1
  • --bam the same with step1
  • --reference: the same with step1
  • --match_low exon percentage lower threshold to call a read-exon unmapping. For example, setting 0.1 means reads covers <10% of the exon length as unmapped. (default is 0.1)
  • --match_low exon percentage upper threshold to call a read-exon mapping. For example, setting 0.6 means reads covers <20% of the exon length as mapped. (default is 0.6)
  • --small_exon_threshold: dynamic exon length threshold to ignore for includsion and exclusion, default is 0.
  • --small_exon_threshold_high: the upper bound of dynamic exon length threshold to ignore for includsion and exclusion, default is 80.
  • --truncation_match: higher than this threshold at the truncation end will be adjusted to 1, default is 0.4.
  • --total_jobs: the number of batches to split genes and run in parallel, default is 1
  • --job_index: the batch/job index current task to run, default is 0
python3 src/main_preprocessing.py \
--task 'compatible matrix' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--bam path/to/bam/file/or/bamfolder/sample1 path/to/bam/file/or/bamfolder/sample2 \
--reference path/to/reference/genes.gtf 

Step3: summary gene annotations

In this step, SCOTCH will generate a new gene annotation file, including all novel isoforms identified. Set --tast summary to run this step. See example/summary.sh for an implementation in SLURM.

python3 src/main_preprocessing.py \
--task 'summary' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2

Step4: generate count matrix

Finally, SCOTCH will generate gene- and isoform-level copunt matrix. Set --tast 'count matrix' to run this step. See example/count.sh for an implementation in SLURM.

  • --target: the same with step1
  • '--novel_read_n' filter out novel isoform is the number of reads mapped for the sample less than novel_read_n, and reads mapped to this novel isoform will be treated as uncategorized. Default is 20.
  • --workers: number of threads for parallel computing.
  • --save_csv/--save_mtx: these settings are used to set up output format. Saving csv files takes some time.(default is to save in both csv and mtx formats)
  • --group_novel: whether group some novel isoforms that are potentially generated by read truncations together as one novel isoform.
  • --platform: 10x-ont or 10x-pacbio or parse-ont
python3 src/main_preprocessing.py \
--task 'count matrix' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--platform 10x-ont
--workers 8 --group_novel --save_csv_false --save_mtx

Part2: Statistical Pipeline

Installation

In R, run below codes to install SCOTCH statistical pipeline.

if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
library("devtools")
install_github("WGLab/SCOTCH")

DTU analysis

Below is sample codes for DTU analysis. Example data can be found in /data folder.

library(SCOTCH)

#----read gene-level count matrix-----#
sample8_CD4_gene=as.matrix(read.csv('gene_count_1.csv',row.names = 'X'))
sample8_CD8_gene=as.matrix(read.csv('gene_count_2.csv',row.names = 'X'))

#----read transcript-level count matrix-----#
sample8_CD4_transcript=as.matrix(read.csv("transcript_count_1.csv",row.names = 'X'))
gene_transcript_CD4_df = data.frame(genes=str_remove(colnames(sample8_CD4_transcript),"\\.(ENST|novel|uncategorized).+"),
                                    transcripts=colnames(sample8_CD4_transcript))

sample8_CD8_transcript=as.matrix(read.csv("transcript_count_2.csv",row.names = 'X'))
gene_transcript_CD8_df = data.frame(genes=str_remove(colnames(sample8_CD8_transcript),"\\.(ENST|novel|uncategorized).+"),
                                    transcripts=colnames(sample8_CD8_transcript))

#----gene-level analysis-----#
df_gene = scotch_gene(sample8_CD4_gene, sample8_CD8_gene), epsilon=0.01,ncores=10)%>%
  filter(pct1>=0.01|pct2>=0.01)

#----transcript-level analysis-----#
df_transcript = scotch_transcript(gene_transcript_CD4_df,gene_transcript_CD8_df, 
                                  sample8_CD4_transcript, sample8_CD8_transcript, ncores=10)
df_scotch = df_gene%>%left_join(df_transcript,by=c('genes'='gene'))

SCOTCH will output the following statistics in results:

genes: gene name
pct1: percent of cells expression the gene in cell population 1
pct2: percent of cells expression the gene in cell population 2
logFC: fold change in log scale
p_gene: p value on the gene level
pct_diff: difference between pct1 and pct2
p_gene_adj: adjusted p value of differential gene expression
isoforms: isoform name
alpha1: estimated parameter of the dirichlet distribution for cell population 1
alpha2: estimated parameter of the dirichlet distribution for cell population 2
TU1: relative transcript usage of cell population 1
TU2: relative transcript usage of cell population 2
TU_diff: difference of TU1 and TU2
TU_var1: variance of TU1
TU_var2: variance of TU2
dispersion1: dispersion parameter for cell population 1
dispersion2: dispersion parameter for cell population 2
isoform_switch: whether there is isoform switching event
isoform_switch_ES: effect size of isoform switching event
p_DTU_gene: p value for differential transcript usage analysis for the whole gene
p_transcript: p value for differential transcript usage analysis for the specific transcript
p_transcript_adj: adjusted p value for differential transcript usage analysis for the specific transcript
p_DTU_gene_adj: adjusted p value for differential transcript usage analysis for the whole gene