aa-tRNA-seq-pipeline

A pipeline to process ONT aa-tRNA-seq data built using snakemake.

Downstream analysis to generate figures for the initial preprint can be found at: https://github.com/rnabioco/aa-tRNA-seq

Usage

The pipeline can be configued by editing the config/config.yml file. The config file specifications will run a small example dataset through the pipeline. To download these data files:

git clone https://github.com/rnabioco/AAtRNAseqPipe.git

# download test data 
bash .test/dl_data.sh

Set up a conda environment:

mamba env create -f environment.yml
conda activate aatrnaseqpipe

Test the pipeline by invoking a dry-run snakemake in the pipeline root directory:

snakemake -n -c 1 -p

Configuration

To use on your own samples, edit config.yml and samples.tsv in config/.

See README.md in the config directory for additional details.

Workflow

Given a directory of pod5 files, this pipeline merges all files from each sample into a single pod5, rebasecalls them to generate an unmapped bam with move table information (for downstream use by Remora), converts the bam into a fastq, and aligns that fastq to a reference containing tRNA + adapter sequences with BWA MEM. The resulting data (pod5s and aligned reads) are then fed to a model trained using Remora to classify charged vs. uncharged reads in the rule cca_classify, generating numeric values indicating the likelihood of a read being aminoacylated in the ML tag of the BAM file. For classifying charged vs. uncharged reads, we treat ML values of 200-255 as aminoacylated in downstream steps, and values <200 as uncharged. This can be altered by adjusting the ml-threshold parameter in the rule get_cca_trna_cpm.

The final steps of the pipeline calculate a number of outputs that may be useful for analysis and visualization, including normalized counts for charged and uncharged tRNA (get_cca_trna_cpm), basecalling error values (bcerror), alignment statistics (align_stats) and information on raw nanopore signal from Remora (remora_signal_stats).

Remora classification

A few notes about Remora classification for charged vs. uncharged tRNA reads

this step retains only full length tRNA reads (with an allowance for signal loss at the 5´ end of nanopore direct RNA sequencing)
Additionally, due to the iterative nature of sequencing method development, the present approach does not rely on differences in adapter sequences attached to charged vs. uncharged tRNA molecules (though these sequences are retained as separate entries in the alignment reference and downstream files). While we anticipate being able to leverage this information in the future, the current pipeline relies exclusively on signal data over a 6-nt modification kmer spanning the universal CCA 3′ end of tRNA and the first three nucleotides of the 3′ adapter (CCAGGC) to distinguish charged and uncharged reads.

Cluster execution

The pipeline includes a run.sh script optimized for the LSF scheduler. For more details on configuring for HPC jobs, see cluster/config.yaml.

Notes

The dorado basecaller can be installed using pre-built binaries available from github. The conda environment.yml installs dorado 0.7.2 from an unsupported (by ONT) channel.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github		.github
.test		.test
cluster		cluster
config		config
resources		resources
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
run-preprint.sh		run-preprint.sh
run-test.sh		run-test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aa-tRNA-seq-pipeline

Usage

Configuration

Workflow

Remora classification

Cluster execution

Notes

About

Contributors 4

Languages

License

rnabioco/aa-tRNA-seq-pipeline

Folders and files

Latest commit

History

Repository files navigation

aa-tRNA-seq-pipeline

Usage

Configuration

Workflow

Remora classification

Cluster execution

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

Languages