This repository is associated with our work on modelling of mRNA metabolism. A detailed read can be found in our latest preprint on bioRxiv, called "Nuclear export is a limiting factor in eukaryotic mRNA metabolism" (doi: https://doi.org/10.1101/2023.05.04.539375). Please use SlamDunk as a mapping tool. Raw and processed sequencing data used in this work has been deposited in NCBI's Gene Expression Omnibus and are accessible through GEO Series accession number GSE233546.
Method: SlamDunk
Input: raw reads
Output: aligned reads (BAM file)
Method: Script pipeline.py
, function snp_corrected_rsummary
command line call python3 pipeline.py denovo_rsummary [ARGS]
Input: filtered aligned reads (filtered BAM file), reference sequence (FASTA), SNP file (gzipped vcf)
Output: ‚read summary table file‘: a file which stores the following information on each read:
- reference sequence name mapped to
- read length
- positions of insertions (position before insertion event, w.r.t. reference, 1-based)
- positions of deletions (position before deletion event, w.r.t. reference, 1-based)
- table listing the number of nucleotide pairings as well as insertions and deletions within the read
Step 3: Sort aligned reads to 3’UTR regions they overlap with & divide between forward and reverse 3’UTR regions
Method: Script pipeline.py
, function antisense_sense_rsummary
command line call python3 pipeline.py filter_rsummary [ARGS]
Input: filtered aligned reads (filtered BAM file), read summary table file, BED file defining 3’UTR regions
Output: two read summary table files, one for all reverse and one for all forward 3’UTR regions
Method: Script pipeline.py
, function new_convcount_table
command line call python3 pipeline.py denovo_cctable [ARGS]
Input: mode (a string that defines what kind of input table(s) is used, in our case this is ‚summary‘), a list with the reverse and forward read summary table file
Output: ‚conversion counts table file‘: a file that stores a reads‘ total number of positions where it could potentially be labeled (all mRNA transcript ‚U‘ positions), and the reads‘ number of labeling conversions actually observed
Step 5: Estimate the labeling efficiency (for each time point and compartment and biological sample separately!) (‚conversion efficiency table file‘)
Method: Script pipeline.py
, function conversion_efficiency_table
command line call python3 pipeline.py denovo_convefftable [ARGS]
Input: conversion counts table file, base error (probability of observing a labeling conversion due to a sequencing error), sequencing error (probability of loosing a true labeling conversion due to a sequencing error)
Output: ‚conversion efficiency table file‘: a file that assigns each reference sequence name in the conversion counts table (in our case, each 3’UTR identifier) a labeling efficiency estimate (and a new/total ratio estimate that we do not need for the moment)
Notes: It is very important to estimate labeling efficiency for each measurement separately (each time point for each compartment in each biological sample; technical replicates like sequencing lanes can be thrown together) since labeling efficiency increases over time to a different extend in nucleus and cytosol!
Method: Script pipeline.py
, function conversion_efficiency_table
command line call python3 pipeline.py denovo_convefftable [ARGS]
Input: conversion counts table file, base error (probability of observing a labeling conversion due to a sequencing error), sequencing error (probability of loosing a true labeling conversion due to a sequencing error), labeling efficiency estimate
Output: ‚conversion efficiency table file‘: a file that assigns each reference sequence name in the conversion counts table (in our case, each 3’UTR identifier) a new/total ratio estimate (and a labeling efficiency which doesn’t make sense for this analysis step since we fix labeling efficiency with our final estimate from Step 4)
Notes: We use the same EM algorithm as in Step 5 which is designed to estimate labeling efficiency alongside with new/total ratios, but we fix labeling efficiency to our final estimate from Step 4 such that the algorithm estimates the new/total ratios for a fixed labeling efficiency determined in the step before.
Method: Script pipeline.py
, function new_summary_table
command line call python3 pipeline.py denovo_summarytable [ARGS]
Input: conversion effiency table file, base error (probability of observing a labeling conversion due to a sequencing error), sequencing error (probability of loosing a true labeling conversion due to a sequencing error), labeling efficiency estimate
Output: ‚summary table file‘: a summary table for a given a conversion counts table and conversion efficiency table file and writes the table to an output file.
If summary tables for all time points have been generated, they can be merged by calling merge_summarytable [ARGS]
using the tables as input.
Method: Script helper_functions.R
, functions expr_level_regression
and summarytable_to_rawdata
Input: A merged summary table file, containing the samples for one time-series. The expr_level_regression
function is used to calculate the expression level of each gene along the time-series (needed for assignment of reliability criteria later). The summarytable_to_rawdata
converts the summarytable to a data frame that is used for the model fit.
Output: An RData-object which stores the rawdata that is used for the estimation of metabolic rates.
Method: Script modelfit.R
, functions fitGene
Input: The function fitGene takes each entry from the rawdata generated in Step 8.
Output: An RData-object which stores the estimation of RNA metabolic rates.
A full list of RNA metabolic rates we obtained from the time-series data used in our preprint "Nuclear export is a limiting factor in eukaryotic mRNA metabolism" (doi: https://doi.org/10.1101/2023.05.04.539375) will also be available with publication of this manuscript.