Skip to content

Latest commit

 

History

History
180 lines (102 loc) · 7.33 KB

README.md

File metadata and controls

180 lines (102 loc) · 7.33 KB

Small ORF Expression Prediction Model

Overview

This project aims to develop a machine learning model to predict the expression of small Open Reading Frames (sORFs). Small ORFs are typically challenging to identify and predict due to their short length, but they play a significant role in various biological processes. By accurately predicting which sORFs are expressed, we can gain insights into their functional roles in the genome.

Project Workflow

1. Gene Identification with ggCaller

The initial step in the project involved identifying potential small ORFs from a dataset of 616 pneumococcal genomes sourced from here. The following steps were performed:

  • Tool Used: ggCaller
  • Parameters:
    • --min-orf-length = 19
    • --min-orf-score = 19

The ggCaller was used to predict small ORFs from these genomes. After running ggCaller, a total of 6,013 genes were predicted.

Here is the distribution of the genes after filtering the long ORFs:

Gene Distribution

2. Transcript Alignment with Kallisto

The next step was to align the transcripts from the study Aprianto R et al. 2018 to the identified small ORFs to determine which of these ORFs are expressed.

  • Tool Used: Kallisto
  • Index File: short_ORFs.ffn
  • Parameters:
    • kmer length = 19

The alignment was performed using the cDNA expression data from the study. The dataset was accessed from the European Nucleotide Archive (ENA).

3. Kallisto Output Analysis

After the alignment, the output from Kallisto was analyzed, specifically focusing on the abundance.tsv file.

Metrics in aboudance.tsv file are :

  • eff_length: The effective length of the transcript, considering the fragment length distribution.
  • est_counts: The estimated number of reads derived from this transcript.

but The key metric for our analysis is:

  • TPM (Transcripts Per Million): A normalized measure of transcript abundance, allowing comparison of transcript levels within and between samples.

TPM Calculation:

  1. RPK (Reads Per Kilobase): Divide the read counts by the length of each gene in kilobases.
  2. Per Million Scaling Factor: Count up all the RPK values in a sample and divide by 1,000,000.
  3. TPM: Divide the RPK values by the “per million” scaling factor.

Note: TPM is preferred over RPKM/FPKM because it ensures that the sum of TPMs in each sample is the same, making cross-sample comparisons more reliable.

Here is a distribution of TPM values:

TPM Distribution

4. Labeling Genes for Model Training

To determine whether a gene is expressed (True) or not (False), a TPM cutoff of 0 was used.

  • True Genes: Genes with a TPM > 0.
  • False Genes: Genes with a TPM = 0.

The following steps were taken:

  • Gene IDs were read from genes_with_zero_tpm.txt and genes_with_nonzero_tpm.txt files.
  • The short_orfs.ffn file was parsed, and each gene's sequence was modified based on its TPM status.
  • The output was a modified short_orfs.ffn file with 0 (False) or 1 (True) appended under each gene's sequence.

5. Final Training File

The final training file, which will be used to train the machine learning model, looks something like this:

>gene_1
ATGCGT...TTGA
0

>gene_2
ATGCCA...TCAA
1

This file includes the sequence of each gene followed by a label indicating whether the gene is expressed (1) or not (0).

example of our labeled data

Next Steps

As I'm still working on this project the next steps will be :

  • Model Training: The labeled dataset will be used to train a machine learning model, CNN as our model 1, and also BERT as our model 2.
  • Model Evaluation: The model will be evaluated using various metrics to ensure its accuracy and reliability in predicting small ORF expression.
  • Model Deployment: The final model will be shared as part of this repository.

Bibliography and Related Articles

Readings for this project

Population Genomic Datasets

  • Title: Population Genomic Datasets Describing the Post-Vaccine Evolutionary Epidemiology of Streptococcus pneumoniae
  • Link: Nature Scientific Data

Transcriptome Analysis

  • Title: High-Resolution Analysis of the Pneumococcal Transcriptome Under a Wide Range of Infection-Relevant Conditions
  • Link: PubMed

Genomic Language Models

  • Title: ProkBERT Family: Genomic Language Models for Microbiome Applications

  • Link: Frontiers in Microbiology

  • Title: DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-language in Genome

  • Link: Bioinformatics Journal

Small ORFs and Microproteins

  • Title: sORFdb – A Database for sORFs, Small Proteins, and Small Protein Families in Bacteria

  • Link: bioRxiv

  • Title: Short Open Reading Frames (sORFs) and Microproteins: An Update on Their Identification and Validation Measures

  • Link: Journal of Biomedical Science

DNA Sequence Classification

  • Title: DNA Sequence Classification by Convolutional Neural Network
  • Link: ResearchGate

Machine Learning in Genomics

  • Title: Apply Machine Learning Algorithms for Genomics Data Classification

  • Authors: Ernest Bonat, Ph.D., Bishes Rayamajhi, MS.

  • Date: February 03, 2021

  • Title: Advanced DNA Sequence Text Classification Using Natural Language Processing

  • Link: Ernest Bonat on Medium

BERT and Deep Learning Resources

  • Title: Slides for BERT, Stanford

  • Link: Stanford NLP Seminar

  • Title: Deep Learning in Genomics Primer (Tutorial)

  • Link: GitHub Tutorial

  • Title: Application of BERT to Enable Gene Classification Based on Clinical Evidence

  • Link: NCBI