This project aims to develop a machine learning model to predict the expression of small Open Reading Frames (sORFs). Small ORFs are typically challenging to identify and predict due to their short length, but they play a significant role in various biological processes. By accurately predicting which sORFs are expressed, we can gain insights into their functional roles in the genome.
The initial step in the project involved identifying potential small ORFs from a dataset of 616 pneumococcal genomes sourced from here. The following steps were performed:
- Tool Used: ggCaller
- Parameters:
--min-orf-length
= 19--min-orf-score
= 19
The ggCaller was used to predict small ORFs from these genomes. After running ggCaller, a total of 6,013 genes were predicted.
Here is the distribution of the genes after filtering the long ORFs:
The next step was to align the transcripts from the study Aprianto R et al. 2018 to the identified small ORFs to determine which of these ORFs are expressed.
- Tool Used: Kallisto
- Index File:
short_ORFs.ffn
- Parameters:
kmer length
= 19
The alignment was performed using the cDNA expression data from the study. The dataset was accessed from the European Nucleotide Archive (ENA).
After the alignment, the output from Kallisto was analyzed, specifically focusing on the abundance.tsv
file.
Metrics in aboudance.tsv file are :
- eff_length: The effective length of the transcript, considering the fragment length distribution.
- est_counts: The estimated number of reads derived from this transcript.
but The key metric for our analysis is:
- TPM (Transcripts Per Million): A normalized measure of transcript abundance, allowing comparison of transcript levels within and between samples.
- RPK (Reads Per Kilobase): Divide the read counts by the length of each gene in kilobases.
- Per Million Scaling Factor: Count up all the RPK values in a sample and divide by 1,000,000.
- TPM: Divide the RPK values by the “per million” scaling factor.
Note: TPM is preferred over RPKM/FPKM because it ensures that the sum of TPMs in each sample is the same, making cross-sample comparisons more reliable.
Here is a distribution of TPM values:
To determine whether a gene is expressed (True) or not (False), a TPM cutoff of 0 was used.
- True Genes: Genes with a TPM > 0.
- False Genes: Genes with a TPM = 0.
The following steps were taken:
- Gene IDs were read from
genes_with_zero_tpm.txt
andgenes_with_nonzero_tpm.txt
files. - The
short_orfs.ffn
file was parsed, and each gene's sequence was modified based on its TPM status. - The output was a modified
short_orfs.ffn
file with0
(False) or1
(True) appended under each gene's sequence.
The final training file, which will be used to train the machine learning model, looks something like this:
>gene_1
ATGCGT...TTGA
0
>gene_2
ATGCCA...TCAA
1
This file includes the sequence of each gene followed by a label indicating whether the gene is expressed (1) or not (0).
As I'm still working on this project the next steps will be :
- Model Training: The labeled dataset will be used to train a machine learning model, CNN as our model 1, and also BERT as our model 2.
- Model Evaluation: The model will be evaluated using various metrics to ensure its accuracy and reliability in predicting small ORF expression.
- Model Deployment: The final model will be shared as part of this repository.
- Title: Population Genomic Datasets Describing the Post-Vaccine Evolutionary Epidemiology of Streptococcus pneumoniae
- Link: Nature Scientific Data
- Title: High-Resolution Analysis of the Pneumococcal Transcriptome Under a Wide Range of Infection-Relevant Conditions
- Link: PubMed
-
Title: ProkBERT Family: Genomic Language Models for Microbiome Applications
-
Title: DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-language in Genome
-
Link: Bioinformatics Journal
-
Title: sORFdb – A Database for sORFs, Small Proteins, and Small Protein Families in Bacteria
-
Link: bioRxiv
-
Title: Short Open Reading Frames (sORFs) and Microproteins: An Update on Their Identification and Validation Measures
- Title: DNA Sequence Classification by Convolutional Neural Network
- Link: ResearchGate
-
Title: Apply Machine Learning Algorithms for Genomics Data Classification
-
Authors: Ernest Bonat, Ph.D., Bishes Rayamajhi, MS.
-
Date: February 03, 2021
-
Title: Advanced DNA Sequence Text Classification Using Natural Language Processing
-
Link: Ernest Bonat on Medium
-
Title: Slides for BERT, Stanford
-
Link: Stanford NLP Seminar
-
Title: Deep Learning in Genomics Primer (Tutorial)
-
Link: GitHub Tutorial
-
Title: Application of BERT to Enable Gene Classification Based on Clinical Evidence
-
Link: NCBI