This repository provides the source codes and example datasets that reproduce the test environments for each model introduced in the paper, "Streamlined Bacterial Essential Gene Prediction Using Only Protein Sequence Embedding", as well as gene essentiality datasets for individual strains. Users can utilize the example code and data to implement protein sequence embedding and essential gene prediction, and can perform predictions on their own data with minor modifications to the provided example codes.
- Protein Sequence Is All You Need: Predict bacterial essential genes using only their protein sequences without integration of complex multi-feature data.
- Extended Bacterial Essential Gene Dataset: Experimental essentiality data (features: 'essentiality', 'protein_seq', 'dna_seq', 'genome_id', 'locus_tag', etc.) of approximately 280,000 bacterial genes collected from 79 studies.
data/raw_data/
: Essential gene datasets (include non-essential genes) of each strain.data/test_exam/
: Example test datasets consisting of genes from E. coli Keio collection.models/
: Models to predict essential genes ('classifier ~') or encode protein sequences ('embed_custom').results/
: Model evaluation, prediction results and model training history.sources/
: Jupyter notebook codes for sequence embedding ('emb ~') or model test and prediction ('test ~').
-
Clone the repository:
git clone https://github.com/sblabkribb/essprotseq.git cd essprotseq
-
Install dependencies:
pip install -r requirements.txt
-
Set options (data_path, etc.) in each source code:
# Set options (example of 'test-indiv_class.ipynb') embed_ver = ["clstm", "esm2", "bert", "t5"] data_path = "../data/test_exam/" model_path = "../models/classifier_indiv/" result_path = "../results/"
-
Run the source code
To cite this work, please reference:
Seongbo Heo et al. "Streamlined Bacterial Essential Gene Prediction Using Only Protein Sequence Embedding" Synthetic Biology Research Center, KRIBB.
This project was supported by the Korea Research Institute of Bioscience and Biotechnology (KRIBB) and the National Research Foundation of Korea.