riboflow is a python package for classifying putative riboswitch sequences into one of 32 classes with > 99% accuracy. It is based on a tensorflow deep learning model. riboflow
has been tested using Python 3.5.2
.
The pip package was derived from this source code. This source code of rnnApp.py
and cnnApp.py
can easily be altered to help achieve better accuracy when the riboswitch labels change or the number of classes increase / decrease when constructing the dataset, by changing the number of NN layers, hyperparameters.
1.original_datasets
a. 32_riboswitches_fasta --> Fasta Format of the 32 riboswitches
b. 32_riboswitches_new_csv --> CSV Format of the 32 riboswitches
2.processed_datasets
a. final_32classes.csv --> Original 32 riboswitches Dataset cleaned and Frequencies calculated
b. final_32train.csv --> 90% of each riboswitch class's instances in the final_32classes.csv
c. final_32test.csv --> Remaining 10% of each riboswitch class's instances in the final_32classes.csv
3.models
Contains the rnn and cnn model's in h5 format
4.preprocess.py
Contains various utilities for train:test splitting of the dataset, loading the datasets and other preprocessing of the data. Could be used to generate -mer frequencies, final_train.csv, final_test.csv and used for data preprocessing by all the models (i.e, both base and deep learning models: baseModels.py, rnnApp.py and cnnApp.py)
5.multiclassROC.py
Used for the ROC analysis of all models (i.e, base and the deep learning models).
6.dynamic.py
Implements a routine to enable dynamic deep learning for new riboswitch classes. Could be used on riboswitch fasta files of any number of classes to generate the equivalent processed csv files having the sequence and k-mer (for now, mono and di-) frequencies (this file can be used by baseModels.py, rnnApp.py, cnnApp.py for training purposes)
> python3 baseModels.py
1. Create's a Picked Model for each of the sklearn classifers stated below:
AdaBoostClassifier(),
GaussianNB(),
KNeighborsClassifier(),
DecisionTreeClassifier(),
RandomForestClassifier(),
MLPClassifier()
2. Each model is used on the test set to obtain accuracy, generate a classication report and the ROC-AUC values for each
of the 32 classes.
3. The MLPClassifier() proved to be the best among the chosen sklearn classifiers and hence Neural Networks (CNN and RNN)
were explored further to acheive greater accuracy.
> python3 rnnApp.py
1. Creates a .h5 RNN Model using tensorflow on keras
2. The model is used on the test set to obtain accuracy, generate a classication report and the ROC-AUC values for each
of the 32 classes.
3. Provides an Accuracy of 99% on the test set.
4. New layers and hyperparameter values can be added or changed when dealing with a dataset having different number of classes
5. The train time is fairly long ( in the magnitude of hours - suitable for system with high specs )
> python3 cnnApp.py
1. Creates a .h5 CNN Model using tensorflow on keras
2. The model is used on the test set to obtain accuracy, generate a classication report and the ROC-AUC values for each
of the 32 classes.
3. Provides an Accuracy of 97% on the test set.
4. New layers and hyperparameter values can be added or changed when dealing with a dataset having different number of classes.
5. The train time is fairly short ( < 1 min - suitable for low spec systems )
Premkumar KAR, Bharanikumar R, Palaniappan A#. (2020) Riboflow: using deep learning to classify riboswitches with ~99% accuracy. Frontiers in Biotechnology and Bioengineering 8: 808. DOI: 10.3389/fbioe.2020.00808
- Keshav Aditya R.P*
- Ramit Bharanikumar*
- Ashok Palaniappan
Copyright (c) 2019, the Authors. MIT License.