Prediction model for non-coding RNA using Support Vector Machine (SVM)
- Python 3
- g++ (with C++ Standard Template Library C++20)
- Java
- Perl
- Argparse (python library)
-
ThunderSVM (https://github.com/Xtra-Computing/thundersvm)
-
RNAfold (https://github.com/ViennaRNA/ViennaRNA/) For installation, please refer to https://github.com/ViennaRNA/ViennaRNA/#installation
NCodR
calls ViennaRNA as a python library. Therefore, after installation of ViennaRNA it is important to make sure thatRNA
module is included in thePYTHONPATH
. GenerallyRNA
directory is located at/usr/local/lib/python3.8/site-packages/
. In that case, please include the following path in.bashrc
(or in.bash_profile
) -
genRNAStats.pl (https://web.bii.a-star.edu.sg/~stanley/Suppl_material4/genRNAStats.pl)
genRNAStats.pl
is modified (to reduce the verbosity) and provided with the package. Therefore, it is not necessary to download the original souce code.
- Download or clone the NCodR from the gitlab repository (https://gitlab.com/sunandanmukherjee/ncodr.git)
- If downloaded, extract the ncodr folder.
- Open a terminal, and navigate to ncodr folder
- Type
make
in the terminal. Upon successful complilation it will create a newbin
directory. - Add the path to your
.bashrc
asNCODR_HOME
. This can be done using the following command from thencodr
directory:
echo "export NCODR_PATH=`pwd`" >> ~/.bashrc
- For accessing
NCodR
from elsewhere, it's path should be added to.bashrc
.
NCodR can be used as a standalone tool. However, to run it locally, the dependencies and the thrid party tools should be installed/downloaded.
Usage:
positional arguments:
input Input file name [RNA sequences in fasta format].
optional arguments:
-h, --help show this help message and exit
-r, --redundant remove redundant sequence [default = OFF]
-c, --clean clean the intermediate files [default = ON]
-o OUTPUT, --output OUTPUT
Output file name [<input>.ncodr if not provided]
Example:
./NCodR.py <input file>
Input file should be provided in fasta format. Upon successful completion, the program will print
the output in the terminal. Additionally, output will also be saved in a text file with .ncodr
extension.
##Test run
A sample dataset is provided in the examples/
directory, which can be used for a test run. For a quick test run, type the following command from the terminal:
./NCodR.py examples/test.fa
This should print the prediction results on the screen, as well as generate a file (test.fa.ncodr)
with the prediction results.
Both training and testing of the final model is done using ThunderSVM.
Eight different classification algorithms were compared to select the best approach. It is implemented in model_comparison.py
script. Classification algorithms are used as implemented in Scikit learn module.
Following python libraries needed to be installed to run these scripts:
- Numpy
- Scikit learn
- Pickle
- Joblib
- Matplotlib
- g++ compiler with C+11 support
- javac compiler
The programs can be compiled by running make. The binaries will be available in the bin folder.
make
For a fasta file named fastafile.fa the first step is to convert it to single line format using fasta1line.pl. The RNAfold program is run to generate the secondary structure file in dot-bracket format (.b).
makeFasta1line.py fastafile.fa fastafile_1l.fa
RNAfold -d2 --noLP --noPS <fastafile_1l.fa >fastafile_1l.b
calc_AU_MFEI fastafile_1l.b
The genRNAStats.pl can be used for calculation of these parameters.
The number of repeats for SSR can be calculated using the repeats program in this directory.
repeats fastafile_1l.fa