Skip to content

Mathematics Question Text Classification with Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB)

Notifications You must be signed in to change notification settings

manuelbenedict/TextClassification_SVM_MNB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Mathematics Question Text Classification with Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB)

About the Project

This is my Practical Work course research project as a research assistant for Mrs. Hasmawati, S.Kom., M.Kom., one of lecturer in Data Science at School of Computing, Telkom University. The research is about text classification implementation for Mathematics multiple choice questions in elementary and junior high schools with machine learning model, i.e. Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB). The question text is in Bahasa Indonesia. All questions will classify in three categories, i.e. easy, medium, and hard.

Tools: Google Colab or Jupyter Notebook
Programming language: Python

Dataset

The dataset is prepared by lecturer, which consist of Mathematics multiple choice questions for elementary and senior high schools. As a research assistant, one of my responsibilities is managed data with removing the duplicates data and invalid questions. Invalid questions contains the senteces or phrases which refer to tables and/or pitcure, e.g. "Perhatikan gambar berikut", "Perhatikan tabel berikut", "Dari gambar disamping", etc. Next, every data needs to be labelled. There are three labels for the question classification, i.e. easy, medium, and hard. There are a guide to label the dataset. The guide is in Bahasa Indonesia (Access the PDF with title "Labelling Guide").

Preprocessing Data

The dataset needs to be prepared before it will be used into the models. For the Natural Language Processing (NLP) problem in general, there are some phases for preprocessing data, i.e. removing characters (punctuation, numbers, and symbols), removing stopwords, and stemming. Stemming text in Bahasa Indonesia is supported by Sastrawi library. After the preprocessing, the final is checking duplicates for once to keep the dataset variety.

Feature Extraction

The feature extraction purpose is to transform the text feature into numerical feature that can be processed. This project tries two feature extractions, i.e. Bag of Words (BoW) and TF-IDF (Term Frequency-Inverse Document Frequency). The BoW is the most basic feature extraction and earliest method, which the number feature is counted of each word (token) in a document. The TF-IDF is the feature extraction which is based on word occurrence frequency in the document. Both of feature extractions are supported by Scikit Learn library.

Splitting Data

The dataset is splitted into training data and testing data with ratio 80:20 that means 80% of training data and 20% of testing data. Scikit learn library is also used for spiltting data.

Machine Learning Models

Support Vector Machine (SVM)

Support Vector Machine (SVM) is one of the regression methods or classification of data based on previous data and modeling supervised first. SVM can conduct training and create a model that will be used for classification and prediction. SVM performs training by creating a line and forming a hyperplane and creating margins for each label by taking the vector (which will be referred to as the support vector) that is closest to the margins for the two labels. SVM has the advantage of grouping high-dimensional textual data. This algorithm includes the most superior machine learning algorithms in the classification. This is evidenced by the results of 5 previous studies using SVM for classification, especially text classification.

Multinomial Naïve Bayes (MNB)

Multinomial Naïve Bayes (MNB) is the development of the Naïve Bayes model which produces a probability value of the frequency or number of words that appear in a sentence. This algorithm is a Naïve Bayes algorithm which is often used in text classification. This method, whose features are assumed from a simple multinomial distribution, has the main characteristic of being a strong (naïve) assumption of the independence between variables. This method also utilizes Bayes Theorem and data mining functionality, namely Naïve Bayesian Classification. MNB is able to consider the frequency of each word that appears in the document.

Result

Model Random state Level Accuracy Precision Recall F-1 Score
SVM with TF-IDF 5 Easy 75.00% 85.00% 65.00% 74.00%
Medium 68.00% 91.00% 78.00%
Hard 100.00% 44.00% 62.00%
SVM with BoW Easy 66.18% 70.00% 54.00% 61.00%
Medium 66.00% 76.00% 70.00%
Hard 60.00% 67.00% 63.00%
MNB with TF-IDF 13 Easy 60.29% 88.00% 33.00% 48.00%
Medium 57.00% 97.00% 72.00%
Hard 0.00% 0.00% 0.00%
MNB with BoW Easy 72.06% 75.00% 71.00% 73.00%
Medium 71.00% 83.00% 76.00%
Hard 71.00% 42.00% 53.00%

Conclusion

Feature extraction and random state used when splitting the data affect the performance results built. TF-IDF and BoW testing, as well as random circumstances resulted in the following conclusions.

  • With TF-IDF, the SVM model has higher performance, with an accuracy of 75% compared to the MNB, with an accuracy of 60.29%, in classifying Mathematics problems.
  • With BoWr, the MNB model has a higher performance, with an accuracy of 72.06%, compared to the SVM, with an accuracy of 66.18%, in classifying Mathematics problems.
  • Overall, the SVM model with TF-IDF has the highest performance, by classifying questions Mathematics.

About

Mathematics Question Text Classification with Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published