Mathematics Question Text Classification with Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB)

About the Project

This is my Practical Work course research project as a research assistant for Mrs. Hasmawati, S.Kom., M.Kom., one of lecturer in Data Science at School of Computing, Telkom University. The research is about text classification implementation for Mathematics multiple choice questions in elementary and junior high schools with machine learning model, i.e. Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB). The question text is in Bahasa Indonesia. All questions will classify in three categories, i.e. easy, medium, and hard.

Tools: Google Colab or Jupyter Notebook
Programming language: Python

Dataset

The dataset is prepared by lecturer, which consist of Mathematics multiple choice questions for elementary and senior high schools. As a research assistant, one of my responsibilities is managed data with removing the duplicates data and invalid questions. Invalid questions contains the senteces or phrases which refer to tables and/or pitcure, e.g. "Perhatikan gambar berikut", "Perhatikan tabel berikut", "Dari gambar disamping", etc. Next, every data needs to be labelled. There are three labels for the question classification, i.e. easy, medium, and hard. There are a guide to label the dataset. The guide is in Bahasa Indonesia (Access the PDF with title "Labelling Guide").

Preprocessing Data

The dataset needs to be prepared before it will be used into the models. For the Natural Language Processing (NLP) problem in general, there are some phases for preprocessing data, i.e. removing characters (punctuation, numbers, and symbols), removing stopwords, and stemming. Stemming text in Bahasa Indonesia is supported by Sastrawi library. After the preprocessing, the final is checking duplicates for once to keep the dataset variety.

Feature Extraction

The feature extraction purpose is to transform the text feature into numerical feature that can be processed. This project tries two feature extractions, i.e. Bag of Words (BoW) and TF-IDF (Term Frequency-Inverse Document Frequency). The BoW is the most basic feature extraction and earliest method, which the number feature is counted of each word (token) in a document. The TF-IDF is the feature extraction which is based on word occurrence frequency in the document. Both of feature extractions are supported by Scikit Learn library.

Splitting Data

The dataset is splitted into training data and testing data with ratio 80:20 that means 80% of training data and 20% of testing data. Scikit learn library is also used for spiltting data.

Machine Learning Models

Support Vector Machine (SVM)

Support Vector Machine (SVM) is one of the regression methods or classification of data based on previous data and modeling supervised first. SVM can conduct training and create a model that will be used for classification and prediction. SVM performs training by creating a line and forming a hyperplane and creating margins for each label by taking the vector (which will be referred to as the support vector) that is closest to the margins for the two labels. SVM has the advantage of grouping high-dimensional textual data. This algorithm includes the most superior machine learning algorithms in the classification. This is evidenced by the results of 5 previous studies using SVM for classification, especially text classification.

Multinomial Naïve Bayes (MNB)

Multinomial Naïve Bayes (MNB) is the development of the Naïve Bayes model which produces a probability value of the frequency or number of words that appear in a sentence. This algorithm is a Naïve Bayes algorithm which is often used in text classification. This method, whose features are assumed from a simple multinomial distribution, has the main characteristic of being a strong (naïve) assumption of the independence between variables. This method also utilizes Bayes Theorem and data mining functionality, namely Naïve Bayesian Classification. MNB is able to consider the frequency of each word that appears in the document.

Result

Model	Random state	Level	Accuracy	Precision	Recall	F-1 Score
SVM with TF-IDF	5	Easy	75.00%	85.00%	65.00%	74.00%
		Medium		68.00%	91.00%	78.00%
		Hard		100.00%	44.00%	62.00%
SVM with BoW		Easy	66.18%	70.00%	54.00%	61.00%
		Medium		66.00%	76.00%	70.00%
		Hard		60.00%	67.00%	63.00%
MNB with TF-IDF	13	Easy	60.29%	88.00%	33.00%	48.00%
		Medium		57.00%	97.00%	72.00%
		Hard		0.00%	0.00%	0.00%
MNB with BoW		Easy	72.06%	75.00%	71.00%	73.00%
		Medium		71.00%	83.00%	76.00%
		Hard		71.00%	42.00%	53.00%

Conclusion

Feature extraction and random state used when splitting the data affect the performance results built. TF-IDF and BoW testing, as well as random circumstances resulted in the following conclusions.

With TF-IDF, the SVM model has higher performance, with an accuracy of 75% compared to the MNB, with an accuracy of 60.29%, in classifying Mathematics problems.
With BoWr, the MNB model has a higher performance, with an accuracy of 72.06%, compared to the SVM, with an accuracy of 66.18%, in classifying Mathematics problems.
Overall, the SVM model with TF-IDF has the highest performance, by classifying questions Mathematics.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Models		Models
Labelling Guide.pdf		Labelling Guide.pdf
Matematika-SD-SMP.csv		Matematika-SD-SMP.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mathematics Question Text Classification with Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB)

About the Project

Dataset

Preprocessing Data

Feature Extraction

Splitting Data

Machine Learning Models

Support Vector Machine (SVM)

Multinomial Naïve Bayes (MNB)

Result

Conclusion

About

Releases

Packages

Languages

manuelbenedict/TextClassification_SVM_MNB

Folders and files

Latest commit

History

Repository files navigation

Mathematics Question Text Classification with Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB)

About the Project

Dataset

Preprocessing Data

Feature Extraction

Splitting Data

Machine Learning Models

Support Vector Machine (SVM)

Multinomial Naïve Bayes (MNB)

Result

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages