Hoax Detection on Social Media with Convolutional Neural Network (CNN) and Support Vector Machine (SVM)
This is my Final Project course research for completing my bachelor study. The research is about hoax detection on social media (Twitter) with Convolutional Neural Network (CNN) dan Support Vector Machine (SVM).
Tools: Google Spreadsheet, Google Colab, Jupyter Notebook
Programming language: Python
The data was crawled from Twitter using Snscrape library. The research focused on two topics, i.e. Ferdy Sambo Case and Kanjuruhan Tragedy. After all data passed preprocessing, all data labelled with 0 for fact news or 1 for hoax news. Labelling was suppoerted by three person for each news to avoid subjectivity. The distribution of hoax/fact news is represented by table below.
Topic | Category | Number of Data |
---|---|---|
Kanjuruhan Tragedy | Fact | 1,279 |
Hoax | 1,420 | |
Ferdy Sambo Case | Fact | 11,588 |
Hoax | 11,038 | |
Total | 25,325 |
The table below is the preprocessing phases and its example for each phase.
Preprocessing Phase | Text |
---|---|
Raw Data | Ratusan Orang Meninggal dalam Tragedi Kanjuruhan, Tersangkanya Hanya 6 |
Data Cleaning | Ratusan Orang Meninggal dalam Tragedi Kanjuruhan Tersangkanya Hanya |
Case Folding | ratusan orang meninggal dalam tragedi kanjuruhan tersangkanya hanya |
Stopword Removal | ratusan orang meninggal tragedi kanjuruhan tersangkanya |
Stemming | ratus orang tinggal tragedi kanjuruhan sangka |
Tokenizing | [‘ratus’, ‘orang’, ‘tinggal’, ‘tragedi’, ‘kanjuruhan’, ‘sangka’] |
Feature extraction is converting raw data into numeric features so that the data can be processed without losing the true meaning of the original data. TF-IDF is a feature extraction that is often used in text processing. The way it works depends on converting the space vector representation into a dense continuous vector space, making it possible to find contextual similarities between phrases and words in a particular document.
Feature expansion is a method that changes the value of a zero-value feature by using the value of a similar word, by utilizing corpus similarity. The GloVe calculates the frequency of occurrence of a word simultaneously in a corpus to determine the similarity value of these words. The probability ratio of the occurrence of words has the potential to encode several forms of meaning and help improve performance on word analogy problems.
The output of GloVe embedding is a sequence of word similarity. This research uses two corpora to find the similarity of a word. The corpus is the Tweet corpus, a corpus formed from tweet text in the dataset, and Tweet + News, a corpus formed from the tweet text in the dataset and news data. News data is sourced from various Indonesian news portals, viz. CNN Indonesia, Tempo, Koran Sindo, and Republika.
Corpus | Number of words |
---|---|
Tweet | 20,733 |
Tweet + Berita | 96,358 |
The splitting data in this research used three spliting ratios (data train : data test), i.e. 90:10, 80:20, and 70:30. The detail use is on Scenario I (on Result).
This project used and tested three models, i.e. Convolutional Neural Network (CNN), Support Vector Mahine (SVM), and hybrid (combination of CNN and SVM).
Splitting ratio | Accuracy (%) | |
CNN | SVM | |
90:10 | 93.91 | 95.52 |
80:20 | 93.61 | 95.25 |
70:10 | 93.53 | 95.15 |
N-gram | Accuracy (%) | |
CNN | SVM | |
Unigram (Baseline) | 93.91 | 95.52 |
Bigram | 91.95 (-2.09) | 92.37 (-3.30) |
Trigram | 80.62 (-14.14) | 80.39 (-15.85) |
N-gram | Accuracy (%) | |
CNN | SVM | |
Unigram (Baseline) | 93.91 | 95.52 |
Unigram + Bigram | 94.28 (+0.40) | 95.92 (+0.41) |
Unigram + Bigram + Trigram | 94.05 (+0.15) | 95.86 (+0.35) |
Rank | CNN Model Accuracy (%) | SVM Model Accuracy (%) | ||||
Baseline | Tweet | Tweet + Berita | Baseline | Tweet | Tweet + Berita | |
Top 1 | 93.91 | 94.35 (+0.47) | 94.90 (+1.06) | 95.52 | 95.95 (+0.45) | 95.74 (+0.22) |
Top 5 | 94.66 (+0.80) | 94.93 (+1.09) | 95.79 (+0.28) | 95.60 (+0.08) | ||
Top 10 | 94.70 (+0.85) | 94.99 (+1.15) | 95.47 (-0.06) | 95.37 (-0.16) |
Trying Top 15 and 20 with Tweet + News Corpus for CNN Model
Rank | CNN Model Accuracy (%) | ||
Baseline | Tweet + Berita | ||
Top 10 | 93.91 | 94.99 (+1.15) | |
Top 15 | 95.11 (+1.29) | ||
Top 20 | 94.76 (+0.91) |
Rank | Model Accuracy (%) | Relative Increase/Decrease to the (%) | ||
CNN (Baseline) | CNN-SVM Hybrid | CNN (Baseline) | SVM (Baseline) | |
Top 1 | 93.91 | 94.82 | (+0.96) | (+0.74) |
Top 5 | 95.55 | (+1.75) | (+0.03) | |
Top 10 | 95.79 | (+2.01) | (+0.28) | |
Top 15 | 94.99 | (+1.16) | (+0.55) |
Based on the results of the tests performed, all scenario tests experienced an increase in performance, except for Scenario V which focused on the CNN-SVM hybrid model, unlike the previous scenario which focused on each CNN and SVM model. This graph shows the accuracy score relative increase to the baselines. The CNN has increased more significant than the SVM even though the SVM has the highest performance. It shows that the CNN model may be optimized to get higher performance than the SVM Model.
The best splitting ratio: 90:10
The best n-gram: unigram + bigram
The best model: SVM 95.95% accuracy (similarity top 1, Tweet corpus), Hybrid CNN-SVM 95.79% accuracy (similarity top 10, Tweet + Berita corpus), and CNN 95.11% accuracy (similarity top 15, Tweet + Berita corpus).