Skip to content

manuelbenedict/Hoax-Detection-on-Social-Media-with-CNN-and-SVM

Repository files navigation

Hoax Detection on Social Media with Convolutional Neural Network (CNN) and Support Vector Machine (SVM)

About the Project

This is my Final Project course research for completing my bachelor study. The research is about hoax detection on social media (Twitter) with Convolutional Neural Network (CNN) dan Support Vector Machine (SVM).

Tools: Google Spreadsheet, Google Colab, Jupyter Notebook
Programming language: Python

Dataset

The data was crawled from Twitter using Snscrape library. The research focused on two topics, i.e. Ferdy Sambo Case and Kanjuruhan Tragedy. After all data passed preprocessing, all data labelled with 0 for fact news or 1 for hoax news. Labelling was suppoerted by three person for each news to avoid subjectivity. The distribution of hoax/fact news is represented by table below.

Topic Category Number of Data
Kanjuruhan Tragedy Fact 1,279
Hoax 1,420
Ferdy Sambo Case Fact 11,588
Hoax 11,038
Total 25,325

Preprocessing Data

The table below is the preprocessing phases and its example for each phase.

Preprocessing Phase Text
Raw Data Ratusan Orang Meninggal dalam Tragedi Kanjuruhan, Tersangkanya Hanya 6
Data Cleaning Ratusan Orang Meninggal dalam Tragedi Kanjuruhan Tersangkanya Hanya
Case Folding ratusan orang meninggal dalam tragedi kanjuruhan tersangkanya hanya
Stopword Removal ratusan orang meninggal tragedi kanjuruhan tersangkanya
Stemming ratus orang tinggal tragedi kanjuruhan sangka
Tokenizing [‘ratus’, ‘orang’, ‘tinggal’, ‘tragedi’, ‘kanjuruhan’, ‘sangka’]

Feature Extraction with TF-IDF

Feature extraction is converting raw data into numeric features so that the data can be processed without losing the true meaning of the original data. TF-IDF is a feature extraction that is often used in text processing. The way it works depends on converting the space vector representation into a dense continuous vector space, making it possible to find contextual similarities between phrases and words in a particular document.

Feature Expansion with GloVe

Feature expansion is a method that changes the value of a zero-value feature by using the value of a similar word, by utilizing corpus similarity. The GloVe calculates the frequency of occurrence of a word simultaneously in a corpus to determine the similarity value of these words. The probability ratio of the occurrence of words has the potential to encode several forms of meaning and help improve performance on word analogy problems.

The output of GloVe embedding is a sequence of word similarity. This research uses two corpora to find the similarity of a word. The corpus is the Tweet corpus, a corpus formed from tweet text in the dataset, and Tweet + News, a corpus formed from the tweet text in the dataset and news data. News data is sourced from various Indonesian news portals, viz. CNN Indonesia, Tempo, Koran Sindo, and Republika.

Corpus Number of words
Tweet 20,733
Tweet + Berita 96,358

Splitting Data

The splitting data in this research used three spliting ratios (data train : data test), i.e. 90:10, 80:20, and 70:30. The detail use is on Scenario I (on Result).

Models

This project used and tested three models, i.e. Convolutional Neural Network (CNN), Support Vector Mahine (SVM), and hybrid (combination of CNN and SVM).

Testing Scenario

Scenario I: Choosing the best splitting ratio

Splitting ratio Accuracy (%)
CNN SVM
90:10 93.91 95.52
80:20 93.61 95.25
70:10 93.53 95.15

Scenario II: Choosing the best n-gram

N-gram Accuracy (%)
CNN SVM
Unigram (Baseline) 93.91 95.52
Bigram 91.95 (-2.09) 92.37 (-3.30)
Trigram 80.62 (-14.14) 80.39 (-15.85)

Scenario III: Choosing the best n-gram combination

N-gram Accuracy (%)
CNN SVM
Unigram (Baseline) 93.91 95.52
Unigram + Bigram 94.28 (+0.40) 95.92 (+0.41)
Unigram + Bigram + Trigram 94.05 (+0.15) 95.86 (+0.35)

Scenario IV: GloVe Embedding

Rank CNN Model Accuracy (%) SVM Model Accuracy (%)
Baseline Tweet Tweet + Berita Baseline Tweet Tweet + Berita
Top 1 93.91 94.35 (+0.47) 94.90 (+1.06) 95.52 95.95 (+0.45) 95.74 (+0.22)
Top 5 94.66 (+0.80) 94.93 (+1.09) 95.79 (+0.28) 95.60 (+0.08)
Top 10 94.70 (+0.85) 94.99 (+1.15) 95.47 (-0.06) 95.37 (-0.16)

Trying Top 15 and 20 with Tweet + News Corpus for CNN Model

Rank CNN Model Accuracy (%)
Baseline Tweet + Berita
Top 10 93.91 94.99 (+1.15)
Top 15 95.11 (+1.29)
Top 20 94.76 (+0.91)

Scenario V: CNN-SVM Hybrid Model

Rank Model Accuracy (%) Relative Increase/Decrease to the (%)
CNN (Baseline) CNN-SVM Hybrid CNN (Baseline) SVM (Baseline)
Top 1 93.91 94.82 (+0.96) (+0.74)
Top 5 95.55 (+1.75) (+0.03)
Top 10 95.79 (+2.01) (+0.28)
Top 15 94.99 (+1.16) (+0.55)

Analysis

Based on the results of the tests performed, all scenario tests experienced an increase in performance, except for Scenario V which focused on the CNN-SVM hybrid model, unlike the previous scenario which focused on each CNN and SVM model. This graph shows the accuracy score relative increase to the baselines. The CNN has increased more significant than the SVM even though the SVM has the highest performance. It shows that the CNN model may be optimized to get higher performance than the SVM Model. Alt text

Conclusion

The best splitting ratio: 90:10
The best n-gram: unigram + bigram
The best model: SVM 95.95% accuracy (similarity top 1, Tweet corpus), Hybrid CNN-SVM 95.79% accuracy (similarity top 10, Tweet + Berita corpus), and CNN 95.11% accuracy (similarity top 15, Tweet + Berita corpus).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published