Hoax Detection on Social Media with Convolutional Neural Network (CNN) and Support Vector Machine (SVM)

About the Project

This is my Final Project course research for completing my bachelor study. The research is about hoax detection on social media (Twitter) with Convolutional Neural Network (CNN) dan Support Vector Machine (SVM).

Tools: Google Spreadsheet, Google Colab, Jupyter Notebook
Programming language: Python

Dataset

The data was crawled from Twitter using Snscrape library. The research focused on two topics, i.e. Ferdy Sambo Case and Kanjuruhan Tragedy. After all data passed preprocessing, all data labelled with 0 for fact news or 1 for hoax news. Labelling was suppoerted by three person for each news to avoid subjectivity. The distribution of hoax/fact news is represented by table below.

Topic	Category	Number of Data
Kanjuruhan Tragedy	Fact	1,279
Kanjuruhan Tragedy	Hoax	1,420
Ferdy Sambo Case	Fact	11,588
Ferdy Sambo Case	Hoax	11,038
Total		25,325

Preprocessing Data

The table below is the preprocessing phases and its example for each phase.

Preprocessing Phase	Text
Raw Data	Ratusan Orang Meninggal dalam Tragedi Kanjuruhan, Tersangkanya Hanya 6
Data Cleaning	Ratusan Orang Meninggal dalam Tragedi Kanjuruhan Tersangkanya Hanya
Case Folding	ratusan orang meninggal dalam tragedi kanjuruhan tersangkanya hanya
Stopword Removal	ratusan orang meninggal tragedi kanjuruhan tersangkanya
Stemming	ratus orang tinggal tragedi kanjuruhan sangka
Tokenizing	[‘ratus’, ‘orang’, ‘tinggal’, ‘tragedi’, ‘kanjuruhan’, ‘sangka’]

Feature Extraction with TF-IDF

Feature extraction is converting raw data into numeric features so that the data can be processed without losing the true meaning of the original data. TF-IDF is a feature extraction that is often used in text processing. The way it works depends on converting the space vector representation into a dense continuous vector space, making it possible to find contextual similarities between phrases and words in a particular document.

Feature Expansion with GloVe

Feature expansion is a method that changes the value of a zero-value feature by using the value of a similar word, by utilizing corpus similarity. The GloVe calculates the frequency of occurrence of a word simultaneously in a corpus to determine the similarity value of these words. The probability ratio of the occurrence of words has the potential to encode several forms of meaning and help improve performance on word analogy problems.

The output of GloVe embedding is a sequence of word similarity. This research uses two corpora to find the similarity of a word. The corpus is the Tweet corpus, a corpus formed from tweet text in the dataset, and Tweet + News, a corpus formed from the tweet text in the dataset and news data. News data is sourced from various Indonesian news portals, viz. CNN Indonesia, Tempo, Koran Sindo, and Republika.

Corpus	Number of words
Tweet	20,733
Tweet + Berita	96,358

Splitting Data

The splitting data in this research used three spliting ratios (data train : data test), i.e. 90:10, 80:20, and 70:30. The detail use is on Scenario I (on Result).

Models

This project used and tested three models, i.e. Convolutional Neural Network (CNN), Support Vector Mahine (SVM), and hybrid (combination of CNN and SVM).

Testing Scenario

Scenario I: Choosing the best splitting ratio

Splitting ratio	Accuracy (%)
Splitting ratio	CNN	SVM
90:10	93.91	95.52
80:20	93.61	95.25
70:10	93.53	95.15

Scenario II: Choosing the best n-gram

N-gram	Accuracy (%)
N-gram	CNN	SVM
Unigram (Baseline)	93.91	95.52
Bigram	91.95 (-2.09)	92.37 (-3.30)
Trigram	80.62 (-14.14)	80.39 (-15.85)

Scenario III: Choosing the best n-gram combination

N-gram	Accuracy (%)
N-gram	CNN	SVM
Unigram (Baseline)	93.91	95.52
Unigram + Bigram	94.28 (+0.40)	95.92 (+0.41)
Unigram + Bigram + Trigram	94.05 (+0.15)	95.86 (+0.35)

Scenario IV: GloVe Embedding

Rank	CNN Model Accuracy (%)			SVM Model Accuracy (%)
Rank	Baseline	Tweet	Tweet + Berita	Baseline	Tweet	Tweet + Berita
Top 1	93.91	94.35 (+0.47)	94.90 (+1.06)	95.52	95.95 (+0.45)	95.74 (+0.22)
Top 5		94.66 (+0.80)	94.93 (+1.09)		95.79 (+0.28)	95.60 (+0.08)
Top 10		94.70 (+0.85)	94.99 (+1.15)		95.47 (-0.06)	95.37 (-0.16)

Trying Top 15 and 20 with Tweet + News Corpus for CNN Model

Rank	CNN Model Accuracy (%)
Rank	Baseline	Tweet + Berita
Top 10	93.91	94.99 (+1.15)
Top 15		95.11 (+1.29)
Top 20		94.76 (+0.91)

Scenario V: CNN-SVM Hybrid Model

Rank	Model Accuracy (%)		Relative Increase/Decrease to the (%)
Rank	CNN (Baseline)	CNN-SVM Hybrid	CNN (Baseline)	SVM (Baseline)
Top 1	93.91	94.82	(+0.96)	(+0.74)
Top 5		95.55	(+1.75)	(+0.03)
Top 10		95.79	(+2.01)	(+0.28)
Top 15		94.99	(+1.16)	(+0.55)

Analysis

Based on the results of the tests performed, all scenario tests experienced an increase in performance, except for Scenario V which focused on the CNN-SVM hybrid model, unlike the previous scenario which focused on each CNN and SVM model. This graph shows the accuracy score relative increase to the baselines. The CNN has increased more significant than the SVM even though the SVM has the highest performance. It shows that the CNN model may be optimized to get higher performance than the SVM Model.

Conclusion

The best splitting ratio: 90:10
The best n-gram: unigram + bigram
The best model: SVM 95.95% accuracy (similarity top 1, Tweet corpus), Hybrid CNN-SVM 95.79% accuracy (similarity top 10, Tweet + Berita corpus), and CNN 95.11% accuracy (similarity top 15, Tweet + Berita corpus).

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Data		Data
Scenario Models		Scenario Models
Crawling.ipynb		Crawling.ipynb
FE.ipynb		FE.ipynb
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
corpusBuild.ipynb		corpusBuild.ipynb
graph.png		graph.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hoax Detection on Social Media with Convolutional Neural Network (CNN) and Support Vector Machine (SVM)

About the Project

Dataset

Preprocessing Data

Feature Extraction with TF-IDF

Feature Expansion with GloVe

Splitting Data

Models

Testing Scenario

Scenario I: Choosing the best splitting ratio

Scenario II: Choosing the best n-gram

Scenario III: Choosing the best n-gram combination

Scenario IV: GloVe Embedding

Scenario V: CNN-SVM Hybrid Model

Analysis

Conclusion

About

Releases

Packages

Languages

manuelbenedict/Hoax-Detection-on-Social-Media-with-CNN-and-SVM

Folders and files

Latest commit

History

Repository files navigation

Hoax Detection on Social Media with Convolutional Neural Network (CNN) and Support Vector Machine (SVM)

About the Project

Dataset

Preprocessing Data

Feature Extraction with TF-IDF

Feature Expansion with GloVe

Splitting Data

Models

Testing Scenario

Scenario I: Choosing the best splitting ratio

Scenario II: Choosing the best n-gram

Scenario III: Choosing the best n-gram combination

Scenario IV: GloVe Embedding

Scenario V: CNN-SVM Hybrid Model

Analysis

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages