Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine learning model for bot detection #8

Open
1 task done
andreaschandra opened this issue May 17, 2020 · 9 comments
Open
1 task done

Machine learning model for bot detection #8

andreaschandra opened this issue May 17, 2020 · 9 comments

Comments

@andreaschandra
Copy link
Member

andreaschandra commented May 17, 2020

a given topic or hashtag, we want to see if the population of tweets more likely to flood by buzzer or user organic

or

given a buzzer account, we want to see the major topics to buzzing about

This task includes

  • feature engineering (need to do text cleansing, preprocessing)

  • baseline model

  • early fine-tuning

  • evaluation

  • define feature set

@andreaschandra andreaschandra changed the title Machine learning model for buzzer detection Machine learning model for bot detection Oct 17, 2020
@rubentea16
Copy link
Member

rubentea16 commented Oct 17, 2020

Prepare Social Politics Word Dictionary (SPWD)

Propose Feature Set :

  • username
  • name
  • is_name_social_political
  • desc
  • tweets
  • n_tweet
  • quoted_tweets
  • hashtag
  • n_tweet_use_hashtag
  • ratio_tweets_use_hashtag
  • n_photo
  • n_video
  • content_url

Feature Engineering :

  • is_name_social_political (1/0) <- create model
  • n_tweet
  • hashtag
  • n_tweet_use_hashtag
  • ratio_tweet_use_hashtag
  • n_photo
  • n_video
  • content_url

@andreaschandra
Copy link
Member Author

@rubentea16 kalo beragam teknik tapi scorenya masih jelek, mungkin labelingnya kurang konsisten atau kurang banyak

@andreaschandra
Copy link
Member Author

andreaschandra commented Dec 5, 2020

Baseline model result @rubentea16

BernouliNB
accuracy: 0.78 | precision: 0.60 | recall: 0.21 | f score: 0.32

Linear SVM
accuracy: 0.85 | precision: 0.74 | recall: 0.57 | f score: 0.64

Random Forest
accuracy: 0.82 | precision: 0.74 | recall: 0.43 | f score: 0.54

Gradient Boosting
accuracy: 0.84 | precision: 0.73 | recall: 0.55 | f score: 0.63

AdaBoost
accuracy: 0.81 | precision: 0.63 | recall: 0.58 | f score: 0.60

@rubentea16
Copy link
Member

Baseline model result @rubentea16

BernouliNB
accuracy: 0.78 | precision: 0.60 | recall: 0.21 | f score: 0.32

Linear SVM
accuracy: 0.85 | precision: 0.74 | recall: 0.57 | f score: 0.64

Random Forest
accuracy: 0.82 | precision: 0.74 | recall: 0.43 | f score: 0.54

Gradient Boosting
accuracy: 0.84 | precision: 0.73 | recall: 0.55 | f score: 0.63

AdaBoost
accuracy: 0.81 | precision: 0.63 | recall: 0.58 | f score: 0.60

ini pake feature apa aja?

@andreaschandra
Copy link
Member Author

andreaschandra commented Dec 6, 2020

@rubentea16
Copy link
Member

rubentea16 commented Jan 17, 2021

Performance Benchmark

Notes :

  • multiple_feat = tweets, user_desc, is_name_social_political, ratio_tweets_use_hashtag, n_tweet, n_photo, n_video
  • single_feat = tweets
  • RFC = Random Forest Classifier(n_estimator=400)
Model Desc Features Word Embedding Accuracy Precision Recall F1-score
RFC - multiple-feat TF-IDF 0.84 0.75 0.33 0.45
RFC - single-feat TF-IDF 0.84 0.72 0.35 0.47
SMOTE+RFC Oversampling train data (Minor class) multiple-feat TF-IDF (desc = 3K dim & tweet = 50K dim) 0.86 0.66 0.62 0.64
SMOTE+RFC Oversampling train data (Minor class) single-feat BPE (tweet = 300 dim) 0.86 0.68 0.57 0.62
SMOTE+SVC(default) Oversampling train data (Minor class) single-feat BPE (tweet = 300 dim) 0.84 0.59 0.73 0.65
SMOTE+XGBoost(default) Oversampling train data (Minor class) single-feat BPE (tweet = 300 dim) 0.86 0.66 0.62 0.64

@andreaschandra
Copy link
Member Author

0.64

interesting

@andreaschandra
Copy link
Member Author

andreaschandra commented Mar 27, 2021

Result after QA label

Algo acc pre rec fsc
Bernouli NB accuracy: 0.78 precision: 0.75 recall: 0.21 f score: 0.33
SVM accuracy: 0.85 precision: 0.75 recall: 0.60 f score: 0.67
Random Forest accuracy: 0.81 precision: 0.77 recall: 0.34 f score: 0.47
Gradient Boosting accuracy: 0.84 precision: 0.78 recall: 0.53 f score: 0.63
AdaBoost accuracy: 0.82 precision: 0.67 recall: 0.56 f score: 0.61

@andreaschandra
Copy link
Member Author

Algo acc pre rec fsc
Bernouli NB accuracy: 0.82 precision: 0.54 recall: 0.69 f score: 0.61
SVM accuracy: 0.87 precision: 0.69 recall: 0.65 f score: 0.67
RF accuracy: 0.85 precision: 0.74 recall: 0.43 f score: 0.54
Gradient Boosting accuracy: 0.87 precision: 0.72 recall: 0.54 f score: 0.62
AdaBoost accuracy: 0.84 precision: 0.60 recall: 0.56 f score: 0.58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants