We built a word vectors model exploiting our whole COVID-19 dataset collected from January 2020 to April 2020 link. By removing retweets and duplicated tweets, we ended with 2,821,940 tweets. We consider two noticeable word embeddings generation methods: word2vec, and FastText. Using these pre-trained word embeddings models that are domain-specific (COVID-19) would be more accurate than using other generic pre-trained word embeddings in AI tasks.
We release four embedding models
Model | Vocabularies No. | Vec-Size | Download |
---|---|---|---|
Word2Vec-Twitter-SkipGram | 262,715 | 200 | Download URL |
Word2Vec-Twitter-CBOW | 262,715 | 200 | Dwonload URL |
Word2Vec-Twitter-SkipGram | 262,715 | 300 | Download URL |
Word2Vec-Twitter- CBOW | 262,715 | 300 | Dwonload URL |
We release two embedding models
Model | Vocabularies No. | Vec-Size | Download |
---|---|---|---|
FastText-Twitter-SkipGram | 262,715 | 200 | Download URL |
FastText-Twitter-SkipGram | 262,715 | 300 | Download URL |
Here is the T-SNE visualisation of the word embedding in 2D. It was done using "Embedding Projector" with 3500 iterations and 15 perplexity.
-
T-SNE visualisation of Model trianed by Continous Bag of word with a dimension of 300
-
T-SNE visualisation of FastText Model trained with a dimension of 300
These models were built using gensim Python library.
For loading and using one of the models, you should install the gensim
and nltk
:
- install
gensim
>= 3.4 andnltk
>= 3.2 using eitherpip
orconda
pip install gensim nltk
conda install gensim nltk
- We built our word embeddings The Word2Vec-Twitter-Skipgram with dimension 200 in our paper COVID-19: What Are Arabic Tweeters Talking About?, to determine the number of topics.
If you are going to use this model, please cite this work using the following bibtext:
@inproceedings{hamoui2020covid,
title={COVID-19: What Are Arabic Tweeters Talking About?},
author={Hamoui, Btool and Alashaikh, Abdulaziz and Alanazi, Eisa},
booktitle={International Conference on Computational Data and Social Networks},
pages={425--436},
year={2020},
organization={Springer}
}
- The rest of word embeddings models were built to be used with classification algorithm in our paper Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter If you are going to use our pre-trained models, please cite this work using the following bibtext:
@article{alqurashi2021eating,
title={Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter},
author={Alqurashi, Sarah and Hamoui, Btool and Alashaikh, Abdulaziz and Alhindi, Ahmad and Alanazi, Eisa},
journal={arXiv preprint arXiv:2101.05626},
year={2021}
}