COVID-19 Arabic Word Embeddings

We built a word vectors model exploiting our whole COVID-19 dataset collected from January 2020 to April 2020 link. By removing retweets and duplicated tweets, we ended with 2,821,940 tweets. We consider two noticeable word embeddings generation methods: word2vec, and FastText. Using these pre-trained word embeddings models that are domain-specific (COVID-19) would be more accurate than using other generic pre-trained word embeddings in AI tasks.

COVID19 Twitter Word2Vec models

We release four embedding models

Model	Vocabularies No.	Vec-Size	Download
Word2Vec-Twitter-SkipGram	262,715	200	Download URL
Word2Vec-Twitter-CBOW	262,715	200	Dwonload URL
Word2Vec-Twitter-SkipGram	262,715	300	Download URL
Word2Vec-Twitter- CBOW	262,715	300	Dwonload URL

COVID19 Twitter FastText models

We release two embedding models

Model	Vocabularies No.	Vec-Size	Download
FastText-Twitter-SkipGram	262,715	200	Download URL
FastText-Twitter-SkipGram	262,715	300	Download URL

Word Embeddings in 2D using T-SNE

Here is the T-SNE visualisation of the word embedding in 2D. It was done using "Embedding Projector" with 3500 iterations and 15 perplexity.

T-SNE visualisation of Model trianed by Continous Bag of word with a dimension of 300
T-SNE visualisation of FastText Model trained with a dimension of 300

How to use

These models were built using gensim Python library. For loading and using one of the models, you should install the gensim and nltk :

install gensim >= 3.4 and nltk >= 3.2 using either pip or conda

pip install gensim nltk

conda install gensim nltk

Code sample

References

We built our word embeddings The Word2Vec-Twitter-Skipgram with dimension 200 in our paper COVID-19: What Are Arabic Tweeters Talking About?, to determine the number of topics.

If you are going to use this model, please cite this work using the following bibtext:

@inproceedings{hamoui2020covid,
  title={COVID-19: What Are Arabic Tweeters Talking About?},
  author={Hamoui, Btool and Alashaikh, Abdulaziz and Alanazi, Eisa},
  booktitle={International Conference on Computational Data and Social Networks},
  pages={425--436},
  year={2020},
  organization={Springer}
}

The rest of word embeddings models were built to be used with classification algorithm in our paper Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter If you are going to use our pre-trained models, please cite this work using the following bibtext:

@article{alqurashi2021eating,
  title={Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter},
  author={Alqurashi, Sarah and Hamoui, Btool and Alashaikh, Abdulaziz and Alhindi, Ahmad and Alanazi, Eisa},
  journal={arXiv preprint arXiv:2101.05626},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
README.md		README.md
WordEmbeddingsVector.ipynb		WordEmbeddingsVector.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-19 Arabic Word Embeddings

COVID19 Twitter Word2Vec models

COVID19 Twitter FastText models

Word Embeddings in 2D using T-SNE

How to use

Code sample

References

About

Releases

Packages

Languages

BatoolHamawi/COVID-19WordEmbeddings

Folders and files

Latest commit

History

Repository files navigation

COVID-19 Arabic Word Embeddings

COVID19 Twitter Word2Vec models

COVID19 Twitter FastText models

Word Embeddings in 2D using T-SNE

How to use

Code sample

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages