The goal of this experiment is to summarize tweets with English and Yoruba code switches.
Keywords: Code-Switching, Tweet Summarization, Language Identification, Code-switching Detection, Translation, Natural Language Processing.
The twitter_data csv file has three columns:
- Tweets: Tweets with code-switches.
- Eng_source: English source of the tweets.
- Summary: Human annotated summary of the tweets.
You can find the modules and libraries used in this project in the requirement.txt file. You can also run the code below.
pip install -r requirements.txt
-
Data: contains the data file used for this project.
-
utils: contains the essential functions used for the project.
-
data_analysis.ipynb: A python notebook that uses the function in the utils to analyse the data used in this project. The results gives information about the data.
-
data_collection.ipynb: A python notebook that shows you the procedure of collecting tweets from Twitter using the Twitter API and tweepy python library.
-
quick_start.ipynb: A python notebook that shows a successful run of the project using the quickstart guideline.
-
main.ipynb and main.py are python notebook and script that utilizes the functions in utils to show the procedure of summarizing tweets with English-Yoruba code switches and the result gotten.
- Clone the repository
git clone https://github.com/gloryodeyemi/COMP_8730_Project.git
- Change the directory to the cloned repository folder
%cd .../COMP_8730_Project
- Install the needed packages
pip install -r requirements.txt
- Run the script
python main.py
The Huggingface AutoTrain feature was used to train and evaluate the baseline approach on our dataset. The evaluation metric scores and testing interface can be found here.
This project is licensed under the MIT License - see the LICENSE file for details.
Glory Odeyemi is currently undergoing her Master's program in Computer Science, Artificial Intelligence specialization at the University of Windsor, Windsor, ON, Canada. You can connect with her on LinkedIn.