Author: Jonathan Beaulieu
You can find the problem statement with details at https://competitions.codalab.org/competitions/17344. This SemEval task is focused on emoji use in tweets. The task is to create a program which given a set of tweet text (without emojis) and the emoji which was used inside the tweet text can predict which emoji will be used in any tweet given the text of the tweet. For this task we are not allowed to used outside tweet data, for example grabbing more tweets to use in the training data. The only outside data we are allowed to use are emoji embeddings. They are using the Macro-F scores instead Micro-F scores to discourage overfitting to the most frequent classes. Plus we train and are evaluated on the 20 most frequent emojis. This task has two subtasks: running against English tweets and running against Spanish tweets.
Input | Output |
---|---|
A little throwback with my favourite person @ Water Wall | ❤ |
Birthday Kisses @ Madison, Wisconsin | 😘 |
Everything about this weekend #hogvibes @ Fayetteville, Arkansas | 💯 |
Simply run ./install.sh
to install all dependencies and do all required setup to get up and running on a clean install of Ubuntu 16.04. Note: this script should work on other versions and linux systems but it has not been tested.
Simply run ./runit.sh
to run and see the results of the program. runit.sh
will train a model for both the English and Spanish train data sets and print out the results of each fold in 10 cross folds.
The code for the Hopper system can be found under the hopper
directory along with all accompanying documentation. All the models are described in detail in the readme in that directory.
Each file lists the author(s) at the top of the file. In the case of multiple authors it is clearly listed the sections each author contributed (we tried keeping this to a minimum).
Jonathan:
- Wrote the framework code. The code for reading the data, running the model and analyzing the results (including scoring sense the provided one is written in a dumb way).
- Wrote the code for the Naive Bayes baselines.
- Wrote the code for the Random Model.
- Wrote this README and the README sections on Bernoulli Naive Bayes baselines.
- Wrote README sections on Character and Word Based NN Models.
- Wrote code for Char-based and Word-based NN Models, scorer, confusion matrix explorer, new run script and the SVM Model.
- Preform many of the experiments.
Dennis:
- Most Frequent Class Model
- Documentation for Most Frequent Class Model in README and code
- Documentation for NaiveBayes Model in code
- RESULTS-1 and RESULTS-1.md
- Resampling (collapsed classes to understand effect of semantically similar emojis)
- Configs for running some models
- RESULTS-2
- Performed some experiments
Sai:
- Analyzed the Output and Results which can be found in Results1_1
- Explained 3rd party code in ORIGINS
- python >= 3.5
- nltk == 3.2.4
- scikit-learn == 0.19.0
- sklearn == 0.0
- scipy == 0.19.1
This project requires Python 3.
The python module dependencies can be found in requirements.txt
.
Install them by running pip install -r requirements.txt
. (Note: You may need to replace pip
with pip3
if python 3 is not your default.)
All the data can be found under the data
directory.
- The mapping of numbers to Emojis are in the mappings directory.
- The trail data is in the trial directory.
- This is ~50k tweets with labels.
- The tweets are separated into English and Spanish
- The train data is in the train directory.
- This is ~500k tweets with labels.
- The tweets are separated into English and Spanish using the provided script. However upon reading the script it separates the tweets based on the emoji used instead on the language used in the tweet.
All the output for the system is stored in output
directory.
Contents
stage_1_train.out.txt
- Baseline Models trained and tested on the train data
- Tested using 10 cross folds
- Baseline Models trained and tested on the train data
stage_1_trial.out.txt
- Baseline Models trained and tested on the trial data
- Tested using 10 cross folds
- Baseline Models trained and tested on the trial data
This is a todo list of everything we need for Stage 1 submission that hasn't been done already.
- Write
install.sh
(Jon) - Analyze results. (Sai) Due: 10/6
- Create baseline based on most frequent class (Dennis) Due: 10/9
- Write readme section about BernoulliNB Model. (Jon) Due: 10/9
- Write readme section about most frequent class baseline model. (Dennis) Due: 10/10
- Clarify Authorship (All)
- Final review, review each others work. (All) Due: 10/12
- Example of actual program input and output and description of problem. (Jon)
This is a todo list of the tasks needing to be done for the resubmittion.
- Cleanup run.py and confusion_matrix.py codewise (Jon) Due: 10/18
- Merging Result Files and make macro/micro f-score clear (Sai and Dennis)
- Fix up Naive Bayes explaination in README (Jon)
- Document run.py and confusion_matrix.py (Sai)
- Meet up with Ted to double check this todo list. (All) Due 10/23 @ 10am
- Check out Twitter/Word embeddings at https://github.com/fvancesco/acmmm2016 (Jon)
- Try normalization. (Sai) Note: Sai will not start working till Monday after his presentation
- Try NML techneques (Dennis)
- Do analysis of tweet, create report of issues current model is facing. (Jon)