semeval-2018-task2

Author: Jonathan Beaulieu

Task

You can find the problem statement with details at https://competitions.codalab.org/competitions/17344. This SemEval task is focused on emoji use in tweets. The task is to create a program which given a set of tweet text (without emojis) and the emoji which was used inside the tweet text can predict which emoji will be used in any tweet given the text of the tweet. For this task we are not allowed to used outside tweet data, for example grabbing more tweets to use in the training data. The only outside data we are allowed to use are emoji embeddings. They are using the Macro-F scores instead Micro-F scores to discourage overfitting to the most frequent classes. Plus we train and are evaluated on the 20 most frequent emojis. This task has two subtasks: running against English tweets and running against Spanish tweets.

Example input and output

Input	Output
A little throwback with my favourite person @ Water Wall	❤
Birthday Kisses @ Madison, Wisconsin	😘
Everything about this weekend #hogvibes @ Fayetteville, Arkansas	💯

Setup

Simply run ./install.sh to install all dependencies and do all required setup to get up and running on a clean install of Ubuntu 16.04. Note: this script should work on other versions and linux systems but it has not been tested.

Running

Simply run ./runit.sh to run and see the results of the program. runit.sh will train a model for both the English and Spanish train data sets and print out the results of each fold in 10 cross folds.

System code

The code for the Hopper system can be found under the hopper directory along with all accompanying documentation. All the models are described in detail in the readme in that directory.

Contributions

Each file lists the author(s) at the top of the file. In the case of multiple authors it is clearly listed the sections each author contributed (we tried keeping this to a minimum).

Jonathan:

Wrote the framework code. The code for reading the data, running the model and analyzing the results (including scoring sense the provided one is written in a dumb way).
Wrote the code for the Naive Bayes baselines.
Wrote the code for the Random Model.
Wrote this README and the README sections on Bernoulli Naive Bayes baselines.
Wrote README sections on Character and Word Based NN Models.
Wrote code for Char-based and Word-based NN Models, scorer, confusion matrix explorer, new run script and the SVM Model.
Preform many of the experiments.

Dennis:

Most Frequent Class Model
Documentation for Most Frequent Class Model in README and code
Documentation for NaiveBayes Model in code
RESULTS-1 and RESULTS-1.md
Resampling (collapsed classes to understand effect of semantically similar emojis)
Configs for running some models
RESULTS-2
Performed some experiments

Sai:

Analyzed the Output and Results which can be found in Results1_1
Explained 3rd party code in ORIGINS

Dependencies

python >= 3.5
nltk == 3.2.4
scikit-learn == 0.19.0
sklearn == 0.0
scipy == 0.19.1

This project requires Python 3.
The python module dependencies can be found in requirements.txt.
Install them by running pip install -r requirements.txt. (Note: You may need to replace pip with pip3 if python 3 is not your default.)

Data

All the data can be found under the data directory.

The mapping of numbers to Emojis are in the mappings directory.
The trail data is in the trial directory.
- This is ~50k tweets with labels.
- The tweets are separated into English and Spanish
The train data is in the train directory.
- This is ~500k tweets with labels.
- The tweets are separated into English and Spanish using the provided script. However upon reading the script it separates the tweets based on the emoji used instead on the language used in the tweet.

Output

All the output for the system is stored in output directory. Contents

stage_1_train.out.txt
- Baseline Models trained and tested on the train data
  - Tested using 10 cross folds
stage_1_trial.out.txt
- Baseline Models trained and tested on the trial data
  - Tested using 10 cross folds

Stage 1 TODO (everything should be done by Wednesday night)

This is a todo list of everything we need for Stage 1 submission that hasn't been done already.

Write install.sh (Jon)
Analyze results. (Sai) Due: 10/6
Create baseline based on most frequent class (Dennis) Due: 10/9
Write readme section about BernoulliNB Model. (Jon) Due: 10/9
Write readme section about most frequent class baseline model. (Dennis) Due: 10/10
Clarify Authorship (All)
Final review, review each others work. (All) Due: 10/12
Example of actual program input and output and description of problem. (Jon)

Stage 1 resubmitted TODO

This is a todo list of the tasks needing to be done for the resubmittion.

Cleanup run.py and confusion_matrix.py codewise (Jon) Due: 10/18
Merging Result Files and make macro/micro f-score clear (Sai and Dennis)
Fix up Naive Bayes explaination in README (Jon)
Document run.py and confusion_matrix.py (Sai)
Meet up with Ted to double check this todo list. (All) Due 10/23 @ 10am

Stage 2 TODO

Check out Twitter/Word embeddings at https://github.com/fvancesco/acmmm2016 (Jon)
Try normalization. (Sai) Note: Sai will not start working till Monday after his presentation
Try NML techneques (Dennis)
Do analysis of tweet, create report of issues current model is facing. (Jon)

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
configs		configs
data		data
hopper		hopper
normalization		normalization
old-result-files		old-result-files
output		output
semeval_paper		semeval_paper
.DS_Store		.DS_Store
.gitignore		.gitignore
ORIGINS		ORIGINS
README.md		README.md
RESULTS-1		RESULTS-1
RESULTS-2.md		RESULTS-2.md
collapse.py		collapse.py
create_matrix_for_model.py		create_matrix_for_model.py
final_run.py		final_run.py
install.sh		install.sh
make_sampled_data.py		make_sampled_data.py
matrix_explorer.py		matrix_explorer.py
parse_output.py		parse_output.py
requirements.txt		requirements.txt
results_macro.csv		results_macro.csv
results_micro.csv		results_micro.csv
run.py		run.py
runit.sh		runit.sh
scorer_semeval18.py		scorer_semeval18.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

semeval-2018-task2

Task

Example input and output

Setup

Running

System code

Contributions

Dependencies

Data

Output

Stage 1 TODO (everything should be done by Wednesday night)

Stage 1 resubmitted TODO

Stage 2 TODO

About

Releases

Packages

Contributors 2

Languages

derpferd/semeval-2018-task2

Folders and files

Latest commit

History

Repository files navigation

semeval-2018-task2

Task

Example input and output

Setup

Running

System code

Contributions

Dependencies

Data

Output

Stage 1 TODO (everything should be done by Wednesday night)

Stage 1 resubmitted TODO

Stage 2 TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages