Code for Evidence > Intuition: Transferability Estimation for Encoder Selection.
Elisa Bassignana, Max Müller-Eberstein, Mike Zhang, Barbara Plank
In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
This repository contains implementations to compute and evaluate the Logarithm of Maximum Evidence (LogME) on a wide variety of Natural Language Processing (NLP) tasks. It can be used to assess pre-trained models for transfer learning, where a pre-trained model with a high LogME value is likely to have good transfer performance (You et al., 2021).
@inproceedings{bassignana-etal-2022-evidence,
title = "Evidence {\textgreater} Intuition: Transferability Estimation for Encoder Selection",
author = {Bassignana, Elisa and
M{\"u}ller-Eberstein, Max and
Zhang, Mike and
Plank, Barbara},
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.283",
pages = "4218--4227",
abstract = "With the increase in availability of large pre-trained language models (LMs) in Natural Language Processing (NLP), it becomes critical to assess their fit for a specific target task a priori{---}as fine-tuning the entire space of available LMs is computationally prohibitive and unsustainable. However, encoder transferability estimation has received little to no attention in NLP. In this paper, we propose to generate quantitative evidence to predict which LM, out of a pool of models, will perform best on a target task without having to fine-tune all candidates. We provide a comprehensive study on LM ranking for 10 NLP tasks spanning the two fundamental problem types of classification and structured prediction. We adopt the state-of-the-art Logarithm of Maximum Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94{\%} of the setups.In the first study of its kind, we further compare transferability measures with the de facto standard of human practitioner ranking, finding that evidence from quantitative metrics is more robust than pure intuition and can help identify unexpected LM candidates.",
}
project
├── resources (run setup.sh and add data)
│ ├── data (run setup.sh and add data)
│ │ └── *
│ ├── output (run setup.sh and add data)
│ │ └── *
├── src
│ ├── classification
│ │ ├── __init__.py
│ │ ├── classifiers.py
│ │ └── losses.py
│ ├── preprocessing
│ │ └── tokenize.py
│ ├── utils
│ │ ├── conll_2_string.py
│ │ ├── string_2_conll.py
│ │ ├── conlleval.perl
│ │ ├── data.py
│ │ ├── embeddings.py
│ │ ├── encode_data.py
│ │ ├── leep.py (deprecated)
│ │ ├── load_data.py
│ │ └── logme.py
│ ├── tasks
│ │ ├── crossner-news
│ │ │ ├── news-labels.json
│ │ │ ├── run_classification.sh
│ │ │ ├── run_classification_tuned.sh
│ │ │ └── run_logme.sh
│ │ ├── crossner-science
│ │ │ ├── run_classification.sh
│ │ │ ├── run_classification_tuned.sh
│ │ │ ├── run_logme.sh
│ │ │ └── science-labels.json
│ │ ├── deidentification
│ │ │ ├── run_classification.sh
│ │ │ ├── run_classification_tuned.sh
│ │ │ └── run_logme.sh
│ │ ├── deprel
│ │ │ ├── convert.py
│ │ │ ├── run_classification.sh
│ │ │ └── run_logme.sh
│ │ ├── glue
│ │ │ ├── convert.py
│ │ │ ├── run_classification.sh
│ │ │ └── run_logme.sh
│ │ ├── relclass
│ │ │ ├── run_classification.sh
│ │ │ ├── run_classification_tuned.sh
│ │ │ └── run_logme.sh
│ │ ├── sentiment
│ │ │ ├── convert.py
│ │ │ ├── run_classification.sh
│ │ │ └── run_logme.sh
│ │ ├── topic
│ │ │ ├── convert_news.py
│ │ │ ├── run_classification.sh
│ │ │ ├── run_classification_tuned.sh
│ │ │ └── run_logme.sh
│ │ ├── human
│ │ │ └── evaluate_rankings.py
├── .gitignore
├── classify.py
├── evaluate.py
├── main.py
├── README.md
├── requirements.txt
└── setup.sh
numpy
scipy
sklearn
torch
transformers
datasets
numba
pip install --user -r requirements.txt
Run bash setup.sh
to create the appropriate directory paths.
There are three main scripts used in all experiments:
# LogME Calculation for a dataset-LM pair
python main.py
# Classifier training using a dataset-LM pair
python classify.py
# Evaluation of predictions
python evaluate.py
For detailed usage, please refer to the examples below, and to the help output of each script:
python main.py -h
To run LogME on your data. The data needs to be pre-processed into a .csv format, where the labels must be converted to unique integers. If your dataset is available in HuggingFace Datasets you can use the name of the dataset in main.py
.
"text","label"
"this is a sentence , to test .","0"
...
"text","label"
"this is New York .","0 0 1 2 0"
...
Note that sequence labeling tasks require a pre-tokenized, space-separated input which has exactly as many tokens as labels.
Each experiment has a dedicated directory in project/src/tasks/
containing a script for dataset conversion into the unified CSV-format (convert.py
), LogME calculation (run_logme.sh
), and classifier training and evaluation (run_classification.sh
).
While many datasets are downloaded automatically, some require a separate, manual download (e.g., due to licensing). The tasks and corresponding datasets covered in the main paper are as follows:
- AGNews (Zhang et al., 2015) is a news topic classification dataset, the scripts for which can be found in
project/src/tasks/topic/
. The data is obtained fromhuggingface
. - Airline Twitter (Crowdflower, 2020) is a sentiment analysis dataset, the scripts for which can be found in
project/src/tasks/sentiment/
. It requires a separate download of the original data files. - SciERC (Luan et al., 2018) is a relation classification dataset, the scripts for which can be found in
project/src/tasks/relclass/
. It requires a separate download of the original data files. - MNLI (Williams et al., 2018) is a natural language inference dataset, the scripts for which can be found in
project/src/tasks/glue/
. The original data is downloaded automatically during the conversion process. - QNLI (Rajpurkar et al., 2016) is a question answering / natural language inference dataset, the scripts for which can be found in
project/src/tasks/glue/
. The original data is downloaded automatically during the conversion process. - RTE (Giampiccolo et al., 2007) is a natural language inference dataset, the scripts for which can be found in
project/src/tasks/glue/
. The original data is downloaded automatically during the conversion process. - EWT (Silveira et all., 2014) is a syntactic dependency treebank, the scripts for which can be found in
project/src/tasks/sentiment/
. It requires a separate download of the original data files. - CrossNER (Liu et al., 2021) is a named entity recognition dataset, the scripts for which can be found in
project/src/tasks/crossner-{news,science}/
. It requires a separate download of the original data files. - JobStack (Jensen et al., 2021) is a deidentification of job postings dataset, the scripts for which can be found in
projects/src/tasks/deidentification/
. The data is obtained from the authors.
To run specific configurations of the experiments above, such as "mean-pooled sequence classification on BioBERT with full fine-tuning" etc., please refer to the examples below.
For detailed example scripts check project/tasks/*
.
#!/bin/bash
# path to your data
DATA_PATH=project/resources/data/airline
# the type of embedding to calculate LogME on (e.g., [cls]-token or the mean of subwords)
# [transformer, transformer+cls]
EMB_TYPE="transformer+cls"
# your favourite encoders to vectorize your data with.
ENCODERS=( "bert-base-uncased"
"roberta-base"
"distilbert-base-uncased"
"emilyalsentzer/Bio_ClinicalBERT"
"dmis-lab/biobert-v1.1"
"cardiffnlp/twitter-roberta-base"
"allenai/scibert_scivocab_uncased" )
# use POOLING="first" if you calculate LogME over the [cls] token, otherwise "mean" is default.
POOLING="first"
# prepare and split data
python project/src/tasks/sentiment/convert.py $DATA_PATH/Tweets.csv $DATA_PATH/ -rs 4012
# iterate over encoders
for enc_idx in "${!ENCODERS[@]}"; do
echo "Computing LogME using embeddings from '${ENCODERS[$enc_idx]}'"
# compute embeddings and LogME
python main.py \
# sequence_classification OR sequence_labeling
--task "sequence_classification" \
--train_path $DATA_PATH/train.csv \
--test_path $DATA_PATH/test.csv \
# column headers in your .csv file
--text_column text --label_column label \
--embedding_model ${EMB_TYPE}:${ENCODERS[$enc_idx]} \
--pooling ${POOLING} | tee run_logme_cls.log
done
#!/bin/bash
DATA_PATH=project/resources/data/airline
EXP_PATH=project/output/sentiment
# Experiment Parameters
ENCODERS=( "bert-base-uncased" "roberta-base" "distilbert-base-uncased" "emilyalsentzer/Bio_ClinicalBERT" "dmis-lab/biobert-v1.1" "cardiffnlp/twitter-roberta-base" "allenai/scibert_scivocab_uncased" )
#EMB_TYPE="transformer"
#POOLING="mean"
EMB_TYPE="transformer+cls"
POOLING="first"
CLASSIFIER="mlp"
SEEDS=( 4012 5060 8823 8857 9908 )
# iterate over seeds
for rsd_idx in "${!SEEDS[@]}"; do
# iterate over encoders
for enc_idx in "${!ENCODERS[@]}"; do
echo "Experiment: '${ENCODERS[$enc_idx]}' and random seed ${SEEDS[$rsd_idx]}."
exp_dir=$EXP_PATH/model${enc_idx}-${POOLING}-${CLASSIFIER}-rs${SEEDS[$rsd_idx]}
# check if experiment already exists
if [ -f "$exp_dir/best.pt" ]; then
echo "[Warning] Experiment '$exp_dir' already exists. Not retraining."
# if experiment is new, train classifier
else
echo "Training ${CLASSIFIER}-classifier using '${ENCODERS[$enc_idx]}' and random seed ${SEEDS[$rsd_idx]}."
# train classifier
python classify.py \
--task "sequence_classification" \
--train_path $DATA_PATH/train.csv \
--test_path $DATA_PATH/dev.csv \
--exp_path ${exp_dir} \
--embedding_model ${EMB_TYPE}:${ENCODERS[$enc_idx]} \
--pooling ${POOLING} \
--classifier ${CLASSIFIER} \
--seed ${SEEDS[$rsd_idx]}
# save experiment info
echo "${EMB_TYPE}:${ENCODERS[$enc_idx]} -> ${POOLING} -> ${CLASSIFIER} with RS=${SEEDS[$rsd_idx]}" > $exp_dir/experiment-info.txt
fi
# check if prediction already exists
if [ -f "$exp_dir/dev-pred.csv" ]; then
echo "[Warning] Prediction '$exp_dir/dev-pred.csv' already exists. Not re-predicting."
# if no prediction is available, run inference
else
# run prediction
python classify.py \
--task "sequence_classification" \
--train_path $DATA_PATH/train.csv \
--test_path $DATA_PATH/dev.csv \
--exp_path ${exp_dir} \
--embedding_model ${EMB_TYPE}:${ENCODERS[$enc_idx]} \
--pooling ${POOLING} \
--classifier ${CLASSIFIER} \
--seed ${SEEDS[$rsd_idx]} \
--prediction_only
fi
# run evaluation
python evaluate.py \
--gold_path ${DATA_PATH}/dev.csv \
--pred_path ${exp_dir}/dev-pred.csv \
--out_path ${exp_dir}
echo
done
done
# path to your data
DATA_PATH=~/project/resources/data/jobstack
EXP_DIR=~/project/resources/output/jobstack
# convert predictions to conll if you do sequence labeling and you have data in conll format
python project/src/utils/string_2_conll.py \
--input ${EXP_DIR}/jobstack-predictions.csv \
--output ${EXP_DIR}/jobstack-predictions.conll \
--labels ${DATA_PATH}/labels.json \
# run evaluation, in this example on dev.
python evaluate.py \
--gold_path ${DATA_PATH}/dev-jobstack.conll \
--pred_path ${EXP_DIR}/jobstack-predictions.conll \
--out_path ${EXP_DIR}