Evidence > Intuition

Code for Evidence > Intuition: Transferability Estimation for Encoder Selection.

Elisa Bassignana, Max Müller-Eberstein, Mike Zhang, Barbara Plank

In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

This repository contains implementations to compute and evaluate the Logarithm of Maximum Evidence (LogME) on a wide variety of Natural Language Processing (NLP) tasks. It can be used to assess pre-trained models for transfer learning, where a pre-trained model with a high LogME value is likely to have good transfer performance (You et al., 2021).

Citation

@inproceedings{bassignana-etal-2022-evidence,
    title = "Evidence {\textgreater} Intuition: Transferability Estimation for Encoder Selection",
    author = {Bassignana, Elisa  and
      M{\"u}ller-Eberstein, Max  and
      Zhang, Mike  and
      Plank, Barbara},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.283",
    pages = "4218--4227",
    abstract = "With the increase in availability of large pre-trained language models (LMs) in Natural Language Processing (NLP), it becomes critical to assess their fit for a specific target task a priori{---}as fine-tuning the entire space of available LMs is computationally prohibitive and unsustainable. However, encoder transferability estimation has received little to no attention in NLP. In this paper, we propose to generate quantitative evidence to predict which LM, out of a pool of models, will perform best on a target task without having to fine-tune all candidates. We provide a comprehensive study on LM ranking for 10 NLP tasks spanning the two fundamental problem types of classification and structured prediction. We adopt the state-of-the-art Logarithm of Maximum Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94{\%} of the setups.In the first study of its kind, we further compare transferability measures with the de facto standard of human practitioner ranking, finding that evidence from quantitative metrics is more robust than pure intuition and can help identify unexpected LM candidates.",
}

Project Structure

project
├── resources (run setup.sh and add data)
│   ├── data (run setup.sh and add data)
│   │   └── *
│   ├── output (run setup.sh and add data)
│   │   └── * 
├── src
│   ├── classification
│   │   ├── __init__.py
│   │   ├── classifiers.py
│   │   └── losses.py
│   ├── preprocessing
│   │   └── tokenize.py
│   ├── utils
│   │   ├── conll_2_string.py
│   │   ├── string_2_conll.py
│   │   ├── conlleval.perl
│   │   ├── data.py
│   │   ├── embeddings.py
│   │   ├── encode_data.py
│   │   ├── leep.py (deprecated)
│   │   ├── load_data.py
│   │   └── logme.py
│   ├── tasks
│   │   ├── crossner-news
│   │   │   ├── news-labels.json
│   │   │   ├── run_classification.sh
│   │   │   ├── run_classification_tuned.sh
│   │   │   └── run_logme.sh
│   │   ├── crossner-science
│   │   │   ├── run_classification.sh
│   │   │   ├── run_classification_tuned.sh
│   │   │   ├── run_logme.sh
│   │   │   └── science-labels.json
│   │   ├── deidentification
│   │   │   ├── run_classification.sh
│   │   │   ├── run_classification_tuned.sh
│   │   │   └── run_logme.sh
│   │   ├── deprel
│   │   │   ├── convert.py
│   │   │   ├── run_classification.sh
│   │   │   └── run_logme.sh
│   │   ├── glue
│   │   │   ├── convert.py
│   │   │   ├── run_classification.sh
│   │   │   └── run_logme.sh
│   │   ├── relclass
│   │   │   ├── run_classification.sh
│   │   │   ├── run_classification_tuned.sh
│   │   │   └── run_logme.sh
│   │   ├── sentiment
│   │   │   ├── convert.py
│   │   │   ├── run_classification.sh
│   │   │   └── run_logme.sh
│   │   ├── topic
│   │   │   ├── convert_news.py
│   │   │   ├── run_classification.sh
│   │   │   ├── run_classification_tuned.sh
│   │   │   └── run_logme.sh
│   │   ├── human
│   │   │   └── evaluate_rankings.py
├── .gitignore
├── classify.py
├── evaluate.py
├── main.py
├── README.md
├── requirements.txt
└── setup.sh

Requirements

numpy
scipy
sklearn
torch
transformers
datasets
numba

pip install --user -r requirements.txt

Setup

Run bash setup.sh to create the appropriate directory paths.

Usage

There are three main scripts used in all experiments:

# LogME Calculation for a dataset-LM pair
python main.py

# Classifier training using a dataset-LM pair
python classify.py

# Evaluation of predictions
python evaluate.py

For detailed usage, please refer to the examples below, and to the help output of each script:

python main.py -h

Data

To run LogME on your data. The data needs to be pre-processed into a .csv format, where the labels must be converted to unique integers. If your dataset is available in HuggingFace Datasets you can use the name of the dataset in main.py.

Sequence Classification

"text","label"
"this is a sentence , to test .","0"
...

Sequence Labeling

"text","label"
"this is New York .","0 0 1 2 0"
...

Note that sequence labeling tasks require a pre-tokenized, space-separated input which has exactly as many tokens as labels.

Experiments

Each experiment has a dedicated directory in project/src/tasks/ containing a script for dataset conversion into the unified CSV-format (convert.py), LogME calculation (run_logme.sh), and classifier training and evaluation (run_classification.sh).

While many datasets are downloaded automatically, some require a separate, manual download (e.g., due to licensing). The tasks and corresponding datasets covered in the main paper are as follows:

AGNews (Zhang et al., 2015) is a news topic classification dataset, the scripts for which can be found in project/src/tasks/topic/. The data is obtained from huggingface.
Airline Twitter (Crowdflower, 2020) is a sentiment analysis dataset, the scripts for which can be found in project/src/tasks/sentiment/. It requires a separate download of the original data files.
SciERC (Luan et al., 2018) is a relation classification dataset, the scripts for which can be found in project/src/tasks/relclass/. It requires a separate download of the original data files.
MNLI (Williams et al., 2018) is a natural language inference dataset, the scripts for which can be found in project/src/tasks/glue/. The original data is downloaded automatically during the conversion process.
QNLI (Rajpurkar et al., 2016) is a question answering / natural language inference dataset, the scripts for which can be found in project/src/tasks/glue/. The original data is downloaded automatically during the conversion process.
RTE (Giampiccolo et al., 2007) is a natural language inference dataset, the scripts for which can be found in project/src/tasks/glue/. The original data is downloaded automatically during the conversion process.
EWT (Silveira et all., 2014) is a syntactic dependency treebank, the scripts for which can be found in project/src/tasks/sentiment/. It requires a separate download of the original data files.
CrossNER (Liu et al., 2021) is a named entity recognition dataset, the scripts for which can be found in project/src/tasks/crossner-{news,science}/. It requires a separate download of the original data files.
JobStack (Jensen et al., 2021) is a deidentification of job postings dataset, the scripts for which can be found in projects/src/tasks/deidentification/. The data is obtained from the authors.

To run specific configurations of the experiments above, such as "mean-pooled sequence classification on BioBERT with full fine-tuning" etc., please refer to the examples below.

Examples

For detailed example scripts check project/tasks/*.

1. Calculate LogME (example)

#!/bin/bash

# path to your data
DATA_PATH=project/resources/data/airline
# the type of embedding to calculate LogME on (e.g., [cls]-token or the mean of subwords) 
# [transformer, transformer+cls]
EMB_TYPE="transformer+cls"
# your favourite encoders to vectorize your data with.
ENCODERS=( "bert-base-uncased" 
           "roberta-base"
           "distilbert-base-uncased" 
           "emilyalsentzer/Bio_ClinicalBERT" 
           "dmis-lab/biobert-v1.1" 
           "cardiffnlp/twitter-roberta-base" 
           "allenai/scibert_scivocab_uncased" )
# use POOLING="first" if you calculate LogME over the [cls] token, otherwise "mean" is default.
POOLING="first"

# prepare and split data
python project/src/tasks/sentiment/convert.py $DATA_PATH/Tweets.csv $DATA_PATH/ -rs 4012

# iterate over encoders
for enc_idx in "${!ENCODERS[@]}"; do
  echo "Computing LogME using embeddings from '${ENCODERS[$enc_idx]}'"
  # compute embeddings and LogME
  python main.py \
    # sequence_classification OR sequence_labeling
    --task "sequence_classification" \
    --train_path $DATA_PATH/train.csv \
    --test_path $DATA_PATH/test.csv \
    # column headers in your .csv file
    --text_column text --label_column label \
    --embedding_model ${EMB_TYPE}:${ENCODERS[$enc_idx]} \
    --pooling ${POOLING} | tee run_logme_cls.log
done

2. Model fine-tuning (example)

#!/bin/bash

DATA_PATH=project/resources/data/airline
EXP_PATH=project/output/sentiment
# Experiment Parameters
ENCODERS=( "bert-base-uncased" "roberta-base" "distilbert-base-uncased" "emilyalsentzer/Bio_ClinicalBERT" "dmis-lab/biobert-v1.1" "cardiffnlp/twitter-roberta-base" "allenai/scibert_scivocab_uncased" )
#EMB_TYPE="transformer"
#POOLING="mean"
EMB_TYPE="transformer+cls"
POOLING="first"
CLASSIFIER="mlp"
SEEDS=( 4012 5060 8823 8857 9908 )

# iterate over seeds
for rsd_idx in "${!SEEDS[@]}"; do
  # iterate over encoders
  for enc_idx in "${!ENCODERS[@]}"; do
    echo "Experiment: '${ENCODERS[$enc_idx]}' and random seed ${SEEDS[$rsd_idx]}."

    exp_dir=$EXP_PATH/model${enc_idx}-${POOLING}-${CLASSIFIER}-rs${SEEDS[$rsd_idx]}
    # check if experiment already exists
    if [ -f "$exp_dir/best.pt" ]; then
      echo "[Warning] Experiment '$exp_dir' already exists. Not retraining."
    # if experiment is new, train classifier
    else
      echo "Training ${CLASSIFIER}-classifier using '${ENCODERS[$enc_idx]}' and random seed ${SEEDS[$rsd_idx]}."
      # train classifier
      python classify.py \
        --task "sequence_classification" \
        --train_path $DATA_PATH/train.csv \
        --test_path $DATA_PATH/dev.csv \
        --exp_path ${exp_dir} \
        --embedding_model ${EMB_TYPE}:${ENCODERS[$enc_idx]} \
        --pooling ${POOLING} \
        --classifier ${CLASSIFIER} \
        --seed ${SEEDS[$rsd_idx]}

      # save experiment info
      echo "${EMB_TYPE}:${ENCODERS[$enc_idx]} -> ${POOLING} -> ${CLASSIFIER} with RS=${SEEDS[$rsd_idx]}" > $exp_dir/experiment-info.txt
    fi

    # check if prediction already exists
    if [ -f "$exp_dir/dev-pred.csv" ]; then
      echo "[Warning] Prediction '$exp_dir/dev-pred.csv' already exists. Not re-predicting."
    # if no prediction is available, run inference
    else
      # run prediction
      python classify.py \
        --task "sequence_classification" \
        --train_path $DATA_PATH/train.csv \
        --test_path $DATA_PATH/dev.csv \
        --exp_path ${exp_dir} \
        --embedding_model ${EMB_TYPE}:${ENCODERS[$enc_idx]} \
        --pooling ${POOLING} \
        --classifier ${CLASSIFIER} \
        --seed ${SEEDS[$rsd_idx]} \
        --prediction_only
    fi

    # run evaluation
    python evaluate.py \
      --gold_path ${DATA_PATH}/dev.csv \
      --pred_path ${exp_dir}/dev-pred.csv \
      --out_path ${exp_dir}

    echo
  done
done

3. Evaluation (example)

# path to your data
DATA_PATH=~/project/resources/data/jobstack
EXP_DIR=~/project/resources/output/jobstack

# convert predictions to conll if you do sequence labeling and you have data in conll format
python project/src/utils/string_2_conll.py \
  --input ${EXP_DIR}/jobstack-predictions.csv \
  --output ${EXP_DIR}/jobstack-predictions.conll \
  --labels ${DATA_PATH}/labels.json \

# run evaluation, in this example on dev.
python evaluate.py \
  --gold_path ${DATA_PATH}/dev-jobstack.conll \
  --pred_path ${EXP_DIR}/jobstack-predictions.conll \
  --out_path ${EXP_DIR}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evidence > Intuition

Citation

Project Structure

Requirements

Setup

Usage

Data

Sequence Classification

Sequence Labeling

Experiments

Examples

1. Calculate LogME (example)

2. Model fine-tuning (example)

3. Evaluation (example)

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
project/src		project/src
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
classify.py		classify.py
evaluate.py		evaluate.py
main.py		main.py
requirements.txt		requirements.txt
setup.sh		setup.sh

License

mainlp/logme-nlp

Folders and files

Latest commit

History

Repository files navigation

Evidence > Intuition

Citation

Project Structure

Requirements

Setup

Usage

Data

Sequence Classification

Sequence Labeling

Experiments

Examples

1. Calculate LogME (example)

2. Model fine-tuning (example)

3. Evaluation (example)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages