An Investigation of Language Model Interpretability via Sentence Editing

This is the repo with code for reproducing results from "An Investigation of Language Model Interpretability via Sentence Editing" (Arxiv link)

Data

The rationales can be found in data-versioned/rationales/*.json.gz. Each example is one line and contains a single JSON example. words is a tokenized version of the sentence, ready to be fed to a BERT-based model. best_words is an un-ordered list of words that make up the rationale.

Reproducing Results

Environment

python3 -m venv virtualenv
. ./virtualenv/bin/activate # unix
pip install -r requirements.txt

Data

You'll need to download the AESW data from the challenge website, save it to ./data-unversioned/aesw, and unzip it.

Preprocessing

To preprocess the train, vaidation and test datasets for BERT:

python -m paper.aesw_to_sentences train,val,test
python -m paper ./experiments/bert_base_aesw_32_1e6/params.toml --preprocess train,val,test

Fine-Tuning

To fine-tune BERT on the AESW task:

python -m paper ./experiments/bert_base_aesw_32_1e6/params.toml --main-loop

# If the training is picking back up:
python -m paper ./experiments/bert_base_aesw_32_1e6/params.toml --main-loop -continuing

Inference

Once fine-tuned, run inference on val and test sets:

export CHECKPOINT=./models/<some hash here>.pt
python -m paper ./experiments/bert_large_aesw_16_1e6/params.toml --inference-val $CHECKPOINT
python -m paper ./experiments/bert_large_aesw_16_1e6/params.toml --inference-test $CHECKPOINT

Interpretability

Save attention weights for the two types of edits. This either requires a GPU or takes a long time (multiple hours). You must also edit the load_*() functions in paper/interpret/run.py to specify which model you want to load.

python -m paper.interpret --weight-types=spelling,delete

Use attention weights to calculate similarity scores and accuracy. Once the attention weights have been calculated, this step is fast even without a GPU.

python -m paper.interpret --eval-types=spelling,delete

Create plots.

python -m paper.interpret.plot # --interactive if you want to show them on screen

Create individual plots comparing each model's attention on a single sentence:

python -m paper.interpret.compare --all-models "This allows us to observe Saturn's moons."
python -m paper.interpret.compare --all-models "(We'll represent a signature as an encrypted message digest):"
python -m paper.interpret.compare --finetuning "The algorithm descripted in the previous sections has several advantages."

Citation

If you use this software or data, please cite our paper and the original AESW paper:

@article{Stevens_An_Investigation_of,
  author = {Stevens, Samuel and Su, Yu},
  journal = {BlackboxNLP 2021},
  title = {{An Investigation of Language Model Interpretability via Sentence Editing}}
}

@inproceedings{aesw,
  title = {A Report on the Automatic Evaluation of Scientific Writing Shared Task},
  author = {Daudaravicius, Vidas and Banchs, Rafael E. and Volodina, Elena and Napoles, Courtney},
  booktitle = {Proceedings of the 11th Workshop on Innovative Use of {NLP} for Building Educational Applications},
  month = jun,
  year = {2016},
  address = {San Diego, CA},
  publisher = {Association for Computational Linguistics},
  url = {https://www.aclweb.org/anthology/W16-0506},
  doi = {10.18653/v1/W16-0506},
  pages = {53--62}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data-versioned/rationales		data-versioned/rationales
docs/images		docs/images
experiments		experiments
paper		paper
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
makefile		makefile
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Investigation of Language Model Interpretability via Sentence Editing

Data

Reproducing Results

Environment

Data

Preprocessing

Fine-Tuning

Inference

Interpretability

Citation

About

Languages

samuelstevens/sentence-editing-interpretability

Folders and files

Latest commit

History

Repository files navigation

An Investigation of Language Model Interpretability via Sentence Editing

Data

Reproducing Results

Environment

Data

Preprocessing

Fine-Tuning

Inference

Interpretability

Citation

About

Resources

Stars

Watchers

Forks

Languages