This project adapts Integrated Gradients implementation provided by Captum for named entity recognition task to explain BERT model predictions on I2B2 2014 - PHI dataset. It also extends Language Interpretability Tool (LIT) to visualize and debug NER BERT models. This project can explain any BERT-based model on NER task for any dataset in CONLL format.
Code in this project was based on code in HuggingFace and the extensive examples provided by Captum and LIT repositories.
- Any BERT model trained on NER for a CONLL-based dataset. The model needs to be trained by HuggingFace NER script.
- A dataset in CONLL format. Test dataset in data folder assumed to be called
test.txt
- Python 3.x (tested with 3.7)
- Run
pip install -r requirements.txt
python explainer.py --data_dir /path/to/data/folder --model_type bert \
--labels /path/to/labels.txt --model_name_or_path /path/to/trained/model \
--max_seq_length 128 --explanations_dir /path/to/store/explainations.html
python lit.py --model_path /pth/to/trained/model --labels /path/to/labels.txt
--test_data_dir /path/to/test/data/folder
The dataset used in this project is I2B2 2014 PHI dataset. Can be requested from the Department of Biomedical Informatics and is provided for free to students and researchers. Any NER-annotated CONLL dataset can be used with this project.
Explanation results are stored in the explanations folder provided in an explanations.html file.