This repository is the implementations for my MSc dissertation. We adapted several state-of-the-art document-level RE models, and conducted thorough evaluations on DocRED and DialogRE.
Please download it here, provided by DocRED: A Large-Scale Document-Level Relation Extraction Dataset.
Please download it here, provided by Dialogue-Based Relation Extraction.
DialogRE needs to be converted to the same format as DocRED.
-
Enter the directory:
-
Run the shell script:
source process_docred.sh
Three documents will be generated under the directory
../data/processed
:train_annotated.json
,dev.json
,test.json
Note: their names are the same as DocRED for convenience.
Main directory:
Adapted from:
https://github.com/thunlp/DocRED/tree/master
Reference paper:
DocRED: A Large-Scale Document-Level Relation Extraction Dataset
python3
pytorch>=1.0
pip3 install -r requirements.txt
DocRED
Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data
folder.
DialogRE
Replace the rel2id.json
under prepro_data
with dialogre/data_processing/rel2id.json
- Run the script:
$ cd code
$ python3 gen_data.py --in_path ../data --out_path prepro_data
$ cd code
$ CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev
Note: change the self.relation_num to 37 for DialogRE
$ cd code
$ CUDA_VISIBLE_DEVICES=0 python3 test.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev --input_theta 0.3601
Main directory:
Adapted from:
https://github.com/hongwang600/DocRed/tree/master
Reference paper:
Fine-tune Bert for DocRED with Two-step Process
Note: Please refer to BiLSTM for preprocessing data, train, and test.
Main directory:
Adapted from:
https://github.com/hongwang600/DocRed/tree/sent_level_enc
Reference paper:
Fine-tune Bert for DocRED with Two-step Process
Note: Please refer to BiLSTM for preprocessing data, train, and test.
Main directory: cd LSR
Adapted from https://github.com/nanguoshun/LSR/tree/master
Reference paper: Reasoning with Latent Structure Refinement for Document-Level Relation Extraction
python==3.6.7
torch==1.3.1 + CUDA == 9.2 1.5.1
OR torch==1.5.1 + CUDA == 10.1
tqdm==4.29.1
numpy==1.15.4
spacy==2.1.3
networkx==2.4
DocRED
Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data
folder.
DialogRE
Replace the rel2id.json
under prepro_data
with dialogre/data_processing/rel2id.json
- Run the script:
$ cd code
$ python3 gen_data.py
In order to train the model, run:
$ cd code
$ python3 train.py
Note: change the self.relation_num to 37 for DialogRE
After the training process, we can test the model by:
python3 test.py
Main directory: cd LSR_BERT
Adapted from https://github.com/nanguoshun/LSR/tree/master
Reference paper: Reasoning with Latent Structure Refinement for Document-Level Relation Extraction
python==3.6.7
torch==1.3.1 + CUDA == 9.2 1.5.1
OR torch==1.5.1 + CUDA == 10.1
tqdm==4.29.1
numpy==1.15.4
spacy==2.1.3
networkx==2.4
pytorch-transformers==1.2.0
DocRED
Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data
folder.
- Run the script
$ cd code
$ python3 gen_data.py
DialogRE
Replace the rel2id.json
under prepro_data
with dialogre/data_processing/rel2id.json
- Run the script
$ cd code
$ python3 gen_data_bert.py
In order to train the model, run:
$ cd code
$ python3 train.py
After the training process, we can test the model by:
python3 test.py
Main directory:
Adapted from:
https://github.com/fenchri/edge-oriented-graph/tree/master
Reference paper:
Connecting the Dots: Document-level Relation Extraction with Edge-oriented Graphs
$ pip3 install -r requirements.txt
Download the two datasets first.
$ mkdir data && cd data
$ mkdir DocRED && mkdir Dialogue
$ # put dev_train.json dev_dev.json dev_test.json of the two datasets in each directory
$ cd ..
Two datasets should first be transformed into the PubTator format.
Run the processing scripts as follows:
$ sh process_docred.sh #DocRED
$ sh process_dialogue.sh #DialogRE
In order to get the data statistics run:
- DocRED
python3 statistics.py --data ../data/DocRED/processed/dev_train.data
python3 statistics.py --data ../data/DocRED/processed/dev_dev.data
python3 statistics.py --data ../data/DocRED/processed/dev_test.data
- DialogRE
python3 statistics.py --data ../data/Dialogue/processed/dev_train.data
python3 statistics.py --data ../data/Dialogue/processed/dev_dev.data
python3 statistics.py --data ../data/Dialogue/processed/dev_test.data
This will additionally generate the gold-annotation file in the same folder with suffix .gold
.
The initial model utilized pre-trained PubMed embeddings.
Please download GloVe embeddings, and put it under ./embeds
- DocRED
$ cd src/
$ python3 eog.py --config ../configs/parameters_docred.yaml --train --gpu 0
- DialogRE
$ cd src/
$ python3 eog.py --config ../configs/parameters_dialogue.yaml --train --gpu 0
$ python3 eog.py --config ../configs/parameters_docred.yaml --test --gpu 0
In order to evaluate the results, the prediction file test.preds
need to be converted to the same format as DocRED:
- DocRED
$ mkdir ../data/DocRED
$ # put the test.preds and rel2id.json under the directory
$ python3 convert2DocREDFormat --data DocRED
- DialogRE
$ mkdir ../data/Dialogue
$ # put the test.preds and rel2id.json under the directory
$ python3 convert2DocREDFormat --data Dialogue
Main directory:
Adapted from:
https://github.com/nlpdata/dialogre/tree/master
Reference Paper:
Dialogue-Based Relation Extraction
Python 3.6 and PyTorch 1.0.
kb/Fandom_triples
: relational triples from Fandom.kb/matching_table.txt
: mapping from Fandom relational types to DialogRE relation types.bert
folder: a re-implementation of BERT and BERTS baselines.- Download and unzip BERT from here, and set up the environment variable for BERT by
export BERT_BASE_DIR=/PATH/TO/BERT/DIR
. - Copy the dataset folder
data
tobert/
. - In
bert
, executepython convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path=$BERT_BASE_DIR/bert_model.ckpt --bert_config_file=$BERT_BASE_DIR/bert_config.json --pytorch_dump_path=$BERT_BASE_DIR/pytorch_model.bin
.
- Download and unzip BERT from here, and set up the environment variable for BERT by
To run the BERTS baseline, execute the following commands in bert
:
$ cd bert
$ python run_classifier.py --task_name berts --do_train --do_eval --data_dir . --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config.json --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin --max_seq_length 512 --train_batch_size 24 --learning_rate 3e-5 --num_train_epochs 20.0 --output_dir berts_f1 --gradient_accumulation_steps 2
$ rm berts_f1/model_best.pt && cp -r berts_f1 berts_f1c && python run_classifier.py --task_name bertsf1c --do_eval --data_dir . --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config.json --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin --max_seq_length 512 --train_batch_size 24 --learning_rate 3e-5 --num_train_epochs 20.0 --output_dir berts_f1c --gradient_accumulation_steps 2
To evaluate the BERTS baseline, execute the following commands in bert
:
$ cd bert
$ python evaluate.py --f1dev berts_f1/logits_dev.txt --f1test berts_f1/logits_test.txt --f1cdev berts_f1c/logits_dev.txt --f1ctest berts_f1c/logits_test.txt
Main directory:
Put train_annotated.json
, dev.json
, test.json
and prediction results dev_test_index.json
under the directory code/DocRED/re_data
or code/Dialogue/re_data
- F1-score versus relation types
$ cd code
$ python3 eval_re_type.py --data DocRED|Dialogue
Specifically, to evaluate BERTS
$ cd ../dialogre/bert
$ python3 evaluate_rel_type.py
- F1-score of intra- v.s. inter-sentential relations
$ cd code
$ python3 eval_re_intra_inter.py --data DocRED|Dialogue
- F1-score versus relation distances
$ cd code
$ python3 eval_re_dist.py --data DocRED|Dialogue
- Distributions of relation types
$ cd code
$ python3 get_re_type_distri.py --data DocRED|Dialogue
- Distributions of intra- v.s. inter-sentential relations
$ cd code
$ python3 get_re_intra_inter_distri.py --data DocRED|Dialogue
- Distributions of relation distances
$ cd code
$ python3 get_re_dist_distri.py --data DocRED|Dialogue
- Distributions of relation distances for date_of_birth and part_of
$ cd code
$ python3 get_dist_distri_given_re_type.py --inputfile train_annotated.json|dev.json|test.json
--type date_of_birth|part_of
- F1 score of intra- versus inter- relations for date_of_birth and part_of
$ cd code
$ python3 eval_intra_inter_given_re_type --data DocRED|Dialogue --type date_of_birth|part_of
-
A reported bug from the authors of Graph-LSR: nanguoshun/LSR#9
Our current workaround:
- Graph-LSR: change the batch size from 20 to 10.
- BERT-LSR: change the batch size from 20 to 10; make the document number an integer times the batch size.
We acknowledge that the initial models and source code own to the authors of the following officially published papers and released code we referred to.
We also referred to the descriptions of these open source repositories for the write-up of this README file.
[1] DocRED: A Large-Scale Document-Level Relation Extraction Dataset
[2] Fine-tune Bert for DocRED with Two-step Process
[3] Reasoning with Latent Structure Refinement for Document-Level Relation Extraction
[4] Connecting the Dots: Document-level Relation Extraction with Edge-oriented Graphs
[5] Dialogue-Based Relation Extraction
[1] https://github.com/thunlp/DocRED/tree/master
[2] https://github.com/hongwang600/DocRed/tree/master
[3] https://github.com/nanguoshun/LSR/tree/master
[4] https://github.com/fenchri/edge-oriented-graph/tree/master