Skip to content

Latest commit

 

History

History
533 lines (316 loc) · 14.3 KB

README.md

File metadata and controls

533 lines (316 loc) · 14.3 KB

Knowledge Base Enrichment in Conversational Domain

This repository is the implementations for my MSc dissertation. We adapted several state-of-the-art document-level RE models, and conducted thorough evaluations on DocRED and DialogRE.

Dataset

DocRED

Please download it here, provided by DocRED: A Large-Scale Document-Level Relation Extraction Dataset.

DialogRE

Please download it here, provided by Dialogue-Based Relation Extraction.

Pre-processing

DialogRE needs to be converted to the same format as DocRED.

  • Enter the directory:

    cd dialogre/data_processing

  • Run the shell script:

    source process_docred.sh

    Three documents will be generated under the directory ../data/processed:

    train_annotated.json, dev.json, test.json

    Note: their names are the same as DocRED for convenience.

BiLSTM

Main directory:

cd docred

Adapted from:

https://github.com/thunlp/DocRED/tree/master

Reference paper:

DocRED: A Large-Scale Document-Level Relation Extraction Dataset

Requirements and Installation

python3

pytorch>=1.0

pip3 install -r requirements.txt

preprocessing data

DocRED

Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data folder.

DialogRE

Replace the rel2id.json under prepro_data with dialogre/data_processing/rel2id.json

  • Run the script:
$ cd code
$ python3 gen_data.py --in_path ../data --out_path prepro_data

Train

$ cd code
$ CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev

Note: change the self.relation_num to 37 for DialogRE

Test

$ cd code
$ CUDA_VISIBLE_DEVICES=0 python3 test.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev --input_theta 0.3601

BERT-Embed

Main directory:

cd DocRed-BERT

Adapted from:

https://github.com/hongwang600/DocRed/tree/master

Reference paper:

Fine-tune Bert for DocRED with Two-step Process

Note: Please refer to BiLSTM for preprocessing data, train, and test.

Sent-Model

Main directory:

cd DocRed-sent_level_enc

Adapted from:

https://github.com/hongwang600/DocRed/tree/sent_level_enc

Reference paper:

Fine-tune Bert for DocRED with Two-step Process

Note: Please refer to BiLSTM for preprocessing data, train, and test.

Graph-LSR

Main directory: cd LSR

Adapted from https://github.com/nanguoshun/LSR/tree/master

Reference paper: Reasoning with Latent Structure Refinement for Document-Level Relation Extraction

Requirement

python==3.6.7 
torch==1.3.1 + CUDA == 9.2 1.5.1
OR torch==1.5.1 + CUDA == 10.1
tqdm==4.29.1
numpy==1.15.4
spacy==2.1.3
networkx==2.4

Data Proprocessing

DocRED

Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data folder.

DialogRE

Replace the rel2id.json under prepro_data with dialogre/data_processing/rel2id.json

  • Run the script:
$ cd code
$ python3 gen_data.py 

Training

In order to train the model, run:

$ cd code
$ python3 train.py

Note: change the self.relation_num to 37 for DialogRE

Test

After the training process, we can test the model by:

python3 test.py

BERT-LSR

Main directory: cd LSR_BERT

Adapted from https://github.com/nanguoshun/LSR/tree/master

Reference paper: Reasoning with Latent Structure Refinement for Document-Level Relation Extraction

Requirement

python==3.6.7 
torch==1.3.1 + CUDA == 9.2 1.5.1
OR torch==1.5.1 + CUDA == 10.1
tqdm==4.29.1
numpy==1.15.4
spacy==2.1.3
networkx==2.4
pytorch-transformers==1.2.0

Data Proprocessing

DocRED

Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data folder.

  • Run the script
$ cd code
$ python3 gen_data.py 

DialogRE

Replace the rel2id.json under prepro_data with dialogre/data_processing/rel2id.json

  • Run the script
$ cd code
$ python3 gen_data_bert.py 

Training

In order to train the model, run:

$ cd code
$ python3 train.py

Test

After the training process, we can test the model by:

python3 test.py

Graph-EOG

Main directory:

cd edge-oriented-graph

Adapted from:

https://github.com/fenchri/edge-oriented-graph/tree/master

Reference paper:

Connecting the Dots: Document-level Relation Extraction with Edge-oriented Graphs

Environment

$ pip3 install -r requirements.txt

Datasets & Pre-processing

Download the two datasets first.

$ mkdir data && cd data
$ mkdir DocRED && mkdir Dialogue
$ # put dev_train.json dev_dev.json dev_test.json of the two datasets in each directory
$ cd ..

Two datasets should first be transformed into the PubTator format.

Run the processing scripts as follows:

$ sh process_docred.sh #DocRED
$ sh process_dialogue.sh #DialogRE

In order to get the data statistics run:

  • DocRED
python3 statistics.py --data ../data/DocRED/processed/dev_train.data
python3 statistics.py --data ../data/DocRED/processed/dev_dev.data
python3 statistics.py --data ../data/DocRED/processed/dev_test.data
  • DialogRE
python3 statistics.py --data ../data/Dialogue/processed/dev_train.data
python3 statistics.py --data ../data/Dialogue/processed/dev_dev.data
python3 statistics.py --data ../data/Dialogue/processed/dev_test.data

This will additionally generate the gold-annotation file in the same folder with suffix .gold.

Pre-trained Word Embeddings

The initial model utilized pre-trained PubMed embeddings.

Please download GloVe embeddings, and put it under ./embeds

Train

  • DocRED
$ cd src/
$ python3 eog.py --config ../configs/parameters_docred.yaml --train --gpu 0  
  • DialogRE
$ cd src/ 
$ python3 eog.py --config ../configs/parameters_dialogue.yaml --train --gpu 0 

Test

$ python3 eog.py --config ../configs/parameters_docred.yaml --test --gpu 0

Post-processing

In order to evaluate the results, the prediction file test.preds need to be converted to the same format as DocRED:

  • DocRED
$ mkdir ../data/DocRED 
$ # put the test.preds and rel2id.json under the directory
$ python3 convert2DocREDFormat --data DocRED
  • DialogRE
$ mkdir ../data/Dialogue 
$ # put the test.preds and rel2id.json under the directory
$ python3 convert2DocREDFormat --data Dialogue

DialogRE

Main directory:

cd dialogre

Adapted from:

https://github.com/nlpdata/dialogre/tree/master

Reference Paper:

Dialogue-Based Relation Extraction

Environment

Python 3.6 and PyTorch 1.0.

Preparation

  • kb/Fandom_triples: relational triples from Fandom.
  • kb/matching_table.txt: mapping from Fandom relational types to DialogRE relation types.
  • bert folder: a re-implementation of BERT and BERTS baselines.
    1. Download and unzip BERT from here, and set up the environment variable for BERT by export BERT_BASE_DIR=/PATH/TO/BERT/DIR.
    2. Copy the dataset folder data to bert/.
    3. In bert, execute python convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path=$BERT_BASE_DIR/bert_model.ckpt --bert_config_file=$BERT_BASE_DIR/bert_config.json --pytorch_dump_path=$BERT_BASE_DIR/pytorch_model.bin.

Train

To run the BERTS baseline, execute the following commands in bert:

$ cd bert

$ python run_classifier.py   --task_name berts  --do_train --do_eval   --data_dir .   --vocab_file $BERT_BASE_DIR/vocab.txt   --bert_config_file $BERT_BASE_DIR/bert_config.json   --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin   --max_seq_length 512   --train_batch_size 24   --learning_rate 3e-5   --num_train_epochs 20.0   --output_dir berts_f1  --gradient_accumulation_steps 2

$ rm berts_f1/model_best.pt && cp -r berts_f1 berts_f1c && python run_classifier.py   --task_name bertsf1c --do_eval   --data_dir .   --vocab_file $BERT_BASE_DIR/vocab.txt   --bert_config_file $BERT_BASE_DIR/bert_config.json   --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin   --max_seq_length 512   --train_batch_size 24   --learning_rate 3e-5   --num_train_epochs 20.0   --output_dir berts_f1c  --gradient_accumulation_steps 2

Test

To evaluate the BERTS baseline, execute the following commands in bert:

$ cd bert
$ python evaluate.py --f1dev berts_f1/logits_dev.txt --f1test berts_f1/logits_test.txt --f1cdev berts_f1c/logits_dev.txt --f1ctest berts_f1c/logits_test.txt

Evaluations

Main directory:

cd Evaluation

Put train_annotated.json, dev.json, test.json and prediction results dev_test_index.json under the directory code/DocRED/re_data or code/Dialogue/re_data

  • F1-score versus relation types
$ cd code
$ python3 eval_re_type.py --data DocRED|Dialogue 

​ Specifically, to evaluate BERTS

$ cd ../dialogre/bert
$ python3 evaluate_rel_type.py 
  • F1-score of intra- v.s. inter-sentential relations
$ cd code
$ python3 eval_re_intra_inter.py --data DocRED|Dialogue 
  • F1-score versus relation distances
$ cd code
$ python3 eval_re_dist.py --data DocRED|Dialogue
  • Distributions of relation types
$ cd code
$ python3 get_re_type_distri.py --data DocRED|Dialogue 
  • Distributions of intra- v.s. inter-sentential relations
$ cd code
$ python3 get_re_intra_inter_distri.py --data DocRED|Dialogue
  • Distributions of relation distances
$ cd code
$ python3 get_re_dist_distri.py --data DocRED|Dialogue
  • Distributions of relation distances for date_of_birth and part_of
$ cd code
$ python3 get_dist_distri_given_re_type.py --inputfile train_annotated.json|dev.json|test.json
--type date_of_birth|part_of
  • F1 score of intra- versus inter- relations for date_of_birth and part_of
$ cd code
$ python3 eval_intra_inter_given_re_type --data DocRED|Dialogue --type date_of_birth|part_of 

Known Issues

  1. A reported bug from the authors of Graph-LSR: nanguoshun/LSR#9

    Our current workaround:

    • Graph-LSR: change the batch size from 20 to 10.
    • BERT-LSR: change the batch size from 20 to 10; make the document number an integer times the batch size.

Acknowledgement

We acknowledge that the initial models and source code own to the authors of the following officially published papers and released code we referred to.

We also referred to the descriptions of these open source repositories for the write-up of this README file.

References

[1] DocRED: A Large-Scale Document-Level Relation Extraction Dataset

[2] Fine-tune Bert for DocRED with Two-step Process

[3] Reasoning with Latent Structure Refinement for Document-Level Relation Extraction

[4] Connecting the Dots: Document-level Relation Extraction with Edge-oriented Graphs

[5] Dialogue-Based Relation Extraction

Open Source Repositories

[1] https://github.com/thunlp/DocRED/tree/master

[2] https://github.com/hongwang600/DocRed/tree/master

[3] https://github.com/nanguoshun/LSR/tree/master

[4] https://github.com/fenchri/edge-oriented-graph/tree/master

[5] https://github.com/nlpdata/dialogre/tree/master