Skip to content

Latest commit

 

History

History
83 lines (67 loc) · 3.78 KB

README_CASE_STUDIES.md

File metadata and controls

83 lines (67 loc) · 3.78 KB

Case Studies JCDL2022

This Readme belongs to our JCDL2022 paper on Nearly-Unsupervised Information Extraction Workflows. Fore more information, please read:

@inproceedings{kroll2022jcdl,
	title = {A Library Perspective on Nearly-Unsupervised Information Extraction Workflows in Digital Libraries},
	booktitle = {ACM/IEEE Joint Conference on Digital Libraries (JCDL '22)},
	year = {2022},
	month = {06},
	address = {Cologne, Germany},
	doi={10.1145/3529372.3530924},
	author = {Hermann Kroll and Jan Pirklbauer and Florian Plötzky and Wolf-Tilo Balke}
}

Content

We performed case studies in three domains:

  • PubMed (10k randomly chosen abstracts that contain a drug)
  • Pollux (10k randomly chosen abstracts)
  • Wikipedia articles (2.4k full text articles about scientists)

We provide the following data for each case study:

  • Document samples
  • Used entity vocabularies
  • Used relation vocabularies

Note that we selected scientists that must have a Wikipedia page and Wikidata entry. We zipped the data directory to reduce the GitHub repository size. So unzip the data first. The extracted folder must be located inside the case_studies folder.

Due to the size of the results, we could not upload all OpenIE6 and CoreNLP extraction results into this repository (the samples are contained). We made this data available at OneDrive. In this folder are two files: 1. a data_all.zip with all results, and 2. a SQLite database that contained all data that we produced in our case studies.

This repository contains:

  • Bash scripts to reproduce our findings
  • Bash scripts to measure the performance
  • SQL scripts for data analysis

Shortcuts for scripts:

Summarized evaluation data can be found in Summary Directory.

Repository Organization

To setup the toolbox, please read the original Readme.

The repository is organized as follows:

case_studies
       ../data      -- contains the data for each collection
       ../pubmed    -- evaluation scripts + data for pharmacy
       ../pollux    -- evaluation scripts + data for political sciences
       ../wikipedia -- evaluation scripts + data for wikipedia

We include randomly selected data that we used for our evaluation.

  • Entity linking + Stanza NER
  • Open IE 6 + PathIE
  • Relation mappings (Canonicalization Evaluation)

The evaluation data is stored in Microsoft Excel XLSX files. We also include the original .csv files exported from our case studies. You can find the data in the corresponding subfolders. The file names should be self-explaining.

Code Changes:

We implemented the following improvements for our toolbox:

  • a subject entity filter (Code)
  • enhanced verb phrase filter options (Code)
  • improved Open IE6 handling (Code)
  • Open IE6 analysis (Code)
  • sentence analysis (Code)

IJDL 2023 Submission

There is a dedicated ReadMe available.