This GitHub project belongs to our JCDL2021 submission "A Toolbox for the Nearly-Unsupervised Construction of Digital Library Knowledge Graphs". The code is shared under the MIT license. Feel free to submit any problem or issue to this GitHub project.
If you use our project in your own research, then please cite:
@inproceedings{kroll2021toolbox,
author = {H. Kroll and J. Pirklbauer and W. Balke},
booktitle = {2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
title = {A Toolbox for the Nearly-Unsupervised Construction of Digital Library Knowledge Graphs},
year = {2021},
volume = {},
issn = {},
pages = {21-30},
doi = {10.1109/JCDL52503.2021.00014},
url = {https://doi.ieeecomputersociety.org/10.1109/JCDL52503.2021.00014},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {sep}
}
You can find our JCDL2021 conference talk here: YouTube We evaluated our toolbox in three different collections: Wikipedia, Pharmacy (PubMed) and Political Sciences. See the Case Study Readme for more details.
We applied the toolbox to the pharmaceutical domain. If you are interested in a possible application, watch our ICADL2021 talk: YouTube
@inproceedings{kroll2021narrativequerygraphs,
author="Kroll, Hermann and Pirklbauer, Jan and Kalo, Jan-Christoph and Kunz, Morris and Ruthmann, Johannes and Balke, Wolf-Tilo",
editor="Ke, Hao-Ren and Lee, Chei Sian and Sugiyama, Kazunari",
title="Narrative Query Graphs for Entity-Interaction-Aware Document Retrieval",
booktitle="Towards Open and Trustworthy Digital Societies",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="80--95",
isbn="978-3-030-91669-5"
}
There are several example files in the resources folder. You may have a look at our case studies.
The toolbox covers three central topics:
- Entity Linking and Named Entity Recognition
- Simple Text Classification
- Information Extraction
- Canonicalization and Cleaning
To use our toolbox, please follow the following setup procedure.
Install python >= 3.8. Decider whether you want to work with a global python version or with a conda environment (see tutorial)
Install all packages from requirements.txt
pip3 install -r requirements.txt
You need to setup the python path. Replace the path to the project path. This procedure must be repeated every time you create a new shell. You can add the path to your bash defaults.
export PYTHONPATH="$PYTHONPATH;/home/USER/KGExtractionToolbox/src/"
Please run the following script to configure NLTK:
python src/kgextractiontoolbox/setup_nltk.py
The toolbox stores all produced data in a relational database. Setup the database configuration in the project
cd config/
cp backend.prod.json backend.json
nano backend.json
Please enter your database credentials in this file. We support Postgres and SQlite databases. If you would like to work with SQLite then enable the use_SQLite property and set a corresponding path.
{
"use_SQLite": false,
"SQLite_path": "sqlitebase.db",
"POSTGRES_DB": "example",
"POSTGRES_HOST": "127.0.0.1",
"POSTGRES_PORT": "5432",
"POSTGRES_USER": "user",
"POSTGRES_PW": "pw",
"POSTGRES_SCHEMA": "public"
}
The database schema is depicted below:
Before you can use the toolbox, you must bring your documents into a compatible format and load them:
The extraction toolbox covers three essential components:
If you would like to know how to use own of these components, follow the corresponding linked README. How data of our toolbox can be exported is described here:
You can delete all information that are stored for a document collection by running:
python src/kgextractiontoolbox/backend/delete_collection.py COLLECTION
This will clean the database. The script will ask you for your agreement to delete the collection. We support a --force argument to directly delete the collection.
python src/kgextractiontoolbox/backend/delete_collection.py COLLECTION --force