Entity Linking

Configuration

First, create a copy of the entity linking configuration.

cp config/entity_linking.prod.json config/entity_linking.json

Dictionary-based Entity Linker

The dictionary-based entity linker requires an entity vocabulary as its input. An entity vocabulary may look like:

id	type	heading	synonyms
Q1	Person	Barack Obama	Obama;Barack
Q2	Person	Angela Merkel	Merkel
Q3	Location	Honolulu
Q4	Location	Hamburg	Hansestadt
Q5	Location	America	US;USA;United States

Each entry has a unique entity id, an entity id, a heading and a list of synonyms. We encode an entity vocabulary as a TSV-file:

each line represents an entity
ids must be unique
types can be arbitrary strings
heading is a string
synonyms is a list of strings seperated by a ;

An example TSV entity vocabulary file looks like:

id	type	heading	synonyms
Q1	Person	Barack Obama	Obama;Barack
Q2	Person	Angela Merkel	Merkel
Q3	Location	Honolulu
Q4	Location	Hamburg	Hansestadt
Q5	Location	America	US;USA;United States

Next, we use the entity vocabulary to produce annotations. The entity linker requires:

document/documents as its input
the corresponding document collection
the vocabulary file

The entity linker will automatically insert documents that are not in the database yet.

python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -i DOCUMENT -c COLLECTION -v VOCAB_FILE

You may want to link a whole collection. Leave out the input parameter:

python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py  -c COLLECTION -v VOCAB_FILE

You can also parallelize the entity linking by adding the --workers argument and specify a number of workers.

python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -i DOCUMENT -c COLLECTION -v VOCAB_FILE --workers 10

By default, the entity linker writes its logs in a temporary directory and deletes this directory by completion. You can specify a logging directory that will not be deleted: You can also parallelize the entity linking by adding the --workers argument and specify a number of workers.

python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -i DOCUMENT -c COLLECTION -v VOCAB_FILE --workdir test/

Note that our toolbox won't annotate the same document twice. This will be checked automatically. If your document content has changed, please delete the old table contents (document and doc_tagged_by and tags before).

If your entity vocabulary has changed, you can use the --force argument to enforce linking all documents again.

python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -i DOCUMENT -c COLLECTION -v VOCAB_FILE --force

Full-text Documents

If your document includes sections, you must enable section linking explicitly:

python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py  -c COLLECTION -v VOCAB_FILE --sections

Only if --sections is set, sections are considered in the linking step.

Configuration

There are several options that can be specified in a configuration file.

nano config/entity_linking.json

You can adjust the setting for the dictionary-based entity linker:

#...
  "dict": {
    "max_words": 5, # specifies the maximal number of words an entity has 
    "check_abbreviation": "true", # check custom introduced abbreviations in brackets 
    "custom_abbreviations": "true", # check custom introduced abbreviations in brackets
    "min_full_tag_len": 5, # may improve the quality when working with homonys. An entity is only tagged in a document when a full mention (here 5) characters was detected at least once.
    "split_by_slash": "true" # enable word splitting rule by slash (metformin/simvastatin -> [metformin, simvastatin, metformin/simvastatin])
  }
#...

Stanza Named Entity Recognition

Stanza is not installed by default. To use it, please install:

pip install stanza~=1.2.3

Before working with Stanza, you need to setup the English model. Therefore, run:

python src/kgextractiontoolbox/setup_stanza.py

This may take a while.

Next, Stanza can be used to detect Named Entities in documents. Note that Stanza does not produce entity ids. Thus, we will use the entity mention string as the entity id.

python src/kgextractiontoolbox/entitylinking/stanza_ner.py -i DOCUMENT -c COLLECTION

Again, the input -i is optional. You may want to leave it out.

Stanza will by default run on your GPU. If no GPU is available, you can specificy the CPU flag which will cause a long runtime.

python src/kgextractiontoolbox/entitylinking/stanza_ner.py -i DOCUMENT -c COLLECTION --CPU

Note that our toolbox won't annotate the same document twice. This will be checked automatically. If your document content has changed, please delete the old table contents (document and doc_tagged_by and tags before).

Full-text Documents

If your document includes sections, you must enable section linking explicitly:

python src/kgextractiontoolbox/entitylinking/stanza_ner.py -i DOCUMENT -c COLLECTION --sections

Only if --sections is set, sections are considered in the linking step.

Stanza Config

There are several options that can be specified in a configuration file.

nano config/entity_linking.json

By default, Stanza produces many entity annotations that might not be helpful. By default, we ignore Ordinals (Number Sequences), Quantities and Percent types. You can adjust the entity filter in the configuration.

#...
  "stanza": {
    "document_batches": 1000, # how many documents will be processed in one batch (more requires more VRAM)
    "entity_type_blocked_list": ["ORDINAL", "QUANTITY", "PERCENT"] # ignored entity types
  }
#...

Biomedical Entity Linking

Instead of integrating the domain-specific entity linker directly, you may also use them next to our toolbox and only load their outputs. Additionally, our Pipeline supports two commonly used tools for entity linking in the biomedical domain. Namely, these are

TaggerOne for Chemicals and Diseases
GNormPlus for Genes and Species.

Setup

First, create a directory for the taggers:

mkdir ~/tools

Download GNormPlus and TaggerOne. Unzip both and move the directories into tools.

tools/
  GNormPlusJava/
  TaggerOne-0.2.1/

Both tools require a Java installation. Both tools need to be installed and compiled by hand. So for GNormPlus and TaggerOne, see their readme files. For TaggerOne, some models must be build manually.

Tagger Configuration

Adjust the root path configurations for both taggers in entity_linking.json:

{
  "taggerOne": {
    "root": "<path to tools>/tools/TaggerOne-0.2.1",  # Taggerone root path here
    "model": "models/model_BC5CDRJ_011.bin",
    "batchSize": 10000,
    "timeout": 15,
    "max_retries": 1
  },
  "gnormPlus": {
    "root": "<path to tools>/tools/GNormPlusJava", # GNormPlus path here
    "javaArgs": "-Xmx16G -Xms10G"
  },
  #...
}

If TaggerOne gets stuck on a file, the process will be killed after "timeout" minutes without progress. The pipeline will then restart TaggerOne and will retry to process the file "max_retries" times. If no progress is made by then, the file will be ignored.

Runing the biomedical entity linking

Below you can see a sample call for the pipeline. Run TaggerOne:

python src/kgextractiontoolbox/entitylinking/biomedical_entity_linking.py test.json --collection test --tagger-one

Run GNormPlus:

python src/kgextractiontoolbox/entitylinking/biomedical_entity_linking.py test.json --collection test --gnormplus

The pipeline will read the input file test.json and will load the contained documents into the database in collection test. It will then invoke both taggerOne and GNormPlus, which generate tags as output. The tags will also be inserted into the database.

You must either select --tagger-one or --gnormplus. Both linkers must run separately.

For more information and additional options, please see

python src/kgextractiontoolbox/entitylinking/biomedical_entity_linking.py --help

Export Annotations

For generating an output file containing the generated tags, please see 04 Export Statements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_01_ENTITY_LINKING.md

README_01_ENTITY_LINKING.md

Entity Linking

Configuration

Dictionary-based Entity Linker

Full-text Documents

Configuration

Stanza Named Entity Recognition

Full-text Documents

Stanza Config

Biomedical Entity Linking

Setup

Tagger Configuration

Runing the biomedical entity linking

Export Annotations

Files

README_01_ENTITY_LINKING.md

Latest commit

History

README_01_ENTITY_LINKING.md

File metadata and controls

Entity Linking

Configuration

Dictionary-based Entity Linker

Full-text Documents

Configuration

Stanza Named Entity Recognition

Full-text Documents

Stanza Config

Biomedical Entity Linking

Setup

Tagger Configuration

Runing the biomedical entity linking

Export Annotations