First, create a copy of the entity linking configuration.
cp config/entity_linking.prod.json config/entity_linking.json
The dictionary-based entity linker requires an entity vocabulary as its input. An entity vocabulary may look like:
id | type | heading | synonyms |
---|---|---|---|
Q1 | Person | Barack Obama | Obama;Barack |
Q2 | Person | Angela Merkel | Merkel |
Q3 | Location | Honolulu | |
Q4 | Location | Hamburg | Hansestadt |
Q5 | Location | America | US;USA;United States |
Each entry has a unique entity id, an entity id, a heading and a list of synonyms. We encode an entity vocabulary as a TSV-file:
- each line represents an entity
- ids must be unique
- types can be arbitrary strings
- heading is a string
- synonyms is a list of strings seperated by a ;
An example TSV entity vocabulary file looks like:
id type heading synonyms
Q1 Person Barack Obama Obama;Barack
Q2 Person Angela Merkel Merkel
Q3 Location Honolulu
Q4 Location Hamburg Hansestadt
Q5 Location America US;USA;United States
Next, we use the entity vocabulary to produce annotations. The entity linker requires:
- document/documents as its input
- the corresponding document collection
- the vocabulary file
The entity linker will automatically insert documents that are not in the database yet.
python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -i DOCUMENT -c COLLECTION -v VOCAB_FILE
You may want to link a whole collection. Leave out the input parameter:
python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -c COLLECTION -v VOCAB_FILE
You can also parallelize the entity linking by adding the --workers argument and specify a number of workers.
python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -i DOCUMENT -c COLLECTION -v VOCAB_FILE --workers 10
By default, the entity linker writes its logs in a temporary directory and deletes this directory by completion. You can specify a logging directory that will not be deleted: You can also parallelize the entity linking by adding the --workers argument and specify a number of workers.
python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -i DOCUMENT -c COLLECTION -v VOCAB_FILE --workdir test/
Note that our toolbox won't annotate the same document twice. This will be checked automatically. If your document content has changed, please delete the old table contents (document and doc_tagged_by and tags before).
If your entity vocabulary has changed, you can use the --force argument to enforce linking all documents again.
python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -i DOCUMENT -c COLLECTION -v VOCAB_FILE --force
If your document includes sections, you must enable section linking explicitly:
python src/kgextractiontoolbox/entitylinking/vocab_entity_linking.py -c COLLECTION -v VOCAB_FILE --sections
Only if --sections is set, sections are considered in the linking step.
There are several options that can be specified in a configuration file.
nano config/entity_linking.json
You can adjust the setting for the dictionary-based entity linker:
#...
"dict": {
"max_words": 5, # specifies the maximal number of words an entity has
"check_abbreviation": "true", # check custom introduced abbreviations in brackets
"custom_abbreviations": "true", # check custom introduced abbreviations in brackets
"min_full_tag_len": 5, # may improve the quality when working with homonys. An entity is only tagged in a document when a full mention (here 5) characters was detected at least once.
"split_by_slash": "true" # enable word splitting rule by slash (metformin/simvastatin -> [metformin, simvastatin, metformin/simvastatin])
}
#...
Stanza is not installed by default. To use it, please install:
pip install stanza~=1.2.3
Before working with Stanza, you need to setup the English model. Therefore, run:
python src/kgextractiontoolbox/setup_stanza.py
This may take a while.
Next, Stanza can be used to detect Named Entities in documents. Note that Stanza does not produce entity ids. Thus, we will use the entity mention string as the entity id.
python src/kgextractiontoolbox/entitylinking/stanza_ner.py -i DOCUMENT -c COLLECTION
Again, the input -i is optional. You may want to leave it out.
Stanza will by default run on your GPU. If no GPU is available, you can specificy the CPU flag which will cause a long runtime.
python src/kgextractiontoolbox/entitylinking/stanza_ner.py -i DOCUMENT -c COLLECTION --CPU
Note that our toolbox won't annotate the same document twice. This will be checked automatically. If your document content has changed, please delete the old table contents (document and doc_tagged_by and tags before).
If your document includes sections, you must enable section linking explicitly:
python src/kgextractiontoolbox/entitylinking/stanza_ner.py -i DOCUMENT -c COLLECTION --sections
Only if --sections is set, sections are considered in the linking step.
There are several options that can be specified in a configuration file.
nano config/entity_linking.json
By default, Stanza produces many entity annotations that might not be helpful. By default, we ignore Ordinals (Number Sequences), Quantities and Percent types. You can adjust the entity filter in the configuration.
#...
"stanza": {
"document_batches": 1000, # how many documents will be processed in one batch (more requires more VRAM)
"entity_type_blocked_list": ["ORDINAL", "QUANTITY", "PERCENT"] # ignored entity types
}
#...
Instead of integrating the domain-specific entity linker directly, you may also use them next to our toolbox and only load their outputs. Additionally, our Pipeline supports two commonly used tools for entity linking in the biomedical domain. Namely, these are
- TaggerOne for Chemicals and Diseases
- GNormPlus for Genes and Species.
First, create a directory for the taggers:
mkdir ~/tools
Download GNormPlus and TaggerOne. Unzip both and move the directories into tools.
tools/
GNormPlusJava/
TaggerOne-0.2.1/
Both tools require a Java installation. Both tools need to be installed and compiled by hand. So for GNormPlus and TaggerOne, see their readme files. For TaggerOne, some models must be build manually.
Adjust the root path configurations for both taggers in entity_linking.json:
{
"taggerOne": {
"root": "<path to tools>/tools/TaggerOne-0.2.1", # Taggerone root path here
"model": "models/model_BC5CDRJ_011.bin",
"batchSize": 10000,
"timeout": 15,
"max_retries": 1
},
"gnormPlus": {
"root": "<path to tools>/tools/GNormPlusJava", # GNormPlus path here
"javaArgs": "-Xmx16G -Xms10G"
},
#...
}
If TaggerOne gets stuck on a file, the process will be killed after "timeout"
minutes without progress. The pipeline will then restart TaggerOne and will retry to process the file "max_retries"
times. If no progress is made by then, the file will be ignored.
Below you can see a sample call for the pipeline. Run TaggerOne:
python src/kgextractiontoolbox/entitylinking/biomedical_entity_linking.py test.json --collection test --tagger-one
Run GNormPlus:
python src/kgextractiontoolbox/entitylinking/biomedical_entity_linking.py test.json --collection test --gnormplus
The pipeline will read the input file test.json
and will load the contained documents into the database in collection test
. It will then invoke both taggerOne and GNormPlus, which generate tags as output. The tags will also be inserted into the database.
You must either select --tagger-one or --gnormplus. Both linkers must run separately.
For more information and additional options, please see
python src/kgextractiontoolbox/entitylinking/biomedical_entity_linking.py --help
For generating an output file containing the generated tags, please see 04 Export Statements.