Our Pipeline supports a JSON document format. Every document is identified via an unique id (integer). Documents must belong to a document collection, but each document id must be unique within the collection.
You can load your documents:
python src/kgextractiontoolbox/documents/load_document.py DOCUMENTS.json --collection COLLECTION
Document ids must be unique integers within a document collection. The loading procedure will automatically include entity annotations (tags) if contained in the document file. If you don't want to include tags, use the --ignore_tags argument.
python src/kgextractiontoolbox/documents/load_document.py DOCUMENTS.json --collection COLLECTION --ignore_tags
A document file may contain only annotations (exported by our toolbox; see export). The toolbox will only load these annotations if the corresponding documents with titles or abstracts have been inserted into the database.
By default, if a document already exists in the database, the loading script will skip it and not make any changes. However, if you have modified documents that need to replace the existing ones in the database (e.g., updated content, new sections, corrected data), you can use the --replace_existing parameter. This parameter tells the script to replace any existing documents in the database with the new ones provided.
To replace existing documents, use the following command:
python src/kgextractiontoolbox/documents/load_document.py DOCUMENTS.json --collection COLLECTION --replace_existing
Here is an example of our JSON format:
[
{
"id": 12345,
"title": "Barack Obama [...]",
"abstract": "Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983 [..]",
"sections": [
{
"position": 0,
"title": "Introduction",
"text": "Barack Hussein Obama II is an American politician [...]"
},
{
"position": 1,
"title": "Early life and career",
"text": "Obama was born on August 4, 1961, at Kapiolani Medical Center for Women and Children [...]"
}
]
},
// more documents ...
]
The "sections" part is optional. The outmost array brackets []
can be omitted if only a single json document should be contained within the file.
Note:
- a document id must be an integer
- id, title and abstracts are required
We also support the .jsonl format, where each document JSON is stored in a single line. Files must have the suffix: .jsonl Here is an example of the JSONL format:
{"id": "1", "title": "Comparing Letrozole", "abstract": "Abstract 1"}
{"id": "2", "title": "A Study Investigating", "abstract": "Abstract 2"}
{"id": "3", "title": "Title 3", "abstract": "Abstract 3"}
The second document format is the so-called PubTator format. A PubTator document has a document id, a document collection, a title and an abstract.
document_id|t|title text here
document_id|a|abstract text here
ATTENTION: the PubTator file must end with two \n characters. The document id must be an integer. Title and abstract can include special characters - the texts will be sanitized in our pipeline. If you want to tag several documents, you can choose from two options:
- Create a PubTator file for each document and put them into a directory
- Create a single PubTator file with several documents
document_id_1|t|title text here
document_id_1|a|abstract text here
document_id_2|t|title text here
document_id_2|a|abstract text here
document_id_3|t|title text here
document_id_3|a|abstract text here
The files are separated by two new line characters \n. ATTENTION: the PubTator file must end with two \n characters.
The following is only of interest, if you are working with custom taggers. In addition, you can specify a tagger map when loading a document file. Then, the database will store the information that these files have been processed by the corresponding taggers. This is useful, if you work with custom taggers, and you don't want to annotate document twice. As an example:
{
"Chemical" : ["TaggerOne", "0.2.1"],
"Disease" : ["TaggerOne", "0.2.1"],
"DosageForm": ["DosageFormTagger" , "1.0.0"],
"Gene" : ["GNormPlus", "unknown"],
"Species" : ["SR4GN", "unknown" ],
"CellLine" : ["TaggerOne", "0.2.1"],
"Variant" : [ "tmVar", "2.0"]
}
Then, run
python src/kgextractiontoolbox/documents/load_document.py DOCUMENTS.json --tagger_map TAGGER_MAP.json --collection COLLECTION