Skip to content

Loading MeSH datasets

filak edited this page Jan 28, 2025 · 80 revisions

Before any data loading

Jena assembler config file <MTW_HOME_DIR>/instance/conf/mesh.ttl MUST BE set up properly !

Jena 4

https://github.com/filak/MTW-MeSH/blob/master/flask-app/instance/conf/mesh_Jena4.ttl

Jena 5

https://github.com/filak/MTW-MeSH/blob/master/flask-app/instance/conf/mesh_Jena5.ttl

Copy the file to <MTW_HOME_DIR>/instance/conf/ and rename it as mesh.ttl

Adjust the paths in mesh.ttl to your <FUSEKI_DATA_DIR>

Use forward slashes

    tdb2:location  "c:/<FUSEKI_DATA_DIR>/databases/mesh" ;

    text:directory "c:/<FUSEKI_DATA_DIR>/indexes/mesh" ;
  • Validate mesh.ttl

No output = file is OK

  riot --validate mesh.ttl
  • Copy the mesh.ttl file to:

    <FUSEKI_DATA_DIR>/configuration/

Get the official MeSH RDF dataset

Download the official MeSH RDF dataset mesh.nt.gz from https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/

You might use curl tool for downloading

curl https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/mesh.nt.gz --ssl-no-revoke -O

! IMPORTANT NOTICE !

As of this writing - Jan 2025 - the above is no longer true.

The mesh.nt.gz currently available is still the MeSH 2024 version - hash c9ef004de88b9201b84f90aad2966bfd067af799

And despite several efforts (https://github.com/HHS/meshrdf/issues/212#issuecomment-2539919254) to get some information when the full RDF dataset for MeSH 2025 version will be made available (if at all) - NLM stays silent. Also the release notes are outdated.

The only official MeSH 2025 RDF datasets available are here https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2025/ - BUT:

  • these are not the complete datasets - obsolete/inactive items are missing - no meshv:active triples are present
  • this is the "name-spaced" version - prefix http://id.nlm.nih.gov/mesh/2025/

The information about MeSH item status is vital - both for the translation process and for functional MTW outputs/exports. There are existing data workflows for updating obsolete MeSH items etc which rely on active/inactive status being available.

So what can be done in this situation ? Let's try create the most complete MeSH 2025 RDF version.

You can follow this guide or skip it and just download the final files - mesh.nt.gz and mesh2024_inactive.nt

Step 1: Get the MeSH 2025 RDF without the year name-spaced prefix - mesh.nt.gz

Download all the official MeSH 2025 XML files here and produce the RDF dataset mesh.nt.gz with https://github.com/HHS/meshrdf script - no year in the namespace (!)

OR

Download the https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2025/mesh2025.nt.gz and update the namespace using MTW script tools/update-ns.py

  py update-ns.py mesh2025.nt.gz http://id.nlm.nih.gov/mesh/2025/ http://id.nlm.nih.gov/mesh/ mesh.nt.gz

Step 2: Create the inactive items dataset - mesh2024_inactive.nt

Fortunately there were no deleted main headings according to the UMLS MeSH 2025 reports - so we can use the last year complete dataset.

Download the complete MeSH 2024 dataset mesh.nt.gz - save it as mesh2024_full.nt.gz and extract the inactive items using Jena tool arq with this query:

  arq --data=mesh2024_full.nt.gz --query=mesh-inactive.sparql > mesh2024_inactive.ttl

  riot --output=N-TRIPLES mesh2024_inactive.ttl > mesh2024_inactive.nt

Step 3: Copy the two created files to your <IMPORT> directory

  • mesh.nt.gz
  • mesh2024_inactive.nt

Get the translation RDF dataset

If you have not translated MeSH before - you can proceed to Import.

Convert the official UMLS TSV file

Use the trans_only_YYYY_extended.txt and convert it with the mesh-trx2nt tool.

The file MUST have the following columns/items:

DescriptorUI | ConceptUI | Language | TermType | String | TermUI | ScopeNote | Tree | Created | Relation | ParentCUI	
  • the header row is optional
  • the TermUI column is always empty
  • the Relation and ParentCUI need to be present at rows with Custom Concepts (ConceptUI starts with F...) and TermType PEP only

Display help - open CMD and run:

 mesh-trx2nt -h
usage: mesh-trx2nt inputFile langcode meshxPrefix [options]

Extracting translation dataset from NLM UMLS text file [trans_only_2023_expanded.txt]

positional arguments:
  inputFile    NLM UMLS text file name (plain or gzipped)
  langcode     Language code
  meshxPrefix  MeSH Translation namespace prefix ie. http://my.mesh.com/id/

options:
  -h, --help   show this help message and exit
  --out OUT    Output file name prefix

IMPORTANT

The langcode parameter MUST be the same as the TARGET_LANG value in your mtw.ini config file !

The meshxPrefix parameter MUST be the same as the TARGET_NS value in your mtw.ini config file !

Run the conversion - open CMD and run ie.:

 mesh-trx2nt trans_only_2023_extended.txt fr http://id.mesh.fr/ 

Convert the official MTMS XML file - OBSOLETE

Download your *.xml translation file at

https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/.mtms/

Extract translation data from MeSH XML as N-triples dataset using mesh-xml2trx tool

Import the RDF datasets

  1. ALWAYS validate ALL the input files

    Run the validation:

    No output = dataset is OK

     riot --validate *.gz
    
  2. Move the input files into a versioned <IMPORT> directory ie. .../MeSH-data/2023/import/

  3. Load the MeSH datatset(s) into Apache Jena

    Stop Fuseki server instance (if running)

    Go to your <IMPORT> directory

    Run the import:

     tdb2_tdbloader --loc %FUSEKI_BASE%/databases/mesh mesh.nt.gz mesh-trx_ ...
    

    or if you do not have a translation then just:

     tdb2_tdbloader --loc %FUSEKI_BASE%/databases/mesh mesh.nt.gz
    
  4. Create Fuseki search index

    Go to your <FUSEKI_DATA_DIR>

     cd %FUSEKI_BASE%
    

    Run the indexation - Jena v4:

     java -cp %FUSEKI_HOME%/fuseki-server.jar jena.textindexer --desc=configuration/mesh.ttl
    

    Run the indexation - Jena v5+:

     java --add-modules jdk.incubator.vector -cp %FUSEKI_HOME%/fuseki-server.jar jena.textindexer --desc=configuration/mesh.ttl
    
  5. Start Fuseki server instance

Loading data from a backup

  1. Stop MTW services

  2. Stop your Fuseki instance

  3. Go to your <FUSEKI_DATA_DIR> and make sure the <mesh> directories under datatabases and indexes dirs are empty !

    Run the import:

     tdb2_tdbloader --loc %FUSEKI_BASE%/databases/mesh %FUSEKI_BASE%/backups/mesh_YYYY-MM-DD_....nq.gz
    

    Create the search index - Jena v4 - run:

     java -cp %FUSEKI_HOME%/fuseki-server.jar jena.textindexer --desc=configuration/mesh.ttl
    

    Create the search index - Jena v5+ - run:

     java --add-modules jdk.incubator.vector -cp %FUSEKI_HOME%/fuseki-server.jar jena.textindexer --desc=configuration/mesh.ttl
    
  4. Start your Fuseki instance

  5. Start MTW services

Continue to MeSH Annual Updates