Skip to content

Latest commit



203 lines (181 loc) · 15.1 KB

File metadata and controls

203 lines (181 loc) · 15.1 KB

Multisensor NER Mapping

HTML Rendered version

1 Prefixes


2 Mapping

Class /PropertyType/enumMappingNotes
allnif:Word or nif:Phrase
textstringn/anif:anchorOf omitted
Persondbo:Person, foaf:Person; nerd:Person
gendermale, femaledbo:genderdbp:Male, dbp:Female
occupationstringrdau:professionOrOccupationdbo:occupation and dbo:profession are object props
Locationtype=othernerd:LocationNo need to use dbo:Location if you can’t identify the type
Locationtype=countrydbo:Country; nerd:Country
Locationtype=regiondbo:Region; nerd:AdministrativeRegion
Locationtype=citydbo:City; nerd:City
Locationtype=streetschema:PostalAddress; nerd:LocationPut text in schema:streetAddress
Organisationtype=institutiondbo:Organisation, foaf:Organization; nerd:Organization
Organisationtype=companydbo:Company, foaf:Company; nerd:Company
typestringnot yetdon’t know yet what makes sense here
Timetime:Instant; nerd:TimeTODO: can you parse to XSD datetime components?
yearstringtime:Instant; nerd:Time
monthstringtime:Instant OR yago:Months; nerd:Timeif yago:Months then dbp:January…
daystringtime:Instant; nerd:Time
timestringtime:Instant; nerd:Time
weekdaystringyago:DaysOfTheWeek; nerd:Timedbp:Sunday,… Put text in rdfs:label
relstringnerd:Timerelative expression, eg “the last three days”
otherstringnerd:Timeany other time expression, eg “Valentine’s day”
Amounttype=priceschema:PriceSpecification; nerd:Amount
unitstringschema:priceCurrency3-letter ISO 4217 format
amountnumberschema:price”.” as decimal separator
Amounttype=unitschema:QuantitativeValue; nerd:AmountHow about percentage??
unitstringschema:unitCodeStrictly speaking, UN/CEFACT Common Code (eg GRM for grams)
typestringdc:typea type if anything can be identified, otherwise empty


  • Classes are uppercase, Properties are lowercase
  • NERD classes are attached to the word using itsrdf:taClassRef
  • Other classes are attached to the NE node (itsrdf:taIdentRef) using rdf:type.
  • the Amount mapping uses classes/properties, which were borrowed from GoodRelations
  • dbo:gender is an object property, though it doesn’t specify the values to use
  • dc:type is a literal. We attach it to the word directly
  • don’t forget to include itsrdf:taAnnotatorsRef “NER-extraction|” for each

3 Example

./NIF-example3.ttl (./NIF-example3.ttl.html) and ./NIF-example3.jsonld include examples for each of the named entity kinds.

  • I made up some word/phrase occurrences. I use nif:anchorOf to illustrate the word/phrase, and omit nif:beginIndex and nif:endIndex. In actual use, you’ll do exactly the opposite (nif:anchorOf should be omitted since it’s redundant)
  • In a couple cases I’ve embedded rdfs:comment and rdfs:seeAlso to illustrate a point. Of course, don’t emit such in the actual JSONLD

3.1 Named Entity URLs

We have the following options for Named Entity URLs:

  1. Global: it’s best to use global DBpedia URLs if they can be identified, as explained in ./NIF-example2.ttl
  2. Project: we could use a project-global namespace for entities, eg

    (Eg the demo uses such URLs for entities it cannot identify in global datasets). However, this won’t allow different NEs with the same name across documents

  3. Document: ./NIF-example3.ttl uses per-document URLs, eg

    (In this and the previous option, the entity URI is made from the entity text, replacing punctuation with “_”). This still does not allow two different John_Smiths in one document, but the chance of this happening is smaller.

Slash vs Hash: everyting after a # is fetched with one HTTP request.

  • So hash is used for “sub-nodes” that will be typically be served with one HTTP request
  • In contrast, slash is used with large collections. If we have a million Named Entities, we can’t use hash in the Project scheme/

4 Validation

Please validate generated NIF files.

4.1 NIF Validator

It says “informat=json-ld not implemented yet”, so we need to convert to ttl first (I use apache-jena-2.12.1)

rdfcat -out ttl test-out.jsonld | java -jar validate.jar -i - -o text

Unfortunately there are only 11 tests, so it’s a disappointment

4.2 RDFUnit Validation

This is implemented in the MS RDF_Validation_Service. We’d be glad to help you read its results.

A better validator is RDFUnit:

I tried their demo site with ./NIF-test1.jsonld and ./NIF-example2.ttl:

1. Data Selection> Direct Input> JSON-LD> Load
Data loaded successfully! (162 statements)
2. Constraints Selection> Automatic> Load
Constraints loaded successfully: (foaf, nif, itsrdf, dcterms)
3. Test Generation
Completed! Generated 514 tests                 (WOW!! That's a lot)
4. Testing> Report Type> Status (all)> Run Tests
Total test cases 514, Succeeded 507, Failed 7  (NOTE: those "Succeeded" also in many cases mean errors)

4.2.1 Generated Tests per Ontology


(Even though I canceled dbo generation a bit prematurely.)

This is too much for us, we don’t want the DBO tests. In particular, the Status (all) report includes a lot of “violations” that come from ontologies not from our data. But it’s definitely worth investigating

4.2.2 RDFUnit test results

Here are the results. “Resources” is a simple tabular format (basically URL-error), “Annotated Resources” provides more detail (about the errors pertaining to each URL)

4.3 Manual Validation

(Was at RDF_Validation, but will maintain it here).

I’ve been checking SIMMOs for NIF conformance for a while, maybe done it 100 times already. Please post only Turtle files, not JSON files since they are impossible to check by eyeballing.

  • Get Jena (eg apache-jena-3.0.0.tar.gz), unzip it somewhere and add the bin directory to your path. We’ll use RIOT (RDF I/O Tool).
  • Get Turtle: You can get a Turtle representation of the SIMMO in one of two ways

4.3.1 Get Turtle from Store

  • Store the SIMMO using the RDF Storing Service
  • Get the SIMMO out using a query like this (saved as “a SIMMO graph”), and then save the result as file-noprefix.ttl (Turtle).
<pre>construct {?s ?p ?o} 
where {graph <> 
  {?s ?p ?o}}
  • There’s also a REST call to get the SIMMO out that’s easier to use from the command line

4.3.2 Get Turtle from SIMMO JSON

  • get the content of the “rdf” key out of the SIMMO JSON. Unescape quotes. Save as file.jsonld So instead of this:

    You need this:

  • You can do this manually, or with RIOT that can convert the stringified RDF field into more readable JSONLD format:
    riot --output=jsonld rdf_output_string.jsonld > new_readable_file.jsonld

    Instead of a single string, the results will be displayed as:

    "@graph" : [ {
      "@id" : "",
      "@type" : [ "", "" ],
      "name" : "10000 Euro"
    }, {
      "@id" : "",
      "@type" : [ "", "" ],
      "name" : "2000 Euro"
    }, {...  

No matter which of the two methods you used, the rest is the same

  • Validate it with RIOT: this is optional but recommended
    riot --validate file.jsonld
  • Convert to Turtle. Omit “WARN riot” lines which would make the Turtle invalid
    riot --output turtle file.jsonld | grep -v "WARN  riot" > file-noprefix.ttl
 Prettify Turtle

Unfortunately this file doesn’t use prefixes, so the URLs are long and ugly (Boyan will fix this for the Store MULTISENSO-137)

  • Save ./prefixes.ttl (I update this file about once a month)
  • Concat the two:
    cat prefixes.ttl file-noprefix.ttl > file-withprefix.ttl
  • Prettify the Turtle to make use of the prefixes and to group all statements of the same subject together:
    riot --formatted=turtle file-withprefix.ttl > file.ttl

Optional manual edits:

Post in Jira that last prettified file.ttl. Thanks!