Skip to content

Latest commit

 

History

History
203 lines (181 loc) · 15.1 KB

Multisensor-NER-Mapping.org

File metadata and controls

203 lines (181 loc) · 15.1 KB

Multisensor NER Mapping

HTML Rendered version

1 Prefixes

./prefixes.ttl

2 Mapping

Class /PropertyType/enumMappingNotes
allnif:Word or nif:Phrase
textstringn/anif:anchorOf omitted
onsetnumbernif:beginIndexstart
offsetnumbernif:endIndexend
Persondbo:Person, foaf:Person; nerd:Person
teststringfoaf:name
firstnamestringfoaf:firstName
lastnamestringfoaf:lastName
gendermale, femaledbo:genderdbp:Male, dbp:Female
occupationstringrdau:professionOrOccupationdbo:occupation and dbo:profession are object props
Locationtype=othernerd:LocationNo need to use dbo:Location if you can’t identify the type
Locationtype=countrydbo:Country; nerd:Country
Locationtype=regiondbo:Region; nerd:AdministrativeRegion
Locationtype=citydbo:City; nerd:City
Locationtype=streetschema:PostalAddress; nerd:LocationPut text in schema:streetAddress
Organisationtype=institutiondbo:Organisation, foaf:Organization; nerd:Organization
Organisationtype=companydbo:Company, foaf:Company; nerd:Company
Productnerd:Product
typestringnot yetdon’t know yet what makes sense here
Timetime:Instant; nerd:TimeTODO: can you parse to XSD datetime components?
yearstringtime:Instant; nerd:Time
monthstringtime:Instant OR yago:Months; nerd:Timeif yago:Months then dbp:January…
daystringtime:Instant; nerd:Time
timestringtime:Instant; nerd:Time
weekdaystringyago:DaysOfTheWeek; nerd:Timedbp:Sunday,… Put text in rdfs:label
relstringnerd:Timerelative expression, eg “the last three days”
otherstringnerd:Timeany other time expression, eg “Valentine’s day”
Amounttype=priceschema:PriceSpecification; nerd:Amount
unitstringschema:priceCurrency3-letter ISO 4217 format
amountnumberschema:price”.” as decimal separator
Amounttype=unitschema:QuantitativeValue; nerd:AmountHow about percentage??
unitstringschema:unitCodeStrictly speaking, UN/CEFACT Common Code (eg GRM for grams)
amountnumberschema:value
Namenerd:Thing
typestringdc:typea type if anything can be identified, otherwise empty

Notes

  • Classes are uppercase, Properties are lowercase
  • NERD classes are attached to the word using itsrdf:taClassRef
  • Other classes are attached to the NE node (itsrdf:taIdentRef) using rdf:type.
  • the Amount mapping uses schema.org classes/properties, which were borrowed from GoodRelations
  • dbo:gender is an object property, though it doesn’t specify the values to use
  • dc:type is a literal. We attach it to the word directly
  • don’t forget to include itsrdf:taAnnotatorsRef “NER-extraction|http://linguatec.com” for each

3 Example

./NIF-example3.ttl (./NIF-example3.ttl.html) and ./NIF-example3.jsonld include examples for each of the named entity kinds.

  • I made up some word/phrase occurrences. I use nif:anchorOf to illustrate the word/phrase, and omit nif:beginIndex and nif:endIndex. In actual use, you’ll do exactly the opposite (nif:anchorOf should be omitted since it’s redundant)
  • In a couple cases I’ve embedded rdfs:comment and rdfs:seeAlso to illustrate a point. Of course, don’t emit such in the actual JSONLD

3.1 Named Entity URLs

We have the following options for Named Entity URLs:

  1. Global: it’s best to use global DBpedia URLs if they can be identified, as explained in ./NIF-example2.ttl
    http://dbpedia.org/resource/Angela_Merkel
        
  2. Project: we could use a project-global namespace for entities, eg
    http://www.multisensorproject.eu/entity/Person/Angela_Merkel
        

    (Eg the http://tag.ontotext.com demo uses such URLs for entities it cannot identify in global datasets). However, this won’t allow different NEs with the same name across documents

  3. Document: ./NIF-example3.ttl uses per-document URLs, eg
    http://www.multisensorproject.eu/content/12542546#Person=Angela_Merkel
        

    (In this and the previous option, the entity URI is made from the entity text, replacing punctuation with “_”). This still does not allow two different John_Smiths in one document, but the chance of this happening is smaller.

Slash vs Hash: everyting after a # is fetched with one HTTP request.

  • So hash is used for “sub-nodes” that will be typically be served with one HTTP request
  • In contrast, slash is used with large collections. If we have a million Named Entities, we can’t use hash in the Project scheme/

4 Validation

Please validate generated NIF files.

4.1 NIF Validator

It says “informat=json-ld not implemented yet”, so we need to convert to ttl first (I use apache-jena-2.12.1)

rdfcat -out ttl test-out.jsonld | java -jar validate.jar -i - -o text

Unfortunately there are only 11 tests, so it’s a disappointment

4.2 RDFUnit Validation

This is implemented in the MS RDF_Validation_Service. We’d be glad to help you read its results.

A better validator is RDFUnit:

I tried their demo site with ./NIF-test1.jsonld and ./NIF-example2.ttl:

1. Data Selection> Direct Input> JSON-LD> Load
Data loaded successfully! (162 statements)
2. Constraints Selection> Automatic> Load
Constraints loaded successfully: (foaf, nif, itsrdf, dcterms)
3. Test Generation
Completed! Generated 514 tests                 (WOW!! That's a lot)
4. Testing> Report Type> Status (all)> Run Tests
Total test cases 514, Succeeded 507, Failed 7  (NOTE: those "Succeeded" also in many cases mean errors)

4.2.1 Generated Tests per Ontology

URIAutomaticManual
http://xmlns.com/foaf/0.1/174-
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#19910
http://www.w3.org/2005/11/its/rdf#75-
http://purl.org/dc/terms/56-
http://www.w3.org/2006/time#183-
http://dbpedia.org/ontology/928114

(Even though I canceled dbo generation a bit prematurely.)

This is too much for us, we don’t want the DBO tests. In particular, the Status (all) report includes a lot of “violations” that come from ontologies not from our data. But it’s definitely worth investigating

4.2.2 RDFUnit test results

Here are the results. “Resources” is a simple tabular format (basically URL-error), “Annotated Resources” provides more detail (about the errors pertaining to each URL)

4.3 Manual Validation

(Was at RDF_Validation, but will maintain it here).

I’ve been checking SIMMOs for NIF conformance for a while, maybe done it 100 times already. Please post only Turtle files, not JSON files since they are impossible to check by eyeballing.

  • Get Jena (eg apache-jena-3.0.0.tar.gz), unzip it somewhere and add the bin directory to your path. We’ll use RIOT (RDF I/O Tool).
  • Get Turtle: You can get a Turtle representation of the SIMMO in one of two ways

4.3.1 Get Turtle from Store

  • Store the SIMMO using the RDF Storing Service
  • Get the SIMMO out using a query like this (saved as “a SIMMO graph”), and then save the result as file-noprefix.ttl (Turtle).
<pre>construct {?s ?p ?o} 
where {graph <http://data.multisensor.org/content/8006dcd60b292feaaef24abc9ec09e2230aab83e> 
  {?s ?p ?o}}
  • There’s also a REST call to get the SIMMO out that’s easier to use from the command line

4.3.2 Get Turtle from SIMMO JSON

  • get the content of the “rdf” key out of the SIMMO JSON. Unescape quotes. Save as file.jsonld So instead of this:
    "rdf":["[{\"@id\":\"http://data.multisensor...[{\"@value\":\"Germany\"}]}]"],"category":""}</pre>
        

    You need this:

    [{"@id":"http://data.multisensor...[{"@value":"Germany"}]}]
        
  • You can do this manually, or with RIOT that can convert the stringified RDF field into more readable JSONLD format:
    riot --output=jsonld rdf_output_string.jsonld > new_readable_file.jsonld
        

    Instead of a single string, the results will be displayed as:

    "@graph" : [ {
      "@id" : "http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#Amount=10000_Euro",
      "@type" : [ "http://schema.org/QuantitativeValue", "http://nerd.eurecom.fr/ontology#Amount" ],
      "name" : "10000 Euro"
    }, {
      "@id" : "http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#Amount=2000_Euro",
      "@type" : [ "http://schema.org/QuantitativeValue", "http://nerd.eurecom.fr/ontology#Amount" ],
      "name" : "2000 Euro"
    }, {...  
        

No matter which of the two methods you used, the rest is the same

  • Validate it with RIOT: this is optional but recommended
    riot --validate file.jsonld
        
  • Convert to Turtle. Omit “WARN riot” lines which would make the Turtle invalid
    riot --output turtle file.jsonld | grep -v "WARN  riot" > file-noprefix.ttl
        

4.3.2.1 Prettify Turtle

Unfortunately this file doesn’t use prefixes, so the URLs are long and ugly (Boyan will fix this for the Store MULTISENSO-137)

  • Save ./prefixes.ttl (I update this file about once a month)
  • Concat the two:
    cat prefixes.ttl file-noprefix.ttl > file-withprefix.ttl
        
  • Prettify the Turtle to make use of the prefixes and to group all statements of the same subject together:
    riot --formatted=turtle file-withprefix.ttl > file.ttl
        

Optional manual edits:

Post in Jira that last prettified file.ttl. Thanks!