Class /Property | Type/enum | Mapping | Notes |
---|---|---|---|
all | nif:Word or nif:Phrase | ||
text | string | n/a | nif:anchorOf omitted |
onset | number | nif:beginIndex | start |
offset | number | nif:endIndex | end |
Person | dbo:Person, foaf:Person; nerd:Person | ||
test | string | foaf:name | |
firstname | string | foaf:firstName | |
lastname | string | foaf:lastName | |
gender | male, female | dbo:gender | dbp:Male, dbp:Female |
occupation | string | rdau:professionOrOccupation | dbo:occupation and dbo:profession are object props |
Location | type=other | nerd:Location | No need to use dbo:Location if you can’t identify the type |
Location | type=country | dbo:Country; nerd:Country | |
Location | type=region | dbo:Region; nerd:AdministrativeRegion | |
Location | type=city | dbo:City; nerd:City | |
Location | type=street | schema:PostalAddress; nerd:Location | Put text in schema:streetAddress |
Organisation | type=institution | dbo:Organisation, foaf:Organization; nerd:Organization | |
Organisation | type=company | dbo:Company, foaf:Company; nerd:Company | |
Product | nerd:Product | ||
type | string | not yet | don’t know yet what makes sense here |
Time | time:Instant; nerd:Time | TODO: can you parse to XSD datetime components? | |
year | string | time:Instant; nerd:Time | |
month | string | time:Instant OR yago:Months; nerd:Time | if yago:Months then dbp:January… |
day | string | time:Instant; nerd:Time | |
time | string | time:Instant; nerd:Time | |
weekday | string | yago:DaysOfTheWeek; nerd:Time | dbp:Sunday,… Put text in rdfs:label |
rel | string | nerd:Time | relative expression, eg “the last three days” |
other | string | nerd:Time | any other time expression, eg “Valentine’s day” |
Amount | type=price | schema:PriceSpecification; nerd:Amount | |
unit | string | schema:priceCurrency | 3-letter ISO 4217 format |
amount | number | schema:price | ”.” as decimal separator |
Amount | type=unit | schema:QuantitativeValue; nerd:Amount | How about percentage?? |
unit | string | schema:unitCode | Strictly speaking, UN/CEFACT Common Code (eg GRM for grams) |
amount | number | schema:value | |
Name | nerd:Thing | ||
type | string | dc:type | a type if anything can be identified, otherwise empty |
Notes
- Classes are uppercase, Properties are lowercase
- NERD classes are attached to the word using itsrdf:taClassRef
- Other classes are attached to the NE node (itsrdf:taIdentRef) using rdf:type.
- the Amount mapping uses schema.org classes/properties, which were borrowed from GoodRelations
- dbo:gender is an object property, though it doesn’t specify the values to use
- dc:type is a literal. We attach it to the word directly
- don’t forget to include itsrdf:taAnnotatorsRef “NER-extraction|http://linguatec.com” for each
./NIF-example3.ttl (./NIF-example3.ttl.html) and ./NIF-example3.jsonld include examples for each of the named entity kinds.
- I made up some word/phrase occurrences. I use nif:anchorOf to illustrate the word/phrase, and omit nif:beginIndex and nif:endIndex. In actual use, you’ll do exactly the opposite (nif:anchorOf should be omitted since it’s redundant)
- In a couple cases I’ve embedded rdfs:comment and rdfs:seeAlso to illustrate a point. Of course, don’t emit such in the actual JSONLD
We have the following options for Named Entity URLs:
- Global: it’s best to use global DBpedia URLs if they can be identified, as explained in ./NIF-example2.ttl
http://dbpedia.org/resource/Angela_Merkel
- Project: we could use a project-global namespace for entities, eg
http://www.multisensorproject.eu/entity/Person/Angela_Merkel
(Eg the http://tag.ontotext.com demo uses such URLs for entities it cannot identify in global datasets). However, this won’t allow different NEs with the same name across documents
- Document: ./NIF-example3.ttl uses per-document URLs, eg
http://www.multisensorproject.eu/content/12542546#Person=Angela_Merkel
(In this and the previous option, the entity URI is made from the entity text, replacing punctuation with “_”). This still does not allow two different John_Smiths in one document, but the chance of this happening is smaller.
Slash vs Hash: everyting after a # is fetched with one HTTP request.
- So hash is used for “sub-nodes” that will be typically be served with one HTTP request
- In contrast, slash is used with large collections. If we have a million Named Entities, we can’t use hash in the Project scheme/
Please validate generated NIF files.
- doc: http://persistence.uni-leipzig.org/nlp2rdf/specification/core.html#validator
- software: http://persistence.uni-leipzig.org/nlp2rdf/specification/validate.jar
- tests: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.ttl
- You can understand them just by reading the error messages, e.g. “nif:anchorOf must match the substring of nif:isString calculated with begin and end index”
It says “informat=json-ld not implemented yet”, so we need to convert to ttl first (I use apache-jena-2.12.1)
rdfcat -out ttl test-out.jsonld | java -jar validate.jar -i - -o text
Unfortunately there are only 11 tests, so it’s a disappointment
This is implemented in the MS RDF_Validation_Service. We’d be glad to help you read its results.
A better validator is RDFUnit:
- home: http://aksw.org/Projects/RDFUnit.html
- demo: http://rdfunit.aksw.org/demo/
- source: https://github.com/AKSW/RDFUnit/
- paper: ”NLP data cleansing based on Linguistic Ontology constraints
I tried their demo site with ./NIF-test1.jsonld and ./NIF-example2.ttl:
1. Data Selection> Direct Input> JSON-LD> Load Data loaded successfully! (162 statements) 2. Constraints Selection> Automatic> Load Constraints loaded successfully: (foaf, nif, itsrdf, dcterms) 3. Test Generation Completed! Generated 514 tests (WOW!! That's a lot) 4. Testing> Report Type> Status (all)> Run Tests Total test cases 514, Succeeded 507, Failed 7 (NOTE: those "Succeeded" also in many cases mean errors)
(Even though I canceled dbo generation a bit prematurely.)
This is too much for us, we don’t want the DBO tests. In particular, the Status (all) report includes a lot of “violations” that come from ontologies not from our data. But it’s definitely worth investigating
Here are the results. “Resources” is a simple tabular format (basically URL-error), “Annotated Resources” provides more detail (about the errors pertaining to each URL)
- ./NIF-test1-out.xls: Status (all) and Resources
- ./NIF-test1-annotated.ttl: Annotated Resources
- ./NIF-example2-out.xls: Resources
- ./NIF-example2-annotated.ttl: Annotated Resources
(Was at RDF_Validation, but will maintain it here).
I’ve been checking SIMMOs for NIF conformance for a while, maybe done it 100 times already. Please post only Turtle files, not JSON files since they are impossible to check by eyeballing.
- Get Jena (eg apache-jena-3.0.0.tar.gz), unzip it somewhere and add the bin directory to your path. We’ll use RIOT (RDF I/O Tool).
- Get Turtle: You can get a Turtle representation of the SIMMO in one of two ways
- Store the SIMMO using the RDF Storing Service
- Get the SIMMO out using a query like this (saved as “a SIMMO graph”), and then save the result as
file-noprefix.ttl
(Turtle).
<pre>construct {?s ?p ?o}
where {graph <http://data.multisensor.org/content/8006dcd60b292feaaef24abc9ec09e2230aab83e>
{?s ?p ?o}}
- There’s also a REST call to get the SIMMO out that’s easier to use from the command line
- get the content of the “rdf” key out of the SIMMO JSON. Unescape quotes. Save as
file.jsonld
So instead of this:"rdf":["[{\"@id\":\"http://data.multisensor...[{\"@value\":\"Germany\"}]}]"],"category":""}</pre>
You need this:
[{"@id":"http://data.multisensor...[{"@value":"Germany"}]}]
- You can do this manually, or with RIOT that can convert the stringified RDF field into more readable JSONLD format:
riot --output=jsonld rdf_output_string.jsonld > new_readable_file.jsonld
Instead of a single string, the results will be displayed as:
"@graph" : [ { "@id" : "http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#Amount=10000_Euro", "@type" : [ "http://schema.org/QuantitativeValue", "http://nerd.eurecom.fr/ontology#Amount" ], "name" : "10000 Euro" }, { "@id" : "http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#Amount=2000_Euro", "@type" : [ "http://schema.org/QuantitativeValue", "http://nerd.eurecom.fr/ontology#Amount" ], "name" : "2000 Euro" }, {...
No matter which of the two methods you used, the rest is the same
- Validate it with RIOT: this is optional but recommended
riot --validate file.jsonld
- Convert to Turtle. Omit “WARN riot” lines which would make the Turtle invalid
riot --output turtle file.jsonld | grep -v "WARN riot" > file-noprefix.ttl
Unfortunately this file doesn’t use prefixes, so the URLs are long and ugly (Boyan will fix this for the Store MULTISENSO-137)
- Save ./prefixes.ttl (I update this file about once a month)
- Concat the two:
cat prefixes.ttl file-noprefix.ttl > file-withprefix.ttl
- Prettify the Turtle to make use of the prefixes and to group all statements of the same subject together:
riot --formatted=turtle file-withprefix.ttl > file.ttl
Optional manual edits:
- Add on top a base, using the actual SIMMO base, eg
@base <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304>.
- Replace “http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304” with “” (I don’t know why RIOT doesn’t use the base, even if I specify the –base option)
- Sort paragraphs (i.e. statement clusters)
Post in Jira that last prettified file.ttl. Thanks!