Multisensor NER Mapping

1 Prefixes

2 Mapping

Class /Property	Type/enum	Mapping	Notes
all		nif:Word or nif:Phrase
text	string	n/a	nif:anchorOf omitted
onset	number	nif:beginIndex	start
offset	number	nif:endIndex	end
Person		dbo:Person, foaf:Person; nerd:Person
test	string	foaf:name
firstname	string	foaf:firstName
lastname	string	foaf:lastName
gender	male, female	dbo:gender	dbp:Male, dbp:Female
occupation	string	rdau:professionOrOccupation	dbo:occupation and dbo:profession are object props
Location	type=other	nerd:Location	No need to use dbo:Location if you can’t identify the type
Location	type=country	dbo:Country; nerd:Country
Location	type=region	dbo:Region; nerd:AdministrativeRegion
Location	type=city	dbo:City; nerd:City
Location	type=street	schema:PostalAddress; nerd:Location	Put text in schema:streetAddress
Organisation	type=institution	dbo:Organisation, foaf:Organization; nerd:Organization
Organisation	type=company	dbo:Company, foaf:Company; nerd:Company
Product		nerd:Product
type	string	not yet	don’t know yet what makes sense here
Time		time:Instant; nerd:Time	TODO: can you parse to XSD datetime components?
year	string	time:Instant; nerd:Time
month	string	time:Instant OR yago:Months; nerd:Time	if yago:Months then dbp:January…
day	string	time:Instant; nerd:Time
time	string	time:Instant; nerd:Time
weekday	string	yago:DaysOfTheWeek; nerd:Time	dbp:Sunday,… Put text in rdfs:label
rel	string	nerd:Time	relative expression, eg “the last three days”
other	string	nerd:Time	any other time expression, eg “Valentine’s day”
Amount	type=price	schema:PriceSpecification; nerd:Amount
unit	string	schema:priceCurrency	3-letter ISO 4217 format
amount	number	schema:price	”.” as decimal separator
Amount	type=unit	schema:QuantitativeValue; nerd:Amount	How about percentage??
unit	string	schema:unitCode	Strictly speaking, UN/CEFACT Common Code (eg GRM for grams)
amount	number	schema:value
Name		nerd:Thing
type	string	dc:type	a type if anything can be identified, otherwise empty

Notes

Classes are uppercase, Properties are lowercase
NERD classes are attached to the word using itsrdf:taClassRef
Other classes are attached to the NE node (itsrdf:taIdentRef) using rdf:type.
the Amount mapping uses schema.org classes/properties, which were borrowed from GoodRelations
dbo:gender is an object property, though it doesn’t specify the values to use
dc:type is a literal. We attach it to the word directly
don’t forget to include itsrdf:taAnnotatorsRef “NER-extraction|http://linguatec.com” for each

3 Example

./NIF-example3.ttl (./NIF-example3.ttl.html) and ./NIF-example3.jsonld include examples for each of the named entity kinds.

I made up some word/phrase occurrences. I use nif:anchorOf to illustrate the word/phrase, and omit nif:beginIndex and nif:endIndex. In actual use, you’ll do exactly the opposite (nif:anchorOf should be omitted since it’s redundant)
In a couple cases I’ve embedded rdfs:comment and rdfs:seeAlso to illustrate a point. Of course, don’t emit such in the actual JSONLD

3.1 Named Entity URLs

We have the following options for Named Entity URLs:

Global: it’s best to use global DBpedia URLs if they can be identified, as explained in ./NIF-example2.ttl
```
http://dbpedia.org/resource/Angela_Merkel
    
```
Project: we could use a project-global namespace for entities, eg
```
http://www.multisensorproject.eu/entity/Person/Angela_Merkel
    
```
(Eg the http://tag.ontotext.com demo uses such URLs for entities it cannot identify in global datasets). However, this won’t allow different NEs with the same name across documents
Document: ./NIF-example3.ttl uses per-document URLs, eg
```
http://www.multisensorproject.eu/content/12542546#Person=Angela_Merkel
    
```
(In this and the previous option, the entity URI is made from the entity text, replacing punctuation with “_”). This still does not allow two different John_Smiths in one document, but the chance of this happening is smaller.

Slash vs Hash: everyting after a # is fetched with one HTTP request.

So hash is used for “sub-nodes” that will be typically be served with one HTTP request
In contrast, slash is used with large collections. If we have a million Named Entities, we can’t use hash in the Project scheme/

4 Validation

Please validate generated NIF files.

4.1 NIF Validator

doc: http://persistence.uni-leipzig.org/nlp2rdf/specification/core.html#validator
software: http://persistence.uni-leipzig.org/nlp2rdf/specification/validate.jar
tests: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.ttl
- You can understand them just by reading the error messages, e.g. “nif:anchorOf must match the substring of nif:isString calculated with begin and end index”

It says “informat=json-ld not implemented yet”, so we need to convert to ttl first (I use apache-jena-2.12.1)

rdfcat -out ttl test-out.jsonld | java -jar validate.jar -i - -o text

Unfortunately there are only 11 tests, so it’s a disappointment

4.2 RDFUnit Validation

This is implemented in the MS RDF_Validation_Service. We’d be glad to help you read its results.

A better validator is RDFUnit:

home: http://aksw.org/Projects/RDFUnit.html
demo: http://rdfunit.aksw.org/demo/
source: https://github.com/AKSW/RDFUnit/
paper: ”NLP data cleansing based on Linguistic Ontology constraints

I tried their demo site with ./NIF-test1.jsonld and ./NIF-example2.ttl:

1. Data Selection> Direct Input> JSON-LD> Load
Data loaded successfully! (162 statements)
2. Constraints Selection> Automatic> Load
Constraints loaded successfully: (foaf, nif, itsrdf, dcterms)
3. Test Generation
Completed! Generated 514 tests                 (WOW!! That's a lot)
4. Testing> Report Type> Status (all)> Run Tests
Total test cases 514, Succeeded 507, Failed 7  (NOTE: those "Succeeded" also in many cases mean errors)

4.2.1 Generated Tests per Ontology

URI	Automatic	Manual
http://xmlns.com/foaf/0.1/	174	-
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#	199	10
http://www.w3.org/2005/11/its/rdf#	75	-
http://purl.org/dc/terms/	56	-
http://www.w3.org/2006/time#	183	-
http://dbpedia.org/ontology/	9281	14

(Even though I canceled dbo generation a bit prematurely.)

This is too much for us, we don’t want the DBO tests. In particular, the Status (all) report includes a lot of “violations” that come from ontologies not from our data. But it’s definitely worth investigating

4.2.2 RDFUnit test results

Here are the results. “Resources” is a simple tabular format (basically URL-error), “Annotated Resources” provides more detail (about the errors pertaining to each URL)

./NIF-test1-out.xls: Status (all) and Resources
./NIF-test1-annotated.ttl: Annotated Resources
./NIF-example2-out.xls: Resources
./NIF-example2-annotated.ttl: Annotated Resources

4.3 Manual Validation

(Was at RDF_Validation, but will maintain it here).

I’ve been checking SIMMOs for NIF conformance for a while, maybe done it 100 times already. Please post only Turtle files, not JSON files since they are impossible to check by eyeballing.

Get Jena (eg apache-jena-3.0.0.tar.gz), unzip it somewhere and add the bin directory to your path. We’ll use RIOT (RDF I/O Tool).
Get Turtle: You can get a Turtle representation of the SIMMO in one of two ways

4.3.1 Get Turtle from Store

Store the SIMMO using the RDF Storing Service
Get the SIMMO out using a query like this (saved as “a SIMMO graph”), and then save the result as file-noprefix.ttl (Turtle).

<pre>construct {?s ?p ?o} 
where {graph <http://data.multisensor.org/content/8006dcd60b292feaaef24abc9ec09e2230aab83e> 
  {?s ?p ?o}}

There’s also a REST call to get the SIMMO out that’s easier to use from the command line

4.3.2 Get Turtle from SIMMO JSON

get the content of the “rdf” key out of the SIMMO JSON. Unescape quotes. Save as file.jsonld So instead of this:

"rdf":["[{\"@id\":\"http://data.multisensor...[{\"@value\":\"Germany\"}]}]"],"category":""}</pre>

You need this:

[{"@id":"http://data.multisensor...[{"@value":"Germany"}]}]

You can do this manually, or with RIOT that can convert the stringified RDF field into more readable JSONLD format:

riot --output=jsonld rdf_output_string.jsonld > new_readable_file.jsonld

Instead of a single string, the results will be displayed as:

"@graph" : [ {
  "@id" : "http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#Amount=10000_Euro",
  "@type" : [ "http://schema.org/QuantitativeValue", "http://nerd.eurecom.fr/ontology#Amount" ],
  "name" : "10000 Euro"
}, {
  "@id" : "http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#Amount=2000_Euro",
  "@type" : [ "http://schema.org/QuantitativeValue", "http://nerd.eurecom.fr/ontology#Amount" ],
  "name" : "2000 Euro"
}, {...

No matter which of the two methods you used, the rest is the same

Validate it with RIOT: this is optional but recommended
```
riot --validate file.jsonld
    
```

Convert to Turtle. Omit “WARN riot” lines which would make the Turtle invalid

riot --output turtle file.jsonld | grep -v "WARN  riot" > file-noprefix.ttl

4.3.2.1 Prettify Turtle

Unfortunately this file doesn’t use prefixes, so the URLs are long and ugly (Boyan will fix this for the Store MULTISENSO-137)

Save ./prefixes.ttl (I update this file about once a month)

Concat the two:

cat prefixes.ttl file-noprefix.ttl > file-withprefix.ttl

Prettify the Turtle to make use of the prefixes and to group all statements of the same subject together:
```
riot --formatted=turtle file-withprefix.ttl > file.ttl
    
```

Optional manual edits:

Add on top a base, using the actual SIMMO base, eg

@base <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304>.

Replace “http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304” with “” (I don’t know why RIOT doesn’t use the base, even if I specify the –base option)
Sort paragraphs (i.e. statement clusters)

Post in Jira that last prettified file.ttl. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multisensor-NER-Mapping.org

Multisensor-NER-Mapping.org

Multisensor NER Mapping

1 Prefixes

2 Mapping

3 Example

3.1 Named Entity URLs

4 Validation

4.1 NIF Validator

4.2 RDFUnit Validation

4.2.1 Generated Tests per Ontology

4.2.2 RDFUnit test results

4.3 Manual Validation

4.3.1 Get Turtle from Store

4.3.2 Get Turtle from SIMMO JSON

4.3.2.1 Prettify Turtle

Files

Multisensor-NER-Mapping.org

Latest commit

History

Multisensor-NER-Mapping.org

File metadata and controls

Multisensor NER Mapping

1 Prefixes

2 Mapping

3 Example

3.1 Named Entity URLs

4 Validation

4.1 NIF Validator

4.2 RDFUnit Validation

4.2.1 Generated Tests per Ontology

4.2.2 RDFUnit test results

4.3 Manual Validation

4.3.1 Get Turtle from Store

4.3.2 Get Turtle from SIMMO JSON

4.3.2.1 Prettify Turtle