-
Notifications
You must be signed in to change notification settings - Fork 0
RDFization Guide v1.0
The linked data that forms part of Bio2RDF ascribes a to simple set modeling patterns that permit our different datasets to syntactically interoperate. The best practices here presented have been inspired by the Banff Manifesto, Tim Berner-Lee's design principles and the collective experience of our community. This document provides a simple set of guidelines to guide Bio2RDF users and contributors in the creation and querying of our data.
This guide will assume that you have working experience in creating RDF documents programatically. If this describes you, then read on!
The over | 1800 biological databases that are currently available, prodive unique identifiers for every record that they host. For example, the | Protein Databank, uses a four character string to represent their unique entries (e.g. http://www.rcsb.org/pdb/explore/explore.do?structureId=1y26|1Y26), similarly | PubMed uses an integer to identify publications (e.g. | 22359647).
All Bio2RDF URIs must use the HTTP scheme and bio2rdf.org as a base: http://bio2rdf.org/. Appended to this base must be a dataset specific namespace. The namespace must be followed by a colon and then a unique identifier. For example, the Bio2RDF URI for the UniProt record with identifier P26838 would be
http://bio2rdf.org/uniprot:P26848
All identifiers must be URL-safe. For example, if an identifier contains a round bracket i.e. ( or ), then the URL encoding of this bracket must be used in the Bio2RDF URI. A list of URL encodings for common characters is available here.
There are over 1800 existing dataset-specific Bio2RDF namespaces that can be browsed at http://www.freebase.com/view/base/bio2rdf/views/bm. Before creating a new namespace, ensure that there isn't an existing namespace appropriate for your dataset, and that any new namespace you wish to use is not already in use for another dataset.
When creating Bio2RDF URIs, there are three types of namespaces that may be used. In the case that you are creating a Bio2RDF URI for an existing dataset identifier, then the namespace used should be that of the data set. For example, the Bio2RDF namespace for UniProt is uniprot. If in the process of converting a dataset to RDF you create new identifiers that did not previously exist in the dataset being converted, then use a dataset_resource namespace. In the example of UniProt, this would be uniprot_resource and an example URI would be:
http://bio2rdf.org/uniprot_resource:P26848-unique-identifier-123
Finally, if you wish to create dataset-specific predicates and types, then use a dataset_vocabulary namespace. For example, the Bio2RDF URI for the UniProt Protein type would be:
http://bio2rdf.org/uniprot_vocabulary:Protein
All URIs that form part of Bio2RDF linked data.
Normalized URI: http://bio2rdf.org/public_database:private_identifier
Consider for example a UniProt record i.e.: P26838. The proposed URI for this record would be the same as its URL:
http://bio2rdf.org/uniprot:P26838
Note: Blank nodes should be avoided like the plague.
What namespaces can we use
Every resource must contain the following metadata with their corresponding predicates:
- rdf:type Define the class of object described by the resource
- rdfs:label The document's title and identifier
The first step of the RDFization process involves the use of a consistent identifier identifier scheme. Data providers such as NCBI, EBI, etc. use unique identifiers to refer to the entities that they are hosting. The linked data that forms part of Bio2RDF distinguishes between those identifiers that refer to the original hosted entities and any other auxliary identifiers used in the creation of the linked data graph
For every unique entity c to a record Bio2RDF identifiers are given by the following URI pattern:
http://bio2rdf.org/''namespace'':''identifier''
where the namespace is a short name listed in our dataset registry that uniquely identifies the source (dataset/database). The identifier is the (alpha)numeric string assigned to identify that entity. For instance, the gene identified by the number 15275 in the NCBI EntrezGene Database (namespace = geneid) has the following identifier:
<code>http://bio2rdf.org/''geneid'':''15275''</code>
The Bio2RDF URI scheme is applied not just to data entries, but also for the vocabulary (types and relations) to describe these entries.
<code>http://bio2rdf.org/''namespace''_term:''term''</code>
For example, the gene identified by geneid:15275 is a kind of Gene, as defined by Entrez Gene.
<code>http://bio2rdf.org/''geneid''_term:''Gene''</code>
Each resource should contain the following annotations:
<code>http://purl.org/dc/terms/title</code> a human readable title as it appears in the source data.
<code>http://purl.org/dc/terms/identifier</code> a string that contains the identifier using the following pattern <namespace>:<identifier>
<code>rdfs:label</code> a Bio2RDF generated label containing a title followed by the identifier "title [ns:id]".
Used by convention in most RDF browsers to render the name of resource instead of its URI.
Taken together,
<code> geneid:15275 rdfs:label "Hk1 [geneid:15275]" ; dc:title "Hk1" ; dc:identifier "geneid:15275" ; rdf:type geneid_term:Gene . </code>
We recognize a minimum of 3 entities found in biological information resources: physical entities, records and datasets.
1. Record
Records are information objects that contain a set of statements, primarily about the subject.
<code> namespace_record:identifier bio2rdf_term:has-primary-subject namespace:identifier . </code>
<code> namespace:identifier bio2rdf_term:is-described-by namespace_record:identifier . </code>
2. Dataset Datasets are collections of records.
<code> bio2rdf_dataset:<namespace> bio2rdf_term:has-item namespace_record:identifer . </code>
Since datasets can be versioned, we
<code> bio2rdf_dataset:namespace.version dc:hasVersion "13" ; dc:partOf bio2rdf_dataset:namespace . </code>
this section is about how to create mappings from your dataset specific vocabulary to SIO.