RDFization Guide v1.0

The linked data that forms part of Bio2RDF ascribes a to simple set modeling patterns that permit our different datasets to syntactically interoperate. The best practices here presented have been inspired by the Banff Manifesto, Tim Berner-Lee's design principles and the collective experience of our community. This document provides a simple set of guidelines to guide Bio2RDF users and contributors in the creation and querying of our data.

This guide will assume that you have working experience in creating RDF documents programatically. If this describes you, then read on!

Table of Contents Creating Resources Formatting URIs Using existing Bio2RDF namespaces Creating Bio2RDF namespaces Annotating resources Submit your RDFizer Linked Data Rules 1: URIs have a syntactic pattern and are dereferenceable 2: Authoritative public namespaces are used 3: Every resource holds unique metadata 4: RDFizer programs must be open source 5: Every dataset must be shipped with an ontlogy file Identifiers Entities Vocabulary Descriptions Minimum Annotations Datasets, Records and Entities Mappings Scripts Serialization Loading

Creating Resources

The over | 1800 biological databases that are currently available, prodive unique identifiers for every record that they host. For example, the | Protein Databank, uses a four character string to represent their unique entries (e.g. http://www.rcsb.org/pdb/explore/explore.do?structureId=1y26|1Y26), similarly | PubMed uses an integer to identify publications (e.g. | 22359647).

Formatting URIs

All Bio2RDF URIs must use the HTTP scheme and bio2rdf.org as a base: http://bio2rdf.org/. Appended to this base must be a dataset specific namespace. The namespace must be followed by a colon and then a unique identifier. For example, the Bio2RDF URI for the UniProt record with identifier P26838 would be

http://bio2rdf.org/uniprot:P26848

All identifiers must be URL-safe. For example, if an identifier contains a round bracket i.e. ( or ), then the URL encoding of this bracket must be used in the Bio2RDF URI. A list of URL encodings for common characters is available here.

Using existing Bio2RDF namespaces

There are over 1800 existing dataset-specific Bio2RDF namespaces that can be browsed at http://www.freebase.com/view/base/bio2rdf/views/bm. Before creating a new namespace, ensure that there isn't an existing namespace appropriate for your dataset, and that any new namespace you wish to use is not already in use for another dataset.

Creating Bio2RDF namespaces

When creating Bio2RDF URIs, there are three types of namespaces that may be used. In the case that you are creating a Bio2RDF URI for an existing dataset identifier, then the namespace used should be that of the data set. For example, the Bio2RDF namespace for UniProt is uniprot. If in the process of converting a dataset to RDF you create new identifiers that did not previously exist in the dataset being converted, then use a dataset_resource namespace. In the example of UniProt, this would be uniprot_resource and an example URI would be:

http://bio2rdf.org/uniprot_resource:P26848-unique-identifier-123

Finally, if you wish to create dataset-specific predicates and types, then use a dataset_vocabulary namespace. For example, the Bio2RDF URI for the UniProt Protein type would be:

http://bio2rdf.org/uniprot_vocabulary:Protein

Annotating resources

Submit your RDFizer

Linked Data Rules

1: URIs have a syntactic pattern and are dereferenceable

All URIs that form part of Bio2RDF linked data. Normalized URI: http://bio2rdf.org/public_database:private_identifier

Consider for example a UniProt record i.e.: P26838. The proposed URI for this record would be the same as its URL:

http://bio2rdf.org/uniprot:P26838

Note: Blank nodes should be avoided like the plague.

2: Authoritative public namespaces are used

What namespaces can we use

3: Every resource holds unique metadata

Every resource must contain the following metadata with their corresponding predicates:

rdf:type Define the class of object described by the resource
rdfs:label The document's title and identifier

4: RDFizer programs must be open source

5: Every dataset must be shipped with an ontlogy file

Identifiers

The first step of the RDFization process involves the use of a consistent identifier identifier scheme. Data providers such as NCBI, EBI, etc. use unique identifiers to refer to the entities that they are hosting. The linked data that forms part of Bio2RDF distinguishes between those identifiers that refer to the original hosted entities and any other auxliary identifiers used in the creation of the linked data graph

Entities

For every unique entity c to a record Bio2RDF identifiers are given by the following URI pattern:

http://bio2rdf.org/''namespace'':''identifier''

where the namespace is a short name listed in our dataset registry that uniquely identifies the source (dataset/database). The identifier is the (alpha)numeric string assigned to identify that entity. For instance, the gene identified by the number 15275 in the NCBI EntrezGene Database (namespace = geneid) has the following identifier:

 <code>http://bio2rdf.org/''geneid'':''15275''</code>

Vocabulary

The Bio2RDF URI scheme is applied not just to data entries, but also for the vocabulary (types and relations) to describe these entries.

 <code>http://bio2rdf.org/''namespace''_term:''term''</code>

For example, the gene identified by geneid:15275 is a kind of Gene, as defined by Entrez Gene.

 <code>http://bio2rdf.org/''geneid''_term:''Gene''</code>

Descriptions

Minimum Annotations

Each resource should contain the following annotations:

 <code>http://purl.org/dc/terms/title</code> 
 a human readable title as it appears in the source data.

 <code>http://purl.org/dc/terms/identifier</code>
 a string that contains the identifier using the following pattern <namespace>:<identifier>

 <code>rdfs:label</code>
 a Bio2RDF generated label containing a title followed by the identifier "title [ns:id]".

Used by convention in most RDF browsers to render the name of resource instead of its URI.

Taken together,

 <code>
  geneid:15275 
   rdfs:label "Hk1 [geneid:15275]" ;
   dc:title "Hk1" ;
   dc:identifier "geneid:15275" ;
   rdf:type geneid_term:Gene .
 </code>

Datasets, Records and Entities

We recognize a minimum of 3 entities found in biological information resources: physical entities, records and datasets.

1. Record

Records are information objects that contain a set of statements, primarily about the subject.

 <code>
  namespace_record:identifier
    bio2rdf_term:has-primary-subject namespace:identifier .
 </code>

 <code>
  namespace:identifier
   bio2rdf_term:is-described-by namespace_record:identifier .
 </code>

2. Dataset Datasets are collections of records.

 <code>
  bio2rdf_dataset:<namespace>
    bio2rdf_term:has-item namespace_record:identifer .
 </code>

Since datasets can be versioned, we

 <code>
  bio2rdf_dataset:namespace.version
    dc:hasVersion "13" ;
    dc:partOf bio2rdf_dataset:namespace .
 </code>

Mappings

this section is about how to create mappings from your dataset specific vocabulary to SIO.

Ontologies

Scripts

:Category:Scripts

Serialization

Loading

Loading the RDF database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly