In the Open Data world we are encouraged to try to publish our data as “5-star” Linked Data because of the semantic richness and ease of integration that the RDF model offers. For many people and organisations this is a new world and some learning and experimenting is required in order to gain the necessary skills and experience to fully exploit this way of working with data. This workshop will re-assert the case for RDF and provide a guided tour of some examples of RDF publication that can act as a guide to those making a first venture into the field.
3. Resource Description Framework
RDF.
• Initially a way of adding metadata to XML
• Subject-Predicate-Object or
• Subject-Predicate-Literal triples
Scotland has an Authority that is
Aberdeen City
Population
Scotland
Authority
Aberdeen City
“218,220”
Aberdeen City has a Population with value
“218,220”
5. “One often overlooked advantage that
RDF offers is its deceptively simple data
model. This data model trivializes
merging of data from multiple sources
and does it in such a way that data about
the same things gets collated and
deduplicated. In my opinion this is the
most important benefit of using RDF
over other open data formats.” (Ian
Davis, 2011)
http://blog.iandavis.com/2011/08/18/the-real-challenge-for-rdf-is-yet-to-come/
6. A resource
… with the name “Bonnet”
…. living in Paris
owns …
Pet 2
… that is called Sasha
“Bonnet”
http://ex.org/pet/2
http://place.org/Paris
“Sasha”
7. Pet 2
… is a ferret
… and has chicken as the favourite food
“chicken”
“ferret”
http://ex.org/pet/2
8. “chicken”
The two references to
http://ex.org/pet/2 point to
the same resource so the
graphs can merge.
“ferret”
“Bonnet”
http://ex.org/pet/2
http://ex.org/pet/2
http://place.org/Paris
“Sasha”
14. So RDF data is “5 star” because
No need for prior design discussion with data
suppliers about data specification.
No need to design container before accepting data.
Datasets are self-describing. Explicit semantics.
Data deduplicates and collates automatically
Merged datasets are collated and de-duplicated
automatically.
19. Creating RDF...
Wikis
• Use Wikipedia and let DBPedia work for you
• Semantic Mediawiki - outputs RDF and can be
linked to triplestore directly
• Drupal and DBPedia - creates RDFa which can
be scraped, - not very widely used.
20. Creating RDF....
Relational to RDF mapping
• D2R Server: Accessing databases with SPARQL
and as Linked Data
– http://opendata.tellmescotland.gov.uk
• Virtuoso RDF Views
– http://location.testproject.eu/BEL/
24. Geospatial Triplestores
•
•
•
•
•
Virtuoso Universal Server (7.0, ColumnStore edition)
Parliament (2.7.4 quickstart)
uSeekM (1.2.0-a5, on top of PostgreSQL 8.4 and PostGIS 1.5)
OWLIM-SE (Trial version 5.3.5849)
Strabon (3.2.3, on top of PostgreSQL 8.4 and PostGIS 1.5)
Xen VMs for each available in Debian 6
http://blog.geoknow.eu/virtual-machines-of-geospatial-rdf-stores/
Dr. Jens Lehmann. Uni Leipzig
27. Linked Data API...
Entity Resolution:
Victoria Quay is
http://cofog01.data.scotland.gov.uk/id/facility/AB0103
...resolves
http://cofog01.data.scotland.gov.uk/doc/facility/AB0103
28. Linked Data API....
Different serialisations [JSON, NT, RDF/XML etc]
HTTP "Accept" headers - e.g. "application/json"
303 re-directs
http://cofog01.data.scotland.gov.uk/id/facility/AB0103.nt
http://cofog01.data.scotland.gov.uk/doc/facility/AB0103.rdf
30. Linked Data API....
• Linked Data API makes it easy
http://data.sepa.org.uk/doc/water/surfacewaters
http://data.sepa.org.uk/doc/water/surfacewaters.xml
31. FluidOps Workbench & FedX
• Built on top of Sesame RDF store
• Wiki-like structure for interaction
• Data pipelined in from external SPARQL and
other sources
• Includes widgets, graph views, facet views etc
for interacting with the aggregated data
34. DBPedia - at the heart of Open Data
September 2013
http://en.wikipedia.org/wiki/DBpedia
45 million interlinks with
Freebase
OpenCyc
UMBEL
GeoNames,
Musicbrainz,
CIA World Fact Book
DBLP
Project Gutenberg
DBtune Jamendo
Eurostat
Uniprot
Bio2RDF
US Census data
Also used in
Thomson Reuters OpenCalais
New York Times Linked Open Data
Zemanta API
DBpedia Spotlight
BBC datasets
“In the Open Data world we are encouraged to try to publish our data as “5-star” Linked Data because of the semantic richness and ease of integration that the RDF model offers. For many people and organisations this is a new world and some learning and experimenting is required in order to gain the necessary skills and experience to fully exploit this way of working with data. This workshop will re-assert the case for RDF and provide a guided tour of some examples of RDF publication that can act as a guide to those making a first venture into the field.”Isn’t is awful when we’re trying to communicate and we’re misunderstood. Not only can it lead to problems as a direct result of the misunderstanding, but there can also be quite a bit of hassle in getting things straightened out after the mistake. In the case of Ginger and Fred it nearly became a showstopper. https://www.youtube.com/watch?v=zZ3fjQa5Hls
Data sharing within and between enterprises has always been a costly effort – The Open Group reckon “... that between 40% and 80% of application integration effort is spent on resolving semantic issues, a task that typically requires significant human intervention. The expanding use of Service Oriented Architecture (SOA) and Cloud Computing are further increasing the need for semantic interoperability that more efficiently aligns IT systems with business objectives.” ... and naturally, the Open Data programmes have similar issues. This is where RDF is playing a key role, both inside and outside enterprises.http://www.opengroup.org/subjectareas/si
There is a lot of talk in the ‘Open Data’ world about “5-star” RDF data and this indicates a meritocratic hierarchy of models of data, so why is RDF ‘tops’? RDF is also key to the “Semantic Web” , also described as “Web 3” – the next generation of web technology. We are also hearing about this “internet of things” and RDF plays a significant role there. So what is there for me, my business, my organisation in considering using data modelled as RDF?RDF (Resource Description Framework) originated in the 1990s as a way of adding metadata to XML documents, but it’s actually also a very tidy way of describing any data. RDF is a model in which data are expressed as triples comprising a Subject and an Object related by a directional Predicate.
From a little after the Ginger and Fred era until about the late 1990s interchanges of computerised data tended to follow detailed discussion and agreement between 2 parties about the data they were exchanging, the model systems of the provider and recipient, mappings and semantics relations, etc. etc.. Data exchanges required extraction, transformation and loading stages (ETL) – and this is still the situation in many situations.
The RDF model removes much of the heavy lifting required in traditional ETL. “One often overlooked advantage that RDF offers is its deceptively simple data model. This data model trivializes merging of data from multiple sources and does it in such a way that data about the same things gets collated and de-duplicated. In my opinion this is the most important benefit of using RDF over other open data formats.” (Ian Davis, 2011)http://blog.iandavis.com/2011/08/18/the-real-challenge-for-rdf-is-yet-to-come/
This is an example of one RDF data set
And here is another
Data entities with the same identifier allow both data sets to merge at these points
Another approach to data integration, particularly within enterprises, is to develop one big pot into which all the organisation’s key data fits – the enterprise data warehouse. The problem with this in brittleness – the warehouse takes an age to design, and then one can only fill it with items that it was built to contain. Anything that is the wrong ‘shape’ has to be either rejected, refashioned (ETL), or else be so important as to be worth the cost of adding an extra wing to the warehouse to cater for it. Lots of analysis, coding and so-on.
In the RDF world the containers for native RDF are promiscuous, just like file systems, accepting any RDF and not just that which fits a particular schema or pattern. [imagine how restrictive a file system that only could store Word documents would be]. Adding new RDF statements with relationships that are not currently present in the dataset does not require the sort of preparatory work needed in the relational model such as the addition of new join tables to the database.
In conventional data management situations the semantics of the data tend to be observed in the interface and expressed in the documentation.
RDF, in contrast, has explicit semantics for the relationships between entities or from an entity to a literal, and also provides a mechanism to build in the descriptions of individual classes of entities and descriptions of the dataset.
The final aspect of RDF that I want to focus on as a potential benefit is that RDF data is a ‘graph’ of nodes and edges which when visualised as a set of circles and arrows, for example, has a particular shape within which one can see clustering and sparseness in ways that is difficult to achieve with other models.
So RDF data is “5 star” because I don’t need to have a dialogue with a range of data providers to unambiguously join together their datasets into a “supergraph” that I can then work with. I don’t necessarily need to modify the container that I use beforehand in order to tailor it to accept data – RDF models can be merged automatically in the absence of a schema. Datasets are, at least to a minimal level, self-describing in that they automatically detect what is the same and what is different, what items are entities and what are properties/relationships. Data becomes collated and de-duplicated automatically.In addition to these arguments in favour of the RDF model for data interchange, the increasing availability of open RDF Linked Data is going to mean that organisations that are not using these approaches are missing out on not being able to effectively make use of openly available RDF from multiple sources in its native, efficient form – they will have to reduce it to a semantically less rich form (JSON, CSV, etc) and this requires ETL steps and so on prior to use.
So I am going to give a tour of some resources and illustrations that might be of help to those individuals and organisations wanting to get started in the RDF world
First step is education/training at scaleThe Euclid Project [http://www.euclid-project.eu/ ]
Second step: starting to work with RDF publication:RDF is an efficient way to merge datasets from multiple sources unambiguously and relatively automatically. Think of it as an equivalent to RNA in biology, in that the main store of genetic information is held securely in another form (generally DNA, but also negative strand or double stranded RNA) but for working purposes (building proteins) that data is converted to a biological globally shared form, RNA. In the same way it is possible to hold data all the time as RDF and work effectively with it, but there are many situations where that isn’t optimal – either for historic reasons in that there is existing infrastructure that works effectively on other data models, or – as in highly transactional systems – the RDF approach isn’t suited to the normal operations. Hand written/scriptedOne easy way to create some RDF is to write it by hand or by some simple scripts. This is often used for learning about RDF or for developing new ideas. XTurtlex is an Eclipse plugin that makes this job much smoother. Generating RDF with scripts can be done using a templating approach but, as with constructing XML using specific tools, there are a range of RDF tools for constructing RDF statements in code: Java has Jena and Sesame APIs, Python has RDFlib, Ruby has the RDF.rb gem, and several scripting languages have bindings for the Redland library which is in C.http://aksw.org/Projects/Xturtle.htmlhttp://jena.apache.org/http://www.openrdf.org/http://www.rdflib.net/http://rdf.rubyforge.org/http://librdf.org/
OpenRefine + RDF pluginThis is an application that makes cleanup and conversion from a range of data types, including delimted text and spreadsheets into RDF. There are also options to use ‘reconciliation services’ which are APIs that provide best-guess suggestions for widely used URIs for entities based on text in your data. These reconciliation services come from Freebase (Google) and also recently the Ordnance Surveyhttp://openrefine.org/http://refine.deri.ie/http://www.freebase.com/http://data.ordnancesurvey.co.uk/datasets/os-linked-data/explorer/reconciliation
Conversion from relational databasesThere are several ways in which RDF can be published from relational databases. This is often a good way to get RDF out of an existing system with minimal hassle, and this might be a low-risk way of getting into publishing some of your data as RDF.Methods that use wikis:Wikipedia and DBPedia: Information in Wikipedia factboxes eventually ends up as RDF data published by DBPedia during the biannual conversion process. DBPedia Live attempts to keep abreast of the rapid rate of page updating that is undertaken on WikipediaSemantic Mediawikiis an extension of the Mediawiki software that is used for Wikipedia, that has an underlying RDF model for its data. Semantic Mediawiki provides routes for exporting both subsets of wiki pages and the whole wiki content as RDF.Drupal 7.0 outputs RDFa data in core. RDFa (or Resource Description Framework- in- attributes) is a W3C Recommendation which allows embedding RDF metadata within Web documents. These RDF assertions can be ‘gleaned’ from the web pages by stylesheets and other ‘distillers’. However, RDFa isn’t used much ‘in the wild’ at the moment.http://wiki.dbpedia.org/DBpediaLivehttp://semantic-mediawiki.org/http://enipedia.tudelft.nl/wiki/Main_Pagehttp://en.openei.orghttps://drupal.org/node/778988
Relational to RDF mappersThese act as “babelfish” to translate a relational database schema into an RDF model through a mapping procedure (the applications assist that process, but it often needs hand-finishing) and provide a query interface (i.e. these mapping applications create a SPARQL endpoint and return RDF, but the underlying data is maintained in a SQL database). This approach is ideal for providing RDF from an existing application which you don’t want to mess with but for which you want an RDF output. Examples of relational to RDF mapping software include D2R and the polyglot storage server Virtuoso. Examples of use of these tools include the TellMeScotland open data publication and the EC Joinup pilot linking Belgian addressing data#http://d2rq.org/d2r-serverhttp://virtuoso.openlinksw.com/whitepapers/relational%20rdf%20views%20mapping.htmlhttp://opendata.tellmescotland.gov.ukhttps://joinup.ec.europa.eu/sites/default/files/D5.2.1_Core_Location_Pilot-Interconnecting_Belgian_National_and_Regional_Address_Data_v0.5.pdf
TriplestoresNative RDF can be stored either as a graph in memory, or within a native RDF triplestore which is a database specifically designed for RDF graph structures. These days we can get computers with huge amounts of RAM at relatively low cost.
Native RDF databases a.k.a. TriplestoresRDF can be stored as native triples in SQL databases, using long skinny tables with three columns for the triples and having various indexes (SPO, OSP etc), This is the approach taken with the Jena SDB datastore, but to some extent this was a naive/simplistic approach using tools like MySQL and Postgres that were readily available in the early 2000s. Subsequent work has focused on developing native RDF datastores that don’t use tables in the SQL sense but use node tables and indexes where the focus has been on optimising both storage and search for the RDF model rather than making use of a more generalised data store. Examples include TDB, Sesame and Mulgara all of which are Java applications, and 4Store which is built in C and only easily compiled on Linux. Other approaches include column stores (e.g. Virtuoso and Vertica). So, if you are looking for an easy way of installing and using a triplestore what is the best approach? The Apache Jena “TDB” triplestore with the Joseki or Fuseki SPARQL endpoint is one option I’ve used a lot. Other simple options that I’ve had some experience of include Sesame, Mulgara, Bigdata, Virtuoso and 4Store, but this is not a definitive list, and each has an associated SPARQL over HTTP query option. http://jena.apache.org/documentation/tdb/index.htmlhttp://joseki.sourceforge.net/http://jena.apache.org/documentation/serving_data/http://www.openrdf.org/http://www.mulgara.org/http://www.systap.com/bigdata.htmhttp://virtuoso.openlinksw.com/rdf-quad-store/http://4store.org/Choice of triplestore will depend on your OS options (e.g. 4Store is built from source and this is easiest with Linux), how much RDF you are storing (usually measured in millions/billions of triples), and the additional functions (e.g. geo indexing is available with Virtuoso, Parliament and a small number of others; Allegrograph has social networking stats functions built in)One advantage of triplestores over SQL stores is that transferring data from one to another is simply a matter of outputting triples from one and loading them into another. Therefore the risk of picking the ‘wrong one’ to start has limited negative consequences.
There are VM images available for some of the geo-capable triplestores – a very useful resource.http://blog.geoknow.eu/virtual-machines-of-geospatial-rdf-stores/
Linked Data APIsWhen you have data in a triplestore one doesn’t want to just leave potential users with a SPARQL endpoint – it’s daunting and unhelpful to many potential users of your data. A Linked Data API is a much more pleasant decoration. A couple of examples include PublishMyData (mainly Ruby, example at ), Elda (mainly Java, example at ).https://github.com/swirrl/publish_my_datahttp://cofog01.data.scotland.gov.uk/doc/facility/AB0103http://code.google.com/p/elda/http://data.sepa.org.uk/doc/water/surfacewaters.html
A Linked Data API provides a faceted HTML view of your data and also helps resolve URIs that have the base URI at your site to some HTML page. For example, the identifier for Victoria Quay is http://cofog01.data.scotland.gov.uk/id/facility/AB0103. If you put this into your browser you get redirected to an HTML page about Victoria Quay: http://cofog01.data.scotland.gov.uk/doc/facility/AB0103.
APIs also help the return of RDF describing resources in different machine readable formats, either by responding to the HTML “Accept” header, or by handling HTTP Server Code 303 re-directs appropriately, e.g.: http://cofog01.data.scotland.gov.uk/id/facility/AB0103.nt returns NTriples and http://cofog01.data.scotland.gov.uk/id/facility/AB0103.rdf returns RDF/XML representations for the same resource.
FluidOps WorkbenchThis is a hybrid tool that provides a wiki interface for the creation of new content but also enables the import of data from various sources into a local Sesame triplestore. An example of the Workbench with Wikipedia/DBPedia data is at http://iwb.fluidops.com/resource/Our_Dynamic_Earthhttp://www.fluidops.com/information-workbench/
This shows a timeline & animated GIF for the development of the Linked Open Data web over the past few years
DBPedia is at its heart
...with very significant interlinkage with other datasets and a very healthy user base in some major projects
So that's the end of the tour
Here is an illustration of a federated SPARQL query that initially goes to the DBPedia endpoint <http://dbpedia.org/sparql/sparql> and finds land-locked countries; it then takes those country identifiers (?country) and goes to the World Bank endpoint <http://worldbank.270a.info/sparql> to find some more information about those countries