1. Consuming Linked Data Juan F. Sequeda Department of Computer Science University of Texas at Austin SemTech 2010
2. How many people are familiar with RDF SPARQL Linked Data Web Architecture (HTTP, etc)
3. History Linked Data Design Issues by TimBL July 2006 Linked Open Data Project WWW2007 First LOD Cloud May 2007 1st Linked Data on the Web Workshop WWW2008 1stTriplification Challenge 2008 How to Publish Linked Data Tutorial ISWC2008 BBC publishes Linked Data 2008 2nd Linked Data on the Web Workshop WWW2009 NY Times announcement SemTech2009 - ISWC09 1st Linked Data-a-thon ISWC2009 1st How to Consume Linked Data Tutorial ISWC2009 Data.gov.uk publishes Linked Data 2010 2st How to Consume Linked Data Tutorial WWW2010 1st International Workshop on Consuming Linked Data COLD2010 …
17. The Modigliani Test Show me all the locations of all the original paintings of Modigliani Daniel Koller (@dakoller) showed that you can find this with a SPARQL query on DBpedia Thanks Richard MacManus - ReadWriteWeb
18.
19. Results of the Modigliani Test AtanasKiryakov from Ontotext Used LDSR – Linked Data Semantic Repository Dbpedia Freebase Geonames UMBEL Wordnet Published April 26, 2010: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php
34. So what is the problem? We aren’t always interested in documents We are interested in THINGS These THINGS might be in documents We can read a HTML document rendered in a browser and find what we are searching for This is hard for computers. Computers have to guess (even though they are pretty good at it)
35. What do we need to do? Make it easy for computers/software to find THINGS
36. How can we do that? Besides publishing documents on the web which computers can’t understand easily Let’s publish something that computers can understand
49. Resource Description Framework (RDF) A data model A way to model data i.e. Relational databases use relational data model RDF is a triple data model Labeled Graph Subject, Predicate, Object <Juan> <was born in> <California> <California> <is part of> <the USA> <Juan> <likes> <the Semantic Web>
50. RDF can be serialized in different ways RDF/XML RDFa (RDF in HTML) N3 Turtle JSON
51. So does that mean that I have to publish my data in RDF now?
55. Databases back up documents THINGS have PROPERTIES: A Book as a Title, an author, … This is a THING: A book title “Programming the Semantic Web” by Toby Segaran, …
56. Lets represent the data in RDF Programming the Semantic Web title author book Toby Segaran isbn 978-0-596-15381-6 publisher name Publisher O’Reilly
57. Remember that we are on the web Everything on the web is identified by a URI
58. And now let’s link the data to other data Programming the Semantic Web title author http://…/isbn978 Toby Segaran isbn 978-0-596-15381-6 publisher name http://…/publisher1 O’Reilly
59. And now consider the data from Revyu.com hasReview http://…/review1 http://…/isbn978 description reviewer Awesome Book http://…/reviewer name Juan Sequeda
60. Let’s start to link data hasReview http://…/review1 http://…/isbn978 Programming the Semantic Web title description sameAs hasReviewer Awesome Book author http://…/isbn978 Toby Segaran http://…/reviewer name isbn 978-0-596-15381-6 Juan Sequeda publisher name http://…/publisher1 O’Reilly
61. Juan Sequeda publishes data too http://juansequeda.com/id http://dbpedia.org/Austin livesIn name Juan Sequeda
62. Let’s link more data hasReview http://…/review1 http://…/isbn978 description hasReviewer Awesome Book http://…/reviewer name Juan Sequeda sameAs http://juansequeda.com/id http://dbpedia.org/Austin livesIn name Juan Sequeda
63. And more hasReview http://…/review1 http://…/isbn978 Programming the Semantic Web title description sameAs hasReviewer Awesome Book author http://…/isbn978 Toby Segaran http://…/reviewer name isbn 978-0-596-15381-6 Juan Sequeda publisher sameAs http://…/publisher1 name O’Reilly http://juansequeda.com/id http://dbpedia.org/Austin livesIn name Juan Sequeda
64. Data on the Web that is in RDF and is linked to other RDF data is LINKED DATA
65. Linked Data Principles Use URIs as names for things Use HTTP URIs so that people can look up (dereference) those names. When someone looks up a URI, provide useful information. Include links to other URIs so that they can discover more things.
67. I can query a database with SQL. Is there a way to query Linked Data with a query language?
68. Yes! There is actually a standardize language for that SPARQL
69. FIND all the reviews on the book “Programming the Semantic Web” by people who live in Austin
70. hasReview http://…/review1 http://…/isbn978 Programming the Semantic Web title description sameAs hasReviewer Awesome Book author http://…/isbn978 Toby Segaran http://…/reviewer name isbn 978-0-596-15381-6 Juan Sequeda publisher sameAs name http://…/publisher1 O’Reilly http://juansequeda.com http://dbpedia.org/Austin livesIn name Juan Sequeda
71. This looks cool, but let’s be realistic. What is the incentive to publish Linked Data?
72. What was your incentive to publish an HTML page in 1990?
73. 1) Share data in documents2) Because you neighbor was doing it
79. Publishing Linked Data Legacy Data in Relational Databases D2R Server Virtuoso Triplify Ultrawrap CMS Drupal 7 Native RDF Stores Databases for RDF (Triple Stores) AllegroGraph, Jena, Sesame, Virtuoso Talis Platform (Linked Data in the Cloud) In HTML with RDFa
87. Google and Yahoo are starting to crawl RDFa! The Semantic Web is a reality!
88. The Reality Yahoo is crawling data that is in RDFa and Microformats under a specific vocabularies FOAF GoodRelations … Google is crawling RDFa and Microformats that use the Google vocabulary
90. Linked Data Browsers Not actually separate browsers. Run inside of HTML browsers View the data that is returned after looking up a URI in tabular form (IMO) UI lacks usability
103. SPARQL Endpoints Linked Data sources usually provide a SPARQL endpoint for their dataset(s) SPARQL endpoint: SPARQL query processing service that supports the SPARQL protocol* Send your SPARQL query, receive the result * http://www.w3.org/TR/rdf-sparql-protocol/
104. Where can I find SPARQL Endpoints? Dbpedia: http://dbpedia.org/sparql Musicbrainz: http://dbtune.org/musicbrainz/sparql U.S. Census: http://www.rdfabout.com/sparql Semantic Crunchbase: http://cb.semsol.org/sparql http://esw.w3.org/topic/SparqlEndpoints
105. Accessing a SPARQL Endpoint SPARQL endpoints: RESTful Web services Issuing SPARQL queries to a remote SPARQL endpoint is basically an HTTP GET request to the SPARQL endpoint with parameter query GET /sparql?query=PREFIX+rd... HTTP/1.1 Host: dbpedia.orgUser-agent: my-sparql-client/0.1 URL-encoded string with the SPARQL query
106. Query Results Formats SPARQL endpoints usually support different result formats: XML, JSON, plain text (for ASK and SELECT queries) RDF/XML, NTriples, Turtle, N3 (for DESCRIBE and CONSTRUCT queries)
110. Query Result Formats Use the ACCEPT header to request the preferred result format: GET /sparql?query=PREFIX+rd... HTTP/1.1 Host: dbpedia.org User-agent: my-sparql-client/0.1 Accept: application/sparql-results+json
111. Query Result Formats As an alternative some SPARQL endpoint implementations (e.g. Joseki) provide an additional parameter out GET /sparql?out=json&query=... HTTP/1.1 Host: dbpedia.org User-agent: my-sparql-client/0.1
112. Accessing a SPARQL Endpoint More convenient: use a library SPARQL JavaScript Library http://www.thefigtrees.net/lee/blog/2006/04 sparql_calendar_demo_a_sparql.html ARC for PHP http://arc.semsol.org/ RAP – RDF API for PHP http://www4.wiwiss.fu-berlin.de/bizer/rdfapi/index.html
114. Accessing a SPARQL Endpoint Example with Jena/ARQ import com.hp.hpl.jena.query.*; String service = "..."; // address of the SPARQL endpoint String query = "SELECT ..."; // your SPARQL query QueryExecutione = QueryExecutionFactory.sparqlService(service, query) ResultSet results = e.execSelect(); while ( results.hasNext() ) { QuerySolutions = results.nextSolution(); // ... } e.close();
115. Querying a single dataset is quite boring compared to: Issuing SPARQL queries over multiple datasets How can you do this? Issue follow-up queries to different endpoints Querying a central collection of datasets Build store with copies of relevant datasets Use query federation system
116. Follow-up Queries Idea: issue follow-up queries over other datasets based on results from previous queries Substituting placeholders in query templates
117. String s1 = "http://cb.semsol.org/sparql"; String s2 = "http://dbpedia.org/sparql"; String qTmpl = "SELECT ?c WHERE{ <%s> rdfs:comment ?c }"; String q1 = "SELECT ?s WHERE { ..."; QueryExecution e1 = QueryExecutionFactory.sparqlService(s1,q1); ResultSet results1 = e1.execSelect(); while ( results1.hasNext() ) { QuerySolution s1 = results.nextSolution(); String q2 = String.format( qTmpl, s1.getResource("s"),getURI() ); QueryExecution e2= QueryExecutionFactory.sparqlService(s2,q2); ResultSet results2 = e2.execSelect(); while ( results2.hasNext() ) { // ... } e2.close(); } e1.close(); Find a list of companies Filtered by some criteria and return DbpediaURIs from them
118. Follow-up Queries Advantage Queried data is up-to-date Drawbacks Requires the existence of a SPARQL endpoint for each dataset Requires program logic Very inefficient
119. Querying a Collection of Datasets Idea: Use an existing SPARQL endpoint that provides access to a set of copies of relevant datasets Example: SPARQL endpoint over a majority of datasets from the LOD cloud at: http://uberblic.org http://lod.openlinksw.com/sparql
120. Querying a Collection of Datasets Advantage: No need for specific program logic Drawbacks: Queried data might be out of date Not all relevant datasets in the collection
121. Own Store of Dataset Copies Idea: Build your own store with copies of relevant datasets and query it Possible stores: Jena TDB http://jena.hpl.hp.com/wiki/TDB Sesame http://www.openrdf.org/ OpenLink Virtuoso http://virtuoso.openlinksw.com/ 4store http://4store.org/ AllegroGraphhttp://www.franz.com/agraph/ etc.
122. Populating Your Store Get RDF dumps provided for the datasets (Focused) Crawling ldspiderhttp://code.google.com/p/ldspider/ Multithreaded API for focussed crawling Crawling strategies (breath-first, load-balancing) Flexible configuration with callbacks and hooks
123. Own Store of Dataset Copies Advantages: No need for specific program logic Can include all datasets Independent of the existence, availability, and efficiency of SPARQL endpoints Drawbacks: Requires effort to set up and to operate the store Ideally, data sources provide RDF dumps; if not? How to keep the copies in sync with the originals? Queried data might be out of date
124. Federated Query Processing Idea: Querying a mediator which distributes sub-queries to relevant sources and integrates the results
125. Federated Query Processing Instance-based federation Each thing described by only one data source Untypical for the Web of Data Triple-based federation No restrictions Requires more distributed joins Statistics about datasets required (both cases)
126. Federated Query Processing DARQ (Distributed ARQ) http://darq.sourceforge.net/ Query engine for federated SPARQL queries Extension of ARQ (query engine for Jena) Last update: June 28, 2006 Semantic Web Integrator and Query Engine(SemWIQ) http://semwiq.sourceforge.net/ Actively maintained
127. Federated Query Processing Advantages: No need for specific program logic Queried data is up to date Drawbacks: Requires the existence of a SPARQL endpoint for each dataset Requires effort to set up and configure the mediator
128. In any case: You have to know the relevant data sources When developing the app using follow-up queries When selecting an existing SPARQL endpoint over a collection of dataset copies When setting up your own store with a collection of dataset copies When configuring your query federation system You restrict yourself to the selected sources
129. In any case: You have to know the relevant data sources When developing the app using follow-up queries When selecting an existing SPARQL endpoint over a collection of dataset copies When setting up your own store with a collection of dataset copies When configuring your query federation system You restrict yourself to the selected sources There is an alternative: Remember, URIs link to data
130. Automated Link Traversal Idea: Discover further data by looking up relevant URIs in your application Can be combined with the previous approaches
131. Link Traversal Based Query Execution Applies the idea of automated link traversal to the execution of SPARQL queries Idea: Intertwine query evaluation with traversal of RDF links Discover data that might contribute to query results during query execution Alternately: Evaluate parts of the query Look up URIs in intermediate solutions
142. Link Traversal Based Query Execution Advantages: No need to know all data sources in advance No need for specific programming logic Queried data is up to date Does not depend on the existence of SPARQL endpoints provided by the data sources Drawbacks: Not as fast as a centralized collection of copies Unsuitable for some queries Results might be incomplete (do we care?)
143. Implementations Semantic Web Client library (SWClLib) for Java http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/ SWIC for Prolog http://moustaki.org/swic/
144. Implementations SQUIN http://squin.org Provides SWClLib functionality as a Web service Accessible like a SPARQL endpoint Install package: unzip and start Less than 5 mins! Convenient access with SQUIN PHP tools: $s = 'http:// ...'; // address of the SQUIN service $q = new SparqlQuerySock( $s, '... SELECT ...' ); $res = $q->getJsonResult();// or getXmlResult()
147. What is a Linked Data application Software system that makes use of data on the web from multiple datasets and that benefits from links between the datasets
148.
149. Discover further information by following the links between different data sources: the fourth principle enables this.
150. Combine the consumed linked data with data from sources (not necessarily Linked Data)
151. Expose the combined data back to the web following the Linked Data principles
152.
153. Hot Research Topics Interlinking Algorithms Provenance and Trust Dataset Dynamics UI Distributed Query Evaluation “You want a good thesis? IR is based on precision and recall. The minute you add semantics, it is a meaningless feature. Logic is based on soundness and completeness. We don’t want soundness and completeness. We want a few good answers quickly.” – Jim Hendler at WWW2009 during the LOD gathering Thanks Michael Hausenblas
154. THANKS Juan Sequeda www.juansequeda.com @juansequeda #cold www.consuminglinkeddata.org Acknowledgements: Olaf Hartig, Patrick Sinclair, Jamie Taylor Slides for Consuming Linked Data with SPARQL by Olaf Hartig