The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
8. Web of Linked Data
9
thing
thing
thing
thing
thing
thing
thing
links define relationships between things
➢ a global database
➢ design for machines
➢ links between things
➢ explicit semantics
9. How Does it Change our Life?
10
➢ a company is moving to another city
○ aggregated information on taxes, prices, salaries,
unemployment, climat
➢ I’m moving to another country
○ mobile providers, bank accounts
➢ you’re buying a house
○ crime statistics, weather, house prices, neighbourhood,
traffic information
➢ media annotation
○ video annotated with Linked Data retrieving always an
actual bio of the speaker
17. Linked Data
12-05-30 18
➢ describes a method of publishing semi-structured data in
the Web
➢ data can be interlinked
➢ builds upon standard Web technologies
➢ extends the standards to share information in a way that
it can be read automatically by computers
18. Big data is high-volume, high-velocity and
high-variety information assets that demand
cost-effective, innovative forms of information
processing for enhanced insight and decision
making.
Big Data
19
19. Linked Data is Big Data
12-05-30 20
➢ Volume: data size growing exponentially
➢ Velocity: streams of data from the Internet
of Things Cloud
➢ Variety
○ semi-structured data
○ heterogenous linked collections of data
21. Data Provenance
“Provenance is information about
entities, activities, and people involved
in producing a piece of data or thing, which can be used to form
assessments about its quality, reliability or trustworthiness.”
Which pieces of data and how they were
combined to produce the results?
22
23. How to efficiently store and query vast amounts
of Linked Data in the cloud?
24
➢ a new physiological data partitioning algorithm to
efficiently and effectively partition the graph and co-
locate related instances in partitions
➢ a new system architecture for handling fine-grained
partitions at scale
➢ novel data placement techniques to co-locate
semantically related pieces of data
➢ new data loading and query execution strategies
taking advantage of our system’s data partitions and
indices
24. How to store and track provenance in Linked
Data processing?
25
➢ a new way to express the provenance of query
results at two different granularity levels by leveraging
the concept of provenance polynomials
➢ two new storage models to represent provenance
data in a data store compactly
➢ query execution strategies to derive the
provenance polynomials while executing the queries
25. How can we efficiently support queries tailored
with provenance information?
26
➢ a characterization of provenance-enabled queries,
that is, queries tailored with provenance data
➢ five provenance-oriented query execution strategies
➢ storage model and indexing techniques to handle
provenance-aware query execution strategies
26. Contributions
➢ new advance data co-location and partitioning
techniques for efficient and scalable query
processing in the Cloud
➢ first efficient provenance-aware database for
Linked Data
27
28. URI: A Uniform Resource Identifier
29
“A Uniform Resource Identifier (URI) provides a
simple and extensible means for identifying a
resource.” -- RFC 3986
Some URIs for “real world” things:
https://www.linkedin.com/in/mwylot
http://dbpedia.org/page/Fribourg
http://www.geonames.org/2657895
29. RDF Data Model
30
➢ Standard model for data interchange on the
Web
➢ Statements about resources/things (triples)
○ Subject(URI) Predicate(URI) Object(URI) .
30. SPARQL Query Language
31
➢ query and manipulate RDF graph content
required and optional graph patterns along with their
conjunctions and disjunctions
➢ aggregation, subqueries, negation, creating values by
expressions, extensible value testing, and constraining
queries by source graph
➢ results can be result sets or graphs
SELECT ?t WHERE {
?a <type> <article> .
?a <tag> <Obama> .
?a <title> ?t . }
31. Outline
➢ Linked Data Management System
➢ Storing and Tracing Provenance
➢ Querying Provenance Information
32
32. Diplodocus
33
A new distributed Linked Data
management system implementing
a novel hybrid storage model
based on flexible RDF templates.
35. Outline
➢ Linked Data Management System
➢ Storing and Tracing Provenance
➢ Querying Provenance Information
36
36. Physical Storage Models
Differences:
➢ ease of implementation
➢ memory consumption
➢ query execution
➢ interference with the original concept of molecule
1) SPOL 2) LSPO 3) SLPO 4) SPLO
37
S - Subject
P - Predicate
O - Object
L - graph label, context value
37. Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result
"Algebraic structures for capturing the provenance
of sparql queries."
Geerts, Floris, et al.
Proceedings of the 16th International Conference
on Database Theory. ACM, 2013.
38
39. Findings
➢ tracing provenance overhead is considerable
but acceptable, on average about 60-70%
➢ most suitable storage model depends upon
data and workloads characteristics
40
40. Outline
➢ Linked Data Management System
➢ Storing and Tracing Provenance
➢ Querying Provenance Information
41
41. Provenance-Enabled Query
A Workload Query is a query producing results a user is
interested in. These results are referred to as workload
query results.
A Provenance Query is a query that selects a set of data
from which the workload query results should originate.
A Provenance-Enabled Query is a pair consisting of a
Workload Query and a Provenance Query, producing results
a user is interested in (as specified by the Workload Query)
and originating only from data pre-selected by the
Provenance Query.
42
42. Provenance-Enabled Query: Example
SELECT ?t WHERE {
?a <type> <article> .
?a <tag> <Obama> .
?a <title> ?t . }
➢ ensure that the articles come from sources attributed to the government
SELECT ?ctx WHERE {
?ctx prov:wasAttributedTo <government> . }
➢ ensure that the data used to produce the answer was associated a
“SeniorEditor” and a “Manager”
SELECT ?ctx WHERE {
?ctx prov:wasGeneratedBy <articleProd>.
<articleProd> prov:wasAssociatedWith ?ed .
?ed rdf:type <SeniorEdior> .
<articleProd> prov:wasAssociatedWith ?m .
?m rdf:type <Manager> . }
43
45. Results
46
Queries tailored with provenance
information can be executed faster due
to the selectivity of provenance
information.
46. Lessons Learnt
➢ there is room for further improvement in Linked Data
management
➢ co-location of related entities is the right way
➢ provenance overhead does not have to be high
➢ we can leverage provenance information to improve
performance
47
48. Summary
➢ new advance data co-location and partitioning
techniques for efficient query processing
➢ cloud support for scalable query processing
➢ first efficient provenance-aware Linked Data
management system
➢ source code available online
49