This presentation addresses the main issues of Linked Data and scalability. In particular, it provides gives details on approaches and technologies for clustering, distributing, sharing, and caching data. Furthermore, it addresses the means for publishing data trough could deployment and the relationship between Big Data and Linked Data, exploring how some of the solutions can be transferred in the context of Linked Data.
2. Analysis &
Mining Module
Visualization
Module
RDFa
Data acquisition
LD Dataset
Access
Application
EUCLID Objective
SPARQL
Endpoint
Publishing
Vocabulary
Mapping
Interlinking
LD Wrapper
Physical Wrapper
Integrated
Dataset
Cleansing
R2R Transf.
LD Wrapper
RDF/
XML
Streaming providers
Downloads
Musical Content
Metadata
EUCLID – Scaling up Linked Data
Other content
2
3. Motivation: Music!
• Our aim: build a music-based portal using Linked
CH 1
Data technologies
• So far, we have studied different mechanisms for:
•
•
•
•
Linked Data management via SPARQL queries
Reasoning over Linked Data
Linked Data access (RDF dumps, endpoints, RDFa)
Linked Data storage in repositories
CH 5
CH 2
CH 3
• In this chapter, we will study current research and
technologies to scale up to very large volumes of
Linked Data
EUCLID – Scaling up Linked Data
3
4. Agenda
1. Introduction to Big (Linked) Data
2. NoSQL databases for Linked Data
3. Hadoop for Linked Data
4. Stream processing for Linked Data
5. … and more
EUCLID – Scaling up Linked Data
4
6. Introduction to Big Data
Big
Data
Management of data which is “too
complex” for being processed with
traditional solutions
•
Big does not stand primarily for size,
but as an analogy for “overwhelming”
•
Big can mean “high variety”, “high
volume” or “high velocity”
EUCLID – Scaling up Linked Data
6
7. The 3 Vs of Big Data
Variety
Big
Big
Data
Data
Different forms of data
Volume
Petabytes of data
Velocity
Real-time data streams
EUCLID – Scaling up Linked Data
7
8. The 3 Vs of Big Data
Variety
Volume
Velocity
time
Data
characteristic
Structured, semi- Large volumes of Streams, sensors,
structured and
data
near real-time
unstructured
data, IoT
Challenge
Data integration
Reasoning and
querying
Reasoning &
querying
Solution
Semantic
technologies are
a good fit
Distributed
storage &
processing,
parallel
processing
Stream reasoning
& querying
EUCLID – Scaling up Linked Data
8
9. The Extended Vs of Big Data
Variety
Volume
Velocity
• Veracity: Uncertainty of the data
• Variability: Variation in meaning in different contexts
• Value: turning data into information into insight
• Not easy measure
• Depend on context and intended use
• Linked Data & Semantic Technologies can help
EUCLID – Scaling up Linked Data
9
11. Beyond Big Data (2)
Semantic Technologies
Semantic technologies extract meaning from data, ranging from quantitative
data and text, to video, voice and images. Many of these techniques have
existed for years and are based on advanced statistics, data mining, machine
learning and knowledge management. One reason they are garnering more
interest is the renewed business requirement for monetizing information as a
strategic asset. Even more pressing is the technical need. Increasing volumes,
variety and velocity — big data — in IM and business operations, requires
semantic technology that makes sense out of data for humans, or
automates decisions
Source: Gartner Inc. “Gartner Identifies Top Technology Trends Impacting Information
Infrastructure in 2013”
EUCLID – Scaling up Linked Data
11
12. Towards Big Linked Data
• This characteristic is the most inherent to Linked Data
Variety
• Agile data model
• Different vocabularies
Volume
2007
Velocity
2008
2009
2010
2011
• RDF Streams
• Semantic Sensors
EUCLID – Scaling up Linked Data
12
14. Big Linked Data &
Linked Big Data
Big Linked Data
Linked Big Data
• Exponential growth of Linked
Data in the last five years
• Big Data approach adopted by
the Linked Data community,
especially to handle
Volume
Velocity
• Linked Data approach
adopted by the Big Data
community
• RDF data model for
Variety
• Enrich Big Data with metadata
and semantics
• Interlink Big Data sets &
reduce duplication
• Simplify data access,
discovery & integration
Source: M. Dimitrov. “Semantic Technologies for Big Data”
EUCLID – Scaling up Linked Data
14
16. RDF Databases
• Native or RDBMS based RDF databases
– OWLIM (http://www.ontotext.com/owlim)
– Virtuoso Universal Server (http://virtuoso.openlinksw.com/ )
– Stardog (http://stardog.com)
– AllegroGraph (http://www.franz.com/agraph/allegrograph/ )
– Systap Bigdata (http://www.systap.com/)
– Jena TDB (http://jena.apache.org/documentation/tdb/)
– Oracle, DB2
EUCLID – Scaling up Linked Data
16
17. RDF Database Advantages
• RDF (graph) based data model
– Global identifies of resources/entities
– Agile schema
• Inference of implicit facts
– Forward, backward, hybrid reasoning strategy
• Expressive query language (SPARQL)
• Compliance to standards
EUCLID – Scaling up Linked Data
17
18. NoSQL Databases
• “Not Only SQL”
• a group of databases technologies which don’t
follow the relational data model
• Typical requirements
– Distributed
– High availability
– Handle big data & query volumes (scalability)
– Hierarchical or graph data structures
– Flexible schema
EUCLID – Scaling up Linked Data
18
19. NoSQL Taxonomy
Conceptual structures
• Key/value stores
Value
Key
– Each key associated with a value (DHT)
• Wide-column stores
Artist
– Each key is associated with many attributes,
columns are stored together
Album
Song
The
Beatles
Let it be
Get back
Queen
Jazz
Fun it
• Document databases
– Each key associated with a complex data
structure
Key
Structureddocument
Key
Structureddocument
• Graph databases
– Data is represented as nodes and edges
Data
EUCLID – Scaling up Linked Data
Relationship
Data
19
20. Key/Value Stores
• Efficient key/value lookups
Key
Value
• Schema-less
• Simpler read/write operations
– Low latency & high throughput
• Examples
– DynamoDB, Azure Table Storage, Riak, Redis, MemcacheDB,
Voldemort
EUCLID – Scaling up Linked Data
20
21. Wide-Column Stores
•
•
•
•
•
•
•
A key is associated with several attributes
Data in the same column is stored together
Efficient for complex aggregations over data
Artist
Album
Song
Schema-less / dynamic schema
The
Let it be
Get back
Beatles
Easy to add new columns
Queen
Jazz
Fun it
Columns can be grouped together (column family)
Examples:
– HBase (http://hbase.apache.org)
– Cassandra (http://cassandra.apache.org)
EUCLID – Scaling up Linked Data
21
22. HBase
•
•
•
•
•
•
•
Open source column-oriented store
Based on Google’s BigTable
Built on top of HDFS and Hadoop
Horizontally scalable, automatic sharding
high availability / automatic failover
Strongly consistent reads/writes
Java/REST API
EUCLID – Scaling up Linked Data
22
23. Document Databases
• Each key associated with a complex data structure
(document)
• Documents can contain key/value pairs, key/array
pairs, or even nested structures
• Schema-less / dynamic schema
– New fields can be easily added to the document structure
• Typical document formats
– JSON, XML
• Examples:
Key
– Couchbase (http://www.couchbase.com)
– MongoDB (http://www.mongodb.org)
EUCLID – Scaling up Linked Data
Structureddocument
Key
Structureddocument
23
25. Couchbase
• Document-oriented database
– Documents are stored as JSON
• Flexible schema
– Document structure easy to change
• Optimised to run in-memory and on several
nodes
– Ejection and eventual persistence
• Incremental views & indexes
• Scalability, rebalancing, replication, failover
• RESTful API
EUCLID – Scaling up Linked Data
25
26. Graph Databases
Motivation
Graphs: Representation of highly connected data
Network of Friends in a High School
Relationship among artists in Last.fm
http://sixdegrees.hu/last.fm/
A Fragment of Facebook
EUCLID – Scaling up Linked Data
Relationships between Tweets
26
27. Graph Databases
• Based on the property graph model
• Support for query languages and core graph-based
tasks
– reachability, traversal, adjacency and pattern matching
• Examples
Relationship
Data
– Neo4j (http://neo4j.org)
– Dex (http://sparsity-technologies.com/dex.php)
– HyperGraphDB (http://www.hypergraphdb.org)
EUCLID – Scaling up Linked Data
Data
27
28. Graph Databases
Example: Property Graph Model
Year: 1970
Duration: 35:16
Let it be
Homepage:
thebeatles.com
Origin: Liverpool
The Beatles
created
Year: 1961
Duration: 32:02
Year: 1965
Elvis Presley
created
Revolver
Revolver
Year: 1966
Duration: 35:01
Fullname: Elvis Aaron
Presley
Homepage: elvis.com
Origin: Memphis
Help!
• Nodes and edges may have properties
• Properties: Key-value pairs
EUCLID – Scaling up Linked Data
28
29. Neo4j
• Graph database
– Nodes, Relationships, Properties, Paths
– Indexes over properties
•
•
•
•
•
Flexible schema
Cypher graph query language
ACID transactions
High availability, distributed clusters
RESTful and Java APIs
EUCLID – Scaling up Linked Data
29
30. Rya
• RDF store based on Accumulo
– Column-store, HDFS
– Sesame query parser, SAIL
implementation
• 3 table index
– SPO, POS, OSP
– Sufficient for all triple patterns
– All triple parts (S, P, O) encoded in
the RowID
– Clustered index
Source: R. Punnoose, A. Crainiceanu, D. Rapp “Rya: A Scalable RDF Triple Store for the Clouds”
EUCLID – Scaling up Linked Data
30
31. Rya (2)
• Query processing
– Sesame (SPARQL) query plan translated to Accumulo range
scans & lookups
– Parallel scans for joins (x10-20 speedup)
– Batch scans (Accumulo) to reduce number of range scans
– Statistics for triple patterns selectivity, query re-ordering
• Performance evaluation (LUBM)
– No significant degradation when data grows with 2-3 orders
of magnitude
Source: R. Punnoose, A. Crainiceanu, D. Rapp “Rya: A Scalable RDF Triple Store for the Clouds”
EUCLID – Scaling up Linked Data
31
32. “NoSQL Databases f0r RDF: An
Empirical Evaluation”
• Goal
– Store RDF data in HBase, Couchbase, Hive & Cassandra
– Benchmark query performance against a native
distributed RDF database (4store)
• HBase prototype
– Jena for SPARQL queries
– 3 index tables (SPO, POS, OSP)
– Row key encodes S+P+O, cells are empty
– Jena query plan translated to HBase filters & lookups
Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
EUCLID – Scaling up Linked Data
32
33. “NoSQL Databases f0r RDF: An
Empirical Evaluation” (2)
• Hive+HBase prototype
– SPARQL to HiveQL translation
– Property table
• Row key is S
• a column for each P
• cell value stores O
• Multi-valued attributes have different timestamps
Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
EUCLID – Scaling up Linked Data
33
34. “NoSQL Databases f0r RDF: An
Empirical Evaluation” (3)
• CumulusRDF prototype
– Sesame for SPARQL queries, Cassandra for data management
– 3 index tables (SPO, POS, OSP)
– Sesame query plan translated to Cassandra index lookups
• Couchbase prototype
– Map RDF into JSON documents
• all triples with the same S stored in the same document (molecule)
• 2 JSON arrays for Ps and Os
– Jena as a SPARQL query engine
– 3 indexes (Couchbase views): SPO, POS, OSP
Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
EUCLID – Scaling up Linked Data
34
35. “NoSQL Databases f0r RDF: An
Empirical Evaluation” (4)
• Benchmarks
– BSBM 10M, 100M
and 1B triples
– 1, 2, 4, 8, 16 node
cluster
– AWS cost & query
execution time
Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
EUCLID – Scaling up Linked Data
35
36. “NoSQL Databases f0r RDF: An
Empirical Evaluation” (5)
• Results
– Simple SPARQL queries can be executed more
efficiently on a NoSQL datastore
– Data loading time for some NoSQL datastores
comparable or better than the native RDF store
– Complex SPARQL queries perform significantly slower
on NoSQL systems
• Query optimisations are required
– MapReduce operations (Hive & Couchbase) introduce
high latency for view maintenance / query execution
Source: Cudre-Mauroux et al. “NoSQL Databases for RDF: An Empirical Evaluation”
EUCLID – Scaling up Linked Data
36
38. Working with Distributed Data
• Apache Hadoop (http://hadoop.apache.org) is an open source
implementation of MapReduce
• MapReduce
– Distributed batch processing
– Map phase partitions the input set (K/V pairs), Reduce phase performs
aggregated processing over the partitions in parallel
– Shuffle intermediate results (from Map nodes to Reduce nodes)
• Allows for the processing of distributed large data sets across
clusters of computers
– On a distributed file system (HDFS)
– Scales up to thousands of nodes, each offering local processing power
and storage
EUCLID – Scaling up Linked Data
38
39. “Scalable Distributed Reasoning
with MapReduce”
• Goal
– Utilise Hadoop for large scale reasoning
• Approach
– Implement each RDFS rule (join) via a Map & Reduce function
– Map outputs original triple as value, and the join term as key
– Reducer receives all needed triples to perform the join
Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”
EUCLID – Scaling up Linked Data
39
40. “Scalable Distributed Reasoning
with MapReduce” (2)
Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”
EUCLID – Scaling up Linked Data
40
41. “Scalable Distributed Reasoning
with MapReduce” (3)
• Challenge
– Too many duplicates (unique to derived
triple ratio of 1:50)
• Optimisations
– Replicate schema triples on each mode
(in memory)
• Needed for each join; usually a small set
– Rule re-ordering
• Which rule may be triggered by another
rule?
• Reduce the number of required iterations
Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”
EUCLID – Scaling up Linked Data
41
42. “Scalable Distributed Reasoning
with MapReduce” (4)
• Results
– Throughput of 4.5M triples / sec on a 16-node cluster
– 16+ nodes do not improve the performance
significantly
Source: Urbani et al. “Scalable Distributed Reasoning with MapReduce”
EUCLID – Scaling up Linked Data
42
43. Lessons Learned from Largescale Reasoning (J. Urbani)
• 1st Law: Treat schema triples differently
– Replicate on all nodes to minimise subsequent data transfer
• 2nd Law: Data skew dominates data distribution
– No universal partitioning scheme for input data
– Computation tasks moved to the nodes storing the data
(data locality)
• 3rd Law: Certain problems only appear at a very large
scale
– Proof-of-concept prototypes are often not representative
Source: Jacopo Urbani “Three Laws Learned from Web-scale Reasoning”
EUCLID – Scaling up Linked Data
43
45. Streaming Data
• A large amount of new data is constantly being created or
data is being updated at a rapid rate
– Traffic data, sensor networks, social networks, financial markets
time
• Many data sources create a constant “stream of information”
– Not always practical to store all data and then query it
– Continuous queries over transient data
• More recent data is more important
– Describes the current state of a dynamic system
EUCLID – Scaling up Linked Data
46
46. Stream Processing
• Streams are observed through windows
• Continuous queries can be registered over the stream
• Continuous queries are iteratively evaluated over the data in the
current window
– Can leverage static background knowledge (e.g., schema information)
• Generates a stream of answers
Window
time
Background
Knowledge
Continuous
Query
EUCLID – Scaling up Linked Data
Stream of answers
47
47. Linked Stream Data
• A representation of sensor/stream data following the
Linked Data principles
– Sensor data can be enriched with semantics
– Facilitates data discovery and integration of heterogeneous data
sources
• Challenges
– RDF Triples must be annotated with timestamps
– Extensions to the SPARQL language – windows, continuous queries,
streaming operators
– Continuous semantics
– Scalability (Volume)
– High throughput and low latency (Velocity)
– Approximate reasoning
EUCLID – Scaling up Linked Data
48
48. Querying Streams with
SPARQL Extensions
• The mechanism to evaluate queries over streaming data is the
specification of continuous queries
• The corresponding results to the continuous query are
updated while new data arrives
• Several SPARQL extensions with streaming operators based on
CQL (Continuous Query Language)
– C-SPARQL
– SPARQLStream
– EP-SPARQL, CQELS, Instants
EUCLID – Scaling up Linked Data
49
49. C-SPARQL (1)
C-SPARQL is an extension of SPARQL 1.1
1. RDF Streams: Sequence of RDF triples annotated with timestamps:
<(s,p,o), timestamp>
2. FROM STREAM extension for stream sources and windows
FromStrClause
'FROM' ['NAMED'] 'STREAM' StreamIRI
' [ RANGE' Window ']'
Window
LogicalWindow | PhysicalWindow
LogicalWindow
Number TimeUnit WindowOverlap
TimeUnit
'DAY'
'MSEC' | 'SEC' | 'MIN' | 'HOUR' |
WindowOverlap
'STEP' Number TimeUnit | 'TUMBLING'
PhysicalWindow
'TRIPLES' Number
EUCLID – Scaling up Linked Data
50
50. C-SPARQL (2)
3. Registration
• Creates a continuous query over the data source
• The query output is variable bindings, RDF graph, or a
new stream
Registration 'REGISTER' ('QUERY'|'STREAM') QName 'AS' Query
EUCLID – Scaling up Linked Data
51
51. C-SPARQL (3)
Example
Query:
Retrieve the cars and districts, where the car was registered in a toll.
REGISTER QUERY CarsEnteringInDistricts AS
SELECT DISTINCT ?district ?car
FROM STREAM <www.uc.eu/tollgates.trdf> [RANGE 40 SEC STEP 10 SEC]
WHERE {
?toll t:registers ?car .
?toll c:placedIn ?street .
?district c:contains ?street . }
Source: Barbieri, Davide Francesco, et al. "Querying rdf streams with c-sparql." ACM SIGMOD
Record 39.1 (2010): 20-26.
EUCLID – Scaling up Linked Data
52
52. C-SPARQL (4)
Source: M. Balduini et al. “Tutorial on Stream Reasoning for Linked Data (ISWC’2013)”
EUCLID – Scaling up Linked Data
53
53. SPARQLStream (1)
• Utilizes the same definition of RDF streams as in C-SPARQL:
<(s,p,o), timestamp>
• The language is defined as follows:
NamedStream 'FROM' ['NAMED'] 'STREAM' StreamIRI ' [' Window ']'
Window
'NOW-' Integer TimeUnit [UpperBound] [Slide]
UpperBound
'TO NOW-' Integer TimeUnit
Slide
'SLIDE' Integer TimeUnit
TimeUnit
'MS' | 'S' | 'MINUTES' | 'HOURS' | 'DAY'
Select
'SELECT' [XStream] [DISTINCT | REDUCED] …
Xstream
'ISTREAM' | 'DSTREAM' | 'RSTREAM'
Source: Jean-Paul Calbimonte and Oscar Corcho. ”SPARQLStream: Ontology-based access to data
streams." Tutorial at ISWC 2013
EUCLID – Scaling up Linked Data
54
54. SPARQLStream (2)
Example
Query:
Retrieve a rstream with the observations captured by all sensors in the last
10 minutes.
PREFIX ssn: <http://purl.oclc.org/NET/ssnx/ssn>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns/#>
SELECT RSTREAM ?sensor ?observation
FROM STREAM <www.semsorgrid4env.eu/SensorReadings.srdf>
[FROM NOW – 10 MINUTES TO NOW STEP 1 MINUTE]
WHERE {
?observation a ssn:Observation;
ssn:observedBy ?sensor .
}
EUCLID – Scaling up Linked Data
55
56. W3C Semantic Sensor Networks
• SSN Ontology
–
–
–
–
http://www.w3.org/2005/Incubator/ssn/ssnx/ssn
OWL DL ontology
used to semantically describe sensors and sensor networks & data
Recommendations for applying the ontology for Linked Sensor Data
EUCLID – Scaling up Linked Data
57
57. W3C Semantic Sensor Networks
(2)
• Different perspectives
– Sensor, data/observation, system
EUCLID – Scaling up Linked Data
58
59. A Trillion RDF Triples
• Use case
– Use RDF and Linked Data for the customer management
database of a big telecom
– Franz Inc / AllegroGraph
EUCLID – Scaling up Linked Data
60
60. uRiKA Appliance
• YarcData
• Big Data appliance for graph
analytics
– 8K processors, 1TB RAM
– In-memory RDF database
– SPARQL 1.1 support
EUCLID – Scaling up Linked Data
61
61. RDFS Reasoning on GPUs
• Similar approach to Urbani et al. for large scale
reasoning with Hadoop
– Handle rules with 2 antecedents
– Rule reordering
– Dictionary encoding
• Shared-memory architecture
– Efficient GPU algorithm implementation is challenging
Source: Norman Heino & Jeff Z. Pan ”RDFS Reasoning on Massively Parallel Hardware" ISWC 2012
EUCLID – Scaling up Linked Data
62
62. RDFS Reasoning on GPUs (2)
• Data parallelism
– Apply one rule (thread) on one instance triple, join to a schema triple
if possible
– Hundreds / thousands of threads working on parallel
• Challenge
– Duplicate removal
• Benchmark
– x5 speedup of computation
– But… memory transfer overhead is significant
Source: Norman Heino & Jeff Z. Pan ”RDFS Reasoning on Massively Parallel Hardware" ISWC 2012
EUCLID – Scaling up Linked Data
63
63. Benchmarks
• BSBM v3.1 (April 2013)
– http://wifo5-03.informatik.unimannheim.de/bizer/berlinsparqlbenchmark/results/V7/
– Includes benchmarks with up to 150 billion triples
– x750 scale increase since the last BSBM result (200M triples)
• LDBC
– Industry neutral, non-profit organisation
– Benchmarks for RDF and graph databases, similar to TPC
– Big data volume, complex queries
EUCLID – Scaling up Linked Data
64
65. Summary
• Linked Data is a good fit for the Variety
challenge of Big Data
• Linked Data can simplify data discovery, data
access, data integration challenges for Big Data
• Exponential growth of Linked Data
• Linked Data benchmarks target bigger
workloads
EUCLID – Scaling up Linked Data
66
66. Summary (2)
• Ongoing R&D towards scaling up Linked Data
for high data Volume and Velocity
– NoSQL datastores for RDF data management
– Hadoop for scalable RDF reasoning
– GPUs for scalable RDF reasoning
• Adapting Linked Data & SPARQL for streaming
data scenarios
EUCLID – Scaling up Linked Data
67
67. For exercises, quiz and further material visit our website:
http://www.euclid-project.eu
Course
eBook
Other channels:
@euclid_project
euclidproject
EUCLID – Scaling up Linked Data
euclidproject
68