LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases
1. SIB . 23.03.2011 . Page 1 http://lod2.eu
WP2
Storing and Querying
Very Large Knowledge Bases
Vienna Update
March 2012 – M18
Peter Boncz
http://lod2.eu
2. SIB . 23.03.2011 . Page 2 http://lod2.eu
Table of Contents
• WP2 Refresher
• LOD Cloud Hosted on the Knowledge Store Cluster
* 50B mark reached, column-store Virtuoso deployed
• State of the Art LOD Laboratory (“Benchmarking”)
* LDBC – RDF Store Industry council
* BSBM at large scale
* RDF-H + Social Intelligence Benchmark (SIB)
• Technical work
* column-store Virtuoso cluster version
* recycling query results
• Next up
* LOD cloud @250B triples
* Virtuoso: adaptive query optimizer (and more)
* first MonetDB/SPARQL version (RDF clustering, graph indexing)
3. LOD2 Title . 02.09.2010 . Page 3 http://lod2.eu
WP2 Organization
CWI (MonetDB):
• Peter Boncz (also in VUA group of Frank v Harmelen)
• Duc Pham Minh (Phd student)
• Irini Fundulaki (1-year sabbatical from FORTH)
OpenLink (Virtuoso):
• Orri Erling
• Hugh Williams
• Ivan Mikhailov
+ FU Berlin (BSBM)
+ DERI (BSBM text+ LOD cloud + text retrieval/sindice)
+ ULEI (DBpedia benchmark)
4. SIB . 23.03.2011 . Page 4 http://lod2.eu
WP2
Storing and Querying Very Large Knowledge Bases
Goal: enabling large-scale, feature-rich & enterprise-ready Linked
Data management solutions
Database Partners in LOD2:
CWI: Leading open source analytics RDBMS
OpenLink: Leading Linked data deployment platform
Technological Excellence:
Creating and publishing metrics for choosing RDF solutions
Bringing Column Store Technology for Business Intelligence on RDF
Ground-breaking database innovations for RDF stores
(Dynamic Query optimization, Adaptive Caching of Joins,
Optimized Graph Processing, Cluster/Cloud scalability)
5. LOD2 Title . 02.09.2010 . Page 5 http://lod2.eu
Task 2.1: State of the Art, Evaluation & Benchmarking
LOD cloud cache scalability
• M0: 20B triples
• M12: 50B triples
• M24: 250B triples
• M36: 1T triples
D2.4 completed: 50B triples in LOD cache @ DERI
First deployment of Virtuoso7 Cluster
• Currently hosting about 55 billion triples
• 8 node Virtuoso v7 (column store) Cluster
• 384GB RAM
• 2TB Disk Storage
• 14B/quads, excl literals
Next up:
• hardware provisioning for 250B and 1T triples
(need 512GB RAM resp. 2TB RAM somewhere)
6. LOD2 Title . 02.09.2010 . Page 6 http://lod2.eu
Task 2.1: State of the Art, Evaluation & Benchmarking
Benchmarking
• creating new benchmarks
• BSBM-BI (FU Berlin)
• DBpedia Benchmark (ULEI) – best paper award
• RDF-H (OGL,CWI)
• Social Intelligence Benchmark (OGL,CWI)
• running benchmark evaluations
• BSBM on a large cluster cluster (Lisa @ SARA)
• BSBM on large single-server (40cores, 1TB RAM)
• creating industry consensus
• Benchmark Auditing Service
• LOD Benchmark Council
7. LOD2 Title . 02.09.2010 . Page 7 http://lod2.eu
BSBM Large Scale Experiments (still ongoing..)
New Aspects:
• The Business Intelligence Use Case (BI)
• Benchmark Rules
• BSBM V3 Results
• trying cluster versions
SARA LISA cluster
• experiments with up to 64 nodes
VectorWise high-end server
• 40-core machine with 1TB RAM
Benchmarked at SARA and Vectorwise
4store 1.1.2 Garlik http://4store.org/
BigData r4169 SYSTAP LLC http://www.systap.com/bigdata.htm
BigOwlim 3.4.3129 OntoText http://www.ontotext.com/owlim/
Jena TDB 0.8.9 openjena.org http://www.openjena.org/TDB/
Fuseki 0.1.0 openjena.org http://openjena.org/wiki/Fuseki
Virtuoso 7.0 OpenLink http://virtuoso.openlinksw.com/
8. LOD2 Title . 02.09.2010 . Page 9 http://lod2.eu
Social Intelligence Benchmark
14 dictionaries
of real data
Facebook schema style
Realistic scenario
simulation
Synthetic Generated Data Linked Open Data
9. LOD2 Title . 02.09.2010 . Page 11 http://lod2.eu
Technical Work: Recycling (D2.4)
Dynamic caching of intermediate query results
• SPARQL problem: hard to index workload / expensive backward chaining
Idea: compute once, re-use many times
10. LOD2 Title . 02.09.2010 . Page 13 http://lod2.eu
Technical Work: Virtuoso 7
Major now upcoming release V7, due for release in 2012
• column store technology:
• aggressive compression more data fits in RAM
• vectored execution things run faster
• elastic cluster implementation
• partitions can migrate across nodes
• bringing computation to the data
• arbitrary recursive functions in the cluster
• geospatial support
• full openGIS support, R-tree backed, EWKT format
• future enhancements
• adaptive query optimization (CWI ROX)
•re-use of intermediates (CWI recycling)
• using SSDs as cache
11. LOD2 Title . 02.09.2010 . Page 14 http://lod2.eu
Next 6 months
Virtuoso: sampled query optimizer
• query optimization in SPARQL is difficult (no stats)
• use adaptive, run-time, query optimization with sampling
MonetDB and SPARQL
• First version in sight (cooperation with FORTH)
• research tracks
• RDF clustering on Characteristic Sets
• correlated join path indexing
LOD cache at 250B triples
• what triples to use?
• what hardware to use? (need 512GB RAM)
12. SIB . 23.03.2011 . Page 15 http://lod2.eu
Contact
Address
Centrum Wiskunde Informatica (CWI)
Science Park 123
1098 XG Amsterdam
The Netherlands
monetdb.cwi.nl
Thanks for your attention!
13. LOD2 Title . 02.09.2010 . Page 16 http://lod2.eu
LOD2 Benchmark Auditing Service
Benchmarking needs of SPARQL engine vendors:
• vendors want to publish in their own timescale
• using new or upcoming releases (not yet public)
• using properly tuned settings and hardware to their solution
• yet need credibility (is it fair)
Tournaments organized by one institution have
• bad timing, wrong version, one more bug to fix, etc
• not the right hardware or settings
• may become a legal liability once matters become more serious
LOD2 should reach out to the SPARQL technical community and
provide independent benchmark auditing services
• start with BSBM working on Auditing Rules Document
• maybe other benchmarks later
Notas do Editor
From the aforementioned reasons, we proposed an RDF and graph database benchmark, called Social Intelligence benchmark, that can exploit the advantages of RDF in graph representation. We are aiming at testing the graph database performance on a highly connected graph. As social network is a high profile for graph data management, we design our benchmark over the scenarios of a social network. We try to generate data as realistic as possible with correlations and offer challenging queries over the data correlations.Besides, since a very large amount of useful information is available in many linked-open datasets, we exploit these resources by linking to them.
Now, I will describe the data specification of SIB. As Facebook is the most popular social network with more than 800 millions active users, we take the schema style of Facebook as the baseline for designing SIB. For generating realistic data, we use 14 dictionaries that we build from real data. These dictionaries cover various domains, for example, geographical information, personal names,..SIB data is designed so that it can simulate realistic scenario including the real behaviors of the users and the characteristics of data distributions in social networks.As we mention before, our synthetic data is linked with well-known linked open data. And here, SIB is linked with DBPedia, one of the largest linked open dataset.
I think most of us know FB and even have a Facebook account. The logical schema of our benchmark simulates the Facebook schema in which a user can have many friends, and there are friendships between them. A user can provide many profile information such as his name, where he is studying at, where he is living at. He can also specify his current status, for example, in Relation ship with another user. The user can upload many photo, start a discussion by writing posts, and get a lot of comments from his friends.