SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Exploiting the query structure
for efficient join ordering in
SPARQL queries
Luiz Henrique Zambom Santana
Vinicius da Silveira Segalin
Agenda
•Paper and authors
•Background
•Problem and solution
•Example
•Algorithms
•Analysis
•Conclusions
Paper and authors
Gubichev, Andrey, and Thomas Neumann. "Exploiting
the query structure for efficient join ordering in
SPARQL queries." EDBT. 2014.
Extending Database Technology – Qualis A2/H-index 52
Background
•SPARQL
• W3C standard
• Semantic Web
• Inspired in SQL
•Query structure
•Join ordering (similar to matrix product)
Problem
•The join ordering problem is a fundamental challenge that has to be
solved by any query optimizer
•Depending on the order of the join, there is a different computation
time
•SQL solutions are not immediately capable of handling large SPARQL
queries. It is introduced a new join ordering algorithm that
performs a SPARQL-tailored query simplification
Problem
•Cardinality estimation is an essential part of any cost-based query
optimizer
•Two different approaches:
• RDF-3X: query compilation time (dominated by finding the optimal
join order) is one order of magnitude higher than the actual
execution time
• Virtuoso 7: greedy algorithm for compilation leads to a slow run
time (sub-optimal order)
Solution
•Best of both worlds:
• Heuristics that spends a reasonable amount of time optimizing the
query, and yet gets a decent join order
• The paper presents a SPARQL-tailored query simplification
procedure, that decomposes the query’s join graph into star-shaped
subqueries and chain-shaped subqueries
Challenges
•RDF can be very verbose
• TPC-H Query 2 written in SPARQL contains joins between 26 index
scans (as opposed to joins between 5 tables in the SQL
formulation)
• Number of plans:
• 5! = 120 plans in SQL vs 26! = 4 *1026
•Lack of schema
• Foreign keys become structural correlations
Solution
• Characteristic set for s defines the properties (attributes) of an entity, thus defining its class (type) in a sense
that the subjects that have the same characteristic set tend to be similar
• Hierarchical Characterization:
• 1. H0
is the set of all characteristic sets of R
• 2. Hi
= {argmin ∀ C ⊂ S ∧|C|=|S|−1 cost(C) | ∀ S ∈ Hi
−1}, that is Hi
consists of the subsets C of sets
from Hi-1
that minimize cost(C).
• 3. ∀ S ∈ Hk
: |S| = 2
• 4. every S ∈ Hi-1
stores a pointer to its cheapest subset C ∈ Hi.
Algorithm 1 (part. 1)
• Line 2: S=[{created, bornIn, livedIn, hasName},
{ bornIn, livedIn, hasName},...]
• Line 8: Init Banker's iteration, ie. from the
smallest to the biggest possible set with the
predicates
Algorithm 1 (part. 2)
• Line 12: guarantees that S2
is smaller than S1
• Line 15-16: finds the subsets that have smaller
cost
• Cost
• Banker’s iteration potentially enumerates all
the subsets of all predicates in the dataset, in
reality it stops relatively early, since it is
always bounded by the largest set in Sets
Algorithm 2 (part. 1/2)
• Objective: finding the optimal join order in (sub)
queries of the form:
select * where {?s p1
?o1
. . . . ?s pk
?ok
}
• Idea: extract the part of the Hierarchical
Characterisation of the dataset starting with the
set S
• Input: Star-shaped graph
• Output: Order of the joins
• Lines 1-9:
• While size S > 2, find the most expensive
subset and push to front of O
Algorithm 2 (part. 2/2)
• The first part leads to the optimal for star-
shaped queries in linear time to the graph size
• However, it do not find the optional solution if
the query have constants:
select * where {?s p1
“Berlin”. . . . ?s pk
?ok
}
• Then:
• Lines 12-14: only one of the bounded
objects is in the triple with the key
predicate, ie., the entire star query is
therefore a lookup of properties of a
specific entity
• Lines 15-16: otherwise (many objects are
key), keep pushing down the constants in
the join tree and stop when the cost of the
corresponding index scan is bigger than the
cost of the join on that level of the tree
Algorithm 3 (part. 1/4)
• Objective: ordering join in general SPARQL queries
(s1
, hasName, "Marie Curie"),
(s1
, bornIn, s2
),
(s2
, label, "Warsaw"),
(s2
, locatedIn, "Poland")
• Problem: s2
links person to city, corresponding to the "foreign key", but RDF does not require any
schema. Knowledge of such dependencies is extremely useful for the query optimizer: without it, the
optimizer has to assume independence between two entities linked via bornIn predicate, thus almost
inevitably underestimating the selectivity of the join of corresponding triple pattern
• Thus, it uses Characteristic Pair (Paar Charakteristisch) in order to discover this kind of relation, where:
PC (Sc
(s), Sc
(o)) = {(Sc
(s), Sc
(o), p) | Sc(o) != ∅ ∧ ∃p : (s, p, o) ∈ R}
• The CP is a in-memory structure and in theory, with n distinct characteristic sets we can get up to n2
characteristic pairs, in real datasets only few pairs appear frequently enough to be stored. For example,
in YAGO-Facts dataset of the 250000 existing pairs, only 5292 pairs appear more than 100 times in the
dataset. This way, the frequent characteristic pairs for the consume less than 16 KB.
Algorithm 3 (part. 2/4)
• Idea: to decompose the query into star-shaped subqueries
connected by chains, and to collapse the subqueries into
meta-nodes
• Input: SPARQL graph
• Output: join ordering for this graph
• Lines 11-24: starts with clustering the query into disjoint
star-shaped subqueries around subjects
• Line 13: order the triple patterns in the query by subject
• Line 15: group triple patterns with identical subjects, since
they potentially form star-shaped subqueries
• Lines 20-23: find starts around objects
Algorithm 3 (part. 3/4)
• Lines 4-5: for every star it adds the new meta-node to the
query graph and removes the intra-star edges
• Lines 6-7: the plan for the star subquery is computed using
the Hierarchical Characterisation (Algorithm 2) and added to
the DP table along with the meta-node
• Line 8: After all the star subqueries have been optimized, we
add the edges between meta-nodes to the query graph, if
the original graph has edges between the corresponding star
sub-queries
Algorithm 3 (part. 4/4)
• Line 10: selectivities associated with these edges are
computed using the Characteristic Pairs synopsis, and the
regular Dynamic Programming algorithm starts working on
this simplified graph
• In the following Figure simplifying the graph from 8 nodes to
3 nodes gives a reduction from 8!=40320 plans to 3!=6 plans
• This algorithm is also linear to the input graph
Analysis
Conclusions
•The problem is very similar to the Matrix product
•The query simplification techniques reduces the search space size by
making some simplification before the DP algorithm starts
•The time analysis shows how important are the complexity study
•There is no complexity analysis though it mentions DP and Greedy
algorithms along the paper
•The tests did not turned the cache off
•Do not cover OPTIONAL clauses of SPARQL, which are equivalent to
the left outer joins and can not be freely reordered with other joins

Mais conteúdo relacionado

Mais procurados (20)

Basic data-structures-v.1.1
Basic data-structures-v.1.1Basic data-structures-v.1.1
Basic data-structures-v.1.1
 
Data structures
Data structuresData structures
Data structures
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Hashing
HashingHashing
Hashing
 
Introduction of data structure
Introduction of data structureIntroduction of data structure
Introduction of data structure
 
Hash table
Hash tableHash table
Hash table
 
Linear search-and-binary-search
Linear search-and-binary-searchLinear search-and-binary-search
Linear search-and-binary-search
 
Searching and Sorting Techniques in Data Structure
Searching and Sorting Techniques in Data StructureSearching and Sorting Techniques in Data Structure
Searching and Sorting Techniques in Data Structure
 
Counting Sort Lowerbound
Counting Sort LowerboundCounting Sort Lowerbound
Counting Sort Lowerbound
 
Ch 1 intriductions
Ch 1 intriductionsCh 1 intriductions
Ch 1 intriductions
 
Hashing
HashingHashing
Hashing
 
Counting sort
Counting sortCounting sort
Counting sort
 
Dynamic Memory & Linked Lists
Dynamic Memory & Linked ListsDynamic Memory & Linked Lists
Dynamic Memory & Linked Lists
 
Data structures
Data structuresData structures
Data structures
 
Binary search
Binary search Binary search
Binary search
 
Unit ii data structure-converted
Unit  ii data structure-convertedUnit  ii data structure-converted
Unit ii data structure-converted
 
Week 2 - Data Structures and Algorithms
Week 2 - Data Structures and AlgorithmsWeek 2 - Data Structures and Algorithms
Week 2 - Data Structures and Algorithms
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structure
 
Search algorithms master
Search algorithms masterSearch algorithms master
Search algorithms master
 

Destaque

Graphium Chrysalis: Exploiting Graph Database
Graphium Chrysalis: Exploiting Graph DatabaseGraphium Chrysalis: Exploiting Graph Database
Graphium Chrysalis: Exploiting Graph DatabaseGraph-TA
 
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...Fabio Benedetti
 
Sistemas de federação linked data
Sistemas de federação linked dataSistemas de federação linked data
Sistemas de federação linked dataDanusa Ribeiro
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataSebastian Hellmann
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataMuhammad Saleem
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataSören Auer
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsMarieke van Erp
 
LDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataLDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Applying Linked Open Data to Public Procurement
Applying Linked Open Data to Public ProcurementApplying Linked Open Data to Public Procurement
Applying Linked Open Data to Public ProcurementJindřich Mynarz
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudA Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudSyed Muhammad Ali Hasnain
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionRakuten Group, Inc.
 
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionFedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionSyed Muhammad Ali Hasnain
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Olaf Hartig
 

Destaque (20)

Graphium Chrysalis: Exploiting Graph Database
Graphium Chrysalis: Exploiting Graph DatabaseGraphium Chrysalis: Exploiting Graph Database
Graphium Chrysalis: Exploiting Graph Database
 
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
 
Sistemas de federação linked data
Sistemas de federação linked dataSistemas de federação linked data
Sistemas de federação linked data
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of Data
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
 
DBpedia InsideOut
DBpedia InsideOutDBpedia InsideOut
DBpedia InsideOut
 
NLP todo
NLP todoNLP todo
NLP todo
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of Data
 
Linked Data Fragments
Linked Data FragmentsLinked Data Fragments
Linked Data Fragments
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
 
LDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataLDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked Data
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Applying Linked Open Data to Public Procurement
Applying Linked Open Data to Public ProcurementApplying Linked Open Data to Public Procurement
Applying Linked Open Data to Public Procurement
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudA Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product Description
 
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionFedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
 

Semelhante a Exploiting the query structure for efficient join ordering in SPARQL queries

Programming in python
Programming in pythonProgramming in python
Programming in pythonIvan Rojas
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scalashinolajla
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesOptimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesHPCC Systems
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
PPT Lecture 2.2.1 onn c++ data structures
PPT Lecture 2.2.1 onn c++ data structuresPPT Lecture 2.2.1 onn c++ data structures
PPT Lecture 2.2.1 onn c++ data structuresmidtushar
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinGeoff Ness
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization Hafiz faiz
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmYu Liu
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
"Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos...
"Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos..."Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos...
"Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos...Quantopian
 

Semelhante a Exploiting the query structure for efficient join ordering in SPARQL queries (20)

Programming in python
Programming in pythonProgramming in python
Programming in python
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesOptimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
QBIC
QBICQBIC
QBIC
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
PPT Lecture 2.2.1 onn c++ data structures
PPT Lecture 2.2.1 onn c++ data structuresPPT Lecture 2.2.1 onn c++ data structures
PPT Lecture 2.2.1 onn c++ data structures
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL Join
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
"Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos...
"Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos..."Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos...
"Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos...
 

Mais de Luiz Henrique Zambom Santana

Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...Luiz Henrique Zambom Santana
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkLuiz Henrique Zambom Santana
 
De Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipeDe Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipeLuiz Henrique Zambom Santana
 
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQLVoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQLLuiz Henrique Zambom Santana
 
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e ElasticsearchUma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e ElasticsearchLuiz Henrique Zambom Santana
 
Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Gra...
Workload-Aware RDF Partitioning  and SPARQL Query Caching for Massive RDF Gra...Workload-Aware RDF Partitioning  and SPARQL Query Caching for Massive RDF Gra...
Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Gra...Luiz Henrique Zambom Santana
 
A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLLuiz Henrique Zambom Santana
 
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL DatabasesA Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL DatabasesLuiz Henrique Zambom Santana
 
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...Luiz Henrique Zambom Santana
 
Novidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHPNovidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHPLuiz Henrique Zambom Santana
 

Mais de Luiz Henrique Zambom Santana (20)

Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with Spark
 
De Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipeDe Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipe
 
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQLVoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
 
IBM Watson, Apache Spark ou TensorFlow?
IBM Watson, Apache Spark ou TensorFlow?IBM Watson, Apache Spark ou TensorFlow?
IBM Watson, Apache Spark ou TensorFlow?
 
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e ElasticsearchUma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
 
Banco de dados nas nuvens - aula 3
Banco de dados nas nuvens - aula 3Banco de dados nas nuvens - aula 3
Banco de dados nas nuvens - aula 3
 
Banco de dados nas nuvens - aula 2
Banco de dados nas nuvens - aula 2Banco de dados nas nuvens - aula 2
Banco de dados nas nuvens - aula 2
 
Banco de dados nas nuvens - aula 1
Banco de dados nas nuvens - aula 1Banco de dados nas nuvens - aula 1
Banco de dados nas nuvens - aula 1
 
Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Gra...
Workload-Aware RDF Partitioning  and SPARQL Query Caching for Massive RDF Gra...Workload-Aware RDF Partitioning  and SPARQL Query Caching for Massive RDF Gra...
Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Gra...
 
A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQL
 
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL DatabasesA Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
 
Normalização
NormalizaçãoNormalização
Normalização
 
SQL Joins
SQL JoinsSQL Joins
SQL Joins
 
Consultas básicas em SQL
Consultas básicas em SQLConsultas básicas em SQL
Consultas básicas em SQL
 
Processamento em Big Data
Processamento em Big DataProcessamento em Big Data
Processamento em Big Data
 
Seminário de Andamento de Doutorado
Seminário de Andamento de DoutoradoSeminário de Andamento de Doutorado
Seminário de Andamento de Doutorado
 
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
 
Workshop de ELK - EmergiNet
Workshop de ELK - EmergiNetWorkshop de ELK - EmergiNet
Workshop de ELK - EmergiNet
 
Novidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHPNovidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHP
 

Exploiting the query structure for efficient join ordering in SPARQL queries

  • 1. Exploiting the query structure for efficient join ordering in SPARQL queries Luiz Henrique Zambom Santana Vinicius da Silveira Segalin
  • 2. Agenda •Paper and authors •Background •Problem and solution •Example •Algorithms •Analysis •Conclusions
  • 3. Paper and authors Gubichev, Andrey, and Thomas Neumann. "Exploiting the query structure for efficient join ordering in SPARQL queries." EDBT. 2014. Extending Database Technology – Qualis A2/H-index 52
  • 4. Background •SPARQL • W3C standard • Semantic Web • Inspired in SQL •Query structure •Join ordering (similar to matrix product)
  • 5. Problem •The join ordering problem is a fundamental challenge that has to be solved by any query optimizer •Depending on the order of the join, there is a different computation time •SQL solutions are not immediately capable of handling large SPARQL queries. It is introduced a new join ordering algorithm that performs a SPARQL-tailored query simplification
  • 6. Problem •Cardinality estimation is an essential part of any cost-based query optimizer •Two different approaches: • RDF-3X: query compilation time (dominated by finding the optimal join order) is one order of magnitude higher than the actual execution time • Virtuoso 7: greedy algorithm for compilation leads to a slow run time (sub-optimal order)
  • 7. Solution •Best of both worlds: • Heuristics that spends a reasonable amount of time optimizing the query, and yet gets a decent join order • The paper presents a SPARQL-tailored query simplification procedure, that decomposes the query’s join graph into star-shaped subqueries and chain-shaped subqueries
  • 8. Challenges •RDF can be very verbose • TPC-H Query 2 written in SPARQL contains joins between 26 index scans (as opposed to joins between 5 tables in the SQL formulation) • Number of plans: • 5! = 120 plans in SQL vs 26! = 4 *1026 •Lack of schema • Foreign keys become structural correlations
  • 9. Solution • Characteristic set for s defines the properties (attributes) of an entity, thus defining its class (type) in a sense that the subjects that have the same characteristic set tend to be similar • Hierarchical Characterization: • 1. H0 is the set of all characteristic sets of R • 2. Hi = {argmin ∀ C ⊂ S ∧|C|=|S|−1 cost(C) | ∀ S ∈ Hi −1}, that is Hi consists of the subsets C of sets from Hi-1 that minimize cost(C). • 3. ∀ S ∈ Hk : |S| = 2 • 4. every S ∈ Hi-1 stores a pointer to its cheapest subset C ∈ Hi.
  • 10. Algorithm 1 (part. 1) • Line 2: S=[{created, bornIn, livedIn, hasName}, { bornIn, livedIn, hasName},...] • Line 8: Init Banker's iteration, ie. from the smallest to the biggest possible set with the predicates
  • 11. Algorithm 1 (part. 2) • Line 12: guarantees that S2 is smaller than S1 • Line 15-16: finds the subsets that have smaller cost • Cost • Banker’s iteration potentially enumerates all the subsets of all predicates in the dataset, in reality it stops relatively early, since it is always bounded by the largest set in Sets
  • 12. Algorithm 2 (part. 1/2) • Objective: finding the optimal join order in (sub) queries of the form: select * where {?s p1 ?o1 . . . . ?s pk ?ok } • Idea: extract the part of the Hierarchical Characterisation of the dataset starting with the set S • Input: Star-shaped graph • Output: Order of the joins • Lines 1-9: • While size S > 2, find the most expensive subset and push to front of O
  • 13. Algorithm 2 (part. 2/2) • The first part leads to the optimal for star- shaped queries in linear time to the graph size • However, it do not find the optional solution if the query have constants: select * where {?s p1 “Berlin”. . . . ?s pk ?ok } • Then: • Lines 12-14: only one of the bounded objects is in the triple with the key predicate, ie., the entire star query is therefore a lookup of properties of a specific entity • Lines 15-16: otherwise (many objects are key), keep pushing down the constants in the join tree and stop when the cost of the corresponding index scan is bigger than the cost of the join on that level of the tree
  • 14. Algorithm 3 (part. 1/4) • Objective: ordering join in general SPARQL queries (s1 , hasName, "Marie Curie"), (s1 , bornIn, s2 ), (s2 , label, "Warsaw"), (s2 , locatedIn, "Poland") • Problem: s2 links person to city, corresponding to the "foreign key", but RDF does not require any schema. Knowledge of such dependencies is extremely useful for the query optimizer: without it, the optimizer has to assume independence between two entities linked via bornIn predicate, thus almost inevitably underestimating the selectivity of the join of corresponding triple pattern • Thus, it uses Characteristic Pair (Paar Charakteristisch) in order to discover this kind of relation, where: PC (Sc (s), Sc (o)) = {(Sc (s), Sc (o), p) | Sc(o) != ∅ ∧ ∃p : (s, p, o) ∈ R} • The CP is a in-memory structure and in theory, with n distinct characteristic sets we can get up to n2 characteristic pairs, in real datasets only few pairs appear frequently enough to be stored. For example, in YAGO-Facts dataset of the 250000 existing pairs, only 5292 pairs appear more than 100 times in the dataset. This way, the frequent characteristic pairs for the consume less than 16 KB.
  • 15. Algorithm 3 (part. 2/4) • Idea: to decompose the query into star-shaped subqueries connected by chains, and to collapse the subqueries into meta-nodes • Input: SPARQL graph • Output: join ordering for this graph • Lines 11-24: starts with clustering the query into disjoint star-shaped subqueries around subjects • Line 13: order the triple patterns in the query by subject • Line 15: group triple patterns with identical subjects, since they potentially form star-shaped subqueries • Lines 20-23: find starts around objects
  • 16. Algorithm 3 (part. 3/4) • Lines 4-5: for every star it adds the new meta-node to the query graph and removes the intra-star edges • Lines 6-7: the plan for the star subquery is computed using the Hierarchical Characterisation (Algorithm 2) and added to the DP table along with the meta-node • Line 8: After all the star subqueries have been optimized, we add the edges between meta-nodes to the query graph, if the original graph has edges between the corresponding star sub-queries
  • 17. Algorithm 3 (part. 4/4) • Line 10: selectivities associated with these edges are computed using the Characteristic Pairs synopsis, and the regular Dynamic Programming algorithm starts working on this simplified graph • In the following Figure simplifying the graph from 8 nodes to 3 nodes gives a reduction from 8!=40320 plans to 3!=6 plans • This algorithm is also linear to the input graph
  • 19. Conclusions •The problem is very similar to the Matrix product •The query simplification techniques reduces the search space size by making some simplification before the DP algorithm starts •The time analysis shows how important are the complexity study •There is no complexity analysis though it mentions DP and Greedy algorithms along the paper •The tests did not turned the cache off •Do not cover OPTIONAL clauses of SPARQL, which are equivalent to the left outer joins and can not be freely reordered with other joins