Linked Data Query Processing Strategies

Linked Data Query Processing Strategies
Günter Ladwig, Thanh Tran
International Semantic Web Conference 2010, Shanghai

Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)

KIT – University of the State of Baden-Württemberg and
National Large-scale Research Center of the Helmholtz Association www.kit.edu

Contents

Introduction
Challenges
Contributions
Linked Data Query Processing Strategies
Stream-based Query Processing
Corrective Source Ranking
Evaluation
Conclusion

2 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)

What is Linked Data?

Linked Data Principles
Use URIs to identify things
Use HTTP URIs that allow dereferencing
Dereferencing a URI provides information about the thing in a
standard format (RDF)
Include links to other, related URIs
Linked Data Query Processing
Evaluate queries directly over Linked Data
Dereference Linked Data URIs during query processing


Challenges

Volume of Source Collection
Each URI is a potential data source
Dynamic of Source Collection
Sources may change rapidly over time
Sources might only be discovered at run-time
Heterogeneity of Sources, Source Descriptions and
Access Methods
Sources vary in size
Description of sources vary in completeness
Access methods: URI lookup, SPARQL endpoints, local cache, ...


Contributions

Discussion of Linked Data Query Processing strategies
Mixed strategy, combining local indexes and run-time
discovery
Data can arrive at any time and in any order
Suited to deal with network latency
Deals with different types of source descriptions
Ranking is refined at run-time


LINKED DATA QUERY
PROCESSING STRATEGIES


Top-down Query Evaluation
SELECT ?paper ?author WHERE {
?paperswrc:author ?author . ?paperswc:isPartOf ?proc .
?proc swc:relatedToEvent<http://sw.org/eswc/2010>.
Probe }

Source URI Score
Local Select and Retrieve sources
http://sw.org/person/AB 0.87
source rank sources Join data
index ... ...

Local index, assumed to be complete
Selection and ranking of sources
No run-time discovery
Fast, only relevant sources are retrieved
Not up-to-date, index size may become very large


Bottom-up Query Evaluation
SELECT ?paper ?author WHERE {
?paperswrc:author ?author . ?paperswc:isPartOf ?proc .
?proc swc:relatedToEvent<http://semweb.org/eswc/2010> . }
Retrieve source
<http://sw.org/proc/eswc/2010>swc:relatedToEvent
<http://sw.org/eswc/2010> .
...

Sources are discovered at Discover new sources
run-time through links
swc:paper1 swc:isPartOf
<http://sw.org/proc/eswc/2010>.
...
Answers can be incomplete
as links might not be discoverable
Slower, as unnecessary sources are retrieved
Always up-to-date

Mixed Strategy

Combination of top-down and bottom-up strategies
Partial local index of sources, not assumed to be complete
New sources are discovered at run-time
Addresses volume and dynamic of Linked Data
Deal with heterogeneous source descriptions
Deal with unpredictable nature of Linked Data access


STREAM-BASED QUERY
PROCESSING


Stream-based Query Processing Results

Network latency Query Plan Join
Do not block!
Evaluation driven by Join name(?y, ?n)
incoming data
Compile-time
worksAt(?x, dbpedia:KIT) knows(?x, ?y)
Construct query plan Samples

Probe local index for
sources Push
Run-time Source Retrieval Retrieve Source Ranker
Rank sources source
Source Retriever 1 Source 1 (score: 1.0)
Retrieve sources Source Retriever 2 Source Source 2 (score: 0.7)
discovered ...
Push data into query plan ...
Discover new sources

Local
source
index


Push-based Symmetric Hash Join

Operation t7 t4
Maintains a hash table for each t7 t5
input
Arriving tuples are inserted into
one hash table and then the other
is probed for join combinations Push output
Push-based
Tuples are pushed into operators Left input Right input
from the leaves to the root of the Key T Key T
query plan
a t1 , t3 b t4 , t5
Execution driven by incoming
tuples instead of results b t 2 t7
2, c t6
Results reported as soon as input
tuples arrive Insert
Probe
Tuples can arrive on all inputs in
Pushed on left: t7(b)
any order

CORRECTIVE SOURCE
RANKING



Prefer more relevant sources
Relevancy of a source is based on
Current query
Any available intermediate results
Overall optimization goal
Define a set of source features and derive concrete
source metrics
Not all metrics are available for all sources (heterogeneity)
Refine previously computed metrics using newly
discovered information


Source Features and Metrics

Source is more relevant if it contains data that contributes
to answers of the query
Triple Pattern Cardinality

Join Pattern Cardinality

Cardinalities stored in local index
Some patterns have high cardinality for all or many
sources (e.g. )
These patterns do not discriminate sources


Source Features and Metrics

Adopt TF-IDF concept to obtain weights for triple patterns
Importance positively correlates with how often bindings to a
pattern occur in a source (i.e. cardinality)
Importance negatively correlates with how often its bindings occur
in all sources of the source collection S
Triple Frequency – Inverse Source Frequency (TF-ISF)


Source Features and Metrics - Links

Source linked from many other sources is more relevant
Relevance is higher when these links match query
predicates
Links are only discovered at run-time


Metric Correction and Refinement

During query processing new information becomes
available: intermediate join results, links
Refine and correctpreviously computed metrics
Important in the case of non-discriminative patterns
Instantiate triple pattern of a join with samples of
intermediate results to obtain better join size estimates
Example
Perform triple pattern
Intermediate results in SHJ operator cardinality lookups


Ranking at Run-time

Optimization goal: early result reporting
Indexed sources: triple and join pattern cardinality, TF-ISF,
weighted links, sampled join size estimates
Discovered sources: weighted links
Ranking has to be refined at run-time
Parameters influencing behavior and cost of ranking
process
Invalid Score Threshold: ranking is performed when the number
of sources with invalid scores passes a threshold
Sample Size: larger samples for join size estimation will give better
estimates, are also more costly
Resampling Threshold: cache join size estimates and perform
sampling only when the hash table of join operator grows past a
given threshold

EVALUATION


Evaluation

Systems: top-down (TD), bottom-up (BU), mixed (MI)
8 queries over various datasets (DBpedia, Geonames,
NYT, Freebase, ...)
To make the approaches comparable, sources were
restricted to those discoverable by the BU approach
~6200 sources, containing ~500k triples
Sources hosted on local proxy server with artificial delay of 2
seconds
25% of sources were randomly chosen to construct index for MI


Results

Overall early result reporting
25% results: MI 8.7s, BU 15.1s
50% results: MI 12.8s, BU 22.0s
Improvement of ~42%
Detailed results for two queries:

Query 1 Query 6
BU MI TD BU MI TD
25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.0
50% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0
Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0
Src. 0.0 853.0 1444.5 0.0 1331.0 1863.5
Selection
Ranking 25.5 2404.0 411.5 23.5 292.5 335.0
#Sources 622 612 154 236 92 49

Result Arrival Times


Ranking Heuristics


Conclusion

Mixed strategy for Linked Data Query Processing
Partial knowledge available beforehand, incorporated with source
discovery at run-time
Metrics for source relevancy
Refinement of ranking at run-time
Early results reported on average 42% faster
Future work
Adapt query plan to changing properties of incoming data
Query local and remote data


Linked Data Query Processing Strategies

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (6)

Semelhante a Linked Data Query Processing Strategies

Semelhante a Linked Data Query Processing Strategies (20)

Último

Último (20)

Linked Data Query Processing Strategies