FULL ENJOY Call girls in Paharganj Delhi | 8377087607
Linked Data Query Processing Strategies
1. Linked Data Query Processing Strategies
Günter Ladwig, Thanh Tran
International Semantic Web Conference 2010, Shanghai
Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
KIT – University of the State of Baden-Württemberg and
National Large-scale Research Center of the Helmholtz Association www.kit.edu
2. Contents
Introduction
Challenges
Contributions
Linked Data Query Processing Strategies
Stream-based Query Processing
Corrective Source Ranking
Evaluation
Conclusion
2 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
3. What is Linked Data?
Linked Data Principles
Use URIs to identify things
Use HTTP URIs that allow dereferencing
Dereferencing a URI provides information about the thing in a
standard format (RDF)
Include links to other, related URIs
Linked Data Query Processing
Evaluate queries directly over Linked Data
Dereference Linked Data URIs during query processing
3 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
4. Challenges
Volume of Source Collection
Each URI is a potential data source
Dynamic of Source Collection
Sources may change rapidly over time
Sources might only be discovered at run-time
Heterogeneity of Sources, Source Descriptions and
Access Methods
Sources vary in size
Description of sources vary in completeness
Access methods: URI lookup, SPARQL endpoints, local cache, ...
4 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
5. Contributions
Discussion of Linked Data Query Processing strategies
Mixed strategy, combining local indexes and run-time
discovery
Stream-based Query Processing
Data can arrive at any time and in any order
Suited to deal with network latency
Corrective Source Ranking
Deals with different types of source descriptions
Ranking is refined at run-time
5 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
6. LINKED DATA QUERY
PROCESSING STRATEGIES
6 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
7. Top-down Query Evaluation
SELECT ?paper ?author WHERE {
?paperswrc:author ?author . ?paperswc:isPartOf ?proc .
?proc swc:relatedToEvent<http://sw.org/eswc/2010>.
Probe }
Source URI Score
Local Select and Retrieve sources
http://sw.org/person/AB 0.87
source rank sources Join data
index ... ...
Local index, assumed to be complete
Selection and ranking of sources
No run-time discovery
Fast, only relevant sources are retrieved
Not up-to-date, index size may become very large
7 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
8. Bottom-up Query Evaluation
SELECT ?paper ?author WHERE {
?paperswrc:author ?author . ?paperswc:isPartOf ?proc .
?proc swc:relatedToEvent<http://semweb.org/eswc/2010> . }
Retrieve source
<http://sw.org/proc/eswc/2010>swc:relatedToEvent
<http://sw.org/eswc/2010> .
...
Sources are discovered at Discover new sources
run-time through links
swc:paper1 swc:isPartOf
<http://sw.org/proc/eswc/2010>.
...
Answers can be incomplete
as links might not be discoverable
Slower, as unnecessary sources are retrieved
Always up-to-date
8 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
9. Mixed Strategy
Combination of top-down and bottom-up strategies
Partial local index of sources, not assumed to be complete
New sources are discovered at run-time
Addresses volume and dynamic of Linked Data
Corrective Source Ranking
Deal with heterogeneous source descriptions
Stream-based Query Processing
Deal with unpredictable nature of Linked Data access
9 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
10. STREAM-BASED QUERY
PROCESSING
10 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
11. Stream-based Query Processing Results
Network latency Query Plan Join
Do not block!
Evaluation driven by Join name(?y, ?n)
incoming data
Compile-time
worksAt(?x, dbpedia:KIT) knows(?x, ?y)
Construct query plan Samples
Probe local index for
sources Push
Run-time Source Retrieval Retrieve Source Ranker
Rank sources source
Source Retriever 1 Source 1 (score: 1.0)
Retrieve sources Source Retriever 2 Source Source 2 (score: 0.7)
discovered ...
Push data into query plan ...
Discover new sources
Local
source
index
11 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
12. Push-based Symmetric Hash Join
Operation t7 t4
Maintains a hash table for each t7 t5
input
Arriving tuples are inserted into
one hash table and then the other
is probed for join combinations Push output
Push-based
Tuples are pushed into operators Left input Right input
from the leaves to the root of the Key T Key T
query plan
a t1 , t3 b t4 , t5
Execution driven by incoming
tuples instead of results b t 2 t7
2, c t6
Results reported as soon as input
tuples arrive Insert
Probe
Tuples can arrive on all inputs in
Pushed on left: t7(b)
any order
12 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
13. CORRECTIVE SOURCE
RANKING
13 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
14. Corrective Source Ranking
Prefer more relevant sources
Relevancy of a source is based on
Current query
Any available intermediate results
Overall optimization goal
Define a set of source features and derive concrete
source metrics
Not all metrics are available for all sources (heterogeneity)
Refine previously computed metrics using newly
discovered information
14 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
15. Source Features and Metrics
Source is more relevant if it contains data that contributes
to answers of the query
Triple Pattern Cardinality
Join Pattern Cardinality
Cardinalities stored in local index
Some patterns have high cardinality for all or many
sources (e.g. )
These patterns do not discriminate sources
15 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
16. Source Features and Metrics
Adopt TF-IDF concept to obtain weights for triple patterns
Importance positively correlates with how often bindings to a
pattern occur in a source (i.e. cardinality)
Importance negatively correlates with how often its bindings occur
in all sources of the source collection S
Triple Frequency – Inverse Source Frequency (TF-ISF)
16 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
17. Source Features and Metrics - Links
Source linked from many other sources is more relevant
Relevance is higher when these links match query
predicates
Links are only discovered at run-time
17 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
18. Metric Correction and Refinement
During query processing new information becomes
available: intermediate join results, links
Refine and correctpreviously computed metrics
Important in the case of non-discriminative patterns
Instantiate triple pattern of a join with samples of
intermediate results to obtain better join size estimates
Example
Perform triple pattern
Intermediate results in SHJ operator cardinality lookups
18 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
19. Ranking at Run-time
Optimization goal: early result reporting
Indexed sources: triple and join pattern cardinality, TF-ISF,
weighted links, sampled join size estimates
Discovered sources: weighted links
Ranking has to be refined at run-time
Parameters influencing behavior and cost of ranking
process
Invalid Score Threshold: ranking is performed when the number
of sources with invalid scores passes a threshold
Sample Size: larger samples for join size estimation will give better
estimates, are also more costly
Resampling Threshold: cache join size estimates and perform
sampling only when the hash table of join operator grows past a
given threshold
19 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
20. EVALUATION
20 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
21. Evaluation
Systems: top-down (TD), bottom-up (BU), mixed (MI)
8 queries over various datasets (DBpedia, Geonames,
NYT, Freebase, ...)
To make the approaches comparable, sources were
restricted to those discoverable by the BU approach
~6200 sources, containing ~500k triples
Sources hosted on local proxy server with artificial delay of 2
seconds
25% of sources were randomly chosen to construct index for MI
21 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
22. Results
Overall early result reporting
25% results: MI 8.7s, BU 15.1s
50% results: MI 12.8s, BU 22.0s
Improvement of ~42%
Detailed results for two queries:
Query 1 Query 6
BU MI TD BU MI TD
25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.0
50% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0
Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0
Src. 0.0 853.0 1444.5 0.0 1331.0 1863.5
Selection
Ranking 25.5 2404.0 411.5 23.5 292.5 335.0
#Sources 622 612 154 236 92 49
22 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
23. Result Arrival Times
23 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
24. Ranking Heuristics
24 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
25. Conclusion
Mixed strategy for Linked Data Query Processing
Partial knowledge available beforehand, incorporated with source
discovery at run-time
Corrective Source Ranking
Metrics for source relevancy
Refinement of ranking at run-time
Stream-based Query Processing
Early results reported on average 42% faster
Future work
Adapt query plan to changing properties of incoming data
Query local and remote data
25 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)