DAW: Duplicate-AWare Federated Query Processing over the Web of Data
1. DAW: Duplicate-AWare Federated Query
Processing over the Web of Data
Muhammad Saleem1 , Axel-Cyrille Ngonga Ngomo1, Josiane
Xavier Parreira2 , Helena F. Deus3 , Manfred Hauswirth2
1Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, Germany
lastname@informatik.uni-leipzig.de
2Digital Enterprise Research Institute(DERI), National University of Ireland.,Galway
firstname.lastnameg@deri.org
International Semantic Web Conference (ISWC), October 21-25 , 2013, Sydney, Australia
4. Motivation
Retrieved results for TP1 (?uri <p1> ?v1) Retrieved results for TP2 (?uri <p2> ?v2)
Triple pattern-wise source selection and skipping
S1 S2 S3TP1 =
Total triple pattern-wise selected sources = 4
S1 S2TP2 = S4
Min. number of new triples (threshold) = 20
Total triple pattern-wise skipped sources = 2
5. Problem Statement
• Data duplication in LOD datasets
– E.g. DrugBank and Neurocommons are duplicated at
DERI health Care and Life Sciences Knowledge Base
• Duplicate results retrieval increase the query
execution time and network traffic
• How to estimate the overlap between data
sources before sub-queries federation?
6. Sketches
• Data structures that provide dataset summaries
– Min-wise Independent Permutations (MIPs)
– Bloom filters
• Estimate overlap among different ID sets
• MIPs provide good tradeoff between estimation
error and space requirements
• MIPs of different lengths can be compared
• Sketches all alone cannot be used in SPARQL
federation
– SPARQL queries are highly selective when subject,
predicate, or object becomes bound in a triple pattern
8. DAW
• A combination of MIPs with compact data
summaries
• Use average selectivities values for bound
subject and objects
• Can be combined with any existing SPARQL
endpoint federation system
• Can be used for partial result retrieval
14. FedX Extension with DAW
0
1
2
3
4
5
6
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Data Movie
Executiontime(sec)
FedX
DAW
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
FedX 2.44
18.79
1.48
-12.38
4.60
14.71
1.74
7.59
2.44
9.76
DAW 1.98 1.67 3.92 1.61 2.20
15. SPLENDID Extension with DAW
0
1
2
3
4
5
6
7
8
9
10
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Movie
Executiontime(sec)
SPLENDID
DAW
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11
DAW 3.04 2.37 6.22 1.688 3.30
16. DARQ Extension with DAW
0
5
10
15
20
25
30
35
40
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Movie
Executiontime(sec)
DARQ
DAW
Over all performance Evaluation
Diseaso
me
Publicati
on Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
DARQ 8.27
23.34
5.26
6.14
23.44
16.31
1.96
13.88
9.59
16.46
DAW 6.34 4.94 19.62 1.688 8.01
18. Conclusion and Future Work
• A sub-query can retrieve results that are already retrieved by another query
– Resources are wasted
– Query runtime is increased
– Extra traffic is generated
• Sketches all alone cannot be used due to expressive nature of SPARQL queries
• We used MIPs applied to RDF predicates along with compact data summaries
• Performance improvement
– FedX : 9.76 %
– SPLENDID: 11.11 %
– DAW: 16.76 %
• The effect of MIPs sizes and threshold values to find the optimal trade-off
between execution time and recall will be explored
saleem@informatik.uni-leipzig.de
AKSW, University of Leipzig, Germany