2. Outline
I. Introduction
II. Framework
III. Algorithms:Indexed XML Data
Keyword Proximity Search
IV. Processing Unindexed XML DATA
in XML Trees
V. Experimental Evaluation
VI. Overview
3. Introduction - Framework - Algorithms:Indexed XML Data – Processing Unindexed XML Data - Experimental Evaluation - Overview
Keyword Search Keyword Proximity Search
in XML Trees
Keyword search
user-friendly information discovery technique
extensively studied for text documents.
Keyword proximity search
well-suited to XML documents
4. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Notation Keyword Proximity Search
in XML Trees
XML DOCUMENT directed tree with labeles
- labled with λ(v), a tag
- 4-tuple: id(v)
start and end correspond to the first and the final times the node is
v visited in a depth-first traversal of the XML tree,
depth is the depth of the node from the root of the tree.
- if v is a leaf, it has a string value val(v) that contains a list of keywords
set of keywords k1,. . . , km.
keyword query
returns a compact representation of the set of trees that connect
the nodes that contain the keywords
5. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Notation Keyword Proximity Search
in XML Trees
r
c1
s1 s2 s3
p2 p5 p6
p1 p3 p4
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
6. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Notation Keyword Proximity Search
in XML Trees
Definition
minimum connecting tree (MCT) of nodes v1,. . . ,vm of a tree → the minimum size subtree that
connects v1, . . . ,vm.
root of the tree → the lowest common ancestor (LCA) of the nodes v1, . . . ,vm.
Examples:
r r
MCTs for the query MCTs for the query
“Tom, Harry” c1 “Tom, Dick, Harry” c1
s1 s2 s3 s1 s2 s3
p1 p2 p4 p5 p6 p1 p2 p3 p4 p5 p6
p3
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a1 a2 a3 a7 a8
a4 a5 a6 a9 a10
7. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Notation Keyword Proximity Search
in XML Trees
DMCT
v1, . . . , vm Є T.
Distance MCT (DMCT) TD=d(TM) of the MCT TM of nodes v1, . . . , vm → the minimum node-labeled
and edge-labeled tree such that:
TD contains v1, . . . , vm
TD contains the LCAs u1, . . . , uk
of any pair of nodes (vi, vj)
where vi , vj Є [v1, . . . , vm], i≠ j
edge labeled with l between
any two distinct nodes n, n’ Є
{v1,...,vm, u1, . . . ,uk} if there is a
path of length l from n’ to n in
TM and the path does not
contain any node n’’ Є { u1, . . . ,
um} other than n and n’.
8. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Notation Keyword Proximity Search
in XML Trees
GDMCT
A Grouped DMCT of a tree T is a labeled tree where edges are labeled with numbers and nodes
are labeled with lists of node ids from T.
DMCT D Є GDMCT G if D and G are isomorphic. Assuming that f is the mapping of the nodes of D
to the nodes of G, which induces a corresponding mapping, also called f, of the edges of D to
the edges of G, the following must hold:
nD is a node of D, nG is a node of
G and f(nD)=nG, then the label
of nG contains the id of nD.
eD is an edge of D, eG is an edge
of G and f(eD) = eG, then the
label of eD and the label of eG
are the same number.
9. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Problems Keyword Proximity Search
in XML Trees
Problem 1 : All GDMCTs Problem
Query K Result
“Tom, Harry” 5
3
10. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Problems Keyword Proximity Search
in XML Trees
Problem 2 : Lowest GDMCTs Problem
Query K Result
“Tom, Harry” 5
3
11. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Nested Loop Algorithm
The nested loops algorithm (NL) for the case of indexed XML Examples of some entries in the
data operates over separate lists of nodes, L(k), one for each master index for our tree:
query keyword, k, to identify the GDMCTs whose sizes are no
more than the user-provided threshold, K.
Master index inverted index a hash table
list L(k)
each node n has path-id
(the list of node ids along the path from the root of T to n)
12. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Nested Loop Algorithm
checks all combinations of
nodes from the keyword lists.
for each combination computes
an MCT (minimum connecting
tree)
merges the resulting MCT into
the list of result GDMCTs, if its
size is within the user-specified
threshold.
13. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Nested Loop Algorithm
For example:
Query: “Tom, Harry” and K=3,
NL examine the 12 node-pairs 12 MCTs
determine 2 of them meet the
threshold(K) return 2 GDMCTs:
14. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Nested Loop Algorithm
Inefficienty:
NL checks all the combinations of nodes
from the keyword lists
The grouping of the results into GDMCTs is
not lightly integrated with the algorithm
and a lookup to the array R is required for
each relevant MCT found.
15. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.
The stack-based algorithm for computing GDMCTs on indexed XML data operates over lists of
nodes, two for each query keyword.
Indexing by keyword master index contains 2 lists
o L(k) of the nodes of T that contain k in T and
o Ld(k) of the ancestors of nodes in L(k).
16. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.
For example the entries for Tom, Dick and Harry are:
17. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.
This is the high-level description of the SA.
It describes how the selected
list of nodes is traversed in a
depth-first manner and the
nodes are pushed and popped
from the stack.
18. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.
novel part of the SA algorithm
processing and bookkeeping
performed at each stack
operation
19. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Index Structure and Algorithm.
Functions that are called from
POP(S)
20. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Illustrative Example
Query: “Tom, Harry”
K=3
Master index lists:
The intersection of the lists:
21. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Illustrative Example
Master index lists: Intersection of the La
Query: “Tom, Harry”
K=3
Some of the initial stack states of the execution of the Stack Algorithm:
1. 2. 3.
22. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Illustrative Example
Master index lists: Intersection of the La
Query: “Tom, Harry”
K=3
Some of the initial stack states of the execution of the Stack Algorithm:
4. 5. 6.
23. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
All GDMCTs: Keyword Proximity Search
in XML Trees
Stack-Based Algorithm
Illustrative Example
Master index lists: Intersection of the La
Query: “Tom, Harry”
K=3
Some of the initial stack states of the execution of the Stack Algorithm:
7. 8. 9. Entries from the lists
continue being examined,
new GDMCTs are created and
pruned until all the answers
are output.
...
24. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Lowest GDMCTs: Keyword Proximity Search
in XML Trees
Stack- Based Algorithm
The key observation is that once we output the GDMCTs of a node u, none of the ancestors of u
in the stack can be LCAs of returned GDMCTs; hence, we can remove all of them from the stack!
Specifically, we can add the following lines after line 5:
25. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
LCAs: Keyword Proximity Search
in XML Trees
Stack- Based Algorithms
The Stack Algorithm can also be easily modified to solve the All LCAs Problem and the Lowest
LCAs Problem, where the user is not interested in the GDMCTs, but only in the LCA nodes.
o First, Merge(.) could be simplified, no merging of GDMCTs would need to be done, and
line 33 could be replaced by:
o Second, we can output an LCA early when the first GDMCT (with all keywords) is
computed for that node (in Procedure CreateNewGDMCTs(.)), instead of waiting until the
node is popped from the stack.
26. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Complexity Keyword Proximity Search
in XML Trees
Analysis
Total number of GDMCTs
Worst case: the number of DMCTs and of GDMCTs = exponential on the number of keywords.
Under reasonable assumptions, the worst-case number of GDMCTs is smaller than that of
DMCTs
Complexity of Finding Isomorphic GDMCTs
Given this canonical representation prezented in this chapter, one can linearize the GDMCTs in
an XML-like nested representation with start and end tags, obtained from the node
annotations.
Theorem 1. The time complexity of SA is
O( L K (i 1 L(ki ) ) 2 )
m
27. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Processing Keyword Proximity Search
in XML Trees
Unindexed XML Data
Both the NL Algorithm and the SA have adaptations to work without index lists by doing a single
pass over the data tree.
The streaming version of the Stack Algorithm following changes to the Stack
Algorithm SA(k1,..km, K):
28. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Experimental Keyword Proximity Search
in XML Trees
Evaluation
Parameters affecting the performance of the presented algorithms:
1) the value of K denoting the threshold,
2) the number m of keywords,
3) the size of the data set.
Tests show that usually the algorithms based on the Stack Algorithm have better results than the
Nested Loops Algorithms both in the Indexed and Unindexed data.
29. Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - Overview
Overview Keyword Proximity Search
in XML Trees
There were presented two main problems:
1) identifying and presenting in a compact manner all MCTs which explain how the keywords are
connected
2) identifying only MCTs whose root is not an ancestor of the root of another MCT.
There are presented solutions:
1) when the XML data has been preprocessed and relevant indices have been constructed
- Nested Loop Algorithm
- Stack Algorithm
2) when the XML data has not been preprocessed, i.e., the XML data can only beprocessed
sequentially.
Benefits of the algorithms are shown by the Experimental Evaluation
30. Resource
Name
Keyword Proximity Search in XML Trees
Vangelis Hristidis, Nick Koudas,
Yannis Papakonstantinou and Diverish Srivastava
Authors
IEEE Transactions on Knoledge and Data Engineering
Publication
Vol 18, No 4, APRIL 2006