Change detection in xml documents using Semantic Identifiers

Change Detection in XML Documents using
Semantic Identifiers
Kailaash Balachandran
Seminar Group: Data Deduplication and Versioning
University of Paderborn, Supervisor: Dr. Rita Hartel
Email: kailaash@mail.uni-paderborn.de
Abstract: Change Detection is a process of comparing successive versions of a document to
identify the changes. The success of XML as the standard for data exchange has paved way for
a number of change detection techniques that focus more on structural changes, rather than on
the semantics. Existing structural change detection mechanisms tend to break down when the
changes made are significantly large. This paper discusses a schema less, semantics based
framework that associates semantic identifiers to elements in successive versions, thus clearing
the obstacle of inefficient association of elements when the structural change is significant.
1. INTRODUCTION
Change detection is a process of identifying differences between successive versions of a document, thereby
determining the parts of the document that have changed or unchanged by comparing versions of the document.
Detection of changes helps in reduced storage space of historical data and in its ability to support temporal queries
[1]. In systems, where data needs to be sent across a network, traffic cost can be reduced as only the changes are sent
and not the entire document. Temporal aspects include queries that changes over time with the creation,
modification, and deletion of data [2, 3]; they are issued against pre and current versions of the document. Thus
change detection helps in mining historical data of the document to provide detailed information on the changes
made since its inception [4]. It is particularly true of data present in the web which has higher rate of change and the
changes made needs to be monitored effectively. It also improves incremental query evaluation wherein there are
continuous queries that monitors a particular data source and updates when the data is modified; the cost of query
evaluation is reduced just by using the change. Instead of re-evaluating a query on the entire data, it is suffice to
combine the result of a query of the query on the changed data with the previous query [5, 7, 8, and 9].
The rest of this paper is organized as follows. Section 2 presents problems with existing structural approaches using
an example where we show how they tend to breakdown when the structural changes are significant but
semantically the same. Section 3 describes the notions of identifiers in XML documents, its classification as local
and non-local and the algorithm used to compute identifiers. Section 4 presents the concepts of node matching and
admits which associates semantically same nodes in two versions with each other. This approach is further explained
with a sample XML document in Section 5. Section 6 concludes.
2. MOTIVATION
Fig. 1 and Fig. 2 shows new and old versions of an XML document. The underlined part reflects the change in both
versions, otherwise they’re alike with exact same information. Now, Fig. 3 shows an alternative version of the same
XML document with even more substantial change. This alternative version also has the same information but they
are just arranged in a different schema. Intuitively, due to significant structural changes, this version requires
number of insertions and deletions even though they have the same information as its predecessor version. The high
cost of change is not our primary concern as the cost is always an order of the size of the document; no matter how
big the change is. But the problem lies in the fact that when the structural changes are significant, it becomes
difficult to associate nodes between versions. An association relates an element in one version to another element in
the next version. Significant structural changes makes association between elements tougher and tend to break

down. For example, text node p2 in Fig.1 should be associated to p2 in Fig. 3. But as they are arranged in different
schemas, <publisher> in both versions are structurally very different from each other. Thus they cannot be
associated by structure based change detection mechanism even though the data present in two versions are
semantically the same. It also creates an obstacle for temporal query support as we are not able to associate elements
between versions.
<bib> <bib> <bib>
<author> <author> <publisher>p1
<name>n1</name> <name>n1</name> <book>
<book> <book> <title>t1</title>
<title>t1</title> <title>t1</title> <author>
<publisher>p1</publisher> <publisher>p1</publisher> <name>n1</name>
</book> </book> </author>
</author> </author> </book>
<author> <author> </publisher>
<name>n2</name> <name>n2</name> <publisher>p2
<book> <book> <book>
<title>t2</title> <book-title>t2</book-title> <title>t2</title>
<publisher>p2</publisher> <publisher>p2</publisher > <author>
</book> </book> <name>n1</name>
<book> <book> </author>
<title>t1</title> <title>t1</title> </book>
<publisher>p1</publisher> <publisher>p1</publisher> </publisher>
</book> </book> </bib>
</author> </author>
</bib> </bib>
Fig. 1 Older Version Fig. 2 Newer Version1 Fig. 3 Newer Version 2
This paper discusses a semantic based change detection technique where any node found semantically equivalent
against another version are associated, regardless of their structural change. At first, we try to identify semantic
identifiers for every node in both versions, compute those identifiers to evaluate to a value. If any node in both
versions evaluate to the same result, they’re found semantically the same and are associated with each other. Finding
semantic based associations provides good temporal query support as they allow element associations to exist across
successive versions of the document.
3. IDENTIFIERS
3.1 Structural Identity
Identifiers are XPath expressions [10] that is used to identify and distinguish elements from one another based on its
type. Every element in XML model is of specific type (T) that corresponds to the list of labels from root to element
separated by a ‘/’. Types are calculated in top down fashion starting from root to the leaf on a XML model tree. A
text node and its parent node have the same type but they are treated as two different nodes. The notion of types
corresponds to the notion of signatures as in [6] where if x is a text node, Signature(x) = /Name(x1)/…
/Name (xn)/Type(x); here Name(x) denote the node name and Type(x) denote its type, x1 is the root of the tree,
(x1..xn) shows the path from root to x. Previous change detection mechanisms matches nodes only if they have the
same signature. But the problem lies in the fact that nodes can be structurally different yet be semantically the same.
Hence we introduce the notion of semantic identity described in the next section. A pair of elements are considered
the structurally identical only if they both are of the same type or have one common descendant that is structurally
same as in the other element.
3.2 Semantic Identity
Semantic identity is different from but related to structural identity. A pair of elements are considered the
structurally identical only if they both are of the same type or have one common descendant that is structurally same

as in the other element. Otherwise they are considered structurally different. The following axioms [1] connects
structural identity to semantic identity.
3.2.1 Axiom 1: If a pair of nodes are found structurally different, they are semantically different.
For example, consider text nodes t1 and t2 in Fig.3. Though they have the same parent <title>, these two nodes are
considered semantically different as they are textually different (which is also an aspect to consider to check
structural identity). Thus we rule out t1 and t2 as both are structurally and semantically different. But this need not
be the case always. Consider the text nodes n1 in both <book> in Fig.3. They are structurally the same but are they
semantically identical? Not all structurally identical nodes can be semantically identical. This is stated in Axiom 2
below.
3.2.2 Axiom 2: A pair of structurally identical nodes are semantically identical if and only if their respective
parents are semantically identical or if they are both root nodes.
From Axiom1, we distinguish nodes that have different content. When nodes have the same content as in n1 text
node in <book> (Fig.3.), Axiom 2 distinguishes them by their context. The context is nothing but the semantic
identity of its respective parent node. Hence as per Axiom 2, <name>n1</name> in Fig.3 are found semantically
different despite their structural similarity because if we inspect <author> in Fig1 and Fig3, they are in context of
two different books and are structurally different as per Axiom 1. Thus the two <name>n1</name> are structurally
similar yet semantically different.
3.3 Local and Non-Local Identifiers
An identifier is an XPath expression that evaluates nodes in the XML tree. They can be either local or non-local,
A non-local identifier locates at least one node which is not a descendant of the context node [1]. Consider <name>
node in Fig.1 that has a text content (text()). Hence text() is the local identifier for <name> in Fig.1. But in Fig.3,
there are two identical <name> nodes with same text content. The identifier of this type cannot have all descendants
of the <name> nodes. Hence <name> identifier has to be non-local.
3.4 Computing Identifiers
The identifier for each node has to be computed to get a value. If a pair of nodes in two versions compute to the
same value, they are considered semantically the same and are associated regardless of their structural change. This
can be mathematically written as Eval(x,L) = Eval(y,L) where x,y are the nodes, List of Expressions L = {
E1,E2…En}. The algorithm to compute identifiers runs in two phases [1].
3.4.1 Phase 1: This corresponds to Axiom1 that tries to identify all local identifiers. They traverse the document in
bottom-up fashion i.e. from leaf to the root. At the end of Phase1, all semantically different nodes that Axiom 1 can
determine are found.
3.4.2 Phase 2: Similarly, this phase corresponds to Axiom 2 which runs recursively and expands to identify nodes
(non-local) of remaining types. After Phase 2’s termination, all semantically different are found. The remaining
node are nothing but the redundant copy of another node in the document. Thus we are able to identity semantically
identical nodes in both versions. The total cost of the algorithm is bounded by O(n*log(n)) where n is size of the
document tree.
4. NODE MATCHING
We are now able to identify semantically identical nodes in the document. These found nodes can be associated to
the respective nodes in the other version. This association is conserved throughout even if the structural change is
significant. This section discusses how the semantically identical nodes can be matched with each other: In order to
handle significant structural changes, we make an assumption that identifying information will remain nearby. The
following section presents terms that are important for matching nodes with each other.

Type Territory: The territory of a type T, denoted TT, is the set of all text nodes that are descendants of the least
common ancestor, denoted lca(T ), of all of the type T nodes [1].
Node Territory: The territory of a type T node p, denoted Np, is TT excluding all text nodes that are descendants of
other type T nodes [1].
The notions of node and type territory are shown in Fig 4. Consider three nodes p1, p2, p3 whose least common
ancestor (lca) is the node p. The grey shaded area in Fig.4 shows the node territory p2 that excludes descendants of
other types. Dark greyed parts depicts the type territory of p1 and p3. For example, consider <book> node in Fig.3,
type territory of <book> is (n1, t1, p1, n2, t2, p2), whereas node territory of the bottom <book> is (p2, t2, n2). Now
in order to match nodes with respective versions, we introduce the notion of ‘Admits’ and ‘Node match’. To
perform node match, both nodes need to admit each other.
Admits: q admits p if Eval(q, ID(q)) ⊆ Np.
Consider node p in version Vp and node q in version Vq that are ready to be associated (matched) with each other.
Node q is identified by a list of text values q1, …, qn in Vq. If a node p in Vp has at least as much information as q
does, then p should have a group of text values (q1, …, qn) in its own territory Np. A match implies semantic
equality between two nodes, thus it requires admissions in both directions [1]. Once two nodes admit each other,
nodes can be matched with each other.
5. MATCHING WITH SAMPLE DOCUMENT
Consider Fig.1 and Fig.3 where there has been significant change in the structure. They both contain the same data
but just arranged in a different schema. For clear understanding, we will display the xml documents in tree structure
as shown from Fig.5. We first need to compute identifies on each node in both versions.
Table.1 – Version 1 Table.2 – Version
From the algorithm and notions of node matching, we calculate the identifier value of <book> node using the XPath
expression (../author/name/text(), title/text()). The calculated values for all <book>’s identifiers in both versions are
shown in Table.1 and Table.2. The texts top, middle and bottom in Table.1&2 corresponds to the respective
Node Identifier Value
book(top) n1, t1
Book(middle) n2, t2
Book(bottom) n2, t1
Node Identifier Value
book(top) t1
book(bottom) t2

positions of <book> node in the XML document. The territory of <book> in Version 2 as shown in Fig.5 is (p1,
n1,t1,n2,p2), whereas the identifier of left most book in version 1 is (n1,t1). Since the notion of admits is satisfied,
leftmost book in version 1 admits the leftmost book in version 2. The dashed lines in Fig.6 depicts admits function
of <book>’s identifier values and the black line shows the <book> node match in both versions. Similarly other
<book> and <author> nodes are matched with each other as shown in Fig.6.
Fig.5. <book> matches with admits
Fig.6. <book> and <author> matches
At the end, it turns out that no node is left unmatched i.e there is no semantically distinct node found in both
versions. Hence, we prove that Fig.1 and Fig.3 are semantically the same and nodes can be associated for book and
author elements.

6. CONCLUSION
This paper discusses an efficient approach to identify changes between successive versions of XML documents
based on semantics rather than on its structure. This approach doesn’t require knowledge of schema beforehand. The
proposed matching algorithm looks for information of one node in the territory of the other thus making the
approach more flexible even when structural changes are significant. This approach is novel because most of the
previous research was focused on the structure of the documents. The approach also minimizes cost of change
caused by significant number of insertions and deletions as semantically identical nodes are associated and
conserved across changes in the document thereby improving support for temporal queries and incremental
evaluation.
7. REFERENCES
[1] Zhang, Shuohao and Dyreson, Curtis and Snodgrass, RichardT. “Schema-Less, Semantics-Based Change Detection for
XML Documents” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529-551, April 1955.
[2] G. Cobéna, S. Abiteboul, A. Marian. “Detecting Changes in XML Documents”. In Proceedings of ICDE, San Jose, February
2002, 41–52.
[3] S. Chawathe and H. Garcia-Molina. “Meaningful Change Detection in Structured Data”.In Proceedings of SIGMOD
Conference, June 1997, 26–37.
[4] C. Dyreson, H. Ling, Y. Wang. “Managing Versions of Web Documents in a Transaction-time Web Server”. In Proc. of the
13th International World Wide Web Conference, NewYork City, May 2004., 421–432..
[5] Dong, Guozhu and Topor, Rodney, “Incremental evaluation of Datalog queries,” Database Theory — ICDT '92, vol. 646,
August 1992, [Lecture Notes in Computer Science].
[6] Y. Wang, D. DeWitt, J.-Y. Cai. “X-Diff: An Effective Change Detection Algorithm forXML Documents”.
www.cs.wisc.edu/niagara/papers/xdiff.pdf, current as of August 2004.
[7] Fabio Grandi. “Introducing an Annotated Bibliography on Temporal and Evolution Aspects in the World Wide Web”.
SIGMOD Record, Volume 33, Number 2, June 2004.
[8] L. Liu, C. Pu, R. Barga, and T. Zhou. “Differential Evaluation of Continual Queries”. Proc. of the International Conference
on Distributed Computing Systems, 1996, 458–465.
[9] L. Liu, C. Pu, and W. Tang. “Continual Queries for Internet Scale Event-Driven Information Delivery”. IEEE Trans.
Knowledge Data Engineering, 11(4), 610–628, 1999.
[10] “XML Path Language (XPath) 2.0”. W3C, www.w3c.org/TR/xpath20/, current as of August 2004..

Change detection in xml documents using Semantic Identifiers

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (6)

Semelhante a Change detection in xml documents using Semantic Identifiers

Semelhante a Change detection in xml documents using Semantic Identifiers (20)

Último

Último (20)

Change detection in xml documents using Semantic Identifiers