SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
Change Detection in XML Documents using
Semantic Identifiers
Kailaash Balachandran
Seminar Group: Data Deduplication and Versioning
University of Paderborn, Supervisor: Dr. Rita Hartel
Email: kailaash@mail.uni-paderborn.de
Abstract: Change Detection is a process of comparing successive versions of a document to
identify the changes. The success of XML as the standard for data exchange has paved way for
a number of change detection techniques that focus more on structural changes, rather than on
the semantics. Existing structural change detection mechanisms tend to break down when the
changes made are significantly large. This paper discusses a schema less, semantics based
framework that associates semantic identifiers to elements in successive versions, thus clearing
the obstacle of inefficient association of elements when the structural change is significant.
1. INTRODUCTION
Change detection is a process of identifying differences between successive versions of a document, thereby
determining the parts of the document that have changed or unchanged by comparing versions of the document.
Detection of changes helps in reduced storage space of historical data and in its ability to support temporal queries
[1]. In systems, where data needs to be sent across a network, traffic cost can be reduced as only the changes are sent
and not the entire document. Temporal aspects include queries that changes over time with the creation,
modification, and deletion of data [2, 3]; they are issued against pre and current versions of the document. Thus
change detection helps in mining historical data of the document to provide detailed information on the changes
made since its inception [4]. It is particularly true of data present in the web which has higher rate of change and the
changes made needs to be monitored effectively. It also improves incremental query evaluation wherein there are
continuous queries that monitors a particular data source and updates when the data is modified; the cost of query
evaluation is reduced just by using the change. Instead of re-evaluating a query on the entire data, it is suffice to
combine the result of a query of the query on the changed data with the previous query [5, 7, 8, and 9].
The rest of this paper is organized as follows. Section 2 presents problems with existing structural approaches using
an example where we show how they tend to breakdown when the structural changes are significant but
semantically the same. Section 3 describes the notions of identifiers in XML documents, its classification as local
and non-local and the algorithm used to compute identifiers. Section 4 presents the concepts of node matching and
admits which associates semantically same nodes in two versions with each other. This approach is further explained
with a sample XML document in Section 5. Section 6 concludes.
2. MOTIVATION
Fig. 1 and Fig. 2 shows new and old versions of an XML document. The underlined part reflects the change in both
versions, otherwise they’re alike with exact same information. Now, Fig. 3 shows an alternative version of the same
XML document with even more substantial change. This alternative version also has the same information but they
are just arranged in a different schema. Intuitively, due to significant structural changes, this version requires
number of insertions and deletions even though they have the same information as its predecessor version. The high
cost of change is not our primary concern as the cost is always an order of the size of the document; no matter how
big the change is. But the problem lies in the fact that when the structural changes are significant, it becomes
difficult to associate nodes between versions. An association relates an element in one version to another element in
the next version. Significant structural changes makes association between elements tougher and tend to break
down. For example, text node p2 in Fig.1 should be associated to p2 in Fig. 3. But as they are arranged in different
schemas, <publisher> in both versions are structurally very different from each other. Thus they cannot be
associated by structure based change detection mechanism even though the data present in two versions are
semantically the same. It also creates an obstacle for temporal query support as we are not able to associate elements
between versions.
<bib> <bib> <bib>
<author> <author> <publisher>p1
<name>n1</name> <name>n1</name> <book>
<book> <book> <title>t1</title>
<title>t1</title> <title>t1</title> <author>
<publisher>p1</publisher> <publisher>p1</publisher> <name>n1</name>
</book> </book> </author>
</author> </author> </book>
<author> <author> </publisher>
<name>n2</name> <name>n2</name> <publisher>p2
<book> <book> <book>
<title>t2</title> <book-title>t2</book-title> <title>t2</title>
<publisher>p2</publisher> <publisher>p2</publisher > <author>
</book> </book> <name>n1</name>
<book> <book> </author>
<title>t1</title> <title>t1</title> </book>
<publisher>p1</publisher> <publisher>p1</publisher> </publisher>
</book> </book> </bib>
</author> </author>
</bib> </bib>
Fig. 1 Older Version Fig. 2 Newer Version1 Fig. 3 Newer Version 2
This paper discusses a semantic based change detection technique where any node found semantically equivalent
against another version are associated, regardless of their structural change. At first, we try to identify semantic
identifiers for every node in both versions, compute those identifiers to evaluate to a value. If any node in both
versions evaluate to the same result, they’re found semantically the same and are associated with each other. Finding
semantic based associations provides good temporal query support as they allow element associations to exist across
successive versions of the document.
3. IDENTIFIERS
3.1 Structural Identity
Identifiers are XPath expressions [10] that is used to identify and distinguish elements from one another based on its
type. Every element in XML model is of specific type (T) that corresponds to the list of labels from root to element
separated by a ‘/’. Types are calculated in top down fashion starting from root to the leaf on a XML model tree. A
text node and its parent node have the same type but they are treated as two different nodes. The notion of types
corresponds to the notion of signatures as in [6] where if x is a text node, Signature(x) = /Name(x1)/…
/Name (xn)/Type(x); here Name(x) denote the node name and Type(x) denote its type, x1 is the root of the tree,
(x1..xn) shows the path from root to x. Previous change detection mechanisms matches nodes only if they have the
same signature. But the problem lies in the fact that nodes can be structurally different yet be semantically the same.
Hence we introduce the notion of semantic identity described in the next section. A pair of elements are considered
the structurally identical only if they both are of the same type or have one common descendant that is structurally
same as in the other element.
3.2 Semantic Identity
Semantic identity is different from but related to structural identity. A pair of elements are considered the
structurally identical only if they both are of the same type or have one common descendant that is structurally same
as in the other element. Otherwise they are considered structurally different. The following axioms [1] connects
structural identity to semantic identity.
3.2.1 Axiom 1: If a pair of nodes are found structurally different, they are semantically different.
For example, consider text nodes t1 and t2 in Fig.3. Though they have the same parent <title>, these two nodes are
considered semantically different as they are textually different (which is also an aspect to consider to check
structural identity). Thus we rule out t1 and t2 as both are structurally and semantically different. But this need not
be the case always. Consider the text nodes n1 in both <book> in Fig.3. They are structurally the same but are they
semantically identical? Not all structurally identical nodes can be semantically identical. This is stated in Axiom 2
below.
3.2.2 Axiom 2: A pair of structurally identical nodes are semantically identical if and only if their respective
parents are semantically identical or if they are both root nodes.
From Axiom1, we distinguish nodes that have different content. When nodes have the same content as in n1 text
node in <book> (Fig.3.), Axiom 2 distinguishes them by their context. The context is nothing but the semantic
identity of its respective parent node. Hence as per Axiom 2, <name>n1</name> in Fig.3 are found semantically
different despite their structural similarity because if we inspect <author> in Fig1 and Fig3, they are in context of
two different books and are structurally different as per Axiom 1. Thus the two <name>n1</name> are structurally
similar yet semantically different.
3.3 Local and Non-Local Identifiers
An identifier is an XPath expression that evaluates nodes in the XML tree. They can be either local or non-local,
A non-local identifier locates at least one node which is not a descendant of the context node [1]. Consider <name>
node in Fig.1 that has a text content (text()). Hence text() is the local identifier for <name> in Fig.1. But in Fig.3,
there are two identical <name> nodes with same text content. The identifier of this type cannot have all descendants
of the <name> nodes. Hence <name> identifier has to be non-local.
3.4 Computing Identifiers
The identifier for each node has to be computed to get a value. If a pair of nodes in two versions compute to the
same value, they are considered semantically the same and are associated regardless of their structural change. This
can be mathematically written as Eval(x,L) = Eval(y,L) where x,y are the nodes, List of Expressions L = {
E1,E2…En}. The algorithm to compute identifiers runs in two phases [1].
3.4.1 Phase 1: This corresponds to Axiom1 that tries to identify all local identifiers. They traverse the document in
bottom-up fashion i.e. from leaf to the root. At the end of Phase1, all semantically different nodes that Axiom 1 can
determine are found.
3.4.2 Phase 2: Similarly, this phase corresponds to Axiom 2 which runs recursively and expands to identify nodes
(non-local) of remaining types. After Phase 2’s termination, all semantically different are found. The remaining
node are nothing but the redundant copy of another node in the document. Thus we are able to identity semantically
identical nodes in both versions. The total cost of the algorithm is bounded by O(n*log(n)) where n is size of the
document tree.
4. NODE MATCHING
We are now able to identify semantically identical nodes in the document. These found nodes can be associated to
the respective nodes in the other version. This association is conserved throughout even if the structural change is
significant. This section discusses how the semantically identical nodes can be matched with each other: In order to
handle significant structural changes, we make an assumption that identifying information will remain nearby. The
following section presents terms that are important for matching nodes with each other.
Type Territory: The territory of a type T, denoted TT, is the set of all text nodes that are descendants of the least
common ancestor, denoted lca(T ), of all of the type T nodes [1].
Node Territory: The territory of a type T node p, denoted Np, is TT excluding all text nodes that are descendants of
other type T nodes [1].
The notions of node and type territory are shown in Fig 4. Consider three nodes p1, p2, p3 whose least common
ancestor (lca) is the node p. The grey shaded area in Fig.4 shows the node territory p2 that excludes descendants of
other types. Dark greyed parts depicts the type territory of p1 and p3. For example, consider <book> node in Fig.3,
type territory of <book> is (n1, t1, p1, n2, t2, p2), whereas node territory of the bottom <book> is (p2, t2, n2). Now
in order to match nodes with respective versions, we introduce the notion of ‘Admits’ and ‘Node match’. To
perform node match, both nodes need to admit each other.
Admits: q admits p if Eval(q, ID(q)) ⊆ Np.
Consider node p in version Vp and node q in version Vq that are ready to be associated (matched) with each other.
Node q is identified by a list of text values q1, …, qn in Vq. If a node p in Vp has at least as much information as q
does, then p should have a group of text values (q1, …, qn) in its own territory Np. A match implies semantic
equality between two nodes, thus it requires admissions in both directions [1]. Once two nodes admit each other,
nodes can be matched with each other.
5. MATCHING WITH SAMPLE DOCUMENT
Consider Fig.1 and Fig.3 where there has been significant change in the structure. They both contain the same data
but just arranged in a different schema. For clear understanding, we will display the xml documents in tree structure
as shown from Fig.5. We first need to compute identifies on each node in both versions.
Table.1 – Version 1 Table.2 – Version
From the algorithm and notions of node matching, we calculate the identifier value of <book> node using the XPath
expression (../author/name/text(), title/text()). The calculated values for all <book>’s identifiers in both versions are
shown in Table.1 and Table.2. The texts top, middle and bottom in Table.1&2 corresponds to the respective
Node Identifier Value
book(top) n1, t1
Book(middle) n2, t2
Book(bottom) n2, t1
Node Identifier Value
book(top) t1
book(bottom) t2
positions of <book> node in the XML document. The territory of <book> in Version 2 as shown in Fig.5 is (p1,
n1,t1,n2,p2), whereas the identifier of left most book in version 1 is (n1,t1). Since the notion of admits is satisfied,
leftmost book in version 1 admits the leftmost book in version 2. The dashed lines in Fig.6 depicts admits function
of <book>’s identifier values and the black line shows the <book> node match in both versions. Similarly other
<book> and <author> nodes are matched with each other as shown in Fig.6.
Fig.5. <book> matches with admits
Fig.6. <book> and <author> matches
At the end, it turns out that no node is left unmatched i.e there is no semantically distinct node found in both
versions. Hence, we prove that Fig.1 and Fig.3 are semantically the same and nodes can be associated for book and
author elements.
6. CONCLUSION
This paper discusses an efficient approach to identify changes between successive versions of XML documents
based on semantics rather than on its structure. This approach doesn’t require knowledge of schema beforehand. The
proposed matching algorithm looks for information of one node in the territory of the other thus making the
approach more flexible even when structural changes are significant. This approach is novel because most of the
previous research was focused on the structure of the documents. The approach also minimizes cost of change
caused by significant number of insertions and deletions as semantically identical nodes are associated and
conserved across changes in the document thereby improving support for temporal queries and incremental
evaluation.
7. REFERENCES
[1] Zhang, Shuohao and Dyreson, Curtis and Snodgrass, RichardT. “Schema-Less, Semantics-Based Change Detection for
XML Documents” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529-551, April 1955.
[2] G. Cobéna, S. Abiteboul, A. Marian. “Detecting Changes in XML Documents”. In Proceedings of ICDE, San Jose, February
2002, 41–52.
[3] S. Chawathe and H. Garcia-Molina. “Meaningful Change Detection in Structured Data”.In Proceedings of SIGMOD
Conference, June 1997, 26–37.
[4] C. Dyreson, H. Ling, Y. Wang. “Managing Versions of Web Documents in a Transaction-time Web Server”. In Proc. of the
13th International World Wide Web Conference, NewYork City, May 2004., 421–432..
[5] Dong, Guozhu and Topor, Rodney, “Incremental evaluation of Datalog queries,” Database Theory — ICDT '92, vol. 646,
August 1992, [Lecture Notes in Computer Science].
[6] Y. Wang, D. DeWitt, J.-Y. Cai. “X-Diff: An Effective Change Detection Algorithm forXML Documents”.
www.cs.wisc.edu/niagara/papers/xdiff.pdf, current as of August 2004.
[7] Fabio Grandi. “Introducing an Annotated Bibliography on Temporal and Evolution Aspects in the World Wide Web”.
SIGMOD Record, Volume 33, Number 2, June 2004.
[8] L. Liu, C. Pu, R. Barga, and T. Zhou. “Differential Evaluation of Continual Queries”. Proc. of the International Conference
on Distributed Computing Systems, 1996, 458–465.
[9] L. Liu, C. Pu, and W. Tang. “Continual Queries for Internet Scale Event-Driven Information Delivery”. IEEE Trans.
Knowledge Data Engineering, 11(4), 610–628, 1999.
[10] “XML Path Language (XPath) 2.0”. W3C, www.w3c.org/TR/xpath20/, current as of August 2004..

Mais conteúdo relacionado

Mais procurados

Chapter 6 relational data model and relational
Chapter  6  relational data model and relationalChapter  6  relational data model and relational
Chapter 6 relational data model and relationalJafar Nesargi
 
introduction of database in DBMS
introduction of database in DBMSintroduction of database in DBMS
introduction of database in DBMSAbhishekRajpoot8
 
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGA CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGijcsit
 
Context-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML DataContext-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML Data1crore projects
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...acijjournal
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Relational database
Relational databaseRelational database
Relational databaseDucat
 
Answering approximate-queries-over-xml-data
Answering approximate-queries-over-xml-dataAnswering approximate-queries-over-xml-data
Answering approximate-queries-over-xml-dataShakas Technologies
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
 
physical and logical data independence
physical and logical data independencephysical and logical data independence
physical and logical data independenceapoorva_upadhyay
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
 

Mais procurados (18)

ch6
ch6ch6
ch6
 
Chapter 6 relational data model and relational
Chapter  6  relational data model and relationalChapter  6  relational data model and relational
Chapter 6 relational data model and relational
 
ch14
ch14ch14
ch14
 
introduction of database in DBMS
introduction of database in DBMSintroduction of database in DBMS
introduction of database in DBMS
 
Sec 2 1st term rev.
Sec 2  1st term rev.Sec 2  1st term rev.
Sec 2 1st term rev.
 
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGA CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
 
Computer sec2-1st term
Computer sec2-1st termComputer sec2-1st term
Computer sec2-1st term
 
Context-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML DataContext-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML Data
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
 
Dbms
DbmsDbms
Dbms
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Relational database
Relational databaseRelational database
Relational database
 
Answering approximate-queries-over-xml-data
Answering approximate-queries-over-xml-dataAnswering approximate-queries-over-xml-data
Answering approximate-queries-over-xml-data
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
 
physical and logical data independence
physical and logical data independencephysical and logical data independence
physical and logical data independence
 
K04302082087
K04302082087K04302082087
K04302082087
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 

Destaque

Implementation of a Web-Based Card Sorting Application with Responsive Design
Implementation of  a Web-Based Card Sorting Application with  Responsive DesignImplementation of  a Web-Based Card Sorting Application with  Responsive Design
Implementation of a Web-Based Card Sorting Application with Responsive DesignKailaash Balachandran
 
Schemaless Change detection in XML Documents using Semantic Identifiers
Schemaless Change detection in XML Documents using Semantic IdentifiersSchemaless Change detection in XML Documents using Semantic Identifiers
Schemaless Change detection in XML Documents using Semantic IdentifiersKailaash Balachandran
 
Testing Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingTesting Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingAlberta Soranzo
 
Information Architecture. Card Sorting
Information Architecture. Card SortingInformation Architecture. Card Sorting
Information Architecture. Card SortingDCU_MPIUA
 

Destaque (6)

Implementation of a Web-Based Card Sorting Application with Responsive Design
Implementation of  a Web-Based Card Sorting Application with  Responsive DesignImplementation of  a Web-Based Card Sorting Application with  Responsive Design
Implementation of a Web-Based Card Sorting Application with Responsive Design
 
Bittorrent in a P2P social network
Bittorrent in a P2P social networkBittorrent in a P2P social network
Bittorrent in a P2P social network
 
Introduction to Web Designing
Introduction to Web DesigningIntroduction to Web Designing
Introduction to Web Designing
 
Schemaless Change detection in XML Documents using Semantic Identifiers
Schemaless Change detection in XML Documents using Semantic IdentifiersSchemaless Change detection in XML Documents using Semantic Identifiers
Schemaless Change detection in XML Documents using Semantic Identifiers
 
Testing Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingTesting Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card Sorting
 
Information Architecture. Card Sorting
Information Architecture. Card SortingInformation Architecture. Card Sorting
Information Architecture. Card Sorting
 

Semelhante a Change detection in xml documents using Semantic Identifiers

Duplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathDuplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathiosrjce
 
Xml based data exchange in the
Xml based data exchange in theXml based data exchange in the
Xml based data exchange in theIJwest
 
Development of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalDevelopment of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalAmjad Ali
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data toIJwest
 
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015ijcsbi
 
Web data management (chapter-1)
Web data management (chapter-1)Web data management (chapter-1)
Web data management (chapter-1)Dhaval Asodariya
 
Catalog-based Conversion from Relational Database into XML Schema (XSD)
Catalog-based Conversion from Relational Database into XML Schema (XSD)Catalog-based Conversion from Relational Database into XML Schema (XSD)
Catalog-based Conversion from Relational Database into XML Schema (XSD)CSCJournals
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Mohit Sngg
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries鍾誠 陳鍾誠
 
1.2M .pdf
1.2M .pdf1.2M .pdf
1.2M .pdfbutest
 
Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...Cuong Tran Van
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATIONijcsit
 
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSXML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSijdms
 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overviewunyil96
 
Oracle soa xml faq
Oracle soa xml faqOracle soa xml faq
Oracle soa xml faqxavier john
 

Semelhante a Change detection in xml documents using Semantic Identifiers (20)

J017616976
J017616976J017616976
J017616976
 
Duplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathDuplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPath
 
Xml based data exchange in the
Xml based data exchange in theXml based data exchange in the
Xml based data exchange in the
 
Development of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalDevelopment of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrieval
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015
 
Web data management (chapter-1)
Web data management (chapter-1)Web data management (chapter-1)
Web data management (chapter-1)
 
Catalog-based Conversion from Relational Database into XML Schema (XSD)
Catalog-based Conversion from Relational Database into XML Schema (XSD)Catalog-based Conversion from Relational Database into XML Schema (XSD)
Catalog-based Conversion from Relational Database into XML Schema (XSD)
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries
 
1.2M .pdf
1.2M .pdf1.2M .pdf
1.2M .pdf
 
Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
 
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSXML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overview
 
Oracle soa xml faq
Oracle soa xml faqOracle soa xml faq
Oracle soa xml faq
 

Último

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Change detection in xml documents using Semantic Identifiers

  • 1. Change Detection in XML Documents using Semantic Identifiers Kailaash Balachandran Seminar Group: Data Deduplication and Versioning University of Paderborn, Supervisor: Dr. Rita Hartel Email: kailaash@mail.uni-paderborn.de Abstract: Change Detection is a process of comparing successive versions of a document to identify the changes. The success of XML as the standard for data exchange has paved way for a number of change detection techniques that focus more on structural changes, rather than on the semantics. Existing structural change detection mechanisms tend to break down when the changes made are significantly large. This paper discusses a schema less, semantics based framework that associates semantic identifiers to elements in successive versions, thus clearing the obstacle of inefficient association of elements when the structural change is significant. 1. INTRODUCTION Change detection is a process of identifying differences between successive versions of a document, thereby determining the parts of the document that have changed or unchanged by comparing versions of the document. Detection of changes helps in reduced storage space of historical data and in its ability to support temporal queries [1]. In systems, where data needs to be sent across a network, traffic cost can be reduced as only the changes are sent and not the entire document. Temporal aspects include queries that changes over time with the creation, modification, and deletion of data [2, 3]; they are issued against pre and current versions of the document. Thus change detection helps in mining historical data of the document to provide detailed information on the changes made since its inception [4]. It is particularly true of data present in the web which has higher rate of change and the changes made needs to be monitored effectively. It also improves incremental query evaluation wherein there are continuous queries that monitors a particular data source and updates when the data is modified; the cost of query evaluation is reduced just by using the change. Instead of re-evaluating a query on the entire data, it is suffice to combine the result of a query of the query on the changed data with the previous query [5, 7, 8, and 9]. The rest of this paper is organized as follows. Section 2 presents problems with existing structural approaches using an example where we show how they tend to breakdown when the structural changes are significant but semantically the same. Section 3 describes the notions of identifiers in XML documents, its classification as local and non-local and the algorithm used to compute identifiers. Section 4 presents the concepts of node matching and admits which associates semantically same nodes in two versions with each other. This approach is further explained with a sample XML document in Section 5. Section 6 concludes. 2. MOTIVATION Fig. 1 and Fig. 2 shows new and old versions of an XML document. The underlined part reflects the change in both versions, otherwise they’re alike with exact same information. Now, Fig. 3 shows an alternative version of the same XML document with even more substantial change. This alternative version also has the same information but they are just arranged in a different schema. Intuitively, due to significant structural changes, this version requires number of insertions and deletions even though they have the same information as its predecessor version. The high cost of change is not our primary concern as the cost is always an order of the size of the document; no matter how big the change is. But the problem lies in the fact that when the structural changes are significant, it becomes difficult to associate nodes between versions. An association relates an element in one version to another element in the next version. Significant structural changes makes association between elements tougher and tend to break
  • 2. down. For example, text node p2 in Fig.1 should be associated to p2 in Fig. 3. But as they are arranged in different schemas, <publisher> in both versions are structurally very different from each other. Thus they cannot be associated by structure based change detection mechanism even though the data present in two versions are semantically the same. It also creates an obstacle for temporal query support as we are not able to associate elements between versions. <bib> <bib> <bib> <author> <author> <publisher>p1 <name>n1</name> <name>n1</name> <book> <book> <book> <title>t1</title> <title>t1</title> <title>t1</title> <author> <publisher>p1</publisher> <publisher>p1</publisher> <name>n1</name> </book> </book> </author> </author> </author> </book> <author> <author> </publisher> <name>n2</name> <name>n2</name> <publisher>p2 <book> <book> <book> <title>t2</title> <book-title>t2</book-title> <title>t2</title> <publisher>p2</publisher> <publisher>p2</publisher > <author> </book> </book> <name>n1</name> <book> <book> </author> <title>t1</title> <title>t1</title> </book> <publisher>p1</publisher> <publisher>p1</publisher> </publisher> </book> </book> </bib> </author> </author> </bib> </bib> Fig. 1 Older Version Fig. 2 Newer Version1 Fig. 3 Newer Version 2 This paper discusses a semantic based change detection technique where any node found semantically equivalent against another version are associated, regardless of their structural change. At first, we try to identify semantic identifiers for every node in both versions, compute those identifiers to evaluate to a value. If any node in both versions evaluate to the same result, they’re found semantically the same and are associated with each other. Finding semantic based associations provides good temporal query support as they allow element associations to exist across successive versions of the document. 3. IDENTIFIERS 3.1 Structural Identity Identifiers are XPath expressions [10] that is used to identify and distinguish elements from one another based on its type. Every element in XML model is of specific type (T) that corresponds to the list of labels from root to element separated by a ‘/’. Types are calculated in top down fashion starting from root to the leaf on a XML model tree. A text node and its parent node have the same type but they are treated as two different nodes. The notion of types corresponds to the notion of signatures as in [6] where if x is a text node, Signature(x) = /Name(x1)/… /Name (xn)/Type(x); here Name(x) denote the node name and Type(x) denote its type, x1 is the root of the tree, (x1..xn) shows the path from root to x. Previous change detection mechanisms matches nodes only if they have the same signature. But the problem lies in the fact that nodes can be structurally different yet be semantically the same. Hence we introduce the notion of semantic identity described in the next section. A pair of elements are considered the structurally identical only if they both are of the same type or have one common descendant that is structurally same as in the other element. 3.2 Semantic Identity Semantic identity is different from but related to structural identity. A pair of elements are considered the structurally identical only if they both are of the same type or have one common descendant that is structurally same
  • 3. as in the other element. Otherwise they are considered structurally different. The following axioms [1] connects structural identity to semantic identity. 3.2.1 Axiom 1: If a pair of nodes are found structurally different, they are semantically different. For example, consider text nodes t1 and t2 in Fig.3. Though they have the same parent <title>, these two nodes are considered semantically different as they are textually different (which is also an aspect to consider to check structural identity). Thus we rule out t1 and t2 as both are structurally and semantically different. But this need not be the case always. Consider the text nodes n1 in both <book> in Fig.3. They are structurally the same but are they semantically identical? Not all structurally identical nodes can be semantically identical. This is stated in Axiom 2 below. 3.2.2 Axiom 2: A pair of structurally identical nodes are semantically identical if and only if their respective parents are semantically identical or if they are both root nodes. From Axiom1, we distinguish nodes that have different content. When nodes have the same content as in n1 text node in <book> (Fig.3.), Axiom 2 distinguishes them by their context. The context is nothing but the semantic identity of its respective parent node. Hence as per Axiom 2, <name>n1</name> in Fig.3 are found semantically different despite their structural similarity because if we inspect <author> in Fig1 and Fig3, they are in context of two different books and are structurally different as per Axiom 1. Thus the two <name>n1</name> are structurally similar yet semantically different. 3.3 Local and Non-Local Identifiers An identifier is an XPath expression that evaluates nodes in the XML tree. They can be either local or non-local, A non-local identifier locates at least one node which is not a descendant of the context node [1]. Consider <name> node in Fig.1 that has a text content (text()). Hence text() is the local identifier for <name> in Fig.1. But in Fig.3, there are two identical <name> nodes with same text content. The identifier of this type cannot have all descendants of the <name> nodes. Hence <name> identifier has to be non-local. 3.4 Computing Identifiers The identifier for each node has to be computed to get a value. If a pair of nodes in two versions compute to the same value, they are considered semantically the same and are associated regardless of their structural change. This can be mathematically written as Eval(x,L) = Eval(y,L) where x,y are the nodes, List of Expressions L = { E1,E2…En}. The algorithm to compute identifiers runs in two phases [1]. 3.4.1 Phase 1: This corresponds to Axiom1 that tries to identify all local identifiers. They traverse the document in bottom-up fashion i.e. from leaf to the root. At the end of Phase1, all semantically different nodes that Axiom 1 can determine are found. 3.4.2 Phase 2: Similarly, this phase corresponds to Axiom 2 which runs recursively and expands to identify nodes (non-local) of remaining types. After Phase 2’s termination, all semantically different are found. The remaining node are nothing but the redundant copy of another node in the document. Thus we are able to identity semantically identical nodes in both versions. The total cost of the algorithm is bounded by O(n*log(n)) where n is size of the document tree. 4. NODE MATCHING We are now able to identify semantically identical nodes in the document. These found nodes can be associated to the respective nodes in the other version. This association is conserved throughout even if the structural change is significant. This section discusses how the semantically identical nodes can be matched with each other: In order to handle significant structural changes, we make an assumption that identifying information will remain nearby. The following section presents terms that are important for matching nodes with each other.
  • 4. Type Territory: The territory of a type T, denoted TT, is the set of all text nodes that are descendants of the least common ancestor, denoted lca(T ), of all of the type T nodes [1]. Node Territory: The territory of a type T node p, denoted Np, is TT excluding all text nodes that are descendants of other type T nodes [1]. The notions of node and type territory are shown in Fig 4. Consider three nodes p1, p2, p3 whose least common ancestor (lca) is the node p. The grey shaded area in Fig.4 shows the node territory p2 that excludes descendants of other types. Dark greyed parts depicts the type territory of p1 and p3. For example, consider <book> node in Fig.3, type territory of <book> is (n1, t1, p1, n2, t2, p2), whereas node territory of the bottom <book> is (p2, t2, n2). Now in order to match nodes with respective versions, we introduce the notion of ‘Admits’ and ‘Node match’. To perform node match, both nodes need to admit each other. Admits: q admits p if Eval(q, ID(q)) ⊆ Np. Consider node p in version Vp and node q in version Vq that are ready to be associated (matched) with each other. Node q is identified by a list of text values q1, …, qn in Vq. If a node p in Vp has at least as much information as q does, then p should have a group of text values (q1, …, qn) in its own territory Np. A match implies semantic equality between two nodes, thus it requires admissions in both directions [1]. Once two nodes admit each other, nodes can be matched with each other. 5. MATCHING WITH SAMPLE DOCUMENT Consider Fig.1 and Fig.3 where there has been significant change in the structure. They both contain the same data but just arranged in a different schema. For clear understanding, we will display the xml documents in tree structure as shown from Fig.5. We first need to compute identifies on each node in both versions. Table.1 – Version 1 Table.2 – Version From the algorithm and notions of node matching, we calculate the identifier value of <book> node using the XPath expression (../author/name/text(), title/text()). The calculated values for all <book>’s identifiers in both versions are shown in Table.1 and Table.2. The texts top, middle and bottom in Table.1&2 corresponds to the respective Node Identifier Value book(top) n1, t1 Book(middle) n2, t2 Book(bottom) n2, t1 Node Identifier Value book(top) t1 book(bottom) t2
  • 5. positions of <book> node in the XML document. The territory of <book> in Version 2 as shown in Fig.5 is (p1, n1,t1,n2,p2), whereas the identifier of left most book in version 1 is (n1,t1). Since the notion of admits is satisfied, leftmost book in version 1 admits the leftmost book in version 2. The dashed lines in Fig.6 depicts admits function of <book>’s identifier values and the black line shows the <book> node match in both versions. Similarly other <book> and <author> nodes are matched with each other as shown in Fig.6. Fig.5. <book> matches with admits Fig.6. <book> and <author> matches At the end, it turns out that no node is left unmatched i.e there is no semantically distinct node found in both versions. Hence, we prove that Fig.1 and Fig.3 are semantically the same and nodes can be associated for book and author elements.
  • 6. 6. CONCLUSION This paper discusses an efficient approach to identify changes between successive versions of XML documents based on semantics rather than on its structure. This approach doesn’t require knowledge of schema beforehand. The proposed matching algorithm looks for information of one node in the territory of the other thus making the approach more flexible even when structural changes are significant. This approach is novel because most of the previous research was focused on the structure of the documents. The approach also minimizes cost of change caused by significant number of insertions and deletions as semantically identical nodes are associated and conserved across changes in the document thereby improving support for temporal queries and incremental evaluation. 7. REFERENCES [1] Zhang, Shuohao and Dyreson, Curtis and Snodgrass, RichardT. “Schema-Less, Semantics-Based Change Detection for XML Documents” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529-551, April 1955. [2] G. Cobéna, S. Abiteboul, A. Marian. “Detecting Changes in XML Documents”. In Proceedings of ICDE, San Jose, February 2002, 41–52. [3] S. Chawathe and H. Garcia-Molina. “Meaningful Change Detection in Structured Data”.In Proceedings of SIGMOD Conference, June 1997, 26–37. [4] C. Dyreson, H. Ling, Y. Wang. “Managing Versions of Web Documents in a Transaction-time Web Server”. In Proc. of the 13th International World Wide Web Conference, NewYork City, May 2004., 421–432.. [5] Dong, Guozhu and Topor, Rodney, “Incremental evaluation of Datalog queries,” Database Theory — ICDT '92, vol. 646, August 1992, [Lecture Notes in Computer Science]. [6] Y. Wang, D. DeWitt, J.-Y. Cai. “X-Diff: An Effective Change Detection Algorithm forXML Documents”. www.cs.wisc.edu/niagara/papers/xdiff.pdf, current as of August 2004. [7] Fabio Grandi. “Introducing an Annotated Bibliography on Temporal and Evolution Aspects in the World Wide Web”. SIGMOD Record, Volume 33, Number 2, June 2004. [8] L. Liu, C. Pu, R. Barga, and T. Zhou. “Differential Evaluation of Continual Queries”. Proc. of the International Conference on Distributed Computing Systems, 1996, 458–465. [9] L. Liu, C. Pu, and W. Tang. “Continual Queries for Internet Scale Event-Driven Information Delivery”. IEEE Trans. Knowledge Data Engineering, 11(4), 610–628, 1999. [10] “XML Path Language (XPath) 2.0”. W3C, www.w3c.org/TR/xpath20/, current as of August 2004..