20100810

A Survey of Approaches to Automatic Schema Matching Erhard Rahm Philip A. Bernstein VLDB 2001 1

Introduction Schema means representation of data. Schema matching is a basic problem in many database application domains. We present a taxonomy that covers many of these existing approaches. 2

Match Match, which takes two schemas as input and produces a mapping between elements of the two schemas that correspond semantically to each other. 3

Mapping(cont.) A mapping element Cust.C# to Customer.CustID Expression =>“Cust.C# = Customer.CustID”. Concatenate(Cust.FirstName, Cust.LastName) = Customer.Contact” 4

Application Domains Schema integration. Data warehouses. E-commerce. Semantic query processing. 5

Architecture for Generic Match(cont.) 6

Classification of Schema Matching Approaches Overview 7

Classification of Schema Matching Approaches For individual matchers, we consider the following largely-orthogonal classification criteria:1. Instance vs schema: matching material are from instance or schema.2. Element vs structure:match for individual schema elements, such as attributes, or for combinations of elements, such as complex schema structures. 8

Classification of Schema Matching Approaches(cont.) 3. Language vs constraint: -linguistic-based approach based on names and textual descriptions -constraint-based approach based on keys and relationships. 4. Matching cardinality:each mapping element may interrelate one or more elements of the two schemas. 5. Auxiliary information: such as dictionaries, global schemas, previous matching decisions, and user input. 9

Schema-Level Matchers Only consider schema information, such as -Name.-Description.-Data type.-Relationship types (part-of, is-a, etc.).-Constraints.-Schema structure. 11

Granularity of Match Element-levelvsStructure-level. Element-level: -match elements at the atomic level, such as attributes in an XML schema. Structure-level: -matching combinations of elements that appear together in a structure. 13

Linguistic Approaches Language-based or linguistic matchers use names and text to find semantically similar schema elements. We discuss two schema-level approaches -Name matching. -Description matching. 16

Name Matching Name-based matching matches schema elements with equal or similar names. Similarity of names can be defined and measured in various ways:1. Equality of names. - Homonyms ex: “line” of business vs “line” of order.2. Equality of canonical name.CName -> customer name.EmpNO ->employee number.3. Equality of synonyms.car ∼ automobile. mark ∼ brand. 17

Name Matching (cont.) 4. Equality of hypernyms.book is-a publication and article is-a publication imply book∼publication, article∼publication, and book∼article. 5. Similarity of names based pronunciation. ShipTo ∼ = Ship2 .6. User-provided name matches. reportsTo ∼ manager. issue ∼ bug. 18

Description Matching Description are used to express the intended semantics of schema elements.eg: S1: empn // employee name. S2: name // name of employee. 19

Constraint-based Approaches If input schemas contain such information, it can be used by a matcher to determine the similarity of schema elements. Schemas often contain constraints to define-data types.-value ranges.-uniqueness.-optionality.-relationship types and so on. 21

Constraint-based Approaches(cont.) Type and key information suggest that Born matches Birthdate and Pnomatches either EmpNo or DeptNo. 22

Auxiliary Information Auxiliary Information:1.Dictionaries.2.Thesauri.3.User-provided information .can improve our matching process. Reuse the matched schemas. 23

Reusing Schema and Mapping Information(cont.) 24

Instance-Level Approaches Instance-level has two approaches:1. To enhance the effectiveness of schema- level matching. 2. To perform instance-level matching on its own. Most of the approaches discussed previously for schema-level matching can be applied to instance-level matching. 25

Instance-Level Approaches(cont.) DeptName is a better match candidate for Dept than EmpName. Take EmpNo, DeptNoandPno as example. Based on similar value ranges ,we match Pnoto EmpNo rather than DeptNo. 26

Combining Different Matchers A matcher that uses just one approach is unlikely to achieve as many good match candidates as one that combines several approaches. Combination can be done in two ways:1. Hybrid matcher. - integrates multiple matching criteria .2. Composite matchers.- combine the results of independently executed matchers. 27

Sample Approaches From the Literature LSD. SKAT. TransScm. ARTEMIS. 28

Learning Source Descriptions(LSD) . 29

Semantic Knowledge Articulation Tool(SKAT) A rule-based approach to semi-automatically determine matches between schemas. Rules are formulated in ﬁrst-order logic to express match and mismatch relationships The user has to initially provide match and mismatch relationships then approve or reject generated matches. Schemas are transformed into a graph-based object-oriented database model. 30

TransScm Input schemas are transformed into labeled graphs. Edges in the schema graphs represent component relationships. The matching is performed node by node (element-level, 1:1) There are several matchers which are checked in a fixed order. If no match is found or if a matcher determines multiple match candidates, user intervention is required.(provide a rule or select a match candidate. ) 31

ARTEMIS It first computes “affinities” in the range 0 to 1 between attributes.1.Name affinity.2.Data Type affinity.3.Struct affinity. Then completes the schema integration by clustering attributes based on those affinities and then constructing views based on the clusters. 32

Characteristics of Proposed Schema Match Approaches 33

Characteristics of Proposed Schema Match Approaches(cont.) 34

Conclusion We used the taxonomy to characterize and compare a variety of previous match implementations. We hope that the taxonomy will be useful to programmers who need to implement a match algorithm. 37

20100810

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a 20100810

Semelhante a 20100810 (20)

20100810