SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Value Invention in Data Exchange
Patricia Arocena1 Boris Glavic2 Ren´ee J. Miller1
University of Toronto1
DBGroup
Illinois Institute of Technology2
DBGroup
SIGMOD 2013 - June 25, 2013 - New York, USA

Outline
1 Introduction
2 Linearization
3 Exploiting Source Constraints
4 Experiments
5 Conclusions

The Data Exchange Problem1
Schema Mappings M = (S, T, Σ)
• Source Schema S and Target Schema T
• High-level speciﬁcation Σ
• models the relationship between S and T
Source Schema S Target Schema T
Source Data Target Data
M
MSource Schema S Target Schema T
M
1R. Fagin et al., Theor. Comput. Sci. 336 (2005).
Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Data Exchange
• Given an instance of S
M
M
M

Data Exchange
• Given an instance of S
• How to materialize a target instance of T?
M
M
M
M

Example
MWorksOn(Department,Project,City)
M Projects(PId, City, ManagerId)Source Schema S Target Schema T
M
M
IT Web Toronto
IT Big Data Chicago
Sales Mobile New York
NULL Toronto NULL
NULL Chicago NULL
NULL New York NULL
We usually create values to represent incomplete information!

Value Invention
M
M
IT Web Toronto
IT Big Data Chicago
f(Web) Toronto g(IT)
f(Big Data) Chicago g(IT)
f(Mobile) New York g(Sales)

Value Invention
M
M
IT Web Toronto
IT Big Data Chicago
f(Web) Toronto g(IT)
f(Big Data) Chicago g(IT)
f(Mobile) New York g(Sales)
∃f ∃g ( WorksOn (d, p, c) → Project (f (p), c, g(d)) )

Our Goal
• Understand when schema mappings speciﬁed by SO tgds
• Flexible and precise value invention
• . . . can be rewritten into nested GLAV mappings
• Desirable computational and programatic properties

Skolem Functions
• Introduced by Thoralf A. Skolem (1920s)
• Widely used in Mathematical Logic and Computer Science
Many important uses in Information Integration
• to model object identiﬁer (OID) inventiona
aR. Hull, M. Yoshikawa, In VLDB (1990).

Skolem Functions
• to model object identiﬁer (OID) invention
• to express correlation semantics (e.g., grouping and data merging)abcd
aL. Popa et al., In VLDB (2002).
bA. Fuxman et al., In VLDB (2006).
cL. Libkin, C. Sirangelo, J. Comput. Syst. Sci. 77 (2011).
dB. Alexe et al., VLDB J. 21 (2012).

Skolem Functions
• to model object identiﬁer (OID) invention
• to express correlation semantics (e.g., grouping and data merging)
• to provide a precise representation of
missing and incomplete informationabc
aY. Papakonstantinou et al., In VLDB (1996).
bL. Popa et al., In VLDB (2002).
cR. Fagin et al., TODS 30 (2005).

Schema Mapping Languages
Various logical mapping formalisms
• s-t tgds (also known as GLAV)a
• Nested s-t tgds (nested GLAV)b
• Second-Order (SO) tgdsc
aR. Fagin et al., Theor. Comput. Sci. 336 (2005).
bA. Fuxman et al., In VLDB (2006).
cR. Fagin et al., TODS 30 (2005).

Schema Mapping Languages
Various logical mapping formalisms
• s-t tgds (also known as GLAV)
• Nested s-t tgds (nested GLAV)
• Second-Order (SO) tgds
Expressiveness
• SO tgds permits arbitrary Skolems!a
• FO mapping languages have more desirable programmatic and
computational propertiesb
aR. Fagin et al., TODS 30 (2005).
bB. ten Cate, P. Kolaitis, In ICDT (2009).

Characterization of Mapping Languages234
Property GLAV nested GLAV SO tgds
Composition Not closed Not closed Closed
Value Invention No Linear Fully customized
correlation correlation correlation
Target
Homomorphisms Closed Closed Not closed
Model Checking PTIME PTIME NP-Complete
3R. Fagin et al., TODS 30 (2005).
4B. ten Cate, P. Kolaitis, In ICDT (2009).

The Quest for FO Rewritability
Rewritability
• Many SO tgds are equivalent to FO mappings!
• We call this FO/GLAV/nested GLAV rewritable
• Some SO tgds are not FO rewritablea
• . . . Even testing for FO rewritability is undecidableb
aR. Fagin et al., TODS 30 (2005).
bI. Feinerer et al., In AMW (2011).
Nash, Bernstein and Melnik
• First suﬃcient condition for GLAV rewritabilitya
• Tailored to consider SO tgds produced by mapping composition
aA. Nash et al., TODS 32 (2007).

Our Contributions
1 Suﬃcient condition for nested GLAV rewritability of SO tgds
2 Linearize:
• PTIME algorithm for rewriting SO tgds
3 Equivalence preserving transformation of SO tgds using source
semantics
4 LinearizeFDs:
• PTIME algorithm for rewriting SO tgds using source FDs
5 Extensive experimental evaluation
• STBenchmark 2.0a
• Real-life mapping scenarios
aP. C. Arocena et al., “STBenchmark 2.0”, tech. rep. (Uni. of Toronto, 2013).

Intuition of Rewriting
Rewrite SO tgds into nested GLAV
• Replace second-order existentials with ﬁrst-order existentials
• ∃f (x) → ∃vf
• Apply logical equivalence of Skolemization in reverse direction
• May have to reorder universal quantiﬁers to create ∀x
Skolemization Equivalence
∀x∃vf δ(x, vf ) ≡ ∃f ∀x δ(x, vf )[vf ← f (x)]
Slide 9 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization

UnSkolemization Revisited
Example: Key Invention
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
∃f (∀d∀p∀b WorksOn (d, p, b) → Project (f (d, p), b))

Source Schema
Target Schema
We need to introduce ∃vf nested within the scope of d and p

Source Schema
Target Schema
We need to introduce ∃vf nested within the scope of d and p
∀d∀p∃vf ∀b WorksOn (d, p, b) → Project (vf , b)

Sufficient Rewriting Condition
Approach
• When can Unskolemization be applied to all Skolems of SO tgd?
• Adapt notions from SO quantifier elimination methodsa
• Consistency:
• OK: . . . f (a) . . . f (a) → ∀a∃vf
• NOT OK: . . . f (a) . . . f (b) → ∀a∃vf ∀b∃vf
• Linearity:
• OK: . . . f (a) . . . g(a, b) → ∀a∃vf ∀b∃vg
• NOT OK:. . . f (a, b) . . . g(b, c) → ∀a∀b∃vf ∀c∃vg
• Partitioning scheme for multi-clause SO tgds
aD. Gabbay et al.,
Second Order Quantifier Elimination: Foundations, Computational Aspects and Applications,
(College Publications, 2008).

Suﬃcient Rewriting Condition
Approach
• When can Unskolemization be applied to all Skolems of SO tgd?
• Adapt notions from SO quantiﬁer elimination methods
• Consistency:
• OK: . . . f (a) . . . f (a) → ∀a∃vf
• NOT OK: . . . f (a) . . . f (b) → ∀a∃vf ∀b∃vf
• Linearity:
• OK: . . . f (a) . . . g(a, b) → ∀a∃vf ∀b∃vg
• NOT OK:. . . f (a, b) . . . g(b, c) → ∀a∀b∃vf ∀c∃vg
• Partitioning scheme for multi-clause SO tgds
Theorem: Linearity
Given an SO tgd θ without equalities between or with Skolem terms
• Consistent
• Linear
⇒ θ can be rewritten as nested GLAV

Linearize Algorithm
Properties of the Algorithm
• Rewrites an SO tgd into nested GLAV
• PTIME
• Size of resulting formula is linear in the size of the input
Linearize(θ)
1 Partition θ into independent sub-formulas (maximal partitioning Π)
2 For each partition
• Check consistency and linearity
3 If all partitions are linear and consistent then
• Rewrite θ into Ω

A Note on Linearity
• Linearity is an syntactic but not a semantic condition
• ⇒ There is hope that an equivalent mapping exists that is linear
• ⇒ Approach: Find an equivalent mapping that is linear
• Modify Skolem arguments?
Non-Linear SO tgd θ
Linear SO tgd θ nested GLAV Ω
Equivalence Preserving Transformation
Linearize
Slide 13 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints

Using Source Functional Dependencies
So far
• Only considered an SO tgd θ
• Have not considered additional knowledge that may be available
Source Schema
Target Schema
∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a))

Source constraints
• Functional dependencies (FDs) ΣS that hold over the source
• Primary keys (and other FDs if available)
• FDs imply dependencies between the arguments of Skolem terms
Source Schema
Target Schema
WorksOn: Department, Project → BudgetId Audit: BudgetId → Auditor
∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a))
Implied FD1 : d, p → b, a Implied FD2 : b → a

Source constraints
• Functional dependencies (FDs) ΣS that hold over the source
• Primary keys (and other FDs if available)
• FDs imply dependencies between the arguments of Skolem terms
• FD x → y be used to augment Skolem arguments: f (x, z) → f (x, z, y)
Source Schema
Target Schema
∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p, b, a), g(b, a))
Implied FD1 : d, p → b, a Implied FD2 : b → a

Equivalence Preserving Transformation
Approach
• Augment Skolem arguments using implied FDs (Re-Skolemization)
• Result θ that is equivalent as long as the FDs hold.
Non-Linear SO tgd θ and source FDs ΣS
Linear SO tgd θ and source FDs ΣS nested GLAV Ω
Re-Skolemize using implied FDs
Linearize
Theorem: Re-Skolemization with FDs preserves equivalence
Given an implied source FD x → y valid over θ:
θ[f (x) ← f (x, y)] ∪ ΣS ≡ θ ∪ ΣS

Why Augmentation?
Does Re-Skolemization aﬀect Linearity?
• Augmentation (θ[f (x) ← f (x, y)])
• θaug
: Result of applying augmentation until no longer possible
• Minimization (θ[f (x, y) ← f (x)])a
• θmin
: Result of applying minimization until no longer possible
aB. Marnette et al., PVLDB 3 (2010).
Theorem: Only augmentation preserves Linearity
Linear(θ) → Linear(θaug
)
Linear(θ) → Linear(θmin
)

LinearizeFDs Algorithm
Properties of the Algorithm
• Rewrites SO tgd into nested GLAV
• PTIME
• Size of resulting formula is linear in the size of the input
LinearizeFDs(θ,ΣS )
1 Compute implied FDs
2 Augment arguments of each Skolem term based on FDs
• Using attribute closure
• Result: θaug
3 Return Linearize(θaug )

Mapping Generator and Experiments
STBenchmark
• Generator for data exchange scenariosa
• Schemas, Data and Mappings
• Construct complex mappings from simple primitives
• e.g., Horizontal Partitioning (HP)
• Parameterized and randomized (e.g., join path length)
aB. Alexe et al., PVLDB 1 (2008).
Extensions
• Arbitrary Skolem terms (SO tgds)
• New primitives (e.g., Adding and Deleting Attributes, etc.)
• Combining primitives into more complex mappings
• e.g., simulating composition and complex correlations
• Primary Keys (PKs) and Functional dependencies (FDs)
Slide 18 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments

Random Scenarios
• 12,500,000 randomly generated mapping scenarios
• Measure success rate
• Compare NBM, Linearize, LinearizeFDs, LinearizeMin
• NBM is only rewriting into GLAV!

Eﬀect of Primary Keys
• Activate/Deactivate source PKs
• Vary amount of non-PK FDs
0%
20%
40%
60%
80%
100%
No PKs With PKs No PKs With PKs No PKs With PKs
SOURCE FDs = 0% SOURCE FDs = 25% SOURCE FDs = 50%
SuccessRate
Linearize LinearizeFDs

Conclusions
Rewriting SO-tgds → nested GLAV
• Linearization
• SO tgd is linear → can be rewritten
• Equivalence preserving Re-Skolemization
• Using source FDs to augment Skolem arguments
Experimental and Theoretical Results
• Using FDs improves chance to rewrite
• 78% increased success rate
• Primary keys are most eﬀective
• > 75% increased success rate
• Augmentation is better than minimization
• about 16% increased success rate
Slide 21 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions

Future Work
Integrate insights on Re-Skolemization into . . .
• Mapping operators such as
• Composition
• MapMerge
• Mapping generation
FO Rewritability of SO tgds
• Combine our suﬃcient condition with that of [NBM07]a
• we know how to do it!
• Exploit Augmentation and Minimization together
• to simplify and optimize SO mappings
• Use target FDs
aA. Nash et al., TODS 32 (2007).

Questions?

Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation

R. Fagin, P. Kolaitis, R. J. Miller, L. Popa,
Data Exchange: Semantics and Query Answering.
Theor. Comput. Sci. 336 (2005).
R. Hull, M. Yoshikawa,
ILOG: Declarative Creation and Manipulation of Object Identifiers.
In VLDB (1990).
L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernández, R. Fagin,
Translating Web Data. In VLDB (2002).
A. Fuxman et al., Nested Mappings: Schema Mapping Reloaded. In
VLDB (2006).
L. Libkin, C. Sirangelo,
Data Exchange and Schema Mappings in Open and Closed Worlds.
J. Comput. Syst. Sci. 77 (2011).
B. Alexe, M. A. Hernández, L. Popa, W. C. Tan,
MapMerge: Correlating Independent Schema Mappings. VLDB J.
21 (2012).
Slide 1 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References

Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina,
Object Fusion in Mediator Systems. In VLDB (1996).
R. Fagin, P. Kolaitis, L. Popa, W.-C. Tan,
Composing Schema Mappings: Second-Order Dependencies to the Rescu
TODS 30 (2005).
B. ten Cate, P. Kolaitis,
Structural Characterizations of Schema-Mapping Languages. In
ICDT (2009).
I. Feinerer, R. Pichler, E. Sallinger, V. Savenkov,
On the Undecidability of the Equivalence of Second-Order Tuple Genera
In AMW (2011).
A. Nash, P. Bernstein, S. Melnik,
Composition of Mappings Given by Embedded Dependencies. TODS
32 (2007).
P. C. Arocena, M. D’Angelo, B. Glavic, R. J. Miller, “STBenchmark
2.0”, tech. rep. (Uni. of Toronto, 2013).

D. Gabbay, R. Schmidt, A. Szalas,
Second Order Quantiﬁer Elimination: Foundations, Computational Aspe
(College Publications, 2008).
B. Marnette, G. Mecca, P. Papotti,
Scalable Data Exchange with Functional Dependencies. PVLDB 3
(2010).
B. Alexe, W. C. Tan, Y. Velegrakis,
STBenchmark: Towards a Benchmark for Mapping Systems.
PVLDB 1 (2008).

Notation
GLAV (s-t tgds): ∀z, x(φ(z, x) → ∃yψ(x, y))
∀d ∀p ∀b Works(d, p, b) → ∃y1 ∃y2 Project(y1, b, y2)
-
nested GLAV: Q(x, y)((φ1(x) → ψ1(x, y)) ∧ . . . ∧ (φn(x) → ψn(x, y))),
where Q(x, y) is a sequence of quantiﬁers, that is, ∀ for x and ∃ for y
∀d ∃y1 ∀p ∃y2 ∀b Works(d, p, b) → Project(y1, b, y2)
SO tgds: ∃f( (∀x1(φ1 → ψ1)) ∧ · · · ∧ (∀xn(φn → ψn)) )
Note: we usually omit universal quantiﬁers
∃f ∃g(Works(d, p, b) → Project(f (p), b, g(d)))
Slide 4 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Notation

Model Checking
Complexity
• NP-complete for SO tgds vs. P for nested GLAV
• Are we only solving the simple cases?
Approach
• Find an SO tgd for which model checking is hard
• But can be rewritten using (implied) source FDs
Slide 5 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity

Model Checking: 3-colorability
Schema Mappings
θ = ∀X, Y :E(X, Y ) → C(f (X), g(Y ))
V (X, Y ) → S(f (X), g(Y ))
Not linear!
Instance
• The source relations encode an undirected graph G
• For each edge (x, y) we create two tuples E(x, y) and E(y, x)
• For each vertex x we create a tuple V (x, x)
• The target relations represent a coloring of the vertexes of G using
three colors r, g, and b
• C: (r, g), (r, b), (g, r), (g, b), (b, r), (b, g) - colors of adjacent nodes
• S: (r, r), (g, g), (b, b) - colors of vertexes
Theorem: Model Checking is 3-colorability
G is 3-colorable if θ holds over I

Model Checking: 3-colorability
Schema Mappings
θ = ∀X, Y :E(X, Y ) → C(f (X, Y ), g(Y ))
V (X, Y ) → S(f (X, Y ), g(Y ))
Linear!
FD: X → Y
Instance
• The source relations encode an undirected graph G
• For each edge (x, y) we create two tuples E(x, y) and E(y, x)
• For each vertex x we create a tuple V (x, x)
• The target relations represent a coloring of the vertexes of G using
three colors r, g, and b
• C: (r, g), (r, b), (g, r), (g, b), (b, r), (b, g) - colors of adjacent nodes
• S: (r, r), (g, g), (b, b) - colors of vertexes
Theorem: Model Checking is 3-colorability
G is 3-colorable if θ holds over I

SO tgds are closed conjunction
• Every SO tgd θ can be written as a set of clauses (φi → ψi )
• Splitting this set of clauses to form new SO tgds θ1, . . . θn is
equivalence preserving
• If θi and θj share none Skolems (they are uncorrelated)
• then θi and θj can be rewritten independently
Slide 7 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Partitioning Scheme

Maximal Partition
Maximal Partition
• Given an SO tgd θ
• Partition clauses into Π = π1, . . . , πn
1 No πi and πj share any skolems
2 There is no Π with more elements than Π that fulﬁlls condition 1)
Theorem: Rewritability and Maximal Partitions
Rewritable (θ) ⇔ ∀i : Rewritable (πi )
Slide 8 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Partitioning Scheme

Real-Life Mappings
• Three real-life mapping scenarios from the literature
• Created SO tgds based on
• Semantics of the schemas
• Documented data transformations
• Compared all rewriting techniques
Slide 9 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Experiments

Towards STBenchmark 2.0
Noteworthy Features
• Support for arbitrary Skolem Functions (SO tgds) and various
Skolemization modes (e.g., Key, All and Random)
• Simulating some cases of composition using Skolem Noise
• Reuse of source schema elements using Source Reuse
• PKs and random multi-attribute FDs over the source
Usability Case?
• Thinking about comparing diﬀerent notions of mapping inverse
• Any suggestions?
Slide 10 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: STBenchmark 2.0

Notation
GLAV (s-t tgds): ∀z, x(φ(z, x) → ∃yψ(x, y))
∀d ∀p ∀b Works(d, p, b) → ∃y1 ∃y2 Project(y1, b, y2)
nested GLAV: Q(x, y)((φ1(x) → ψ1(x, y)) ∧ . . . ∧ (φn(x) → ψn(x, y))),
where Q(x, y) is a sequence of quantiﬁers, that is, ∀ for x and ∃ for y
∀d ∃y1 ∀p ∃y2 ∀b Works(d, p, b) → Project(y1, b, y2)
SO tgds: ∃f( (∀x1(φ1 → ψ1)) ∧ · · · ∧ (∀xn(φn → ψn)) )
Note: we usually omit universal quantiﬁers
∃f ∃g(Works(d, p, b) → Project(f (p), b, g(d)))
Slide 11 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples

Nesting
Example: Skolems with Overlapping Arguments
Source Schema
Target Schema
∃f ∃g( WorksOn (d, p, b) → Dept (d, f (d), p, g(d, p)))

Nesting
Source Schema
Target Schema
We need to introduce two ∃ quantiﬁers without violating the dependencies
modeled by f and g

Nesting and Linearization
Source Schema
Target Schema
We need to introduce two ∃ quantiﬁers without violating the dependencies
modeled by f and g
∀d∃vf ∀p∃vg ∀b WorksOn (d, p, b) → Dept (d, vf , p, vg )

Augmentation is better than Minimization
Source Schema
Target Schema
θ = ∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a))
θaug
= ∃f ∃g . . . → Budget (p, f (d, p, b, a), g(b, a))
θmin
= ∃f ∃g . . . → Budget (p, f (d, p), g(b))
FD1 : d, p → b, a FD2 : b → a
Slide 13 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Example Augmentation

SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (16)

Semelhante a SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Semelhante a SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange" (20)

Mais de Boris Glavic

Mais de Boris Glavic (17)

Último

Último (20)

SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"