The document discusses value invention in data exchange and schema mappings. It introduces the data exchange problem involving mapping source and target schemas using a specification. Value invention involves creating values to represent incomplete information when materializing the target schema. The goal is to understand when schema mappings specified by second-order tuple-generating dependencies (SO tgds) can be rewritten as nested global-as-view mappings, which have more desirable computational properties. The paper presents an algorithm called Linearize that rewrites SO tgds as nested GLAV mappings if they are linear and consistent. It also discusses exploiting source constraints like functional dependencies to find an equivalent linear mapping.
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
1. Value Invention in Data Exchange
Patricia Arocena1 Boris Glavic2 Ren´ee J. Miller1
University of Toronto1
DBGroup
Illinois Institute of Technology2
DBGroup
SIGMOD 2013 - June 25, 2013 - New York, USA
3. The Data Exchange Problem1
Schema Mappings M = (S, T, Σ)
• Source Schema S and Target Schema T
• High-level specification Σ
• models the relationship between S and T
Source Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
MSource Schema S Target Schema T
Source Data Target Data
M
1R. Fagin et al., Theor. Comput. Sci. 336 (2005).
Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
4. The Data Exchange Problem1
Schema Mappings M = (S, T, Σ)
• Source Schema S and Target Schema T
• High-level specification Σ
• models the relationship between S and T
Data Exchange
• Given an instance of S
Source Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
MSource Schema S Target Schema T
Source Data Target Data
M
1R. Fagin et al., Theor. Comput. Sci. 336 (2005).
Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
5. The Data Exchange Problem1
Schema Mappings M = (S, T, Σ)
• Source Schema S and Target Schema T
• High-level specification Σ
• models the relationship between S and T
Data Exchange
• Given an instance of S
• How to materialize a target instance of T?
Source Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
MSource Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
M
1R. Fagin et al., Theor. Comput. Sci. 336 (2005).
Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
6. Example
Source Schema S Target Schema T
Source Data Target Data
MWorksOn(Department,Project,City)
Source Schema S Target Schema T
Source Data Target Data
M Projects(PId, City, ManagerId)Source Schema S Target Schema T
Source Data Target Data
MSource Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
M
IT Web Toronto
IT Big Data Chicago
Sales Mobile New York
NULL Toronto NULL
NULL Chicago NULL
NULL New York NULL
We usually create values to represent incomplete information!
Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
7. Value Invention
Source Schema S Target Schema T
Source Data Target Data
MWorksOn(Department,Project,City)
Source Schema S Target Schema T
Source Data Target Data
M Projects(PId, City, ManagerId)Source Schema S Target Schema T
Source Data Target Data
MSource Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
M
IT Web Toronto
IT Big Data Chicago
Sales Mobile New York
f(Web) Toronto g(IT)
f(Big Data) Chicago g(IT)
f(Mobile) New York g(Sales)
We usually create values to represent incomplete information!
Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
8. Value Invention
Source Schema S Target Schema T
Source Data Target Data
MWorksOn(Department,Project,City)
Source Schema S Target Schema T
Source Data Target Data
M Projects(PId, City, ManagerId)Source Schema S Target Schema T
Source Data Target Data
MSource Schema S Target Schema T
Source Data Target Data
M
Source Schema S Target Schema T
Source Data Target Data
M
IT Web Toronto
IT Big Data Chicago
Sales Mobile New York
f(Web) Toronto g(IT)
f(Big Data) Chicago g(IT)
f(Mobile) New York g(Sales)
We usually create values to represent incomplete information!
∃f ∃g ( WorksOn (d, p, c) → Project (f (p), c, g(d)) )
Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
9. Our Goal
• Understand when schema mappings specified by SO tgds
• Flexible and precise value invention
• . . . can be rewritten into nested GLAV mappings
• Desirable computational and programatic properties
Slide 3 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
10. Skolem Functions
• Introduced by Thoralf A. Skolem (1920s)
• Widely used in Mathematical Logic and Computer Science
Many important uses in Information Integration
• to model object identifier (OID) inventiona
aR. Hull, M. Yoshikawa, In VLDB (1990).
Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
11. Skolem Functions
• Introduced by Thoralf A. Skolem (1920s)
• Widely used in Mathematical Logic and Computer Science
Many important uses in Information Integration
• to model object identifier (OID) invention
• to express correlation semantics (e.g., grouping and data merging)abcd
aL. Popa et al., In VLDB (2002).
bA. Fuxman et al., In VLDB (2006).
cL. Libkin, C. Sirangelo, J. Comput. Syst. Sci. 77 (2011).
dB. Alexe et al., VLDB J. 21 (2012).
Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
12. Skolem Functions
• Introduced by Thoralf A. Skolem (1920s)
• Widely used in Mathematical Logic and Computer Science
Many important uses in Information Integration
• to model object identifier (OID) invention
• to express correlation semantics (e.g., grouping and data merging)
• to provide a precise representation of
missing and incomplete informationabc
aY. Papakonstantinou et al., In VLDB (1996).
bL. Popa et al., In VLDB (2002).
cR. Fagin et al., TODS 30 (2005).
Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
13. Schema Mapping Languages
Various logical mapping formalisms
• s-t tgds (also known as GLAV)a
• Nested s-t tgds (nested GLAV)b
• Second-Order (SO) tgdsc
aR. Fagin et al., Theor. Comput. Sci. 336 (2005).
bA. Fuxman et al., In VLDB (2006).
cR. Fagin et al., TODS 30 (2005).
Slide 5 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
14. Schema Mapping Languages
Various logical mapping formalisms
• s-t tgds (also known as GLAV)
• Nested s-t tgds (nested GLAV)
• Second-Order (SO) tgds
Expressiveness
• SO tgds permits arbitrary Skolems!a
• FO mapping languages have more desirable programmatic and
computational propertiesb
aR. Fagin et al., TODS 30 (2005).
bB. ten Cate, P. Kolaitis, In ICDT (2009).
Slide 5 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
15. Characterization of Mapping Languages234
Property GLAV nested GLAV SO tgds
Composition Not closed Not closed Closed
Value Invention No Linear Fully customized
correlation correlation correlation
Target
Homomorphisms Closed Closed Not closed
Model Checking PTIME PTIME NP-Complete
2R. Fagin et al., Theor. Comput. Sci. 336 (2005).
3R. Fagin et al., TODS 30 (2005).
4B. ten Cate, P. Kolaitis, In ICDT (2009).
Slide 6 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
16. The Quest for FO Rewritability
Rewritability
• Many SO tgds are equivalent to FO mappings!
• We call this FO/GLAV/nested GLAV rewritable
• Some SO tgds are not FO rewritablea
• . . . Even testing for FO rewritability is undecidableb
aR. Fagin et al., TODS 30 (2005).
bI. Feinerer et al., In AMW (2011).
Nash, Bernstein and Melnik
• First sufficient condition for GLAV rewritabilitya
• Tailored to consider SO tgds produced by mapping composition
aA. Nash et al., TODS 32 (2007).
Slide 7 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
17. Our Contributions
1 Sufficient condition for nested GLAV rewritability of SO tgds
2 Linearize:
• PTIME algorithm for rewriting SO tgds
3 Equivalence preserving transformation of SO tgds using source
semantics
4 LinearizeFDs:
• PTIME algorithm for rewriting SO tgds using source FDs
5 Extensive experimental evaluation
• STBenchmark 2.0a
• Real-life mapping scenarios
aP. C. Arocena et al., “STBenchmark 2.0”, tech. rep. (Uni. of Toronto, 2013).
Slide 8 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
19. Intuition of Rewriting
Rewrite SO tgds into nested GLAV
• Replace second-order existentials with first-order existentials
• ∃f (x) → ∃vf
• Apply logical equivalence of Skolemization in reverse direction
• May have to reorder universal quantifiers to create ∀x
Skolemization Equivalence
∀x∃vf δ(x, vf ) ≡ ∃f ∀x δ(x, vf )[vf ← f (x)]
Slide 9 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
20. UnSkolemization Revisited
Example: Key Invention
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
∃f (∀d∀p∀b WorksOn (d, p, b) → Project (f (d, p), b))
Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
21. UnSkolemization Revisited
Example: Key Invention
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
∃f (∀d∀p∀b WorksOn (d, p, b) → Project (f (d, p), b))
We need to introduce ∃vf nested within the scope of d and p
Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
22. UnSkolemization Revisited
Example: Key Invention
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
∃f (∀d∀p∀b WorksOn (d, p, b) → Project (f (d, p), b))
We need to introduce ∃vf nested within the scope of d and p
∀d∀p∃vf ∀b WorksOn (d, p, b) → Project (vf , b)
Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
23. Sufficient Rewriting Condition
Approach
• When can Unskolemization be applied to all Skolems of SO tgd?
• Adapt notions from SO quantifier elimination methodsa
• Consistency:
• OK: . . . f (a) . . . f (a) → ∀a∃vf
• NOT OK: . . . f (a) . . . f (b) → ∀a∃vf ∀b∃vf
• Linearity:
• OK: . . . f (a) . . . g(a, b) → ∀a∃vf ∀b∃vg
• NOT OK:. . . f (a, b) . . . g(b, c) → ∀a∀b∃vf ∀c∃vg
• Partitioning scheme for multi-clause SO tgds
aD. Gabbay et al.,
Second Order Quantifier Elimination: Foundations, Computational Aspects and Applications,
(College Publications, 2008).
Slide 11 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
24. Sufficient Rewriting Condition
Approach
• When can Unskolemization be applied to all Skolems of SO tgd?
• Adapt notions from SO quantifier elimination methods
• Consistency:
• OK: . . . f (a) . . . f (a) → ∀a∃vf
• NOT OK: . . . f (a) . . . f (b) → ∀a∃vf ∀b∃vf
• Linearity:
• OK: . . . f (a) . . . g(a, b) → ∀a∃vf ∀b∃vg
• NOT OK:. . . f (a, b) . . . g(b, c) → ∀a∀b∃vf ∀c∃vg
• Partitioning scheme for multi-clause SO tgds
Theorem: Linearity
Given an SO tgd θ without equalities between or with Skolem terms
• Consistent
• Linear
⇒ θ can be rewritten as nested GLAV
Slide 11 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
25. Linearize Algorithm
Properties of the Algorithm
• Rewrites an SO tgd into nested GLAV
• PTIME
• Size of resulting formula is linear in the size of the input
Linearize(θ)
1 Partition θ into independent sub-formulas (maximal partitioning Π)
2 For each partition
• Check consistency and linearity
3 If all partitions are linear and consistent then
• Rewrite θ into Ω
Slide 12 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
27. A Note on Linearity
• Linearity is an syntactic but not a semantic condition
• ⇒ There is hope that an equivalent mapping exists that is linear
• ⇒ Approach: Find an equivalent mapping that is linear
• Modify Skolem arguments?
Non-Linear SO tgd θ
Linear SO tgd θ nested GLAV Ω
Equivalence Preserving Transformation
Linearize
Slide 13 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
28. Using Source Functional Dependencies
So far
• Only considered an SO tgd θ
• Have not considered additional knowledge that may be available
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a))
Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
29. Using Source Functional Dependencies
Source constraints
• Functional dependencies (FDs) ΣS that hold over the source
• Primary keys (and other FDs if available)
• FDs imply dependencies between the arguments of Skolem terms
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
WorksOn: Department, Project → BudgetId Audit: BudgetId → Auditor
∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a))
Implied FD1 : d, p → b, a Implied FD2 : b → a
Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
30. Using Source Functional Dependencies
Source constraints
• Functional dependencies (FDs) ΣS that hold over the source
• Primary keys (and other FDs if available)
• FDs imply dependencies between the arguments of Skolem terms
• FD x → y be used to augment Skolem arguments: f (x, z) → f (x, z, y)
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
WorksOn: Department, Project → BudgetId Audit: BudgetId → Auditor
∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p, b, a), g(b, a))
Implied FD1 : d, p → b, a Implied FD2 : b → a
Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
31. Equivalence Preserving Transformation
Approach
• Augment Skolem arguments using implied FDs (Re-Skolemization)
• Result θ that is equivalent as long as the FDs hold.
Non-Linear SO tgd θ and source FDs ΣS
Linear SO tgd θ and source FDs ΣS nested GLAV Ω
Re-Skolemize using implied FDs
Linearize
Theorem: Re-Skolemization with FDs preserves equivalence
Given an implied source FD x → y valid over θ:
θ[f (x) ← f (x, y)] ∪ ΣS ≡ θ ∪ ΣS
Slide 15 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
32. Why Augmentation?
Does Re-Skolemization affect Linearity?
• Augmentation (θ[f (x) ← f (x, y)])
• θaug
: Result of applying augmentation until no longer possible
• Minimization (θ[f (x, y) ← f (x)])a
• θmin
: Result of applying minimization until no longer possible
aB. Marnette et al., PVLDB 3 (2010).
Theorem: Only augmentation preserves Linearity
Linear(θ) → Linear(θaug
)
Linear(θ) → Linear(θmin
)
Slide 16 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
33. LinearizeFDs Algorithm
Properties of the Algorithm
• Rewrites SO tgd into nested GLAV
• PTIME
• Size of resulting formula is linear in the size of the input
LinearizeFDs(θ,ΣS )
1 Compute implied FDs
2 Augment arguments of each Skolem term based on FDs
• Using attribute closure
• Result: θaug
3 Return Linearize(θaug )
Slide 17 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
35. Mapping Generator and Experiments
STBenchmark
• Generator for data exchange scenariosa
• Schemas, Data and Mappings
• Construct complex mappings from simple primitives
• e.g., Horizontal Partitioning (HP)
• Parameterized and randomized (e.g., join path length)
aB. Alexe et al., PVLDB 1 (2008).
Extensions
• Arbitrary Skolem terms (SO tgds)
• New primitives (e.g., Adding and Deleting Attributes, etc.)
• Combining primitives into more complex mappings
• e.g., simulating composition and complex correlations
• Primary Keys (PKs) and Functional dependencies (FDs)
Slide 18 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments
36. Random Scenarios
• 12,500,000 randomly generated mapping scenarios
• Measure success rate
• Compare NBM, Linearize, LinearizeFDs, LinearizeMin
• NBM is only rewriting into GLAV!
Slide 19 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments
37. Effect of Primary Keys
• Activate/Deactivate source PKs
• Vary amount of non-PK FDs
0%
20%
40%
60%
80%
100%
No PKs With PKs No PKs With PKs No PKs With PKs
SOURCE FDs = 0% SOURCE FDs = 25% SOURCE FDs = 50%
SuccessRate
Linearize LinearizeFDs
Slide 20 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments
39. Conclusions
Rewriting SO-tgds → nested GLAV
• Linearization
• SO tgd is linear → can be rewritten
• Equivalence preserving Re-Skolemization
• Using source FDs to augment Skolem arguments
Experimental and Theoretical Results
• Using FDs improves chance to rewrite
• 78% increased success rate
• Primary keys are most effective
• > 75% increased success rate
• Augmentation is better than minimization
• about 16% increased success rate
Slide 21 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions
40. Future Work
Integrate insights on Re-Skolemization into . . .
• Mapping operators such as
• Composition
• MapMerge
• Mapping generation
FO Rewritability of SO tgds
• Combine our sufficient condition with that of [NBM07]a
• we know how to do it!
• Exploit Augmentation and Minimization together
• to simplify and optimize SO mappings
• Use target FDs
aA. Nash et al., TODS 32 (2007).
Slide 22 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions
41. Questions?
Slide 23 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions
42. Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation
43. R. Fagin, P. Kolaitis, R. J. Miller, L. Popa,
Data Exchange: Semantics and Query Answering.
Theor. Comput. Sci. 336 (2005).
R. Hull, M. Yoshikawa,
ILOG: Declarative Creation and Manipulation of Object Identifiers.
In VLDB (1990).
L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hern´andez, R. Fagin,
Translating Web Data. In VLDB (2002).
A. Fuxman et al., Nested Mappings: Schema Mapping Reloaded. In
VLDB (2006).
L. Libkin, C. Sirangelo,
Data Exchange and Schema Mappings in Open and Closed Worlds.
J. Comput. Syst. Sci. 77 (2011).
B. Alexe, M. A. Hern´andez, L. Popa, W. C. Tan,
MapMerge: Correlating Independent Schema Mappings. VLDB J.
21 (2012).
Slide 1 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References
44. Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina,
Object Fusion in Mediator Systems. In VLDB (1996).
R. Fagin, P. Kolaitis, L. Popa, W.-C. Tan,
Composing Schema Mappings: Second-Order Dependencies to the Rescu
TODS 30 (2005).
B. ten Cate, P. Kolaitis,
Structural Characterizations of Schema-Mapping Languages. In
ICDT (2009).
I. Feinerer, R. Pichler, E. Sallinger, V. Savenkov,
On the Undecidability of the Equivalence of Second-Order Tuple Genera
In AMW (2011).
A. Nash, P. Bernstein, S. Melnik,
Composition of Mappings Given by Embedded Dependencies. TODS
32 (2007).
P. C. Arocena, M. D’Angelo, B. Glavic, R. J. Miller, “STBenchmark
2.0”, tech. rep. (Uni. of Toronto, 2013).
Slide 2 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References
45. D. Gabbay, R. Schmidt, A. Szalas,
Second Order Quantifier Elimination: Foundations, Computational Aspe
(College Publications, 2008).
B. Marnette, G. Mecca, P. Papotti,
Scalable Data Exchange with Functional Dependencies. PVLDB 3
(2010).
B. Alexe, W. C. Tan, Y. Velegrakis,
STBenchmark: Towards a Benchmark for Mapping Systems.
PVLDB 1 (2008).
Slide 3 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References
46. Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation
47. Notation
GLAV (s-t tgds): ∀z, x(φ(z, x) → ∃yψ(x, y))
∀d ∀p ∀b Works(d, p, b) → ∃y1 ∃y2 Project(y1, b, y2)
-
nested GLAV: Q(x, y)((φ1(x) → ψ1(x, y)) ∧ . . . ∧ (φn(x) → ψn(x, y))),
where Q(x, y) is a sequence of quantifiers, that is, ∀ for x and ∃ for y
∀d ∃y1 ∀p ∃y2 ∀b Works(d, p, b) → Project(y1, b, y2)
SO tgds: ∃f( (∀x1(φ1 → ψ1)) ∧ · · · ∧ (∀xn(φn → ψn)) )
Note: we usually omit universal quantifiers
∃f ∃g(Works(d, p, b) → Project(f (p), b, g(d)))
Slide 4 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Notation
48. Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation
49. Model Checking
Complexity
• NP-complete for SO tgds vs. P for nested GLAV
• Are we only solving the simple cases?
Approach
• Find an SO tgd for which model checking is hard
• But can be rewritten using (implied) source FDs
Slide 5 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity
50. Model Checking: 3-colorability
Schema Mappings
θ = ∀X, Y :E(X, Y ) → C(f (X), g(Y ))
V (X, Y ) → S(f (X), g(Y ))
Not linear!
Instance
• The source relations encode an undirected graph G
• For each edge (x, y) we create two tuples E(x, y) and E(y, x)
• For each vertex x we create a tuple V (x, x)
• The target relations represent a coloring of the vertexes of G using
three colors r, g, and b
• C: (r, g), (r, b), (g, r), (g, b), (b, r), (b, g) - colors of adjacent nodes
• S: (r, r), (g, g), (b, b) - colors of vertexes
Theorem: Model Checking is 3-colorability
G is 3-colorable if θ holds over I
Slide 6 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity
51. Model Checking: 3-colorability
Schema Mappings
θ = ∀X, Y :E(X, Y ) → C(f (X, Y ), g(Y ))
V (X, Y ) → S(f (X, Y ), g(Y ))
Linear!
FD: X → Y
Instance
• The source relations encode an undirected graph G
• For each edge (x, y) we create two tuples E(x, y) and E(y, x)
• For each vertex x we create a tuple V (x, x)
• The target relations represent a coloring of the vertexes of G using
three colors r, g, and b
• C: (r, g), (r, b), (g, r), (g, b), (b, r), (b, g) - colors of adjacent nodes
• S: (r, r), (g, g), (b, b) - colors of vertexes
Theorem: Model Checking is 3-colorability
G is 3-colorable if θ holds over I
Slide 6 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity
52. Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation
53. SO tgds are closed conjunction
• Every SO tgd θ can be written as a set of clauses (φi → ψi )
• Splitting this set of clauses to form new SO tgds θ1, . . . θn is
equivalence preserving
• If θi and θj share none Skolems (they are uncorrelated)
• then θi and θj can be rewritten independently
Slide 7 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Partitioning Scheme
54. Maximal Partition
Maximal Partition
• Given an SO tgd θ
• Partition clauses into Π = π1, . . . , πn
1 No πi and πj share any skolems
2 There is no Π with more elements than Π that fulfills condition 1)
Theorem: Rewritability and Maximal Partitions
Rewritable (θ) ⇔ ∀i : Rewritable (πi )
Slide 8 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Partitioning Scheme
55. Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation
56. Real-Life Mappings
• Three real-life mapping scenarios from the literature
• Created SO tgds based on
• Semantics of the schemas
• Documented data transformations
• Compared all rewriting techniques
Slide 9 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Experiments
57. Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation
58. Towards STBenchmark 2.0
Noteworthy Features
• Support for arbitrary Skolem Functions (SO tgds) and various
Skolemization modes (e.g., Key, All and Random)
• Simulating some cases of composition using Skolem Noise
• Reuse of source schema elements using Source Reuse
• PKs and random multi-attribute FDs over the source
Usability Case?
• Thinking about comparing different notions of mapping inverse
• Any suggestions?
Slide 10 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: STBenchmark 2.0
59. Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation
60. Notation
GLAV (s-t tgds): ∀z, x(φ(z, x) → ∃yψ(x, y))
∀d ∀p ∀b Works(d, p, b) → ∃y1 ∃y2 Project(y1, b, y2)
nested GLAV: Q(x, y)((φ1(x) → ψ1(x, y)) ∧ . . . ∧ (φn(x) → ψn(x, y))),
where Q(x, y) is a sequence of quantifiers, that is, ∀ for x and ∃ for y
∀d ∃y1 ∀p ∃y2 ∀b Works(d, p, b) → Project(y1, b, y2)
SO tgds: ∃f( (∀x1(φ1 → ψ1)) ∧ · · · ∧ (∀xn(φn → ψn)) )
Note: we usually omit universal quantifiers
∃f ∃g(Works(d, p, b) → Project(f (p), b, g(d)))
Slide 11 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples
61. Nesting
Example: Skolems with Overlapping Arguments
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
∃f ∃g( WorksOn (d, p, b) → Dept (d, f (d), p, g(d, p)))
Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples
62. Nesting
Example: Skolems with Overlapping Arguments
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
∃f ∃g( WorksOn (d, p, b) → Dept (d, f (d), p, g(d, p)))
We need to introduce two ∃ quantifiers without violating the dependencies
modeled by f and g
Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples
63. Nesting and Linearization
Example: Skolems with Overlapping Arguments
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
∃f ∃g( WorksOn (d, p, b) → Dept (d, f (d), p, g(d, p)))
We need to introduce two ∃ quantifiers without violating the dependencies
modeled by f and g
∀d∃vf ∀p∃vg ∀b WorksOn (d, p, b) → Dept (d, vf , p, vg )
Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples
64. Appendix
6 References
7 Additional Notation
8 A Note on Complexity
9 Partitioning Scheme
10 Additional Experiments
11 STBenchmark 2.0
12 Additional Linearization Examples
13 Example Augmentation
65. Augmentation is better than Minimization
Source Schema
WorksOn (Department, Project, BudgetId)
Audit (BudgetId, Auditor)
City (Department, City)
Target Schema
Project (PId, BudgetId)
Dept (Dept, Year, Project, NumEmp)
Location (Department, DepId, City, State)
Budget (Project, Leader, Size)
WorksOn: Department, Project → BudgetId Audit: BudgetId → Auditor
θ = ∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a))
θaug
= ∃f ∃g . . . → Budget (p, f (d, p, b, a), g(b, a))
θmin
= ∃f ∃g . . . → Budget (p, f (d, p), g(b))
FD1 : d, p → b, a FD2 : b → a
Slide 13 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Example Augmentation