Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Violations

Functional Dependencies: Redundancy Analysis
and Correcting Violations
Solmaz Kolahi
solmaz@cs.ubc.ca

Postdoctoral Research Fellow
Department of Computer Science
University of British Columbia

Joint work with Leonid Libkin and Laks Lakshmanan

Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 1/20

Motivation

Both relational and XML databases may store redundant data:

title director actor year
The Departed Scorsese DiCaprio 2006
The Departed Scorsese Nicholson 2006
Functional Dependency: Shrek the Third Miller Myers 2007
Shrek the Third Miller Murphy 2007
title → year
Shrek the Third Hui Myers 2007
Shrek the Third Hui Murphy 2007

Functional Dependency:
@AreaCode → @City

AreaCode AreaCode AreaCode AreaCode
416 416 416 416

City City City City
Toronto Toronto Toronto Toronto


Motivation

Normalization techniques try to remove redundancies:
BCNF eliminates all redundancies.
only key dependencies are allowed.
cannot always be achieved without losing dependencies.

AB → C C→B
R(A, B, C)

3NF eliminates some redundancies.
allows redundancy on prime attributes.
preserves dependencies.

XNF eliminates all redundancies w.r.t. XML functional dependencies.
only XML keys are allowed: if X → p.@l, then X → p.
introduced by Arenas & Libkin in 2002.


Motivation

Traditional normalization theory
characterizes a database as redundant or non-redundant.
does not measure redundancy.
cannot provide guidelines to reduce redundancy.

The more redundant the data, the more prone to update anomalies.

Our goal is
to show that there is a spectrum of redundancy using an
information-theoretic tool.
to choose database designs with low redundancy.
to handle databases with dependency violations.


Outline

Motivation.
Reducing redundancy in relational and XML data:
Measure of redundancy.
Redundancy analysis of normal forms and schemas.
Correcting functional dependency violations.
Conclusions.
Future work.


Measure of Information Content

Proposed by Arenas & Libkin in 2003.
Used to measure the redundancy of a data value in a database
instance with respect to a set of constraints.
Intuitively, RICI (p|Σ) measures the relative information content of
position p in instance I w.r.t. constraints Σ.
Independent of data models and query languages.




Σ = {A → C} A B C D
RICI (P |Σ) 1 2 3 4
0.875 1 2 3 5




Σ = {A → C} A B C D
RICI (P |Σ) 1 2 3 4
0.875 1 2 3 5
0.781 1 2 3 6




Σ = {A → C} A B C D
RICI (P |Σ) 1 2 3 4
0.875 1 2 3 5
0.781 1 2 3 6
0.711 1 2 3 7




Σ = {A → C} A B C D
RICI (P |Σ) 1 2 3 4
0.875 1 2 3 5
0.781 1 2 3 6
0.711 1 2 3 7
0.658 1 2 3 8




Σ = {A → C} Σ = {A → C, B → C}
A B C D
RICI (P |Σ) RICI (P |Σ)
1 2 3 4
0.875 0.781
1 2 3 5
0.781 0.629
1 2 3 6
0.711 0.522
1 2 3 7
0.658 0.446
1 2 3 8



A B C
1 2 3
R(A, B, C) Σ = {A → B}
1 2 4
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.



A B C
2 3
R(A, B, C) Σ = {A → B}
1 2
every a ∈ {1, . . . , k}.

P (2|X) =



A B C
1 2 3
R(A, B, C) Σ = {A → B}
1 2 1
every a ∈ {1, . . . , k}.

P (2|X) =



A B C
4 2 3
R(A, B, C) Σ = {A → B}
1 2 7
every a ∈ {1, . . . , k}.

P (2|X) =



A B C
4 2 3
R(A, B, C) Σ = {A → B}
1 2 7
every a ∈ {1, . . . , k}.

P (2|X) = 48/



A B C
a 3
R(A, B, C) Σ = {A → B}
1 2
every a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16
P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2



A B C
1 2 3
R(A, B, C) Σ = {A → B}
1 2 4
every a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16
P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2

Conditional entropy : 2.8057
Average over all possible X: RICk = 2.4558
I



A B C
1 2 3
R(A, B, C) Σ = {A → B}
1 2 4
every a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16
P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2

Conditional entropy : 2.8057
Average over all possible X: RICk = 2.4558
I

RICk (p| Σ)
I
RICI (p|Σ) = lim = 0.875
log k
k→∞


Measure and Database Design

A schema S with constraints Σ is well-designed if for every instance I of
(S, Σ) and every position p in I RICI (p|Σ) = 1.

Known results (Arenas & Libkin, 2003):
relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF.
XML documents with FDs: (S, Σ) is well-designed iff it is in XNF.





Well-designed databases cannot always be achieved:
Performance issues.
Dependency preservation.





Well-designed databases cannot always be achieved:
Performance issues.
Dependency preservation.

General design goal: maximizing information content to the possible
extent by enforcing some design conditions.


Guaranteed Information Content

Given a condition C, guaranteed information content (GIC) is the smallest
information content found in instances of schemas satisfying C.

instances of (S, Σ)

Schema satisfying condition C
(S, Σ) =⇒
≥ GIC
RICI (p|Σ)

More formally,
we look at the set of all possible values for information content

POSS C (m) = {RICI (p | Σ) | I is an instance of (R, Σ),
R has m attributes,
(R, Σ) satisﬁes C},

then GICC (m), is the inﬁmum of POSS C (m).

Price of Dependency Preservation

Design goal: minimizing redundancy while preserving FDs.

For a normal form NF, PRICE(NF ) is the minimum information content that
NF loses to guarantee dependency preservation.

if c ∈ [0, 1] is the largest information content guaranteed for
decompositions into NF,

NF-decomposition
(R1 , Σ1 )

≥c
RICI (p|Σ)
(R, Σ) (R2 , Σ2 )

(R3 , Σ3 )

then price of dependency preservation, PRICE(NF ), is 1 − c.



Theorem
= 1/2.
PRICE(3NF)

≥ 1/2 for any dependency-preserving normal form NF.
PRICE(NF )

To pay the smallest price for achieving dependency preservation, we
should do a 3NF normalization.



Theorem
= 1/2.
PRICE(3NF)

≥ 1/2 for any dependency-preserving normal form NF.
PRICE(NF )

To pay the smallest price for achieving dependency preservation, we
should do a 3NF normalization.

Not all 3NF normalizations are equal:
Special subclasses of 3NF exist (old research).
Only one subclass (3NF+ ) achieves the smallest price.
We compare normal forms based on guaranteed information content
or highest redundancy they allow.


Comparing Normal Forms

For every m > 2:
Theorem

21−m
GICAll (m) =
22−m
GIC3NF (m) =
GIC3NF+ (m) = 1/2

3NF is twice as good as doing nothing.
3NF+ is exponentially better.
similar results obtained if we compare normal forms based on
guaranteed average information content.


Redundancy of an Arbitrary Schema

Normalizing into smaller relations is not always desirable.
losing constraints.
slowing down query answering.
Normalization decision can be made based on
how much redundancy the schema allows; or
where in the spectrum of redundancy the schema lies; or
the lowest information content found in all instances of the
schema.

Design goal: decomposing the schema with the highest potential for
redundancy.



Given an arbitrary schema R with FDs Σ, let
Theorem
ΣA = {X | X → A, X is minimal and non-key};
#HS = the number of hitting sets of ΣA ;
l=| X|.
X∈ΣA

Then the smallest information content found in column A of instances is

GICR (A) = #HS · 2−l .
Σ



Theorem
l=| X|.
X∈ΣA


GICR (A) = #HS · 2−l .
Σ

R1 (A, B, C, D, E) R2 (A, B, C, D, E)
Σ1 = { AB → E, Σ2 = { BC → E,
D → E} AC → E,
BD → E}



Theorem
l=| X|.
X∈ΣA


GICR (A) = #HS · 2−l .
Σ

R1 (A, B, C, D, E) R2 (A, B, C, D, E)
Σ1 = { AB → E, Σ2 = { BC → E,
D → E} AC → E,
BD → E}
GICR1 (E) = GICR2 (E) =
3 1
1 2
Σ Σ
8 2


Outline

Motivation.
Reducing redundancy in relational and XML data:
Measure of redundancy.
Redundancy analysis of normal forms and schemas.
Correcting functional dependency violations.
Conclusions.
Future work.


Functional Dependency Violations

Large databases often tend to violate a set of FDs.
Σ = {cnt, arCode → reg, cnt, reg → prov }
An inconsistent database
name cnt prov reg arCode phone
t1 Smith CAN BC Van 604 123 4567
t2 Adams CAN BC Van 604 765 4321
t3 Simpson CAN BC Man 604 345 6789
t4 Rice CAN AB Vic 604 987 6543


Functional Dependency Violations

Large databases often tend to violate a set of FDs.
Σ = {cnt, arCode → reg, cnt, reg → prov }
An inconsistent database
t3 Simpson CAN BC Man 604 345 6789
t4 Rice CAN AB Vic 604 987 6543
A minimal repair
t3 Simpson CAN BC Van 604 345 6789
t4 v1
Rice AB Vic 604 987 6543


Handling Inconsistent Databases

Integrity constraints Σ (FDs, keys, etc.).
Inconsistent database D: does not satisfy Σ.
We can produce a repair R by inserting/deleting tuples or modifying
values in D.
∆(D, R) = number of modiﬁcations



values in D.
Handling inconsistency:
Consistent query answering:
{Q(R) | R is a minimal repair for D}
certain answer for queryQ =
Producing an optimum repair Ropt with minimum ∆.
Both approaches are intractable in general.



values in D.
Handling inconsistency:
Consistent query answering:
{Q(R) | R is a minimal repair for D}
certain answer for queryQ =
Producing an optimum repair Ropt with minimum ∆.
Both approaches are intractable in general.

Our approach: producing an approximate solution Rapp for optimum repair.
∆(D, Rapp ) ≤ α · ∆(D, Ropt )


Approximating Optimum Repair

Theorem. Finding an optimum solution for FD violations is NP-hard.

Theorem. Finding a constant-factor approximation for all FD violations is
NP-hard.

Theorem. For every ﬁxed set of FDs, there is a polynomial-time algorithm
that approximates optimum repair within a factor of α, where α depends
on FDs.
A B C D E

t1 a1 b1 c1 d1 e1
t2 a2 b1 c2 d2 e2
t3 a1 b3 c3 d3 e3
t4 a4 b4 c4 d4 e4
t5 a5 b4 c5 d5 e5
t6 a6 b6 c4 d5 e6
Σ = {A → C, B → C, CD → E}

Conclusions

We analyze schemas and normal forms based on worst cases of
redundancy.
There is a spectrum of information content (redundancy) for
schemas.

0 1 1
2
poorly-designed well-designed

Producing optimum repair for FD violations is hard.
We introduced an approximation framework.


Future Work

Comparing quality of schemas with low / high information content in
practice.
Deﬁning normalization concepts for XML such as:
dependency preserving decomposition.
Finding an equivalent of 3NF for XML as a normal form
that guarantees an information content of 1 .
2
to which every XML document is decomposable.
Extending the repair algorithm for other integrity constraints.


Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Violations

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (14)

Semelhante a Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Violations

Semelhante a Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Violations (20)

Mais de knowdiff

Mais de knowdiff (17)

Último

Último (20)

Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Violations