SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Functional Dependencies: Redundancy Analysis
           and Correcting Violations
                      Solmaz Kolahi
                   solmaz@cs.ubc.ca

             Postdoctoral Research Fellow
            Department of Computer Science
             University of British Columbia

       Joint work with Leonid Libkin and Laks Lakshmanan




                                                   Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 1/20
Motivation

Both relational and XML databases may store redundant data:

                                        title             director      actor         year
                                        The Departed      Scorsese      DiCaprio      2006
                                        The Departed      Scorsese      Nicholson     2006
Functional Dependency:                  Shrek the Third   Miller        Myers         2007
                                        Shrek the Third   Miller        Murphy        2007
title → year
                                        Shrek the Third   Hui           Myers         2007
                                        Shrek the Third   Hui           Murphy        2007




Functional Dependency:
@AreaCode → @City


                         AreaCode                  AreaCode                    AreaCode             AreaCode
                           416                        416                         416                  416

                                      City                      City                                City               City
                                    Toronto                   Toronto                             Toronto            Toronto




                                                                                    Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 2/20
Motivation

Normalization techniques try to remove redundancies:
    BCNF eliminates all redundancies.
      only key dependencies are allowed.
      cannot always be achieved without losing dependencies.

                          AB → C     C→B
        R(A, B, C)

    3NF eliminates some redundancies.
       allows redundancy on prime attributes.
       preserves dependencies.



    XNF eliminates all redundancies w.r.t. XML functional dependencies.
      only XML keys are allowed: if X → p.@l, then X → p.
      introduced by Arenas & Libkin in 2002.

                                                       Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 3/20
Motivation

Traditional normalization theory
    characterizes a database as redundant or non-redundant.
    does not measure redundancy.
    cannot provide guidelines to reduce redundancy.


The more redundant the data, the more prone to update anomalies.

Our goal is
    to show that there is a spectrum of redundancy using an
    information-theoretic tool.
    to choose database designs with low redundancy.
    to handle databases with dependency violations.



                                                      Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 4/20
Outline

Motivation.
Reducing redundancy in relational and XML data:
   Measure of redundancy.
   Redundancy analysis of normal forms and schemas.
Correcting functional dependency violations.
Conclusions.
Future work.




                                               Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 5/20
Measure of Information Content

Proposed by Arenas & Libkin in 2003.
Used to measure the redundancy of a data value in a database
instance with respect to a set of constraints.
Intuitively, RICI (p|Σ) measures the relative information content of
position p in instance I w.r.t. constraints Σ.
Independent of data models and query languages.




                                                    Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
Measure of Information Content

    Proposed by Arenas & Libkin in 2003.
    Used to measure the redundancy of a data value in a database
    instance with respect to a set of constraints.
    Intuitively, RICI (p|Σ) measures the relative information content of
    position p in instance I w.r.t. constraints Σ.
    Independent of data models and query languages.

Σ = {A → C}            A     B     C    D
 RICI (P |Σ)            1    2     3     4
   0.875                1    2     3     5




                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
Measure of Information Content

    Proposed by Arenas & Libkin in 2003.
    Used to measure the redundancy of a data value in a database
    instance with respect to a set of constraints.
    Intuitively, RICI (p|Σ) measures the relative information content of
    position p in instance I w.r.t. constraints Σ.
    Independent of data models and query languages.

Σ = {A → C}            A     B     C    D
 RICI (P |Σ)            1    2     3     4
   0.875                1    2     3     5
   0.781                1    2     3     6




                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
Measure of Information Content

    Proposed by Arenas & Libkin in 2003.
    Used to measure the redundancy of a data value in a database
    instance with respect to a set of constraints.
    Intuitively, RICI (p|Σ) measures the relative information content of
    position p in instance I w.r.t. constraints Σ.
    Independent of data models and query languages.

Σ = {A → C}            A     B     C    D
 RICI (P |Σ)            1    2     3     4
   0.875                1    2     3     5
   0.781                1    2     3     6
   0.711                1    2     3     7




                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
Measure of Information Content

    Proposed by Arenas & Libkin in 2003.
    Used to measure the redundancy of a data value in a database
    instance with respect to a set of constraints.
    Intuitively, RICI (p|Σ) measures the relative information content of
    position p in instance I w.r.t. constraints Σ.
    Independent of data models and query languages.

Σ = {A → C}            A     B     C    D
 RICI (P |Σ)            1    2     3     4
   0.875                1    2     3     5
   0.781                1    2     3     6
   0.711                1    2     3     7
   0.658                1    2     3     8



                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
Measure of Information Content

    Proposed by Arenas & Libkin in 2003.
    Used to measure the redundancy of a data value in a database
    instance with respect to a set of constraints.
    Intuitively, RICI (p|Σ) measures the relative information content of
    position p in instance I w.r.t. constraints Σ.
    Independent of data models and query languages.

Σ = {A → C}                                     Σ = {A → C, B → C}
                       A     B     C    D
 RICI (P |Σ)                                     RICI (P |Σ)
                        1    2     3     4
   0.875                                           0.781
                        1    2     3     5
   0.781                                           0.629
                        1    2     3     6
   0.711                                           0.522
                        1    2     3     7
   0.658                                           0.446
                        1    2     3     8



                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
Measure of Information Content


                                          A     B       C
                                          1      2        3
  R(A, B, C) Σ = {A → B}
                                          1      2        4
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.




                                                      Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
Measure of Information Content


                                          A     B       C
                                                 2        3
  R(A, B, C) Σ = {A → B}
                                          1      2
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.

P (2|X) =




                                                      Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
Measure of Information Content


                                          A     B       C
                                          1      2        3
  R(A, B, C) Σ = {A → B}
                                          1      2        1
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.

P (2|X) =




                                                      Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
Measure of Information Content


                                          A     B       C
                                          4      2        3
  R(A, B, C) Σ = {A → B}
                                          1      2        7
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.

P (2|X) =




                                                      Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
Measure of Information Content


                                          A     B       C
                                          4      2        3
  R(A, B, C) Σ = {A → B}
                                          1      2        7
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.

P (2|X) = 48/




                                                      Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
Measure of Information Content


                                          A     B         C
                                                    a       3
  R(A, B, C) Σ = {A → B}
                                          1         2
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16
P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2




                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
Measure of Information Content


                                          A      B        C
                                             1      2       3
  R(A, B, C) Σ = {A → B}
                                             1      2       4
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16
P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2

Conditional entropy : 2.8057
Average over all possible X: RICk = 2.4558
                                I




                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
Measure of Information Content


                                             A    B       C
                                             1      2       3
  R(A, B, C) Σ = {A → B}
                                             1      2       4
Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7).
For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for
every a ∈ {1, . . . , k}.

P (2|X) = 48/(48 + 6 × 42) = 0.16
P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2

Conditional entropy : 2.8057
Average over all possible X: RICk = 2.4558
                                I


                                       RICk (p| Σ)
                                          I
                  RICI (p|Σ)   = lim               = 0.875
                                          log k
                                 k→∞



                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
Measure and Database Design

A schema S with constraints Σ is well-designed if for every instance I of
(S, Σ) and every position p in I RICI (p|Σ) = 1.

Known results (Arenas & Libkin, 2003):
    relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF.
    XML documents with FDs: (S, Σ) is well-designed iff it is in XNF.




                                                       Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
Measure and Database Design

A schema S with constraints Σ is well-designed if for every instance I of
(S, Σ) and every position p in I RICI (p|Σ) = 1.

Known results (Arenas & Libkin, 2003):
    relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF.
    XML documents with FDs: (S, Σ) is well-designed iff it is in XNF.

Well-designed databases cannot always be achieved:
    Performance issues.
    Dependency preservation.




                                                       Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
Measure and Database Design

A schema S with constraints Σ is well-designed if for every instance I of
(S, Σ) and every position p in I RICI (p|Σ) = 1.

Known results (Arenas & Libkin, 2003):
    relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF.
    XML documents with FDs: (S, Σ) is well-designed iff it is in XNF.

Well-designed databases cannot always be achieved:
    Performance issues.
    Dependency preservation.

General design goal: maximizing information content to the possible
extent by enforcing some design conditions.



                                                       Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
Guaranteed Information Content

Given a condition C, guaranteed information content (GIC) is the smallest
information content found in instances of schemas satisfying C.

                                               instances of (S, Σ)

     Schema satisfying condition C
               (S, Σ)                =⇒
                                                                           ≥ GIC
                                                       RICI (p|Σ)

More formally,
    we look at the set of all possible values for information content

             POSS C (m) = {RICI (p | Σ) | I is an instance of (R, Σ),
                                          R has m attributes,
                                          (R, Σ) satisfies C},

    then GICC (m), is the infimum of POSS C (m).
                                                         Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 9/20
Price of Dependency Preservation

Design goal: minimizing redundancy while preserving FDs.

For a normal form NF, PRICE(NF ) is the minimum information content that
NF loses to guarantee dependency preservation.

    if c ∈ [0, 1] is the largest information content guaranteed for
    decompositions into NF,

                      NF-decomposition
                                    (R1 , Σ1 )


                                                                                   ≥c
                                                                RICI (p|Σ)
                     (R, Σ)          (R2 , Σ2 )


                                     (R3 , Σ3 )



    then price of dependency preservation, PRICE(NF ), is 1 − c.

                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 10/20
Price of Dependency Preservation

Theorem
                 = 1/2.
    PRICE(3NF)

                 ≥ 1/2 for any dependency-preserving normal form NF.
    PRICE(NF )


To pay the smallest price for achieving dependency preservation, we
should do a 3NF normalization.




                                                    Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 11/20
Price of Dependency Preservation

Theorem
                 = 1/2.
    PRICE(3NF)

                 ≥ 1/2 for any dependency-preserving normal form NF.
    PRICE(NF )


To pay the smallest price for achieving dependency preservation, we
should do a 3NF normalization.

Not all 3NF normalizations are equal:
    Special subclasses of 3NF exist (old research).
    Only one subclass (3NF+ ) achieves the smallest price.
    We compare normal forms based on guaranteed information content
    or highest redundancy they allow.




                                                      Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 11/20
Comparing Normal Forms

          For every m > 2:
Theorem


                                             21−m
                         GICAll (m)  =
                                             22−m
                         GIC3NF (m)  =
                         GIC3NF+ (m) =       1/2

    3NF is twice as good as doing nothing.
    3NF+ is exponentially better.
    similar results obtained if we compare normal forms based on
    guaranteed average information content.




                                                    Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 12/20
Redundancy of an Arbitrary Schema

    Normalizing into smaller relations is not always desirable.
       losing constraints.
       slowing down query answering.
    Normalization decision can be made based on
       how much redundancy the schema allows; or
       where in the spectrum of redundancy the schema lies; or
       the lowest information content found in all instances of the
       schema.

Design goal: decomposing the schema with the highest potential for
redundancy.




                                                      Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 13/20
Redundancy of an Arbitrary Schema

          Given an arbitrary schema R with FDs Σ, let
Theorem
    ΣA = {X | X → A, X is minimal and non-key};
    #HS = the number of hitting sets of ΣA ;
    l=|            X|.
            X∈ΣA

Then the smallest information content found in column A of instances is

                          GICR (A) = #HS · 2−l .
                             Σ




                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
Redundancy of an Arbitrary Schema

          Given an arbitrary schema R with FDs Σ, let
Theorem
    ΣA = {X | X → A, X is minimal and non-key};
    #HS = the number of hitting sets of ΣA ;
    l=|            X|.
            X∈ΣA

Then the smallest information content found in column A of instances is

                          GICR (A) = #HS · 2−l .
                             Σ



           R1 (A, B, C, D, E)        R2 (A, B, C, D, E)
           Σ1 = { AB → E,            Σ2 = { BC → E,
                   D → E}                    AC → E,
                                             BD → E}




                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
Redundancy of an Arbitrary Schema

          Given an arbitrary schema R with FDs Σ, let
Theorem
    ΣA = {X | X → A, X is minimal and non-key};
    #HS = the number of hitting sets of ΣA ;
    l=|            X|.
            X∈ΣA

Then the smallest information content found in column A of instances is

                             GICR (A) = #HS · 2−l .
                                Σ



           R1 (A, B, C, D, E)           R2 (A, B, C, D, E)
           Σ1 = { AB → E,               Σ2 = { BC → E,
                   D → E}                       AC → E,
                                                BD → E}
           GICR1 (E) =                  GICR2 (E) =
                         3                            1
                1                            2
              Σ                            Σ
                         8                            2



                                                          Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
Outline

Motivation.
Reducing redundancy in relational and XML data:
   Measure of redundancy.
   Redundancy analysis of normal forms and schemas.
Correcting functional dependency violations.
Conclusions.
Future work.




                                               Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 15/20
Functional Dependency Violations

Large databases often tend to violate a set of FDs.
Σ = {cnt, arCode → reg, cnt, reg → prov }
                       An inconsistent database
             name       cnt    prov reg arCode            phone
       t1    Smith     CAN      BC     Van      604      123 4567
       t2    Adams     CAN      BC     Van      604      765 4321
       t3 Simpson CAN           BC     Man      604      345 6789
       t4     Rice     CAN      AB      Vic     604      987 6543




                                                  Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 16/20
Functional Dependency Violations

Large databases often tend to violate a set of FDs.
Σ = {cnt, arCode → reg, cnt, reg → prov }
                       An inconsistent database
             name       cnt    prov reg arCode            phone
       t1    Smith     CAN      BC     Van      604      123 4567
       t2    Adams     CAN      BC     Van      604      765 4321
       t3 Simpson CAN           BC     Man      604      345 6789
       t4     Rice     CAN      AB      Vic     604      987 6543
                            A minimal repair
             name       cnt    prov reg arCode           phone
       t1    Smith     CAN     BC     Van    604        123 4567
       t2    Adams     CAN     BC     Van    604        765 4321
       t3   Simpson    CAN     BC     Van    604        345 6789
       t4               v1
              Rice              AB    Vic    604        987 6543


                                                  Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 16/20
Handling Inconsistent Databases

Integrity constraints Σ (FDs, keys, etc.).
Inconsistent database D: does not satisfy Σ.
We can produce a repair R by inserting/deleting tuples or modifying
values in D.
                     ∆(D, R) = number of modifications




                                                    Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
Handling Inconsistent Databases

Integrity constraints Σ (FDs, keys, etc.).
Inconsistent database D: does not satisfy Σ.
We can produce a repair R by inserting/deleting tuples or modifying
values in D.
                     ∆(D, R) = number of modifications
Handling inconsistency:
    Consistent query answering:
                                    {Q(R) | R is a minimal repair for D}
     certain answer for queryQ =
    Producing an optimum repair Ropt with minimum ∆.
Both approaches are intractable in general.




                                                    Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
Handling Inconsistent Databases

Integrity constraints Σ (FDs, keys, etc.).
Inconsistent database D: does not satisfy Σ.
We can produce a repair R by inserting/deleting tuples or modifying
values in D.
                     ∆(D, R) = number of modifications
Handling inconsistency:
    Consistent query answering:
                                    {Q(R) | R is a minimal repair for D}
     certain answer for queryQ =
    Producing an optimum repair Ropt with minimum ∆.
Both approaches are intractable in general.

Our approach: producing an approximate solution Rapp for optimum repair.
                     ∆(D, Rapp ) ≤ α · ∆(D, Ropt )



                                                    Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
Approximating Optimum Repair

Theorem. Finding an optimum solution for FD violations is NP-hard.

Theorem. Finding a constant-factor approximation for all FD violations is
NP-hard.

Theorem. For every fixed set of FDs, there is a polynomial-time algorithm
that approximates optimum repair within a factor of α, where α depends
on FDs.
                                              A    B     C         D          E

                                         t1   a1   b1    c1        d1         e1
                                         t2   a2   b1    c2        d2         e2
                                         t3   a1   b3    c3        d3         e3
                                         t4   a4   b4    c4        d4         e4
                                         t5   a5   b4    c5        d5         e5
                                         t6   a6   b6    c4        d5         e6
Σ = {A → C, B → C, CD → E}
                                                        Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 18/20
Conclusions

We analyze schemas and normal forms based on worst cases of
redundancy.
There is a spectrum of information content (redundancy) for
schemas.

           0                    1                           1
                                2
       poorly-designed                              well-designed

Producing optimum repair for FD violations is hard.
We introduced an approximation framework.




                                                 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 19/20
Future Work

Comparing quality of schemas with low / high information content in
practice.
Defining normalization concepts for XML such as:
   dependency preserving decomposition.
Finding an equivalent of 3NF for XML as a normal form
   that guarantees an information content of 1 .
                                             2
   to which every XML document is decomposable.
Extending the repair algorithm for other integrity constraints.




                                                   Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 20/20

Mais conteúdo relacionado

Destaque

Functional dependencies and normalization
Functional dependencies and normalizationFunctional dependencies and normalization
Functional dependencies and normalization
daxesh chauhan
 
Functional dependencies and normalization for relational databases
Functional dependencies and normalization for relational databasesFunctional dependencies and normalization for relational databases
Functional dependencies and normalization for relational databases
Jafar Nesargi
 
7. Relational Database Design in DBMS
7. Relational Database Design in DBMS7. Relational Database Design in DBMS
7. Relational Database Design in DBMS
koolkampus
 

Destaque (14)

Normlaization
NormlaizationNormlaization
Normlaization
 
Fd & Normalization - Database Management System
Fd & Normalization - Database Management SystemFd & Normalization - Database Management System
Fd & Normalization - Database Management System
 
Normalization
NormalizationNormalization
Normalization
 
Functional dependencies and normalization
Functional dependencies and normalizationFunctional dependencies and normalization
Functional dependencies and normalization
 
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
FUNCTION DEPENDENCY  AND TYPES & EXAMPLEFUNCTION DEPENDENCY  AND TYPES & EXAMPLE
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
 
Functional dependencies and normalization for relational databases
Functional dependencies and normalization for relational databasesFunctional dependencies and normalization for relational databases
Functional dependencies and normalization for relational databases
 
Binary Search Tree
Binary Search TreeBinary Search Tree
Binary Search Tree
 
Binary Search Tree in Data Structure
Binary Search Tree in Data StructureBinary Search Tree in Data Structure
Binary Search Tree in Data Structure
 
Normalization
NormalizationNormalization
Normalization
 
Trees, Binary Search Tree, AVL Tree in Data Structures
Trees, Binary Search Tree, AVL Tree in Data Structures Trees, Binary Search Tree, AVL Tree in Data Structures
Trees, Binary Search Tree, AVL Tree in Data Structures
 
Database Normalization
Database NormalizationDatabase Normalization
Database Normalization
 
7. Relational Database Design in DBMS
7. Relational Database Design in DBMS7. Relational Database Design in DBMS
7. Relational Database Design in DBMS
 
DBMS - Normalization
DBMS - NormalizationDBMS - Normalization
DBMS - Normalization
 
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NFDatabase Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
 

Semelhante a Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Violations

SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
Boris Glavic
 
Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)
Waqas Nawaz
 
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open DataTop-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
Vito Ostuni
 
Collaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringCollaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph Clustering
Waqas Nawaz
 

Semelhante a Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Violations (20)

SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
 
Challenges for a new era
Challenges for a new eraChallenges for a new era
Challenges for a new era
 
A Statistical and Schema Independent Approach to Identify Equivalent Properti...
A Statistical and Schema Independent Approach to Identify Equivalent Properti...A Statistical and Schema Independent Approach to Identify Equivalent Properti...
A Statistical and Schema Independent Approach to Identify Equivalent Properti...
 
Presentation at MTSR 2012
Presentation at MTSR 2012Presentation at MTSR 2012
Presentation at MTSR 2012
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval by
 
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open DataTop-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
 
Ways to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is ScarseWays to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is Scarse
 
Spatio textual similarity join
Spatio textual similarity joinSpatio textual similarity join
Spatio textual similarity join
 
10.1.1.70.8789
10.1.1.70.878910.1.1.70.8789
10.1.1.70.8789
 
Mapping of extensible markup language-to-ontology representation for effectiv...
Mapping of extensible markup language-to-ontology representation for effectiv...Mapping of extensible markup language-to-ontology representation for effectiv...
Mapping of extensible markup language-to-ontology representation for effectiv...
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP Systems
 
Relational Databases_RDF Graphs_and_Constraints.pdf
Relational Databases_RDF Graphs_and_Constraints.pdfRelational Databases_RDF Graphs_and_Constraints.pdf
Relational Databases_RDF Graphs_and_Constraints.pdf
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX tool
 
Image-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationImage-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data Integration
 
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONS
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONSSPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONS
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONS
 
Collaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringCollaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph Clustering
 
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
 

Mais de knowdiff

Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
knowdiff
 
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time SystemsSara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
knowdiff
 
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
knowdiff
 

Mais de knowdiff (17)

Ut talk feb 2017
Ut talk   feb 2017Ut talk   feb 2017
Ut talk feb 2017
 
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...
 
Scheduling for cloud systems with multi level data locality
Scheduling for cloud systems with multi level data localityScheduling for cloud systems with multi level data locality
Scheduling for cloud systems with multi level data locality
 
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
 
Knowledge based economy and power of crowd sourcing
Knowledge based economy and power of crowd sourcing Knowledge based economy and power of crowd sourcing
Knowledge based economy and power of crowd sourcing
 
Amin tayyebi: Big Data and Land Use Change Science
Amin tayyebi: Big Data and Land Use Change ScienceAmin tayyebi: Big Data and Land Use Change Science
Amin tayyebi: Big Data and Land Use Change Science
 
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
 
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time SystemsSara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
 
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
 
Narjess Afzaly: Model Your Problem with Graphs and Generate your objects
Narjess Afzaly: Model Your Problem with Graphs and Generate your objectsNarjess Afzaly: Model Your Problem with Graphs and Generate your objects
Narjess Afzaly: Model Your Problem with Graphs and Generate your objects
 
Computational methods applications in air pollution modeling (Dr. Yadghar)
Computational methods applications in air pollution modeling (Dr. Yadghar)Computational methods applications in air pollution modeling (Dr. Yadghar)
Computational methods applications in air pollution modeling (Dr. Yadghar)
 
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)
 
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...
 
Mehran Shaghaghi: Quantum Mechanics Dilemmas
Mehran Shaghaghi: Quantum Mechanics DilemmasMehran Shaghaghi: Quantum Mechanics Dilemmas
Mehran Shaghaghi: Quantum Mechanics Dilemmas
 
Hossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on GraphsHossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on Graphs
 
Dr. Amir Nejat
Dr. Amir NejatDr. Amir Nejat
Dr. Amir Nejat
 
Alborz
AlborzAlborz
Alborz
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Violations

  • 1. Functional Dependencies: Redundancy Analysis and Correcting Violations Solmaz Kolahi solmaz@cs.ubc.ca Postdoctoral Research Fellow Department of Computer Science University of British Columbia Joint work with Leonid Libkin and Laks Lakshmanan Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 1/20
  • 2. Motivation Both relational and XML databases may store redundant data: title director actor year The Departed Scorsese DiCaprio 2006 The Departed Scorsese Nicholson 2006 Functional Dependency: Shrek the Third Miller Myers 2007 Shrek the Third Miller Murphy 2007 title → year Shrek the Third Hui Myers 2007 Shrek the Third Hui Murphy 2007 Functional Dependency: @AreaCode → @City AreaCode AreaCode AreaCode AreaCode 416 416 416 416 City City City City Toronto Toronto Toronto Toronto Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 2/20
  • 3. Motivation Normalization techniques try to remove redundancies: BCNF eliminates all redundancies. only key dependencies are allowed. cannot always be achieved without losing dependencies. AB → C C→B R(A, B, C) 3NF eliminates some redundancies. allows redundancy on prime attributes. preserves dependencies. XNF eliminates all redundancies w.r.t. XML functional dependencies. only XML keys are allowed: if X → p.@l, then X → p. introduced by Arenas & Libkin in 2002. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 3/20
  • 4. Motivation Traditional normalization theory characterizes a database as redundant or non-redundant. does not measure redundancy. cannot provide guidelines to reduce redundancy. The more redundant the data, the more prone to update anomalies. Our goal is to show that there is a spectrum of redundancy using an information-theoretic tool. to choose database designs with low redundancy. to handle databases with dependency violations. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 4/20
  • 5. Outline Motivation. Reducing redundancy in relational and XML data: Measure of redundancy. Redundancy analysis of normal forms and schemas. Correcting functional dependency violations. Conclusions. Future work. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 5/20
  • 6. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
  • 7. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} A B C D RICI (P |Σ) 1 2 3 4 0.875 1 2 3 5 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
  • 8. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} A B C D RICI (P |Σ) 1 2 3 4 0.875 1 2 3 5 0.781 1 2 3 6 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
  • 9. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} A B C D RICI (P |Σ) 1 2 3 4 0.875 1 2 3 5 0.781 1 2 3 6 0.711 1 2 3 7 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
  • 10. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} A B C D RICI (P |Σ) 1 2 3 4 0.875 1 2 3 5 0.781 1 2 3 6 0.711 1 2 3 7 0.658 1 2 3 8 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
  • 11. Measure of Information Content Proposed by Arenas & Libkin in 2003. Used to measure the redundancy of a data value in a database instance with respect to a set of constraints. Intuitively, RICI (p|Σ) measures the relative information content of position p in instance I w.r.t. constraints Σ. Independent of data models and query languages. Σ = {A → C} Σ = {A → C, B → C} A B C D RICI (P |Σ) RICI (P |Σ) 1 2 3 4 0.875 0.781 1 2 3 5 0.781 0.629 1 2 3 6 0.711 0.522 1 2 3 7 0.658 0.446 1 2 3 8 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 6/20
  • 12. Measure of Information Content A B C 1 2 3 R(A, B, C) Σ = {A → B} 1 2 4 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
  • 13. Measure of Information Content A B C 2 3 R(A, B, C) Σ = {A → B} 1 2 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
  • 14. Measure of Information Content A B C 1 2 3 R(A, B, C) Σ = {A → B} 1 2 1 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
  • 15. Measure of Information Content A B C 4 2 3 R(A, B, C) Σ = {A → B} 1 2 7 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
  • 16. Measure of Information Content A B C 4 2 3 R(A, B, C) Σ = {A → B} 1 2 7 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = 48/ Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
  • 17. Measure of Information Content A B C a 3 R(A, B, C) Σ = {A → B} 1 2 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = 48/(48 + 6 × 42) = 0.16 P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
  • 18. Measure of Information Content A B C 1 2 3 R(A, B, C) Σ = {A → B} 1 2 4 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = 48/(48 + 6 × 42) = 0.16 P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2 Conditional entropy : 2.8057 Average over all possible X: RICk = 2.4558 I Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
  • 19. Measure of Information Content A B C 1 2 3 R(A, B, C) Σ = {A → B} 1 2 4 Pick k such that adom(I) ⊆ {1, . . . , k} (k = 7). For every X ⊆ Pos(I) − {p} compute probability distribution P (a|X) for every a ∈ {1, . . . , k}. P (2|X) = 48/(48 + 6 × 42) = 0.16 P (a|X) = 42/(48 + 6 × 42) = 0.14 for every a = 2 Conditional entropy : 2.8057 Average over all possible X: RICk = 2.4558 I RICk (p| Σ) I RICI (p|Σ) = lim = 0.875 log k k→∞ Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 7/20
  • 20. Measure and Database Design A schema S with constraints Σ is well-designed if for every instance I of (S, Σ) and every position p in I RICI (p|Σ) = 1. Known results (Arenas & Libkin, 2003): relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF. XML documents with FDs: (S, Σ) is well-designed iff it is in XNF. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
  • 21. Measure and Database Design A schema S with constraints Σ is well-designed if for every instance I of (S, Σ) and every position p in I RICI (p|Σ) = 1. Known results (Arenas & Libkin, 2003): relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF. XML documents with FDs: (S, Σ) is well-designed iff it is in XNF. Well-designed databases cannot always be achieved: Performance issues. Dependency preservation. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
  • 22. Measure and Database Design A schema S with constraints Σ is well-designed if for every instance I of (S, Σ) and every position p in I RICI (p|Σ) = 1. Known results (Arenas & Libkin, 2003): relational databases with FDs: (S, Σ) is well-designed iff it is in BCNF. XML documents with FDs: (S, Σ) is well-designed iff it is in XNF. Well-designed databases cannot always be achieved: Performance issues. Dependency preservation. General design goal: maximizing information content to the possible extent by enforcing some design conditions. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 8/20
  • 23. Guaranteed Information Content Given a condition C, guaranteed information content (GIC) is the smallest information content found in instances of schemas satisfying C. instances of (S, Σ) Schema satisfying condition C (S, Σ) =⇒ ≥ GIC RICI (p|Σ) More formally, we look at the set of all possible values for information content POSS C (m) = {RICI (p | Σ) | I is an instance of (R, Σ), R has m attributes, (R, Σ) satisfies C}, then GICC (m), is the infimum of POSS C (m). Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 9/20
  • 24. Price of Dependency Preservation Design goal: minimizing redundancy while preserving FDs. For a normal form NF, PRICE(NF ) is the minimum information content that NF loses to guarantee dependency preservation. if c ∈ [0, 1] is the largest information content guaranteed for decompositions into NF, NF-decomposition (R1 , Σ1 ) ≥c RICI (p|Σ) (R, Σ) (R2 , Σ2 ) (R3 , Σ3 ) then price of dependency preservation, PRICE(NF ), is 1 − c. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 10/20
  • 25. Price of Dependency Preservation Theorem = 1/2. PRICE(3NF) ≥ 1/2 for any dependency-preserving normal form NF. PRICE(NF ) To pay the smallest price for achieving dependency preservation, we should do a 3NF normalization. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 11/20
  • 26. Price of Dependency Preservation Theorem = 1/2. PRICE(3NF) ≥ 1/2 for any dependency-preserving normal form NF. PRICE(NF ) To pay the smallest price for achieving dependency preservation, we should do a 3NF normalization. Not all 3NF normalizations are equal: Special subclasses of 3NF exist (old research). Only one subclass (3NF+ ) achieves the smallest price. We compare normal forms based on guaranteed information content or highest redundancy they allow. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 11/20
  • 27. Comparing Normal Forms For every m > 2: Theorem 21−m GICAll (m) = 22−m GIC3NF (m) = GIC3NF+ (m) = 1/2 3NF is twice as good as doing nothing. 3NF+ is exponentially better. similar results obtained if we compare normal forms based on guaranteed average information content. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 12/20
  • 28. Redundancy of an Arbitrary Schema Normalizing into smaller relations is not always desirable. losing constraints. slowing down query answering. Normalization decision can be made based on how much redundancy the schema allows; or where in the spectrum of redundancy the schema lies; or the lowest information content found in all instances of the schema. Design goal: decomposing the schema with the highest potential for redundancy. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 13/20
  • 29. Redundancy of an Arbitrary Schema Given an arbitrary schema R with FDs Σ, let Theorem ΣA = {X | X → A, X is minimal and non-key}; #HS = the number of hitting sets of ΣA ; l=| X|. X∈ΣA Then the smallest information content found in column A of instances is GICR (A) = #HS · 2−l . Σ Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
  • 30. Redundancy of an Arbitrary Schema Given an arbitrary schema R with FDs Σ, let Theorem ΣA = {X | X → A, X is minimal and non-key}; #HS = the number of hitting sets of ΣA ; l=| X|. X∈ΣA Then the smallest information content found in column A of instances is GICR (A) = #HS · 2−l . Σ R1 (A, B, C, D, E) R2 (A, B, C, D, E) Σ1 = { AB → E, Σ2 = { BC → E, D → E} AC → E, BD → E} Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
  • 31. Redundancy of an Arbitrary Schema Given an arbitrary schema R with FDs Σ, let Theorem ΣA = {X | X → A, X is minimal and non-key}; #HS = the number of hitting sets of ΣA ; l=| X|. X∈ΣA Then the smallest information content found in column A of instances is GICR (A) = #HS · 2−l . Σ R1 (A, B, C, D, E) R2 (A, B, C, D, E) Σ1 = { AB → E, Σ2 = { BC → E, D → E} AC → E, BD → E} GICR1 (E) = GICR2 (E) = 3 1 1 2 Σ Σ 8 2 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 14/20
  • 32. Outline Motivation. Reducing redundancy in relational and XML data: Measure of redundancy. Redundancy analysis of normal forms and schemas. Correcting functional dependency violations. Conclusions. Future work. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 15/20
  • 33. Functional Dependency Violations Large databases often tend to violate a set of FDs. Σ = {cnt, arCode → reg, cnt, reg → prov } An inconsistent database name cnt prov reg arCode phone t1 Smith CAN BC Van 604 123 4567 t2 Adams CAN BC Van 604 765 4321 t3 Simpson CAN BC Man 604 345 6789 t4 Rice CAN AB Vic 604 987 6543 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 16/20
  • 34. Functional Dependency Violations Large databases often tend to violate a set of FDs. Σ = {cnt, arCode → reg, cnt, reg → prov } An inconsistent database name cnt prov reg arCode phone t1 Smith CAN BC Van 604 123 4567 t2 Adams CAN BC Van 604 765 4321 t3 Simpson CAN BC Man 604 345 6789 t4 Rice CAN AB Vic 604 987 6543 A minimal repair name cnt prov reg arCode phone t1 Smith CAN BC Van 604 123 4567 t2 Adams CAN BC Van 604 765 4321 t3 Simpson CAN BC Van 604 345 6789 t4 v1 Rice AB Vic 604 987 6543 Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 16/20
  • 35. Handling Inconsistent Databases Integrity constraints Σ (FDs, keys, etc.). Inconsistent database D: does not satisfy Σ. We can produce a repair R by inserting/deleting tuples or modifying values in D. ∆(D, R) = number of modifications Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
  • 36. Handling Inconsistent Databases Integrity constraints Σ (FDs, keys, etc.). Inconsistent database D: does not satisfy Σ. We can produce a repair R by inserting/deleting tuples or modifying values in D. ∆(D, R) = number of modifications Handling inconsistency: Consistent query answering: {Q(R) | R is a minimal repair for D} certain answer for queryQ = Producing an optimum repair Ropt with minimum ∆. Both approaches are intractable in general. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
  • 37. Handling Inconsistent Databases Integrity constraints Σ (FDs, keys, etc.). Inconsistent database D: does not satisfy Σ. We can produce a repair R by inserting/deleting tuples or modifying values in D. ∆(D, R) = number of modifications Handling inconsistency: Consistent query answering: {Q(R) | R is a minimal repair for D} certain answer for queryQ = Producing an optimum repair Ropt with minimum ∆. Both approaches are intractable in general. Our approach: producing an approximate solution Rapp for optimum repair. ∆(D, Rapp ) ≤ α · ∆(D, Ropt ) Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 17/20
  • 38. Approximating Optimum Repair Theorem. Finding an optimum solution for FD violations is NP-hard. Theorem. Finding a constant-factor approximation for all FD violations is NP-hard. Theorem. For every fixed set of FDs, there is a polynomial-time algorithm that approximates optimum repair within a factor of α, where α depends on FDs. A B C D E t1 a1 b1 c1 d1 e1 t2 a2 b1 c2 d2 e2 t3 a1 b3 c3 d3 e3 t4 a4 b4 c4 d4 e4 t5 a5 b4 c5 d5 e5 t6 a6 b6 c4 d5 e6 Σ = {A → C, B → C, CD → E} Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 18/20
  • 39. Conclusions We analyze schemas and normal forms based on worst cases of redundancy. There is a spectrum of information content (redundancy) for schemas. 0 1 1 2 poorly-designed well-designed Producing optimum repair for FD violations is hard. We introduced an approximation framework. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 19/20
  • 40. Future Work Comparing quality of schemas with low / high information content in practice. Defining normalization concepts for XML such as: dependency preserving decomposition. Finding an equivalent of 3NF for XML as a normal form that guarantees an information content of 1 . 2 to which every XML document is decomposable. Extending the repair algorithm for other integrity constraints. Solmaz Kolahi, Sharif U. of Tech., December 2008 – p. 20/20