SlideShare uma empresa Scribd logo
1 de 63
Baixar para ler offline
View-based Query Processing over Semistructured Data
                  Diego Calvanese
          Free University of Bozen-Bolzano


                 View-based Query Processing
    Diego Calvanese, Giuseppe De Giacomo, Georg Gottlob,
              Maurizio Lenzerini, Riccardo Rosati




      Corso di Dottorato – Dottorato in Ingegneria Informatica
                University of Rome “La Sapienza”
                     September–October 2005
VBQP over Semistructured Data – Outline


 1. The semistructured data model and regular path queries (RPQs)

 2. Containment of RPQs and 2RPQs

 3. View-based query rewriting for RPQs and 2RPQs

 4. View-based query answering for RPQs and 2RPQs via automata

 5. View-based query answering for RPQs and 2RPQs via CSP




D. Calvanese            View-based Query Processing over Semistructured Data   1
VBQP over Semistructured Data – Outline


⇒ The semistructured data model and regular path queries (RPQs)

 2. Containment of RPQs and 2RPQs

 3. View-based query rewriting for RPQs and 2RPQs

 4. View-based query answering for RPQs and 2RPQs via automata

 5. View-based query answering for RPQs and 2RPQs via CSP




D. Calvanese            View-based Query Processing over Semistructured Data   2
Semistructured data
Semistructured data (SSD) are an abstraction for data on the web, structured
documents, XML:
   • A (semistructured) database (DB) is a (finite) edge-labeled graph
                                                                 bib

                                                               o1
                                                                              ...                         ...
                                             article      book                      article         book

                       reference                   reference           reference               reference
                                       o15                     o37                      o42                        o58
                                                               ...                                                       title
                        author
                                   title
                                           ...                 ...        author
                                                                                     author
                                                                                                           title


                         o68           o75       ...                         o64                    o71            o83
                                                                                                                             o95

                                                                                                                          "XPath ..."
                    "Victor Vianu" "Regular ..."                          "Tova Milo"                     "Type Inference ..."
                                                                                   firstname               lastname


                                                                                              o52         o53

                                                                                          "Dan" "Suciu"



   • In some cases there are restrictions on the structure of the graph,
     e.g., in XML the graph has to be a tree

D. Calvanese                        View-based Query Processing over Semistructured Data                                                3
Formalization of semistructured databases
Definition:
 • The schema of a DB is a relational alphabet Σ of binary predicates (one
   for each edge label).
 • A SSDB is a set of binary relations.

Note that we do not allow for constraints over the relations in a SSDB.
                                                      bib
Example:                                                                                                                      Σ = {bib, article,
                                                    o1
                                                                    ...                         ...
                                  article      book                       article         book
                                                                                                                                    book, reference,
            reference
                            o15
                                        reference
                                                    o37
                                                             reference
                                                                              o42
                                                                                     reference
                                                                                                         o58
                                                                                                                                    title, author, . . .}
                                                    ...                                                        title
               author
                        title
                                ...                 ...          author
                                                                           author
                                                                                                 title


               o68          o75       ...                          o64                    o71            o83
                                                                                                                   o95

                                                                                                                "XPath ..."
         "Victor Vianu" "Regular ..."                           "Tova Milo"                     "Type Inference ..."
                                                                         firstname               lastname


                                                                                    o52         o53

                                                                                "Dan" "Suciu"




D. Calvanese                                              View-based Query Processing over Semistructured Data                                              4
Queries over SSD
Queries over SSD are typically constituted by two parts:
 • selection part: selects (tuples of) nodes that satisfy some condition
 • restructuring part: reorganizes the selected nodes into a graph (or tree)

In this course we deal with the selection part only.


Queries must provide the ability to “navigate” the graph structure to relate pairs
of nodes ; must contain some form of recursion:

   • Datalog: provides a very expressive form of recursion

   • XPath: descendant/ancestor axes refer to successor/predecessor nodes at
     arbitrary depth in the tree – rather restricted form of recursion

   • reflexive transitive closure provides a good tradeoff


D. Calvanese              View-based Query Processing over Semistructured Data   5
Path queries
Are the basic element of all proposals for query languages over SSD.
Definition: A path query Q has the form

                                   Q(x, y) ← x L y

where L is a language over the alphabet Σ of binary DB predicates.


Recall that a DB B is a set of (binary) relations over Σ, or equivalently a graph
whose edges are labeled with elements of Σ.


Definition: The answer Q(B) to Q over B is the set of pairs of nodes (a, b)
                            p1      pk
such that there is a path a → · · · → b in B, with p1 · · · pk ∈ L.


Notable example: regular path queries (RPQs), in which L is a regular
language over Σ

D. Calvanese              View-based Query Processing over Semistructured Data      6
Regular path queries
In an RPQ, we can specify the regular language through a regular expression.


Example: DB alphabet:                                Σ = {bib, article, book, reference, title, . . .}
Query:         Q(x, y) ← x ((article + book)·reference∗·title) y

Consider the DB B over Σ:
                                                               bib
                                                                                                                                      Q(B)
                                                             o1
                                                                            ...                         ...
                                           article      book                      article         book
                                                                                                                                      o1 o75
                     reference                   reference           reference               reference                                o1 o83
                                     o15                     o37                      o42                        o58
                                                             ...                                                       title          o1 o95
                      author             ...                 ...        author                           title
                                 title                                             author                                              .
                                                                                                                                       .  .
                                                                                                                                          .
                                               ...                                                                         o95         .  .
                       o68           o75                                   o64                    o71            o83
                                                                                                                        "XPath ..."
                  "Victor Vianu" "Regular ..."                          "Tova Milo"                     "Type Inference ..."
                                                                                 firstname               lastname


                                                                                            o52         o53

                                                                                        "Dan" "Suciu"




D. Calvanese                                     View-based Query Processing over Semistructured Data                                          7
Regular path queries – Observations

Expressive power of RPQs:

   • Not expressible in first-order logic

   • Are a fragment of transitive-closure logic

   • Are a fragment of binary Datalog
      – Concatenation: P (x, y) ← E1 (x, z), E2 (z, y)
      – Union: P (x, y) ← E1 (x, y)
                P (x, y) ← E2 (x, y)
        – Reflexive-transitive closure: P (x, y) ← E(x, y)
                                       P (x, y) ← E(x, z), P (z, y)




D. Calvanese               View-based Query Processing over Semistructured Data   8
VBQP over Semistructured Data – Outline

 √
       The semistructured data model and regular path queries (RPQs)

⇒ Containment of RPQs and 2RPQs

 3. View-based query rewriting for RPQs and 2RPQs

 4. View-based query answering for RPQs and 2RPQs via automata

 5. View-based query answering for RPQs and 2RPQs via CSP




D. Calvanese              View-based Query Processing over Semistructured Data   9
Path query containment
Given Q1 (x, y) ← x L1 y
               Q2 (x, y) ← x L2 y
check whether Q1 ⊑ Q2 , i.e., for every DB B, we have Q1 (B) ⊆ Q2 (B).


Language-Theoretic Lemma 1:                      Q1 ⊑ Q2                iff L1 ⊆ L2
                                                 p1           pk
Proof: “Only if”: Consider a DB             a → · · · → b with p1 · · · pk ∈ L1 and
p1 · · · pk ∈ L2 .
            /
                                                                             p1     pk
“If”: If (a, b) ∈ Q1 (B), then B contains a path a → · · · → b with
p1 · · · pk ∈ L1 . But then p1 · · · pk ∈ L2 , and (a, b) ∈ Q2 (B).


Corollary: Path query containment is
 • undecidable for context-free path queries
 • PSPACE-complete for regular path queries [Stockmeyer 1973]

D. Calvanese                 View-based Query Processing over Semistructured Data        10
Containment of RPQs
Via language containment. We exploit that L1 ⊆ L2 iff L1 − L2 = ∅.


Algorithm for checking whether L(E1 ) ⊆ L(E2 ) (for regular expr. E1 , E2 )
 1. Construct NFAs Ai such that L(Ai) = L(Ei)
    ; linear blowup
 2. Construct NFA A2 such that L(A2 ) = Σ∗ − L(A2 )
    ; exponential blowup
 3. Construct A = A1 × A2 such that L(A) = L(E1 ) − L(E2 )
    ; quadratic blowup
 4. Check whether there is a path from the initial state to a final state in A
    ; NLOGSPACE


Theorem: Containment of RPQs is in PSPACE, and hence PSPACE-complete.

D. Calvanese              View-based Query Processing over Semistructured Data   11
Two-way regular path queries (2RPQs)
   • Provide the ability to navigate DB edges in both directions.
     Allow one to capture, e.g., the predecessor axis of XPath.
   • We introduce an extended alphabet Σ± = Σ ∪ Σ−, where
     Σ− = {p− | p ∈ Σ}.
   • We call the elements of Σ− inverse edge labels.

Definition: A two-way regular path query over Σ has the form

                                    Q(x, y) ← x E y

with E a regular expression over the extended alphabet Σ±.

Example:       Q2 (x, y) ← x (article·(reference + reference−)∗·title) y


Note: the edges of the DB are still labeled with elements of Σ only.

D. Calvanese               View-based Query Processing over Semistructured Data   12
Semantics of two-way regular path queries
   • Consider a 2RPQ Q(x, y) ← x E y and a DB B over Σ.
                                                  r1                         rk
   • A semi-path a1 → a2 · · · ak → ak+1 in B is a sequence of nodes with:
                  pi
      – either ai → ai+1 in B, and ri = pi,
                 pi
      – or ai+1 → ai in B, and ri = p−.
                                      i

   • The answer Q(B) to Q over B is the set of pairs of nodes (a, b) s.t. there
                      r1      rk
     is a semi-path a → · · · → b in B, with r1 · · · rk ∈ L(E).

Example:                  Query Q2 (x, y) ← x (article·(reference + reference−)∗·title) y
Database B:                                                   bib
                                                                                                                                Q2 (B)
                                                           o1
                                                                           ...                    ...
                                        article        book                      article     book                               o1 o75
                  reference
                                  o15
                                              reference
                                                          o37
                                                                    reference
                                                                                     o42
                                                                                           reference
                                                                                                           o58
                                                                                                                                o1 o83
                                                          ...                                                    title
                                                                                                                                o1 o95
                   author
                              title
                                      ...                 ...          author
                                                                                 author
                                                                                                   title
                                                                                                                                 .
                                                                                                                                 .  .
                                                                                                                                    .
                    o68           o75       ...                           o64               o71            o83
                                                                                                                     o95
                                                                                                                                 .  .
                                                                                                                  "XPath ..."
               "Victor Vianu" "Regular ..."                            "Tova Milo"                "Type Inference ..."




D. Calvanese                                              View-based Query Processing over Semistructured Data                           13
Two-way regular path queries – Observations
Language-Theoretic Lemma 1 does not hold anymore.


Reason: sequences of direct and inverse edge labels may be “folded” away.


Example:        Σ = {P }
                Q1 (x, y) ← x P y
                Q2 (x, y) ← x P ·P −·P y
We have that:
                                               P
   • Q1 ⊑ Q2 : consider any path a → b in a DB B

   • but L(P )      L(P ·P −·P ).




D. Calvanese               View-based Query Processing over Semistructured Data   14
Foldings
Definition: Let u, v be words over Σ±. We say that v folds onto u, denoted
v ; u, if we can transform v into u by repeatedly:
  • replacing each occurrence in v of p·p−·p with p, and
  • replacing each occurrence in v of p−·p·p− with p−.

Example:     rss−st ; rst
              r   s   s   s  t   r   s   t
Pictorially: → · → · ← · → · → ; → · → · →


Definition: Let E be a RE over Σ±.
Then fold (E) = {v | v ; u, for some u ∈ L(E)}


The notion of folding allows us to reduce containment of 2RPQs to a
language-theoretic problem.


D. Calvanese            View-based Query Processing over Semistructured Data   15
Containment of 2RPQs
Consider two 2RPQs     Q1 (x, y) ← x E1 y
                       Q2 (x, y) ← x E2 y


Language-Theoretic Lemma 2:                 Q1 ⊑ Q2                iff L(E1 ) ⊆ fold (E2 )

                                                          r1           rk
Proof: by considering simple semi-paths a → · · · → b in a DB, where
r1 · · · rk ∈ L(E1 ).


To decide L(E1 ) ⊆ fold (E2 ) we resort to two-way automata on words.




D. Calvanese            View-based Query Processing over Semistructured Data                 16
Two-way automata on words (2NFA)
A 2NFA is similar to a standard one-way automaton (1NFA)

                                A = (Σ, S, S0 , δ, F )

but the transition function δ : S × Σ → 2S×{−1,0,1} maps each state to a set of
pairs
 • new state
 • moving direction (left, don’t move, or right)

Theorem [Rabin&Scott, Shepherdson 1959]: 2NFAs accept regular languages
Given a 2NFA A with n states, one can construct a 1NFA with O(2n log n) states
accepting L(A).




D. Calvanese              View-based Query Processing over Semistructured Data   17
2NFAs and foldings
                                                ˜
Theorem: Let E be a RE over Σ±. There is a 2NFA AE such that
       ˜
   • L(AE ) = fold (E)
                             ˜
   • The number of states of AE is linear in the size of E


                        ˜
In the construction of AE we exploit the fact that 2NFAs can move backwards
on a word. E.g., to fold pp−p onto p, the 2NFA:

 1. Moves forward on p.

 2. Makes a step backward and expects to see p while staying in place (this
    corresponds to moving according to p−, i.e., backward on p).

 3. Moves forward again on p.




D. Calvanese              View-based Query Processing over Semistructured Data   18
2NFAs and foldings – Example
                                                                                        ∗
Regular expression over Σ±:            E = r·(p + q)·p−·p·q −

Word in L(E) viewed as a path in a DB:                              r           p            q
                                                           d1                                          d2
                                                                r         p         q−             $



1NFA that accepts L(E)                            p
                          s0       r      s1              s2        p−    s3        p       s4              q−
                                                  q


2NFA that accepts fold (E)
                                   p, 1          −
                 s0    r, 1 s                s2 p , 1 s3 p, 1 s4
                             1                                                              q−, 1
                                q, 1                               ∗, 1
                                    ∗, −1                  p, 0 ···
                 ···        ···                                       sf
                                                    s←
                                                     2                                      ∗, 1



D. Calvanese             View-based Query Processing over Semistructured Data                                    19
2NFAs and foldings – Construction
Let E be a RE over Σ± and A = (Σ±, S, S0 , δ, F ) a 1NFA with
L(A) = L(E).
We construct the 2NFA
      ˜
      AE = (Σ± ∪ {$},          S ∪ {sf } ∪ {s← | s ∈ S},                            S0 ,   δA ,   {sf })

where δA is defined as follows:
   • (s←, −1) ∈ δA(s, ℓ), for each s ∈ S and ℓ ∈ Σ± ∪ {$}

   • (s2 , 1) ∈ δA(s1 , r) and (s2 , 0) ∈ δA(s←, r −), for each transition s2 ∈ δ(s1 , r)
                                              1
     of E .

   • (sf , 1) ∈ δA(s, ℓ), for each s ∈ F and ℓ ∈ Σ± ∪ {$}


We have that:      w ∈ fold (E)                        ˜
                                            iff w$ ∈ L(AE )
                                                       ˜
(We can also get rid of the $ at the end of words in L(AE ).)

D. Calvanese                 View-based Query Processing over Semistructured Data                          20
Containment of 2RPQs (cont’d)
Consider two 2RPQs     Q1 (x, y) ← x E1 y
                       Q2 (x, y) ← x E2 y


Summing what we have seen till now, we have that

                                Q1 ⊑ Q2                               iff
                          L(E1 ) ⊆ fold (E2 )                         iff
                                      ˜
                          L(E1 ) ⊆ L(AE2 )



                    ˜
To check L(E1 ) ⊆ L(AE2 ) we have to look into the transformation of 2NFAs
to 1NFAs.




D. Calvanese            View-based Query Processing over Semistructured Data   21
Transforming 2NFAs to 1NFAs
Theorem [Vardi 1988]: Let A = (Σ, S, S0 , δ, F ) be a 2NFA.
There is a 1NFA Ac such that
 • Ac accepts the complement of A, i.e., L(Ac) = Σ∗ − L(A)
 • Ac is exponential in A, i.e., ||Ac|| is 2O(||A||)

Proof: guess a subset-sequence counterexample.
a0 · · · ak−1 ∈ L(A) iff there is a sequence T0 , . . . , Tk of subsets of S such
              /
that:
  • S0 ⊆ T0 and Tk ∩ F = ∅
  • If s ∈ Ti and (t, +1) ∈ δ(s, ai), then t ∈ Ti+1 , for 0 ≤ i < k
  • If s ∈ Ti and (t, 0) ∈ δ(s, ai), then t ∈ Ti, for 0 ≤ i < k
  • If s ∈ Ti and (t, −1) ∈ δ(s, ai), then t ∈ Ti−1 , for 0 ≤ i < k

The 1NFA Ac guesses such a seQuence T0 , . . . , Tk and checks that it
satisfies the 4 conditions above.

D. Calvanese              View-based Query Processing over Semistructured Data      22
Containment of 2RPQs (finally done!)
Consider two 2RPQs   Q1 (x, y) ← x E1 y
                     Q2 (x, y) ← x E2 y


Summing up, we have that

                            Q1 ⊑ Q2                                     iff
                      L(E1 ) ⊆ fold (E2 )                               iff
                                  ˜
                      L(E1 ) ⊆ L(AE2 )                                  iff
                                 ˜
                      L(E1 ) ∩ L(Ac 2 ) = ∅                             iff
                                  E
                               ˜
                      L(AE1 × Ac ) = ∅     E2




Theorem: Containment of 2RPQs is in PSPACE, hence PSPACE-complete.


D. Calvanese          View-based Query Processing over Semistructured Data    23
Containment of queries over semistructured data

Complexity of containment for various classes of queries over semistructured
data:


                         Language                            Complexity
                           RPQs                                 PSPACE              [PODS’99]
                           2RPQs                                PSPACE              [PODS’00]
                        Tree-2RPQs                              PSPACE              [DBPL’01]
                     Conjunctive-2RPQs                       EXPSPACE                [KR’00]
                 Datalog in Unions of C2RPQs                  2EXPTIME              [ICDT’03]




D. Calvanese                 View-based Query Processing over Semistructured Data               24
VBQP over Semistructured Data – Outline

 √
       The semistructured data model and regular path queries (RPQs)
 √
       Containment of RPQs and 2RPQs

⇒ View-based query rewriting for RPQs and 2RPQs

 4. View-based query answering for RPQs and 2RPQs via automata

 5. View-based query answering for RPQs and 2RPQs via CSP




D. Calvanese              View-based Query Processing over Semistructured Data   25
View-based query processing

       certain
      answers        we are                                                       answers
       certQ,V    interested in                                                     to Q

                                                                            Q


                    View definition V                           Database schema
                    V1 V2 … Vn                                   R1 R2 … Rm


                                  …                                         …

                   View extension E                                  Database B




D. Calvanese         View-based Query Processing over Semistructured Data                   26
View-based query processing for SSD
We consider the setting where:
   • The schema Σ is a set of binary predicates without constraints.
   • Each view symbol in V is a binary predicate.
   • Each view definition in V Σ is an RPQ/2RPQ over Σ.
   • Each view extension in E is a binary relation.
   • The query Q is an RPQ/2RPQ over Σ.
   • Views are sound, complete, or exact views (will discuss mostly the case of
     sound views).
   • The domain is open.


Problem: Given a schema Σ, views V with definitions V Σ and extensions E , a
query Q over Σ, and a tuple t, decide whether t ∈ cert(Q, Σ, V Σ , E).

D. Calvanese               View-based Query Processing over Semistructured Data   27
View-based query answering for SSD – Example


   • Schema Σ = {article, reference, title, author, . . .}

   • Set of views V = {V1 , V2 , V3 }

   • View definitions V Σ    = {V1Σ , V2Σ , V3Σ }, with
      V1Σ :   V1 (b, a)     ← b (article) a
      V2Σ : V2 (p1 , p2 )   ← p1 (reference∗) p2
      V3Σ :   V3 (p, t)     ← p (title) t

   • View extensions E = {E1 , E2 , E3 }, where
      – E1 stores for each bibliography its articles
      – E2 stores for each publ. the ones it references directly or indirectly
      – E3 stores for each publication its title




D. Calvanese                View-based Query Processing over Semistructured Data   28
View-based query answering via rewriting

       certain
      answers      answers                                   we are
                   to RmaxQ,V                                                                answers
       certQ,V                                            interested in                        to Q

                                       RmaxQ,V                                         Q


                            View definition V                              Database schema
                            V1 V2 … Vn                                      R1 R2 … Rm


                                             …                                         …

                            View extension E                                    Database B



Note: the class C of queries in which to express the rewriting is fixed a priori.


D. Calvanese                    View-based Query Processing over Semistructured Data                   29
View-based query rewriting for SSD
Problem: Given a schema Σ, views V with definitions V Σ , a query Q over Σ,
and a class C of queries over V , compute the query R over V such that:

 1. R is a query in C .

 2. R is a sound rewriting, i.e., for every database B over Σ and every view
    extension E that is sound wrt B (i.e., such that E ⊆ V Σ (B)), we have
    R(E) ⊆ Q(B).
 3. R is the maximal (wrt containment) query in C satisfying condition (2).

Such a query is called the C -maximal rewriting of Q wrt Σ and V , and is
denoted rew C (Q, Σ, V).


In the following, we assume that the class C of queries in which the rewriting is
expressed coincides with that of Q (i.e., is that of RPQs / 2RPQs), and we do
not mention it explicitly.

D. Calvanese              View-based Query Processing over Semistructured Data   30
View-based query rewriting for 2RPQs – Example
Consider three views with definitions:

                            V1 (b, a) ← b (article) a
                          V2 (p1 , p2 ) ← p1 (reference∗) p2
                            V3 (p, t) ← p (title) t



Query:            Q(x, y) ← x (article·(reference + reference−)∗·title) y


   • R(x, y) ← x (v1 ·v2 ·v3 ) y                                is an RPQ rewriting of Q

   • R(x, y) ← x (v1 ·(v2 + v2 −)·v3 ) y                        is a 2RPQ rewriting of Q

   • R(x, y) ← x (v1 ·(v2 + v2 −)∗·v3 ) y                       is a 2RPQ-maximal rewriting of Q
                                                                that is also exact


D. Calvanese                  View-based Query Processing over Semistructured Data             31
Sound rewritings
Since RPQs and 2RPQs are monotone queries, the following characterizations
of a sound rewriting R of Q wrt Σ and V are equivalent:
   • For every database B over Σ and every view extension E with
     E ⊆ V Σ (B), we have that R(E) ⊆ Q(B).
   • For every database B over Σ and every view extension E with
     E = V Σ (B), we have that R(E) ⊆ Q(B).
   • For every database B over Σ, we have that R(V Σ (B)) ⊆ Q(B).


Observe: The following two ways of computing R(V Σ (B)) are equivalent:
   • First evaluate the view definitions V Σ over B, and then evaluate R over the
     obtained view extension.
   • First expand in R the view symbols V with the corresponding definitions
     V Σ , and then evaluate the resulting query over B.

D. Calvanese              View-based Query Processing over Semistructured Data   32
Expansion of an RPQ / 2RPQ
We call a query over Σ a Σ-query, and a query over V a V -query.

Consider:
 • a set of views V = {V1 , . . . , Vk}
                                                                   Σ
 • a set of view definitions (RPQs or 2RPQs) V Σ = {V1Σ , . . . , Vk },
   with ViΣ : Vi(x, y) ← x Ei y
 • a V -query (RPQ or 2RPQ) R : R(x, y) ← x ER y

Definition: The expansion R[V→VΣ ] of R wrt V Σ is the Σ-query obtained from
R by replacing in ER:
 • each occurrence of a view symbol Vi with Ei.
 • each occurrence of an inverse view symbol Vi− with inv (Ei), where
                inv (p) = p−          inv (e1 + e2 ) = inv (e1 ) + inv (e2 )
               inv (p−) = p             inv (e1 ·e2 ) = inv (e2 )·inv (e2 )
                                            inv (e∗) = inv (e)∗

D. Calvanese                   View-based Query Processing over Semistructured Data   33
Checking rewritings via expansion

Hence, to check whether a rewriting R of Q wrt Σ and V Σ is sound, we can:

 1. Compute R[V→VΣ ] by expanding each direct (resp., inverse) view symbol
    V (resp., V −) in R with the corresponding definition V Σ (resp., inv (V Σ )).

 2. Check whether R[V→VΣ ] ⊑ Q                (notice that both are Σ-queries).




D. Calvanese             View-based Query Processing over Semistructured Data     34
Counterexample method for rewriting of RPQs
Due to Language-Theoretic Lemma 1, we can treat RPQs simply as regular
expressions.


Consider a candidate rewriting: R = u1 · · · uk ∈ V ∗

   • R is a bad rewriting of Q if R[V→VΣ ] ⊑ Q

   • R is a bad rewriting of Q if there are witnesses w1 , . . . , wk ∈ Σ∗ such
     that w1 · · · wk ⊑ Q, where wi ∈ L(VjΣ ) if ui = Vj .


Example: Q = abcd,            V = {V1 , V2 },                V1Σ = ab,               V2Σ = cd

   • bad rewriting: V2 V1 ,      with witnesses cd and ab

   • good rewriting: V1 V2


D. Calvanese                  View-based Query Processing over Semistructured Data              35
Rewriting of RPQs
Construction is based on 1NFAs:
 1. Construct a 1DFA AQ = (Σ, S, s0 , δ, F ) for Q ; exponential blowup
 2. Construct the 1NFA Ab = (V, S, s0 , δ ′, S − F ), where
    sj ∈ δ ′(si, V ) iff there exists a word w ∈ L(V Σ ) s.t. sj ∈ δ ∗(si, w)
       Note:
        • AQ and Abad have the same states, but complementary final states.
        • Abad accepts a word u1 · · · uk ∈ V ∗ iff it is a bad rewriting of Q.
          The words w used in the definition of δ ′ act as witnesses.
        ; linear blowup
 3. Complement Abad to get a 1NFA Arew for good rewritings.
    ; exponential blowup

The construction yields the maximal path rewriting.
It is represented by a 1DFA. ; The maximal path rewriting is an RPQ.

D. Calvanese                View-based Query Processing over Semistructured Data   36
Rewriting of RPQs – Example
Query:              Q = a·(b·a + c)∗
Views:             V = {V1 , V2 , V3 },             with V1Σ = a,                 V2Σ = a·c∗·b,     V3Σ = c



                         b, c                V3                       V2                       V3
                                                                                        V1 ,
      b              a                                V1                                V2 ,         V1
                                                                                        V3
                         a      a, b, c                               V1 ,
                                                                      V2
               c                                         V3                                            V2
               Aq                                      Abad                                           Arew




D. Calvanese                         View-based Query Processing over Semistructured Data                     37
Rewriting of RPQs – Complexity
Checking nonemptiness of the maximal path-rewriting of an RPQ Query wrt a
set of RPQ views is EXPSPACE-complete.
Proof:

   • The 1DFA Arew is of double exponential size.

   • We can check for its non-emptiness on the fly in NLOGSPACE in the
     number of its states, i.e., in EXPSPACE in the size of Q

   • The matching lower-bound is by a reduction from an EXPSPACE-hard
     tiling problem.


There exists an RPQ Query Q and RPQ views V over an alphabet Σ such that
the smallest 1NFA for the rewriting rew (Q, Σ, V) is of double exponential size.



D. Calvanese             View-based Query Processing over Semistructured Data   38
Exact rewritings
Definition: A rewriting R of an RPQ Q wrt Σ and V Σ is exact if Q ⊑ R[V→VΣ ] .

Note: since R is a rewriting, we also have R[V→VΣ ] ⊑ Q.


To verify whether R is an exact rewriting of Q wrt Σ and V Σ :

 1. Construct a 1NFA B over Σ accepting R[V→VΣ ] .
       It suffices to replace each Vi edge in R with a 1NFA for ViΣ .

 2. Check whether L(AQ) ⊆ L(B), i.e.,
        • complement B to obtain an 1NFA B ,
        • check whether L(AQ) ∩ L(B) = ∅, i.e., whether L(AQ × B) = ∅.


B may be of triply exponential size in Q but we can check its emptyness
on-the-fly in 2EXPSPACE.

D. Calvanese                View-based Query Processing over Semistructured Data   39
Rewriting of RPQs – Results
Theorem:

   • Nonemptiness of the maximal rewriting is EXPSPACE-complete.

   • The maximal rewriting may be of double exponential size.

   • Existence of an exact rewriting is 2EXPSPACE-complete.




D. Calvanese             View-based Query Processing over Semistructured Data   40
Rewriting of 2RPQs
   • The query and the view definitions are 2RPQs over Σ.
   • We look for rewritings that are a 2RPQ over V .
   • We consider again candidate rewritings and try to characterize bad
     rewritings.

                                                         ∗
Candidate rewriting: R = u1 · · · uk ∈ V ±
   • To check whether a rewriting is bad, we need to expand both direct and
     inverse view symbols.
   • R is a bad rewriting of Q if R[V→VΣ ] ⊑ Q
                                                                                 ∗
   • R is a bad rewriting of Q if there are witnesses w1 , . . . , wk ∈ Σ± such
     that w1 · · · wk ⊑ Q, where
      – wi ∈ L(VjΣ ) if ui = Vj .
      – wi ∈ L(inv (VjΣ )) if ui = Vj−.

D. Calvanese              View-based Query Processing over Semistructured Data       41
Counterexample method for rewriting of 2RPQs
We consider counterexample words, which are obtained from a bad rewriting
by inserting after each symbol u its witness w.
Definition: Counterexample word u1 w1 · · · ukwk
 1. wi ∈ L(VjΣ ) if ui = Vj .
 2. wi ∈ L(inv (VjΣ )) if ui = Vj−.
 3. w1 · · · wk ⊑ Q.

Example: Q = abcd, V = {V1 , V2 }, V1Σ = ab, V2Σ = cd
 • bad rewriting: V2 V1 , with witnesses w1 = cd, w2 = ab
 • counterexample word: V2 cd V1 ab

Checking counterexample words with 2NFAs
 • Check (1) and (2) with 2NFAs for VjΣ .
 • Use folding technique to construct 2NFA to check w1 · · · wk ⊑ Q.
 • Complement resulting 2NFA.
; Complexity is exponential

D. Calvanese              View-based Query Processing over Semistructured Data   42
From counterexamples to rewritings
To construct good rewritings:

 1. Construct 1NFA A1 for counterexample words u1 w1 · · · ukwk.
    ; exponential

 2. Project out witness words wi to get 1NFA A2 for bad rewritings u1 · · · uk.
    ; linear

 3. Complement A2 to get 1NFA for good rewritings.
    ; exponential


The construction yields the maximal two-way path rewriting
It is represented by a 1DFA. ; The maximal two-way path rewriting is a
2RPQ.



D. Calvanese             View-based Query Processing over Semistructured Data   43
Rewriting of 2RPQs – Results

We get for 2RPQs the same upper (and lower) bounds as for RPQs.


Theorem:

   • Nonemptiness of the maximal rewriting is EXPSPACE-complete.

   • The maximal rewriting may be of double exponential size.

   • Existence of an exact rewriting is 2EXPSPACE-complete.




D. Calvanese             View-based Query Processing over Semistructured Data   44
VBQP over Semistructured Data – Outline

 √
       The semistructured data model and regular path queries (RPQs)
 √
       Containment of RPQs and 2RPQs
 √
       View-based query rewriting for RPQs and 2RPQs

⇒ View-based query answering for RPQs and 2RPQs via automata

 5. View-based query answering for RPQs and 2RPQs via CSP




D. Calvanese              View-based Query Processing over Semistructured Data   45
View-based query answering for RPQs / 2RPQs
Given:

   • a set V of view symbols with a corresponding set V Σ of RPQ / 2RPQ view
     definitions over a relational alphabet Σ

   • a corresponding set E of view extensions

   • a 2RPQ Q over Σ

   • a pair (c, d) of objects

check whether (c, d) ∈ cert(Q, Σ, V Σ , E).


In other words, check whether for every database B over Σ such that the view
extension E is sound wrt B (i.e, E ⊆ V(B)), we have that (c, d) ∈ Q(B).




D. Calvanese               View-based Query Processing over Semistructured Data   46
View-based query answering for 2RPQs – Idea
We search for a counterexample database, i.e., a database B such that E is
sound wrt B, but such that (c, d) is not in the answer to Q over B.


Technique:

 1. Encode counterexample databases as finite words.

 2. Construct a 2NFA that accepts such words.

 3. Check for emptiness of the automaton.




D. Calvanese              View-based Query Processing over Semistructured Data   47
Canonical counterexample databases
A database B is a counterexample to (c, d) ∈ cert(Q, Σ, V, E) if
 • E ⊆ V Σ (B), i.e., the view extension E is sound wrt B,
 • (c, d) ∈ Q(B).
          /

Observations:

   • It is sufficient to restrict the attention to counterexamples of a special form
     (canonical databases)
      d1            d2                d3
                                                r                           Set of objects in the view extension E :
           r             p        q
                q                           r                               ∆E = {d1 , d2 , d3 , d4 , d5 , . . .}
                                           d5
               d4        p    p

   • Each canonical database B can be represented as a word wB over
     ΣA = Σ± ∪ ∆E ∪ {$} of the form

                               $ d1 w1 d2 $ d3 w2 d4 $ · · · $ d2m−1 wm d2m $

D. Calvanese                                    View-based Query Processing over Semistructured Data                   48
Canonical DBs and 2NFAs
         d1         d2               d3
                                               r
               r         p       q                                                               ∗
B:                 q                       r           Q(x, y) ← x (r·(p + q)·p−·p·q − ) y
                                          d5
               d4        p   p


Word representing B:

                   $ d1 r d2 $ d4 p−p d5 $ d4 q − d2 $ d3 rr d3 $ d2 pq − d3 $


To verify whether (di, dj ) ∈ Q(B) we exploit that 2NFAs can:

   • move on the word both forward and backward

   • jump from one position in the word representing a node to any other position
     representing the same node (search mode)


; We can construct a 2NFA A(Q,di ,dj ) that accepts wB iff (di, dj ) ∈ Q(B)

D. Calvanese                              View-based Query Processing over Semistructured Data       49
View-based query answering for 2RPQs – Technique
To check whether (c, d) ∈ cert(Q, Σ, V, E), we construct a 1NFA AQA as the
                        /
intersection of:
                                                                  ∗
   • the 1NFA A0 that accepts ($·∆E ·Σ±·Σ± ·∆E )∗·$
   • the 1NFAs corresponding to the various A(ViΣ ,a,b)
       (for each sound or exact view Vi, and for each pair (a, b) ∈ Ei)
   • the 1NFAs corresponding to the complement of each AVi
       (for each complete or exact view Vi, the 2NFA AVi checks whether a pair of
       objects other than those in Ei is in Vi(B))
   • the 1NFA corresponding to the complement of A(Q,c,d)

; AQA accepts words representing counterexample DBs


We have that (c, d) ∈ cert(Q, Σ, V, E) iff AQA is nonempty.
                    /


D. Calvanese                  View-based Query Processing over Semistructured Data   50
Complexity of query answering
Can be measured in three different ways:

   • Data complexity: as a function of the size of the view extension E

   • Expression complexity: as a function of the size of the query Q and of the
     view definitions V1Σ , . . . , VkΣ


   • Combined complexity: as a function of the size of both the view extension
     E and the expressions Q , V1Σ , . . . , VkΣ




D. Calvanese              View-based Query Processing over Semistructured Data   51
View-based query answering for 2RPQs – Upper bounds
   • All 2NFAs are of linear size in the size of Q , all views in V and the view
     extensions E .
     ; The corresponding 1NFA would be exponential.

   • However, we can avoid the explicit construction of AQA , and we can
     construct it on the fly while checking for nonemptiness.


; View-based query answering for 2RPQs is in PSPACE wrt expression
complexity and combined complexity.


PSPACE-hardness follows immediately by reduction from universality of
regular languages.


What about data complexity, i.e., complexity in the size of the view extension E ?

D. Calvanese               View-based Query Processing over Semistructured Data    52
VBQP over Semistructured Data – Outline

 √
       The semistructured data model and regular path queries (RPQs)
 √
       Containment of RPQs and 2RPQs
 √
       View-based query rewriting for RPQs and 2RPQs
 √
       View-based query answering for RPQs and 2RPQs via automata

⇒ View-based query answering for RPQs and 2RPQs via CSP




D. Calvanese              View-based Query Processing over Semistructured Data   53
View-based query answering for 2RPQs – Data complexity
   • In the previous algorithm one cannot immediately single out the
     contribution of E to the nonemptiness check of the automaton AQA .

   • This can however be done by analyzing the transformation from 2NFA to
     1NFA, and modifying the construction of the automata to avoid search
     mode.


We look at an alternative way to analyze data complexity, derived from a tight
connection between view-based query answering under sound views and
constraint satisfaction (CSP).


; Better insight into view-based query answering for RPQs and 2RPQs.
; Several additional results on various forms of view-based query processing.


D. Calvanese             View-based Query Processing over Semistructured Data    54
Constraint satisfaction problems
Let A and B be relational structures over the same alphabet.

A homomorphism h is a mapping from A to B such that for every relation R,
if (c1 , . . . , cn) ∈ R(A), then (h(c1 ), . . . , h(cn)) ∈ R(B).


Non-uniform constraint satisfaction problem CSP (B): the set of relational
structures A such that there is a homomorphism from A to B.


Complexity:
   • CSP (B) is in NP.
   • There are structures B for which CSP (B) is NP-hard.
       Example:    p         p

                         p



D. Calvanese                 View-based Query Processing over Semistructured Data   55
CSP and VBQA for 2RPQs – Constraint template
From Q and V , we can define a relational structure T = CT Q,V , called
constraint template of Q wrt V :
   • The vocabulary of T is {R1 , . . . , Rk} ∪ {Uini , Ufin }, where
      – each Ri corresponds to a view Vi and denotes a binary predicate
      – Uini and Ufin denote unary predicates
   • Let Q be represented by a 1NFA (Σ±, S, S0 , δ, F ):
      – The domain ∆T of T is 2S , i.e., all sets of states of Q
      – σ ∈ Uini (T ) iff S0 ⊆ σ
      – σ ∈ Ufin (T ) iff σ ∩ F = ∅
      – (σ1 , σ2 ) ∈ Ri(T ) iff there exists a word p1 · · · pk ∈ L(ViΣ ) and a
        sequence T0 , . . . , Tk of subsets of S such that:
             1. T0 = σ1 and Tk = σ2
             2. if s ∈ Ti and t ∈ δ(s, pi+1 ), then t ∈ Ti+1
             3. if s ∈ Ti and t ∈ δ(s, p−), then t ∈ Ti−1
                                          i



D. Calvanese              View-based Query Processing over Semistructured Data    56
CSP and VBQA for 2RPQs – Constraint instance
Observations:

   • Intuitively, the constraint template CT Q,V encodes how the states of Q
     change when moving along a database according to the views.

   • The existence of a word p1 · · · pk ∈ L(ViΣ ) and of a sequence T0 , . . . , Tk
     of subsets of S satisfying conditions 1–3 can be checked in PSPACE.

   • CT Q,V can be computed in time exponential in Q and polynomial in V .


From E and two objects c and d, we can define another relational structure
I = CI c,d over the same vocabulary, called the constraint instance:
         E
  • The domain ∆I of I is ∆E ∪ {c, d}
  • Ri(I) = Ei, for i ∈ {1, . . . , k}
  • Uini (I) = {c}
  • Ufin (I) = {d}

D. Calvanese               View-based Query Processing over Semistructured Data   57
CSP and view-based query answering for 2RPQs


Theorem:       (c, d) is not a certain answer to Q wrt V and E
                                  if and only if
 there is a homomorphism from CI c,d to CT Q,V , i.e, CI c,d ∈ CSP (CT Q,V )
                                      E                  E



; Characterization of view-based query answering for 2RPQs in terms of CSP
; View-based query answering for 2RPQs is in coNP wrt data complexity




D. Calvanese               View-based Query Processing over Semistructured Data   58
Query answering for RPQs — Data complexity lower bound
coNP-hard wrt data compexity, by reduction from 3-coloring:
         Views: Vs   = Sr + Sg + Sb
                VG   = Rrg + Rgr + Rrb + Rbr + Rgb + Rbg
                Vf   = Fr + Fg + Fb
         Query: Q    =   x=y Sx·Fy +  x=y∨w=z Sx·Ryw ·Fy (color mismatch)
Only domain and view extensions depend on graph G = (N, E).
                                                                                            c

    Domain:     ∆E    =   N ∪ {c, d}                                             Vs
                                                                                        1
    Extensions: Es    =   {(c, a) | a ∈ N }
                EG    =   {(a, b), (b, a) | (a, b) ∈ E}                          VG 2                3

                Ef    =   {(a, d) | a ∈ N }                                                     4
                                                                                 Vf
                                                                                            d
Thm: G is 3-colorable iff (c, d) is not a certain answer to Q.


Note: we have used only a query and views that are unions of simple paths.

D. Calvanese              View-based Query Processing over Semistructured Data                      59
Complexity of view-based query answering for RPQs / 2RPQs


         Assumption on   Assumption on                                      Complexity
               domain           views                      data         expression       combined
                             all sound                    coNP               coNP         coNP
               closed         all exact                   coNP               coNP         coNP
                              arbitrary                   coNP               coNP         coNP
                             all sound                    coNP            PSPACE         PSPACE
               open           all exact                   coNP            PSPACE         PSPACE
                              arbitrary                   coNP            PSPACE         PSPACE


From [ICDE’00] for RPQs and [PODS’00, ICDT’05] for 2RPQs.



D. Calvanese              View-based Query Processing over Semistructured Data                      60
Consequence of complexity results


  + The view-based query answering algorithm provides a set of answers that
    is sound and complete.

   – A coNP data complexity does not allow for effective deployment of the
     query answering algorithm.

       Note that coNP-hardness holds already for queries and views that are
       unions of simple paths (no reflexive-transitive closure).


; Adopt an indirect approach to view-based query answering, via query
rewriting.




D. Calvanese               View-based Query Processing over Semistructured Data   61
Query answering by rewriting
To answer a query Q wrt V and E :
 1. re-express Q in terms of the view symbols, i.e., compute a rewriting of Q .
 2. directly evaluate the rewriting over E .

Comparison with direct approach to query answering:

  + We can consider rewritings in a class with polynomial data complexity
    (e.g., 2RPQs) ; the data complexity for query answering is polynomial.
+/– We have traded expression complexity for data complexity.
   – We may lose completeness (i.e., not obtain all certain answers).


We need to establish the “quality” of a rewriting:
 • When does the (maximal) rewriting compute all certain answers?
 • What do we gain or lose by considering rewritings in a given class?


D. Calvanese              View-based Query Processing over Semistructured Data    62

Mais conteúdo relacionado

Último

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 

Último (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 

Destaque

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Destaque (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

View-based Query Processing over Semistructured Data: The Semistructured Data Model and Regular Path Queries

  • 1. View-based Query Processing over Semistructured Data Diego Calvanese Free University of Bozen-Bolzano View-based Query Processing Diego Calvanese, Giuseppe De Giacomo, Georg Gottlob, Maurizio Lenzerini, Riccardo Rosati Corso di Dottorato – Dottorato in Ingegneria Informatica University of Rome “La Sapienza” September–October 2005
  • 2. VBQP over Semistructured Data – Outline 1. The semistructured data model and regular path queries (RPQs) 2. Containment of RPQs and 2RPQs 3. View-based query rewriting for RPQs and 2RPQs 4. View-based query answering for RPQs and 2RPQs via automata 5. View-based query answering for RPQs and 2RPQs via CSP D. Calvanese View-based Query Processing over Semistructured Data 1
  • 3. VBQP over Semistructured Data – Outline ⇒ The semistructured data model and regular path queries (RPQs) 2. Containment of RPQs and 2RPQs 3. View-based query rewriting for RPQs and 2RPQs 4. View-based query answering for RPQs and 2RPQs via automata 5. View-based query answering for RPQs and 2RPQs via CSP D. Calvanese View-based Query Processing over Semistructured Data 2
  • 4. Semistructured data Semistructured data (SSD) are an abstraction for data on the web, structured documents, XML: • A (semistructured) database (DB) is a (finite) edge-labeled graph bib o1 ... ... article book article book reference reference reference reference o15 o37 o42 o58 ... title author title ... ... author author title o68 o75 ... o64 o71 o83 o95 "XPath ..." "Victor Vianu" "Regular ..." "Tova Milo" "Type Inference ..." firstname lastname o52 o53 "Dan" "Suciu" • In some cases there are restrictions on the structure of the graph, e.g., in XML the graph has to be a tree D. Calvanese View-based Query Processing over Semistructured Data 3
  • 5. Formalization of semistructured databases Definition: • The schema of a DB is a relational alphabet Σ of binary predicates (one for each edge label). • A SSDB is a set of binary relations. Note that we do not allow for constraints over the relations in a SSDB. bib Example: Σ = {bib, article, o1 ... ... article book article book book, reference, reference o15 reference o37 reference o42 reference o58 title, author, . . .} ... title author title ... ... author author title o68 o75 ... o64 o71 o83 o95 "XPath ..." "Victor Vianu" "Regular ..." "Tova Milo" "Type Inference ..." firstname lastname o52 o53 "Dan" "Suciu" D. Calvanese View-based Query Processing over Semistructured Data 4
  • 6. Queries over SSD Queries over SSD are typically constituted by two parts: • selection part: selects (tuples of) nodes that satisfy some condition • restructuring part: reorganizes the selected nodes into a graph (or tree) In this course we deal with the selection part only. Queries must provide the ability to “navigate” the graph structure to relate pairs of nodes ; must contain some form of recursion: • Datalog: provides a very expressive form of recursion • XPath: descendant/ancestor axes refer to successor/predecessor nodes at arbitrary depth in the tree – rather restricted form of recursion • reflexive transitive closure provides a good tradeoff D. Calvanese View-based Query Processing over Semistructured Data 5
  • 7. Path queries Are the basic element of all proposals for query languages over SSD. Definition: A path query Q has the form Q(x, y) ← x L y where L is a language over the alphabet Σ of binary DB predicates. Recall that a DB B is a set of (binary) relations over Σ, or equivalently a graph whose edges are labeled with elements of Σ. Definition: The answer Q(B) to Q over B is the set of pairs of nodes (a, b) p1 pk such that there is a path a → · · · → b in B, with p1 · · · pk ∈ L. Notable example: regular path queries (RPQs), in which L is a regular language over Σ D. Calvanese View-based Query Processing over Semistructured Data 6
  • 8. Regular path queries In an RPQ, we can specify the regular language through a regular expression. Example: DB alphabet: Σ = {bib, article, book, reference, title, . . .} Query: Q(x, y) ← x ((article + book)·reference∗·title) y Consider the DB B over Σ: bib Q(B) o1 ... ... article book article book o1 o75 reference reference reference reference o1 o83 o15 o37 o42 o58 ... title o1 o95 author ... ... author title title author . . . . ... o95 . . o68 o75 o64 o71 o83 "XPath ..." "Victor Vianu" "Regular ..." "Tova Milo" "Type Inference ..." firstname lastname o52 o53 "Dan" "Suciu" D. Calvanese View-based Query Processing over Semistructured Data 7
  • 9. Regular path queries – Observations Expressive power of RPQs: • Not expressible in first-order logic • Are a fragment of transitive-closure logic • Are a fragment of binary Datalog – Concatenation: P (x, y) ← E1 (x, z), E2 (z, y) – Union: P (x, y) ← E1 (x, y) P (x, y) ← E2 (x, y) – Reflexive-transitive closure: P (x, y) ← E(x, y) P (x, y) ← E(x, z), P (z, y) D. Calvanese View-based Query Processing over Semistructured Data 8
  • 10. VBQP over Semistructured Data – Outline √ The semistructured data model and regular path queries (RPQs) ⇒ Containment of RPQs and 2RPQs 3. View-based query rewriting for RPQs and 2RPQs 4. View-based query answering for RPQs and 2RPQs via automata 5. View-based query answering for RPQs and 2RPQs via CSP D. Calvanese View-based Query Processing over Semistructured Data 9
  • 11. Path query containment Given Q1 (x, y) ← x L1 y Q2 (x, y) ← x L2 y check whether Q1 ⊑ Q2 , i.e., for every DB B, we have Q1 (B) ⊆ Q2 (B). Language-Theoretic Lemma 1: Q1 ⊑ Q2 iff L1 ⊆ L2 p1 pk Proof: “Only if”: Consider a DB a → · · · → b with p1 · · · pk ∈ L1 and p1 · · · pk ∈ L2 . / p1 pk “If”: If (a, b) ∈ Q1 (B), then B contains a path a → · · · → b with p1 · · · pk ∈ L1 . But then p1 · · · pk ∈ L2 , and (a, b) ∈ Q2 (B). Corollary: Path query containment is • undecidable for context-free path queries • PSPACE-complete for regular path queries [Stockmeyer 1973] D. Calvanese View-based Query Processing over Semistructured Data 10
  • 12. Containment of RPQs Via language containment. We exploit that L1 ⊆ L2 iff L1 − L2 = ∅. Algorithm for checking whether L(E1 ) ⊆ L(E2 ) (for regular expr. E1 , E2 ) 1. Construct NFAs Ai such that L(Ai) = L(Ei) ; linear blowup 2. Construct NFA A2 such that L(A2 ) = Σ∗ − L(A2 ) ; exponential blowup 3. Construct A = A1 × A2 such that L(A) = L(E1 ) − L(E2 ) ; quadratic blowup 4. Check whether there is a path from the initial state to a final state in A ; NLOGSPACE Theorem: Containment of RPQs is in PSPACE, and hence PSPACE-complete. D. Calvanese View-based Query Processing over Semistructured Data 11
  • 13. Two-way regular path queries (2RPQs) • Provide the ability to navigate DB edges in both directions. Allow one to capture, e.g., the predecessor axis of XPath. • We introduce an extended alphabet Σ± = Σ ∪ Σ−, where Σ− = {p− | p ∈ Σ}. • We call the elements of Σ− inverse edge labels. Definition: A two-way regular path query over Σ has the form Q(x, y) ← x E y with E a regular expression over the extended alphabet Σ±. Example: Q2 (x, y) ← x (article·(reference + reference−)∗·title) y Note: the edges of the DB are still labeled with elements of Σ only. D. Calvanese View-based Query Processing over Semistructured Data 12
  • 14. Semantics of two-way regular path queries • Consider a 2RPQ Q(x, y) ← x E y and a DB B over Σ. r1 rk • A semi-path a1 → a2 · · · ak → ak+1 in B is a sequence of nodes with: pi – either ai → ai+1 in B, and ri = pi, pi – or ai+1 → ai in B, and ri = p−. i • The answer Q(B) to Q over B is the set of pairs of nodes (a, b) s.t. there r1 rk is a semi-path a → · · · → b in B, with r1 · · · rk ∈ L(E). Example: Query Q2 (x, y) ← x (article·(reference + reference−)∗·title) y Database B: bib Q2 (B) o1 ... ... article book article book o1 o75 reference o15 reference o37 reference o42 reference o58 o1 o83 ... title o1 o95 author title ... ... author author title . . . . o68 o75 ... o64 o71 o83 o95 . . "XPath ..." "Victor Vianu" "Regular ..." "Tova Milo" "Type Inference ..." D. Calvanese View-based Query Processing over Semistructured Data 13
  • 15. Two-way regular path queries – Observations Language-Theoretic Lemma 1 does not hold anymore. Reason: sequences of direct and inverse edge labels may be “folded” away. Example: Σ = {P } Q1 (x, y) ← x P y Q2 (x, y) ← x P ·P −·P y We have that: P • Q1 ⊑ Q2 : consider any path a → b in a DB B • but L(P ) L(P ·P −·P ). D. Calvanese View-based Query Processing over Semistructured Data 14
  • 16. Foldings Definition: Let u, v be words over Σ±. We say that v folds onto u, denoted v ; u, if we can transform v into u by repeatedly: • replacing each occurrence in v of p·p−·p with p, and • replacing each occurrence in v of p−·p·p− with p−. Example: rss−st ; rst r s s s t r s t Pictorially: → · → · ← · → · → ; → · → · → Definition: Let E be a RE over Σ±. Then fold (E) = {v | v ; u, for some u ∈ L(E)} The notion of folding allows us to reduce containment of 2RPQs to a language-theoretic problem. D. Calvanese View-based Query Processing over Semistructured Data 15
  • 17. Containment of 2RPQs Consider two 2RPQs Q1 (x, y) ← x E1 y Q2 (x, y) ← x E2 y Language-Theoretic Lemma 2: Q1 ⊑ Q2 iff L(E1 ) ⊆ fold (E2 ) r1 rk Proof: by considering simple semi-paths a → · · · → b in a DB, where r1 · · · rk ∈ L(E1 ). To decide L(E1 ) ⊆ fold (E2 ) we resort to two-way automata on words. D. Calvanese View-based Query Processing over Semistructured Data 16
  • 18. Two-way automata on words (2NFA) A 2NFA is similar to a standard one-way automaton (1NFA) A = (Σ, S, S0 , δ, F ) but the transition function δ : S × Σ → 2S×{−1,0,1} maps each state to a set of pairs • new state • moving direction (left, don’t move, or right) Theorem [Rabin&Scott, Shepherdson 1959]: 2NFAs accept regular languages Given a 2NFA A with n states, one can construct a 1NFA with O(2n log n) states accepting L(A). D. Calvanese View-based Query Processing over Semistructured Data 17
  • 19. 2NFAs and foldings ˜ Theorem: Let E be a RE over Σ±. There is a 2NFA AE such that ˜ • L(AE ) = fold (E) ˜ • The number of states of AE is linear in the size of E ˜ In the construction of AE we exploit the fact that 2NFAs can move backwards on a word. E.g., to fold pp−p onto p, the 2NFA: 1. Moves forward on p. 2. Makes a step backward and expects to see p while staying in place (this corresponds to moving according to p−, i.e., backward on p). 3. Moves forward again on p. D. Calvanese View-based Query Processing over Semistructured Data 18
  • 20. 2NFAs and foldings – Example ∗ Regular expression over Σ±: E = r·(p + q)·p−·p·q − Word in L(E) viewed as a path in a DB: r p q d1 d2 r p q− $ 1NFA that accepts L(E) p s0 r s1 s2 p− s3 p s4 q− q 2NFA that accepts fold (E) p, 1 − s0 r, 1 s s2 p , 1 s3 p, 1 s4 1 q−, 1 q, 1 ∗, 1 ∗, −1 p, 0 ··· ··· ··· sf s← 2 ∗, 1 D. Calvanese View-based Query Processing over Semistructured Data 19
  • 21. 2NFAs and foldings – Construction Let E be a RE over Σ± and A = (Σ±, S, S0 , δ, F ) a 1NFA with L(A) = L(E). We construct the 2NFA ˜ AE = (Σ± ∪ {$}, S ∪ {sf } ∪ {s← | s ∈ S}, S0 , δA , {sf }) where δA is defined as follows: • (s←, −1) ∈ δA(s, ℓ), for each s ∈ S and ℓ ∈ Σ± ∪ {$} • (s2 , 1) ∈ δA(s1 , r) and (s2 , 0) ∈ δA(s←, r −), for each transition s2 ∈ δ(s1 , r) 1 of E . • (sf , 1) ∈ δA(s, ℓ), for each s ∈ F and ℓ ∈ Σ± ∪ {$} We have that: w ∈ fold (E) ˜ iff w$ ∈ L(AE ) ˜ (We can also get rid of the $ at the end of words in L(AE ).) D. Calvanese View-based Query Processing over Semistructured Data 20
  • 22. Containment of 2RPQs (cont’d) Consider two 2RPQs Q1 (x, y) ← x E1 y Q2 (x, y) ← x E2 y Summing what we have seen till now, we have that Q1 ⊑ Q2 iff L(E1 ) ⊆ fold (E2 ) iff ˜ L(E1 ) ⊆ L(AE2 ) ˜ To check L(E1 ) ⊆ L(AE2 ) we have to look into the transformation of 2NFAs to 1NFAs. D. Calvanese View-based Query Processing over Semistructured Data 21
  • 23. Transforming 2NFAs to 1NFAs Theorem [Vardi 1988]: Let A = (Σ, S, S0 , δ, F ) be a 2NFA. There is a 1NFA Ac such that • Ac accepts the complement of A, i.e., L(Ac) = Σ∗ − L(A) • Ac is exponential in A, i.e., ||Ac|| is 2O(||A||) Proof: guess a subset-sequence counterexample. a0 · · · ak−1 ∈ L(A) iff there is a sequence T0 , . . . , Tk of subsets of S such / that: • S0 ⊆ T0 and Tk ∩ F = ∅ • If s ∈ Ti and (t, +1) ∈ δ(s, ai), then t ∈ Ti+1 , for 0 ≤ i < k • If s ∈ Ti and (t, 0) ∈ δ(s, ai), then t ∈ Ti, for 0 ≤ i < k • If s ∈ Ti and (t, −1) ∈ δ(s, ai), then t ∈ Ti−1 , for 0 ≤ i < k The 1NFA Ac guesses such a seQuence T0 , . . . , Tk and checks that it satisfies the 4 conditions above. D. Calvanese View-based Query Processing over Semistructured Data 22
  • 24. Containment of 2RPQs (finally done!) Consider two 2RPQs Q1 (x, y) ← x E1 y Q2 (x, y) ← x E2 y Summing up, we have that Q1 ⊑ Q2 iff L(E1 ) ⊆ fold (E2 ) iff ˜ L(E1 ) ⊆ L(AE2 ) iff ˜ L(E1 ) ∩ L(Ac 2 ) = ∅ iff E ˜ L(AE1 × Ac ) = ∅ E2 Theorem: Containment of 2RPQs is in PSPACE, hence PSPACE-complete. D. Calvanese View-based Query Processing over Semistructured Data 23
  • 25. Containment of queries over semistructured data Complexity of containment for various classes of queries over semistructured data: Language Complexity RPQs PSPACE [PODS’99] 2RPQs PSPACE [PODS’00] Tree-2RPQs PSPACE [DBPL’01] Conjunctive-2RPQs EXPSPACE [KR’00] Datalog in Unions of C2RPQs 2EXPTIME [ICDT’03] D. Calvanese View-based Query Processing over Semistructured Data 24
  • 26. VBQP over Semistructured Data – Outline √ The semistructured data model and regular path queries (RPQs) √ Containment of RPQs and 2RPQs ⇒ View-based query rewriting for RPQs and 2RPQs 4. View-based query answering for RPQs and 2RPQs via automata 5. View-based query answering for RPQs and 2RPQs via CSP D. Calvanese View-based Query Processing over Semistructured Data 25
  • 27. View-based query processing certain answers we are answers certQ,V interested in to Q Q View definition V Database schema V1 V2 … Vn R1 R2 … Rm … … View extension E Database B D. Calvanese View-based Query Processing over Semistructured Data 26
  • 28. View-based query processing for SSD We consider the setting where: • The schema Σ is a set of binary predicates without constraints. • Each view symbol in V is a binary predicate. • Each view definition in V Σ is an RPQ/2RPQ over Σ. • Each view extension in E is a binary relation. • The query Q is an RPQ/2RPQ over Σ. • Views are sound, complete, or exact views (will discuss mostly the case of sound views). • The domain is open. Problem: Given a schema Σ, views V with definitions V Σ and extensions E , a query Q over Σ, and a tuple t, decide whether t ∈ cert(Q, Σ, V Σ , E). D. Calvanese View-based Query Processing over Semistructured Data 27
  • 29. View-based query answering for SSD – Example • Schema Σ = {article, reference, title, author, . . .} • Set of views V = {V1 , V2 , V3 } • View definitions V Σ = {V1Σ , V2Σ , V3Σ }, with V1Σ : V1 (b, a) ← b (article) a V2Σ : V2 (p1 , p2 ) ← p1 (reference∗) p2 V3Σ : V3 (p, t) ← p (title) t • View extensions E = {E1 , E2 , E3 }, where – E1 stores for each bibliography its articles – E2 stores for each publ. the ones it references directly or indirectly – E3 stores for each publication its title D. Calvanese View-based Query Processing over Semistructured Data 28
  • 30. View-based query answering via rewriting certain answers answers we are to RmaxQ,V answers certQ,V interested in to Q RmaxQ,V Q View definition V Database schema V1 V2 … Vn R1 R2 … Rm … … View extension E Database B Note: the class C of queries in which to express the rewriting is fixed a priori. D. Calvanese View-based Query Processing over Semistructured Data 29
  • 31. View-based query rewriting for SSD Problem: Given a schema Σ, views V with definitions V Σ , a query Q over Σ, and a class C of queries over V , compute the query R over V such that: 1. R is a query in C . 2. R is a sound rewriting, i.e., for every database B over Σ and every view extension E that is sound wrt B (i.e., such that E ⊆ V Σ (B)), we have R(E) ⊆ Q(B). 3. R is the maximal (wrt containment) query in C satisfying condition (2). Such a query is called the C -maximal rewriting of Q wrt Σ and V , and is denoted rew C (Q, Σ, V). In the following, we assume that the class C of queries in which the rewriting is expressed coincides with that of Q (i.e., is that of RPQs / 2RPQs), and we do not mention it explicitly. D. Calvanese View-based Query Processing over Semistructured Data 30
  • 32. View-based query rewriting for 2RPQs – Example Consider three views with definitions: V1 (b, a) ← b (article) a V2 (p1 , p2 ) ← p1 (reference∗) p2 V3 (p, t) ← p (title) t Query: Q(x, y) ← x (article·(reference + reference−)∗·title) y • R(x, y) ← x (v1 ·v2 ·v3 ) y is an RPQ rewriting of Q • R(x, y) ← x (v1 ·(v2 + v2 −)·v3 ) y is a 2RPQ rewriting of Q • R(x, y) ← x (v1 ·(v2 + v2 −)∗·v3 ) y is a 2RPQ-maximal rewriting of Q that is also exact D. Calvanese View-based Query Processing over Semistructured Data 31
  • 33. Sound rewritings Since RPQs and 2RPQs are monotone queries, the following characterizations of a sound rewriting R of Q wrt Σ and V are equivalent: • For every database B over Σ and every view extension E with E ⊆ V Σ (B), we have that R(E) ⊆ Q(B). • For every database B over Σ and every view extension E with E = V Σ (B), we have that R(E) ⊆ Q(B). • For every database B over Σ, we have that R(V Σ (B)) ⊆ Q(B). Observe: The following two ways of computing R(V Σ (B)) are equivalent: • First evaluate the view definitions V Σ over B, and then evaluate R over the obtained view extension. • First expand in R the view symbols V with the corresponding definitions V Σ , and then evaluate the resulting query over B. D. Calvanese View-based Query Processing over Semistructured Data 32
  • 34. Expansion of an RPQ / 2RPQ We call a query over Σ a Σ-query, and a query over V a V -query. Consider: • a set of views V = {V1 , . . . , Vk} Σ • a set of view definitions (RPQs or 2RPQs) V Σ = {V1Σ , . . . , Vk }, with ViΣ : Vi(x, y) ← x Ei y • a V -query (RPQ or 2RPQ) R : R(x, y) ← x ER y Definition: The expansion R[V→VΣ ] of R wrt V Σ is the Σ-query obtained from R by replacing in ER: • each occurrence of a view symbol Vi with Ei. • each occurrence of an inverse view symbol Vi− with inv (Ei), where inv (p) = p− inv (e1 + e2 ) = inv (e1 ) + inv (e2 ) inv (p−) = p inv (e1 ·e2 ) = inv (e2 )·inv (e2 ) inv (e∗) = inv (e)∗ D. Calvanese View-based Query Processing over Semistructured Data 33
  • 35. Checking rewritings via expansion Hence, to check whether a rewriting R of Q wrt Σ and V Σ is sound, we can: 1. Compute R[V→VΣ ] by expanding each direct (resp., inverse) view symbol V (resp., V −) in R with the corresponding definition V Σ (resp., inv (V Σ )). 2. Check whether R[V→VΣ ] ⊑ Q (notice that both are Σ-queries). D. Calvanese View-based Query Processing over Semistructured Data 34
  • 36. Counterexample method for rewriting of RPQs Due to Language-Theoretic Lemma 1, we can treat RPQs simply as regular expressions. Consider a candidate rewriting: R = u1 · · · uk ∈ V ∗ • R is a bad rewriting of Q if R[V→VΣ ] ⊑ Q • R is a bad rewriting of Q if there are witnesses w1 , . . . , wk ∈ Σ∗ such that w1 · · · wk ⊑ Q, where wi ∈ L(VjΣ ) if ui = Vj . Example: Q = abcd, V = {V1 , V2 }, V1Σ = ab, V2Σ = cd • bad rewriting: V2 V1 , with witnesses cd and ab • good rewriting: V1 V2 D. Calvanese View-based Query Processing over Semistructured Data 35
  • 37. Rewriting of RPQs Construction is based on 1NFAs: 1. Construct a 1DFA AQ = (Σ, S, s0 , δ, F ) for Q ; exponential blowup 2. Construct the 1NFA Ab = (V, S, s0 , δ ′, S − F ), where sj ∈ δ ′(si, V ) iff there exists a word w ∈ L(V Σ ) s.t. sj ∈ δ ∗(si, w) Note: • AQ and Abad have the same states, but complementary final states. • Abad accepts a word u1 · · · uk ∈ V ∗ iff it is a bad rewriting of Q. The words w used in the definition of δ ′ act as witnesses. ; linear blowup 3. Complement Abad to get a 1NFA Arew for good rewritings. ; exponential blowup The construction yields the maximal path rewriting. It is represented by a 1DFA. ; The maximal path rewriting is an RPQ. D. Calvanese View-based Query Processing over Semistructured Data 36
  • 38. Rewriting of RPQs – Example Query: Q = a·(b·a + c)∗ Views: V = {V1 , V2 , V3 }, with V1Σ = a, V2Σ = a·c∗·b, V3Σ = c b, c V3 V2 V3 V1 , b a V1 V2 , V1 V3 a a, b, c V1 , V2 c V3 V2 Aq Abad Arew D. Calvanese View-based Query Processing over Semistructured Data 37
  • 39. Rewriting of RPQs – Complexity Checking nonemptiness of the maximal path-rewriting of an RPQ Query wrt a set of RPQ views is EXPSPACE-complete. Proof: • The 1DFA Arew is of double exponential size. • We can check for its non-emptiness on the fly in NLOGSPACE in the number of its states, i.e., in EXPSPACE in the size of Q • The matching lower-bound is by a reduction from an EXPSPACE-hard tiling problem. There exists an RPQ Query Q and RPQ views V over an alphabet Σ such that the smallest 1NFA for the rewriting rew (Q, Σ, V) is of double exponential size. D. Calvanese View-based Query Processing over Semistructured Data 38
  • 40. Exact rewritings Definition: A rewriting R of an RPQ Q wrt Σ and V Σ is exact if Q ⊑ R[V→VΣ ] . Note: since R is a rewriting, we also have R[V→VΣ ] ⊑ Q. To verify whether R is an exact rewriting of Q wrt Σ and V Σ : 1. Construct a 1NFA B over Σ accepting R[V→VΣ ] . It suffices to replace each Vi edge in R with a 1NFA for ViΣ . 2. Check whether L(AQ) ⊆ L(B), i.e., • complement B to obtain an 1NFA B , • check whether L(AQ) ∩ L(B) = ∅, i.e., whether L(AQ × B) = ∅. B may be of triply exponential size in Q but we can check its emptyness on-the-fly in 2EXPSPACE. D. Calvanese View-based Query Processing over Semistructured Data 39
  • 41. Rewriting of RPQs – Results Theorem: • Nonemptiness of the maximal rewriting is EXPSPACE-complete. • The maximal rewriting may be of double exponential size. • Existence of an exact rewriting is 2EXPSPACE-complete. D. Calvanese View-based Query Processing over Semistructured Data 40
  • 42. Rewriting of 2RPQs • The query and the view definitions are 2RPQs over Σ. • We look for rewritings that are a 2RPQ over V . • We consider again candidate rewritings and try to characterize bad rewritings. ∗ Candidate rewriting: R = u1 · · · uk ∈ V ± • To check whether a rewriting is bad, we need to expand both direct and inverse view symbols. • R is a bad rewriting of Q if R[V→VΣ ] ⊑ Q ∗ • R is a bad rewriting of Q if there are witnesses w1 , . . . , wk ∈ Σ± such that w1 · · · wk ⊑ Q, where – wi ∈ L(VjΣ ) if ui = Vj . – wi ∈ L(inv (VjΣ )) if ui = Vj−. D. Calvanese View-based Query Processing over Semistructured Data 41
  • 43. Counterexample method for rewriting of 2RPQs We consider counterexample words, which are obtained from a bad rewriting by inserting after each symbol u its witness w. Definition: Counterexample word u1 w1 · · · ukwk 1. wi ∈ L(VjΣ ) if ui = Vj . 2. wi ∈ L(inv (VjΣ )) if ui = Vj−. 3. w1 · · · wk ⊑ Q. Example: Q = abcd, V = {V1 , V2 }, V1Σ = ab, V2Σ = cd • bad rewriting: V2 V1 , with witnesses w1 = cd, w2 = ab • counterexample word: V2 cd V1 ab Checking counterexample words with 2NFAs • Check (1) and (2) with 2NFAs for VjΣ . • Use folding technique to construct 2NFA to check w1 · · · wk ⊑ Q. • Complement resulting 2NFA. ; Complexity is exponential D. Calvanese View-based Query Processing over Semistructured Data 42
  • 44. From counterexamples to rewritings To construct good rewritings: 1. Construct 1NFA A1 for counterexample words u1 w1 · · · ukwk. ; exponential 2. Project out witness words wi to get 1NFA A2 for bad rewritings u1 · · · uk. ; linear 3. Complement A2 to get 1NFA for good rewritings. ; exponential The construction yields the maximal two-way path rewriting It is represented by a 1DFA. ; The maximal two-way path rewriting is a 2RPQ. D. Calvanese View-based Query Processing over Semistructured Data 43
  • 45. Rewriting of 2RPQs – Results We get for 2RPQs the same upper (and lower) bounds as for RPQs. Theorem: • Nonemptiness of the maximal rewriting is EXPSPACE-complete. • The maximal rewriting may be of double exponential size. • Existence of an exact rewriting is 2EXPSPACE-complete. D. Calvanese View-based Query Processing over Semistructured Data 44
  • 46. VBQP over Semistructured Data – Outline √ The semistructured data model and regular path queries (RPQs) √ Containment of RPQs and 2RPQs √ View-based query rewriting for RPQs and 2RPQs ⇒ View-based query answering for RPQs and 2RPQs via automata 5. View-based query answering for RPQs and 2RPQs via CSP D. Calvanese View-based Query Processing over Semistructured Data 45
  • 47. View-based query answering for RPQs / 2RPQs Given: • a set V of view symbols with a corresponding set V Σ of RPQ / 2RPQ view definitions over a relational alphabet Σ • a corresponding set E of view extensions • a 2RPQ Q over Σ • a pair (c, d) of objects check whether (c, d) ∈ cert(Q, Σ, V Σ , E). In other words, check whether for every database B over Σ such that the view extension E is sound wrt B (i.e, E ⊆ V(B)), we have that (c, d) ∈ Q(B). D. Calvanese View-based Query Processing over Semistructured Data 46
  • 48. View-based query answering for 2RPQs – Idea We search for a counterexample database, i.e., a database B such that E is sound wrt B, but such that (c, d) is not in the answer to Q over B. Technique: 1. Encode counterexample databases as finite words. 2. Construct a 2NFA that accepts such words. 3. Check for emptiness of the automaton. D. Calvanese View-based Query Processing over Semistructured Data 47
  • 49. Canonical counterexample databases A database B is a counterexample to (c, d) ∈ cert(Q, Σ, V, E) if • E ⊆ V Σ (B), i.e., the view extension E is sound wrt B, • (c, d) ∈ Q(B). / Observations: • It is sufficient to restrict the attention to counterexamples of a special form (canonical databases) d1 d2 d3 r Set of objects in the view extension E : r p q q r ∆E = {d1 , d2 , d3 , d4 , d5 , . . .} d5 d4 p p • Each canonical database B can be represented as a word wB over ΣA = Σ± ∪ ∆E ∪ {$} of the form $ d1 w1 d2 $ d3 w2 d4 $ · · · $ d2m−1 wm d2m $ D. Calvanese View-based Query Processing over Semistructured Data 48
  • 50. Canonical DBs and 2NFAs d1 d2 d3 r r p q ∗ B: q r Q(x, y) ← x (r·(p + q)·p−·p·q − ) y d5 d4 p p Word representing B: $ d1 r d2 $ d4 p−p d5 $ d4 q − d2 $ d3 rr d3 $ d2 pq − d3 $ To verify whether (di, dj ) ∈ Q(B) we exploit that 2NFAs can: • move on the word both forward and backward • jump from one position in the word representing a node to any other position representing the same node (search mode) ; We can construct a 2NFA A(Q,di ,dj ) that accepts wB iff (di, dj ) ∈ Q(B) D. Calvanese View-based Query Processing over Semistructured Data 49
  • 51. View-based query answering for 2RPQs – Technique To check whether (c, d) ∈ cert(Q, Σ, V, E), we construct a 1NFA AQA as the / intersection of: ∗ • the 1NFA A0 that accepts ($·∆E ·Σ±·Σ± ·∆E )∗·$ • the 1NFAs corresponding to the various A(ViΣ ,a,b) (for each sound or exact view Vi, and for each pair (a, b) ∈ Ei) • the 1NFAs corresponding to the complement of each AVi (for each complete or exact view Vi, the 2NFA AVi checks whether a pair of objects other than those in Ei is in Vi(B)) • the 1NFA corresponding to the complement of A(Q,c,d) ; AQA accepts words representing counterexample DBs We have that (c, d) ∈ cert(Q, Σ, V, E) iff AQA is nonempty. / D. Calvanese View-based Query Processing over Semistructured Data 50
  • 52. Complexity of query answering Can be measured in three different ways: • Data complexity: as a function of the size of the view extension E • Expression complexity: as a function of the size of the query Q and of the view definitions V1Σ , . . . , VkΣ • Combined complexity: as a function of the size of both the view extension E and the expressions Q , V1Σ , . . . , VkΣ D. Calvanese View-based Query Processing over Semistructured Data 51
  • 53. View-based query answering for 2RPQs – Upper bounds • All 2NFAs are of linear size in the size of Q , all views in V and the view extensions E . ; The corresponding 1NFA would be exponential. • However, we can avoid the explicit construction of AQA , and we can construct it on the fly while checking for nonemptiness. ; View-based query answering for 2RPQs is in PSPACE wrt expression complexity and combined complexity. PSPACE-hardness follows immediately by reduction from universality of regular languages. What about data complexity, i.e., complexity in the size of the view extension E ? D. Calvanese View-based Query Processing over Semistructured Data 52
  • 54. VBQP over Semistructured Data – Outline √ The semistructured data model and regular path queries (RPQs) √ Containment of RPQs and 2RPQs √ View-based query rewriting for RPQs and 2RPQs √ View-based query answering for RPQs and 2RPQs via automata ⇒ View-based query answering for RPQs and 2RPQs via CSP D. Calvanese View-based Query Processing over Semistructured Data 53
  • 55. View-based query answering for 2RPQs – Data complexity • In the previous algorithm one cannot immediately single out the contribution of E to the nonemptiness check of the automaton AQA . • This can however be done by analyzing the transformation from 2NFA to 1NFA, and modifying the construction of the automata to avoid search mode. We look at an alternative way to analyze data complexity, derived from a tight connection between view-based query answering under sound views and constraint satisfaction (CSP). ; Better insight into view-based query answering for RPQs and 2RPQs. ; Several additional results on various forms of view-based query processing. D. Calvanese View-based Query Processing over Semistructured Data 54
  • 56. Constraint satisfaction problems Let A and B be relational structures over the same alphabet. A homomorphism h is a mapping from A to B such that for every relation R, if (c1 , . . . , cn) ∈ R(A), then (h(c1 ), . . . , h(cn)) ∈ R(B). Non-uniform constraint satisfaction problem CSP (B): the set of relational structures A such that there is a homomorphism from A to B. Complexity: • CSP (B) is in NP. • There are structures B for which CSP (B) is NP-hard. Example: p p p D. Calvanese View-based Query Processing over Semistructured Data 55
  • 57. CSP and VBQA for 2RPQs – Constraint template From Q and V , we can define a relational structure T = CT Q,V , called constraint template of Q wrt V : • The vocabulary of T is {R1 , . . . , Rk} ∪ {Uini , Ufin }, where – each Ri corresponds to a view Vi and denotes a binary predicate – Uini and Ufin denote unary predicates • Let Q be represented by a 1NFA (Σ±, S, S0 , δ, F ): – The domain ∆T of T is 2S , i.e., all sets of states of Q – σ ∈ Uini (T ) iff S0 ⊆ σ – σ ∈ Ufin (T ) iff σ ∩ F = ∅ – (σ1 , σ2 ) ∈ Ri(T ) iff there exists a word p1 · · · pk ∈ L(ViΣ ) and a sequence T0 , . . . , Tk of subsets of S such that: 1. T0 = σ1 and Tk = σ2 2. if s ∈ Ti and t ∈ δ(s, pi+1 ), then t ∈ Ti+1 3. if s ∈ Ti and t ∈ δ(s, p−), then t ∈ Ti−1 i D. Calvanese View-based Query Processing over Semistructured Data 56
  • 58. CSP and VBQA for 2RPQs – Constraint instance Observations: • Intuitively, the constraint template CT Q,V encodes how the states of Q change when moving along a database according to the views. • The existence of a word p1 · · · pk ∈ L(ViΣ ) and of a sequence T0 , . . . , Tk of subsets of S satisfying conditions 1–3 can be checked in PSPACE. • CT Q,V can be computed in time exponential in Q and polynomial in V . From E and two objects c and d, we can define another relational structure I = CI c,d over the same vocabulary, called the constraint instance: E • The domain ∆I of I is ∆E ∪ {c, d} • Ri(I) = Ei, for i ∈ {1, . . . , k} • Uini (I) = {c} • Ufin (I) = {d} D. Calvanese View-based Query Processing over Semistructured Data 57
  • 59. CSP and view-based query answering for 2RPQs Theorem: (c, d) is not a certain answer to Q wrt V and E if and only if there is a homomorphism from CI c,d to CT Q,V , i.e, CI c,d ∈ CSP (CT Q,V ) E E ; Characterization of view-based query answering for 2RPQs in terms of CSP ; View-based query answering for 2RPQs is in coNP wrt data complexity D. Calvanese View-based Query Processing over Semistructured Data 58
  • 60. Query answering for RPQs — Data complexity lower bound coNP-hard wrt data compexity, by reduction from 3-coloring: Views: Vs = Sr + Sg + Sb VG = Rrg + Rgr + Rrb + Rbr + Rgb + Rbg Vf = Fr + Fg + Fb Query: Q = x=y Sx·Fy + x=y∨w=z Sx·Ryw ·Fy (color mismatch) Only domain and view extensions depend on graph G = (N, E). c Domain: ∆E = N ∪ {c, d} Vs 1 Extensions: Es = {(c, a) | a ∈ N } EG = {(a, b), (b, a) | (a, b) ∈ E} VG 2 3 Ef = {(a, d) | a ∈ N } 4 Vf d Thm: G is 3-colorable iff (c, d) is not a certain answer to Q. Note: we have used only a query and views that are unions of simple paths. D. Calvanese View-based Query Processing over Semistructured Data 59
  • 61. Complexity of view-based query answering for RPQs / 2RPQs Assumption on Assumption on Complexity domain views data expression combined all sound coNP coNP coNP closed all exact coNP coNP coNP arbitrary coNP coNP coNP all sound coNP PSPACE PSPACE open all exact coNP PSPACE PSPACE arbitrary coNP PSPACE PSPACE From [ICDE’00] for RPQs and [PODS’00, ICDT’05] for 2RPQs. D. Calvanese View-based Query Processing over Semistructured Data 60
  • 62. Consequence of complexity results + The view-based query answering algorithm provides a set of answers that is sound and complete. – A coNP data complexity does not allow for effective deployment of the query answering algorithm. Note that coNP-hardness holds already for queries and views that are unions of simple paths (no reflexive-transitive closure). ; Adopt an indirect approach to view-based query answering, via query rewriting. D. Calvanese View-based Query Processing over Semistructured Data 61
  • 63. Query answering by rewriting To answer a query Q wrt V and E : 1. re-express Q in terms of the view symbols, i.e., compute a rewriting of Q . 2. directly evaluate the rewriting over E . Comparison with direct approach to query answering: + We can consider rewritings in a class with polynomial data complexity (e.g., 2RPQs) ; the data complexity for query answering is polynomial. +/– We have traded expression complexity for data complexity. – We may lose completeness (i.e., not obtain all certain answers). We need to establish the “quality” of a rewriting: • When does the (maximal) rewriting compute all certain answers? • What do we gain or lose by considering rewritings in a given class? D. Calvanese View-based Query Processing over Semistructured Data 62