SlideShare uma empresa Scribd logo
1 de 5
Baixar para ler offline
Review
                                                                        Data warehousing: integrating data for OLAP
                Data Warehousing                                        • OLAP versus OLTP
                                                                        • Warehousing versus mediation
                                                                        • Warehouse maintenance
                      CPS 216                                              – Warehouse data as materialized views
                                                                           – Recomputation versus incremental maintenance
              Advanced Database Systems
                                                                           – Self-maintenance


                                                                                                                                                2




                            Today                                                                Star schema               Dimension table
                                                                                  Dimension table                     Store SID city
                                                                        Product   PID   name     cost                          s1 Durham
                                                                                   p1    beer     10                           s2 Chapel Hill
• Star, snowflake, and cube                                                        p2   diaper    16                           s3
                                                                                                                               …
                                                                                                                                    RTP
                                                                                                                                     …
                                                                                   …      …       …


                                                                                   Sale   OID    date   CID PID SID qty   price Fact table
• ROLAP and MOLAP algorithms                                                              100 11/23/2001 c3 p1 s1 1        12 • Big
                                                                                          102 12/12/2001 c3 p2 s1 2        17 • Constantly growing
                                                                                          105 12/24/2001 c5 p1 s3 5        13 • Stores measures
                                                                                           …      …      … … … …           …     (often aggregated in
                                                                                                                                  queries)
                                                                               Customer    CID name     address      city  Dimension table
                                                                                            c3 Amy    100 Main St. Durham • Small
                                                                                            c4  Ben   102 Main St. Durham • Updated infrequently
                                                                                            c5  Carl 800 Eighth St. Durham
                                                                                            …    …         …          …
                                                                   3                                                                            4




            Dimension hierarchies                                                          Star join indexes
              Brand
 Product                                                                • Queries frequently join fact table with dimension tables
             Product type    Product category                              » Materialize the join result to speed up queries
                       brandID    name                                  • For each combination of dimension attribute values,
                          b1     Huggies
                          b2     Budwiser                                 store the list of tuple ID’s in the fact table
 PID brandID typeID       …         …                                      – Brand name, store city, customer city → sales records;
  p1    b1      t1
  p2    b2      t2     typeID     name      catID   catID     name           Product type, store city → sales records; etc.
  p3    b1      t4        t1      diaper     cat7    cat7    babies        – Conceptually, multi-attribute indexes on the join result
  …     …      …          t2       beer      cat9    cat9   beverages
                          t3       juice     cat9     …         …       • One index to support each combination of selection
                         …
                          t4     pacifier
                                     …
                                             cat7
                                              …
                                                                          conditions on attributes?
                                                                           – Too many indexes!
  + star schema = snowflake schema
                                                                   5                                                                            6




                                                                                                                                                        1
Bitmap join indexes                                                                              Data cube
» O’Neil & Quass, SIGMOD 1997                                                                             Simplified schema: Sale(CID, PID, SID, qty)
                                                                                          Product
• Bitmap and projection indexes for each
  dimension attribute                                                                                                                                   (c5, p1, s3) = 5

    – Value of the dimension attribute ↔ tuple ID’s in the                                                                           (c3, p2, s1) = 2
      fact table                                                                                                                    Store
                                                                                                 p2                           s3
• To process an arbitrary combination of selection                                                                                   (c3, p1, s1) = 1    (c5, p1, s1) = 3
  conditions, use bitmap indexes                                                                                   s2
    – Bitmaps can be combined efficiently                                                        p1
                                                                                                          s1
• To retrieve attribute values for output, use
  projection indexes                                                                           ALL                       c3         c4         c5
                                                                                                                                                        Customer
                                                                             7                                                                                      8




           Completing the cube (slide 1)                                                        Completing the cube (slide 2)
                   Total quantity of sales for each product in each store                               Total quantity of sales for each product
                   SELECT SUM(qty) FROM Sale                                                            SELECT SUM(qty) FROM Sale GROUP BY PID;
     Product                                                                              Product
                   GROUP BY PID, SID;
                                                                 (c5, p1, s3) = 5                                                                       (c5, p1, s3) = 5
                               (ALL, p1, s3) = 5                                                                    (ALL, p1, s3) = 5
                 (ALL, p2, s1) = 2           (c3, p2, s1) = 2                                         (ALL, p2, s1) = 2           (c3, p2, s1) = 2
                                              Store                                  (ALL, p2, ALL)                                 Store
            p2 (ALL, p1, s1) = 4         s3                                               =2     p2 (ALL, p1, s1) = 4         s3
                                              (c3, p1, s1) = 1    (c5, p1, s1) = 3                                                   (c3, p1, s1) = 1    (c5, p1, s1) = 3
                             s2                                                                                    s2
                                                                                     (ALL, p1, ALL)
            p1                                                                            =9     p1
                     s1           Project all points onto Product-Store plane                             s1        Further project points onto Product axis

                                                                 Customer                                                                               Customer
          ALL                       c3        c4        c5                                     ALL                       c3         c4         c5
                                                                             9                                                                                     10




           Completing the cube (slide 3)                                                                       CUBE operator
                   Total quantity of sales
                                                                                     » Gray et al., ICDE 1996
                   SELECT SUM(qty) FROM Sale;
     Product
                                                                                     • Sale(CID, PID, SID, qty)
                                                                 (c5, p1, s3) = 5
                               (ALL, p1, s3) = 5                                     • Proposed SQL extension:
                 (ALL, p2, s1) = 2           (c3, p2, s1) = 2                          SELECT SUM(qty) FROM Sale
(ALL, p2, ALL)                                Store                                    GROUP BY CUBE CID, PID, SID;
     =2     p2 (ALL, p1, s1) = 4         s3
                                              (c3, p1, s1) = 1    (c5, p1, s1) = 3   • Output contains:
                             s2                                                          – Normal groups produced by GROUP BY
(ALL, p1, ALL)
     =9     p1                                                                               • (c1, p1, s1, sum), (c1, p2, s3, sum), etc.
                     s1       Further project points onto the origin
                                                                                         – Groups with one or more ALL’s
                                                                                             • (ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc.
                                                                 Customer
          ALL                       c3        c4        c5                           • Can you write a CUBE query using only GROUP BY’s?
    (ALL, ALL, ALL) = 11                                                    11                                                      12




                                                                                                                                                                            2
ROLLUP operator                                             Computing GROUP BY
• Sometimes CUBE is too much                                        • ROLAP (Relational OLAP)
   – (…, state, city, street, …, age, DOB, …)
                                                                      – Use standard relational engine
   – CUBE state, city, street returns meaningless groups
      • (ALL, ALL, ’Main Street’): sales on any Main Street?          – Sorting and clustering
   – CUBE age, DOB returns useless groups                             – Using indexes
      • (ALL, DOB): DOB functionally determines age!
                                                                      – Automatic summary tables
• Proposed SQL extension:
  GROUP BY ROLLUP state, city, street;                              • MOLAP (Multidimensional OLAP)
• Output contains groups with ALL’s only as suffix                    – Use a sparse multidimensional array
   – (’NC’, ’Durham’, ’Main Street’), (’NC’, ’Durham’, ALL),
     (’NC’, ALL, ALL), (ALL, ALL, ALL)
   – But not (ALL, ALL, ’Main Street’) or (ALL, ’Durham’, ALL)
                                                               13                                                           14




              Sorting and clustering                                              More on sort order
• Sort (or cluster, e.g., using hashing) tuples                     • Sort by the order in which GROUP BY attributes
  according to GROUP BY attributes                                    appear?
   – Tuples in the same group are processed together                  – Not necessary; e.g., GROUP BY PID, SID can be
   – Only one intermediate aggregate result needs to be                 processed just as efficiently by sorting on SID, PID
     kept—low memory requirement                                    • Sort by the order in which GROUP BY ROLLUP
• What if GROUP BY attributes ≠ sort attributes?                      attributes appear?
   – Still fine if GROUP BY attributes form a prefix of the           – Useful; e.g., GROUP BY ROLLUP state, city, street
     sort order                                                         can be processed efficiently by sorting on state, city,
   – Otherwise, need to keep intermediate aggregate                     street, but not by sorting on street, city, state
     results around                                   15                                                                    16




         Using bitmap join indexes                                         Automatic summary tables
» O’Neil & Quass, SIGMOD 1997                                       • Computing GROUP BY aggregates is expensive
• Use the bitmap join indexes on GB1, GB2, …, GBk                   • OLAP queries perform GROUP BY all the time
• For each value v1 of GB1 in order:
    For each value v2 of GB2 in order: …
       For each value vk of GBk in order:                           • Idea: precompute and store the aggregates!
         Intersect bitmaps to locate tuples;                        » Automatic summary tables
         Retrieve their measures;                                     – Maintained automatically as base data changes
         Calculate aggregate for group (v1, v2, …, vk);
                                                                      – Just another index/materialized view
• Helps if data is sorted by GB1, GB2, …, GBk
   – So measures in the same group are clustered
                                                               17                                                           18




                                                                                                                                  3
Aggregation view lattice                                                                      Selecting views to materialize
                               GROUP BY
                              CID, PID, SID
                                                                                              • Factors in deciding what to materialize
                                                                                                  –   What is its storage cost?
                                                                                                  –   What is its update cost?
      GROUP BY                 GROUP BY                           GROUP BY
       CID, PID                 CID, SID                           PID, SID                       –   Which queries can benefit from it?
                                                                                                  –   How much can a query benefit from it?
      GROUP BY                 GROUP BY                           GROUP BY                    • Example
        CID                      PID                                SID                           – GROUP BY ∅ is small, but not useful to most queries
                                                                                                  – GROUP BY CID, PID, SID is useful to any query, but too
                                                               A child can be                       large to be beneficial
                              GROUP BY ∅
                                                           computed from any parent           » Harinarayan et al., SIGMOD 1996; Gupta & Mumick,
                                                                                                ICDE 1999
                                                                                         19                                                                                     20




     Interlude: TPC-D, -H, and -R                                                                             From tables to arrays
• TPC-D: standard OLAP benchmark until 1999                                                   » Zhao et al., SIGMOD 1997
  – With aggressive use of precomputation techniques                                          • “Chunk” an n-dimensional cube into n-
    (materialized views, automatic summary tables),                                             dimensional subcubes
    vendors were able to “cheat” and achieve amazing
                                                                                                  – For a dense chunk (>40% full), store it as is
    performance
                                                                                                  – For a sparse chunk (<40% full), compress it using
• Now, TPC-D has been replaced by                                                                   <coordinate, value> pairs
  – TPC-H: ad hoc OLAP queries
                                                                                              • To convert a table into chunks
     • Cannot use materialized views
                                                                                                  – Pass 1: Partition table into memory-size partitions,
  – TPC-R: business-reporting OLAP queries
                                                                                                    each of which contains a number of chunks
     • Can use materialized views
                                                                                                  – Pass 2: Read partitions back in one at a time, and
» http://www.tpc.org/                                                                    21         chunk each partition in memory                     22




                Dimension order                                                                                  Basic array cubing
                                                                                                                                                          GROUP BY B, C
                                                                                              Memory required:
                          61            62            63         64                                                       61    62    63        64
                     45            46            47        48                                      1 chunk              45    46    47    48
                                                                                               (could be more)                                                                      ☻
                   29         30            31        32                                                              29   30    31    32                                       ☻
                                                                           Dimension B




                                                                                                                                                          Dimension B




                                                                                                                                                                            ☻
                                                                      60                                                                             60                 ☻           ☻
              13        14          15           16              44                                              13      14    15    16      44                                 ☻
                                                            28                                                                             28                               ☻
                                                                      56                                                                             56                 ☻           ☻
               9        10          11           12              40                                              9       10    11    12      40                                 ☻
                                                            24                                                                             24                               ☻
                                                                     52 Dimension order:                                                             52                 ☻           ☻
               5          6             7         8              36         A, B, C                              5        6     7     8         36                              ☻
                                                            20          C                                                                  20             C                 ☻
                                                                       n = Sort order:                                                                n
                                                                    sio                                                                            sio                  ☻
               1          2          3            4               en        C, B, A                              1        2     3     4          en
                                                            D   im                                                                         D   im
                                                                                         23                                                                                     24
                   Dimension A                                                                                        Dimension A




                                                                                                                                                                                        4
Minimal spanning tree                                                 Multiway array cubing
                                         GROUP BY
• Recall the aggregation                  A, B, C 100               • Goal: compute all                    61     62     63     64
                                                                      aggregates at the                 45     46     47     48
  lattice                                                             same time in a single           29    30     31     32




                                                                                                                                              Dimension B
• MST of the lattice:        GROUP BY    GROUP BY       GROUP BY      pass over the array,
                               A, B 20     A, C 10        B, C 50                                                                        60
  parent is always                                                    using minimum                                            44
                                                                                                 13      14     15     16
                                                                      amount of memory                                       28
  chosen to be the one       GROUP BY    GROUP BY       GROUP BY                                                                         56
                                                                    • GROUP BY B, C                                            40
  with minimum size             A2          B 10           C5         requires 1 chunk
                                                                                                  9      10     11     12
                                                                                                                             24
                                                                                                                                         52
• Compute each node                                                 • GROUP BY A, C                                               36
                                                                                                  5        6     7      8
  from its parent in the                 GROUP BY ∅ 1                 requires 4 chunks                                      20               C
                                                                                                                                         n
  MST                                                               • GROUP BY A, B                                                   sio
                                                                                                  1        2     3      4           en
                                                                      requires 16 chunks                                         i m
                                                          25                                                                 D      26
                                                                                                      Dimension A




           Memory requirement                                          Minimum-memory spanning tree
• Dimension order is D1, D2, …, Dn                                  • MMST of the aggregation lattice
• Aggregate to compute projects out Dp (i.e.,                          – Parent is always chosen to be the one that makes the
  GROUP BY D1, …, Dp – 1, Dp + 1, …, Dn)                                 child require the minimum memory to compute
                                                                       – Note that results are produced in dimension order too,
• The memory required is roughly
                                                                         so computation of the entire MMST can be pipelined
  | D1 | · | D1 | · … · | Dp – 1 | chunks
  – Where | Di | denotes the number of chunks along Di
                                                                    • Choose an optimal dimension order to minimize
                                                                      the total amount of memory required by MMST
» It is harder to aggregate away dimensions that
                                                                       – It turns out that this optimal order is D1, D2, …, Dn,
  come later in the dimension order                                      where | D1 | ≤ | D2 | ≤ … ≤ | Dn |
                                                          27                                                                        28




         ROLAP versus MOLAP                                                               Next time
• Multiway array cubing algorithm (MOLAP) beats
  sorting-based ROLAP algorithms
  – Compressed array representation is more compact
    than table representation
  – Sorting-based ROLAP spends too much time on                                            Data mining
    comparing and copying
  – In MOLAP, order is implied by the array positions
» An alternative ROLAP techinque
  – Convert table to array
  – Do MOLAP processing
  – Dump the result cube to a table                       29                                                                        30




                                                                                                                                                            5

Mais conteúdo relacionado

Semelhante a Cube olap

SAP SD Study material
SAP SD Study material SAP SD Study material
SAP SD Study material Harsha Halyal
 
Data Warehousing
Data WarehousingData Warehousing
Data WarehousingHeena Madan
 
3dpc intro_apr2012
3dpc intro_apr20123dpc intro_apr2012
3dpc intro_apr2012dopsahl
 
Recommendations play @flipkart (3)
Recommendations play @flipkart (3)Recommendations play @flipkart (3)
Recommendations play @flipkart (3)hava101
 
Neecom 2010 (Inovis-Dell case study)
Neecom 2010 (Inovis-Dell case study)Neecom 2010 (Inovis-Dell case study)
Neecom 2010 (Inovis-Dell case study)Doug Kern
 
Biug 20112026 dimensional modeling and mdx best practices
Biug 20112026   dimensional modeling and mdx best practicesBiug 20112026   dimensional modeling and mdx best practices
Biug 20112026 dimensional modeling and mdx best practicesItay Braun
 
Measure Data Quality
Measure Data QualityMeasure Data Quality
Measure Data QualityZavalaJV
 
|Into the Cloud: A GXS Dell Case Study
|Into the Cloud: A GXS Dell Case Study|Into the Cloud: A GXS Dell Case Study
|Into the Cloud: A GXS Dell Case Studymramos12995
 
Sfdc user group good data012712(1)
Sfdc user group good data012712(1)Sfdc user group good data012712(1)
Sfdc user group good data012712(1)debm_madronasg
 
Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12
Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12
Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12peak10marketing
 
Goods Receipt Document Imaging
Goods Receipt Document Imaging Goods Receipt Document Imaging
Goods Receipt Document Imaging Verbella CMG
 
Tridiagona Solutions Corporate Presentation
Tridiagona Solutions Corporate PresentationTridiagona Solutions Corporate Presentation
Tridiagona Solutions Corporate Presentationabhijeetsd
 
Reporting _ Paul Vella _ OBI Analytics for JDE.pdf
Reporting _ Paul Vella _ OBI Analytics for JDE.pdfReporting _ Paul Vella _ OBI Analytics for JDE.pdf
Reporting _ Paul Vella _ OBI Analytics for JDE.pdfInSync2011
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalabilitylucboudreau
 
Business Intelligence and Analytics: A Command and Control Center for Supply ...
Business Intelligence and Analytics: A Command and Control Center for Supply ...Business Intelligence and Analytics: A Command and Control Center for Supply ...
Business Intelligence and Analytics: A Command and Control Center for Supply ...marcus evans Network
 

Semelhante a Cube olap (20)

Big data
Big dataBig data
Big data
 
63556309 sap
63556309 sap63556309 sap
63556309 sap
 
SAP SD Study material
SAP SD Study material SAP SD Study material
SAP SD Study material
 
Gic2011 aula3-ingles
Gic2011 aula3-inglesGic2011 aula3-ingles
Gic2011 aula3-ingles
 
Gic2011 aula3-ingles
Gic2011 aula3-inglesGic2011 aula3-ingles
Gic2011 aula3-ingles
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Datawarehosuing
DatawarehosuingDatawarehosuing
Datawarehosuing
 
3dpc intro_apr2012
3dpc intro_apr20123dpc intro_apr2012
3dpc intro_apr2012
 
Recommendations play @flipkart (3)
Recommendations play @flipkart (3)Recommendations play @flipkart (3)
Recommendations play @flipkart (3)
 
Neecom 2010 (Inovis-Dell case study)
Neecom 2010 (Inovis-Dell case study)Neecom 2010 (Inovis-Dell case study)
Neecom 2010 (Inovis-Dell case study)
 
Biug 20112026 dimensional modeling and mdx best practices
Biug 20112026   dimensional modeling and mdx best practicesBiug 20112026   dimensional modeling and mdx best practices
Biug 20112026 dimensional modeling and mdx best practices
 
Measure Data Quality
Measure Data QualityMeasure Data Quality
Measure Data Quality
 
|Into the Cloud: A GXS Dell Case Study
|Into the Cloud: A GXS Dell Case Study|Into the Cloud: A GXS Dell Case Study
|Into the Cloud: A GXS Dell Case Study
 
Sfdc user group good data012712(1)
Sfdc user group good data012712(1)Sfdc user group good data012712(1)
Sfdc user group good data012712(1)
 
Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12
Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12
Webinar- Simple and Cost-Effective Disaster Recovery in the Cloud - 7-19-12
 
Goods Receipt Document Imaging
Goods Receipt Document Imaging Goods Receipt Document Imaging
Goods Receipt Document Imaging
 
Tridiagona Solutions Corporate Presentation
Tridiagona Solutions Corporate PresentationTridiagona Solutions Corporate Presentation
Tridiagona Solutions Corporate Presentation
 
Reporting _ Paul Vella _ OBI Analytics for JDE.pdf
Reporting _ Paul Vella _ OBI Analytics for JDE.pdfReporting _ Paul Vella _ OBI Analytics for JDE.pdf
Reporting _ Paul Vella _ OBI Analytics for JDE.pdf
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalability
 
Business Intelligence and Analytics: A Command and Control Center for Supply ...
Business Intelligence and Analytics: A Command and Control Center for Supply ...Business Intelligence and Analytics: A Command and Control Center for Supply ...
Business Intelligence and Analytics: A Command and Control Center for Supply ...
 

Cube olap

  • 1. Review Data warehousing: integrating data for OLAP Data Warehousing • OLAP versus OLTP • Warehousing versus mediation • Warehouse maintenance CPS 216 – Warehouse data as materialized views – Recomputation versus incremental maintenance Advanced Database Systems – Self-maintenance 2 Today Star schema Dimension table Dimension table Store SID city Product PID name cost s1 Durham p1 beer 10 s2 Chapel Hill • Star, snowflake, and cube p2 diaper 16 s3 … RTP … … … … Sale OID date CID PID SID qty price Fact table • ROLAP and MOLAP algorithms 100 11/23/2001 c3 p1 s1 1 12 • Big 102 12/12/2001 c3 p2 s1 2 17 • Constantly growing 105 12/24/2001 c5 p1 s3 5 13 • Stores measures … … … … … … … (often aggregated in queries) Customer CID name address city Dimension table c3 Amy 100 Main St. Durham • Small c4 Ben 102 Main St. Durham • Updated infrequently c5 Carl 800 Eighth St. Durham … … … … 3 4 Dimension hierarchies Star join indexes Brand Product • Queries frequently join fact table with dimension tables Product type Product category » Materialize the join result to speed up queries brandID name • For each combination of dimension attribute values, b1 Huggies b2 Budwiser store the list of tuple ID’s in the fact table PID brandID typeID … … – Brand name, store city, customer city → sales records; p1 b1 t1 p2 b2 t2 typeID name catID catID name Product type, store city → sales records; etc. p3 b1 t4 t1 diaper cat7 cat7 babies – Conceptually, multi-attribute indexes on the join result … … … t2 beer cat9 cat9 beverages t3 juice cat9 … … • One index to support each combination of selection … t4 pacifier … cat7 … conditions on attributes? – Too many indexes! + star schema = snowflake schema 5 6 1
  • 2. Bitmap join indexes Data cube » O’Neil & Quass, SIGMOD 1997 Simplified schema: Sale(CID, PID, SID, qty) Product • Bitmap and projection indexes for each dimension attribute (c5, p1, s3) = 5 – Value of the dimension attribute ↔ tuple ID’s in the (c3, p2, s1) = 2 fact table Store p2 s3 • To process an arbitrary combination of selection (c3, p1, s1) = 1 (c5, p1, s1) = 3 conditions, use bitmap indexes s2 – Bitmaps can be combined efficiently p1 s1 • To retrieve attribute values for output, use projection indexes ALL c3 c4 c5 Customer 7 8 Completing the cube (slide 1) Completing the cube (slide 2) Total quantity of sales for each product in each store Total quantity of sales for each product SELECT SUM(qty) FROM Sale SELECT SUM(qty) FROM Sale GROUP BY PID; Product Product GROUP BY PID, SID; (c5, p1, s3) = 5 (c5, p1, s3) = 5 (ALL, p1, s3) = 5 (ALL, p1, s3) = 5 (ALL, p2, s1) = 2 (c3, p2, s1) = 2 (ALL, p2, s1) = 2 (c3, p2, s1) = 2 Store (ALL, p2, ALL) Store p2 (ALL, p1, s1) = 4 s3 =2 p2 (ALL, p1, s1) = 4 s3 (c3, p1, s1) = 1 (c5, p1, s1) = 3 (c3, p1, s1) = 1 (c5, p1, s1) = 3 s2 s2 (ALL, p1, ALL) p1 =9 p1 s1 Project all points onto Product-Store plane s1 Further project points onto Product axis Customer Customer ALL c3 c4 c5 ALL c3 c4 c5 9 10 Completing the cube (slide 3) CUBE operator Total quantity of sales » Gray et al., ICDE 1996 SELECT SUM(qty) FROM Sale; Product • Sale(CID, PID, SID, qty) (c5, p1, s3) = 5 (ALL, p1, s3) = 5 • Proposed SQL extension: (ALL, p2, s1) = 2 (c3, p2, s1) = 2 SELECT SUM(qty) FROM Sale (ALL, p2, ALL) Store GROUP BY CUBE CID, PID, SID; =2 p2 (ALL, p1, s1) = 4 s3 (c3, p1, s1) = 1 (c5, p1, s1) = 3 • Output contains: s2 – Normal groups produced by GROUP BY (ALL, p1, ALL) =9 p1 • (c1, p1, s1, sum), (c1, p2, s3, sum), etc. s1 Further project points onto the origin – Groups with one or more ALL’s • (ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc. Customer ALL c3 c4 c5 • Can you write a CUBE query using only GROUP BY’s? (ALL, ALL, ALL) = 11 11 12 2
  • 3. ROLLUP operator Computing GROUP BY • Sometimes CUBE is too much • ROLAP (Relational OLAP) – (…, state, city, street, …, age, DOB, …) – Use standard relational engine – CUBE state, city, street returns meaningless groups • (ALL, ALL, ’Main Street’): sales on any Main Street? – Sorting and clustering – CUBE age, DOB returns useless groups – Using indexes • (ALL, DOB): DOB functionally determines age! – Automatic summary tables • Proposed SQL extension: GROUP BY ROLLUP state, city, street; • MOLAP (Multidimensional OLAP) • Output contains groups with ALL’s only as suffix – Use a sparse multidimensional array – (’NC’, ’Durham’, ’Main Street’), (’NC’, ’Durham’, ALL), (’NC’, ALL, ALL), (ALL, ALL, ALL) – But not (ALL, ALL, ’Main Street’) or (ALL, ’Durham’, ALL) 13 14 Sorting and clustering More on sort order • Sort (or cluster, e.g., using hashing) tuples • Sort by the order in which GROUP BY attributes according to GROUP BY attributes appear? – Tuples in the same group are processed together – Not necessary; e.g., GROUP BY PID, SID can be – Only one intermediate aggregate result needs to be processed just as efficiently by sorting on SID, PID kept—low memory requirement • Sort by the order in which GROUP BY ROLLUP • What if GROUP BY attributes ≠ sort attributes? attributes appear? – Still fine if GROUP BY attributes form a prefix of the – Useful; e.g., GROUP BY ROLLUP state, city, street sort order can be processed efficiently by sorting on state, city, – Otherwise, need to keep intermediate aggregate street, but not by sorting on street, city, state results around 15 16 Using bitmap join indexes Automatic summary tables » O’Neil & Quass, SIGMOD 1997 • Computing GROUP BY aggregates is expensive • Use the bitmap join indexes on GB1, GB2, …, GBk • OLAP queries perform GROUP BY all the time • For each value v1 of GB1 in order: For each value v2 of GB2 in order: … For each value vk of GBk in order: • Idea: precompute and store the aggregates! Intersect bitmaps to locate tuples; » Automatic summary tables Retrieve their measures; – Maintained automatically as base data changes Calculate aggregate for group (v1, v2, …, vk); – Just another index/materialized view • Helps if data is sorted by GB1, GB2, …, GBk – So measures in the same group are clustered 17 18 3
  • 4. Aggregation view lattice Selecting views to materialize GROUP BY CID, PID, SID • Factors in deciding what to materialize – What is its storage cost? – What is its update cost? GROUP BY GROUP BY GROUP BY CID, PID CID, SID PID, SID – Which queries can benefit from it? – How much can a query benefit from it? GROUP BY GROUP BY GROUP BY • Example CID PID SID – GROUP BY ∅ is small, but not useful to most queries – GROUP BY CID, PID, SID is useful to any query, but too A child can be large to be beneficial GROUP BY ∅ computed from any parent » Harinarayan et al., SIGMOD 1996; Gupta & Mumick, ICDE 1999 19 20 Interlude: TPC-D, -H, and -R From tables to arrays • TPC-D: standard OLAP benchmark until 1999 » Zhao et al., SIGMOD 1997 – With aggressive use of precomputation techniques • “Chunk” an n-dimensional cube into n- (materialized views, automatic summary tables), dimensional subcubes vendors were able to “cheat” and achieve amazing – For a dense chunk (>40% full), store it as is performance – For a sparse chunk (<40% full), compress it using • Now, TPC-D has been replaced by <coordinate, value> pairs – TPC-H: ad hoc OLAP queries • To convert a table into chunks • Cannot use materialized views – Pass 1: Partition table into memory-size partitions, – TPC-R: business-reporting OLAP queries each of which contains a number of chunks • Can use materialized views – Pass 2: Read partitions back in one at a time, and » http://www.tpc.org/ 21 chunk each partition in memory 22 Dimension order Basic array cubing GROUP BY B, C Memory required: 61 62 63 64 61 62 63 64 45 46 47 48 1 chunk 45 46 47 48 (could be more) ☻ 29 30 31 32 29 30 31 32 ☻ Dimension B Dimension B ☻ 60 60 ☻ ☻ 13 14 15 16 44 13 14 15 16 44 ☻ 28 28 ☻ 56 56 ☻ ☻ 9 10 11 12 40 9 10 11 12 40 ☻ 24 24 ☻ 52 Dimension order: 52 ☻ ☻ 5 6 7 8 36 A, B, C 5 6 7 8 36 ☻ 20 C 20 C ☻ n = Sort order: n sio sio ☻ 1 2 3 4 en C, B, A 1 2 3 4 en D im D im 23 24 Dimension A Dimension A 4
  • 5. Minimal spanning tree Multiway array cubing GROUP BY • Recall the aggregation A, B, C 100 • Goal: compute all 61 62 63 64 aggregates at the 45 46 47 48 lattice same time in a single 29 30 31 32 Dimension B • MST of the lattice: GROUP BY GROUP BY GROUP BY pass over the array, A, B 20 A, C 10 B, C 50 60 parent is always using minimum 44 13 14 15 16 amount of memory 28 chosen to be the one GROUP BY GROUP BY GROUP BY 56 • GROUP BY B, C 40 with minimum size A2 B 10 C5 requires 1 chunk 9 10 11 12 24 52 • Compute each node • GROUP BY A, C 36 5 6 7 8 from its parent in the GROUP BY ∅ 1 requires 4 chunks 20 C n MST • GROUP BY A, B sio 1 2 3 4 en requires 16 chunks i m 25 D 26 Dimension A Memory requirement Minimum-memory spanning tree • Dimension order is D1, D2, …, Dn • MMST of the aggregation lattice • Aggregate to compute projects out Dp (i.e., – Parent is always chosen to be the one that makes the GROUP BY D1, …, Dp – 1, Dp + 1, …, Dn) child require the minimum memory to compute – Note that results are produced in dimension order too, • The memory required is roughly so computation of the entire MMST can be pipelined | D1 | · | D1 | · … · | Dp – 1 | chunks – Where | Di | denotes the number of chunks along Di • Choose an optimal dimension order to minimize the total amount of memory required by MMST » It is harder to aggregate away dimensions that – It turns out that this optimal order is D1, D2, …, Dn, come later in the dimension order where | D1 | ≤ | D2 | ≤ … ≤ | Dn | 27 28 ROLAP versus MOLAP Next time • Multiway array cubing algorithm (MOLAP) beats sorting-based ROLAP algorithms – Compressed array representation is more compact than table representation – Sorting-based ROLAP spends too much time on Data mining comparing and copying – In MOLAP, order is implied by the array positions » An alternative ROLAP techinque – Convert table to array – Do MOLAP processing – Dump the result cube to a table 29 30 5