SlideShare a Scribd company logo
1 of 34
Query Processing and Optimization

                  (Review from Database II)




02/14/13                                       1
Data Organization on Secondary Storage
  Architecture of a database storage manager:

                                          A mapping table maps
           free frame
                                          page# to frame#
           disk page
                                         buffer pool




                                                secondary storage


02/14/13                                                            2
Buffer Management
  When a page is requested, the storage manager looks at the
     mapping table to see if this page is in the pool. If not:
  1. If there is a free frame, it allocates the frame, reads the disk
     page into frame, and inserts the mapping from page# to frame#
     into the mapping table.
  2. Otherwise, it selects an allocated frame and replaces it with the
     new page. If the victim frame is dirty (has been updated), it
     writes it back to disk. It also updates the mapping table.
  Then it pins the page.




02/14/13                                                                 3
Pinning/Unpinning Data
  The page pin is a memory counter that indicates how many times
    this page has been requested. The pin counter and the dirty bit
    of each frame can be stored in the mapping table.
  When a program requests a page, it must pin the page, which is
    done by incrementing the pin counter. When a program doesn’t
    need the page anymore, it unpins the page, which is done by
    decrementing the pin counter.
  When the pin is zero, the frame associated with the page is a good
    candidate for replacement when the buffer pool is full.




02/14/13                                                               4
A Higher-Level Interface
  Page I/O is very low level. You can abstract it using records, files,
    and indexes.
  A file is a sequence of records, while each record is an aggregation
    of data values.
  Records are stored on disk blocks. Small records do not cross page
    boundaries. The blocking factor for a file is the average number
    of file records stored on a disk block. Large records may span
    multiple pages. Fixed-size records are typically unspanned.
  The disk pages of a file that contain the file records may be
    contiguous, linked, or indexed. A file may be organized as a
    heap (unordered) or a sequential (ordered) file. Ordered files
    can be searched using binary search but they are hard to update.
     A file may also have a file descriptor (or header) that contains
    info about the file and the record fields.
02/14/13                                                                  5
Indexes
  The most common form of an index is a mapping of one or more
     fields of a file (this is the search key) to record ids (rids). It
     speeds up selections on the search key. An index gives an
     alternative access path to the file records on the index key.
  Types of indexes:
  1. Primary index: if the search key contains the primary key. If
     the search key contains a candidate key, is a unique index.
     Otherwise, it is a secondary index.
  2. Clustered index: when the order of records is the same as the
     order of index.
  3. Dense index: when, for each data record, there is an entry in
     the index with search key equal to the associated record fields.
     Otherwise, it is sparse. A sparse index must be clustered, to be
     able to locate records with no search key.
02/14/13                                                                  6
B+-trees
  B+-trees are variations of search trees that are suitable for block-
     based secondary storage. They allow efficient search (both
     equality & range search), insertion, and deletion of search keys.
  Characteristics:
  • Each tree node corresponds to a disk block. The tree is kept
     height-balanced.
  • Each node is kept between half-full and completely full, except
     for the root, which can have one search key minimum.
  • An insertion into a node that is not full is very efficient. If the
     node is full, we have an overflow, and the insertion causes a
     split into two nodes. Splitting may propagate to other tree levels
     and may reach the root.
  • A deletion is very efficient if the node is kept above half full.
     Otherwise, there is an underflow, and the node must be merged
     with neighboring nodes.
02/14/13                                                                  7
B+-tree Node Structure

                            Internal node
                       K1   … K  i-1   K …  i   Kq-1


                  P1                   Pi              Pq

             K ≤ K1             Ki-1< K ≤ Ki           K > Kq-1


  Order p of a B+-tree is the max number of pointers in a node:
                     p = (b+k)/(k+d)
  where b is the block size, k is the key size, and d is the pointer size.
  Number of pointers, q, in an internal node/leaf: p/2 ≤ q ≤ p
  Search, insert, and delete have logqN cost (for N records).
02/14/13                                                                     8
Example
  For a 1024B page, 9B key, and 7B pointer size, the order is:
       p = (1024+9)/(9+7) = 64
  In practice, typical order is p=200 and a typical fill factor is 67%.
  That is, the average fanout is 133 (=200*0.67).
  Typical capacities:
  • Height 3: 2,352,637 records (=1333)
  • Height 4: 312,900,700 records (=1334)




02/14/13                                                                  9
Query Processing
  Need to evaluate database queries efficiently. Assumptions:
  • You can not upload the entire database in memory
  • You want to utilize the memory as much as possible
  • The cost will heavily depend on how many pages you read
     from (and write to) disk.
  Steps:
  1. Translation: translate the query into an algebraic form
  2. Algebraic optimization: improve the algebraic form using
     heuristics
  3. Plan selection: consider available alternative implementation
     algorithms for each algebraic operation in the query and
     choose the best one (or avoid the worst) based on statistics
  4. Evaluation: evaluate the best plan against the database.
02/14/13                                                             10
Evaluation Algorithms: Sorting
  Sorting is central to query processing. It is used for:
  1. SQL order-by clause
  2. Sort-merge join
  3. Duplicate elimination (you sort the file by all record values so
     that duplicates will moved next to each other)
  4. Bulk loading of B+-trees
  5. Group-by with aggregations (you sort the file by the group-by
     attributes and then you aggregate on subsequent records that
     belong to the same group, since, after sorting, records with the
     same group-by values will be moved next to each other)




02/14/13                                                                11
External Sorting

 file                                                        Memory (nB blocks)




                                             merging


           sorting             merging
                     initial
                     runs

  Available buffer space: nB ≥ 3                   Sorting cost: 2*b
  Number of blocks in file: b                      Merging cost: 2*b*logdmnR
  Number of initial runs: nR = b/nB
  Degree of merging: dm = min(nB-1,nR)             Total cost is O(b*logb)
  Number of passes is logdmnR
02/14/13                                                                          12
Example
  With 5 buffer pages, to sort 108 page file:
  1. Sorting: nR=108/5 = 22 sorted initial runs of 5 pages each
  2. Merging: dm=4 run files to merge for each merging, so we get
      22/4=6 sorted runs (5 pages each) that need to be merged
  3. Merging: we will get 6/4=2 sorted runs to be merged
  4. Merging: we get the final sorted file of 108 pages.
  Total number of pages read/written: 2*108*4 = 864 pages
  (since we have 1 sorting + 3 merging steps).




02/14/13                                                            13
Replacement Selection
  Question: What’s the best way to create each initial run?
  With QuickSort: you load nB pages from the file into the memory
    buffer, you sort using QuickSort, and you write the result to a
    runfile. So the runfile is always at most nB pages.
  With HeapSort with replacement selection: you load nB pages from
    file into the memory buffer, you perform BuildHeap to create
    the initial heap in memory. But now you continuously read
    more records from the input file, you Heapify each record, and
    you remove and write the smallest element of the heap (the
    heap root) into the output runfile until the record sort key is
    larger than the smallest key in the heap. Then, you complete the
    HeapSort in memory and you dump the sorted file to the output.
    Result: the average size of the runfile is now 2*nB. So even
    though HeapSort is slower than QuickSort for in-memory
    sorting, it is better for external sorting, since it creates larger
    runfiles.
02/14/13                                                                  14
Join Evaluation Algorithms
  Join is the most expensive operation. Many algorithms exist.
  Join: R       R.A=S.B S

  Nested loops join (naïve evaluation):
    res ← ∅
    for each r∈R do
        for each s∈S do
            if r.A=s.B
               then insert the concatenation of r and s into res

  Improvement: use 3 memory blocks, one for R, one for S, and one
    for the output. Cost = bR+bR*bS (plus the cost for writing the
    output), where bR and bS are the numbers of blocks in R and S.

02/14/13                                                             15
Block Nested Loops Join
  Try to utilize all nB blocks in memory. Use one memory block for
    the inner relation, S, one block for the output, and the rest (nB-2)
    for the outer relation, R:

       while not eof(R)
       { read the next nB-2 blocks of R into memory;
         start scanning S from start one block at the time;
         while not eof(S)
         { read the next block of S;
             perform the join R     R.A=S.B S between the memory

                   blocks of R and S and write the result to the output;
          }
       }

02/14/13                                                                   16
Block Nested Loops Join (cont.)
  Cost: bR+bR/(nb-2)*bS
  But, if either R or S can fit entirely in memory (ie. bR≤nb-2 or bS≤nb-
    2) then the cost is bR+bS.
  You always use the smaller relation (in number of blocks) as outer
    and the larger as inner. Why? Because the cost of S          R.A=S.B R is
    bS+bS/(nb-2)*bR. So if bR>bS, then the latter cost is smaller.
  Rocking the inner relation: Instead of always scanning the inner
    relation R from the beginning, we can scan it top-down first,
    then bottom-up, then top-down again, etc. That way, you don’t
    have to read the first or last block twice. In that case, the cost
    formula will have bS-1 instead of bS.



02/14/13                                                                        17
Index Nested Loops Join
  If there is an index I (say a B+-tree) on S over the search key S.B,
      we can use an index nested loops join:
      for each tuple r∈R do
     { retrieve all tuples s∈S using the value r.A as the search key
                   for the index I;
        perform the join between the r and the retrieved tuples from S
     }

  Cost: bR+|R|*(#of-levels-in-the-index), where |R| is the number
    of tuples in R. The number of levels in the B+-tree index is
    typically smaller than 4.



02/14/13                                                                 18
Sort-Merge Join
  Applies to equijoins only. Assume R.A is a candidate key of R.
  Steps:
  1. Sort R on R.A
  2. Sort S on S.B
  3. Merge the two sorted files as follows:

           r ← first(R)
           s ← first(S)
           repeat
           { while s.B≤r.A
              { if r.A=s.B then write <r,s> to the output
                 s ← next(S)
              }
              r ← next(R)
           } until eof(R) //or eof(S)

02/14/13                                                           19
Sort-Merge Join (cont.)
  Cost of merging: bR+bS
  If R and/or S need to be sorted, then the sorting cost must be added
     too. You don’t need to sort a relation if there is a B+-tree whose
     search key is equal to the sort key (or, more generally, if the
     sort key is a prefix of the search key).
  Note that if R.A is not candidate key, then switch R and S,
     provided that S.B is a candidate key for S. If neither is true,
     then a nested loops join must be performed between the equal
     values of R and S. In the worst case, if all tuples in R have the
     same value for r.A and all tuples of S have the same value for
     s.B, equal to r.A, then the cost will be bR*bS.
  Sorting and merging can be combined into one phase. In practice,
    since sorting can be done in less than 4 passes only, the sort-
    merge join is close to linear.
02/14/13                                                                  20
Hash Join
  Works on equijoins only. R is called the built table and S is called
     the probe table. We assume that R can fit in memory
  Steps:
  1. Built phase: read the built table, R, in memory as a hash table
     with hash key R.A. For example, if H is a memory hash table
     with n buckets, then tuple r of R goes to the H(h(r.A) mod n)
     bucket, where h maps r.A to an integer.
  2. Probe phase: scan S, one block at a time:
         res ← ∅
         for each s∈S
             for each r∈H(h(s.B) mod n)
                if r.A=s.B
                    then insert the concatenation of r and s into res
02/14/13                                                                 21
Partitioned Hash Join
  If neither R nor S can fit in memory, you partition both R and S
      into m=min(bR,bS)/(k*nB) partitions, where k is a number
      larger than 1, eg, 2. Steps:
  1. Partition R:
          create m partition files for R: R1,…Rm
          for each r∈R
              put r in the partition Rk, where k=h(r.A) mod m
  2. Partition S:
          create m partition files for S: S1,…Sm
          for each s∈S
              put s in the partition Sk, where k=h(s.B) mod m
  3. for i=1 to m
          perform in-memory hash join between Ri and Si
02/14/13                                                             22
Partitioned Hash Join (cont.)
  If the hash function does not partition uniformly (this is called
      data skew), one or more Ri/Si partitions may not fit in memory.
      We can apply the partitioning technique recursively until all
      partitions can fit in memory.
  Cost: 3*(bR+bS) since each block is read twice and written once.
  In most cases, it’s better than sort-merge join, and is highly
     parallelizable. But sensitive to data skew. Sort-merge is better
     when one or both inputs are sorted. Also sort-merge join
     delivers the output sorted.




02/14/13                                                                23
Aggregation with Group-by
  If it is aggregation without group-by, then simply scan the input and aggregate
        using one or more accumulators.
  For aggregations with a group-by, sort the input on the group-by attributes and
        scan the result. Example:
             select dno, count(*), avg(salary)
             from employee group by dno
  Algorithm:
           sort employee by dno into E;
           e←first(E)
           count←0; sum←0; d←e.dno;
           while not eof(E)
           { if e.dno<>d
              then { output d, count, sum/count;
                     count←0; sum←0; d←e.dno }
              count ←count+1; sum←sum+e.salary
              e←next(E);
           }
           output d, count, sum/count;
02/14/13                                                                            24
Other Operators
  Aggregation with group-by (cont.):
  Sorting can be combined with scanning/aggregation.
  You can group-by using partitioned hashing using the group-by
    attributes for hash/partition key.
  Other operations:
  • Selection can be done with scanning the input and testing each
    tuple against the selection condition. Or you can use an index.
  • Intersection is a special case of join (the predicate is an equality
    over all attribute values).
  • Projection requires duplicate elimination, which can be done
    with sorting or hashing. You can eliminate duplicates during
    sorting or hashing.
  • Union requires duplicate elimination too.
02/14/13                                                                   25
Combining Operators
  Each relational algebraic operator reads one or two        π
     relations as input and returns one relation as
     output. It can be very expensive if the
     evaluation algorithms that implement these
     operators had to materialize the output relations
     into temporary files on disk.
  Solution: Stream-based processing (pipelining).        σ   B C D
  Iterators: open, next, close
                                                         A
  Operators work on stream of tuples now.
  Operation next returns one tuple only, which is
     sent to the output stream. It is ‘pull’ based: To
     create one tuple, the operator calls the next
     operation over its input streams as many times
     as necessary to generate the tuple.
02/14/13                                                         26
Stream-Based Processing
  Selection without streams:
       table selection ( table x, bool (pred)(record) )                      join   x       y      x.A=y.B
       {     table result = empty
             for each e in x
                 if pred(e) then insert e into result
             return result                                       selection      x       y       x.C>10

       }
  Stream-based selection:
       record selection ( stream s, bool (pred)(record) )
       {     while not eof(s)
             { r = next(s)
                 if pred(r) then return r
             }
             return empty_record
       }
       struct stream { record(next_fnc)(…); stream x; stream y; args; }
       record next ( stream s )
             if (s.y=empty) return (s.next_fnc)(s.x,s.args)
                else return (s.next_fnc)(s.x,s.y,s.args)

02/14/13                                                                                                     27
But …

            Streamed-based nested loops join:

            record nested-loops ( stream left, stream right, bool (pred)(record,record) )
            {         while not eof(left)
                      { x = next(left)
                         while not eof(right)
                         { y = next(right)
                            if pred(x,y) then return <x,y>
                         }
                         open(right)
                      }
                      return empty_record
            }

           •If the inner stream is the result of a another operator (such as a join), it is
             better to materialize it into a file. So this works great for left-deep trees.
           •But doesn’t work well for sorting (a blocking operator).
02/14/13                                                                                      28
Query Optimization
  A query of the form:
    select A1,…,An
    from R1,…,Rm
    where pred
  can be evaluated by the following algebraic expression:
       πA1,…An(σpred(R1×… × Rm))

  Algebraic optimization:
  Find a more efficient algebraic expression using heuristic rules.




02/14/13                                                              29
Heuristics
  • If pred in σpred is a conjunction, break σpred into a cascade of σ:
           σ p1 and …pn(R) = σp1(… σpn(R))
  • Move σ as far down the query tree as possible:
           σp(R × S) = σp(R) × S       if p refers to R only
  • Convert cartesian products into joins
           σR.A=S.B(R × S) = R       R.A=S.B   S
  • Rearrange joins so that there are no cartesian products
  • Move π as far down the query tree as possible (but retain
    attributes needed in joins/selections).




02/14/13                                                                  30
Example
  select e.fname, e.lname
  from project p, works_on w, employee e
  where p.plocation=‘Stafford’ and e.bdate>’Dec-31-1957’
     and p.num=4 and p.number = w.pno and w.essn=e.ssn

                                      π e.fname, e.lname
                                          p.pnumber=w.pno


                           π e.fname, e.lname, w.pno π p.pnumber
                             w.essn=e.ssn
                                                     σp.plocation=‘Stafford and p.num=4
                                                           p
                                                     project
             π e.ssn, e.fname, e.lname πww.essn, w.pno
             σ e.bdate>’Dec-31-1957’ works_on
               e
           employee
02/14/13                                                                                  31
Plan Selection
  It has 3 components:
  1. Enumeration of the plan space.
           –   Only the space of left-deep plans is typically considered.
           –   Cartesian products are avoided.
           –   It’s NP-hard problem.
           –   Some exponential algorithms (O(2N) for N joins) are still practical for
               everyday queries (<10 joins). We will study the system R dynamic
               programming algorithm.
           –   Many polynomial-time heuristics exist.
  1. Cost estimation
           –   Based on statistics, maintained in system catalog.
           –   Very rough approximations; still black magic.
           –   More accurate estimations exist based on histograms.
  1. Plan selection.
           –   Ideally, want to find the best plan. Practically, want to avoid the worst
               plans.
02/14/13                                                                                   32
Cost Estimation
  The cost of a plan is the sum of the cost of all the plan operators.
  • If the intermediate result between plan operators is
    materialized, we need to consider the cost of reading/writing the
    result.
  • To estimate the cost of a plan operator (such as block-nested
    loops join), we need to estimate the size (in blocks) of the
    inputs.
           – Based on predicate selectivity. Assumed independence of predicates.
           – For each operator, both the cost and the output size needs to be
             estimated (since it may be needed for the next operator).
           – The sizes of leaves are retrieved from the catalog (statistics on
             table/index cardinalities).
  • We will study system R.
           – Very inexact, but works well in practice. Used to be widely used.
           – More sophisticated techniques known now.
02/14/13                                                                           33
Statistics
  Statistics stored in system catalogs typically contain
           – Number of tuples (cardinality) and number of blocks for each table and
             index.
           – Number of distinct key values for each index.
           – Index height, low and high key values for each index.
  Catalogs are updated periodically, but not every time data change
    (too expensive). This may introduce slight inconsistency
    between data and statistics, but usually the choice of plans is
    resilient to slight changes in statistics.
  Histograms are better approximations:

                       # of tuples




                                     M Tu W Th F Sa Su   Day of week
                                            M
02/14/13                                                                              34

More Related Content

What's hot

Combined paging and segmentation
Combined paging and segmentationCombined paging and segmentation
Combined paging and segmentation
Tech_MX
 
Ch9 OS
Ch9 OSCh9 OS
Ch9 OS
C.U
 
34 single partition allocation
34 single partition allocation34 single partition allocation
34 single partition allocation
myrajendra
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organization
Jafar Nesargi
 
Implementation of page table
Implementation of page tableImplementation of page table
Implementation of page table
guestff64339
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Yahoo Developer Network
 

What's hot (20)

Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating System
 
Combined paging and segmentation
Combined paging and segmentationCombined paging and segmentation
Combined paging and segmentation
 
Ch9 OS
Ch9 OSCh9 OS
Ch9 OS
 
OS_Ch9
OS_Ch9OS_Ch9
OS_Ch9
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
34 single partition allocation
34 single partition allocation34 single partition allocation
34 single partition allocation
 
Data Consistency Enhancement in Writeback mode of Journaling using Backpointers
Data Consistency Enhancement in Writeback mode of Journaling using BackpointersData Consistency Enhancement in Writeback mode of Journaling using Backpointers
Data Consistency Enhancement in Writeback mode of Journaling using Backpointers
 
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
 
4 026
4 0264 026
4 026
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organization
 
Virtual memory
Virtual memoryVirtual memory
Virtual memory
 
Implementation of page table
Implementation of page tableImplementation of page table
Implementation of page table
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Structure of the page table
Structure of the page tableStructure of the page table
Structure of the page table
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
ppt
pptppt
ppt
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 

Similar to Query processing and optimization

File implementation
File implementationFile implementation
File implementation
Mohd Arif
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 

Similar to Query processing and optimization (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Memory management
Memory managementMemory management
Memory management
 
File implementation
File implementationFile implementation
File implementation
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
OSCh9
OSCh9OSCh9
OSCh9
 
Index file
Index fileIndex file
Index file
 
CS 2212- UNIT -4.pptx
CS 2212-  UNIT -4.pptxCS 2212-  UNIT -4.pptx
CS 2212- UNIT -4.pptx
 
Os4
Os4Os4
Os4
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
Bin carver
Bin carverBin carver
Bin carver
 
Unit 08 dbms
Unit 08 dbmsUnit 08 dbms
Unit 08 dbms
 
Samsung DeepSort
Samsung DeepSortSamsung DeepSort
Samsung DeepSort
 
6 chapter 6 record storage and primary file organization
6 chapter 6  record storage and primary file organization6 chapter 6  record storage and primary file organization
6 chapter 6 record storage and primary file organization
 
Unix Memory Management - Operating Systems
Unix Memory Management - Operating SystemsUnix Memory Management - Operating Systems
Unix Memory Management - Operating Systems
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
 
Shignled disk
Shignled diskShignled disk
Shignled disk
 
Extlect03
Extlect03Extlect03
Extlect03
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in Spark
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 

More from Arif A. (7)

Mobile cloud2020
Mobile cloud2020Mobile cloud2020
Mobile cloud2020
 
Arif's PhD Defense (Title: Efficient Cloud Application Deployment in Distrib...
Arif's PhD Defense (Title:  Efficient Cloud Application Deployment in Distrib...Arif's PhD Defense (Title:  Efficient Cloud Application Deployment in Distrib...
Arif's PhD Defense (Title: Efficient Cloud Application Deployment in Distrib...
 
Introduction Mobile cloud
Introduction Mobile cloudIntroduction Mobile cloud
Introduction Mobile cloud
 
Architecture of ibm 3838
Architecture of ibm 3838Architecture of ibm 3838
Architecture of ibm 3838
 
Ibm 3838
Ibm 3838Ibm 3838
Ibm 3838
 
Mach Kernel
Mach KernelMach Kernel
Mach Kernel
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

Query processing and optimization

  • 1. Query Processing and Optimization (Review from Database II) 02/14/13 1
  • 2. Data Organization on Secondary Storage Architecture of a database storage manager: A mapping table maps free frame page# to frame# disk page buffer pool secondary storage 02/14/13 2
  • 3. Buffer Management When a page is requested, the storage manager looks at the mapping table to see if this page is in the pool. If not: 1. If there is a free frame, it allocates the frame, reads the disk page into frame, and inserts the mapping from page# to frame# into the mapping table. 2. Otherwise, it selects an allocated frame and replaces it with the new page. If the victim frame is dirty (has been updated), it writes it back to disk. It also updates the mapping table. Then it pins the page. 02/14/13 3
  • 4. Pinning/Unpinning Data The page pin is a memory counter that indicates how many times this page has been requested. The pin counter and the dirty bit of each frame can be stored in the mapping table. When a program requests a page, it must pin the page, which is done by incrementing the pin counter. When a program doesn’t need the page anymore, it unpins the page, which is done by decrementing the pin counter. When the pin is zero, the frame associated with the page is a good candidate for replacement when the buffer pool is full. 02/14/13 4
  • 5. A Higher-Level Interface Page I/O is very low level. You can abstract it using records, files, and indexes. A file is a sequence of records, while each record is an aggregation of data values. Records are stored on disk blocks. Small records do not cross page boundaries. The blocking factor for a file is the average number of file records stored on a disk block. Large records may span multiple pages. Fixed-size records are typically unspanned. The disk pages of a file that contain the file records may be contiguous, linked, or indexed. A file may be organized as a heap (unordered) or a sequential (ordered) file. Ordered files can be searched using binary search but they are hard to update. A file may also have a file descriptor (or header) that contains info about the file and the record fields. 02/14/13 5
  • 6. Indexes The most common form of an index is a mapping of one or more fields of a file (this is the search key) to record ids (rids). It speeds up selections on the search key. An index gives an alternative access path to the file records on the index key. Types of indexes: 1. Primary index: if the search key contains the primary key. If the search key contains a candidate key, is a unique index. Otherwise, it is a secondary index. 2. Clustered index: when the order of records is the same as the order of index. 3. Dense index: when, for each data record, there is an entry in the index with search key equal to the associated record fields. Otherwise, it is sparse. A sparse index must be clustered, to be able to locate records with no search key. 02/14/13 6
  • 7. B+-trees B+-trees are variations of search trees that are suitable for block- based secondary storage. They allow efficient search (both equality & range search), insertion, and deletion of search keys. Characteristics: • Each tree node corresponds to a disk block. The tree is kept height-balanced. • Each node is kept between half-full and completely full, except for the root, which can have one search key minimum. • An insertion into a node that is not full is very efficient. If the node is full, we have an overflow, and the insertion causes a split into two nodes. Splitting may propagate to other tree levels and may reach the root. • A deletion is very efficient if the node is kept above half full. Otherwise, there is an underflow, and the node must be merged with neighboring nodes. 02/14/13 7
  • 8. B+-tree Node Structure Internal node K1 … K i-1 K … i Kq-1 P1 Pi Pq K ≤ K1 Ki-1< K ≤ Ki K > Kq-1 Order p of a B+-tree is the max number of pointers in a node: p = (b+k)/(k+d) where b is the block size, k is the key size, and d is the pointer size. Number of pointers, q, in an internal node/leaf: p/2 ≤ q ≤ p Search, insert, and delete have logqN cost (for N records). 02/14/13 8
  • 9. Example For a 1024B page, 9B key, and 7B pointer size, the order is: p = (1024+9)/(9+7) = 64 In practice, typical order is p=200 and a typical fill factor is 67%. That is, the average fanout is 133 (=200*0.67). Typical capacities: • Height 3: 2,352,637 records (=1333) • Height 4: 312,900,700 records (=1334) 02/14/13 9
  • 10. Query Processing Need to evaluate database queries efficiently. Assumptions: • You can not upload the entire database in memory • You want to utilize the memory as much as possible • The cost will heavily depend on how many pages you read from (and write to) disk. Steps: 1. Translation: translate the query into an algebraic form 2. Algebraic optimization: improve the algebraic form using heuristics 3. Plan selection: consider available alternative implementation algorithms for each algebraic operation in the query and choose the best one (or avoid the worst) based on statistics 4. Evaluation: evaluate the best plan against the database. 02/14/13 10
  • 11. Evaluation Algorithms: Sorting Sorting is central to query processing. It is used for: 1. SQL order-by clause 2. Sort-merge join 3. Duplicate elimination (you sort the file by all record values so that duplicates will moved next to each other) 4. Bulk loading of B+-trees 5. Group-by with aggregations (you sort the file by the group-by attributes and then you aggregate on subsequent records that belong to the same group, since, after sorting, records with the same group-by values will be moved next to each other) 02/14/13 11
  • 12. External Sorting file Memory (nB blocks) merging sorting merging initial runs Available buffer space: nB ≥ 3 Sorting cost: 2*b Number of blocks in file: b Merging cost: 2*b*logdmnR Number of initial runs: nR = b/nB Degree of merging: dm = min(nB-1,nR) Total cost is O(b*logb) Number of passes is logdmnR 02/14/13 12
  • 13. Example With 5 buffer pages, to sort 108 page file: 1. Sorting: nR=108/5 = 22 sorted initial runs of 5 pages each 2. Merging: dm=4 run files to merge for each merging, so we get 22/4=6 sorted runs (5 pages each) that need to be merged 3. Merging: we will get 6/4=2 sorted runs to be merged 4. Merging: we get the final sorted file of 108 pages. Total number of pages read/written: 2*108*4 = 864 pages (since we have 1 sorting + 3 merging steps). 02/14/13 13
  • 14. Replacement Selection Question: What’s the best way to create each initial run? With QuickSort: you load nB pages from the file into the memory buffer, you sort using QuickSort, and you write the result to a runfile. So the runfile is always at most nB pages. With HeapSort with replacement selection: you load nB pages from file into the memory buffer, you perform BuildHeap to create the initial heap in memory. But now you continuously read more records from the input file, you Heapify each record, and you remove and write the smallest element of the heap (the heap root) into the output runfile until the record sort key is larger than the smallest key in the heap. Then, you complete the HeapSort in memory and you dump the sorted file to the output. Result: the average size of the runfile is now 2*nB. So even though HeapSort is slower than QuickSort for in-memory sorting, it is better for external sorting, since it creates larger runfiles. 02/14/13 14
  • 15. Join Evaluation Algorithms Join is the most expensive operation. Many algorithms exist. Join: R R.A=S.B S Nested loops join (naïve evaluation): res ← ∅ for each r∈R do for each s∈S do if r.A=s.B then insert the concatenation of r and s into res Improvement: use 3 memory blocks, one for R, one for S, and one for the output. Cost = bR+bR*bS (plus the cost for writing the output), where bR and bS are the numbers of blocks in R and S. 02/14/13 15
  • 16. Block Nested Loops Join Try to utilize all nB blocks in memory. Use one memory block for the inner relation, S, one block for the output, and the rest (nB-2) for the outer relation, R: while not eof(R) { read the next nB-2 blocks of R into memory; start scanning S from start one block at the time; while not eof(S) { read the next block of S; perform the join R R.A=S.B S between the memory blocks of R and S and write the result to the output; } } 02/14/13 16
  • 17. Block Nested Loops Join (cont.) Cost: bR+bR/(nb-2)*bS But, if either R or S can fit entirely in memory (ie. bR≤nb-2 or bS≤nb- 2) then the cost is bR+bS. You always use the smaller relation (in number of blocks) as outer and the larger as inner. Why? Because the cost of S R.A=S.B R is bS+bS/(nb-2)*bR. So if bR>bS, then the latter cost is smaller. Rocking the inner relation: Instead of always scanning the inner relation R from the beginning, we can scan it top-down first, then bottom-up, then top-down again, etc. That way, you don’t have to read the first or last block twice. In that case, the cost formula will have bS-1 instead of bS. 02/14/13 17
  • 18. Index Nested Loops Join If there is an index I (say a B+-tree) on S over the search key S.B, we can use an index nested loops join: for each tuple r∈R do { retrieve all tuples s∈S using the value r.A as the search key for the index I; perform the join between the r and the retrieved tuples from S } Cost: bR+|R|*(#of-levels-in-the-index), where |R| is the number of tuples in R. The number of levels in the B+-tree index is typically smaller than 4. 02/14/13 18
  • 19. Sort-Merge Join Applies to equijoins only. Assume R.A is a candidate key of R. Steps: 1. Sort R on R.A 2. Sort S on S.B 3. Merge the two sorted files as follows: r ← first(R) s ← first(S) repeat { while s.B≤r.A { if r.A=s.B then write <r,s> to the output s ← next(S) } r ← next(R) } until eof(R) //or eof(S) 02/14/13 19
  • 20. Sort-Merge Join (cont.) Cost of merging: bR+bS If R and/or S need to be sorted, then the sorting cost must be added too. You don’t need to sort a relation if there is a B+-tree whose search key is equal to the sort key (or, more generally, if the sort key is a prefix of the search key). Note that if R.A is not candidate key, then switch R and S, provided that S.B is a candidate key for S. If neither is true, then a nested loops join must be performed between the equal values of R and S. In the worst case, if all tuples in R have the same value for r.A and all tuples of S have the same value for s.B, equal to r.A, then the cost will be bR*bS. Sorting and merging can be combined into one phase. In practice, since sorting can be done in less than 4 passes only, the sort- merge join is close to linear. 02/14/13 20
  • 21. Hash Join Works on equijoins only. R is called the built table and S is called the probe table. We assume that R can fit in memory Steps: 1. Built phase: read the built table, R, in memory as a hash table with hash key R.A. For example, if H is a memory hash table with n buckets, then tuple r of R goes to the H(h(r.A) mod n) bucket, where h maps r.A to an integer. 2. Probe phase: scan S, one block at a time: res ← ∅ for each s∈S for each r∈H(h(s.B) mod n) if r.A=s.B then insert the concatenation of r and s into res 02/14/13 21
  • 22. Partitioned Hash Join If neither R nor S can fit in memory, you partition both R and S into m=min(bR,bS)/(k*nB) partitions, where k is a number larger than 1, eg, 2. Steps: 1. Partition R: create m partition files for R: R1,…Rm for each r∈R put r in the partition Rk, where k=h(r.A) mod m 2. Partition S: create m partition files for S: S1,…Sm for each s∈S put s in the partition Sk, where k=h(s.B) mod m 3. for i=1 to m perform in-memory hash join between Ri and Si 02/14/13 22
  • 23. Partitioned Hash Join (cont.) If the hash function does not partition uniformly (this is called data skew), one or more Ri/Si partitions may not fit in memory. We can apply the partitioning technique recursively until all partitions can fit in memory. Cost: 3*(bR+bS) since each block is read twice and written once. In most cases, it’s better than sort-merge join, and is highly parallelizable. But sensitive to data skew. Sort-merge is better when one or both inputs are sorted. Also sort-merge join delivers the output sorted. 02/14/13 23
  • 24. Aggregation with Group-by If it is aggregation without group-by, then simply scan the input and aggregate using one or more accumulators. For aggregations with a group-by, sort the input on the group-by attributes and scan the result. Example: select dno, count(*), avg(salary) from employee group by dno Algorithm: sort employee by dno into E; e←first(E) count←0; sum←0; d←e.dno; while not eof(E) { if e.dno<>d then { output d, count, sum/count; count←0; sum←0; d←e.dno } count ←count+1; sum←sum+e.salary e←next(E); } output d, count, sum/count; 02/14/13 24
  • 25. Other Operators Aggregation with group-by (cont.): Sorting can be combined with scanning/aggregation. You can group-by using partitioned hashing using the group-by attributes for hash/partition key. Other operations: • Selection can be done with scanning the input and testing each tuple against the selection condition. Or you can use an index. • Intersection is a special case of join (the predicate is an equality over all attribute values). • Projection requires duplicate elimination, which can be done with sorting or hashing. You can eliminate duplicates during sorting or hashing. • Union requires duplicate elimination too. 02/14/13 25
  • 26. Combining Operators Each relational algebraic operator reads one or two π relations as input and returns one relation as output. It can be very expensive if the evaluation algorithms that implement these operators had to materialize the output relations into temporary files on disk. Solution: Stream-based processing (pipelining). σ B C D Iterators: open, next, close A Operators work on stream of tuples now. Operation next returns one tuple only, which is sent to the output stream. It is ‘pull’ based: To create one tuple, the operator calls the next operation over its input streams as many times as necessary to generate the tuple. 02/14/13 26
  • 27. Stream-Based Processing Selection without streams: table selection ( table x, bool (pred)(record) ) join x y x.A=y.B { table result = empty for each e in x if pred(e) then insert e into result return result selection x y x.C>10 } Stream-based selection: record selection ( stream s, bool (pred)(record) ) { while not eof(s) { r = next(s) if pred(r) then return r } return empty_record } struct stream { record(next_fnc)(…); stream x; stream y; args; } record next ( stream s ) if (s.y=empty) return (s.next_fnc)(s.x,s.args) else return (s.next_fnc)(s.x,s.y,s.args) 02/14/13 27
  • 28. But … Streamed-based nested loops join: record nested-loops ( stream left, stream right, bool (pred)(record,record) ) { while not eof(left) { x = next(left) while not eof(right) { y = next(right) if pred(x,y) then return <x,y> } open(right) } return empty_record } •If the inner stream is the result of a another operator (such as a join), it is better to materialize it into a file. So this works great for left-deep trees. •But doesn’t work well for sorting (a blocking operator). 02/14/13 28
  • 29. Query Optimization A query of the form: select A1,…,An from R1,…,Rm where pred can be evaluated by the following algebraic expression: πA1,…An(σpred(R1×… × Rm)) Algebraic optimization: Find a more efficient algebraic expression using heuristic rules. 02/14/13 29
  • 30. Heuristics • If pred in σpred is a conjunction, break σpred into a cascade of σ: σ p1 and …pn(R) = σp1(… σpn(R)) • Move σ as far down the query tree as possible: σp(R × S) = σp(R) × S if p refers to R only • Convert cartesian products into joins σR.A=S.B(R × S) = R R.A=S.B S • Rearrange joins so that there are no cartesian products • Move π as far down the query tree as possible (but retain attributes needed in joins/selections). 02/14/13 30
  • 31. Example select e.fname, e.lname from project p, works_on w, employee e where p.plocation=‘Stafford’ and e.bdate>’Dec-31-1957’ and p.num=4 and p.number = w.pno and w.essn=e.ssn π e.fname, e.lname p.pnumber=w.pno π e.fname, e.lname, w.pno π p.pnumber w.essn=e.ssn σp.plocation=‘Stafford and p.num=4 p project π e.ssn, e.fname, e.lname πww.essn, w.pno σ e.bdate>’Dec-31-1957’ works_on e employee 02/14/13 31
  • 32. Plan Selection It has 3 components: 1. Enumeration of the plan space. – Only the space of left-deep plans is typically considered. – Cartesian products are avoided. – It’s NP-hard problem. – Some exponential algorithms (O(2N) for N joins) are still practical for everyday queries (<10 joins). We will study the system R dynamic programming algorithm. – Many polynomial-time heuristics exist. 1. Cost estimation – Based on statistics, maintained in system catalog. – Very rough approximations; still black magic. – More accurate estimations exist based on histograms. 1. Plan selection. – Ideally, want to find the best plan. Practically, want to avoid the worst plans. 02/14/13 32
  • 33. Cost Estimation The cost of a plan is the sum of the cost of all the plan operators. • If the intermediate result between plan operators is materialized, we need to consider the cost of reading/writing the result. • To estimate the cost of a plan operator (such as block-nested loops join), we need to estimate the size (in blocks) of the inputs. – Based on predicate selectivity. Assumed independence of predicates. – For each operator, both the cost and the output size needs to be estimated (since it may be needed for the next operator). – The sizes of leaves are retrieved from the catalog (statistics on table/index cardinalities). • We will study system R. – Very inexact, but works well in practice. Used to be widely used. – More sophisticated techniques known now. 02/14/13 33
  • 34. Statistics Statistics stored in system catalogs typically contain – Number of tuples (cardinality) and number of blocks for each table and index. – Number of distinct key values for each index. – Index height, low and high key values for each index. Catalogs are updated periodically, but not every time data change (too expensive). This may introduce slight inconsistency between data and statistics, but usually the choice of plans is resilient to slight changes in statistics. Histograms are better approximations: # of tuples M Tu W Th F Sa Su Day of week M 02/14/13 34