2. Data Organization on Secondary Storage
Architecture of a database storage manager:
A mapping table maps
free frame
page# to frame#
disk page
buffer pool
secondary storage
02/14/13 2
3. Buffer Management
When a page is requested, the storage manager looks at the
mapping table to see if this page is in the pool. If not:
1. If there is a free frame, it allocates the frame, reads the disk
page into frame, and inserts the mapping from page# to frame#
into the mapping table.
2. Otherwise, it selects an allocated frame and replaces it with the
new page. If the victim frame is dirty (has been updated), it
writes it back to disk. It also updates the mapping table.
Then it pins the page.
02/14/13 3
4. Pinning/Unpinning Data
The page pin is a memory counter that indicates how many times
this page has been requested. The pin counter and the dirty bit
of each frame can be stored in the mapping table.
When a program requests a page, it must pin the page, which is
done by incrementing the pin counter. When a program doesn’t
need the page anymore, it unpins the page, which is done by
decrementing the pin counter.
When the pin is zero, the frame associated with the page is a good
candidate for replacement when the buffer pool is full.
02/14/13 4
5. A Higher-Level Interface
Page I/O is very low level. You can abstract it using records, files,
and indexes.
A file is a sequence of records, while each record is an aggregation
of data values.
Records are stored on disk blocks. Small records do not cross page
boundaries. The blocking factor for a file is the average number
of file records stored on a disk block. Large records may span
multiple pages. Fixed-size records are typically unspanned.
The disk pages of a file that contain the file records may be
contiguous, linked, or indexed. A file may be organized as a
heap (unordered) or a sequential (ordered) file. Ordered files
can be searched using binary search but they are hard to update.
A file may also have a file descriptor (or header) that contains
info about the file and the record fields.
02/14/13 5
6. Indexes
The most common form of an index is a mapping of one or more
fields of a file (this is the search key) to record ids (rids). It
speeds up selections on the search key. An index gives an
alternative access path to the file records on the index key.
Types of indexes:
1. Primary index: if the search key contains the primary key. If
the search key contains a candidate key, is a unique index.
Otherwise, it is a secondary index.
2. Clustered index: when the order of records is the same as the
order of index.
3. Dense index: when, for each data record, there is an entry in
the index with search key equal to the associated record fields.
Otherwise, it is sparse. A sparse index must be clustered, to be
able to locate records with no search key.
02/14/13 6
7. B+-trees
B+-trees are variations of search trees that are suitable for block-
based secondary storage. They allow efficient search (both
equality & range search), insertion, and deletion of search keys.
Characteristics:
• Each tree node corresponds to a disk block. The tree is kept
height-balanced.
• Each node is kept between half-full and completely full, except
for the root, which can have one search key minimum.
• An insertion into a node that is not full is very efficient. If the
node is full, we have an overflow, and the insertion causes a
split into two nodes. Splitting may propagate to other tree levels
and may reach the root.
• A deletion is very efficient if the node is kept above half full.
Otherwise, there is an underflow, and the node must be merged
with neighboring nodes.
02/14/13 7
8. B+-tree Node Structure
Internal node
K1 … K i-1 K … i Kq-1
P1 Pi Pq
K ≤ K1 Ki-1< K ≤ Ki K > Kq-1
Order p of a B+-tree is the max number of pointers in a node:
p = (b+k)/(k+d)
where b is the block size, k is the key size, and d is the pointer size.
Number of pointers, q, in an internal node/leaf: p/2 ≤ q ≤ p
Search, insert, and delete have logqN cost (for N records).
02/14/13 8
9. Example
For a 1024B page, 9B key, and 7B pointer size, the order is:
p = (1024+9)/(9+7) = 64
In practice, typical order is p=200 and a typical fill factor is 67%.
That is, the average fanout is 133 (=200*0.67).
Typical capacities:
• Height 3: 2,352,637 records (=1333)
• Height 4: 312,900,700 records (=1334)
02/14/13 9
10. Query Processing
Need to evaluate database queries efficiently. Assumptions:
• You can not upload the entire database in memory
• You want to utilize the memory as much as possible
• The cost will heavily depend on how many pages you read
from (and write to) disk.
Steps:
1. Translation: translate the query into an algebraic form
2. Algebraic optimization: improve the algebraic form using
heuristics
3. Plan selection: consider available alternative implementation
algorithms for each algebraic operation in the query and
choose the best one (or avoid the worst) based on statistics
4. Evaluation: evaluate the best plan against the database.
02/14/13 10
11. Evaluation Algorithms: Sorting
Sorting is central to query processing. It is used for:
1. SQL order-by clause
2. Sort-merge join
3. Duplicate elimination (you sort the file by all record values so
that duplicates will moved next to each other)
4. Bulk loading of B+-trees
5. Group-by with aggregations (you sort the file by the group-by
attributes and then you aggregate on subsequent records that
belong to the same group, since, after sorting, records with the
same group-by values will be moved next to each other)
02/14/13 11
12. External Sorting
file Memory (nB blocks)
merging
sorting merging
initial
runs
Available buffer space: nB ≥ 3 Sorting cost: 2*b
Number of blocks in file: b Merging cost: 2*b*logdmnR
Number of initial runs: nR = b/nB
Degree of merging: dm = min(nB-1,nR) Total cost is O(b*logb)
Number of passes is logdmnR
02/14/13 12
13. Example
With 5 buffer pages, to sort 108 page file:
1. Sorting: nR=108/5 = 22 sorted initial runs of 5 pages each
2. Merging: dm=4 run files to merge for each merging, so we get
22/4=6 sorted runs (5 pages each) that need to be merged
3. Merging: we will get 6/4=2 sorted runs to be merged
4. Merging: we get the final sorted file of 108 pages.
Total number of pages read/written: 2*108*4 = 864 pages
(since we have 1 sorting + 3 merging steps).
02/14/13 13
14. Replacement Selection
Question: What’s the best way to create each initial run?
With QuickSort: you load nB pages from the file into the memory
buffer, you sort using QuickSort, and you write the result to a
runfile. So the runfile is always at most nB pages.
With HeapSort with replacement selection: you load nB pages from
file into the memory buffer, you perform BuildHeap to create
the initial heap in memory. But now you continuously read
more records from the input file, you Heapify each record, and
you remove and write the smallest element of the heap (the
heap root) into the output runfile until the record sort key is
larger than the smallest key in the heap. Then, you complete the
HeapSort in memory and you dump the sorted file to the output.
Result: the average size of the runfile is now 2*nB. So even
though HeapSort is slower than QuickSort for in-memory
sorting, it is better for external sorting, since it creates larger
runfiles.
02/14/13 14
15. Join Evaluation Algorithms
Join is the most expensive operation. Many algorithms exist.
Join: R R.A=S.B S
Nested loops join (naïve evaluation):
res ← ∅
for each r∈R do
for each s∈S do
if r.A=s.B
then insert the concatenation of r and s into res
Improvement: use 3 memory blocks, one for R, one for S, and one
for the output. Cost = bR+bR*bS (plus the cost for writing the
output), where bR and bS are the numbers of blocks in R and S.
02/14/13 15
16. Block Nested Loops Join
Try to utilize all nB blocks in memory. Use one memory block for
the inner relation, S, one block for the output, and the rest (nB-2)
for the outer relation, R:
while not eof(R)
{ read the next nB-2 blocks of R into memory;
start scanning S from start one block at the time;
while not eof(S)
{ read the next block of S;
perform the join R R.A=S.B S between the memory
blocks of R and S and write the result to the output;
}
}
02/14/13 16
17. Block Nested Loops Join (cont.)
Cost: bR+bR/(nb-2)*bS
But, if either R or S can fit entirely in memory (ie. bR≤nb-2 or bS≤nb-
2) then the cost is bR+bS.
You always use the smaller relation (in number of blocks) as outer
and the larger as inner. Why? Because the cost of S R.A=S.B R is
bS+bS/(nb-2)*bR. So if bR>bS, then the latter cost is smaller.
Rocking the inner relation: Instead of always scanning the inner
relation R from the beginning, we can scan it top-down first,
then bottom-up, then top-down again, etc. That way, you don’t
have to read the first or last block twice. In that case, the cost
formula will have bS-1 instead of bS.
02/14/13 17
18. Index Nested Loops Join
If there is an index I (say a B+-tree) on S over the search key S.B,
we can use an index nested loops join:
for each tuple r∈R do
{ retrieve all tuples s∈S using the value r.A as the search key
for the index I;
perform the join between the r and the retrieved tuples from S
}
Cost: bR+|R|*(#of-levels-in-the-index), where |R| is the number
of tuples in R. The number of levels in the B+-tree index is
typically smaller than 4.
02/14/13 18
19. Sort-Merge Join
Applies to equijoins only. Assume R.A is a candidate key of R.
Steps:
1. Sort R on R.A
2. Sort S on S.B
3. Merge the two sorted files as follows:
r ← first(R)
s ← first(S)
repeat
{ while s.B≤r.A
{ if r.A=s.B then write <r,s> to the output
s ← next(S)
}
r ← next(R)
} until eof(R) //or eof(S)
02/14/13 19
20. Sort-Merge Join (cont.)
Cost of merging: bR+bS
If R and/or S need to be sorted, then the sorting cost must be added
too. You don’t need to sort a relation if there is a B+-tree whose
search key is equal to the sort key (or, more generally, if the
sort key is a prefix of the search key).
Note that if R.A is not candidate key, then switch R and S,
provided that S.B is a candidate key for S. If neither is true,
then a nested loops join must be performed between the equal
values of R and S. In the worst case, if all tuples in R have the
same value for r.A and all tuples of S have the same value for
s.B, equal to r.A, then the cost will be bR*bS.
Sorting and merging can be combined into one phase. In practice,
since sorting can be done in less than 4 passes only, the sort-
merge join is close to linear.
02/14/13 20
21. Hash Join
Works on equijoins only. R is called the built table and S is called
the probe table. We assume that R can fit in memory
Steps:
1. Built phase: read the built table, R, in memory as a hash table
with hash key R.A. For example, if H is a memory hash table
with n buckets, then tuple r of R goes to the H(h(r.A) mod n)
bucket, where h maps r.A to an integer.
2. Probe phase: scan S, one block at a time:
res ← ∅
for each s∈S
for each r∈H(h(s.B) mod n)
if r.A=s.B
then insert the concatenation of r and s into res
02/14/13 21
22. Partitioned Hash Join
If neither R nor S can fit in memory, you partition both R and S
into m=min(bR,bS)/(k*nB) partitions, where k is a number
larger than 1, eg, 2. Steps:
1. Partition R:
create m partition files for R: R1,…Rm
for each r∈R
put r in the partition Rk, where k=h(r.A) mod m
2. Partition S:
create m partition files for S: S1,…Sm
for each s∈S
put s in the partition Sk, where k=h(s.B) mod m
3. for i=1 to m
perform in-memory hash join between Ri and Si
02/14/13 22
23. Partitioned Hash Join (cont.)
If the hash function does not partition uniformly (this is called
data skew), one or more Ri/Si partitions may not fit in memory.
We can apply the partitioning technique recursively until all
partitions can fit in memory.
Cost: 3*(bR+bS) since each block is read twice and written once.
In most cases, it’s better than sort-merge join, and is highly
parallelizable. But sensitive to data skew. Sort-merge is better
when one or both inputs are sorted. Also sort-merge join
delivers the output sorted.
02/14/13 23
24. Aggregation with Group-by
If it is aggregation without group-by, then simply scan the input and aggregate
using one or more accumulators.
For aggregations with a group-by, sort the input on the group-by attributes and
scan the result. Example:
select dno, count(*), avg(salary)
from employee group by dno
Algorithm:
sort employee by dno into E;
e←first(E)
count←0; sum←0; d←e.dno;
while not eof(E)
{ if e.dno<>d
then { output d, count, sum/count;
count←0; sum←0; d←e.dno }
count ←count+1; sum←sum+e.salary
e←next(E);
}
output d, count, sum/count;
02/14/13 24
25. Other Operators
Aggregation with group-by (cont.):
Sorting can be combined with scanning/aggregation.
You can group-by using partitioned hashing using the group-by
attributes for hash/partition key.
Other operations:
• Selection can be done with scanning the input and testing each
tuple against the selection condition. Or you can use an index.
• Intersection is a special case of join (the predicate is an equality
over all attribute values).
• Projection requires duplicate elimination, which can be done
with sorting or hashing. You can eliminate duplicates during
sorting or hashing.
• Union requires duplicate elimination too.
02/14/13 25
26. Combining Operators
Each relational algebraic operator reads one or two π
relations as input and returns one relation as
output. It can be very expensive if the
evaluation algorithms that implement these
operators had to materialize the output relations
into temporary files on disk.
Solution: Stream-based processing (pipelining). σ B C D
Iterators: open, next, close
A
Operators work on stream of tuples now.
Operation next returns one tuple only, which is
sent to the output stream. It is ‘pull’ based: To
create one tuple, the operator calls the next
operation over its input streams as many times
as necessary to generate the tuple.
02/14/13 26
27. Stream-Based Processing
Selection without streams:
table selection ( table x, bool (pred)(record) ) join x y x.A=y.B
{ table result = empty
for each e in x
if pred(e) then insert e into result
return result selection x y x.C>10
}
Stream-based selection:
record selection ( stream s, bool (pred)(record) )
{ while not eof(s)
{ r = next(s)
if pred(r) then return r
}
return empty_record
}
struct stream { record(next_fnc)(…); stream x; stream y; args; }
record next ( stream s )
if (s.y=empty) return (s.next_fnc)(s.x,s.args)
else return (s.next_fnc)(s.x,s.y,s.args)
02/14/13 27
28. But …
Streamed-based nested loops join:
record nested-loops ( stream left, stream right, bool (pred)(record,record) )
{ while not eof(left)
{ x = next(left)
while not eof(right)
{ y = next(right)
if pred(x,y) then return <x,y>
}
open(right)
}
return empty_record
}
•If the inner stream is the result of a another operator (such as a join), it is
better to materialize it into a file. So this works great for left-deep trees.
•But doesn’t work well for sorting (a blocking operator).
02/14/13 28
29. Query Optimization
A query of the form:
select A1,…,An
from R1,…,Rm
where pred
can be evaluated by the following algebraic expression:
πA1,…An(σpred(R1×… × Rm))
Algebraic optimization:
Find a more efficient algebraic expression using heuristic rules.
02/14/13 29
30. Heuristics
• If pred in σpred is a conjunction, break σpred into a cascade of σ:
σ p1 and …pn(R) = σp1(… σpn(R))
• Move σ as far down the query tree as possible:
σp(R × S) = σp(R) × S if p refers to R only
• Convert cartesian products into joins
σR.A=S.B(R × S) = R R.A=S.B S
• Rearrange joins so that there are no cartesian products
• Move π as far down the query tree as possible (but retain
attributes needed in joins/selections).
02/14/13 30
31. Example
select e.fname, e.lname
from project p, works_on w, employee e
where p.plocation=‘Stafford’ and e.bdate>’Dec-31-1957’
and p.num=4 and p.number = w.pno and w.essn=e.ssn
π e.fname, e.lname
p.pnumber=w.pno
π e.fname, e.lname, w.pno π p.pnumber
w.essn=e.ssn
σp.plocation=‘Stafford and p.num=4
p
project
π e.ssn, e.fname, e.lname πww.essn, w.pno
σ e.bdate>’Dec-31-1957’ works_on
e
employee
02/14/13 31
32. Plan Selection
It has 3 components:
1. Enumeration of the plan space.
– Only the space of left-deep plans is typically considered.
– Cartesian products are avoided.
– It’s NP-hard problem.
– Some exponential algorithms (O(2N) for N joins) are still practical for
everyday queries (<10 joins). We will study the system R dynamic
programming algorithm.
– Many polynomial-time heuristics exist.
1. Cost estimation
– Based on statistics, maintained in system catalog.
– Very rough approximations; still black magic.
– More accurate estimations exist based on histograms.
1. Plan selection.
– Ideally, want to find the best plan. Practically, want to avoid the worst
plans.
02/14/13 32
33. Cost Estimation
The cost of a plan is the sum of the cost of all the plan operators.
• If the intermediate result between plan operators is
materialized, we need to consider the cost of reading/writing the
result.
• To estimate the cost of a plan operator (such as block-nested
loops join), we need to estimate the size (in blocks) of the
inputs.
– Based on predicate selectivity. Assumed independence of predicates.
– For each operator, both the cost and the output size needs to be
estimated (since it may be needed for the next operator).
– The sizes of leaves are retrieved from the catalog (statistics on
table/index cardinalities).
• We will study system R.
– Very inexact, but works well in practice. Used to be widely used.
– More sophisticated techniques known now.
02/14/13 33
34. Statistics
Statistics stored in system catalogs typically contain
– Number of tuples (cardinality) and number of blocks for each table and
index.
– Number of distinct key values for each index.
– Index height, low and high key values for each index.
Catalogs are updated periodically, but not every time data change
(too expensive). This may introduce slight inconsistency
between data and statistics, but usually the choice of plans is
resilient to slight changes in statistics.
Histograms are better approximations:
# of tuples
M Tu W Th F Sa Su Day of week
M
02/14/13 34