Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
1. 2 December 2005
Introduction to Databases
Storage Management
Prof. Beat Signer
Department of Computer Science
Vrije Universiteit Brussel
http://www.beatsigner.com
2. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 2April 21, 2017
Context of Today's Lecture
Access
Methods
System
Buffers
Authorisation
Control
Integrity
Checker
Command
Processor
Program
Object Code
DDL
Compiler
File
Manager
Buffer
Manager
Recovery
Manager
Scheduler
Query
Optimiser
Transaction
Manager
Query
Compiler
Queries
Catalogue
Manager
DML
Preprocessor
Database
Schema
Application
Programs
Database
Manager
Data
Manager
DBMS
Programmers Users DB Admins
Based on 'Components of a DBMS', Database Systems,
T. Connolly and C. Begg, Addison-Wesley 2010
Data, Indices and
System Catalogue
3. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 3April 21, 2017
Storage Device Hierarchy
Storage devices vary in
data capacity
access speed
cost per byte
Devices with fastest
access time have
highest costs and
smallest capacity
Cache
Main Memory
Flash Memory
Magnetic Disk
Optical Disk
Magnetic Tapes
4. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 4April 21, 2017
Cache
On-board caches on same chip as the microprocessor
level 1 (L1) cache (typical size of ~64 kB )
- temporary storage of instructions and data
level 2 (L2) cache (~1 MB) and level 3 (L3) cache (~8 MB)
Data items in the cache are copies of values in main
memory locations
If data in the cache has been updated, changes must be
reflected in the corresponding memory locations
5. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 5April 21, 2017
Main Memory
Main memory can be several gigabytes large
Normally too small and too expensive for storing the
entire database
content is lost during power failure or crash (volatile memory)
in-memory databases (IMDB) primarily rely on main memory
- note that IMDBs lack durability (D of the ACID properties)
IMDB size limited by the maximal addressable memory space
- e.g. maximal 4 GB for 32-bit address space
Random access memory (RAM)
time to access data is more or less independent of its location
(different from magnetic tapes)
Typical access time of ~10 nanoseconds (10-8 seconds)
6. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 6April 21, 2017
Secondary Storage (Hard Disk)
Essentially random access
Files are moved between a hard disk and main memory
(disk I/O) by the operating system (OS) or the DBMS
the transfer units are blocks
tendency for larger block sizes
Parts of the main memory are used to buffer blocks
the buffer manager of the DBMS manages the loading and
unloading of blocks for specific DBMS operations
Typical block I/O time (seek time) ~10 milliseconds
1'000'000 times slower than main memory access
Capacity of multiple multiple terabytes and a system can
use many disk units
7. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 7April 21, 2017
Hard Disk
A hard disk contains one
or more platters and one or
more heads
The platters were originally
addressed in terms of cylinders,
heads and sectors (block)
cylinder-head-sector (CHS) scheme
max of 1024 cylinders, 16 heads
and 63 sectors
Current hard disks offer
logical block addressing (LBA)
hides the physical disk geometry
8. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 8April 21, 2017
Solid-State Drives (SSD)
Storage device that uses solid-state memory
(flash memory) to persistently store data
Offers a hard disk interface with a storage capacity of up
to a few hundred gigabytes
Typical block I/O time (seek time) ~0.1 milliseconds
SSDs might help to reduce the gap between primary and
secondary storage in DBMS systems
Currently there are still some limitations of SSDs
the limited number of SSD write operations before failure can be a
problem for DBs with a lot of update operations
write operations are often still much slower than read operations
9. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 9April 21, 2017
Tertiary Storage
No random access
access time depends on
data location
Different devices
tapes
optical disk jukeboxes
- racks of CD-ROMs (read only)
tape silos
- room-sized devices holding
racks of tapes operated by
tape robots
- e.g. StorageTek PowderHorn
with up to 28.8 petabytes
10. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 10April 21, 2017
Models of Computation
RAM model of computation
assumes that all data is held in main memory
DBMS model of computation
assumes that data does not fit into main memory
efficient algorithms must take into account secondary and even
tertiary storage
best algorithms for processing large amounts of data often differ
from those for the RAM model of computation
minimising disk accesses plays a major role
- I/O model of computation
I/O model of computation
the time to move a block between disk and memory is much
higher than the time for the corresponding computation
11. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 11April 21, 2017
Accelerating Secondary Storage Access
Various possible strategies to improve secondary
storage access
placement of blocks that are often accessed together on the same
disk cylinder
distribute data across multiple disks to profit from parallel disk
accesses (e.g. RAID)
mirroring of data
use of disk scheduling algorithms in OS, DBMS or disk controller
to determine order of requested block read/writes
- e.g. elevator algorithm
prefetching of disk blocks
efficient caching
- main memory
- disk controllers
12. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 12April 21, 2017
Redundant Array of Independent Disks
The redundant array of independent disks (RAID)
organisation technique provides a single disk view for a
number (array) of disks
divide and replicate data across multiple hard disks
introduced in 1987 by D.A. Patterson, G.A. Gibson and R. Katz
The main goals of a RAID solution are
higher capacity by grouping multiple disks
- originally a RAID was also a cheaper alternative to expensive large disks
• original name: Redundant Array of Inexpensive Disks
higher performance due to parallel disk access
- multiple parallel read/write operations
increased reliability since data might be stored redundantly
- data can be restored if a disk fails
13. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 13April 21, 2017
RAID ...
There are three main concepts in RAID systems
identical data is written to more than one disk (mirroring)
data is split accross multiple disks (striping)
redundant parity data is stored on separated disks and used to
detect and fix problems (error correction)
14. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 14April 21, 2017
RAID Reliability
The mean time between failures (MTBF) is the average
time until a disk failure occurs
e.g. a hard disk might have a MTBF of 200'000 hours (22.8 years)
- note that the MTBF decreases as disks get older
If a DBMS uses an array of disks, then the overall
system's MTBF can be much lower
e.g. the MTBF for a disk array of 100 of the disks mentioned
above is 200'000 hours/100 = 2'000 hours (83 days)
By storing information redundantly, data can be restored
in the case of a disk failure
15. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 15April 21, 2017
RAID Reliability ...
The mean time to data loss (MTTDL) depends on the
MTBF and the mean time to repair (MTTR)
for mirroring data on two disks the MTTDL is defined by
if we mirror the information on two disks with a MTBF of 200'000
hours and a mean time to repair of 10 hours then the MTTDL is
200'0002/(2*10) hours = 228'000 years
of course in reality it is more likely that an error occurs on multiple
disks around the same time
- drives have the same age
- power failure, earthquake, fire, ...
MTTR
MTBF
MTTDL
2
2
16. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 16April 21, 2017
RAID Levels
The different RAID levels offer different
cost-performance trade-offs
RAID 0
block level striping without any redundancy
RAID 1
mirroring without striping
RAID 2
bit level striping
multiple parity disks
RAID 3
byte level striping
one parity disk
[http://en.wikipedia.org/wiki/RAID]
17. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 17April 21, 2017
RAID Levels ...
RAID 4
block level striping
one parity disk
similar to RAID 3
RAID 5
block level striping with distributed parity
no dedicted parity disk
RAID 6
block level striping with dual
distributed parity
no dedicted parity disk
similar to RAID 5
18. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 18April 21, 2017
Data Representation
A DBMS has to define how the elements of its data model
(e.g. relational model) are mapped to secondary storage
a field contains a fixed- or variable-length sequence of bytes and
represents an attribute
a record contains a fixed- or variable-length sequence of fields and
represents a tuple
records are stored in fixed-length physical block storage units
representing a set of tuples
- the blocks also represent the units of data transfer
a file contains a collection of blocks and represents a relation
A database is finally mapped to a number of files
managed by the underlying operating system
index structures are stored in separate files
19. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 19April 21, 2017
Relational Model Representation
A number of issues have to be addressed when
mapping the basic elements of the relational model
to secondary storage
how to map the SQL datatypes to fields?
how to represent tuples as records?
how to represent records in blocks?
how to represent a relation as a collection of blocks?
how to deal with record sizes that do not fit into blocks?
how to deal with variable-length records?
how to deal with schema updates and growing record lengths?
...
20. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 20April 21, 2017
Representation of SQL Datatypes
Fixed-length character string (CHAR(n))
represented as a field which is an array of n bytes
strings that are shorter than n bytes are filled up with a special
"pad" character
Variable-length character string (VARCHAR(n))
two common representations (non-fixed length version later)
length plus content
- allocate an array of n + 1 bytes
- the first byte represents the length of the string (8-bit integer) followed by the
string content
- limited to a maximal string length of 255 characters
null-terminated string
- allocate an array of n + 1 bytes
- terminate the string with a special null character (like in C)
21. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 21April 21, 2017
Representation of SQL Datatypes ...
Dates (DATE)
fixed-length character string
Time (TIME(n))
the precision n leads to strings of variable length and two possible
representations
fixed-precision
- limit the precision to a fixed value and store as VARCHAR(m)
true-variable length
- store the time as true variable length value
Bits (BIT(n))
bit values of size n can be packed into single bytes
packing of multiple bit values into a single byte is not recommended
- makes the retrieval and updating of a value more complex and error-prone
22. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 22April 21, 2017
Storage Access
A part of the system's main memory is used as a buffer
to store copies of disk blocks
The buffer manager is responsible to move data from
secondary disk storage into memory
the number of block transfers between disk and memory should
be minimised
as many blocks a possible should be kept in memory
The buffer manager is called by the DMBS every time a
disk block has to be accessed
the buffer manager has to check whether the block is already
allocated in the buffer (main memory)
23. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 23April 21, 2017
Buffer Manager
If the requested block is already in the buffer, the buffer
manager returns the corresponding address
If the block is not yet in the buffer, the buffer manager
performs the following steps
allocate buffer space
- if no space is available, remove an existing block from the buffer
(based on a buffer replacement strategy) and write it back to the disk if it has
been modified since it was last fetched/written to disk
read the block from the disk, add it to the buffer and return the
corresponding memory address
Note the similarities to a virtual memory manager
24. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 24April 21, 2017
Buffer Replacement Strategies
Most operating systems use a least recently used (LRU)
strategy where the block that was least recently used is
moved back from memory to disk
use past access pattern to predict future block access
A DBMS is able to predict future access patterns more
accurately than an operating system
a request to the DBMS involves multiple steps and the DBMS
might be able to determine which blocks will be needed by
analysing the different steps of the operation
note that LRU might not always be the best replacement strategy
for a DBMS
25. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 25April 21, 2017
Buffer Replacement Strategies ...
Let us have a look at the procedure to compute the
following natural join query: order ⋈ customer
note that we will see more efficient solutions for this problem
when discussing query optimisation
for each tuple o of order {
for each tuple c of customer {
if o.customerID = c.customerID {
create a new tuple r with:
r.customerID := c.customerID
r.name := c.name
...
r.orderID := o.orderID
...
add tuple r to the result set of the join operation
}
}
}
26. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 26April 21, 2017
Buffer Replacement Strategies ...
We further assume that the two relations order and
customer are stored in separate files
From the pseudocode we can see that
once an order tuple has been processed, it is not needed
anymore
- if a whole block of order tuples has been processed, that block is no longer
required in memory (but an LRU strategy might keep it)
- as soon as the last tuple of an order block has been processed, the buffer
manager should free the memory space toss-immediate strategy
once a customer tuple has been processed, it is not accessed
again until all the other customer tuples have been accessed
- when the processing of a customer block has been finished, the least recently
used customer block will be requested next
- we should replace the block that has been most recently used (MRU)
27. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 27April 21, 2017
Buffer Replacement Strategies ...
A memory block can be marked to indicate that this block
is not allowed to be written back to disk (pinned block)
note that if we want to use an MRU strategy for the inner loop of
the previous example, the block has to be pinned
- the block has to be unpinned after the last tuple in the block has be processed
the pinning of blocks provides some control to restrict the time
when blocks can be written back to disk
- important for crash recovery
- blocks that are currently updated should not be written to disk
The prefetching of blocks might be used to further
increase the performance of the overall system
e.g. for serial scans (relation scans)
28. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 28April 21, 2017
Buffer Replacement Strategies ...
The buffer manager can also use statistical information
about the probability that a request will reference a
particular relation (and its related blocks)
the system catalogue (data dictionary) with its metadata is one of
the most frequently accessed parts of the database
- if possible, system catalogue blocks should always be in the buffer
index files might be accessed more often than the corresponding
files themselves
- do not remove index files from the buffer if not necessary
the crash recovery manager can also provide constraints for the
buffer manager
- the recovery manager might demand that other blocks have to be written first
(force-output) before a specific block can be written to disk
29. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 29April 21, 2017
System Catalogue / Data Dictionary
Stores metadata about the database
names of the relations
names, domain and lengths of the attributes of each relation
names of views
names of indices
- name of relation that is indexed
- name of attributes
- type of index
integrity constraints
users and their authorisations
statistical data
- number of tuples in relation, storage method, ...
...
30. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 30April 21, 2017
File Organisation
A file is a logically organised as a sequence of records
each record contains a sequence of fields
name, datatype and offset of record fields are defined by
the schema
record types (schema) might change over time
The records are mapped to disk blocks
the block size is fixed and defined by the physical properties of
the disk and the operating system
the record size might vary for different relations and even
between tuples of the same relation (variable field size)
There are different possible mappings of records to files
use multiple files and only store fixed-length records in each file
store variable-length records in a file
31. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 31April 21, 2017
Fixed-Length Records
If we assume that an integer requires 2 bytes and
characters are represented by one byte, then the
customer record is 64 bytes long
type customer = record
cID int;
name varchar(30)
street varchar(30)
end
cID name street... ...
0 2 33 64
Block
32. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 32April 21, 2017
Fixed-Length Records ...
Often a record header is added to each record for
managing metadata about
the record schema (pointer s to the DBMS schema information)
timestamp t about the last access or modification time
the length l of the record
- could be computed from the schema but the information is convenient if we
want to quickly access the next record without having to consult the schema
...
0 12 48 80
cID name street... ...s t l
16
Block
33. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 33April 21, 2017
Fixed-Length Records in Blocks/Files
Problems with this fixed length representation
after a record has been deleted, its space has to be filled with
another record
- could move all records after the deleted one but that is too expensive
- can move the last record to the deleted record's position but also that might
require an additional block access
if the block size is not a multiple of the record size, some records
will cross block boundaries and we need two block accesses to
read/write such a record
1 Max Frisch Bahnhofstrasse 7h
2 Eddy Merckx Pleinlaan 25h
5 Claude Debussy 12 Rue Louiseh
53 Albert Einstein Bergstrasse 18h
8 Max Frisch ETH Zentrumh
record 0
record 1
record 2
record 3
record 4
34. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 34April 21, 2017
Fixed-Length Records in Blocks/Files ...
Since insert operations tend to be more frequent that
delete operations, it might be acceptable to leave the
space of the deleted record open until a new record is
inserted
we cannot just add an additional boolean flag ("free") to the
record since it will be hard to find the free records
allocate a certain amount of bytes for a file header containing
metadata about the file
The block/file header contains a pointer (address) to the
first deleted record
each deleted record contains a pointer (address) to the next
deleted record
the linked list of deleted records is called a free list
35. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 35April 21, 2017
record 0
record 1
record 2
record 3
record 4
header
Fixed-Length Records in Blocks/Files ...
To insert a new record, the first free record pointed to by
the header is used and the address in the header is
updated to the free record that the used record was
pointing to
to save some space, the pointers of the free list can also be
stored in the unused space of deleted records (no additional field)
1 Max Frisch Bahnhofstrasse 7h
5 Claude Debussy 12 Rue Louiseh
8 Max Frisch ETH Zentrumh
36. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 36April 21, 2017
Address Space
There are several ways how the database address
space (blocks and block offsets) can be represented
physical addresses consisting of byte strings (up to 16 bytes)
that address
- host
- storage device identifier (e.g. hard disk ID)
- cylinder number of the disk
- track within the cylinder (for multi-surface disks)
- block within the track
- potential offset of record within the block
logical addresses consisting of an arbitrary string of length n
37. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 37April 21, 2017
Address Space Mapping
A map table is stored at a known disk location and
provides a mapping between the logical and physical
address spaces
introduces some indirection since the map table has to be
consulted to get the physical address
flexibility to rearrange records within blocks or move them to other
blocks without affecting the record's logical address
different combinations of logical and physical addresses are
possible (structured address schemes)
... ...
logical physicallogical
address
physical
addressmap table
38. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 38April 21, 2017
Variable-Length Data
Records of the same type may have different lengths
We may want to represent
record fields with varying size (e.g. VARCHAR(n))
large fields (e.g. images)
...
We need an alternative data representation to deal with
these requirements
39. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 39April 21, 2017
Variable-Length Record Fields
Scheme for records with variable-length fields
put all fixed-length fields first (e.g. cID)
add the length of the record to the record header
add the offsets of the variable-length fields to the record header
Note that if the order of the variable-length fields is
always the same, we do not have to store an offset for
the first variable-length field (e.g. name)
cID name street
record length
40. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 40April 21, 2017
Variable-Length Records
There are different reasons why we might have to use
variable-length records
to store records that have at least one field with a variable length
to store different record types in a single block/file
Structured address scheme (slotted-page structure)
address of a record consists of the block address in combination
with an offset table index
records can be moved around
record3 record2 record1... free ...
offset table
41. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 41April 21, 2017
Large Records
Sometimes we have to deal with values that do not fit
into a single block (e.g. audio or movie clips)
a record that is split across two or more blocks is called
a spanned record
spanned records can also be used to pack blocks more efficiently
Extra header information
each record header carries a bit to indicate if it is a fragment
- fragments have some more bits; telling whether first or last fragment of record
potential pointers to previous and next fragment
block header
record header
record2b record3record2arecord1
block 1 block 2
42. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 42April 21, 2017
Storage of Binary Large Objects (BLOBS)
BLOB is stored as a sequence of blocks
often blocks allocated successively on a disk cylinder
BLOB might be striped across multiple disks for more
efficient retrieval
BLOB field might not be automatically fetched into
memory
user has to explicitly load parts of the BLOB
possibly index structures to retrieve parts of a BLOB
43. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 43April 21, 2017
Insertion of Records
If the records are not kept in a particular order, we can
just find a block with some empty space or create a new
block if there is no such space
If the record has to be inserted in a particular order, but
there is no space in the block, there are two alternatives
find space in a nearby block and rearrange some records
create an overflow block and link it from the header of the original
block
- note that an overflow block might point to another overflow block and so on
record3 record2 record1... free ...
offset table
44. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 44April 21, 2017
Deletion of Records
If we use an offset table, we may compact the free space
in the block (slide around the records)
If the records cannot be moved, we might have a free list
in the header
We might also be able to remove an overflow block after
a delete operation
record3 record2 record1... free ...
offset table
45. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 45April 21, 2017
Update of Records
If we have to update a fixed-length record there is no
problem since we will still use the same space
If the updated record is larger than the original version,
then we might have to create more space
same options as discussed for insert operation
If the updated record is smaller, then we may compact
some free space or remove overflow blocks
similar to delete operation
record3 record2 record1... free ...
offset table
46. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 46April 21, 2017
Homework
Study the following chapter of the
Database System Concepts book
chapter 10
- sections 10.1-10.9
- Storage and File Structure
47. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 47April 21, 2017
Exercise 8
Functional Dependencies and Normalisation
48. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 48April 21, 2017
References
H. Garcia-Molina, J.D. Ullman and J. Widom,
Database Systems – The Complete Book,
Prentice Hall, 2002
A. Silberschatz, H. Korth and S. Sudarshan, Database
System Concepts (Sixth Edition), McGraw-Hill, 2010