SlideShare uma empresa Scribd logo
1 de 63
1
Data Storage and Indexing
2
Outline
• Data Storage
– storing data: disks
– buffer manager
– representing data
– external sorting
• Indexing
– types of indexes
– B+ trees
– hash tables (maybe)
3
Physical Addresses
• Each block and each record have a physical
address that consists of:
– The host
– The disk
– The cylinder number
– The track number
– The block within the track
– For records: an offset in the block
• sometimes this is in the block’s header
4
Logical Addresses
• Logical address: a string of bytes (10-16)
• More flexible: can blocks/records around
• But need translation table:
Logical address Physical address
L1 P1
L2 P2
L3 P3
5
Main Memory Address
• When the block is read in main memory, it
receives a main memory address
• Buffer manager has another translation table
Memory address Logical address
M1 L1
M2 L2
M3 L3
6
The I/O Model of Computation
• In main memory algorithms
– we care about CPU time
• In databases time is dominated by I/O cost
• Assumption: cost is given only by I/O
• Consequence: need to redesign certain
algorithms
• Will illustrate here with sorting
7
Sorting
• Illustrates the difference in algorithm design
when your data is not in main memory:
– Problem: sort 1Gb of data with 1Mb of RAM.
• Arises in many places in database systems:
– Data requested in sorted order (ORDER BY)
– Needed for grouping operations
– First step in sort-merge join algorithm
– Duplicate removal
– Bulk loading of B+-tree indexes.
8
2-Way Merge-sort:
Requires 3 Buffers
• Pass 1: Read a page, sort it, write it.
– only one buffer page is used
• Pass 2, 3, …, etc.:
– three buffer pages used.
Main memory buffers
INPUT 1
INPUT 2
OUTPUT
DiskDisk
9
Two-Way External Merge Sort
• Each pass we read + write
each page in file.
• N pages in the file => the
number of passes
• So total cost is:
• Improvement: start with
larger runs
• Sort 1GB with 1MB memory
in 10 passes
 = +log2 1N
 ( )2 12N Nlog +
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,62,6 4,9 7,8 1,3 2
2,3
4,6
4,7
8,9
1,3
5,6 2
2,3
4,4
6,7
8,9
1,2
3,5
6
1,2
2,3
3,4
4,5
6,6
7,8
10
Can We Do Better ?
• We have more main memory
• Should use it to improve performance
11
Cost Model for Our Analysis
• B: Block size
• M: Size of main memory
• N: Number of records in the file
• R: Size of one record
12
Indexes
13
Outline
• Types of indexes
• B+ trees
• Hash tables (maybe)
14
Indexes
• An index on a file speeds up selections on the
search key field(s)
• Search key = any subset of the fields of a relation
– Search key is not the same as key (minimal set of fields
that uniquely identify a record in a relation).
• Entries in an index: (k, r), where:
– k = the key
– r = the record OR record id OR record ids
15
Index Classification
• Clustered/unclustered
– Clustered = records sorted in the key order
– Unclustered = no
• Dense/sparse
– Dense = each record has an entry in the index
– Sparse = only some records have
• Primary/secondary
– Primary = on the primary key
– Secondary = on any key
– Some books interpret these differently
• B+ tree / Hash table / …
16
Clustered Index
• File is sorted on the index attribute
• Dense index: sequence of (key,pointer) pairs
10
20
30
40
50
60
70
80
10
20
30
40
50
60
70
80
17
Clustered Index
• Sparse index: one key per data block
10
30
50
70
90
110
130
150
10
20
30
40
50
60
70
80
18
Clustered Index with Duplicate
Keys
• Dense index: point to the first record with
that key
10
20
30
40
50
60
70
80
10
10
10
20
20
20
30
40
19
Clustered Index with Duplicate
Keys
• Sparse index: pointer to lowest search key in
each block:
• Search for 20
10
10
20
30
10
10
10
20
20
20
30
40
20
Clustered Index with Duplicate
Keys
• Better: pointer to lowest new search key in
each block:
• Search for 20
10
20
30
40
50
60
70
80
10
10
10
20
30
30
40
50
21
Unclustered Indexes
• To index other attributes than primary key
• Always dense (why ?)
10
10
20
20
20
30
30
30
20
30
30
20
10
20
10
30
22
Summary Clustered vs.
Unclustered Index
Data entries
(Index File)
(Data file)
Data Records
Data entries
Data Records
CLUSTERED UNCLUSTERED
23
Composite Search Keys
• Composite Search Keys: Search
on a combination of fields.
– Equality query: Every field
value is equal to a constant
value. E.g. wrt <sal,age>
index:
• age=20 and sal =75
– Range query: Some field
value is not a constant. E.g.:
• age =20; or age=20 and
sal > 10
sue 13 75
bob
cal
joe 12
10
20
8011
12
name age sal
<sal, age>
<age, sal> <age>
<sal>
12,20
12,10
11,80
13,75
20,12
10,12
75,13
80,11
11
12
12
13
10
20
75
80
Data records
sorted by name
Data entries in index
sorted by <sal,age>
Data entries
sorted by <sal>
Examples of composite key
indexes using lexicographic order.
24
B+ Trees
• Search trees
• Idea in B Trees:
– make 1 node = 1 block
• Idea in B+ Trees:
– Make leaves into a linked list (range queries are
easier)
25
• Parameter d = the degree
• Each node has >= d and <= 2d keys (except root)
• Each leaf has >=d and <= 2d keys:
B+ Trees Basics
30 120 240
Keys k < 30
Keys 30<=k<120 Keys 120<=k<240 Keys 240<=k
40 50 60
40 50 60
Next leaf
26
B+ Tree Example
80
20 60 100 120 140
10 15 18 20 30 40 50 60 65 80 85 90
10 15 18 20 30 40 50 60 65 80 85 90
d = 2
27
B+ Tree Design
• How large d ?
• Example:
– Key size = 4 bytes
– Pointer size = 8 bytes
– Block size = 4096 byes
• 2d x 4 + (2d+1) x 8 <= 4096
• d = 170
28
Searching a B+ Tree
• Exact key values:
– Start at the root
– Proceed down, to the leaf
• Range queries:
– As above
– Then sequential traversal
Select name
From people
Where age = 25
Select name
From people
Where 20 <= age
and age <= 30
29
B+ Trees in Practice
• Typical order: 100. Typical fill-factor: 67%.
– average fanout = 133
• Typical capacities:
– Height 4: 1334
= 312,900,700 records
– Height 3: 1333
= 2,352,637 records
• Can often hold top levels in buffer pool:
– Level 1 = 1 page = 8 Kbytes
– Level 2 = 133 pages = 1 Mbyte
– Level 3 = 17,689 pages = 133 MBytes
30
Insertion in a B+ Tree
Insert (K, P)
• Find leaf where K belongs, insert
• If no overflow (2d keys or less), halt
• If overflow (2d+1 keys), split node, insert in parent:
• If leaf, keep K3 too in right node
• When root splits, new root has 1 key only
K1 K2 K3 K4 K5
P0 P1 P2 P3 P4 p5
K1 K2
P0 P1 P2
K4 K5
P3 P4 p5
(K3, ) to parent
31
Insertion in a B+ Tree
80
20 60 100 120 140
10 15 18 20 30 40 50 60 65 80 85 90
10 15 18 20 30 40 50 60 65 80 85 90
Insert K=19
32
Insertion in a B+ Tree
80
20 60 100 120 140
10 15 18 19 20 30 40 50 60 65 80 85 90
10 15 18 20 30 40 50 60 65 80 85 9019
After insertion
33
Insertion in a B+ Tree
80
20 60 100 120 140
10 15 18 19 20 30 40 50 60 65 80 85 90
10 15 18 20 30 40 50 60 65 80 85 9019
Now insert 25
34
Insertion in a B+ Tree
80
20 60 100 120 140
10 15 18 19 20 2
5
30 4
0
50 60 65 80 85 90
10 15 18 20 25 30 40 60 65 80 85 9019
After insertion
50
35
Insertion in a B+ Tree
80
20 60 100 120 140
10 15 18 19 20 2
5
30 4
0
50 60 65 80 85 90
10 15 18 20 25 30 40 60 65 80 85 9019
But now have to split !
50
36
Insertion in a B+ Tree
80
20 30 60 100 120 140
10 15 18 19 20 2
5
60 65 80 85 90
10 15 18 20 25 30 40 60 65 80 85 9019
After the split
50
30 4
0
5
0
37
Deletion from a B+ Tree
80
20 30 60 100 120 140
10 15 18 19 20 2
5
60 65 80 85 90
10 15 18 20 25 30 40 60 65 80 85 9019
Delete 30
50
30 4
0
5
0
38
Deletion from a B+ Tree
80
20 30 60 100 120 140
10 15 18 19 20 2
5
60 65 80 85 90
10 15 18 20 25 40 60 65 80 85 9019
After deleting 30
50
40 5
0
May change to
40, or not
39
Deletion from a B+ Tree
80
20 30 60 100 120 140
10 15 18 19 20 2
5
60 65 80 85 90
10 15 18 20 25 40 60 65 80 85 9019
Now delete 25
50
40 5
0
40
Deletion from a B+ Tree
80
20 30 60 100 120 140
10 15 18 19 20 60 65 80 85 90
10 15 18 20 40 60 65 80 85 9019
After deleting 25
Need to rebalance
Rotate
50
40 5
0
41
Deletion from a B+ Tree
80
19 30 60 100 120 140
10 15 18 19 2
0
60 65 80 85 90
10 15 18 20 40 60 65 80 85 9019
Now delete 40
50
40 5
0
42
Deletion from a B+ Tree
80
19 30 60 100 120 140
10 15 18 19 2
0
60 65 80 85 90
10 15 18 20 60 65 80 85 9019
After deleting 40
Rotation not possible
Need to merge nodes
50
50
43
Deletion from a B+ Tree
80
19 60 100 120 140
10 15 18 19 2
0
5
0
60 65 80 85 90
10 15 18 20 60 65 80 85 9019
Final tree
50
44
Hash Tables
• Secondary storage hash tables are much like
main memory ones
• Recall basics:
– There are n buckets
– A hash function f(k) maps a key k to {0, 1, …, n-1}
– Store in bucket f(k) a pointer to record with key k
• Secondary storage: bucket = block, use
overflow blocks when needed
45
• Assume 1 bucket (block) stores 2 keys +
pointers
• h(e)=0
• h(b)=h(f)=1
• h(g)=2
• h(a)=h(c)=3
Hash Table Example
e
b
f
g
a
c
0
1
2
3
46
• Search for a:
• Compute h(a)=3
• Read bucket 3
• 1 disk access
Searching in a Hash Table
e
b
f
g
a
c
0
1
2
3
47
• Place in right bucket, if space
• E.g. h(d)=2
Insertion in Hash Table
e
b
f
g
d
a
c
0
1
2
3
48
• Create overflow block, if no space
• E.g. h(k)=1
• More over-
flow blocks
may be needed
Insertion in Hash Table
e
b
f
g
d
a
c
0
1
2
3
k
49
Hash Table Performance
• Excellent, if no overflow blocks
• Degrades considerably when number of
keys exceeds the number of buckets (I.e.
many overflow blocks).
50
Extensible Hash Table
• Allows has table to grow, to avoid
performance degradation
• Assume a hash function h that returns
numbers in {0, …, 2k
– 1}
• Start with n = 2i
<< 2k
, only look at first i
most significant bits
51
Extensible Hash Table
• E.g. i=1, n=2, k=4
• Note: we only look at the first bit (0 or 1)
0(010)
1(011)
i=1 1
1
0
1
52
Insertion in Extensible Hash
Table
• Insert 1110
0(010)
1(011)
1(110)
i=1 1
1
0
1
53
Insertion in Extensible Hash
Table
• Now insert 1010
• Need to extend table, split blocks
• i becomes 2
0(010)
1(011)
1(110), 1(010)
i=1 1
1
0
1
54
Insertion in Extensible Hash
Table
• Now insert 1110
0(010)
10(11)
10(10)
i=2 1
2
00
01
10
11
11(10) 2
55
Insertion in Extensible Hash
Table
• Now insert 0000, then 0101
• Need to split block
0(010)
0(000), 0(101)
10(11)
10(10)
i=2 1
2
00
01
10
11
11(10) 2
56
Insertion in Extensible Hash
Table
• After splitting the block
00(10)
00(00)
10(11)
10(10)
i=2
2
2
00
01
10
11
11(10) 2
01(01) 2
57
Performance Extensible Hash
Table
• No overflow blocks: access always one read
• BUT:
– Extensions can be costly and disruptive
– After an extension table may no longer fit in
memory
58
Linear Hash Table
• Idea: extend only one entry at a time
• Problem: n= no longer a power of 2
• Let i be such that 2i
<= n < 2i+1
• After computing h(k), use last i bits:
– If last i bits represent a number > n, change msb
from 1 to 0 (get a number <= n)
59
Linear Hash Table Example
• N=3
(01)00
(11)00
(10)10
i=2
00
01
10
(01)11 BIT FLIP
60
Linear Hash Table Example
• Insert 1000: overflow blocks…
(01)00
(11)00
(10)10
i=2
00
01
10
(01)11
(10)00
61
Linear Hash Tables
• Extension: independent on overflow blocks
• Extend n:=n+1 when average number of
records per block exceeds (say) 80%
62
Linear Hash Table Extension
• From n=3 to n=4
• Only need to touch
one block (which one ?)
(01)00
(11)00
(10)10
i=2
00
01
10
(01)11
(01)11
(01)11
i=2
00
01
10
(10)10
(01)00
(11)00
11
63
Linear Hash Table Extension
• From n=3 to n=4 finished
• Extension from n=4
to n=5 (new bit)
• Need to touch every
single block (why ?)
(01)11
i=2
00
01
10
(10)10
(01)00
(11)00
11

Mais conteúdo relacionado

Destaque

Network internet
Network internetNetwork internet
Network internet
Kumar
 
Utsav Mahendra : Introduction to Database and managemnet
Utsav Mahendra : Introduction to Database and managemnetUtsav Mahendra : Introduction to Database and managemnet
Utsav Mahendra : Introduction to Database and managemnet
Utsav Mahendra
 
Database Systems Introduction (INTD-3535)
Database Systems Introduction (INTD-3535)Database Systems Introduction (INTD-3535)
Database Systems Introduction (INTD-3535)
julyprum
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
WanBK Leo
 
1 introduction databases and database users
1 introduction databases and database users1 introduction databases and database users
1 introduction databases and database users
Kumar
 
Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
Ali Usman
 
1. Introduction to DBMS
1. Introduction to DBMS1. Introduction to DBMS
1. Introduction to DBMS
koolkampus
 
Lecture 01 introduction to database
Lecture 01 introduction to databaseLecture 01 introduction to database
Lecture 01 introduction to database
emailharmeet
 

Destaque (20)

Network internet
Network internetNetwork internet
Network internet
 
Introduction to network ( Internet and its layer) Or how internet really works!
Introduction to network ( Internet and its layer) Or how internet really works!Introduction to network ( Internet and its layer) Or how internet really works!
Introduction to network ( Internet and its layer) Or how internet really works!
 
W 8 introduction to database
W 8  introduction to databaseW 8  introduction to database
W 8 introduction to database
 
Introduction to databases
Introduction to databasesIntroduction to databases
Introduction to databases
 
01.01 introduction to database
01.01 introduction to database01.01 introduction to database
01.01 introduction to database
 
Utsav Mahendra : Introduction to Database and managemnet
Utsav Mahendra : Introduction to Database and managemnetUtsav Mahendra : Introduction to Database and managemnet
Utsav Mahendra : Introduction to Database and managemnet
 
Database Systems Introduction (INTD-3535)
Database Systems Introduction (INTD-3535)Database Systems Introduction (INTD-3535)
Database Systems Introduction (INTD-3535)
 
database design process
database design processdatabase design process
database design process
 
Basics of Database Design
Basics of Database DesignBasics of Database Design
Basics of Database Design
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
 
introduction to database
 introduction to database introduction to database
introduction to database
 
1 introduction databases and database users
1 introduction databases and database users1 introduction databases and database users
1 introduction databases and database users
 
Database an introduction
Database an introductionDatabase an introduction
Database an introduction
 
Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
 
Introduction: Databases and Database Users
Introduction: Databases and Database UsersIntroduction: Databases and Database Users
Introduction: Databases and Database Users
 
1. Introduction to DBMS
1. Introduction to DBMS1. Introduction to DBMS
1. Introduction to DBMS
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
 
The Internet
The InternetThe Internet
The Internet
 
Lecture 01 introduction to database
Lecture 01 introduction to databaseLecture 01 introduction to database
Lecture 01 introduction to database
 

Semelhante a [Www.pkbulk.blogspot.com]file and indexing

btrees.ppt ttttttttttttttttttttttttttttt
btrees.ppt  tttttttttttttttttttttttttttttbtrees.ppt  ttttttttttttttttttttttttttttt
btrees.ppt ttttttttttttttttttttttttttttt
RAtna29
 

Semelhante a [Www.pkbulk.blogspot.com]file and indexing (20)

Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
Memory caching
Memory cachingMemory caching
Memory caching
 
Memory caching
Memory cachingMemory caching
Memory caching
 
Memory caching
Memory cachingMemory caching
Memory caching
 
Memory caching
Memory cachingMemory caching
Memory caching
 
Memory caching
Memory cachingMemory caching
Memory caching
 
Memory caching
Memory cachingMemory caching
Memory caching
 
Cache mapping
Cache mappingCache mapping
Cache mapping
 
btrees.ppt ttttttttttttttttttttttttttttt
btrees.ppt  tttttttttttttttttttttttttttttbtrees.ppt  ttttttttttttttttttttttttttttt
btrees.ppt ttttttttttttttttttttttttttttt
 
Lecture2-DT.pptx
Lecture2-DT.pptxLecture2-DT.pptx
Lecture2-DT.pptx
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
 

Mais de AnusAhmad

Mais de AnusAhmad (20)

[Www.pkbulk.blogspot.com]dbms12
[Www.pkbulk.blogspot.com]dbms12[Www.pkbulk.blogspot.com]dbms12
[Www.pkbulk.blogspot.com]dbms12
 
[Www.pkbulk.blogspot.com]dbms11
[Www.pkbulk.blogspot.com]dbms11[Www.pkbulk.blogspot.com]dbms11
[Www.pkbulk.blogspot.com]dbms11
 
[Www.pkbulk.blogspot.com]dbms10
[Www.pkbulk.blogspot.com]dbms10[Www.pkbulk.blogspot.com]dbms10
[Www.pkbulk.blogspot.com]dbms10
 
[Www.pkbulk.blogspot.com]dbms09
[Www.pkbulk.blogspot.com]dbms09[Www.pkbulk.blogspot.com]dbms09
[Www.pkbulk.blogspot.com]dbms09
 
[Www.pkbulk.blogspot.com]dbms07
[Www.pkbulk.blogspot.com]dbms07[Www.pkbulk.blogspot.com]dbms07
[Www.pkbulk.blogspot.com]dbms07
 
[Www.pkbulk.blogspot.com]dbms06
[Www.pkbulk.blogspot.com]dbms06[Www.pkbulk.blogspot.com]dbms06
[Www.pkbulk.blogspot.com]dbms06
 
[Www.pkbulk.blogspot.com]dbms05
[Www.pkbulk.blogspot.com]dbms05[Www.pkbulk.blogspot.com]dbms05
[Www.pkbulk.blogspot.com]dbms05
 
[Www.pkbulk.blogspot.com]dbms04
[Www.pkbulk.blogspot.com]dbms04[Www.pkbulk.blogspot.com]dbms04
[Www.pkbulk.blogspot.com]dbms04
 
[Www.pkbulk.blogspot.com]dbms03
[Www.pkbulk.blogspot.com]dbms03[Www.pkbulk.blogspot.com]dbms03
[Www.pkbulk.blogspot.com]dbms03
 
[Www.pkbulk.blogspot.com]dbms02
[Www.pkbulk.blogspot.com]dbms02[Www.pkbulk.blogspot.com]dbms02
[Www.pkbulk.blogspot.com]dbms02
 
[Www.pkbulk.blogspot.com]dbms01
[Www.pkbulk.blogspot.com]dbms01[Www.pkbulk.blogspot.com]dbms01
[Www.pkbulk.blogspot.com]dbms01
 
[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13
 
9. java server faces
9. java server faces9. java server faces
9. java server faces
 
8. java script
8. java script8. java script
8. java script
 
7. struts
7. struts7. struts
7. struts
 
5. servlets
5. servlets5. servlets
5. servlets
 
4. jsp
4. jsp4. jsp
4. jsp
 
3. applets
3. applets3. applets
3. applets
 
2. http, html
2. http, html2. http, html
2. http, html
 
1. intro
1. intro1. intro
1. intro
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

[Www.pkbulk.blogspot.com]file and indexing

  • 2. 2 Outline • Data Storage – storing data: disks – buffer manager – representing data – external sorting • Indexing – types of indexes – B+ trees – hash tables (maybe)
  • 3. 3 Physical Addresses • Each block and each record have a physical address that consists of: – The host – The disk – The cylinder number – The track number – The block within the track – For records: an offset in the block • sometimes this is in the block’s header
  • 4. 4 Logical Addresses • Logical address: a string of bytes (10-16) • More flexible: can blocks/records around • But need translation table: Logical address Physical address L1 P1 L2 P2 L3 P3
  • 5. 5 Main Memory Address • When the block is read in main memory, it receives a main memory address • Buffer manager has another translation table Memory address Logical address M1 L1 M2 L2 M3 L3
  • 6. 6 The I/O Model of Computation • In main memory algorithms – we care about CPU time • In databases time is dominated by I/O cost • Assumption: cost is given only by I/O • Consequence: need to redesign certain algorithms • Will illustrate here with sorting
  • 7. 7 Sorting • Illustrates the difference in algorithm design when your data is not in main memory: – Problem: sort 1Gb of data with 1Mb of RAM. • Arises in many places in database systems: – Data requested in sorted order (ORDER BY) – Needed for grouping operations – First step in sort-merge join algorithm – Duplicate removal – Bulk loading of B+-tree indexes.
  • 8. 8 2-Way Merge-sort: Requires 3 Buffers • Pass 1: Read a page, sort it, write it. – only one buffer page is used • Pass 2, 3, …, etc.: – three buffer pages used. Main memory buffers INPUT 1 INPUT 2 OUTPUT DiskDisk
  • 9. 9 Two-Way External Merge Sort • Each pass we read + write each page in file. • N pages in the file => the number of passes • So total cost is: • Improvement: start with larger runs • Sort 1GB with 1MB memory in 10 passes  = +log2 1N  ( )2 12N Nlog + Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 5,62,6 4,9 7,8 1,3 2 2,3 4,6 4,7 8,9 1,3 5,6 2 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8
  • 10. 10 Can We Do Better ? • We have more main memory • Should use it to improve performance
  • 11. 11 Cost Model for Our Analysis • B: Block size • M: Size of main memory • N: Number of records in the file • R: Size of one record
  • 13. 13 Outline • Types of indexes • B+ trees • Hash tables (maybe)
  • 14. 14 Indexes • An index on a file speeds up selections on the search key field(s) • Search key = any subset of the fields of a relation – Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). • Entries in an index: (k, r), where: – k = the key – r = the record OR record id OR record ids
  • 15. 15 Index Classification • Clustered/unclustered – Clustered = records sorted in the key order – Unclustered = no • Dense/sparse – Dense = each record has an entry in the index – Sparse = only some records have • Primary/secondary – Primary = on the primary key – Secondary = on any key – Some books interpret these differently • B+ tree / Hash table / …
  • 16. 16 Clustered Index • File is sorted on the index attribute • Dense index: sequence of (key,pointer) pairs 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
  • 17. 17 Clustered Index • Sparse index: one key per data block 10 30 50 70 90 110 130 150 10 20 30 40 50 60 70 80
  • 18. 18 Clustered Index with Duplicate Keys • Dense index: point to the first record with that key 10 20 30 40 50 60 70 80 10 10 10 20 20 20 30 40
  • 19. 19 Clustered Index with Duplicate Keys • Sparse index: pointer to lowest search key in each block: • Search for 20 10 10 20 30 10 10 10 20 20 20 30 40
  • 20. 20 Clustered Index with Duplicate Keys • Better: pointer to lowest new search key in each block: • Search for 20 10 20 30 40 50 60 70 80 10 10 10 20 30 30 40 50
  • 21. 21 Unclustered Indexes • To index other attributes than primary key • Always dense (why ?) 10 10 20 20 20 30 30 30 20 30 30 20 10 20 10 30
  • 22. 22 Summary Clustered vs. Unclustered Index Data entries (Index File) (Data file) Data Records Data entries Data Records CLUSTERED UNCLUSTERED
  • 23. 23 Composite Search Keys • Composite Search Keys: Search on a combination of fields. – Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age> index: • age=20 and sal =75 – Range query: Some field value is not a constant. E.g.: • age =20; or age=20 and sal > 10 sue 13 75 bob cal joe 12 10 20 8011 12 name age sal <sal, age> <age, sal> <age> <sal> 12,20 12,10 11,80 13,75 20,12 10,12 75,13 80,11 11 12 12 13 10 20 75 80 Data records sorted by name Data entries in index sorted by <sal,age> Data entries sorted by <sal> Examples of composite key indexes using lexicographic order.
  • 24. 24 B+ Trees • Search trees • Idea in B Trees: – make 1 node = 1 block • Idea in B+ Trees: – Make leaves into a linked list (range queries are easier)
  • 25. 25 • Parameter d = the degree • Each node has >= d and <= 2d keys (except root) • Each leaf has >=d and <= 2d keys: B+ Trees Basics 30 120 240 Keys k < 30 Keys 30<=k<120 Keys 120<=k<240 Keys 240<=k 40 50 60 40 50 60 Next leaf
  • 26. 26 B+ Tree Example 80 20 60 100 120 140 10 15 18 20 30 40 50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 90 d = 2
  • 27. 27 B+ Tree Design • How large d ? • Example: – Key size = 4 bytes – Pointer size = 8 bytes – Block size = 4096 byes • 2d x 4 + (2d+1) x 8 <= 4096 • d = 170
  • 28. 28 Searching a B+ Tree • Exact key values: – Start at the root – Proceed down, to the leaf • Range queries: – As above – Then sequential traversal Select name From people Where age = 25 Select name From people Where 20 <= age and age <= 30
  • 29. 29 B+ Trees in Practice • Typical order: 100. Typical fill-factor: 67%. – average fanout = 133 • Typical capacities: – Height 4: 1334 = 312,900,700 records – Height 3: 1333 = 2,352,637 records • Can often hold top levels in buffer pool: – Level 1 = 1 page = 8 Kbytes – Level 2 = 133 pages = 1 Mbyte – Level 3 = 17,689 pages = 133 MBytes
  • 30. 30 Insertion in a B+ Tree Insert (K, P) • Find leaf where K belongs, insert • If no overflow (2d keys or less), halt • If overflow (2d+1 keys), split node, insert in parent: • If leaf, keep K3 too in right node • When root splits, new root has 1 key only K1 K2 K3 K4 K5 P0 P1 P2 P3 P4 p5 K1 K2 P0 P1 P2 K4 K5 P3 P4 p5 (K3, ) to parent
  • 31. 31 Insertion in a B+ Tree 80 20 60 100 120 140 10 15 18 20 30 40 50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 90 Insert K=19
  • 32. 32 Insertion in a B+ Tree 80 20 60 100 120 140 10 15 18 19 20 30 40 50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 9019 After insertion
  • 33. 33 Insertion in a B+ Tree 80 20 60 100 120 140 10 15 18 19 20 30 40 50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 9019 Now insert 25
  • 34. 34 Insertion in a B+ Tree 80 20 60 100 120 140 10 15 18 19 20 2 5 30 4 0 50 60 65 80 85 90 10 15 18 20 25 30 40 60 65 80 85 9019 After insertion 50
  • 35. 35 Insertion in a B+ Tree 80 20 60 100 120 140 10 15 18 19 20 2 5 30 4 0 50 60 65 80 85 90 10 15 18 20 25 30 40 60 65 80 85 9019 But now have to split ! 50
  • 36. 36 Insertion in a B+ Tree 80 20 30 60 100 120 140 10 15 18 19 20 2 5 60 65 80 85 90 10 15 18 20 25 30 40 60 65 80 85 9019 After the split 50 30 4 0 5 0
  • 37. 37 Deletion from a B+ Tree 80 20 30 60 100 120 140 10 15 18 19 20 2 5 60 65 80 85 90 10 15 18 20 25 30 40 60 65 80 85 9019 Delete 30 50 30 4 0 5 0
  • 38. 38 Deletion from a B+ Tree 80 20 30 60 100 120 140 10 15 18 19 20 2 5 60 65 80 85 90 10 15 18 20 25 40 60 65 80 85 9019 After deleting 30 50 40 5 0 May change to 40, or not
  • 39. 39 Deletion from a B+ Tree 80 20 30 60 100 120 140 10 15 18 19 20 2 5 60 65 80 85 90 10 15 18 20 25 40 60 65 80 85 9019 Now delete 25 50 40 5 0
  • 40. 40 Deletion from a B+ Tree 80 20 30 60 100 120 140 10 15 18 19 20 60 65 80 85 90 10 15 18 20 40 60 65 80 85 9019 After deleting 25 Need to rebalance Rotate 50 40 5 0
  • 41. 41 Deletion from a B+ Tree 80 19 30 60 100 120 140 10 15 18 19 2 0 60 65 80 85 90 10 15 18 20 40 60 65 80 85 9019 Now delete 40 50 40 5 0
  • 42. 42 Deletion from a B+ Tree 80 19 30 60 100 120 140 10 15 18 19 2 0 60 65 80 85 90 10 15 18 20 60 65 80 85 9019 After deleting 40 Rotation not possible Need to merge nodes 50 50
  • 43. 43 Deletion from a B+ Tree 80 19 60 100 120 140 10 15 18 19 2 0 5 0 60 65 80 85 90 10 15 18 20 60 65 80 85 9019 Final tree 50
  • 44. 44 Hash Tables • Secondary storage hash tables are much like main memory ones • Recall basics: – There are n buckets – A hash function f(k) maps a key k to {0, 1, …, n-1} – Store in bucket f(k) a pointer to record with key k • Secondary storage: bucket = block, use overflow blocks when needed
  • 45. 45 • Assume 1 bucket (block) stores 2 keys + pointers • h(e)=0 • h(b)=h(f)=1 • h(g)=2 • h(a)=h(c)=3 Hash Table Example e b f g a c 0 1 2 3
  • 46. 46 • Search for a: • Compute h(a)=3 • Read bucket 3 • 1 disk access Searching in a Hash Table e b f g a c 0 1 2 3
  • 47. 47 • Place in right bucket, if space • E.g. h(d)=2 Insertion in Hash Table e b f g d a c 0 1 2 3
  • 48. 48 • Create overflow block, if no space • E.g. h(k)=1 • More over- flow blocks may be needed Insertion in Hash Table e b f g d a c 0 1 2 3 k
  • 49. 49 Hash Table Performance • Excellent, if no overflow blocks • Degrades considerably when number of keys exceeds the number of buckets (I.e. many overflow blocks).
  • 50. 50 Extensible Hash Table • Allows has table to grow, to avoid performance degradation • Assume a hash function h that returns numbers in {0, …, 2k – 1} • Start with n = 2i << 2k , only look at first i most significant bits
  • 51. 51 Extensible Hash Table • E.g. i=1, n=2, k=4 • Note: we only look at the first bit (0 or 1) 0(010) 1(011) i=1 1 1 0 1
  • 52. 52 Insertion in Extensible Hash Table • Insert 1110 0(010) 1(011) 1(110) i=1 1 1 0 1
  • 53. 53 Insertion in Extensible Hash Table • Now insert 1010 • Need to extend table, split blocks • i becomes 2 0(010) 1(011) 1(110), 1(010) i=1 1 1 0 1
  • 54. 54 Insertion in Extensible Hash Table • Now insert 1110 0(010) 10(11) 10(10) i=2 1 2 00 01 10 11 11(10) 2
  • 55. 55 Insertion in Extensible Hash Table • Now insert 0000, then 0101 • Need to split block 0(010) 0(000), 0(101) 10(11) 10(10) i=2 1 2 00 01 10 11 11(10) 2
  • 56. 56 Insertion in Extensible Hash Table • After splitting the block 00(10) 00(00) 10(11) 10(10) i=2 2 2 00 01 10 11 11(10) 2 01(01) 2
  • 57. 57 Performance Extensible Hash Table • No overflow blocks: access always one read • BUT: – Extensions can be costly and disruptive – After an extension table may no longer fit in memory
  • 58. 58 Linear Hash Table • Idea: extend only one entry at a time • Problem: n= no longer a power of 2 • Let i be such that 2i <= n < 2i+1 • After computing h(k), use last i bits: – If last i bits represent a number > n, change msb from 1 to 0 (get a number <= n)
  • 59. 59 Linear Hash Table Example • N=3 (01)00 (11)00 (10)10 i=2 00 01 10 (01)11 BIT FLIP
  • 60. 60 Linear Hash Table Example • Insert 1000: overflow blocks… (01)00 (11)00 (10)10 i=2 00 01 10 (01)11 (10)00
  • 61. 61 Linear Hash Tables • Extension: independent on overflow blocks • Extend n:=n+1 when average number of records per block exceeds (say) 80%
  • 62. 62 Linear Hash Table Extension • From n=3 to n=4 • Only need to touch one block (which one ?) (01)00 (11)00 (10)10 i=2 00 01 10 (01)11 (01)11 (01)11 i=2 00 01 10 (10)10 (01)00 (11)00 11
  • 63. 63 Linear Hash Table Extension • From n=3 to n=4 finished • Extension from n=4 to n=5 (new bit) • Need to touch every single block (why ?) (01)11 i=2 00 01 10 (10)10 (01)00 (11)00 11

Notas do Editor

  1. 4
  2. 5
  3. 6
  4. 3
  5. 7
  6. 12
  7. 13