Database Storage Engine Internals

Demystifying data structures
and algorithms adopted By
database storage engine
Adewumi Sunkanmi D.
Demystifying data
structures and
algorithms used by
database storage engine
Adewumi Sunkanmi D.
Senior Software Engineer at Acronis
working on Advanced Automation, one
of the cloud services offered by Acronis
Cyber Cloud.
Outline
1. Overview of a three-tier application
2. Criteria for selecting the best database for an application
3. Overview of database architecture
4. Types for database storage engines and their tradeoffs
5. Q/A
client
POST
GET
client
server
POST
GET
WRITE
READ
server
client
Database
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
Neo4J
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
4. Rate of write and read and how EXACTLY are these
operations handled at the hardware level?
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
SELECT
COLS FROM WHERE
COL_ID students >
score 70
firstname lastname
“SELECT firstname, lastname FROM students WHERE score > 70;”
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
Disk
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://www.cs.umb.edu/~poneil/lsmtree.pdf
Log Structured Merge Tree Storage
Engine
The LMS tree is an immutable disk resident data
structure and it is optimized for sequential writes while
maintaining the acceptable read performance.
Log Structured Merge Tree Storage
Engine
Three components
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red black
tree in RAM
ben: 300 josh: 500
Threshold reached!
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
SSD/HDD file (SSTable file)
T1
ben: 300
bin: 220
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
10MB
alexandar : 10
andreas : 50
…….
erik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (segment file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Find(apa)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 300 mia: 220
write
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
update?
Since we return from the most
recent memtable or segment file, we
just insert the key with the new
value,
Ben will be returned from T2 not T1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
delete?
Insert the key with a delete marker
called tombstone, since this will be
the most recent, we can tell it has
been deleted, e.g
ben->null
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine Key present?: Strict NO if not
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
Optimtimize reads with Bloom Filters
Maybe or Maybe
not(99% accurate)
https://brilliant.org/wiki/bloom-filter/
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if power failure
happens before data
is flushed to disk?
Compaction
1. Persist write in an append only log file before
writing to in-memory table. WAL
2. Recreate memtable from last Log Sequence
Number.
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
Where is LSM tree Storage engine
used?
1. Apache Cassandra
2. WiredTiger
3. InfluxDB
4. Yugabyte DB
5. ScyllaDB
6. CockroachDB
7. Google’s BigTable
8. RocksDB
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://carlosproal.com/ir/papers/p121-comer.pdf
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
B trees are page-oriented indexing structures
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
Important notes on B-tree
1. Store key value pairs (sorted by key)
2. Self balancing
3. Often used for indexing
4. Mutable data structure(in place update)
5. Each node is a fixed size block/page 4KB
6. Can only read or write one page at a time
https://carlosproal.com/ir/papers/p121-comer.pdf
Anatomy of B-Tree
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
B-Trees
https://sqlbak.com/academy/database-page
A database page
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
NOTE: Leaf Page contains both
the key and value
Anatomy of B-Trees
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
Branching factor = 5
Depth= 3
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
Searching for a key is faster because we are not scaning
all keys but only keys within range, takes O(log n)
Where n is the total number of keys
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
val val val val
70 78 85
INSERT(87)
87
69 val val val val
70 78 85 86 val
Branching factor - 5
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
Branching
factor
exceeded! > 5
Create new
page
val val val val
70 78 85 87
69 val val val val
70 78 85 86 val 87 val
Branching factor - 5
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
69 70 78
val val val 85 86 87
val val val
INSERT(87)
Branching
factor
exceeded! > 5
Create new
page
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
What if the parent page is full?
Split it
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
How does update work?
1. Find the leaf page with key
2. Edit the row
3. Overwrite the page
INSERT(87)
LSM trees Vs B-Trees storage engine
LSM Tree B-Tree
Optimized for write Optimized for read
Compressed better(No
Fragmentation)
Fragmentation wastes space
There can be duplicates before
compaction
Each key exist exactly in one
place
Strong transaction support
Spikes in write can cause slow
compaction due to many
SSTable files. Can cause Out
of Memory Error(OOM)
Space optimization in B-tree
Primary index(primary key index)
Secondary index
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Leaf page contains both key and value Leaf page contains both key and value
DUPLICATE !
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Store value offset(smaller in size)
Extra Disk I/O
So you can store important
columns in leaf page and less
important columns in heap file
@gifted_dl
@gifted_dl
Adewumi Sunkanmi D.
1 de 79

Recomendados

IO Dubi Lebel por
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebelsqlserver.co.il
1.2K visualizações104 slides
5 steps to faster web sites & HTML5 games - updated for DDDscot por
5 steps to faster web sites & HTML5 games - updated for DDDscot5 steps to faster web sites & HTML5 games - updated for DDDscot
5 steps to faster web sites & HTML5 games - updated for DDDscotMichael Ewins
1.1K visualizações73 slides
SQL Server On SANs por
SQL Server On SANsSQL Server On SANs
SQL Server On SANsQuest Software
1K visualizações46 slides
Using Oracle Database with Amazon Web Services por
Using Oracle Database with Amazon Web ServicesUsing Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web Servicesguest484c12
5.7K visualizações79 slides
Redis — memcached on steroids por
Redis — memcached on steroidsRedis — memcached on steroids
Redis — memcached on steroidsRobert Lehmann
1.8K visualizações91 slides
All about Storage - Series 2 Defining Data por
All about Storage - Series 2 Defining DataAll about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining DataDAGEOP LTD
128 visualizações30 slides

Mais conteúdo relacionado

Similar a Database Storage Engine Internals

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013 por
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Amazon Web Services
12.7K visualizações78 slides
5 Steps to Faster Web Sites and HTML5 Games por
5 Steps to Faster Web Sites and HTML5 Games5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 GamesMichael Ewins
1.3K visualizações62 slides
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike por
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeHow to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeAerospike, Inc.
1.5K visualizações16 slides
Amazed by AWS Series #4 por
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4Amazon Web Services Korea
1.7K visualizações102 slides
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America por
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaMichael Noel
699 visualizações35 slides
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the... por
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Knut Relbe-Moe [MVP, MCT]
390 visualizações50 slides

Similar a Database Storage Engine Internals(20)

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013 por Amazon Web Services
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Amazon Web Services12.7K visualizações
5 Steps to Faster Web Sites and HTML5 Games por Michael Ewins
5 Steps to Faster Web Sites and HTML5 Games5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 Games
Michael Ewins1.3K visualizações
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike por Aerospike, Inc.
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeHow to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
Aerospike, Inc. 1.5K visualizações
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America por Michael Noel
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
Michael Noel699 visualizações
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the... por Knut Relbe-Moe [MVP, MCT]
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Knut Relbe-Moe [MVP, MCT]390 visualizações
HPCC Systems vs Hadoop por Fujio Turner
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
Fujio Turner3.4K visualizações
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features por Amazon Web Services
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
Amazon Web Services30.9K visualizações
Site Performance - From Pinto to Ferrari por Joseph Scott
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
Joseph Scott3.5K visualizações
MySpace Data Architecture June 2009 por Mark Ginnebaugh
MySpace Data Architecture June 2009MySpace Data Architecture June 2009
MySpace Data Architecture June 2009
Mark Ginnebaugh2.3K visualizações
Virtualization and SAN Basics for DBAs por Quest Software
Virtualization and SAN Basics for DBAsVirtualization and SAN Basics for DBAs
Virtualization and SAN Basics for DBAs
Quest Software901 visualizações
Maa wp-10g-racprimaryracphysicalsta-131940 por gopalchsamanta
Maa wp-10g-racprimaryracphysicalsta-131940Maa wp-10g-racprimaryracphysicalsta-131940
Maa wp-10g-racprimaryracphysicalsta-131940
gopalchsamanta177 visualizações
Sql server backup internals por Hamid J. Fard
Sql server backup internalsSql server backup internals
Sql server backup internals
Hamid J. Fard861 visualizações
Designing Information Structures For Performance And Reliability por bryanrandol
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
bryanrandol608 visualizações
ora_sothea por thysothea
ora_sotheaora_sothea
ora_sothea
thysothea709 visualizações
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver... por Guy Harrison
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Guy Harrison3.3K visualizações
The care and feeding of a MySQL database por Dave Stokes
The care and feeding of a MySQL databaseThe care and feeding of a MySQL database
The care and feeding of a MySQL database
Dave Stokes879 visualizações
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services por Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web ServicesAWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
Amazon Web Services1.4K visualizações
Best practices for using flash in hyperscale software storage architectures por Eric Carter
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architectures
Eric Carter477 visualizações
Experiences with Oracle SPARC S7-2 Server por JomaSoft
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 Server
JomaSoft273 visualizações

Último

.NET Deserialization Attacks por
.NET Deserialization Attacks.NET Deserialization Attacks
.NET Deserialization AttacksDharmalingam Ganesan
5 visualizações50 slides
Airline Booking Software por
Airline Booking SoftwareAirline Booking Software
Airline Booking SoftwareSharmiMehta
9 visualizações26 slides
Agile 101 por
Agile 101Agile 101
Agile 101John Valentino
12 visualizações20 slides
Introduction to Gradle por
Introduction to GradleIntroduction to Gradle
Introduction to GradleJohn Valentino
6 visualizações7 slides
How to build dyanmic dashboards and ensure they always work por
How to build dyanmic dashboards and ensure they always workHow to build dyanmic dashboards and ensure they always work
How to build dyanmic dashboards and ensure they always workWiiisdom
14 visualizações13 slides
What is API por
What is APIWhat is API
What is APIartembondar5
13 visualizações15 slides

Último(20)

Airline Booking Software por SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta9 visualizações
Agile 101 por John Valentino
Agile 101Agile 101
Agile 101
John Valentino12 visualizações
Introduction to Gradle por John Valentino
Introduction to GradleIntroduction to Gradle
Introduction to Gradle
John Valentino6 visualizações
How to build dyanmic dashboards and ensure they always work por Wiiisdom
How to build dyanmic dashboards and ensure they always workHow to build dyanmic dashboards and ensure they always work
How to build dyanmic dashboards and ensure they always work
Wiiisdom14 visualizações
What is API por artembondar5
What is APIWhat is API
What is API
artembondar513 visualizações
360 graden fabriek por info33492
360 graden fabriek360 graden fabriek
360 graden fabriek
info33492165 visualizações
Bootstrapping vs Venture Capital.pptx por Zeljko Svedic
Bootstrapping vs Venture Capital.pptxBootstrapping vs Venture Capital.pptx
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic15 visualizações
Dapr Unleashed: Accelerating Microservice Development por Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski15 visualizações
The Path to DevOps por John Valentino
The Path to DevOpsThe Path to DevOps
The Path to DevOps
John Valentino5 visualizações
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... por Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke35 visualizações
Quality Engineer: A Day in the Life por John Valentino
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the Life
John Valentino7 visualizações
EV Charging App Case por iCoderz Solutions
EV Charging App Case EV Charging App Case
EV Charging App Case
iCoderz Solutions9 visualizações
Introduction to Git Source Control por John Valentino
Introduction to Git Source ControlIntroduction to Git Source Control
Introduction to Git Source Control
John Valentino7 visualizações
Electronic AWB - Electronic Air Waybill por Freightoscope
Electronic AWB - Electronic Air Waybill Electronic AWB - Electronic Air Waybill
Electronic AWB - Electronic Air Waybill
Freightoscope 5 visualizações
nintendo_64.pptx por paiga02016
nintendo_64.pptxnintendo_64.pptx
nintendo_64.pptx
paiga020166 visualizações
Top-5-production-devconMunich-2023.pptx por Tier1 app
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptx
Tier1 app9 visualizações
Automated Testing of Microsoft Power BI Reports por RTTS
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
RTTS10 visualizações
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile... por Stefan Wolpers
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
Stefan Wolpers42 visualizações
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... por TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin96 visualizações

Database Storage Engine Internals

  • 1. Demystifying data structures and algorithms adopted By database storage engine Adewumi Sunkanmi D.
  • 2. Demystifying data structures and algorithms used by database storage engine
  • 3. Adewumi Sunkanmi D. Senior Software Engineer at Acronis working on Advanced Automation, one of the cloud services offered by Acronis Cyber Cloud.
  • 4. Outline 1. Overview of a three-tier application 2. Criteria for selecting the best database for an application 3. Overview of database architecture 4. Types for database storage engines and their tradeoffs 5. Q/A
  • 15. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph?
  • 16. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database
  • 17. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling
  • 18. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes)
  • 19. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes)
  • 20. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database
  • 21. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database 4. Rate of write and read and how EXACTLY are these operations handled at the hardware level?
  • 23. SELECT COLS FROM WHERE COL_ID students > score 70 firstname lastname “SELECT firstname, lastname FROM students WHERE score > 70;”
  • 26. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 28. Log Structured Merge Tree Storage Engine The LMS tree is an immutable disk resident data structure and it is optimized for sequential writes while maintaining the acceptable read performance.
  • 29. Log Structured Merge Tree Storage Engine Three components 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table)
  • 30. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177
  • 31. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM
  • 32. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM josh: 500
  • 33. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red black tree in RAM ben: 300 josh: 500 Threshold reached!
  • 34. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 SSD/HDD file (SSTable file) T1 ben: 300 bin: 220 josh: 500
  • 35. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB
  • 36. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 10MB alexandar : 10 andreas : 50 ……. erik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ………
  • 37. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB
  • 38. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (segment file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB Find(apa)
  • 39. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 300 mia: 220 write
  • 40. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 SSD/HDD file (SSTable file)
  • 41. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) SSD/HDD file (SSTable file)
  • 42. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 SSD/HDD file (SSTable file)
  • 43. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 SSD/HDD file (SSTable file)
  • 44. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 45. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 46. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle update? Since we return from the most recent memtable or segment file, we just insert the key with the new value, Ben will be returned from T2 not T1 SSD/HDD file (SSTable file)
  • 47. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle delete? Insert the key with a delete marker called tombstone, since this will be the most recent, we can tell it has been deleted, e.g ben->null SSD/HDD file (SSTable file)
  • 48. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( SSD/HDD file (SSTable file)
  • 49. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help SSD/HDD file (SSTable file)
  • 50. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help Compaction SSD/HDD file (SSTable file)
  • 51. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction SSD/HDD file (SSTable file)
  • 52. Log Structured Merge Tree Storage Engine Key present?: Strict NO if not 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction Optimtimize reads with Bloom Filters Maybe or Maybe not(99% accurate) https://brilliant.org/wiki/bloom-filter/ SSD/HDD file (SSTable file)
  • 53. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if power failure happens before data is flushed to disk? Compaction 1. Persist write in an append only log file before writing to in-memory table. WAL 2. Recreate memtable from last Log Sequence Number. SSD/HDD file (SSTable file)
  • 54. Log Structured Merge Tree Storage Engine Where is LSM tree Storage engine used? 1. Apache Cassandra 2. WiredTiger 3. InfluxDB 4. Yugabyte DB 5. ScyllaDB 6. CockroachDB 7. Google’s BigTable 8. RocksDB
  • 55. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 58. https://carlosproal.com/ir/papers/p121-comer.pdf B-Trees Important notes on B-tree 1. Store key value pairs (sorted by key) 2. Self balancing 3. Often used for indexing 4. Mutable data structure(in place update) 5. Each node is a fixed size block/page 4KB 6. Can only read or write one page at a time
  • 59. https://carlosproal.com/ir/papers/p121-comer.pdf Anatomy of B-Tree 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val
  • 61. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val NOTE: Leaf Page contains both the key and value Anatomy of B-Trees
  • 62. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val Branching factor = 5 Depth= 3 Anatomy of B-Tree
  • 63. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 64. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 65. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 66. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree
  • 67. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree Searching for a key is faster because we are not scaning all keys but only keys within range, takes O(log n) Where n is the total number of keys
  • 68. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 val val val val 70 78 85 INSERT(87) 87 69 val val val val 70 78 85 86 val Branching factor - 5
  • 69. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 Branching factor exceeded! > 5 Create new page val val val val 70 78 85 87 69 val val val val 70 78 85 86 val 87 val Branching factor - 5 INSERT(87)
  • 70. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 69 70 78 val val val 85 86 87 val val val INSERT(87) Branching factor exceeded! > 5 Create new page
  • 71. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page INSERT(87)
  • 72. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page What if the parent page is full? Split it INSERT(87)
  • 73. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page How does update work? 1. Find the leaf page with key 2. Edit the row 3. Overwrite the page INSERT(87)
  • 74. LSM trees Vs B-Trees storage engine LSM Tree B-Tree Optimized for write Optimized for read Compressed better(No Fragmentation) Fragmentation wastes space There can be duplicates before compaction Each key exist exactly in one place Strong transaction support Spikes in write can cause slow compaction due to many SSTable files. Can cause Out of Memory Error(OOM)
  • 75. Space optimization in B-tree Primary index(primary key index) Secondary index
  • 76. Space optimization in B-tree Secondary index Primary index(primary key index) Leaf page contains both key and value Leaf page contains both key and value DUPLICATE !
  • 77. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File
  • 78. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File Store value offset(smaller in size) Extra Disk I/O So you can store important columns in leaf page and less important columns in heap file