A Deep Dive Into Understanding Apache Cassandra

Inside Cassandra
Michael Penick

Overview
• To disk and back again
• Cassandra Internals by Aaron Morton
• Goals
– RDBMS comparison to C*
– Make educated decisions
I’m configuration

Node 3Node 2
Node 1Node 0
Distributed Hashing
A B
C D
E F
G H
I J
K L
M N
O P
Location = Hash(Key) % # Nodes

Node 4
Node 3Node 2
Node 1Node 0
Distributed Hashing
A B
C D
F G
H
K
J
LP
O
M
I
N
E
% Data Moved = 100 * N / (N + 1)

Consistent Hashing
0
Node 1
Node 2Node 3
Node 4

Consistent Hashing
0
A
E
I
M
B
F
J
N C
G
K
O
D
H
L
P
Add Node 0
A
E
I
M
B
F
J
N C
G
K
O
D
H
L
P
% Data Moved = 100 * 1 / N

Virtual Nodes
Found: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
num_tokens
initial_token

Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3
Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE

Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
R1
R2
R3
Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM

Hinted Handoff
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
and
hinted_handoff_enabled = true
R1
R2
R3
Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY
Write locally:
system.hints
Note: Doesn’t not count toward consistency level (except ANY)

Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
R1
R2
R3
Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY EACH_QUORUM
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
R1
R2
Appends FWD_TO
parameter to
message

Read Repair
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
R1
R2
R3
Client
SELECT * FROM table USING CONSISTENCY ONE
and
read_repair_chance > 0

Write
Memory
Disk
Commit Log
Memtable
K1 C1:V1 C2:V2
K1 C1:V1 C2:V2
SSTable #1
K1 C1:V1 C2:V2
…
… …
Flush when:
> commitlog_total_space_in_mb
or
> memtable_total_space_in_mb

Write
Memory
Disk
Commit Log
Memtable
K1 C3:V3
K1 C3:V3
SSTable #1 SSTable #2
K1 C1:V1 C2:V2
…
… … …
Note: All writes are sequential!
Physical Volume #1 Physical Volume #2
K1 C3:V3

Commit Log
Mutation
#3
Mutation
#2
Mutation
#1
Commit Log
Executor
Commit Log
Allocator
Segment #3 Segment #2 Segment #1 Segment #1
Commit Log
File
Memory
Disk
Commit Log
File
Commit Log
File
Flush! Write!
commitlog_segment_size_in_mb

Commit Log
• commitlog_sync
1. periodic (default)
• commitlog_sync_period_in_ms (default: 10 seconds)
2. batch
• commitlog_batch_window_in_ms

Memtable
• ConcurrentSkipListMap<RowPosition, AtomicSort
edColumns> rows;
• AtomicSortedColumns.Holder
– DeletionInfo deletionInfo; // tombstone
– SnapTreeMap<ByteBuffer, Column> map;
• Goals
– Fast operations
– Fast concurrent access
– Fast in-order iteration
– Atomic/Isolated operations within a column family

Skip List
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL

Skip List
Get 7
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL

Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL

Skip List
Delete 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL

Skip List
Insert 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL

Skip List
Insert 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL

Skip List
ConcurrentSkipListMap uses: p = 0.5

Skip List
H 1 3 T
2
H 1 3 T
2
C
A
S

Skip List
while(true):
next = current.next
new_node.next = next
if(CompareAndSwap(current.next, next, new_node)):
break

Skip List
H 1 3 T
H 1 3 T
2
CAS
I’m lost!

Skip List
H 1 3 T
C
A
S
H 1 3 T
H 1 3 T
CAS

Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4

Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
CAS

Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL

Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
CAS

SnapTree
3
2 5
1 4 6
Node Balance Factor
1 0
2 1
3 0
4 0
5 0
6 0
Balance Factor = Height(Left-Subtree) – Height(Right-Subtree)

SnapTree
5
2 6
1 3
4
Node Balance Factor
1 0
2 -1
3 -1
4 0
5 2
6 0
Balance Factor must be -1, 0 or +1

SnapTree
5
3
4
A
B C
D
5
4
3
A B
C
D
4
3 5
A B C D
Left-Right Case
Left-Left Case

SnapTree
3
5
4
D
CB
A
3
4
5
DC
B
A
4
3 5
A B C D
Right-Left Case
Right-Right Case

SnapTree
5
2 6
1 3
4
Node Balance Factor
1 0
2 1
3 1
4 0
5 2
6 0
5
2
6
1
3
4
Node Balance Factor
1 0
2 -1
3 -1
4 0
5 2
6 0

SnapTree
Node Balance Factor
1 0
2 1
3 1
4 0
5 2
6 0
5
2
6
1
3
4
Node Balance Factor
1 0
2 1
3 0
4 0
5 0
6 0
3
2 5
1 4 6

Epoch
SnapTree
5
2 6
1 3
4
Root
Lock
4
Version(5) is 0
Version(2) is 0
Does Version(5) == 0?
Insert

Epoch
SnapTree
5
2 6
1 3
4
Root
4Get
Version(5) is 0
Version(2) is 0

Epoch
SnapTree
Root
5
2
6
1
3
4
4Get
NO! Go back to 5

Epoch
SnapTree
Root
4
3
2 5
1 4 6
3Delete
Lock : (

Epoch
SnapTree
Root
4
3
2 5
1 4 6
3Delete
Lock

Epoch
SnapTree
Root
4
3
2 5
1 4 6
3Delete
Lock
SetValue(3, null)

SnapTree
Epoch #1
Root
3
2 5
1 4 6
Clone Stop
Delete
Insert

SnapTree
Epoch #2
Root
3
2 5
1 4 6
Clone
Epoch #3
Root
I’m
shared!

SnapTree
Epoch #2
Root
3
2 5
1 4 6
Epoch #3
Root
7Insert

SnapTree
Epoch #2
Root
3
2 5
1 4 6
Epoch #3
Root
7Insert
3
2 5
1 4 6

SnapTree
Epoch #2
Root
3
2 5
1 4 6
Epoch #3
Root
7Insert
3
2 5
1 4 6
7

Snap Tree
C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307

SSTable
Filter.db Data.db
K1
K2
K3
C1
C1
C2
C2
C3
CRC.db
0xFFCC23ED
0x1FEA2321
0xCE652133
Index.db
K1
K2
K3
00001
00002
00003
CompressionInfo.db
00001
00002
00003
00001
00004
00006
Compression? NoYes
• CASSANDRA-2319
• Promote row index
• CASSANDRA-4885
• Remove … per-row
bloom filters

Delete
• Essentially a write (mutation)
• Data not remove immediately, but a
tombstone record added
• tombstone time > gc_grace = data removed
(compaction)

Bloom Filter
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K1Hash Insert

Bloom Filter
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
K1Hash Insert

Bloom Filter
1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
K1Hash InsertHashHash
hash = murmur3(key) # creates two hashes
for i in count(hash):
result[i] = abs(hash[0] + i * hash[1]) % num_keys)

Bloom Filter
Bloom Filter
Probability
Calculation
Config: bloom_filter_fp_chance,
and
SSTable: number of rows
Num hashes,
and
Num bits per entry

Read
Memory
Disk
Memtable
K1 C4:V4
SSTable #2
K1 C3:V3
SSTable #1
K1 C1:V1 C2:V2
…
… …
Memtable
K1 C5:V5
… K1 C4:V4C1:V1 C2:V2 C3:V3 C5:V5
Row Cache
= Off-heap
row_cache_size_in_mb > 0

Read
Memory
Disk
Bloom
Filter
Key
Cache
Partition
Summary
Compression
Offsets
Partition
Index Data
Cache Hit
Cache Miss
= Off-heap
key_cache_size_in_mb > 0
index_interval = 128
(default)

Compaction (Size-tiered)
min_compaction_threshold = 4
Memtable flush!

Compaction (Leveled)
Memtable flush!

L0: 160 MB L1: 160 MB x 10
sstable_size_in_mb = 160
L2: 160 MB x 100

L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100
…

Topics
• CAS (PAXOS)
• Anti-entropy (Merkel trees)
• Gossip (Failure detection)

A Deep Dive Into Understanding Apache Cassandra

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a A Deep Dive Into Understanding Apache Cassandra

Semelhante a A Deep Dive Into Understanding Apache Cassandra (20)

Mais de DataStax Academy

Mais de DataStax Academy (20)

Último

Último (20)

A Deep Dive Into Understanding Apache Cassandra

Notas do Editor