Deep Dive on Apache Flink State and Checkpointing

© 2019 Ververica
Seth Wiesman, Solutions Architect
Deep Dive on Apache Flink State

© 2019 Ververica
Agenda
• Serialization
• State Backends
• Checkpoint Tuning
• Schema Migration
• Upcoming Features
3

© 2019 Ververica
Flink’s Serialization System
• Natively Supported Types
• Primitive Types
• Tuples, Scala Case Classes
• Pojo’s
• Unsupported Types Fall Back to Kryo
5

© 2019 Ververica
Flink’s Serialization System
Benchmark Results For Flink 1.8
6
Serializer Ops/s
PojoSerializer 305 / 293*
RowSerializer 475
TupleSerializer 498
Kryo 102 / 67*
Avro (Reflect API) 127
Avro (SpecificRecord API) 297
Protobuf (via Kryo) 376
Apache Thrift (via Kryo) 129 / 112*
public static class MyPojo {
  public int id;
  private String name;
  private String[] operationNames;
  private MyOperation[] operations;
  private int otherId1;
  private Object someObject; // used with String
}
MyOperation {
  int id;
  protected String name;
}

© 2019 Ververica
Custom Serializers
• registerKryoType(Class<?>)
• Registers a type with Kryo for more compact binary format
• registerTypeWithKryoSerializer(Class<?>, Class<? extends Serializer>)
• Provides a default serializer for the given class
• Provided serializer class must extends com.esotericsoftware.kryo.Serializer
• addDefaultKryoSerializer(Class<?>, Serializer<?> serializer)
• Registers a serializer as the default serializer for the given type
Registration with Kryo via ExecutionConfig
7

© 2019 Ververica
Custom Serializer’s
@TypeInfo Annotation
8
@TypeInfo(MyTupleTypeInfoFactory.class)
public class MyTuple<T0, T1> {
  public T0 myﬁeld0;
  public T1 myﬁeld1;
}
public class MyTupleTypeInfoFactory extends TypeInfoFactory<MyTuple> {
  @Override
  public TypeInformation<MyTuple> createTypeInfo(Type t, Map<String, TypeInformation<?>> genericParameters) {
    return new MyTupleTypeInfo(genericParameters.get("T0"), genericParameters.get("T1"));

}

© 2019 Ververica
State Backends

© 2019 Ververica10
Task Manager Process Memory Layout
Task Manager JVM Process
Java Heap
Off Heap / Native
Flink Framework etc.
Network Buffers
Timer State
Keyed State
Typical Size

© 2019 Ververica11
Java Heap
Off Heap / Native
Network Buffers
Timer State
Keyed State
Typical Size

© 2019 Ververica12
Java Heap
Off Heap / Native
Network Buffers
Timer State
Keyed State
Typical Size

© 2019 Ververica13
Keyed State Backends
Based on Java Heap Objects Based on RocksDB

© 2019 Ververica
Heap Keyed State Backend
• State lives as Java objects on the heap
• Organized as chained hash table, key ↦ state
• One hash table per registered state
• Supports asynchronous state snapshots
• Data is de / serialized only during state snapshot and restore
• Highest Performance
• Affected by garbage collection overhead / pauses
• Currently no incremental checkpoints
• High memory overhead of representation
• State is limited by available heap memory
14

© 2019 Ververica
Heap State Table Architecture
15
- Hash buckets (Object[]), 4B-8B per slot
- Load factor <= 75%
- Incremental rehash
Entry
Entry
Entry

© 2019 Ververica
Heap State Table Architecture
16
- Hash buckets (Object[]), 4B-8B per slot
- Load factor <= 75%
- Incremental rehash
Entry
Entry
Entry
▪ 4 References:
▪ Key
▪ Namespace
▪ State
▪ Next
▪ 3 int:
▪ Entry Version
▪ State Version
▪ Hash Code
K
N
S
4 x (4B-8B)
+3 x 4B
+ ~8B-16B (Object overhead)
Object sizes and
overhead.
Some objects might
be shared.

© 2019 Ververica
Heap State Table Snapshot
17
Original Snapshot
A C
B
Entry
Entry
Entry
Copy of hash bucket array is snapshot overhead

© 2019 Ververica
18
Original Snapshot
A C
B
D
No conflicting modification = no overhead

© 2019 Ververica
19
Original Snapshot
A’ C
B
D A
Modifications trigger deep copy of entry - only as much as required. This depends on
what was modified and what is immutable (as determined by type serializer).
Worst case overhead = size of original at time of snapshot.

© 2019 Ververica
Heap Backend Tuning Considerations
• Choose TypeSerializers with efficient copy-methods
• Flag immutability of objects where possible to avoid copy completely
• Flatten POJOs / avoid deep objects
• Reduces object overheads and following references
• GC choice / tuning
• Scale out using multiple task managers per node
20

© 2019 Ververica
RocksDB Keyed State Backend Characteristics
• State lives as serialized byte-strings in off-heap memory and on local disk
• One column family per registered state (~table)
• Key / Value store, organized as a log-structured merge tree (LSM tree)
• Key: serialized bytes of <keygroup, key, namespace>
• LSM naturally supports MVCC
• Data is de / serialized on every read and update
• Not affected by garbage collection
• Relatively low overhead of representation
• LSM naturally supports incremental snapshots
• State size is limited by available local disk space
• Lower performance (~ order of magnitude compared to Heap state backend)
21

© 2019 Ververica
RocksDB Architecture
22
Local Disk
WAL
WAL
Compaction
Memory Persistent Store
Flush
In Flink:
- disable WAL and sync
- persistence via checkpointsActive
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge

© 2019 Ververica
23
Local Disk
WAL
WAL
Compaction
Flush
In Flink:
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge
Set per column
family (~table)

© 2019 Ververica
24
ReadOp
Local Disk
WAL
WAL
Flush
Merge
Active
MemTable
ReadOnly
MemTable
Full/Switch
WriteOp
SST SST
SSTSST
In Flink:
- persistence via checkpoints

© 2019 Ververica
25
ReadOp
Local Disk
WAL
WAL
Flush
Merge
Active
MemTable
ReadOnly
MemTable
Full/Switch
WriteOp
SST SST
SSTSST
In Flink:
MemTable
ReadOnly
MemTable
WriteOp
ReadOp
Local Disk
WAL
WAL
Compaction
Full/Switch
Read Only
Block Cache
Flush
SST SST
SSTSST
Merge
In Flink:
- persistence via checkpoints

© 2019 Ververica
RocksDB Resource Consumption
• One RocksDB instance per operator subtask
• block_cache_size
• Size of the block cache
• write_buffer_size
• Max size of a MemTable
• max_write_buffer_number
• The maximum number of MemTable’s allowed in memory before flush to SST file
• Indexes and bloom filters
• Optional
• Table Cache
• Caches open file descriptors to SST files
• Default: unlimited!
26

© 2019 Ververica
Performance Tuning
Amplification Factors
27
Write Amplification
Read Amplification Space Amplification
More details: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Parameter
Space

© 2019 Ververica
Performance Tuning
Amplification Factors
28
Write Amplification
Read Amplification Space Amplification
More details: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Parameter
Space
Example: More compaction effort =
increased write amplification
and reduced read amplification

© 2019 Ververica
General Performance Considerations
• Use efficient TypeSerializer’s and serialization formats
• Decompose user code objects
• ValueState<List<Integer>> ListState<Integer>
• ValueState<Map<Integer, Integer>> MapState<Integer, Integer>
• Use the correct configuration for your hardware setup
• Consider enabling RocksDB native metrics to profile your applications
• File Systems
• Working directory on fast storage, ideally local SSD. Could even be memory.
• EBS performance can be problematic
29

© 2019 Ververica
Timer Service

© 2019 Ververica
Heap Timers
31
▪ 2 References:
▪ Key
▪ Namespace
▪ 1 long:
▪ Timestamp
▪ 1 int:
▪ Array Index
K
N
Object sizes and
overhead.
Some objects might
be shared.
Binary heap of timers in array
Peek: O(1)
Poll: O(log(n))
Insert: O(log(n))
Delete: O(n)
Contains O(n)
Timer

© 2019 Ververica
Heap Timers
32
▪ 2 References:
▪ Key
▪ Namespace
▪ 1 long:
▪ Timestamp
▪ 1 int:
▪ Array Index
K
N
Object sizes and
overhead.
Some objects might
be shared.
HashMap<Timer, Timer> : fast deduplication and deletes
Key Value
Peek: O(1)
Poll: O(log(n))
Insert: O(log(n))
Delete: O(log(n))
Contains O(1)
MapEntry
Timer

© 2019 Ververica
Heap Timers
33
HashMap<Timer, Timer> : fast deduplication and deletes
MapEntry
Key Value
Snapshot (net values of a timer are immutable)
Timer

© 2019 Ververica
RocksDB Timers
34
0 20 A X
0 40 D Z
1 10 D Z
1 20 C Y
2 50 B Y
2 60 A X
…
…
Key
Group
Time
stamp
Key
Name
space
…
Lexicographically ordered
byte sequences as key, no value
Column Family - only key, no value

© 2019 Ververica
RocksDB Timers
35
0 20 A X
0 40 D Z
1 10 D Z
1 20 C Y
2 50 B Y
2 60 A X
…
…
Key
Group
Time
stamp
Key
Name
space
Column Family - only key, no value
Key group queues
(caching first k timers)
Priority queue of
key group queues

© 2019 Ververica
3 Task Manager Memory Layout
36
Off Heap / Native
Network Buffers
Timer State
Keyed State
Java Heap
Off Heap / Native
Network Buffers
Timer State
Keyed State
Java Heap
Off Heap / Native
Network Buffers
Keyed State
Timer State

© 2019 Ververica
Full Checkpoint
38
G
H
C
D
Checkpoint 1 Checkpoint 2 Checkpoint 3
I
E
A
B
C
D
A
B
C
D
A
F
C
D
E
@t1 @t2 @t3
A
F
C
D
E
G
H
C
D
I
E

© 2019 Ververica
Full Checkpoint Overview
• Creation iterates and writes full database snapshots as a stream to stable storage
• Restore reads data as a stream from stable storage and re-inserts into the state backend
• Each checkpoint is self contained, and size is proportional to the size of full state
• Optional: compression with snappy
39

© 2019 Ververica
Incremental Checkpoint
40
H
C
D
Checkpoint 1 Checkpoint 2 Checkpoint 3
I
E
A
B
C
D
A
B
C
D
A
F
C
D
E
E
F
G
H
I
@t1 @t2 @t3
builds upon builds upon
𝚫𝚫 𝚫

© 2019 Ververica
Incremental Checkpoints with RocksDB
41
Local Disk
WAL
WAL
Compaction
Flush
Incremental checkpoint:
Observe created/deleted
SST files since last checkpoint
Active
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge

© 2019 Ververica
Incremental Checkpoint Overview
• Expected trade-off: faster* checkpoints, slower recovery
• Creation only copies deltas (new local SST files) to stable storage
• Creates write amplification because we also upload compacted SST files so that we can prune checkpoint
history
• Sum of all increments that we read from stable storage can be larger than the full state size
• No rebuild is required because we simply re-open the RocksDB backend from the SST files
• SST files are snappy compressed by default
42

© 2019 Ververica
Anatomy of a Flink Stream Job Upgrade
44
Flink job user code
Local State Backend
Persistent Savepoint
local reads / writes that 
manipulate state

Deep Dive on Apache Flink State and Checkpointing

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Deep Dive on Apache Flink State and Checkpointing

Semelhante a Deep Dive on Apache Flink State and Checkpointing (20)

Mais de Ververica

Mais de Ververica (20)

Último

Último (20)

Deep Dive on Apache Flink State and Checkpointing