Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
In a typical setup of Flink cluster on Kubernetes
JM containers
TM containers
Instance 2 taskmanager containers -> Java process, Python process, Kinesis producers
Each measures memory differently:
Java process measures heap, non-heap using beans. Only measure memory use of Java process, excludes Python/Kinesis
Containers’ memory use is measured using cAdvisor
On the instance, the OS bridges the gap between Virtual memory and Real memory
Means that if the Java process requests 100 Gb of memory, that memory is allocated in virtual space
In reality, it is mapped to a “zero page”, meaning it doesn’t actually take up 100Gb of space, until it is accessed/written to.
In Kubernetes set ups, vm.overcommit_memory is recommended to be set to 1 -> Do not limit virtual memory.
When all processes use the memory, the oom-killer (OS process) will spin up and terminate a process.
Does not care what container it belongs to – can cause containers to be in unhealthy state.
Takeaways
Measure BOTH Java memory and OS memory (cAdvisor)
Use data to drill down into which bucket of memory is over the limit
cGroups v2 to treat all processes in container like a single process
Java memory
NOT JUST HEAP
Metaspace – for classes
Direct – directly use ByteBuffers -> significantly faster than allocating in heap
Mapped – useful when reading from file -> Map a file into a ByteBuffer
Java Native Interface (JNI) – when running non-Java code (RocksDB state backend)
Overhead – used for the JVM itself (GC, thread stacks, Symbols)
Map to Flink Taskmanager memory configurations
Explain limits for each bucket
When each limit is exceeded, this is how it surfaces
Gotchas