Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
3. Reasons for changing loads
- Seasonality:
- day / night
- weekend / weekday
- Product popularity: new feature launches, ad campaigns
- Upstream system outages: load spikes during recovery
4. Solutions in Flink to Rescale
- Flink 1.2 (2017): Rescalable State
- Flink can restore from a savepoint with a different parallelism, so no data will be lost, all
computations will stay correct
- When used for scaling: requires custom tooling to orchestrate operations, and
bookkeeping
- Flink 1.13 (2021): Reactive Mode (beta)
- Flink automatically adjusts when TaskManagers are added or removed
- Requires outside entity to decide on # TaskManagers
- Since Flink 1.15 (2022): Reactive Mode is out of beta
Further reading: https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html
5. How to use Reactive Mode?
- Reactive Mode works with all standalone deployments
- E.g. Kubernetes, Docker or via the provided deployment scripts
- Set the configuration:
scheduler-mode=reactive
- Start the JobManager, and add as many TaskManagers as you need
- (optionally) Use a service to determine the number of TaskManagers
- Kubernetes Horizontal Pod Autoscaler
- AWS AutoScaling Groups
- Google Cloud Managed Instance Groups
6. Reactive Mode: How does it work?
JobManager
TaskManager
Job parallelism = 2
TaskManager
Flink automatically adjusts when TaskManagers are added or removed
Example: Load is increasing
Load
7. Reactive Mode: How does it work?
JobManager
TaskManager
Job parallelism = 4
TaskManager
Flink automatically adjusts when TaskManagers are added or removed
Example: Load is increasing → add more TaskManagers
TaskManager TaskManager
NEW NEW
8. Reactive Mode: How does it work?
- The JobManager adjusts the job parallelism depending on the number of
available TaskManagers
- When the # TaskManager changes, the Flink job is restarting, restoring from
the latest checkpoint
- Possible metrics: CPU load / Kafka lag (recommended) / Throughput / latency
- Scaling model similar to Kafka Streams
9. Reactive Mode example: Kubernetes HPA
- Kubernetes has a built-in
component called
HorizontalPodAutoscaler
- Automatically adjusts the
scale of a deployment based
on a metric
Flink
TaskManager
Deployment
Flink
JobManager
Job
Flink
Job-
Manager
Pod
Flink
Task-
Manager
Pod
Flink
Task-
Manager
Pod
Flink
Task-
Manager
Pod
min=1 max=15
cpu=80%
on=TaskManager
deployment
Horizontalpodautoscaler
Adjusted dynamically
Source: https://flink.apache.org/2021/05/06/reactive-mode.html
10. Reactive Mode and Flink Deployments
→ Reactive Mode only works with “standalone mode”
Passive Deployment
Flink resources managed externally (“Standalone
mode”)
→ “a bunch of JVMs”
Deployed on bare metal, Docker, Kubernetes
Pros / Cons:
+ DIY scenarios
+ Fast deployments
- Restart
→ Reactive Scaling (outside entity decides)
Active Deployment
Flink actively manages resources
→ Flink talks to a resource manager
Implementations: Native Kubernetes, YARN
Pros / cons:
+ Automatically restarts failed resources
+ Allocates only required resources
- Requires a lot of K8s permissions
→ Autoscaling (Flink decides)
11. Autoscaling with Flink? Enter Adaptive
Scheduler
- Benefits
- Flink can make better scaling decisions
- Example: rescale only right after a checkpoint completed → avoid
reprocessing
- Fewer components required (“batteries included”)
- How?
- Reactive Mode is based a new (Flink 1.13) internal workload scheduler,
called Adaptive Scheduler.
- Currently configured to behave “reactively”, can also be changed to
automatic
12. Internals: Adaptive Scheduler
Source / Further reading: https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management
SlotManager
Resource
Manager
Active K8s / YARN
Requirements
Adaptive Scheduler
I need 15 slots
I have 8 slots
13. Adaptive Scheduler for Autoscaling (future)
Source / Further reading: https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management
SlotManager
Resource
Manager
Active K8s / YARN
Requirements
Adaptive Scheduler
I need x slots
I have 8 slots
Pluggable
Autoscaler
14. Ideas for autoscaler implementations
- REST Interface
- Set desired parallelism via REST call to JobManager
- Either for entire job (and let JM decide on per-operator parallelism) or per-
operator
- User Code + provided autoscaling strategies
- User provides Flink with a custom scaling logic with access to metrics
- Problem: we want to avoid user-code on the JobManager
- JobGraph configuration
- Users configure min, target, max parallelism per operator
15. Closing remarks
- Autoscaling with Flink is possible today, it’s called
“Reactive Mode” :-)
- Getting started guide:
https://flink.apache.org/2021/05/06/reactive-mode.html
- Limitations of Adaptive Scheduler / Reactive Mode
- Only works with Application Mode
- Task local recovery not yet supported
- Lack of good UI support (history of rescale events)
Space between actual load and # of workers == wasted resources
You want your resource allocation to be close to actual load
Rescalable state: stop with savepoint, restore
Good when scaling manually and very rarely
Reactive Mode == Kafka Streams deployment model
Rescalable state: stop with savepoint, restore
Good when scaling manually and very rarely
Reactive Mode == Kafka Streams deployment model
How does Reactive Mode work?
“Just add more hardware”
Rescaling same operation as failure: restore from latest checkpoint
Can be expensive with large state … only rescale rarely!
Example implementation in Kubernetes, the most popular deployment option of Flink at the moment
Relationship of scaling and deployment modes.
Passive deployment: manually launch the flink components (K8s HA also works here!)
Active deployment: flink takes care of launch itself (mostly)
Blue line / states: interesting path
Source code:
hide empty description
skinparam monochrome false
skinparam defaultFontSize 15
[*] -> Created
Created --> Waiting : Start scheduling
state "Waiting for resources" as Waiting #lightblue
state Executing #lightblue
state Restarting #lightblue
Waiting --> Waiting : Resources are not stable yet
Waiting -[#blue,bold]-> Executing : Resources are stable
Waiting --> Finished : Cancel, suspend or not \nenough resources
Executing --> Canceling : Cancel
Executing --> Failing : Unrecoverable fault
Executing --> Finished : Suspend terminal state
Executing -[#blue,bold]-> Restarting : Recoverable fault
Restarting --> Finished : Suspend
Restarting --> Canceling : Cancel
Restarting -[#blue,bold]-> Waiting : Cancelation complete
Canceling --> Finished : Cancelation complete
Failing --> Finished : Failing complete
Finished -> [*]
https://www.planttext.com/?text=RPB1RiCW38RlF8NLOxM-m0wxLEi3h9fsw7PmYTim4OZ0JEtRpoHbB2YdHFYp_zy_zAOZe67aEtGKTJ0Z6--KEcs_OFS2-q38rAd75tPoze66ZRl2CnmP0qFKFNN9of6AB1Hi2d7n0G95duAck06CfLSLOZdlhR20WS1vcSrujWHtuaNBwurqMcsQ6nRmmJWJnQAmUtIQx1F454To7OY_h4BEfsiFd-xFx6ITYeggUddWF6LMd_yRu83cKNwNaTh_K9ZMk62otBBLtR6w-lPdIGvpii0K1kFGmfHkqoxRvqieKRHQ_yhhOYsnibj3rEkQwvWV36W_Z9R4NXsmcdr3bwGQjXnNhjI4awVv2m00
Source code:
hide empty description
skinparam monochrome false
skinparam defaultFontSize 15
[*] -> Created
Created --> Waiting : Start scheduling
state "Waiting for resources" as Waiting #lightblue
state Executing #lightblue
state Restarting #lightblue
Waiting --> Waiting : Resources are not stable yet
Waiting -[#blue,bold]-> Executing : Resources are stable
Waiting --> Finished : Cancel, suspend or not \nenough resources
Executing --> Canceling : Cancel
Executing --> Failing : Unrecoverable fault
Executing --> Finished : Suspend terminal state
Executing -[#blue,bold]-> Restarting : Recoverable fault
Restarting --> Finished : Suspend
Restarting --> Canceling : Cancel
Restarting -[#blue,bold]-> Waiting : Cancelation complete
Canceling --> Finished : Cancelation complete
Failing --> Finished : Failing complete
Finished -> [*]
https://www.planttext.com/?text=RPB1RiCW38RlF8NLOxM-m0wxLEi3h9fsw7PmYTim4OZ0JEtRpoHbB2YdHFYp_zy_zAOZe67aEtGKTJ0Z6--KEcs_OFS2-q38rAd75tPoze66ZRl2CnmP0qFKFNN9of6AB1Hi2d7n0G95duAck06CfLSLOZdlhR20WS1vcSrujWHtuaNBwurqMcsQ6nRmmJWJnQAmUtIQx1F454To7OY_h4BEfsiFd-xFx6ITYeggUddWF6LMd_yRu83cKNwNaTh_K9ZMk62otBBLtR6w-lPdIGvpii0K1kFGmfHkqoxRvqieKRHQ_yhhOYsnibj3rEkQwvWV36W_Z9R4NXsmcdr3bwGQjXnNhjI4awVv2m00