5. 5
Background1
Oceanus: Tencent Real-Time Compute Platform
Data Distribution Monitoring Real-Time BI Online Learning
1200
Jobs
35 Trillion
Messages in a day
4.5 PB
Data in a day
6. 6
Background1
The Problem
Job submission & execution is unreliable.
Client becomes bottleneck & suffers security risks.
Flexible programmatic interface is missing.
Start with the workflow of job submission!
22. 22
Reliability2
Failure Case: Job Manager failed to start in time
22
Client
Dispatcher
client side cluster side
JobGraph
Store
JobRegistry
Job Manager
Starting
Job Manager
23. 23
Reliability2
Failure Case: Job Manager crashed after job finished
23
cluster side
Dispatcher
JobGraph
Store
JobRegistry
Job Manager
Set JobStatus
DONE
24. 24
Reliability2
Failure Case: Job Manager crashed after job finished
24
cluster side
Dispatcher
JobGraph
Store
JobRegistry
Job Manager
Crash
Finished
Clear
JobStatus
26. 26
Reliability2
Failure Case: Standby starts after JobStatus cleared
26
cluster side
Dispatcher
JobGraph
Store
JobRegistry
Job Manager
Start
Job Manager Execute twice!
(FLINK-11813)
27. 27
Reliability2
What to define job submitted & executed
27
JobRegistry
JobGraphStore
(internal) JobManagerRunnerFuture Handled in Dispatcher
Modified by Dispatcher
Modified by Dispatcher & Job Manager
Many factors to define job status!
Goal: Achieve atomic job submission and execution
28. 28
Reliability2
How to achieve atomic job submission
28
• What is the sign of a successful submission?
JobGraph persisted in JobGraphStore
30. 30
Reliability2
How to achieve atomic job submission
30
• What is the sign of a successful submission?
JobGraph persisted in JobGraphStore
• What is the sign of a successful execution?
DONE in JobRegistry(only Dispatcher modifies it; not cleared after written)
31. 31
Reliability2
Successful job execution defined by DONE
31
Dispatcher
JobGraph
Store
JobRegistry
Job Manager
6. Set JobStatus
RUNNING
cluster side
7.Start
Job Manager
41. 41
Isolation3
Application Mode: Package User Program
41
ShipFiles
MainClass
Parallelism
SavepointSettings
Arguments
...
Deployer
client side
CommandLine
Interface
Pass Program
Metadata
42. 42
Isolation3
Application Mode: Package User Program
42
Deployer
client side
CommandLine
Interface
MainClass
Parallelism
SavepointSettings
Arguments
...
Package
User Program
50. 50
Isolation3
Deployment: Recap
50
Session Mode Job Mode
User Program execute as is abort on execute
Client Perspective cluster deployed &
job submitted
cluster deployed
with bundled job
High Availability configured
JobGraphStore
special
JobGraphStore
Application Mode
execute as is
cluster deployed &
(local) job submitted
configured
JobGraphStore
53. 53
Unification4
Client Interface: The Problem
Flink does not provide public & stable client interface
Various customized submission requires programmatic interface
Goal: Expose unified layered client interface!