Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Boost PC performance: How more available memory can improve productivity
Building a fully managed stream processing platform on Flink at scale for LinkedIn
1. Vaibhav Maheshwari
Software Engineer, Brooklin
Enter name
Enter their title
Enter name
Enter their title
Enter their title
Building a Fully Managed Stream Processing
Platform on Flink at Scale for LinkedIn
Weiqing Yang
Software Engineer
08/03/2022
Yixing Zhang
Software Engineer
Sonam Mandal
Software Engineer
4. Managed Stream SQL Processing Platform at LinkedIn
Fully managed solution. User focuses ONLY on the app logic, and Stream SQL
team takes care of resource and app management.
User Responsibilities Stream Team Responsibilities
● Author App logic, test and
deploy
● Operational aspects: Framework
lib upgrades, Config
management, Alert/Failure
handling, etc.
● Resource Management:
Hardware, Scaling, etc.
6. Managed Stream SQL
Productive
Reliable
Smart
Vision: Platform that enables users to create stream processing pipelines within minutes and
manage them easily.
• Various DSLs extensible with
customer added UDFs
• Zero config required
• Orchestration layer to validate sql
statements, create missing
resources, schema, etc.
7. Managed Stream SQL
Productive
Reliable
Smart
Vision: Platform that enables users to create stream processing pipelines within minutes and
manage them easily.
• Various DSLs extensible with
customer added UDFs
• Zero config required
• Orchestration layer to validate sql
statements, create missing
resources, schema, etc.
• Auto-scale
• Smart alerting system
• Custom dashboard based
on inputs/outputs
8. Managed Stream SQL
Productive
Reliable
Smart
Vision: Platform that enables users to create stream processing pipelines within minutes and
manage them easily.
• Various DSLs extensible with
customer added UDFs
• Zero config required
• Orchestration layer to validate sql
statements, create missing
resources, schema, etc.
• Auto-scale
• Smart alerting system
• Custom dashboard based
on inputs/outputs
• At-Least-Once processing
semantics
• Fault-tolerance and fast
recovery
9. Stateless Use Cases
○ Change Capture Views
○ Materialized Views
○ Data Migration
○ Re-partitioning
○ Caching with Couchbase
Stateful Use Cases
○ Aggregations
○ Windowing
○ Joins
Typical Stream SQL Use Cases
11. Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
12. Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.
13. Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.
○ UDF support
Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table
functions)
14. Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.
○ UDF support
Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table
functions)
○ Table API and SQL Support
Support both Table API and SQL style Flink applications
15. Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.
○ UDF support
Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table
functions)
○ Table API and SQL Support
Support both Table API and SQL style Flink applications
● Kubernetes for compute and GPFS for storage
16. Control Plane
Built Flink cluster and Flink job Kubernetes
operators to manage the Flink resources
and job lifecycle in a Kubernetes
environment.
Integrated with the unified control plane
orchestrator (CPO) to ease the deployment
and management of streaming jobs across
both Flink and Samza.
Integrated with auto-sizing (via CPO) to
enable scaling Flink jobs as resource
requirements change without requiring
manual intervention (WIP).
Data Plane
Extended Flink to have multiple source / sink
connectors:
● Brooklin (change-capture streams)
● LiKafka
● Espresso (document DB)
● Venice (KV store)
● Couchbase
Integrated with the LinkedIn stack for config
and dependency management
Integrated with monitoring, alerting (via
CPO) and logging infrastructure to store
application logs in Azure Data Explorer for
ease of debuggability.
Managed Stream Processing With Flink SQL
Integrations with the LinkedIn Stack
17. Flink SQL Architecture Venice
Flink Cluster
Flink Job
Operator
(namespaced)
Flink
Cluster
Operator
(namespaced)
Flink Cluster
Flink Cluster
Couchbase
Flink
Job
CR
Brooklin Kafka
Kubernetes Cluster
Flink
Job
Control
Plane
Orchestrator
(CPO)
Flink
Cluster
CR
2. Submit / Update the
Flink Job
3. Delete existing Flink
cluster & submit new
CR for Flink Cluster
4. Create Flink
Cluster components
5. Submit the Flink
SQL Job to the JM
6. Flink connectors
in Flink Job process
the inputs
7. Flink connectors
in Flink Job write
out the outputs
{
{
…
Control Flow
1. Issue the deploy /
undeploy / auto-sizing
change of the Flink Job
UI
auto-
sizer
1-to-1 mapping between Flink
Job and Flink Cluster
Each Flink Job can have multiple
SQL statements
20. Authoring and Testing
● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs:
○ SQL statements
○ Table API
21. Authoring and Testing
● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs:
○ SQL statements
○ Table API
● Local developer testing can be performed prior to deploying streaming applications to
staging / production:
○ Users can unittest via Datagen (extended to add support for lookup sources and complex Avro
schema data generation) and FsCatalog (our custom catalog to provide I/O abstractions)
○ Users can start a local Flink cluster and use either:
■ Flink SQL shell, or
■ a LinkedIn version of /bin/flink to run their application locally for testing.
○ Local testing tests against the staging environment
22. Authoring and Testing
● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs:
○ SQL statements
○ Table API
● Local developer testing can be performed prior to deploying streaming applications to
staging / production:
○ Users can unittest via Datagen (extended to add support for lookup sources and complex Avro
schema data generation) and FsCatalog (our custom catalog to provide I/O abstractions)
○ Users can start a local Flink cluster and use either:
■ Flink SQL shell, or
■ a LinkedIn version of /bin/flink to run their application locally for testing.
○ Local testing tests against the staging environment
● Canary support to test updates to their application in parallel to a version already running in
staging / production to prevent downtime due to bugs / issues.
23. Deployment with CPO
Control Plane Orchestrator (CPO) is a new service we have built at LinkedIn for
managing and deploying stream processing jobs including Flink jobs and Samza
jobs.
User CLI CPO
User
Web UI
CPO
CLI
Kubernetes Cluster
1. Create Job
1. Deploy/Undeploy/
Rollback Job
3. Submit Job
2. Send Request
2. Send Request
New Flink Job Registration Workflow
Flink Job Deployment Workflow
24. Control Plane Orchestrator (CPO)
Dashboard
UI CLI
● Connected Components
○ Job Lifecycle Notifications
○ Monitoring and Alerting
○ Split Deployment
○ Auto-sizing* (WIP)
● Future Features
○ Framework Version management
○ Auto-creation for I/O Resources
Metadata
Store
CPO
user-facing
CPO backend
Internal Services
and tools
Runtime
Environments
Yarn K8s
Auto-sizing
Kafka …
* Auto-sizing for stream processing applications at LinkedIn
25. Flink Job Operator
Manage the lifecycle of Flink SQL jobs
on Kubernetes
● Flink Cluster creation via the Flink
Cluster Operator
● Job deployment / undeployment /
update / upgrade via the Flink REST
APIs
● Savepoint and checkpoint
management via the Flink REST APIs
● Storage integrations for savepoints
and checkpoints
● Job deletion handling
Manages Flink cluster resources on
Kubernetes
● Job Manager Service
● Job Manager Deployment
● Task Manager Deployment
● ConfigMaps
● Security management (application
certificates)
● Storage integrations for cluster level
data
● Cluster deletion handling
Flink Cluster Operator
We have also built a CI/CD pipeline to test both operators and guarantee good code quality
26. Resource Ownership Overview
Resource ownership among Kubernetes resources via owner
references:
FlinkJob
FlinkCluster
etc…
ConfigMap
Task Manager
Deployment
Job Manager
Deployment
Job Manager
Service
28. Deployment Workflow
Flink New Job
Creating
Flink Cluster
Creating
Job JAR
Uploading
Flink Job
Submitting
Flink Job
Submitted
Flink Job
Running
Flink Job
Stopped
Null status, State:
RUNNING /
STOPPED
STOPPED
RUNNING Job
Finished
New Flink Job Creation Workflow
Flink Job Updating
/ Undeploying
Flink Job
Savepointing
Flink Cluster
Deleting
Flink Cluster
Creating
Job JAR
Uploading
Flink Job
Running
Flink Job
Stopped
Spec changed,
State: RUNNING /
STOPPED
S
a
v
e
p
o
i
n
t
:
E
n
a
b
l
e
d
J
o
b
F
i
n
i
s
h
e
d
Flink Job
Canceling
Flink Job
Submitting
Flink Job
Submitted
S
a
v
e
p
o
i
n
t
:
D
i
s
a
b
l
e
d
Undeploying: State
STOPPED
Updating: State
RUNNING
Flink Job Update / Undeploy Workflow
Flink REST API
29. Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.
30. Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.
● Flink provides fault-tolerance for jobs via the use of job retry configs for
job errors
31. Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.
● Flink provides fault-tolerance for jobs via the use of job retry configs for
job errors
● If a deploy or undeploy fails, manual intervention is needed to restart
or stop the Flink job via CPO
32. Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.
● Flink provides fault-tolerance for jobs via the use of job retry configs for
job errors
● If a deploy or undeploy fails, manual intervention is needed to restart
or stop the Flink job via CPO
● Provide both checkpoints and savepoints for state restoration in job
restart / upgrade / update scenarios
34. When to update a Flink cluster to Ready
● All the child K8s resources (dependents) are Ready
35. When to update a Flink cluster to Ready
● All the child K8s resources (dependents) are Ready
● Ask the Job Manager Service for the the cluster overview
and validate that the response returned is OK. Then, validate
the result of this response:
○ Validate that the "taskmanagers" matches the expected number of
Task Manager replicas
○ Validate that the "slots-total" matches (taskSlots * expected number of
Task Managers)
36. High-Level Architecture
Flink Cluster
GPFS
(one per K8s Cluster)
Stores Checkpoints / Savepoints /
logs *
Flink ConfigMap
Job Manager
Scrape ConfigMap
Flink Job
Manager
Deployment
Flink Task
Manager
Deployment
Flink Job Manager
Service
GPFS Mount
GPFS Mount
Task Manager
Scrape ConfigMap
Init
Container
Metrics
Sidecar
Init
Container
Metrics
Sidecar
Identity Service
(managing the lifecycle
of app certificates)
Identity Service
(managing the lifecycle
of app certificates)
Scrape
metrics to
monitoring
service
Scrape
metrics to
monitoring
service
Flink Job submission /
status check / cancel /
savepoints via REST
Flink Cluster K8s Components and Interactions
Automated
metrics-service
creates
dashboards /
alerts from
monitoring
service
Automated
metrics-service
creates
dashboards /
alerts from
monitoring
service
* We are working on integrating with a centralized
logging platform for application logs at LinkedIn,
backed by Microsoft Azure Data Explorer (Kusto)
38. Use Cases
● Managed stream processing platform is the backbone for other infra
systems like Search, Espresso (internal document store) and feature
management, etc.
● The first production use cases (~ 60 applications) on Flink have been
deployed and serving production traffic at LinkedIn.
39. Use Case: Search Infra
Provide search capabilities in a fully hosted, self-serve, cloud-based fashion. Customers
often require indexing and searching of joined or transformed records.
● Joins refer to the join of records from different database tables
● Transformations refer to various operations/changes on the records.
40. Search Infra: Joins and Transformations Support
Flink SQL
join job
Flink SQL
* Rest.li: A framework for building RESTful architectures at scale
* Brooklin: change-capture streams
* Couchbase: a highly scalable, distributed data store
41. Use Case: Espresso Couchbase Caching
Client
Application
Client
Application
Client
Application
Espresso Brooklin
Change Capture
PUT
DB/Table/a/b/c UPDATE
DB/Table/a/b/c
Cache Data
on Reads
Expires Data with TTL
Provisions Buckets
Nuage
Expires Cached
Data on
Change
Evolves
Espresso
Schema
DELETE DB/Table/a/b/c
DELETE DB/Table/a/b
DELETE DB/Table/a
Provisions
Stream
Client
Admin
Configures Store cache
with appropriate TTL
Add Cache to Espresso
DB/Table with TTL
Couchbase
* Nuage: a Data Systems Management platform
* Brooklin: change-capture streams
* Couchbase: a highly scalable, distributed data store
Stream SQL apps:
Cache Invalidation
43. Lessons Learned (1)
● Invest in testability
○ Make Flink SQL jobs easily testable
■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O
abstractions)
■ Local dev testing. Start a local Flink cluster and test apps locally against the
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc
44. Lessons Learned (1)
● Invest in testability
○ Make Flink SQL jobs easily testable
■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O
abstractions)
■ Local dev testing. Start a local Flink cluster and test apps locally against the
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc.
● Build Proactive Platforms
○ App validation support during authoring
○ Detect issues during provisioning
45. Lessons Learned (1)
● Invest in testability
○ Make Flink SQL jobs easily testable
■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O
abstractions)
■ Local dev testing. Start a local Flink cluster and test apps locally against the
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc.
● Build Proactive Platforms
○ App validation support during authoring
○ Detect issues during provisioning
● More authoring language options
○ User can choose SQL or Java (via Table API) or hybrid
46. Lessons Learned (2)
● Automatic dashboard generation and alerts setup
○ Custom dashboard based on inputs/outputs
○ Making Stream SQL team get all the alerts causes oncall overload,
so build smart alerting system to route alerts due to user logic errors
directly to users
47. Lessons Learned (2)
● Automatic dashboard generation and alerts setup
○ Custom dashboard based on inputs/outputs
○ Making Stream SQL team get all the alerts causes oncall overload,
so build smart alerting system to route alerts due to user logic errors
directly to users
● Invest in auto scale and auto remediation
○ Aggressive scale downs cause lag
○ Aggressive scale up cause capacity crunch
48. Lessons Learned (2)
● Automatic dashboard generation and alerts setup
○ Custom dashboard based on inputs/outputs
○ Making Stream SQL team get all the alerts causes oncall overload,
so build smart alerting system to route alerts due to user logic errors
directly to users
● Invest in auto scale and auto remediation
○ Aggressive scale downs cause lag
○ Aggressive scale up cause capacity crunch
● Build a reaper service
○ Garbage collect unused user applications
53. Terminal State Handling
Update: State
RUNNING
Flink Job
Stopped
Flink New
Job Creating
No Change / Update:
State STOPPED
Remaining
New Job
creation
steps
U
p
d
a
t
e
:
S
t
a
t
e
R
U
N
N
I
N
G
Flink Job
Running
Flink Job
Updating
No Change
Remaining
Job Update
/ Undeploy
steps
Flink Job
Undepoying
U
p
d
a
t
e
:
S
t
a
t
e
S
T
O
P
P
E
D
Handling spec change when the
FlinkJob is in RUNNING state
Handling spec change when the
FlinkJob is in a STOPPED state
Flink REST API
Job Running: State
RUNNING
Flink Job
Failed
Flink Job
Updating
Exception
Remaining Job
Update / Job
Undeploy /
Cluster deletion
steps
Flink Job
Undepoying
Job
Running: State
STOPPED
Flink Cluster
Deleting
No Job / Cluster
Inaccessible
Handling spec change when the
FlinkJob is in a FAILED state