Building a fully managed stream processing platform on Flink at scale for LinkedIn

Vaibhav Maheshwari
Software Engineer, Brooklin
Enter name
Enter their title
Enter name
Enter their title
Enter their title
Building a Fully Managed Stream Processing
Platform on Flink at Scale for LinkedIn
Weiqing Yang
Software Engineer
08/03/2022
Yixing Zhang
Software Engineer
Sonam Mandal
Software Engineer

1
2
3
4
Agenda
Introduction & Background
Flink SQL Architecture
Overview
Component Deep Dives
Use Case & Lessons Learned

Managed Stream SQL Processing Platform at LinkedIn
Fully managed solution. User focuses ONLY on the app logic, and Stream SQL
team takes care of resource and app management.
User Responsibilities Stream Team Responsibilities
● Author App logic, test and
deploy
● Operational aspects: Framework
lib upgrades, Config
management, Alert/Failure
handling, etc.
● Resource Management:
Hardware, Scaling, etc.

Managed Stream SQL
Productive
Reliable
Smart
Vision: Platform that enables users to create stream processing pipelines within minutes and
manage them easily.

Managed Stream SQL
Productive
Reliable
Smart
manage them easily.
• Various DSLs extensible with
customer added UDFs
• Zero config required
• Orchestration layer to validate sql
statements, create missing
resources, schema, etc.

Managed Stream SQL
Productive
Reliable
Smart
manage them easily.
customer added UDFs
• Auto-scale
• Smart alerting system
• Custom dashboard based
on inputs/outputs

Managed Stream SQL
Productive
Reliable
Smart
manage them easily.
customer added UDFs
• Auto-scale
• Smart alerting system
• Custom dashboard based
on inputs/outputs
• At-Least-Once processing
semantics
• Fault-tolerance and fast
recovery

Stateless Use Cases
○ Change Capture Views
○ Materialized Views
○ Data Migration
○ Re-partitioning
○ Caching with Couchbase
Stateful Use Cases
○ Aggregations
○ Windowing
○ Joins
Typical Stream SQL Use Cases

Flink SQL Architecture Overview

Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.

○ Testability
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.

○ Testability
○ UDF support
Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table
functions)

○ Testability
○ UDF support
functions)
○ Table API and SQL Support
Support both Table API and SQL style Flink applications

○ Testability
○ UDF support
functions)
○ Table API and SQL Support
Support both Table API and SQL style Flink applications
● Kubernetes for compute and GPFS for storage

Control Plane
Built Flink cluster and Flink job Kubernetes
operators to manage the Flink resources
and job lifecycle in a Kubernetes
environment.
Integrated with the unified control plane
orchestrator (CPO) to ease the deployment
and management of streaming jobs across
both Flink and Samza.
Integrated with auto-sizing (via CPO) to
enable scaling Flink jobs as resource
requirements change without requiring
manual intervention (WIP).
Data Plane
Extended Flink to have multiple source / sink
connectors:
● Brooklin (change-capture streams)
● LiKafka
● Espresso (document DB)
● Venice (KV store)
● Couchbase
Integrated with the LinkedIn stack for config
and dependency management
Integrated with monitoring, alerting (via
CPO) and logging infrastructure to store
application logs in Azure Data Explorer for
ease of debuggability.
Managed Stream Processing With Flink SQL
Integrations with the LinkedIn Stack

Flink SQL Architecture Venice
Flink Cluster
Flink Job
Operator
(namespaced)
Flink
Cluster
Operator
(namespaced)
Flink Cluster
Flink Cluster
Couchbase
Flink
Job
CR
Brooklin Kafka
Kubernetes Cluster
Flink
Job
Control
Plane
Orchestrator
(CPO)
Flink
Cluster
CR
2. Submit / Update the
Flink Job
3. Delete existing Flink
cluster & submit new
CR for Flink Cluster
4. Create Flink
Cluster components
5. Submit the Flink
SQL Job to the JM
6. Flink connectors
in Flink Job process
the inputs
7. Flink connectors
in Flink Job write
out the outputs
{
{
…
Control Flow
1. Issue the deploy /
undeploy / auto-sizing
change of the Flink Job
UI
auto-
sizer
1-to-1 mapping between Flink
Job and Flink Cluster
Each Flink Job can have multiple
SQL statements

1
2
3
4
Component
Deep Dive
Topics
Authoring and Testing
Job Deployment via CPO
Flink Job Operator
Flink Cluster Operator

● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs:
○ SQL statements
○ Table API

○ SQL statements
○ Table API
● Local developer testing can be performed prior to deploying streaming applications to
staging / production:
○ Users can unittest via Datagen (extended to add support for lookup sources and complex Avro
schema data generation) and FsCatalog (our custom catalog to provide I/O abstractions)
○ Users can start a local Flink cluster and use either:
■ Flink SQL shell, or
■ a LinkedIn version of /bin/flink to run their application locally for testing.
○ Local testing tests against the staging environment

○ SQL statements
○ Table API
● Local developer testing can be performed prior to deploying streaming applications to
staging / production:
○ Users can unittest via Datagen (extended to add support for lookup sources and complex Avro
schema data generation) and FsCatalog (our custom catalog to provide I/O abstractions)
○ Users can start a local Flink cluster and use either:
■ Flink SQL shell, or
■ a LinkedIn version of /bin/flink to run their application locally for testing.
○ Local testing tests against the staging environment
● Canary support to test updates to their application in parallel to a version already running in
staging / production to prevent downtime due to bugs / issues.

Deployment with CPO
Control Plane Orchestrator (CPO) is a new service we have built at LinkedIn for
managing and deploying stream processing jobs including Flink jobs and Samza
jobs.
User CLI CPO
User
Web UI
CPO
CLI
Kubernetes Cluster
1. Create Job
1. Deploy/Undeploy/
Rollback Job
3. Submit Job
2. Send Request
2. Send Request
New Flink Job Registration Workflow
Flink Job Deployment Workflow

Control Plane Orchestrator (CPO)
Dashboard
UI CLI
● Connected Components
○ Job Lifecycle Notifications
○ Monitoring and Alerting
○ Split Deployment
○ Auto-sizing* (WIP)
● Future Features
○ Framework Version management
○ Auto-creation for I/O Resources
Metadata
Store
CPO
user-facing
CPO backend
Internal Services
and tools
Runtime
Environments
Yarn K8s
Auto-sizing
Kafka …
* Auto-sizing for stream processing applications at LinkedIn

Flink Job Operator
Manage the lifecycle of Flink SQL jobs
on Kubernetes
● Flink Cluster creation via the Flink
Cluster Operator
● Job deployment / undeployment /
update / upgrade via the Flink REST
APIs
● Savepoint and checkpoint
management via the Flink REST APIs
● Storage integrations for savepoints
and checkpoints
● Job deletion handling
Manages Flink cluster resources on
Kubernetes
● Job Manager Service
● Job Manager Deployment
● Task Manager Deployment
● ConfigMaps
● Security management (application
certificates)
● Storage integrations for cluster level
data
● Cluster deletion handling
Flink Cluster Operator
We have also built a CI/CD pipeline to test both operators and guarantee good code quality

Resource Ownership Overview
Resource ownership among Kubernetes resources via owner
references:
FlinkJob
FlinkCluster
etc…
ConfigMap
Task Manager
Deployment
Job Manager
Deployment
Job Manager
Service

Deployment Workflow
Flink New Job
Creating
Flink Cluster
Creating
Job JAR
Uploading
Flink Job
Submitting
Flink Job
Submitted
Flink Job
Running
Flink Job
Stopped
Null status, State:
RUNNING /
STOPPED
STOPPED
RUNNING Job
Finished
New Flink Job Creation Workﬂow
Flink Job Updating
/ Undeploying
Flink Job
Savepointing
Flink Cluster
Deleting
Flink Cluster
Creating
Job JAR
Uploading
Flink Job
Running
Flink Job
Stopped
Spec changed,
State: RUNNING /
STOPPED
S
a
v
e
p
o
i
n
t
:
E
n
a
b
l
e
d
J
o
b
F
i
n
i
s
h
e
d
Flink Job
Canceling
Flink Job
Submitting
Flink Job
Submitted
S
a
v
e
p
o
i
n
t
:
D
i
s
a
b
l
e
d
Undeploying: State
STOPPED
Updating: State
RUNNING
Flink Job Update / Undeploy Workﬂow
Flink REST API

Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.

Failure Recovery
● Flink provides fault-tolerance for jobs via the use of job retry configs for
job errors

Failure Recovery
job errors
● If a deploy or undeploy fails, manual intervention is needed to restart
or stop the Flink job via CPO

Failure Recovery
job errors
● If a deploy or undeploy fails, manual intervention is needed to restart
or stop the Flink job via CPO
● Provide both checkpoints and savepoints for state restoration in job
restart / upgrade / update scenarios

When to update a Flink cluster to Ready
● All the child K8s resources (dependents) are Ready

When to update a Flink cluster to Ready
● All the child K8s resources (dependents) are Ready
● Ask the Job Manager Service for the the cluster overview
and validate that the response returned is OK. Then, validate
the result of this response:
○ Validate that the "taskmanagers" matches the expected number of
Task Manager replicas
○ Validate that the "slots-total" matches (taskSlots * expected number of
Task Managers)

High-Level Architecture
Flink Cluster
GPFS
(one per K8s Cluster)
Stores Checkpoints / Savepoints /
logs *
Flink ConfigMap
Job Manager
Scrape ConfigMap
Flink Job
Manager
Deployment
Flink Task
Manager
Deployment
Flink Job Manager
Service
GPFS Mount
GPFS Mount
Task Manager
Scrape ConfigMap
Init
Container
Metrics
Sidecar
Init
Container
Metrics
Sidecar
Identity Service
(managing the lifecycle
of app certiﬁcates)
Identity Service
(managing the lifecycle
of app certiﬁcates)
Scrape
metrics to
monitoring
service
Scrape
metrics to
monitoring
service
Flink Job submission /
status check / cancel /
savepoints via REST
Flink Cluster K8s Components and Interactions
Automated
metrics-service
creates
dashboards /
alerts from
monitoring
service
Automated
metrics-service
creates
dashboards /
alerts from
monitoring
service
* We are working on integrating with a centralized
logging platform for application logs at LinkedIn,
backed by Microsoft Azure Data Explorer (Kusto)

Use Cases
● Managed stream processing platform is the backbone for other infra
systems like Search, Espresso (internal document store) and feature
management, etc.
● The first production use cases (~ 60 applications) on Flink have been
deployed and serving production traffic at LinkedIn.

Use Case: Search Infra
Provide search capabilities in a fully hosted, self-serve, cloud-based fashion. Customers
often require indexing and searching of joined or transformed records.
● Joins refer to the join of records from different database tables
● Transformations refer to various operations/changes on the records.

Search Infra: Joins and Transformations Support
Flink SQL
join job
Flink SQL
* Rest.li: A framework for building RESTful architectures at scale
* Brooklin: change-capture streams
* Couchbase: a highly scalable, distributed data store

Use Case: Espresso Couchbase Caching
Client
Application
Client
Application
Client
Application
Espresso Brooklin
Change Capture
PUT
DB/Table/a/b/c UPDATE
DB/Table/a/b/c
Cache Data
on Reads
Expires Data with TTL
Provisions Buckets
Nuage
Expires Cached
Data on
Change
Evolves
Espresso
Schema
DELETE DB/Table/a/b/c
DELETE DB/Table/a/b
DELETE DB/Table/a
Provisions
Stream
Client
Admin
Configures Store cache
with appropriate TTL
Add Cache to Espresso
DB/Table with TTL
Couchbase
* Nuage: a Data Systems Management platform
* Couchbase: a highly scalable, distributed data store
Stream SQL apps:
Cache Invalidation

Use Case: Espresso Materialized Views
Database
Brooklin
Brooklin change
capture stream
Flink SQL
Filtering
Repartition
Projections
Joins
UDFs
Changes
Espresso Db1. Table1
Database
Espresso Db2. Table2
* Espresso: distributed document store

Lessons Learned (1)
● Invest in testability
○ Make Flink SQL jobs easily testable
■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O
abstractions)
■ Local dev testing. Start a local Flink cluster and test apps locally against the
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc

Lessons Learned (1)
abstractions)
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc.
● Build Proactive Platforms
○ App validation support during authoring
○ Detect issues during provisioning

Lessons Learned (1)
abstractions)
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc.
● Build Proactive Platforms
○ App validation support during authoring
○ Detect issues during provisioning
● More authoring language options
○ User can choose SQL or Java (via Table API) or hybrid

Lessons Learned (2)
● Automatic dashboard generation and alerts setup
○ Custom dashboard based on inputs/outputs
○ Making Stream SQL team get all the alerts causes oncall overload,
so build smart alerting system to route alerts due to user logic errors
directly to users

Lessons Learned (2)
directly to users
● Invest in auto scale and auto remediation
○ Aggressive scale downs cause lag
○ Aggressive scale up cause capacity crunch

Lessons Learned (2)
directly to users
● Invest in auto scale and auto remediation
○ Aggressive scale downs cause lag
○ Aggressive scale up cause capacity crunch
● Build a reaper service
○ Garbage collect unused user applications

Example Flink Job
Custom Resource
apiVersion: flink.k8s.org/v1alpha1
kind: FlinkJob
metadata:
name: simple-job
annotations:
liAppName: simple-job
spec:
flinkCluster:
image:
imageName: <image-registry-URI>:<port>/flink-li-image:0.0.39
imagePullPolicy: Always
jobManagerConfig:
resources:
requests:
memory: 1024Mi
cpu: 700m
taskManagerConfig:
resources:
requests:
memory: 1024Mi
cpu: 700m
taskManagerCount: 2
taskSlots: 2
flinkJob:
jobState: RUNNING
parallelism: 4
jobArtifact:
jarUri: "<artifactory-URI>:port/flink-sql-sample-app.jar"

Example Flink cluster
Custom Resource
kind: FlinkCluster
metadata:
name: simple-cluster
annotations:
liAppName: flinkSampleApp
spec:
image:
imageName: <image-registry-URI>:<port>/flink-li-image:0.0.39
jobManagerConfig:
resources:
requests:
memory: 362Mi
cpu: 300m
taskManagerConfig:
resources:
requests:
memory: 362Mi
cpu: 300m
taskManagerCount: 1
taskSlots: 2

Terminal State Handling
Update: State
RUNNING
Flink Job
Stopped
Flink New
Job Creating
No Change / Update:
State STOPPED
Remaining
New Job
creation
steps
U
p
d
a
t
e
:
S
t
a
t
e
R
U
N
N
I
N
G
Flink Job
Running
Flink Job
Updating
No Change
Remaining
Job Update
/ Undeploy
steps
Flink Job
Undepoying
U
p
d
a
t
e
:
S
t
a
t
e
S
T
O
P
P
E
D
Handling spec change when the
FlinkJob is in RUNNING state
FlinkJob is in a STOPPED state
Flink REST API
Job Running: State
RUNNING
Flink Job
Failed
Flink Job
Updating
Exception
Remaining Job
Update / Job
Undeploy /
Cluster deletion
steps
Flink Job
Undepoying
Job
Running: State
STOPPED
Flink Cluster
Deleting
No Job / Cluster
Inaccessible
FlinkJob is in a FAILED state

Use Case: Change Capture Views
Espresso
Oracle
MySQL
Brooklin
Brooklin change capture stream
Flink SQL
Brooklin change capture view
Filtering
Repartition
Projections
Joins
UDFs

Use Case: Data Caching
Espresso/
MySql/
Oracle Flink SQL
Cache Population
Couchbase
cache
Application
Brooklin

Use Case: Data Migration
Espresso/
MySql/
Oracle Flink SQL
Data migration
Espresso/
MySql
Brooklin

Building a fully managed stream processing platform on Flink at scale for LinkedIn

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Building a fully managed stream processing platform on Flink at scale for LinkedIn

Semelhante a Building a fully managed stream processing platform on Flink at scale for LinkedIn (20)

Mais de Flink Forward

Mais de Flink Forward (8)

Último

Último (20)

Building a fully managed stream processing platform on Flink at scale for LinkedIn