SlideShare uma empresa Scribd logo
1 de 56
Baixar para ler offline
Vaibhav Maheshwari
Software Engineer, Brooklin
Enter name
Enter their title
Enter name
Enter their title
Enter their title
Building a Fully Managed Stream Processing
Platform on Flink at Scale for LinkedIn
Weiqing Yang
Software Engineer
08/03/2022
Yixing Zhang
Software Engineer
Sonam Mandal
Software Engineer
1
2
3
4
Agenda
Introduction & Background
Flink SQL Architecture
Overview
Component Deep Dives
Use Case & Lessons Learned
Introduction & Background
Managed Stream SQL Processing Platform at LinkedIn
Fully managed solution. User focuses ONLY on the app logic, and Stream SQL
team takes care of resource and app management.
User Responsibilities Stream Team Responsibilities
● Author App logic, test and
deploy
● Operational aspects: Framework
lib upgrades, Config
management, Alert/Failure
handling, etc.
● Resource Management:
Hardware, Scaling, etc.
Managed Stream SQL
Productive
Reliable
Smart
Vision: Platform that enables users to create stream processing pipelines within minutes and
manage them easily.
Managed Stream SQL
Productive
Reliable
Smart
Vision: Platform that enables users to create stream processing pipelines within minutes and
manage them easily.
• Various DSLs extensible with
customer added UDFs
• Zero config required
• Orchestration layer to validate sql
statements, create missing
resources, schema, etc.
Managed Stream SQL
Productive
Reliable
Smart
Vision: Platform that enables users to create stream processing pipelines within minutes and
manage them easily.
• Various DSLs extensible with
customer added UDFs
• Zero config required
• Orchestration layer to validate sql
statements, create missing
resources, schema, etc.
• Auto-scale
• Smart alerting system
• Custom dashboard based
on inputs/outputs
Managed Stream SQL
Productive
Reliable
Smart
Vision: Platform that enables users to create stream processing pipelines within minutes and
manage them easily.
• Various DSLs extensible with
customer added UDFs
• Zero config required
• Orchestration layer to validate sql
statements, create missing
resources, schema, etc.
• Auto-scale
• Smart alerting system
• Custom dashboard based
on inputs/outputs
• At-Least-Once processing
semantics
• Fault-tolerance and fast
recovery
Stateless Use Cases
○ Change Capture Views
○ Materialized Views
○ Data Migration
○ Re-partitioning
○ Caching with Couchbase
Stateful Use Cases
○ Aggregations
○ Windowing
○ Joins
Typical Stream SQL Use Cases
Flink SQL Architecture Overview
Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.
Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.
○ UDF support
Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table
functions)
Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.
○ UDF support
Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table
functions)
○ Table API and SQL Support
Support both Table API and SQL style Flink applications
Flink SQL @ LinkedIn
● Flink SQL language and connector capabilities leveraged at LinkedIn:
○ Testability
Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and
Table applications testable.
○ SQL language capabilities
Support more features: aggregation support, windowing, event time support and watermarks, native
support for handling complex and nested records, etc.
○ UDF support
Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table
functions)
○ Table API and SQL Support
Support both Table API and SQL style Flink applications
● Kubernetes for compute and GPFS for storage
Control Plane
Built Flink cluster and Flink job Kubernetes
operators to manage the Flink resources
and job lifecycle in a Kubernetes
environment.
Integrated with the unified control plane
orchestrator (CPO) to ease the deployment
and management of streaming jobs across
both Flink and Samza.
Integrated with auto-sizing (via CPO) to
enable scaling Flink jobs as resource
requirements change without requiring
manual intervention (WIP).
Data Plane
Extended Flink to have multiple source / sink
connectors:
● Brooklin (change-capture streams)
● LiKafka
● Espresso (document DB)
● Venice (KV store)
● Couchbase
Integrated with the LinkedIn stack for config
and dependency management
Integrated with monitoring, alerting (via
CPO) and logging infrastructure to store
application logs in Azure Data Explorer for
ease of debuggability.
Managed Stream Processing With Flink SQL
Integrations with the LinkedIn Stack
Flink SQL Architecture Venice
Flink Cluster
Flink Job
Operator
(namespaced)
Flink
Cluster
Operator
(namespaced)
Flink Cluster
Flink Cluster
Couchbase
Flink
Job
CR
Brooklin Kafka
Kubernetes Cluster
Flink
Job
Control
Plane
Orchestrator
(CPO)
Flink
Cluster
CR
2. Submit / Update the
Flink Job
3. Delete existing Flink
cluster & submit new
CR for Flink Cluster
4. Create Flink
Cluster components
5. Submit the Flink
SQL Job to the JM
6. Flink connectors
in Flink Job process
the inputs
7. Flink connectors
in Flink Job write
out the outputs
{
{
…
Control Flow
1. Issue the deploy /
undeploy / auto-sizing
change of the Flink Job
UI
auto-
sizer
1-to-1 mapping between Flink
Job and Flink Cluster
Each Flink Job can have multiple
SQL statements
Component Deep Dives
1
2
3
4
Component
Deep Dive
Topics
Authoring and Testing
Job Deployment via CPO
Flink Job Operator
Flink Cluster Operator
Authoring and Testing
● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs:
○ SQL statements
○ Table API
Authoring and Testing
● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs:
○ SQL statements
○ Table API
● Local developer testing can be performed prior to deploying streaming applications to
staging / production:
○ Users can unittest via Datagen (extended to add support for lookup sources and complex Avro
schema data generation) and FsCatalog (our custom catalog to provide I/O abstractions)
○ Users can start a local Flink cluster and use either:
■ Flink SQL shell, or
■ a LinkedIn version of /bin/flink to run their application locally for testing.
○ Local testing tests against the staging environment
Authoring and Testing
● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs:
○ SQL statements
○ Table API
● Local developer testing can be performed prior to deploying streaming applications to
staging / production:
○ Users can unittest via Datagen (extended to add support for lookup sources and complex Avro
schema data generation) and FsCatalog (our custom catalog to provide I/O abstractions)
○ Users can start a local Flink cluster and use either:
■ Flink SQL shell, or
■ a LinkedIn version of /bin/flink to run their application locally for testing.
○ Local testing tests against the staging environment
● Canary support to test updates to their application in parallel to a version already running in
staging / production to prevent downtime due to bugs / issues.
Deployment with CPO
Control Plane Orchestrator (CPO) is a new service we have built at LinkedIn for
managing and deploying stream processing jobs including Flink jobs and Samza
jobs.
User CLI CPO
User
Web UI
CPO
CLI
Kubernetes Cluster
1. Create Job
1. Deploy/Undeploy/
Rollback Job
3. Submit Job
2. Send Request
2. Send Request
New Flink Job Registration Workflow
Flink Job Deployment Workflow
Control Plane Orchestrator (CPO)
Dashboard
UI CLI
● Connected Components
○ Job Lifecycle Notifications
○ Monitoring and Alerting
○ Split Deployment
○ Auto-sizing* (WIP)
● Future Features
○ Framework Version management
○ Auto-creation for I/O Resources
Metadata
Store
CPO
user-facing
CPO backend
Internal Services
and tools
Runtime
Environments
Yarn K8s
Auto-sizing
Kafka …
* Auto-sizing for stream processing applications at LinkedIn
Flink Job Operator
Manage the lifecycle of Flink SQL jobs
on Kubernetes
● Flink Cluster creation via the Flink
Cluster Operator
● Job deployment / undeployment /
update / upgrade via the Flink REST
APIs
● Savepoint and checkpoint
management via the Flink REST APIs
● Storage integrations for savepoints
and checkpoints
● Job deletion handling
Manages Flink cluster resources on
Kubernetes
● Job Manager Service
● Job Manager Deployment
● Task Manager Deployment
● ConfigMaps
● Security management (application
certificates)
● Storage integrations for cluster level
data
● Cluster deletion handling
Flink Cluster Operator
We have also built a CI/CD pipeline to test both operators and guarantee good code quality
Resource Ownership Overview
Resource ownership among Kubernetes resources via owner
references:
FlinkJob
FlinkCluster
etc…
ConfigMap
Task Manager
Deployment
Job Manager
Deployment
Job Manager
Service
Flink Job Controller
Deployment Workflow
Flink New Job
Creating
Flink Cluster
Creating
Job JAR
Uploading
Flink Job
Submitting
Flink Job
Submitted
Flink Job
Running
Flink Job
Stopped
Null status, State:
RUNNING /
STOPPED
STOPPED
RUNNING Job
Finished
New Flink Job Creation Workflow
Flink Job Updating
/ Undeploying
Flink Job
Savepointing
Flink Cluster
Deleting
Flink Cluster
Creating
Job JAR
Uploading
Flink Job
Running
Flink Job
Stopped
Spec changed,
State: RUNNING /
STOPPED
S
a
v
e
p
o
i
n
t
:
E
n
a
b
l
e
d
J
o
b
F
i
n
i
s
h
e
d
Flink Job
Canceling
Flink Job
Submitting
Flink Job
Submitted
S
a
v
e
p
o
i
n
t
:
D
i
s
a
b
l
e
d
Undeploying: State
STOPPED
Updating: State
RUNNING
Flink Job Update / Undeploy Workflow
Flink REST API
Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.
Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.
● Flink provides fault-tolerance for jobs via the use of job retry configs for
job errors
Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.
● Flink provides fault-tolerance for jobs via the use of job retry configs for
job errors
● If a deploy or undeploy fails, manual intervention is needed to restart
or stop the Flink job via CPO
Failure Recovery
● Today we rely on Kubernetes to restart pods that die for the Job
Manager and Task Manager deployments. We plan to add health
monitoring capabilities for such resources in the future.
● Flink provides fault-tolerance for jobs via the use of job retry configs for
job errors
● If a deploy or undeploy fails, manual intervention is needed to restart
or stop the Flink job via CPO
● Provide both checkpoints and savepoints for state restoration in job
restart / upgrade / update scenarios
Flink Cluster Controller
When to update a Flink cluster to Ready
● All the child K8s resources (dependents) are Ready
When to update a Flink cluster to Ready
● All the child K8s resources (dependents) are Ready
● Ask the Job Manager Service for the the cluster overview
and validate that the response returned is OK. Then, validate
the result of this response:
○ Validate that the "taskmanagers" matches the expected number of
Task Manager replicas
○ Validate that the "slots-total" matches (taskSlots * expected number of
Task Managers)
High-Level Architecture
Flink Cluster
GPFS
(one per K8s Cluster)
Stores Checkpoints / Savepoints /
logs *
Flink ConfigMap
Job Manager
Scrape ConfigMap
Flink Job
Manager
Deployment
Flink Task
Manager
Deployment
Flink Job Manager
Service
GPFS Mount
GPFS Mount
Task Manager
Scrape ConfigMap
Init
Container
Metrics
Sidecar
Init
Container
Metrics
Sidecar
Identity Service
(managing the lifecycle
of app certificates)
Identity Service
(managing the lifecycle
of app certificates)
Scrape
metrics to
monitoring
service
Scrape
metrics to
monitoring
service
Flink Job submission /
status check / cancel /
savepoints via REST
Flink Cluster K8s Components and Interactions
Automated
metrics-service
creates
dashboards /
alerts from
monitoring
service
Automated
metrics-service
creates
dashboards /
alerts from
monitoring
service
* We are working on integrating with a centralized
logging platform for application logs at LinkedIn,
backed by Microsoft Azure Data Explorer (Kusto)
Use Cases & Lessons Learned
Use Cases
● Managed stream processing platform is the backbone for other infra
systems like Search, Espresso (internal document store) and feature
management, etc.
● The first production use cases (~ 60 applications) on Flink have been
deployed and serving production traffic at LinkedIn.
Use Case: Search Infra
Provide search capabilities in a fully hosted, self-serve, cloud-based fashion. Customers
often require indexing and searching of joined or transformed records.
● Joins refer to the join of records from different database tables
● Transformations refer to various operations/changes on the records.
Search Infra: Joins and Transformations Support
Flink SQL
join job
Flink SQL
* Rest.li: A framework for building RESTful architectures at scale
* Brooklin: change-capture streams
* Couchbase: a highly scalable, distributed data store
Use Case: Espresso Couchbase Caching
Client
Application
Client
Application
Client
Application
Espresso Brooklin
Change Capture
PUT
DB/Table/a/b/c UPDATE
DB/Table/a/b/c
Cache Data
on Reads
Expires Data with TTL
Provisions Buckets
Nuage
Expires Cached
Data on
Change
Evolves
Espresso
Schema
DELETE DB/Table/a/b/c
DELETE DB/Table/a/b
DELETE DB/Table/a
Provisions
Stream
Client
Admin
Configures Store cache
with appropriate TTL
Add Cache to Espresso
DB/Table with TTL
Couchbase
* Nuage: a Data Systems Management platform
* Brooklin: change-capture streams
* Couchbase: a highly scalable, distributed data store
Stream SQL apps:
Cache Invalidation
Use Case: Espresso Materialized Views
Database
Brooklin
Brooklin change
capture stream
Flink SQL
Filtering
Repartition
Projections
Joins
UDFs
Changes
Espresso Db1. Table1
Database
Espresso Db2. Table2
* Brooklin: change-capture streams
* Espresso: distributed document store
Lessons Learned (1)
● Invest in testability
○ Make Flink SQL jobs easily testable
■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O
abstractions)
■ Local dev testing. Start a local Flink cluster and test apps locally against the
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc
Lessons Learned (1)
● Invest in testability
○ Make Flink SQL jobs easily testable
■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O
abstractions)
■ Local dev testing. Start a local Flink cluster and test apps locally against the
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc.
● Build Proactive Platforms
○ App validation support during authoring
○ Detect issues during provisioning
Lessons Learned (1)
● Invest in testability
○ Make Flink SQL jobs easily testable
■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O
abstractions)
■ Local dev testing. Start a local Flink cluster and test apps locally against the
staging environment
○ Vet newer apps before promoting to prod, e.g. canary support, etc.
● Build Proactive Platforms
○ App validation support during authoring
○ Detect issues during provisioning
● More authoring language options
○ User can choose SQL or Java (via Table API) or hybrid
Lessons Learned (2)
● Automatic dashboard generation and alerts setup
○ Custom dashboard based on inputs/outputs
○ Making Stream SQL team get all the alerts causes oncall overload,
so build smart alerting system to route alerts due to user logic errors
directly to users
Lessons Learned (2)
● Automatic dashboard generation and alerts setup
○ Custom dashboard based on inputs/outputs
○ Making Stream SQL team get all the alerts causes oncall overload,
so build smart alerting system to route alerts due to user logic errors
directly to users
● Invest in auto scale and auto remediation
○ Aggressive scale downs cause lag
○ Aggressive scale up cause capacity crunch
Lessons Learned (2)
● Automatic dashboard generation and alerts setup
○ Custom dashboard based on inputs/outputs
○ Making Stream SQL team get all the alerts causes oncall overload,
so build smart alerting system to route alerts due to user logic errors
directly to users
● Invest in auto scale and auto remediation
○ Aggressive scale downs cause lag
○ Aggressive scale up cause capacity crunch
● Build a reaper service
○ Garbage collect unused user applications
Q&A
Backup Slides
Example Flink Job
Custom Resource
apiVersion: flink.k8s.org/v1alpha1
kind: FlinkJob
metadata:
name: simple-job
annotations:
liAppName: simple-job
spec:
flinkCluster:
image:
imageName: <image-registry-URI>:<port>/flink-li-image:0.0.39
imagePullPolicy: Always
jobManagerConfig:
resources:
requests:
memory: 1024Mi
cpu: 700m
taskManagerConfig:
resources:
requests:
memory: 1024Mi
cpu: 700m
taskManagerCount: 2
taskSlots: 2
flinkJob:
jobState: RUNNING
parallelism: 4
jobArtifact:
jarUri: "<artifactory-URI>:port/flink-sql-sample-app.jar"
Example Flink cluster
Custom Resource
kind: FlinkCluster
metadata:
name: simple-cluster
annotations:
liAppName: flinkSampleApp
spec:
image:
imageName: <image-registry-URI>:<port>/flink-li-image:0.0.39
jobManagerConfig:
resources:
requests:
memory: 362Mi
cpu: 300m
taskManagerConfig:
resources:
requests:
memory: 362Mi
cpu: 300m
taskManagerCount: 1
taskSlots: 2
Terminal State Handling
Update: State
RUNNING
Flink Job
Stopped
Flink New
Job Creating
No Change / Update:
State STOPPED
Remaining
New Job
creation
steps
U
p
d
a
t
e
:
S
t
a
t
e
R
U
N
N
I
N
G
Flink Job
Running
Flink Job
Updating
No Change
Remaining
Job Update
/ Undeploy
steps
Flink Job
Undepoying
U
p
d
a
t
e
:
S
t
a
t
e
S
T
O
P
P
E
D
Handling spec change when the
FlinkJob is in RUNNING state
Handling spec change when the
FlinkJob is in a STOPPED state
Flink REST API
Job Running: State
RUNNING
Flink Job
Failed
Flink Job
Updating
Exception
Remaining Job
Update / Job
Undeploy /
Cluster deletion
steps
Flink Job
Undepoying
Job
Running: State
STOPPED
Flink Cluster
Deleting
No Job / Cluster
Inaccessible
Handling spec change when the
FlinkJob is in a FAILED state
Use Case: Change Capture Views
Espresso
Oracle
MySQL
Brooklin
Brooklin change capture stream
Flink SQL
Brooklin change capture view
Filtering
Repartition
Projections
Joins
UDFs
Use Case: Data Caching
Espresso/
MySql/
Oracle Flink SQL
Cache Population
Couchbase
cache
Application
Brooklin
Use Case: Data Migration
Espresso/
MySql/
Oracle Flink SQL
Data migration
Espresso/
MySql
Brooklin

Mais conteúdo relacionado

Mais procurados

Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst PracticesKonstantin Knauf
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022HostedbyConfluent
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 

Mais procurados (20)

Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst Practices
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 

Semelhante a Building a fully managed stream processing platform on Flink at scale for LinkedIn

Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Flink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsFlink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsBowen Li
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Flink Forward
 
Pivotal Platform: A First Look at the October Release
Pivotal Platform: A First Look at the October ReleasePivotal Platform: A First Look at the October Release
Pivotal Platform: A First Look at the October ReleaseVMware Tanzu
 
Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)
Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)
Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)Nedelcho Delchev
 
Apache Flink Online Training
Apache Flink Online TrainingApache Flink Online Training
Apache Flink Online TrainingLearntek1
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Spring Roo Flex Add-on
Spring Roo Flex Add-onSpring Roo Flex Add-on
Spring Roo Flex Add-onBill Ott
 
Gary Fowler [InfluxData] | InfluxDB Scripting Languages | InfluxDays 2022
Gary Fowler [InfluxData] | InfluxDB Scripting Languages | InfluxDays 2022Gary Fowler [InfluxData] | InfluxDB Scripting Languages | InfluxDays 2022
Gary Fowler [InfluxData] | InfluxDB Scripting Languages | InfluxDays 2022InfluxData
 
Ankit Chohan - Java
Ankit Chohan - JavaAnkit Chohan - Java
Ankit Chohan - JavaAnkit Chohan
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...Lucas Jellema
 
Oracle Apex Technical Introduction
Oracle Apex   Technical IntroductionOracle Apex   Technical Introduction
Oracle Apex Technical Introductioncrokitta
 
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022HostedbyConfluent
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Taiwan User Group
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker seriesMonal Daxini
 

Semelhante a Building a fully managed stream processing platform on Flink at scale for LinkedIn (20)

Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Flink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsFlink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systems
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
 
Pivotal Platform: A First Look at the October Release
Pivotal Platform: A First Look at the October ReleasePivotal Platform: A First Look at the October Release
Pivotal Platform: A First Look at the October Release
 
Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)
Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)
Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)
 
Apache Flink Online Training
Apache Flink Online TrainingApache Flink Online Training
Apache Flink Online Training
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache flink
Apache flinkApache flink
Apache flink
 
Spring Roo Flex Add-on
Spring Roo Flex Add-onSpring Roo Flex Add-on
Spring Roo Flex Add-on
 
Gary Fowler [InfluxData] | InfluxDB Scripting Languages | InfluxDays 2022
Gary Fowler [InfluxData] | InfluxDB Scripting Languages | InfluxDays 2022Gary Fowler [InfluxData] | InfluxDB Scripting Languages | InfluxDays 2022
Gary Fowler [InfluxData] | InfluxDB Scripting Languages | InfluxDays 2022
 
Ankit Chohan - Java
Ankit Chohan - JavaAnkit Chohan - Java
Ankit Chohan - Java
 
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
 
Oracle Apex Technical Introduction
Oracle Apex   Technical IntroductionOracle Apex   Technical Introduction
Oracle Apex Technical Introduction
 
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 

Mais de Flink Forward

One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 

Mais de Flink Forward (8)

One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Último

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Building a fully managed stream processing platform on Flink at scale for LinkedIn

  • 1. Vaibhav Maheshwari Software Engineer, Brooklin Enter name Enter their title Enter name Enter their title Enter their title Building a Fully Managed Stream Processing Platform on Flink at Scale for LinkedIn Weiqing Yang Software Engineer 08/03/2022 Yixing Zhang Software Engineer Sonam Mandal Software Engineer
  • 2. 1 2 3 4 Agenda Introduction & Background Flink SQL Architecture Overview Component Deep Dives Use Case & Lessons Learned
  • 4. Managed Stream SQL Processing Platform at LinkedIn Fully managed solution. User focuses ONLY on the app logic, and Stream SQL team takes care of resource and app management. User Responsibilities Stream Team Responsibilities ● Author App logic, test and deploy ● Operational aspects: Framework lib upgrades, Config management, Alert/Failure handling, etc. ● Resource Management: Hardware, Scaling, etc.
  • 5. Managed Stream SQL Productive Reliable Smart Vision: Platform that enables users to create stream processing pipelines within minutes and manage them easily.
  • 6. Managed Stream SQL Productive Reliable Smart Vision: Platform that enables users to create stream processing pipelines within minutes and manage them easily. • Various DSLs extensible with customer added UDFs • Zero config required • Orchestration layer to validate sql statements, create missing resources, schema, etc.
  • 7. Managed Stream SQL Productive Reliable Smart Vision: Platform that enables users to create stream processing pipelines within minutes and manage them easily. • Various DSLs extensible with customer added UDFs • Zero config required • Orchestration layer to validate sql statements, create missing resources, schema, etc. • Auto-scale • Smart alerting system • Custom dashboard based on inputs/outputs
  • 8. Managed Stream SQL Productive Reliable Smart Vision: Platform that enables users to create stream processing pipelines within minutes and manage them easily. • Various DSLs extensible with customer added UDFs • Zero config required • Orchestration layer to validate sql statements, create missing resources, schema, etc. • Auto-scale • Smart alerting system • Custom dashboard based on inputs/outputs • At-Least-Once processing semantics • Fault-tolerance and fast recovery
  • 9. Stateless Use Cases ○ Change Capture Views ○ Materialized Views ○ Data Migration ○ Re-partitioning ○ Caching with Couchbase Stateful Use Cases ○ Aggregations ○ Windowing ○ Joins Typical Stream SQL Use Cases
  • 11. Flink SQL @ LinkedIn ● Flink SQL language and connector capabilities leveraged at LinkedIn: ○ Testability Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and Table applications testable.
  • 12. Flink SQL @ LinkedIn ● Flink SQL language and connector capabilities leveraged at LinkedIn: ○ Testability Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and Table applications testable. ○ SQL language capabilities Support more features: aggregation support, windowing, event time support and watermarks, native support for handling complex and nested records, etc.
  • 13. Flink SQL @ LinkedIn ● Flink SQL language and connector capabilities leveraged at LinkedIn: ○ Testability Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and Table applications testable. ○ SQL language capabilities Support more features: aggregation support, windowing, event time support and watermarks, native support for handling complex and nested records, etc. ○ UDF support Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table functions)
  • 14. Flink SQL @ LinkedIn ● Flink SQL language and connector capabilities leveraged at LinkedIn: ○ Testability Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and Table applications testable. ○ SQL language capabilities Support more features: aggregation support, windowing, event time support and watermarks, native support for handling complex and nested records, etc. ○ UDF support Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table functions) ○ Table API and SQL Support Support both Table API and SQL style Flink applications
  • 15. Flink SQL @ LinkedIn ● Flink SQL language and connector capabilities leveraged at LinkedIn: ○ Testability Leveraged and extended existing Flink abstractions like Catalog and Table to make Flink SQL and Table applications testable. ○ SQL language capabilities Support more features: aggregation support, windowing, event time support and watermarks, native support for handling complex and nested records, etc. ○ UDF support Support: Strongly typed Avro UDF, UDAF (user defined aggregation functions), UDTF (User defined table functions) ○ Table API and SQL Support Support both Table API and SQL style Flink applications ● Kubernetes for compute and GPFS for storage
  • 16. Control Plane Built Flink cluster and Flink job Kubernetes operators to manage the Flink resources and job lifecycle in a Kubernetes environment. Integrated with the unified control plane orchestrator (CPO) to ease the deployment and management of streaming jobs across both Flink and Samza. Integrated with auto-sizing (via CPO) to enable scaling Flink jobs as resource requirements change without requiring manual intervention (WIP). Data Plane Extended Flink to have multiple source / sink connectors: ● Brooklin (change-capture streams) ● LiKafka ● Espresso (document DB) ● Venice (KV store) ● Couchbase Integrated with the LinkedIn stack for config and dependency management Integrated with monitoring, alerting (via CPO) and logging infrastructure to store application logs in Azure Data Explorer for ease of debuggability. Managed Stream Processing With Flink SQL Integrations with the LinkedIn Stack
  • 17. Flink SQL Architecture Venice Flink Cluster Flink Job Operator (namespaced) Flink Cluster Operator (namespaced) Flink Cluster Flink Cluster Couchbase Flink Job CR Brooklin Kafka Kubernetes Cluster Flink Job Control Plane Orchestrator (CPO) Flink Cluster CR 2. Submit / Update the Flink Job 3. Delete existing Flink cluster & submit new CR for Flink Cluster 4. Create Flink Cluster components 5. Submit the Flink SQL Job to the JM 6. Flink connectors in Flink Job process the inputs 7. Flink connectors in Flink Job write out the outputs { { … Control Flow 1. Issue the deploy / undeploy / auto-sizing change of the Flink Job UI auto- sizer 1-to-1 mapping between Flink Job and Flink Cluster Each Flink Job can have multiple SQL statements
  • 19. 1 2 3 4 Component Deep Dive Topics Authoring and Testing Job Deployment via CPO Flink Job Operator Flink Cluster Operator
  • 20. Authoring and Testing ● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs: ○ SQL statements ○ Table API
  • 21. Authoring and Testing ● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs: ○ SQL statements ○ Table API ● Local developer testing can be performed prior to deploying streaming applications to staging / production: ○ Users can unittest via Datagen (extended to add support for lookup sources and complex Avro schema data generation) and FsCatalog (our custom catalog to provide I/O abstractions) ○ Users can start a local Flink cluster and use either: ■ Flink SQL shell, or ■ a LinkedIn version of /bin/flink to run their application locally for testing. ○ Local testing tests against the staging environment
  • 22. Authoring and Testing ● Provide templates for users to easily author Flink SQL Jobs via the use of the following APIs: ○ SQL statements ○ Table API ● Local developer testing can be performed prior to deploying streaming applications to staging / production: ○ Users can unittest via Datagen (extended to add support for lookup sources and complex Avro schema data generation) and FsCatalog (our custom catalog to provide I/O abstractions) ○ Users can start a local Flink cluster and use either: ■ Flink SQL shell, or ■ a LinkedIn version of /bin/flink to run their application locally for testing. ○ Local testing tests against the staging environment ● Canary support to test updates to their application in parallel to a version already running in staging / production to prevent downtime due to bugs / issues.
  • 23. Deployment with CPO Control Plane Orchestrator (CPO) is a new service we have built at LinkedIn for managing and deploying stream processing jobs including Flink jobs and Samza jobs. User CLI CPO User Web UI CPO CLI Kubernetes Cluster 1. Create Job 1. Deploy/Undeploy/ Rollback Job 3. Submit Job 2. Send Request 2. Send Request New Flink Job Registration Workflow Flink Job Deployment Workflow
  • 24. Control Plane Orchestrator (CPO) Dashboard UI CLI ● Connected Components ○ Job Lifecycle Notifications ○ Monitoring and Alerting ○ Split Deployment ○ Auto-sizing* (WIP) ● Future Features ○ Framework Version management ○ Auto-creation for I/O Resources Metadata Store CPO user-facing CPO backend Internal Services and tools Runtime Environments Yarn K8s Auto-sizing Kafka … * Auto-sizing for stream processing applications at LinkedIn
  • 25. Flink Job Operator Manage the lifecycle of Flink SQL jobs on Kubernetes ● Flink Cluster creation via the Flink Cluster Operator ● Job deployment / undeployment / update / upgrade via the Flink REST APIs ● Savepoint and checkpoint management via the Flink REST APIs ● Storage integrations for savepoints and checkpoints ● Job deletion handling Manages Flink cluster resources on Kubernetes ● Job Manager Service ● Job Manager Deployment ● Task Manager Deployment ● ConfigMaps ● Security management (application certificates) ● Storage integrations for cluster level data ● Cluster deletion handling Flink Cluster Operator We have also built a CI/CD pipeline to test both operators and guarantee good code quality
  • 26. Resource Ownership Overview Resource ownership among Kubernetes resources via owner references: FlinkJob FlinkCluster etc… ConfigMap Task Manager Deployment Job Manager Deployment Job Manager Service
  • 28. Deployment Workflow Flink New Job Creating Flink Cluster Creating Job JAR Uploading Flink Job Submitting Flink Job Submitted Flink Job Running Flink Job Stopped Null status, State: RUNNING / STOPPED STOPPED RUNNING Job Finished New Flink Job Creation Workflow Flink Job Updating / Undeploying Flink Job Savepointing Flink Cluster Deleting Flink Cluster Creating Job JAR Uploading Flink Job Running Flink Job Stopped Spec changed, State: RUNNING / STOPPED S a v e p o i n t : E n a b l e d J o b F i n i s h e d Flink Job Canceling Flink Job Submitting Flink Job Submitted S a v e p o i n t : D i s a b l e d Undeploying: State STOPPED Updating: State RUNNING Flink Job Update / Undeploy Workflow Flink REST API
  • 29. Failure Recovery ● Today we rely on Kubernetes to restart pods that die for the Job Manager and Task Manager deployments. We plan to add health monitoring capabilities for such resources in the future.
  • 30. Failure Recovery ● Today we rely on Kubernetes to restart pods that die for the Job Manager and Task Manager deployments. We plan to add health monitoring capabilities for such resources in the future. ● Flink provides fault-tolerance for jobs via the use of job retry configs for job errors
  • 31. Failure Recovery ● Today we rely on Kubernetes to restart pods that die for the Job Manager and Task Manager deployments. We plan to add health monitoring capabilities for such resources in the future. ● Flink provides fault-tolerance for jobs via the use of job retry configs for job errors ● If a deploy or undeploy fails, manual intervention is needed to restart or stop the Flink job via CPO
  • 32. Failure Recovery ● Today we rely on Kubernetes to restart pods that die for the Job Manager and Task Manager deployments. We plan to add health monitoring capabilities for such resources in the future. ● Flink provides fault-tolerance for jobs via the use of job retry configs for job errors ● If a deploy or undeploy fails, manual intervention is needed to restart or stop the Flink job via CPO ● Provide both checkpoints and savepoints for state restoration in job restart / upgrade / update scenarios
  • 34. When to update a Flink cluster to Ready ● All the child K8s resources (dependents) are Ready
  • 35. When to update a Flink cluster to Ready ● All the child K8s resources (dependents) are Ready ● Ask the Job Manager Service for the the cluster overview and validate that the response returned is OK. Then, validate the result of this response: ○ Validate that the "taskmanagers" matches the expected number of Task Manager replicas ○ Validate that the "slots-total" matches (taskSlots * expected number of Task Managers)
  • 36. High-Level Architecture Flink Cluster GPFS (one per K8s Cluster) Stores Checkpoints / Savepoints / logs * Flink ConfigMap Job Manager Scrape ConfigMap Flink Job Manager Deployment Flink Task Manager Deployment Flink Job Manager Service GPFS Mount GPFS Mount Task Manager Scrape ConfigMap Init Container Metrics Sidecar Init Container Metrics Sidecar Identity Service (managing the lifecycle of app certificates) Identity Service (managing the lifecycle of app certificates) Scrape metrics to monitoring service Scrape metrics to monitoring service Flink Job submission / status check / cancel / savepoints via REST Flink Cluster K8s Components and Interactions Automated metrics-service creates dashboards / alerts from monitoring service Automated metrics-service creates dashboards / alerts from monitoring service * We are working on integrating with a centralized logging platform for application logs at LinkedIn, backed by Microsoft Azure Data Explorer (Kusto)
  • 37. Use Cases & Lessons Learned
  • 38. Use Cases ● Managed stream processing platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management, etc. ● The first production use cases (~ 60 applications) on Flink have been deployed and serving production traffic at LinkedIn.
  • 39. Use Case: Search Infra Provide search capabilities in a fully hosted, self-serve, cloud-based fashion. Customers often require indexing and searching of joined or transformed records. ● Joins refer to the join of records from different database tables ● Transformations refer to various operations/changes on the records.
  • 40. Search Infra: Joins and Transformations Support Flink SQL join job Flink SQL * Rest.li: A framework for building RESTful architectures at scale * Brooklin: change-capture streams * Couchbase: a highly scalable, distributed data store
  • 41. Use Case: Espresso Couchbase Caching Client Application Client Application Client Application Espresso Brooklin Change Capture PUT DB/Table/a/b/c UPDATE DB/Table/a/b/c Cache Data on Reads Expires Data with TTL Provisions Buckets Nuage Expires Cached Data on Change Evolves Espresso Schema DELETE DB/Table/a/b/c DELETE DB/Table/a/b DELETE DB/Table/a Provisions Stream Client Admin Configures Store cache with appropriate TTL Add Cache to Espresso DB/Table with TTL Couchbase * Nuage: a Data Systems Management platform * Brooklin: change-capture streams * Couchbase: a highly scalable, distributed data store Stream SQL apps: Cache Invalidation
  • 42. Use Case: Espresso Materialized Views Database Brooklin Brooklin change capture stream Flink SQL Filtering Repartition Projections Joins UDFs Changes Espresso Db1. Table1 Database Espresso Db2. Table2 * Brooklin: change-capture streams * Espresso: distributed document store
  • 43. Lessons Learned (1) ● Invest in testability ○ Make Flink SQL jobs easily testable ■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O abstractions) ■ Local dev testing. Start a local Flink cluster and test apps locally against the staging environment ○ Vet newer apps before promoting to prod, e.g. canary support, etc
  • 44. Lessons Learned (1) ● Invest in testability ○ Make Flink SQL jobs easily testable ■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O abstractions) ■ Local dev testing. Start a local Flink cluster and test apps locally against the staging environment ○ Vet newer apps before promoting to prod, e.g. canary support, etc. ● Build Proactive Platforms ○ App validation support during authoring ○ Detect issues during provisioning
  • 45. Lessons Learned (1) ● Invest in testability ○ Make Flink SQL jobs easily testable ■ Unit tests via Datagen and FsCatalog (our custom catalog to provide I/O abstractions) ■ Local dev testing. Start a local Flink cluster and test apps locally against the staging environment ○ Vet newer apps before promoting to prod, e.g. canary support, etc. ● Build Proactive Platforms ○ App validation support during authoring ○ Detect issues during provisioning ● More authoring language options ○ User can choose SQL or Java (via Table API) or hybrid
  • 46. Lessons Learned (2) ● Automatic dashboard generation and alerts setup ○ Custom dashboard based on inputs/outputs ○ Making Stream SQL team get all the alerts causes oncall overload, so build smart alerting system to route alerts due to user logic errors directly to users
  • 47. Lessons Learned (2) ● Automatic dashboard generation and alerts setup ○ Custom dashboard based on inputs/outputs ○ Making Stream SQL team get all the alerts causes oncall overload, so build smart alerting system to route alerts due to user logic errors directly to users ● Invest in auto scale and auto remediation ○ Aggressive scale downs cause lag ○ Aggressive scale up cause capacity crunch
  • 48. Lessons Learned (2) ● Automatic dashboard generation and alerts setup ○ Custom dashboard based on inputs/outputs ○ Making Stream SQL team get all the alerts causes oncall overload, so build smart alerting system to route alerts due to user logic errors directly to users ● Invest in auto scale and auto remediation ○ Aggressive scale downs cause lag ○ Aggressive scale up cause capacity crunch ● Build a reaper service ○ Garbage collect unused user applications
  • 49. Q&A
  • 51. Example Flink Job Custom Resource apiVersion: flink.k8s.org/v1alpha1 kind: FlinkJob metadata: name: simple-job annotations: liAppName: simple-job spec: flinkCluster: image: imageName: <image-registry-URI>:<port>/flink-li-image:0.0.39 imagePullPolicy: Always jobManagerConfig: resources: requests: memory: 1024Mi cpu: 700m taskManagerConfig: resources: requests: memory: 1024Mi cpu: 700m taskManagerCount: 2 taskSlots: 2 flinkJob: jobState: RUNNING parallelism: 4 jobArtifact: jarUri: "<artifactory-URI>:port/flink-sql-sample-app.jar"
  • 52. Example Flink cluster Custom Resource kind: FlinkCluster metadata: name: simple-cluster annotations: liAppName: flinkSampleApp spec: image: imageName: <image-registry-URI>:<port>/flink-li-image:0.0.39 jobManagerConfig: resources: requests: memory: 362Mi cpu: 300m taskManagerConfig: resources: requests: memory: 362Mi cpu: 300m taskManagerCount: 1 taskSlots: 2
  • 53. Terminal State Handling Update: State RUNNING Flink Job Stopped Flink New Job Creating No Change / Update: State STOPPED Remaining New Job creation steps U p d a t e : S t a t e R U N N I N G Flink Job Running Flink Job Updating No Change Remaining Job Update / Undeploy steps Flink Job Undepoying U p d a t e : S t a t e S T O P P E D Handling spec change when the FlinkJob is in RUNNING state Handling spec change when the FlinkJob is in a STOPPED state Flink REST API Job Running: State RUNNING Flink Job Failed Flink Job Updating Exception Remaining Job Update / Job Undeploy / Cluster deletion steps Flink Job Undepoying Job Running: State STOPPED Flink Cluster Deleting No Job / Cluster Inaccessible Handling spec change when the FlinkJob is in a FAILED state
  • 54. Use Case: Change Capture Views Espresso Oracle MySQL Brooklin Brooklin change capture stream Flink SQL Brooklin change capture view Filtering Repartition Projections Joins UDFs
  • 55. Use Case: Data Caching Espresso/ MySql/ Oracle Flink SQL Cache Population Couchbase cache Application Brooklin
  • 56. Use Case: Data Migration Espresso/ MySql/ Oracle Flink SQL Data migration Espresso/ MySql Brooklin