SlideShare a Scribd company logo
1 of 53
SPaaS
Stream Processing with Flink at Netflix
Monal Daxini
Engineering Manager,
Stream Processing
FlinkForward 2017
@monaldax
Challenges & Lessons Learnt
We’ll look at our Ingest pipeline, SPaaS, Use cases, Challenges, and Lessons learnt
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline
Challenges & Lessons Learnt
Let’s start with Keystone Ingest pipeline
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline
Event
Producers
Sinks
To reliably publish and process events, one needs highly available
Ingest pipelines - the backbone of a real-time data infrastructure
Keystone
Stream Processing
(SPaaS)
Keystone
Management
Keystone
Messaging
Keystone pipeline is built on 3 core subsystems
It does not impact members ability to play videos
100% in AWS
Keystone Management – Provision a data stream (mini pipeline) 📽
Keystone Management – Update filter 📽
Keystone Management - Filter DSL & Message Parser
* We would like to move away from xpath & our custom parser
Keystone Management - Configure output (sinks / destination)
Keystone Management – ElasticSearch sink config
Keystone Management – Projection 📽
Provisioning of a data stream generates dashboard & alert configurations
Event
Processing
Keystone Management – Tooling 📽
Keystone Management – Sample report, Inactive streams
Keystone Management – Sample Report, Under provisioned streams
Challenges & Lessons Learnt
Let’s explore event flow in Keystone pipeline, and its capabilities
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline
Event Flow / Capabilities
Keystone pipeline system boundary
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
Keystone Management
KSGateway
Stream
Consumers
Consumer Kafka
KCWKCW
* Does not impact video playability
Events are published via a proxy or a Kafka client wrapper (KCW)
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
KCW
Stream
Consumers
Consumer Kafka
Keystone Management
KCW
Events land up in fronting Kafka cluster
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
Stream
Consumers
Consumer Kafka
Keystone Management
KCWKCW
Events are polled by router, filter and projection applied
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
Stream
Consumers
Consumer Kafka
Keystone Management
KCWKCW
Router sends events to destination
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
Stream
Consumers
Consumer Kafka
Keystone Management
KCWKCW
Keystone Pipeline Capabilities
• At-least-once delivery semantics*
• Data stream Isolation
• Inject event metadata
• GUID, timestamp, host, app
Keep data loss < 1% per day per data stream
for infrastructure migration or deployments
Keystone Pipeline Capabilities
• Scales based on traffic (externally driven)
• Producer & Router
• Kafka Cluster failover
• Kafka Kong
Kafka Kong
Once a week
Keystone pipeline is up 24x7, availability is key,
In the spirit of Chaos Kong, we do Kafka Kong
Automated kafka cluster failover
Flink Router
Fronting
Kafka
Event
Producer
X
Bring up backup
Kafka cluster
Flink Router1
Fronting Kafka Failover
Time is the essence - failover as fast as 5 minutes
Fully
Automated
Fronting Kafka Failover
Challenges & Lessons Learnt
Let’s look at Stream Processing as a Service
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline
Event Flow / Capabilities
Point & Click Pipelines
Filtering & Projection
(In prod)
SPaaS enables point & click pipeline, and customs jobs
Flink is currently used for two broad classes of use cases in SPaaS
DSL (Future)Custom Code
(Staging with prod data)
High level SPaaS Architecture
Container Runtime
(Titus)
Point & Click or
Job DSL (Future)
Custom Code1. Create
2. Submit
Stream
Processing Job
2. Submit Job DSL (Future)
3. Launch Job
Continuous Delivery
Platform
Keystone Management
Point and Click
Custom code (upcoming)
Stream processing platform layered cake offers flexible services
to our internal customers (engineers)
AWS EC2
Container Runtime
SPaaS-Core
Reusable Blocks
ES, Kafka, & Hive Sink (Flink in test)
Routers Stream Processing Applications
Reusable Blocks - Early days
Keystone & Kafka Sink, Complex Sessionization
Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Flink Job deployment on container runtime
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 2
….
Task Manager
Titus Host
3 IP
Titus Job
IPIP
AWS
VPC
ENI
Fronting Kafa
(Offset checkpointing)
Checkpoint / Snapshot
1.
2.
SPaaS run on Titus (Netflix’s inhouse) Container runtime
Titus UITitus UI
Docker
Registry
Docker
Registry
Rhea
container
container
container
docker
Titus Agent
metrics agent
container
container
SPaaS-Job
Titus executor
logging agent
zfsmesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management
& Scheduler
S3
Zookeeper
Docker
Registry
EC2 Autoscaling
API
Mesos Master
Titus UI
(CI/CD)
Fenzo
Challenges & Lessons Learnt
Let’s look at Stream Processing use cases
SPaaS
Stream Processing as a Service
3 SPaaS Use Cases
Ingest Pipeline
Event Flow / Capabilities
1. Keystone pipeline Router is a massively parallel use case
• Every data stream is independent and isolated
• No dependencies between tasks
• Chained operator
• Only state – Kafka offset checkpointing
Job Plan from JobManager UI
Broker
Router – each Flink job reads from one topic, and each task
independently polls events from assigned partitions
ES Router Kafka Router
Prod Scale processed by
ES & Kafka Flink routers, and Hive Samza routers
• 1,300,000,000,000+ events processed / day
• 3+PB in 9+PB out / day
• 99%+ availability ytd
Prod – trending events (approximate)
≅ 80B to 1.3T
Prod Scale – only Kafka and ES Flink routers are
deployed in prod, (Hive output Flink routers are in test, unaccounted below)
• 4000+ Kafka brokers, 50+ clusters
• 100’s of Data Streams (Flink Jobs)
• 3700+ Docker containers running
• 1400+ nodes with 22K+ cpu cores
Router has large scale in terms of volume and overall deployed
streams in the cloud, which leads to challenges unnoticed otherwise
• S3 checkpointing backend
• S3 outage = router downtime,
rely on Kafka offset commit only, like Samza
• Pressure on S3 if deployment is not staggered
• Disable distributed checkpointing, only JobManager writes
to checkpointing backend.
Router has large scale in terms volume and overall deployed
streams in the cloud, which leads to challenges unnoticed otherwise
• A failed task can cause the job to restart (JVM running)
• Need Fine grained recovery (Phase 1), FLIP-1
• Failures at times can cause few more duplicates than Samza
Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Flink Job deployment on container runtime
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 2
….
Task Manager
Titus Host 3
IP
Titus Job
IPIP
AWS
VPC
ENI
Fronting Kafa
(Offset checkpointing)
Checkpoint / Snapshot
1.
2.
X Causes Flink
Job Restart
Measurable cost savings moving from ES and Kafka routers to Flink from Samza
• Disclaimer: When comparing Flink and Samza, you may observe
different results in your own environment and setup
• This is not an exact apples-to-apples comparison
Observed significant savings by migrating ES and Kafka to Flink
routers on New container runtime vs Samza on Old container runtime
2. Enriching User Video Plays with “discovery” attributes using Flink
• Talk to other live services
• Integration with IPC ecosystem
• Needs high throughput
• Small state
• O(100M) events / day
Not in production No Keystone Management support yet
The challenges with the event enrichment use case
● Access data (slow / fast changing) from live or static sources
● Play nice, avoid member streaming impact
● Reliability and stability
● Dependency Isolation (Jar Hell)
● Backfill – (historical data / deal with bugs)
3. Complex sessionization of user events using Flink
• Create sessions with start and end events, determined
based on event payload and event time order (punctuated)
• Handle late, and out-of-order events
• 2 to 24 hour session window duration
• O(10B) events per day, testing with a small fraction of
this volume (flink job state 100GB+)
Not in production No Keystone Management support yet
The challenges with complex sessionization use case with large state
● Flink supported session window with gap duration is not sufficient
● Developed custom, complex session windows - done
● Large state & large scale
● Quick checkpoints, and fast recovery from job failures
● Incremental checkpointing
● Exploring other storage strategies with Flink community
We have realized that these three use cases represent a large
set of challenges / requirements needed from SPaaS platform
● Router - Massively Parallel - almost no state, very large scale
● Event enrichment - small state, medium scale
● Complex sessionization – large state, large scale
In addition, there are several other challenges across use cases
● Developer tooling & Testing
● Insight and Operations
● Continuity through upgrades & deployments
● Data parity & Canary tooling
● Thinking streaming first – always on, operational responsibilities
● Cross region event aggregation and routing
● Auto scaling & capacity planning
Community
Contributions
We are contributing by running Flink at scale
in the cloud (pioneer tax), and more
● Metrics, Operations, Deployment
● Custom, complex session windows
● Fault tolerance, large State management
● Challenges related to massively parallel codebase
● Adaptation of our Patch - https://github.com/apache/flink/pull/3312
You got a glimpse of how we are leveraging Flink as part of our
stream processing platform to serve the business insights of
other engineers at Netflix.
We have come a long way, however we have just begun the
journey in our quest for Fast data. If you are on a similar
journey or have ideas, or would like to collaborate to move
Flink forward, we would like to hear from you.
Conclusion

More Related Content

What's hot

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Monal Daxini
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafkaMole Wong
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...Paul Brebner
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016Eric Sammer
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Productionconfluent
 
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? confluent
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 

What's hot (20)

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafka
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
 
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 

Similar to Flink forward-2017-netflix keystones-paas

Monal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixMonal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixFlink Forward
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Peter Bakas
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableWhy Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableHostedbyConfluent
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...confluent
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022HostedbyConfluent
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a ServiceSteven Wu
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureTimothy Spann
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesJosef Adersberger
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesQAware GmbH
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basFlorent Ramiere
 

Similar to Flink forward-2017-netflix keystones-paas (20)

Monal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixMonal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ Netflix
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableWhy Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Flink forward-2017-netflix keystones-paas

  • 1. SPaaS Stream Processing with Flink at Netflix Monal Daxini Engineering Manager, Stream Processing FlinkForward 2017 @monaldax
  • 2. Challenges & Lessons Learnt We’ll look at our Ingest pipeline, SPaaS, Use cases, Challenges, and Lessons learnt SPaaS Stream Processing as a Service SPaaS Use Cases Ingest Pipeline
  • 3. Challenges & Lessons Learnt Let’s start with Keystone Ingest pipeline SPaaS Stream Processing as a Service SPaaS Use Cases Ingest Pipeline
  • 4. Event Producers Sinks To reliably publish and process events, one needs highly available Ingest pipelines - the backbone of a real-time data infrastructure
  • 5. Keystone Stream Processing (SPaaS) Keystone Management Keystone Messaging Keystone pipeline is built on 3 core subsystems It does not impact members ability to play videos 100% in AWS
  • 6. Keystone Management – Provision a data stream (mini pipeline) 📽
  • 7. Keystone Management – Update filter 📽
  • 8. Keystone Management - Filter DSL & Message Parser * We would like to move away from xpath & our custom parser
  • 9. Keystone Management - Configure output (sinks / destination)
  • 10. Keystone Management – ElasticSearch sink config
  • 11. Keystone Management – Projection 📽
  • 12. Provisioning of a data stream generates dashboard & alert configurations Event Processing
  • 13. Keystone Management – Tooling 📽
  • 14. Keystone Management – Sample report, Inactive streams
  • 15. Keystone Management – Sample Report, Under provisioned streams
  • 16. Challenges & Lessons Learnt Let’s explore event flow in Keystone pipeline, and its capabilities SPaaS Stream Processing as a Service SPaaS Use Cases Ingest Pipeline Event Flow / Capabilities
  • 17. Keystone pipeline system boundary Flink SPaaS Router EMR Fronting Kafka Event Producer Keystone Management KSGateway Stream Consumers Consumer Kafka KCWKCW * Does not impact video playability
  • 18. Events are published via a proxy or a Kafka client wrapper (KCW) Flink SPaaS Router EMR Fronting Kafka Event Producer KSGateway KCW Stream Consumers Consumer Kafka Keystone Management KCW
  • 19. Events land up in fronting Kafka cluster Flink SPaaS Router EMR Fronting Kafka Event Producer KSGateway Stream Consumers Consumer Kafka Keystone Management KCWKCW
  • 20. Events are polled by router, filter and projection applied Flink SPaaS Router EMR Fronting Kafka Event Producer KSGateway Stream Consumers Consumer Kafka Keystone Management KCWKCW
  • 21. Router sends events to destination Flink SPaaS Router EMR Fronting Kafka Event Producer KSGateway Stream Consumers Consumer Kafka Keystone Management KCWKCW
  • 22. Keystone Pipeline Capabilities • At-least-once delivery semantics* • Data stream Isolation • Inject event metadata • GUID, timestamp, host, app
  • 23. Keep data loss < 1% per day per data stream for infrastructure migration or deployments
  • 24. Keystone Pipeline Capabilities • Scales based on traffic (externally driven) • Producer & Router • Kafka Cluster failover • Kafka Kong
  • 25. Kafka Kong Once a week Keystone pipeline is up 24x7, availability is key, In the spirit of Chaos Kong, we do Kafka Kong
  • 26. Automated kafka cluster failover Flink Router Fronting Kafka Event Producer X Bring up backup Kafka cluster Flink Router1
  • 28. Time is the essence - failover as fast as 5 minutes Fully Automated Fronting Kafka Failover
  • 29. Challenges & Lessons Learnt Let’s look at Stream Processing as a Service SPaaS Stream Processing as a Service SPaaS Use Cases Ingest Pipeline Event Flow / Capabilities
  • 30. Point & Click Pipelines Filtering & Projection (In prod) SPaaS enables point & click pipeline, and customs jobs Flink is currently used for two broad classes of use cases in SPaaS DSL (Future)Custom Code (Staging with prod data)
  • 31. High level SPaaS Architecture Container Runtime (Titus) Point & Click or Job DSL (Future) Custom Code1. Create 2. Submit Stream Processing Job 2. Submit Job DSL (Future) 3. Launch Job Continuous Delivery Platform Keystone Management Point and Click Custom code (upcoming)
  • 32. Stream processing platform layered cake offers flexible services to our internal customers (engineers) AWS EC2 Container Runtime SPaaS-Core Reusable Blocks ES, Kafka, & Hive Sink (Flink in test) Routers Stream Processing Applications Reusable Blocks - Early days Keystone & Kafka Sink, Complex Sessionization
  • 33. Titus Job Task Manager IP Titus Host 4 Titus Host 5 Flink Job deployment on container runtime Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 2 …. Task Manager Titus Host 3 IP Titus Job IPIP AWS VPC ENI Fronting Kafa (Offset checkpointing) Checkpoint / Snapshot 1. 2.
  • 34. SPaaS run on Titus (Netflix’s inhouse) Container runtime Titus UITitus UI Docker Registry Docker Registry Rhea container container container docker Titus Agent metrics agent container container SPaaS-Job Titus executor logging agent zfsmesos agent docker RheaTitus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry EC2 Autoscaling API Mesos Master Titus UI (CI/CD) Fenzo
  • 35. Challenges & Lessons Learnt Let’s look at Stream Processing use cases SPaaS Stream Processing as a Service 3 SPaaS Use Cases Ingest Pipeline Event Flow / Capabilities
  • 36. 1. Keystone pipeline Router is a massively parallel use case • Every data stream is independent and isolated • No dependencies between tasks • Chained operator • Only state – Kafka offset checkpointing Job Plan from JobManager UI
  • 37. Broker Router – each Flink job reads from one topic, and each task independently polls events from assigned partitions ES Router Kafka Router
  • 38. Prod Scale processed by ES & Kafka Flink routers, and Hive Samza routers • 1,300,000,000,000+ events processed / day • 3+PB in 9+PB out / day • 99%+ availability ytd
  • 39. Prod – trending events (approximate) ≅ 80B to 1.3T
  • 40. Prod Scale – only Kafka and ES Flink routers are deployed in prod, (Hive output Flink routers are in test, unaccounted below) • 4000+ Kafka brokers, 50+ clusters • 100’s of Data Streams (Flink Jobs) • 3700+ Docker containers running • 1400+ nodes with 22K+ cpu cores
  • 41. Router has large scale in terms of volume and overall deployed streams in the cloud, which leads to challenges unnoticed otherwise • S3 checkpointing backend • S3 outage = router downtime, rely on Kafka offset commit only, like Samza • Pressure on S3 if deployment is not staggered • Disable distributed checkpointing, only JobManager writes to checkpointing backend.
  • 42. Router has large scale in terms volume and overall deployed streams in the cloud, which leads to challenges unnoticed otherwise • A failed task can cause the job to restart (JVM running) • Need Fine grained recovery (Phase 1), FLIP-1 • Failures at times can cause few more duplicates than Samza
  • 43. Titus Job Task Manager IP Titus Host 4 Titus Host 5 Flink Job deployment on container runtime Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 2 …. Task Manager Titus Host 3 IP Titus Job IPIP AWS VPC ENI Fronting Kafa (Offset checkpointing) Checkpoint / Snapshot 1. 2. X Causes Flink Job Restart
  • 44. Measurable cost savings moving from ES and Kafka routers to Flink from Samza • Disclaimer: When comparing Flink and Samza, you may observe different results in your own environment and setup • This is not an exact apples-to-apples comparison Observed significant savings by migrating ES and Kafka to Flink routers on New container runtime vs Samza on Old container runtime
  • 45. 2. Enriching User Video Plays with “discovery” attributes using Flink • Talk to other live services • Integration with IPC ecosystem • Needs high throughput • Small state • O(100M) events / day Not in production No Keystone Management support yet
  • 46. The challenges with the event enrichment use case ● Access data (slow / fast changing) from live or static sources ● Play nice, avoid member streaming impact ● Reliability and stability ● Dependency Isolation (Jar Hell) ● Backfill – (historical data / deal with bugs)
  • 47. 3. Complex sessionization of user events using Flink • Create sessions with start and end events, determined based on event payload and event time order (punctuated) • Handle late, and out-of-order events • 2 to 24 hour session window duration • O(10B) events per day, testing with a small fraction of this volume (flink job state 100GB+) Not in production No Keystone Management support yet
  • 48. The challenges with complex sessionization use case with large state ● Flink supported session window with gap duration is not sufficient ● Developed custom, complex session windows - done ● Large state & large scale ● Quick checkpoints, and fast recovery from job failures ● Incremental checkpointing ● Exploring other storage strategies with Flink community
  • 49. We have realized that these three use cases represent a large set of challenges / requirements needed from SPaaS platform ● Router - Massively Parallel - almost no state, very large scale ● Event enrichment - small state, medium scale ● Complex sessionization – large state, large scale
  • 50. In addition, there are several other challenges across use cases ● Developer tooling & Testing ● Insight and Operations ● Continuity through upgrades & deployments ● Data parity & Canary tooling ● Thinking streaming first – always on, operational responsibilities ● Cross region event aggregation and routing ● Auto scaling & capacity planning
  • 52. We are contributing by running Flink at scale in the cloud (pioneer tax), and more ● Metrics, Operations, Deployment ● Custom, complex session windows ● Fault tolerance, large State management ● Challenges related to massively parallel codebase ● Adaptation of our Patch - https://github.com/apache/flink/pull/3312
  • 53. You got a glimpse of how we are leveraging Flink as part of our stream processing platform to serve the business insights of other engineers at Netflix. We have come a long way, however we have just begun the journey in our quest for Fast data. If you are on a similar journey or have ideas, or would like to collaborate to move Flink forward, we would like to hear from you. Conclusion