The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Flink forward-2017-netflix keystones-paas
1. SPaaS
Stream Processing with Flink at Netflix
Monal Daxini
Engineering Manager,
Stream Processing
FlinkForward 2017
@monaldax
2. Challenges & Lessons Learnt
We’ll look at our Ingest pipeline, SPaaS, Use cases, Challenges, and Lessons learnt
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline
3. Challenges & Lessons Learnt
Let’s start with Keystone Ingest pipeline
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline
16. Challenges & Lessons Learnt
Let’s explore event flow in Keystone pipeline, and its capabilities
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline
Event Flow / Capabilities
17. Keystone pipeline system boundary
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
Keystone Management
KSGateway
Stream
Consumers
Consumer Kafka
KCWKCW
* Does not impact video playability
18. Events are published via a proxy or a Kafka client wrapper (KCW)
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
KCW
Stream
Consumers
Consumer Kafka
Keystone Management
KCW
19. Events land up in fronting Kafka cluster
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
Stream
Consumers
Consumer Kafka
Keystone Management
KCWKCW
20. Events are polled by router, filter and projection applied
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
Stream
Consumers
Consumer Kafka
Keystone Management
KCWKCW
28. Time is the essence - failover as fast as 5 minutes
Fully
Automated
Fronting Kafka Failover
29. Challenges & Lessons Learnt
Let’s look at Stream Processing as a Service
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline
Event Flow / Capabilities
30. Point & Click Pipelines
Filtering & Projection
(In prod)
SPaaS enables point & click pipeline, and customs jobs
Flink is currently used for two broad classes of use cases in SPaaS
DSL (Future)Custom Code
(Staging with prod data)
31. High level SPaaS Architecture
Container Runtime
(Titus)
Point & Click or
Job DSL (Future)
Custom Code1. Create
2. Submit
Stream
Processing Job
2. Submit Job DSL (Future)
3. Launch Job
Continuous Delivery
Platform
Keystone Management
Point and Click
Custom code (upcoming)
33. Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Flink Job deployment on container runtime
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 2
….
Task Manager
Titus Host
3 IP
Titus Job
IPIP
AWS
VPC
ENI
Fronting Kafa
(Offset checkpointing)
Checkpoint / Snapshot
1.
2.
34. SPaaS run on Titus (Netflix’s inhouse) Container runtime
Titus UITitus UI
Docker
Registry
Docker
Registry
Rhea
container
container
container
docker
Titus Agent
metrics agent
container
container
SPaaS-Job
Titus executor
logging agent
zfsmesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management
& Scheduler
S3
Zookeeper
Docker
Registry
EC2 Autoscaling
API
Mesos Master
Titus UI
(CI/CD)
Fenzo
35. Challenges & Lessons Learnt
Let’s look at Stream Processing use cases
SPaaS
Stream Processing as a Service
3 SPaaS Use Cases
Ingest Pipeline
Event Flow / Capabilities
36. 1. Keystone pipeline Router is a massively parallel use case
• Every data stream is independent and isolated
• No dependencies between tasks
• Chained operator
• Only state – Kafka offset checkpointing
Job Plan from JobManager UI
37. Broker
Router – each Flink job reads from one topic, and each task
independently polls events from assigned partitions
ES Router Kafka Router
38. Prod Scale processed by
ES & Kafka Flink routers, and Hive Samza routers
• 1,300,000,000,000+ events processed / day
• 3+PB in 9+PB out / day
• 99%+ availability ytd
40. Prod Scale – only Kafka and ES Flink routers are
deployed in prod, (Hive output Flink routers are in test, unaccounted below)
• 4000+ Kafka brokers, 50+ clusters
• 100’s of Data Streams (Flink Jobs)
• 3700+ Docker containers running
• 1400+ nodes with 22K+ cpu cores
41. Router has large scale in terms of volume and overall deployed
streams in the cloud, which leads to challenges unnoticed otherwise
• S3 checkpointing backend
• S3 outage = router downtime,
rely on Kafka offset commit only, like Samza
• Pressure on S3 if deployment is not staggered
• Disable distributed checkpointing, only JobManager writes
to checkpointing backend.
42. Router has large scale in terms volume and overall deployed
streams in the cloud, which leads to challenges unnoticed otherwise
• A failed task can cause the job to restart (JVM running)
• Need Fine grained recovery (Phase 1), FLIP-1
• Failures at times can cause few more duplicates than Samza
43. Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Flink Job deployment on container runtime
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 2
….
Task Manager
Titus Host 3
IP
Titus Job
IPIP
AWS
VPC
ENI
Fronting Kafa
(Offset checkpointing)
Checkpoint / Snapshot
1.
2.
X Causes Flink
Job Restart
44. Measurable cost savings moving from ES and Kafka routers to Flink from Samza
• Disclaimer: When comparing Flink and Samza, you may observe
different results in your own environment and setup
• This is not an exact apples-to-apples comparison
Observed significant savings by migrating ES and Kafka to Flink
routers on New container runtime vs Samza on Old container runtime
45. 2. Enriching User Video Plays with “discovery” attributes using Flink
• Talk to other live services
• Integration with IPC ecosystem
• Needs high throughput
• Small state
• O(100M) events / day
Not in production No Keystone Management support yet
46. The challenges with the event enrichment use case
● Access data (slow / fast changing) from live or static sources
● Play nice, avoid member streaming impact
● Reliability and stability
● Dependency Isolation (Jar Hell)
● Backfill – (historical data / deal with bugs)
47. 3. Complex sessionization of user events using Flink
• Create sessions with start and end events, determined
based on event payload and event time order (punctuated)
• Handle late, and out-of-order events
• 2 to 24 hour session window duration
• O(10B) events per day, testing with a small fraction of
this volume (flink job state 100GB+)
Not in production No Keystone Management support yet
48. The challenges with complex sessionization use case with large state
● Flink supported session window with gap duration is not sufficient
● Developed custom, complex session windows - done
● Large state & large scale
● Quick checkpoints, and fast recovery from job failures
● Incremental checkpointing
● Exploring other storage strategies with Flink community
49. We have realized that these three use cases represent a large
set of challenges / requirements needed from SPaaS platform
● Router - Massively Parallel - almost no state, very large scale
● Event enrichment - small state, medium scale
● Complex sessionization – large state, large scale
50. In addition, there are several other challenges across use cases
● Developer tooling & Testing
● Insight and Operations
● Continuity through upgrades & deployments
● Data parity & Canary tooling
● Thinking streaming first – always on, operational responsibilities
● Cross region event aggregation and routing
● Auto scaling & capacity planning
52. We are contributing by running Flink at scale
in the cloud (pioneer tax), and more
● Metrics, Operations, Deployment
● Custom, complex session windows
● Fault tolerance, large State management
● Challenges related to massively parallel codebase
● Adaptation of our Patch - https://github.com/apache/flink/pull/3312
53. You got a glimpse of how we are leveraging Flink as part of our
stream processing platform to serve the business insights of
other engineers at Netflix.
We have come a long way, however we have just begun the
journey in our quest for Fast data. If you are on a similar
journey or have ideas, or would like to collaborate to move
Flink forward, we would like to hear from you.
Conclusion