Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.
2. @monaldax
● Data Engineer Why stream processing, and what does the platform offer?
● Data Leader Product / vision of a stream processing platform
● Platform engineer How we build and operate a stream processing platform?
What Do I Get Out Of This Talk?
Organized based on different roles or perspectives
@monaldax
3. @monaldax
● I will focus on stream processing platform for business insights, which my
team builds, mostly based on Flink
● I won’t
● Address operational insights for which we have different systems
● Compare stream processing engines, or cover stream processing concepts
6. @monaldax
● Low latency business insights and analytics
● Processing data as it arrives helps spread workload over time, &
reduce processing redundancy
● Need to process unbounded data sets becoming increasingly
common
Why Real Time Data?
7. @monaldax
● Enable users to focus on data and business insights, and not worry
about building stream processing infrastructure and tooling
Why Build A Stream Processing Platform?
9. @monaldax
Platform Needs To Offer Robust Way To Process Streams
Allowing To Tradeoff Between Ease, Capability, & Flexibility
SPaaS
10. @monaldax
Point & Click
Routing, Filtering, Projection
Streaming Jobs
● Support Streaming SQL Future
● Interactive exploration of streams for quick prototyping Future
Stream Processing as a Service platform offers
28. Data Stream Operations is Managed
• Fully managed scaling
• Managed capacity planning
• 24 X 7 availability [Scale]
• Garbage collect unused streams
@monaldax
29. Keystone Pipeline - The Road Ahead
• Additional components – UDFs, Data Hygiene, Data Alerting, etc
• Component chaining in the UI
• Schema Support
• Data Lineage
• Cost attribution
@monaldax
49. @monaldax
Stateless Streaming Job Use Case: High Level Architecture
Enriching And Identifying Certain Plays
Playback
History Service
Video
Metadata
Streaming Job
Play Logs
Live Service Lookup Data
52. Search Personalization – Custom Windowing On
Out-of-order Events
...... S ES
……….Session 2: S
Hours
S E
Session 1:
SE …
@monaldax
53. Streaming Application
Flink Engine
Local State
Stateful Streaming Application With Local State,
Checkpoints, And Savepoints
Sinks
Savepoints
(Explicitly Triggered)
Checkpoints
(Automatic)
Sources
@monaldax
54. Streaming Job (Flink) Savepoint Tooling Support
• Amazon S3 based multi-tenant storage management
• Auto savepoint and resume from savepoint on redeploy
• Resume from an existing savepoint
@monaldax
55. Streaming Job (Flink) High Level Features
• Stateless jobs
• Event enrichment support by accessing services using platform thick clients
• Stateful jobs 100s of GB, with larger state support in the works
• Reusable blocks (in progress)
• Job development, deployment, and monitoring tooling (alpha)
@monaldax
56. Streaming Jobs - The Road Ahead
• Easy resource provisioning estimates
• Flink support for reading and writing from data warehouse, backfill
• Continue to evolve tooling and support for large state
• Reusable Components - sources, sinks, operators, schema support, data hygiene
• Tooling support for Spark Streaming
@monaldax
58. Prod – Trending Events & Scale With Events
Flowing To Hive, Elasticsearch, Kafka
≅ 80B to 1.3T
• 1.3T+ events processed per day
• 600B to 1T unique per day
• 2+ PB in 4.5+ PB out per day
• Peak: 12M events in / sec & 36 GB / sec
@monaldax
61. @monaldax
RTDI Consists Of 4 Systems. Keystone Pipeline Runs 24 X 7,
& Does Not Impact Members Ability To Play Videos
Keystone
Stream Processing
(SPaaS)
Keystone
Management
Keystone
Messaging
24 x 7
- Dev
- Test
- Prod
Granular
shadowing
69. • Have message sizes > 1MB and up to 10MB
• Large Scale Keystone Ingest pipelines results in large fan out
• Lower Latency – used for ad-hoc messaging as well
• Open source – enhance, patch, or extend
• Cons: It’s not Managed
Why Kafka?
@monaldax
70. Scale for Large Fan-out and Isolation - Cascading Topology
Fronting
Kafka
Consumer
Kafka
Consumer
@monaldax
71. Alternative: Logical Stream (Topic) Spread Across
Multiple Topics Across Multiple Clusters (WIP)
Multi-Cluster
Producer
Multi-Cluster
Consumer
@monaldax
72. • Dedicated Zookeeper cluster per Kafka cluster
• Small Clusters < 200 brokers, partitions <= 10K
• Partitions distributed evenly across brokers
• Rack-aware replica assignment, brokers spread in 3 Zones
• 2 copies & Unclean leader election on
• Non-transactional
Kafka Deployment Strategies – Version 0.10 (YMMV)
@monaldax
76. • Keystone pipeline is built on Flink Routers
• Each Flink Router is a stream processing job
• Router provisioning based on incoming traffic or estimates
• Runs on containers atop EC2
• Island mode - single AWS Region
Streaming Jobs 1.3.2
@monaldax
77. High-level Stream Processing Platform Architecture
Streaming Jobs
Keystone Management
Point & Click or
Streaming Job
Container Runtime
1. Create
Streaming Job
2. Launch Job with
Config overrides
3. Launch Containers
• Immutable Image
• User driven config overrides
@monaldax
79. @monaldax
Flink Job Cluster In HA Mode
Zookeeper
Job Manager
Leader (WebUI)
Task Manager Task Manager Task Manager
Job Manager
(WebUI)
One dedicated Zookeeper
cluster for all streamig Jobs
83. @monaldax
Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Checkpoints Are Taken Often
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 1
….
Task Manager
Titus Host 2
IP
Titus Job
IPIP
AWS
VPC
State
- Checkpoints
- Kafka Offset
Save
84. @monaldax
Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Checkpoints Are Taken Often. A Container Could Fail…
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 1
….
Task Manager
Titus Host 2
IP
Titus Job
IPIP
AWS
VPC
State
- Checkpoints
- Kafka Offset
Save
X
85. @monaldax
Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 2
….
Task Manager
Titus Host 3
IP
Titus Job
IPIP
AWS
VPC
State
- Checkpoints
- Kafka OffsetRestore
Failed Container Automatically Replaced. State
Restored To Last Checkpoint, Partially Recovery Supported
Replacement container
90. • The ability to pass data along the chain of Joblets within a Job
• Locks and semaphores on resources spanning across jobs
• Customization and integration into Netflix ecosystem – Eureka, etc.
Keystone Management Unique Features
@monaldax
92. • No separate Ops team
• No separate QA team
• No separate Dev team
• It’s all done by developers of the Real Time Data Infrastructure
We Run What We Build!
@monaldax
93. • We rely on metrics, monitoring, alerting & paging, & automation
• Separate metrics system – Atlas
• Separate alert configuration and alert actions system
• Options for separate system to run cross-system automation tasks
We Leverage Other Netflix Systems
@monaldax
104. @monaldax
Launch Backup Kafka Cluster With Same Number Of
Instances, But Smaller Instance Type
Flink Router
Fronting Kafka
Event
Producer
Bring up failover
Kafka cluster
Copy metadata
from Zookeeper
X
105. @monaldax
Change Producer Config To Produce To Failover
Cluster, And Launch Routers For Failover Traffic
Flink Router
Fronting Kafka
Event
Producer
Failover Flink
Router
X
106. @monaldax
Change Producer Config To Original Cluster, And
Finish Draining Events From Backup Flink Router
Flink Router
Fronting Kafka
Event
Producer
Failover Flink
Router
107. @monaldax
Decommission Backup Cluster And Router Once Original
Cluster Is Fixed, Or A Replacement Cluster Is Live
Flink Router
Fronting Kafka
Event
Producer
Failover Flink
Router
X X
109. • Failover currently supported for Fronting Kafka clusters only
• We are working on multi-consumer client with support for keyed
message to support failover of consumer Kafka clusters.
Consumer Kafka Clusters
@monaldax
110. Planned & Regular
Kafka Kong
This Automation Also Serves As Kafka Kong, A Tool That
Follows Principles Of Chaos Engineering
@monaldax
111. • Over provision for variations and traffic for failover
• Broker health & outlier detection and auto termination
• 99 percentile response time
• Broker TCP timeouts, errors, retransmissions
• Producer’s send latency
Kafka Operation Strategies (YMMV)
@monaldax
112. • Scale up by
• Adding partitions – to new brokers, requires no keyed messages
• Partition reassignment – in small batches with custom tool
• Scale down by
• Create New topics / New clusters
• Create new clusters - use Kafka failover automation
Kafka Operation Strategies (YMMV)
@monaldax
114. • Container replacement
• Checkpoints and Savepoints
• Keep retrying if event data format is valid
• Isolation – issue with one sink does not impact another
Routers & Streaming Job Fault Tolerance By Design
@monaldax
115. • Provision new or updated streams
• Bulk updates and terminate routers and re-deployment
• Automatic partial recovery allows zero-touch migration of
underlying container infrastructure
• Manual – KSRunbook
Router Deployment Automation
@monaldax
117. • Per stream provisioning based on past weeks traffic or bit rate estimate
• Provision buffer capacity
• Run 1 additional container for latency sensitive consumers
• Manual, % increase, easy to compute and deploy
• Plan capacity to handle service failover, and holiday peaks
Router Capacity Planning And Provisioning
@monaldax
118. Admin Tooling To Scale Up Manually, Or To Deploy A New Build
@monaldax
124. @monaldax
Flink Streaming Job
● Split between application and infrastructure
● Metrics and monitoring and
● Alerts
● Paging and on-call rotations
● Platform customers follow the same “We build it we run it model”
127. @monaldax
Operations – The road ahead
● True auto scaling
● Bootstrap capacity planning for stateful streaming jobs
● Automated Canary tooling & Data parity
● Point and Click components quick testing, and performance profiling
● E.g., - iterating over a Filter definition
128. @monaldax
I Want To Learn More
● http://bit.ly/mLOOP - Deep dive into Unbounded Data Processing Systems
● http://bit.ly/m17FF - Keynote – Stream Processing with Flink at Netflix
● http://bit.ly/2BoYAq0 - Multi-tenant Multi-cluster Kafka Messaging Service