Flink forward-2017-netflix keystones-paas

SPaaS
Stream Processing with Flink at Netflix
Monal Daxini
Engineering Manager,
Stream Processing
FlinkForward 2017
@monaldax

Challenges & Lessons Learnt
We’ll look at our Ingest pipeline, SPaaS, Use cases, Challenges, and Lessons learnt
SPaaS
Stream Processing as a Service
SPaaS Use Cases
Ingest Pipeline

Let’s start with Keystone Ingest pipeline
SPaaS
SPaaS Use Cases
Ingest Pipeline

Event
Producers
Sinks
To reliably publish and process events, one needs highly available
Ingest pipelines - the backbone of a real-time data infrastructure

Keystone
Stream Processing
(SPaaS)
Keystone
Management
Keystone
Messaging
Keystone pipeline is built on 3 core subsystems
It does not impact members ability to play videos
100% in AWS

Keystone Management – Provision a data stream (mini pipeline) 📽

Keystone Management – Update filter 📽

Keystone Management - Filter DSL & Message Parser
* We would like to move away from xpath & our custom parser

Keystone Management - Configure output (sinks / destination)

Keystone Management – ElasticSearch sink config

Keystone Management – Projection 📽

Provisioning of a data stream generates dashboard & alert configurations
Event
Processing

Keystone Management – Tooling 📽

Keystone Management – Sample report, Inactive streams

Keystone Management – Sample Report, Under provisioned streams

Let’s explore event flow in Keystone pipeline, and its capabilities
SPaaS
SPaaS Use Cases
Ingest Pipeline
Event Flow / Capabilities

Keystone pipeline system boundary
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
Keystone Management
KSGateway
Stream
Consumers
Consumer Kafka
KCWKCW
* Does not impact video playability

Events are published via a proxy or a Kafka client wrapper (KCW)
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
KCW
Stream
Consumers
Consumer Kafka
Keystone Management
KCW

Events land up in fronting Kafka cluster
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
Stream
Consumers
Consumer Kafka
Keystone Management
KCWKCW

Events are polled by router, filter and projection applied
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
Stream
Consumers
Consumer Kafka
Keystone Management
KCWKCW

Router sends events to destination
Flink
SPaaS Router
EMR
Fronting
Kafka
Event
Producer
KSGateway
Stream
Consumers
Consumer Kafka
Keystone Management
KCWKCW

Keystone Pipeline Capabilities
• At-least-once delivery semantics*
• Data stream Isolation
• Inject event metadata
• GUID, timestamp, host, app

Keep data loss < 1% per day per data stream
for infrastructure migration or deployments

Keystone Pipeline Capabilities
• Scales based on traffic (externally driven)
• Producer & Router
• Kafka Cluster failover
• Kafka Kong

Kafka Kong
Once a week
Keystone pipeline is up 24x7, availability is key,
In the spirit of Chaos Kong, we do Kafka Kong

Automated kafka cluster failover
Flink Router
Fronting
Kafka
Event
Producer
X
Bring up backup
Kafka cluster
Flink Router1

Time is the essence - failover as fast as 5 minutes
Fully
Automated
Fronting Kafka Failover

Let’s look at Stream Processing as a Service
SPaaS
SPaaS Use Cases
Ingest Pipeline

Point & Click Pipelines
Filtering & Projection
(In prod)
SPaaS enables point & click pipeline, and customs jobs
Flink is currently used for two broad classes of use cases in SPaaS
DSL (Future)Custom Code
(Staging with prod data)

High level SPaaS Architecture
Container Runtime
(Titus)
Point & Click or
Job DSL (Future)
Custom Code1. Create
2. Submit
Stream
Processing Job
2. Submit Job DSL (Future)
3. Launch Job
Continuous Delivery
Platform
Keystone Management
Point and Click
Custom code (upcoming)

Stream processing platform layered cake offers flexible services
to our internal customers (engineers)
AWS EC2
Container Runtime
SPaaS-Core
Reusable Blocks
ES, Kafka, & Hive Sink (Flink in test)
Routers Stream Processing Applications
Reusable Blocks - Early days
Keystone & Kafka Sink, Complex Sessionization

Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Flink Job deployment on container runtime
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 2
….
Task Manager
Titus Host
3 IP
Titus Job
IPIP
AWS
VPC
ENI
Fronting Kafa
(Offset checkpointing)
Checkpoint / Snapshot
1.
2.

SPaaS run on Titus (Netflix’s inhouse) Container runtime
Titus UITitus UI
Docker
Registry
Docker
Registry
Rhea
container
container
container
docker
Titus Agent
metrics agent
container
container
SPaaS-Job
Titus executor
logging agent
zfsmesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management
& Scheduler
S3
Zookeeper
Docker
Registry
EC2 Autoscaling
API
Mesos Master
Titus UI
(CI/CD)
Fenzo

Let’s look at Stream Processing use cases
SPaaS
3 SPaaS Use Cases
Ingest Pipeline

1. Keystone pipeline Router is a massively parallel use case
• Every data stream is independent and isolated
• No dependencies between tasks
• Chained operator
• Only state – Kafka offset checkpointing
Job Plan from JobManager UI

Broker
Router – each Flink job reads from one topic, and each task
independently polls events from assigned partitions
ES Router Kafka Router

Prod Scale processed by
ES & Kafka Flink routers, and Hive Samza routers
• 1,300,000,000,000+ events processed / day
• 3+PB in 9+PB out / day
• 99%+ availability ytd

Prod – trending events (approximate)
≅ 80B to 1.3T

Prod Scale – only Kafka and ES Flink routers are
deployed in prod, (Hive output Flink routers are in test, unaccounted below)
• 4000+ Kafka brokers, 50+ clusters
• 100’s of Data Streams (Flink Jobs)
• 3700+ Docker containers running
• 1400+ nodes with 22K+ cpu cores

Router has large scale in terms of volume and overall deployed
streams in the cloud, which leads to challenges unnoticed otherwise
• S3 checkpointing backend
• S3 outage = router downtime,
rely on Kafka offset commit only, like Samza
• Pressure on S3 if deployment is not staggered
• Disable distributed checkpointing, only JobManager writes
to checkpointing backend.

Router has large scale in terms volume and overall deployed
streams in the cloud, which leads to challenges unnoticed otherwise
• A failed task can cause the job to restart (JVM running)
• Need Fine grained recovery (Phase 1), FLIP-1
• Failures at times can cause few more duplicates than Samza

Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Flink Job deployment on container runtime
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 2
….
Task Manager
Titus Host 3
IP
Titus Job
IPIP
AWS
VPC
ENI
Fronting Kafa
(Offset checkpointing)
Checkpoint / Snapshot
1.
2.
X Causes Flink
Job Restart

Measurable cost savings moving from ES and Kafka routers to Flink from Samza
• Disclaimer: When comparing Flink and Samza, you may observe
different results in your own environment and setup
• This is not an exact apples-to-apples comparison
Observed significant savings by migrating ES and Kafka to Flink
routers on New container runtime vs Samza on Old container runtime

2. Enriching User Video Plays with “discovery” attributes using Flink
• Talk to other live services
• Integration with IPC ecosystem
• Needs high throughput
• Small state
• O(100M) events / day
Not in production No Keystone Management support yet

The challenges with the event enrichment use case
● Access data (slow / fast changing) from live or static sources
● Play nice, avoid member streaming impact
● Reliability and stability
● Dependency Isolation (Jar Hell)
● Backfill – (historical data / deal with bugs)

3. Complex sessionization of user events using Flink
• Create sessions with start and end events, determined
based on event payload and event time order (punctuated)
• Handle late, and out-of-order events
• 2 to 24 hour session window duration
• O(10B) events per day, testing with a small fraction of
this volume (flink job state 100GB+)
Not in production No Keystone Management support yet

The challenges with complex sessionization use case with large state
● Flink supported session window with gap duration is not sufficient
● Developed custom, complex session windows - done
● Large state & large scale
● Quick checkpoints, and fast recovery from job failures
● Incremental checkpointing
● Exploring other storage strategies with Flink community

We have realized that these three use cases represent a large
set of challenges / requirements needed from SPaaS platform
● Router - Massively Parallel - almost no state, very large scale
● Event enrichment - small state, medium scale
● Complex sessionization – large state, large scale

In addition, there are several other challenges across use cases
● Developer tooling & Testing
● Insight and Operations
● Continuity through upgrades & deployments
● Data parity & Canary tooling
● Thinking streaming first – always on, operational responsibilities
● Cross region event aggregation and routing
● Auto scaling & capacity planning

We are contributing by running Flink at scale
in the cloud (pioneer tax), and more
● Metrics, Operations, Deployment
● Custom, complex session windows
● Fault tolerance, large State management
● Challenges related to massively parallel codebase
● Adaptation of our Patch - https://github.com/apache/flink/pull/3312

You got a glimpse of how we are leveraging Flink as part of our
stream processing platform to serve the business insights of
other engineers at Netflix.
We have come a long way, however we have just begun the
journey in our quest for Fast data. If you are on a similar
journey or have ideas, or would like to collaborate to move
Flink forward, we would like to hear from you.
Conclusion

Flink forward-2017-netflix keystones-paas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink forward-2017-netflix keystones-paas

Similar to Flink forward-2017-netflix keystones-paas (20)

Recently uploaded

Recently uploaded (20)

Flink forward-2017-netflix keystones-paas