SlideShare uma empresa Scribd logo
1 de 71
Baixar para ler offline
Real Time Data Analytics @
UberAnkur Bansal
Apache Big Data Europe
November 14, 2016
About Me
● Sr. Software Engineer, Streaming Team @ Uber
○ Streaming team supports platform for real time data
analytics: Kafka, Samza, Flink, Pinot.. and plenty more
○ Focused on scaling Kafka at Uber’s pace
● Staff software Engineer @ Ebay
○ Build & scale Ebay’s cloud using openstack
● Apache Kylin: Committer, Emeritus PMC
Agenda
● Real time Use Cases
● Kafka Infrastructure Deep Dive
● Our own Development:
○ Rest Proxy & Clients
○ Local Agent
○ uReplicator (Mirrormaker)
○ Chaperone (Auditing)
● Operations/Tooling
Important Use Cases
Stream
Processing
Real-time Price Surging
SURGE
MULTIPLIERS
Rider eyeballs
Open car information
KAFKA
Real-time Machine Learning - UberEats ETD
● Fraud detection
● Share my ETA
And many more ...
Apache Kafka is Uber’s Lifeline
Kafka ecosystem @ Uber
100s of billion
100s TB
Messages/day
bytes/day
Kafka cluster stats
Multiple data centers
Kafka Infrastructure Deep Dive
Requirements
● Scale to 100s Billions/day → 1 Trillion/day
● High Throughput ( Scale: 100s TB → PB)
● Low Latency for most use cases(<5ms )
● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)
● Multi-Language Support
● Tens of thousands of simultaneous clients.
● Reliable data replication across DC
Kafka Pipeline
Local
Agent
uReplicator
Kafka Pipeline: Data Flow
1
2
3 5 7
64 8
Kafka Clusters
Local
Agent
uReplicator
Kafka Clusters
● Use case based clusters
○ Data (async, reliable)
○ Logging (High throughput)
○ Time Sensitive (Low Latency e.g. Surge, Push
notifications)
○ High Value Data (At-least once, Sync e.g. Payments)
● Secondary cluster as fallback
● Aggregate clusters for all data topics.
Kafka Clusters
● Scale to 100s Billions/day → 1 Trillion/day
● High Throughput ( Scale: 100s TB → PB)
● Low Latency for most use cases(<5ms )
● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)
● Multi-Language Support
● Tens of thousands of simultaneous clients.
● Reliable data replication across DC
Kafka Rest Proxy
Local
Agent
uReplicator
Why Kafka Rest Proxy ?
● Simplified Client API
● Multi-lang support (Java, NodeJs, Python, Golang)
● Decouple client from Kafka broker
○ Thin clients = operational ease
○ Less connections to Kafka brokers
○ Future kafka upgrade
● Enhanced Reliability
○ Primary & Secondary Kafka Clusters
Kafka Rest Proxy: Internals
Kafka Rest Proxy: Internals
Kafka Rest Proxy: Internals
● Based on Confluent’s open sourced Rest Proxy
● Performance enhancements
○ Simple http servlets on jetty instead of Jersey
○ Optimized for binary payloads.
○ Performance increase from 7K* to 45-50K QPS/box
● Caching of topic metadata.
● Reliability improvements*
○ Support for Fallback cluster
○ Support for multiple Producers (SLA based segregation)
● Plan to contribute back to community
*Based on benchmarking & analysis done in Jun ’2015
Rest Proxy: performance (1 box)
Message rate (K/second) at single node
End-endLatency(ms)
Kafka Clusters + Rest Proxy
● Scale to 100s Billions/day → 1 Trillion/day
● High Throughput ( Scale: 100s TB → PB)
● Low Latency for most use cases(<5ms )
● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)
● Multi-Language Support
● Tens of thousands of simultaneous clients.
● Reliable data replication across DC
Kafka Clients
Local
Agent
uReplicator
Client Libraries
● Support for multiple clusters.
● High Throughput
○ Non-blocking, async, batching
○ <1ms produce latency for clients
○ Handles Throttling/BackOff signals from Rest Proxy
● Topic Discovery
○ Discovers the kafka cluster a topic belongs
○ Able to multiplex to different kafka clusters
● Integration with Local Agent for critical data
Client Libraries
Add
Figure
What if there is
network glitch /
outage?
Client Libraries
Add
Figure
Kafka Clusters + Rest Proxy + Clients
● Scale to 100s Billions/day → 1 Trillion/day
● High Throughput ( Scale: 100s TB → PB)
● Low Latency for most use cases(<5ms )
● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)
● Multi-Language Support
● Tens of thousands of simultaneous clients.
● Reliable data replication across DC
Local Agent
Local
Agent
uReplicator
Local Agent
● Local spooling in case of downstream outage/backpressure
● Backfills at the controlled rate to avoid hammering
infrastructure recovering from outage
● Implementation:
○ Reuses code from rest-proxy and kafka’s log module.
○ Appends all topics to same file for high throughput.
Local Agent Architecture
Add
Figure
Local Agent in Action
Add
Figure
Kafka Clusters + Rest Proxy + Clients + Local Agent
● Scale to 100s Billions/day → 1 Trillion/day
● High Throughput ( Scale: 100s TB → PB)
● Low Latency for most use cases(<5ms )
● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)
● Multi-Language Support
● Tens of thousands of simultaneous clients.
● Reliable data replication across DC
uReplicator
Local
Agent
uReplicator
Multi-DC data flow
CONFIDENTIAL
>> INSERT SCREENSHOT HERE <<
Mirrormaker : existing problems
● New Topic added
● New partitions added
● Mirrormaker bounced
● New mirrormaker added
uReplicator: In-house solution
Zookeeper
Helix MM
Controller
Helix
Agent
Thread 1
Thread N
Topic-partition
Helix
Agent
Thread 1
Thread N
Topic-partition
Helix
Agent
Thread 1
Thread N
Topic-partition
MM worker1 MM worker2 MM worker3
uReplicator
Zookeeper
Helix MM
Controller
Helix
Agent
Thread 1
Thread N
Topic-partition
Helix
Agent
Thread 1
Thread N
Topic-partition
Helix
Agent
Thread 1
Thread N
Topic-partition
MM worker1 MM worker2 MM worker3
Kafka Clusters + Rest Proxy + Clients + Local Agent
● Scale to 100s Billions/day → 1 Trillion/day
● High Throughput ( Scale: 100s TB → PB)
● Low Latency for most use cases(<5ms )
● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)
● Multi-Language Support
● Tens of thousands of simultaneous clients.
● Reliable data replication across DC
uReplicator
● Running in production for 1+ year
● Open sourced: https://github.com/uber/uReplicator
● Blog: https://eng.uber.com/ureplicator/
Chaperone - E2E Auditing
Chaperone Architecture
CONFIDENTIAL
>> INSERT SCREENSHOT HERE <<
Chaperone : Track counts
CONFIDENTIAL
>> INSERT SCREENSHOT HERE <<
Chaperone : Track Latency
Chaperone
● Running in production for 1+ year
● Planning to open source in ~2 Weeks
At-least Once Kafka
Why do we need it?
1
2
3 5 7
64 8
● Most of infrastructure tuned for high throughput
○ Batching at each stage
○ Ack before produce (ack’ed != committed)
● Single node failure in any stage leads to data loss
● Need a reliable pipeline for High Value Data e.g. Payments
How did we achieve it?
● Brokers:
○ min.insync.replicas=2, can only torrent one node failure
○ unclean.leader.election= false, need to wait until the old
leader comes back
● Rest Proxy:
○ Partition Failover
● Improved Operations:
○ Replication throttling, to reduce impact of node bootstrap
○ Prevent catching up nodes to become ISR
Operations/Tooling
Partition Rebalancing
Add
Figure
Partition Rebalancing
● Calculates partition
imbalance and inter-broker
dependency.
● Generates & Executes
Rebalance Plan.
● Rebalance plans are
incremental, can be stopped
and resumed.
● Currently on-demand,
Automated in the future.
XFS vs EXT4
Add
Figure
Summary: Scale
● Kafka Brokers:
○ Multiple Clusters per DC
○ Use case based tuning
● Rest Proxy to reduce connections and better batching
● Rest Proxy & Clients
○ Batch everywhere, Async produce
○ Replace Jersey with Jetty
● XFS
Summary: Reliability
● Local Agent
● Secondary Clusters
● Multi Producer support in Rest Proxy
● uReplicator
● Auditing via Chaperone
Future Work
● Open source contribution
○ Chaperone
○ Toolkit
● Data Lineage
● Active Active Kafka
● Chargeback
● Exactly once mirroring via uReplicator
Questions ?
ankur@uber.com
Extra Slides
Kafka Durability (acks=1)
Broker 1
100
101
102
103
Broker 2
100
101
Broker 3
100
101
Leader
Committed
Producer
Acked
Kafka Durability (acks=1)
Broker 1
100
101
102
103
Broker 2
100
101
Broker 3
100
101
Leader
Committed
Producer
Failed
Acked
Kafka Durability (acks=1)
Broker 1
100
101
102
103
Broker 2
100
101
Broker 3
100
101
Leader
Committed
Producer
Kafka Durability (acks=1)
Broker 1
100
101
102
103
Broker 2
100
101
104
105
106
Broker 3
100
101
104
105
Leader
Committed
Producer
Old HW
Kafka Durability (acks=1)
Broker 1
100
101
102
103
Broker 2
100
101
104
105
106
Broker 3
100
101
104
105
Leader
Committed
Producer
X
Old HW
X
Kafka Durability (acks=1)
Broker 1
100
101
104
105
106
Broker 2
100
101
104
105
106
Broker 3
100
101
105
106
Leader
Committed
Producer
data loss!!
Distributed Messaging system
* Supported in Kafka 0.8+
● High throughput
● Low latency
● Scalable
● Centralized
● Real-time
What is Kafka?
● Distributed
● Partitioned
● Replicated
● Commit Log
Broker 1 Broker 2 Broker 3
ZooKeeper
What is Kafka?
● Distributed
● Partitioned
● Replicated
● Commit Log
Broker 1
Partition 0
Broker 2
Partition 1
Broker 3
Partition 2
ZooKeeper
What is Kafka?
● Distributed
● Partitioned
● Replicated
● Commit Log
Broker 1
Partition 0
Partition 2
Broker 2
Partition 1
Partition 0
Broker 3
Partition 2
Partition 1
ZooKeeper
What is Kafka?
● Distributed
● Partitioned
● Replicated
● Commit Log
Broker 1
Partition 0
0 1 2 3
Partition 2
0 1 2 3
Broker 2
Partition 1
0 1 2 3
Partition 0
0 1 2 3
Broker 3
Partition 2
0 1 2 3
Partition 1
0 1 2 3
ZooKeeper
Kafka Concepts

Mais conteúdo relacionado

Mais procurados

Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Araf Karsh Hamid
 
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...Flink Forward
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
Getting the most out of MariaDB MaxScale
Getting the most out of MariaDB MaxScaleGetting the most out of MariaDB MaxScale
Getting the most out of MariaDB MaxScaleMariaDB plc
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022HostedbyConfluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryJean-Paul Azar
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasDataWorks Summit
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharingconfluent
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafkaconfluent
 

Mais procurados (20)

Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Getting the most out of MariaDB MaxScale
Getting the most out of MariaDB MaxScaleGetting the most out of MariaDB MaxScale
Getting the most out of MariaDB MaxScale
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharing
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 

Destaque

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017Sudhir Tonse
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at UberSudhir Tonse
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Uber Analytics Test
Uber Analytics TestUber Analytics Test
Uber Analytics TestCoursetake
 
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)IT Arena
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
Beyond Flipped Classrooms and MOOCs: The future of engineering and management...
Beyond Flipped Classrooms and MOOCs: The future of engineering and management...Beyond Flipped Classrooms and MOOCs: The future of engineering and management...
Beyond Flipped Classrooms and MOOCs: The future of engineering and management...Jeffrey Funk Business Models
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkroutconfluent
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series AnalysisQAware GmbH
 
Using NCC Group Web Performance Data Creatively
Using NCC Group Web Performance Data CreativelyUsing NCC Group Web Performance Data Creatively
Using NCC Group Web Performance Data CreativelyGareth Hughes
 

Destaque (20)

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Uber's Business Model
Uber's Business ModelUber's Business Model
Uber's Business Model
 
Uber Analytics Test
Uber Analytics TestUber Analytics Test
Uber Analytics Test
 
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Expliseat - World's Lightest Seats
Expliseat - World's Lightest SeatsExpliseat - World's Lightest Seats
Expliseat - World's Lightest Seats
 
Biz Models for High-Tech Products
Biz Models for High-Tech ProductsBiz Models for High-Tech Products
Biz Models for High-Tech Products
 
Garena Online
Garena OnlineGarena Online
Garena Online
 
Beyond Flipped Classrooms and MOOCs: The future of engineering and management...
Beyond Flipped Classrooms and MOOCs: The future of engineering and management...Beyond Flipped Classrooms and MOOCs: The future of engineering and management...
Beyond Flipped Classrooms and MOOCs: The future of engineering and management...
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series Analysis
 
Product management
Product managementProduct management
Product management
 
Using NCC Group Web Performance Data Creatively
Using NCC Group Web Performance Data CreativelyUsing NCC Group Web Performance Data Creatively
Using NCC Group Web Performance Data Creatively
 
Pro_Tools_Tier_2
Pro_Tools_Tier_2Pro_Tools_Tier_2
Pro_Tools_Tier_2
 

Semelhante a Uber Real Time Data Analytics

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupMingmin Chen
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ NetflixIdo Shilon
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Peter Bakas
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxiniMonal Daxini
 
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixMonal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixFlink Forward
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bbNitin Kumar
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafkaconfluent
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyAllen (Xiaozhong) Wang
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 

Semelhante a Uber Real Time Data Analytics (20)

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
 
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixMonal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ Netflix
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafka
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 

Último

Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Último (20)

Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 

Uber Real Time Data Analytics

  • 1. Real Time Data Analytics @ UberAnkur Bansal Apache Big Data Europe November 14, 2016
  • 2. About Me ● Sr. Software Engineer, Streaming Team @ Uber ○ Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot.. and plenty more ○ Focused on scaling Kafka at Uber’s pace ● Staff software Engineer @ Ebay ○ Build & scale Ebay’s cloud using openstack ● Apache Kylin: Committer, Emeritus PMC
  • 3. Agenda ● Real time Use Cases ● Kafka Infrastructure Deep Dive ● Our own Development: ○ Rest Proxy & Clients ○ Local Agent ○ uReplicator (Mirrormaker) ○ Chaperone (Auditing) ● Operations/Tooling
  • 6. Real-time Machine Learning - UberEats ETD
  • 7.
  • 8. ● Fraud detection ● Share my ETA And many more ...
  • 9. Apache Kafka is Uber’s Lifeline
  • 11. 100s of billion 100s TB Messages/day bytes/day Kafka cluster stats Multiple data centers
  • 13. Requirements ● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB) ● Low Latency for most use cases(<5ms ) ● Reliability - 99.99% ( #Msgs Available /#Msgs Produced) ● Multi-Language Support ● Tens of thousands of simultaneous clients. ● Reliable data replication across DC
  • 15. Kafka Pipeline: Data Flow 1 2 3 5 7 64 8
  • 17. Kafka Clusters ● Use case based clusters ○ Data (async, reliable) ○ Logging (High throughput) ○ Time Sensitive (Low Latency e.g. Surge, Push notifications) ○ High Value Data (At-least once, Sync e.g. Payments) ● Secondary cluster as fallback ● Aggregate clusters for all data topics.
  • 18. Kafka Clusters ● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB) ● Low Latency for most use cases(<5ms ) ● Reliability - 99.99% ( #Msgs Available /#Msgs Produced) ● Multi-Language Support ● Tens of thousands of simultaneous clients. ● Reliable data replication across DC
  • 20. Why Kafka Rest Proxy ? ● Simplified Client API ● Multi-lang support (Java, NodeJs, Python, Golang) ● Decouple client from Kafka broker ○ Thin clients = operational ease ○ Less connections to Kafka brokers ○ Future kafka upgrade ● Enhanced Reliability ○ Primary & Secondary Kafka Clusters
  • 21. Kafka Rest Proxy: Internals
  • 22. Kafka Rest Proxy: Internals
  • 23. Kafka Rest Proxy: Internals ● Based on Confluent’s open sourced Rest Proxy ● Performance enhancements ○ Simple http servlets on jetty instead of Jersey ○ Optimized for binary payloads. ○ Performance increase from 7K* to 45-50K QPS/box ● Caching of topic metadata. ● Reliability improvements* ○ Support for Fallback cluster ○ Support for multiple Producers (SLA based segregation) ● Plan to contribute back to community *Based on benchmarking & analysis done in Jun ’2015
  • 24. Rest Proxy: performance (1 box) Message rate (K/second) at single node End-endLatency(ms)
  • 25. Kafka Clusters + Rest Proxy ● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB) ● Low Latency for most use cases(<5ms ) ● Reliability - 99.99% ( #Msgs Available /#Msgs Produced) ● Multi-Language Support ● Tens of thousands of simultaneous clients. ● Reliable data replication across DC
  • 27. Client Libraries ● Support for multiple clusters. ● High Throughput ○ Non-blocking, async, batching ○ <1ms produce latency for clients ○ Handles Throttling/BackOff signals from Rest Proxy ● Topic Discovery ○ Discovers the kafka cluster a topic belongs ○ Able to multiplex to different kafka clusters ● Integration with Local Agent for critical data
  • 28. Client Libraries Add Figure What if there is network glitch / outage?
  • 30. Kafka Clusters + Rest Proxy + Clients ● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB) ● Low Latency for most use cases(<5ms ) ● Reliability - 99.99% ( #Msgs Available /#Msgs Produced) ● Multi-Language Support ● Tens of thousands of simultaneous clients. ● Reliable data replication across DC
  • 32. Local Agent ● Local spooling in case of downstream outage/backpressure ● Backfills at the controlled rate to avoid hammering infrastructure recovering from outage ● Implementation: ○ Reuses code from rest-proxy and kafka’s log module. ○ Appends all topics to same file for high throughput.
  • 34. Local Agent in Action Add Figure
  • 35. Kafka Clusters + Rest Proxy + Clients + Local Agent ● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB) ● Low Latency for most use cases(<5ms ) ● Reliability - 99.99% ( #Msgs Available /#Msgs Produced) ● Multi-Language Support ● Tens of thousands of simultaneous clients. ● Reliable data replication across DC
  • 38. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Mirrormaker : existing problems ● New Topic added ● New partitions added ● Mirrormaker bounced ● New mirrormaker added
  • 39. uReplicator: In-house solution Zookeeper Helix MM Controller Helix Agent Thread 1 Thread N Topic-partition Helix Agent Thread 1 Thread N Topic-partition Helix Agent Thread 1 Thread N Topic-partition MM worker1 MM worker2 MM worker3
  • 40. uReplicator Zookeeper Helix MM Controller Helix Agent Thread 1 Thread N Topic-partition Helix Agent Thread 1 Thread N Topic-partition Helix Agent Thread 1 Thread N Topic-partition MM worker1 MM worker2 MM worker3
  • 41. Kafka Clusters + Rest Proxy + Clients + Local Agent ● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB) ● Low Latency for most use cases(<5ms ) ● Reliability - 99.99% ( #Msgs Available /#Msgs Produced) ● Multi-Language Support ● Tens of thousands of simultaneous clients. ● Reliable data replication across DC
  • 42. uReplicator ● Running in production for 1+ year ● Open sourced: https://github.com/uber/uReplicator ● Blog: https://eng.uber.com/ureplicator/
  • 43. Chaperone - E2E Auditing
  • 45. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Chaperone : Track counts
  • 46. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Chaperone : Track Latency
  • 47. Chaperone ● Running in production for 1+ year ● Planning to open source in ~2 Weeks
  • 49. Why do we need it? 1 2 3 5 7 64 8 ● Most of infrastructure tuned for high throughput ○ Batching at each stage ○ Ack before produce (ack’ed != committed) ● Single node failure in any stage leads to data loss ● Need a reliable pipeline for High Value Data e.g. Payments
  • 50. How did we achieve it? ● Brokers: ○ min.insync.replicas=2, can only torrent one node failure ○ unclean.leader.election= false, need to wait until the old leader comes back ● Rest Proxy: ○ Partition Failover ● Improved Operations: ○ Replication throttling, to reduce impact of node bootstrap ○ Prevent catching up nodes to become ISR
  • 53. Partition Rebalancing ● Calculates partition imbalance and inter-broker dependency. ● Generates & Executes Rebalance Plan. ● Rebalance plans are incremental, can be stopped and resumed. ● Currently on-demand, Automated in the future.
  • 55. Summary: Scale ● Kafka Brokers: ○ Multiple Clusters per DC ○ Use case based tuning ● Rest Proxy to reduce connections and better batching ● Rest Proxy & Clients ○ Batch everywhere, Async produce ○ Replace Jersey with Jetty ● XFS
  • 56. Summary: Reliability ● Local Agent ● Secondary Clusters ● Multi Producer support in Rest Proxy ● uReplicator ● Auditing via Chaperone
  • 57. Future Work ● Open source contribution ○ Chaperone ○ Toolkit ● Data Lineage ● Active Active Kafka ● Chargeback ● Exactly once mirroring via uReplicator
  • 60. Kafka Durability (acks=1) Broker 1 100 101 102 103 Broker 2 100 101 Broker 3 100 101 Leader Committed Producer Acked
  • 61. Kafka Durability (acks=1) Broker 1 100 101 102 103 Broker 2 100 101 Broker 3 100 101 Leader Committed Producer Failed Acked
  • 62. Kafka Durability (acks=1) Broker 1 100 101 102 103 Broker 2 100 101 Broker 3 100 101 Leader Committed Producer
  • 63. Kafka Durability (acks=1) Broker 1 100 101 102 103 Broker 2 100 101 104 105 106 Broker 3 100 101 104 105 Leader Committed Producer Old HW
  • 64. Kafka Durability (acks=1) Broker 1 100 101 102 103 Broker 2 100 101 104 105 106 Broker 3 100 101 104 105 Leader Committed Producer X Old HW X
  • 65. Kafka Durability (acks=1) Broker 1 100 101 104 105 106 Broker 2 100 101 104 105 106 Broker 3 100 101 105 106 Leader Committed Producer data loss!!
  • 66. Distributed Messaging system * Supported in Kafka 0.8+ ● High throughput ● Low latency ● Scalable ● Centralized ● Real-time
  • 67. What is Kafka? ● Distributed ● Partitioned ● Replicated ● Commit Log Broker 1 Broker 2 Broker 3 ZooKeeper
  • 68. What is Kafka? ● Distributed ● Partitioned ● Replicated ● Commit Log Broker 1 Partition 0 Broker 2 Partition 1 Broker 3 Partition 2 ZooKeeper
  • 69. What is Kafka? ● Distributed ● Partitioned ● Replicated ● Commit Log Broker 1 Partition 0 Partition 2 Broker 2 Partition 1 Partition 0 Broker 3 Partition 2 Partition 1 ZooKeeper
  • 70. What is Kafka? ● Distributed ● Partitioned ● Replicated ● Commit Log Broker 1 Partition 0 0 1 2 3 Partition 2 0 1 2 3 Broker 2 Partition 1 0 1 2 3 Partition 0 0 1 2 3 Broker 3 Partition 2 0 1 2 3 Partition 1 0 1 2 3 ZooKeeper