SlideShare uma empresa Scribd logo
1 de 30
From a Kafkaesque Story to
the Promised Land
7/7/2013
Ran Silberman
Open Source paradigm
The Cathedral & the Bazaar by Eric S Raymond, 1999
the struggle between top-down and bottom-up design
Challenges of data platform[1]
• High throughput
• Horizontal scale to address growth
• High availability of data services
• No Data loss
• Satisfy Real-Time demands
• Enforce structural data with schemas
• Process Big Data and Enterprise Data
• Single Source of Truth (SSOT)
SLA's of data platform
BI DWH
Real-time
Customers
Real-time
dashboards
Data Bus
Offline
Customers
SLA:
1. 98% in < 1/2 hr
2. 99.999% < 4 hrs
SLA:
1. 98% in < 500 msec
2. No send > 2 sec
Real-time
servers
Legacy Data flow in LivePerson
BI DWH
(Oracle)
RealTime
servers
View
Reports
Customers
ETL
Sessionize
Modeling
Schema
View
1st phase - move to Hadoop
ETL
Sessionize
Modeling
Schema
View
RealTime
servers
BI DWH
(Vertica)
View
Reports
HDFS
Hadoop
MR Job transfers
data to BI DWH
Customers
BI DWH
(Oracle)
2. move to Kafka
6
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1
View
Reports
Customers
3. Integrate with new producers
6
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1 Topic-2
New
RealTime
servers
View
Reports
Customers
4. Add Real-time BI
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1 Topic-2
New
RealTime
servers
Storm
Topology
5. Standardize Data-Model using Avro
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1 Topic-2
New
RealTime
servers
Storm
Topology
Camus
6. Define Single Source of Truth (SSOT)
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1 Topic-2
New
RealTime
servers
Storm
Topology
Camus
Kafka[2] as Backbone for Data
• Central "Message Bus"
• Support multiple topics (MQ style)
• Write ahead to files
• Distributed & Highly Available
• Horizontal Scale
• High throughput (10s MB/Sec per server)
• Service is agnostic to consumers' state
• Retention policy
Kafka Architecture
Kafka Architecture cont.
Node 1
Zookeeper
Producer 1 Producer 2 Producer 3
Node 2 Node 3
Consumer 1 Consumer 1Consumer 1
Group1
Kafka Architecture cont.
Node 1
Zookeeper
Producer 1 Producer 2
Node 3 Node 4
Consumer 2 Consumer 3Consumer 1
Node 2
Topic1 Topic2
Kafka replay messages.
Zookeeper
Node 3 Node 4
Min Offset ->
Max Offset ->
fetchRequest = new fetchRequest(topic, partition, offset, size);
currentOffset : taken from zookeeper
Earliest offset: -2
Latest offset : -1
Kafka API[3]
• Producer API
• Consumer API
o High-level API
 using zookeeper to access brokers and to save
offsets
o SimpleConsumer API
 direct access to Kafka brokers
• Kafka-Spout, Camus, and KafkaHadoopConsumer all
use SimpleConsumer
Kafka API[3]
• Producer
messages = new List<KeyedMessage<K, V>>()
Messages.add(new KeyedMessage(“topic1”, null, msg1));
Send(messages);
• Consumer
streams[] = Consumer.createMessageStream((“topic1”, 1);
for (message: streams[0]{
//do something with message
}
Kafka in Unit Testing
• Use of class KafkaServer
• Run embedded server
Introducing Avro[5]
• Schema representation using JSON
• Support types
o Primitive types: boolean, int, long, string, etc.
o Complex types:
Record, Enum, Union, Arrays, Maps, Fixed
• Data is serialized using its schema
• Avro files include file-header of the schema
Add Avro protocol to the story
Topic 1
Schema Repo
Producer 1
Topic 2
Consumers:
Camus/Storm
Create Message
according to
Schema 1.0
Register schema 1.0
Add revision to
message header
Send message
Read message Extract header
and obtain
schema version
Get schema by
version 1.0
Encode message
with Schema 1.0
Decode message
with schema 1.0
{event1:{header:{sessionId:"102122"),{timestamp:"12346")}...
Header Avro message
Kafka message
Pass message
1.0
Kafka + Storm + Avro example
• Demonstrating use of Avro data passing from Kafka to
Storm
• Explains Avro revision evolution
• Requires Kafka and Zookeeper installed
• Uses Storm artifact and Kafka-Spout artifact in Maven
• Plugin generates Java classes from Avro Schema
• https://github.com/ransilberman/avro-kafka-storm
Producer machine
Resiliency
Producer
Consistent
Topic
Send message
to Kafka
local file
Persist message to
local disk
Kafka Bridge
Send message
to Kafka
Fast
Topic
Real-time
Consumer:
Storm
Offline
Consumer:
Hadoop
Challenges of Kafka
• Still not mature enough
• Not enough supporting tools (viewers, maintenance)
• Duplications may occur
• API not documented enough
• Open Source - support by community only
• Difficult to replay messages from specific point in time
• Eventually Consistent...
Eventually Consistent
Because it is a distributed system -
• No guarantee for delivery order
• No way to tell to which broker message is sent
• Kafka do not guarantee that there are no duplications
• ...But eventually, all message will arrive!
Event
generated
Event
destination
Desert
Major Improvements in Kafka 0.8[4]
• Partitions replication
• Message send guarantee
• Consumer offsets are represented numbers instead of
bytes (e.g., 1, 2, 3, ..)
Addressing Data Challenges
• High throughput
o Kafka, Hadoop
• Horizontal scale to address growth
o Kafka, Storm, Hadoop
• High availability of data services
o Kafka, Storm, Zookeeper
• No Data loss
o Highly Available services, No ETL
Addressing Data Challenges Cont.
• Satisfy Real-Time demands
o Storm
• Enforce structural data with schemas
o Avro
• Process Big Data and Enterprise Data
o Kafka, Hadoop
• Single Source of Truth (SSOT)
o Hadoop, No ETL
References
• [1]Satisfying new requirements for Data Integration By
David Loshin
• [2]Apache Kafka
• [3]Kafka API
• [4]Kafka 0.8 Quick Start
• [5]Apache Avro
• [5]Storm
Thank you!

Mais conteĂşdo relacionado

Mais procurados

Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
Martin Kleppmann
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 

Mais procurados (20)

Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
 
Building a distributed Key-Value store with Cassandra
Building a distributed Key-Value store with CassandraBuilding a distributed Key-Value store with Cassandra
Building a distributed Key-Value store with Cassandra
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacre
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
 
ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
 
Easily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon CrosbyEasily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon Crosby
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
 

Destaque

Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Alexey Kharlamov
 

Destaque (17)

Clash of clans data structures
Clash of clans   data structuresClash of clans   data structures
Clash of clans data structures
 
Sharding with spider solutions 20160721
Sharding with spider solutions 20160721Sharding with spider solutions 20160721
Sharding with spider solutions 20160721
 
Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
Dev ops for big data cluster management tools
Dev ops for big data  cluster management toolsDev ops for big data  cluster management tools
Dev ops for big data cluster management tools
 
Using spider for sharding in production
Using spider for sharding in productionUsing spider for sharding in production
Using spider for sharding in production
 
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界
 
Spider storage engine (dec212016)
Spider storage engine (dec212016)Spider storage engine (dec212016)
Spider storage engine (dec212016)
 
終歸:分群消費者x多元商機的實現
終歸:分群消費者x多元商機的實現終歸:分群消費者x多元商機的實現
終歸:分群消費者x多元商機的實現
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Spiderストレージエンジンのご紹介
Spiderストレージエンジンのご紹介Spiderストレージエンジンのご紹介
Spiderストレージエンジンのご紹介
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 

Semelhante a From a kafkaesque story to The Promised Land

Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack
Sadayuki Furuhashi
 
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackNoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePack
Sadayuki Furuhashi
 
F_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptxF_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptx
NIMITJAIN71
 

Semelhante a From a kafkaesque story to The Promised Land (20)

From a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePersonFrom a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePerson
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
World of Tanks Experience of Using Kafka
World of Tanks Experience of Using KafkaWorld of Tanks Experience of Using Kafka
World of Tanks Experience of Using Kafka
 
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack
 
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackNoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePack
 
F_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptxF_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptx
 
Actors or Not: Async Event Architectures
Actors or Not: Async Event ArchitecturesActors or Not: Async Event Architectures
Actors or Not: Async Event Architectures
 
Event Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDBEvent Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDB
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

From a kafkaesque story to The Promised Land

  • 1. From a Kafkaesque Story to the Promised Land 7/7/2013 Ran Silberman
  • 2. Open Source paradigm The Cathedral & the Bazaar by Eric S Raymond, 1999 the struggle between top-down and bottom-up design
  • 3. Challenges of data platform[1] • High throughput • Horizontal scale to address growth • High availability of data services • No Data loss • Satisfy Real-Time demands • Enforce structural data with schemas • Process Big Data and Enterprise Data • Single Source of Truth (SSOT)
  • 4. SLA's of data platform BI DWH Real-time Customers Real-time dashboards Data Bus Offline Customers SLA: 1. 98% in < 1/2 hr 2. 99.999% < 4 hrs SLA: 1. 98% in < 500 msec 2. No send > 2 sec Real-time servers
  • 5. Legacy Data flow in LivePerson BI DWH (Oracle) RealTime servers View Reports Customers ETL Sessionize Modeling Schema View
  • 6. 1st phase - move to Hadoop ETL Sessionize Modeling Schema View RealTime servers BI DWH (Vertica) View Reports HDFS Hadoop MR Job transfers data to BI DWH Customers BI DWH (Oracle)
  • 7. 2. move to Kafka 6 RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 View Reports Customers
  • 8. 3. Integrate with new producers 6 RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers View Reports Customers
  • 9. 4. Add Real-time BI View Reports 6 Customers RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers Storm Topology
  • 10. 5. Standardize Data-Model using Avro View Reports 6 Customers RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers Storm Topology Camus
  • 11. 6. Define Single Source of Truth (SSOT) View Reports 6 Customers RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers Storm Topology Camus
  • 12. Kafka[2] as Backbone for Data • Central "Message Bus" • Support multiple topics (MQ style) • Write ahead to files • Distributed & Highly Available • Horizontal Scale • High throughput (10s MB/Sec per server) • Service is agnostic to consumers' state • Retention policy
  • 14. Kafka Architecture cont. Node 1 Zookeeper Producer 1 Producer 2 Producer 3 Node 2 Node 3 Consumer 1 Consumer 1Consumer 1
  • 15. Group1 Kafka Architecture cont. Node 1 Zookeeper Producer 1 Producer 2 Node 3 Node 4 Consumer 2 Consumer 3Consumer 1 Node 2 Topic1 Topic2
  • 16. Kafka replay messages. Zookeeper Node 3 Node 4 Min Offset -> Max Offset -> fetchRequest = new fetchRequest(topic, partition, offset, size); currentOffset : taken from zookeeper Earliest offset: -2 Latest offset : -1
  • 17. Kafka API[3] • Producer API • Consumer API o High-level API  using zookeeper to access brokers and to save offsets o SimpleConsumer API  direct access to Kafka brokers • Kafka-Spout, Camus, and KafkaHadoopConsumer all use SimpleConsumer
  • 18. Kafka API[3] • Producer messages = new List<KeyedMessage<K, V>>() Messages.add(new KeyedMessage(“topic1”, null, msg1)); Send(messages); • Consumer streams[] = Consumer.createMessageStream((“topic1”, 1); for (message: streams[0]{ //do something with message }
  • 19. Kafka in Unit Testing • Use of class KafkaServer • Run embedded server
  • 20. Introducing Avro[5] • Schema representation using JSON • Support types o Primitive types: boolean, int, long, string, etc. o Complex types: Record, Enum, Union, Arrays, Maps, Fixed • Data is serialized using its schema • Avro files include file-header of the schema
  • 21. Add Avro protocol to the story Topic 1 Schema Repo Producer 1 Topic 2 Consumers: Camus/Storm Create Message according to Schema 1.0 Register schema 1.0 Add revision to message header Send message Read message Extract header and obtain schema version Get schema by version 1.0 Encode message with Schema 1.0 Decode message with schema 1.0 {event1:{header:{sessionId:"102122"),{timestamp:"12346")}... Header Avro message Kafka message Pass message 1.0
  • 22. Kafka + Storm + Avro example • Demonstrating use of Avro data passing from Kafka to Storm • Explains Avro revision evolution • Requires Kafka and Zookeeper installed • Uses Storm artifact and Kafka-Spout artifact in Maven • Plugin generates Java classes from Avro Schema • https://github.com/ransilberman/avro-kafka-storm
  • 23. Producer machine Resiliency Producer Consistent Topic Send message to Kafka local file Persist message to local disk Kafka Bridge Send message to Kafka Fast Topic Real-time Consumer: Storm Offline Consumer: Hadoop
  • 24. Challenges of Kafka • Still not mature enough • Not enough supporting tools (viewers, maintenance) • Duplications may occur • API not documented enough • Open Source - support by community only • Difficult to replay messages from specific point in time • Eventually Consistent...
  • 25. Eventually Consistent Because it is a distributed system - • No guarantee for delivery order • No way to tell to which broker message is sent • Kafka do not guarantee that there are no duplications • ...But eventually, all message will arrive! Event generated Event destination Desert
  • 26. Major Improvements in Kafka 0.8[4] • Partitions replication • Message send guarantee • Consumer offsets are represented numbers instead of bytes (e.g., 1, 2, 3, ..)
  • 27. Addressing Data Challenges • High throughput o Kafka, Hadoop • Horizontal scale to address growth o Kafka, Storm, Hadoop • High availability of data services o Kafka, Storm, Zookeeper • No Data loss o Highly Available services, No ETL
  • 28. Addressing Data Challenges Cont. • Satisfy Real-Time demands o Storm • Enforce structural data with schemas o Avro • Process Big Data and Enterprise Data o Kafka, Hadoop • Single Source of Truth (SSOT) o Hadoop, No ETL
  • 29. References • [1]Satisfying new requirements for Data Integration By David Loshin • [2]Apache Kafka • [3]Kafka API • [4]Kafka 0.8 Quick Start • [5]Apache Avro • [5]Storm