SlideShare uma empresa Scribd logo
1 de 41
Real-time Data and Apache Kafka
Jay Kreps
I ♥ Log
The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Apache Kafka
A
brief
history
of
Kafka
Three principles
1. One pipeline to rule them all
2. Stream processing >> messaging
3. Clusters not servers
Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of a database
– Messages strictly ordered
– All data persistent
• Distributed by default
– Replication
– Partitioning model
Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Low-latency: ~1.5 ms
• Replicated to each datacenter
• Tens of thousands of data producers
• Thousands of consumers
• 7 million messages written/sec
• 35 million messages read/sec
• Hadoop integration
Open source
• Apache Software Foundation
• Very healthy usage outside LinkedIn
• Broad base of committers
• 30 clients in 15 languages
• Great ecosystem of supporting tools
The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Kafka is about logs
What is a log?
Partitioning
Logs: pub/sub done right
Logs And Distributed Systems
Example:
A Fault-tolerant CEO Hash Table
Operations Final State
State-machine Replication
Primary-backup
What use is a log?
The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Data Integration
Types of Data
• Database data
– Users, products, orders, etc
• Events
– Clicks, Impressions, Pageviews, etc
• Application metrics
– CPU usage, requests/sec
• Application logs
– Service calls, errors
Systems at LinkedIn
• Live Stores
– Voldemort
– Espresso
– Graph
– OLAP
– Search
– InGraphs
• Offline
– Hadoop
– Teradata
Bad
Good
Example: User views job
The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Stream Processing
Stream processing is a
generalization
of batch processing
Examples
• Monitoring
• Security
• Content processing
• Recommendations
• Newsfeed
• ETL
Stream Processing = Logs + Jobs
Systems Can Help
Samza Architecture
Example: Top Articles By Company
Log-centric Architecture
Kafka
http://kafka.apache.org
Samza
http://samza.incubator.apache.org
Log Blog
http://linkd.in/199iMwY
Me
http://www.linkedin.com/in/jaykreps
@jaykreps

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Design Patterns for working with Fast Data
Design Patterns for working with Fast DataDesign Patterns for working with Fast Data
Design Patterns for working with Fast Data
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Real time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and Couchbase
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Introduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - MadridIntroduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - Madrid
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 

Semelhante a I Heart Log: Real-time Data and Apache Kafka

Semelhante a I Heart Log: Real-time Data and Apache Kafka (20)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
MyHeritage Kakfa use cases - Feb 2014 Meetup
MyHeritage Kakfa use cases - Feb 2014 Meetup MyHeritage Kakfa use cases - Feb 2014 Meetup
MyHeritage Kakfa use cases - Feb 2014 Meetup
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented MiddlewareADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
 
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving | Serverless Data Pipelines, API, Messaging and Stream ProcessingJustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
 

Último

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 

I Heart Log: Real-time Data and Apache Kafka

Notas do Editor

  1. Who are you?What is this talk about? What is a log and what is it good forExciting topic
  2. Producers, consumers distributed
  3. First project was an open source clone of Amazon Dynamo (Project Voldemort)Makes explaining things easier
  4. 1 Pipeline for database data1 Pipeline for metrics1 Pipeline for events1 Pipeline for real-time processingNo pipeline for application logs300ActiveMQ brokers
  5. 10,000 messages/sec * 100 byte messages = ~1MB/sec
  6. 200 Kafka-related projects on github1000+ emails/month
  7. The log is fundamental abstraction Kafka providesYou can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
  8. What is a log?Traditional uses?Non-traditional uses…
  9. Time orderedSemi-structured
  10. List of changesContents of record doesn’t matterIndexed by “time”Not application log (i.e. text file)
  11. Data model of Kafka: A topicPartitions can be spread over machines, replicated
  12. The whole system is one big distributed system
  13. Paxos,Zookeeper (Zab), Raft, etc.Traditional databases, Hbase/Bigtable, Spanner, HDFS namenode, etcLog has two purposes:ReplicationConsistency
  14. Very important problem
  15. What if replica is down?Ordering is importantTime is important
  16. Log is list of changesKey point: can re-create any point-in-timeIn banking: credits and debitsIn software: the version control changelog
  17. State-machine replication: log the incoming requests (logical logging)
  18. Log the changed rows (physical logging)
  19. Outside of distributed systems internals…
  20. AKA ETLMany systemsEvent dataMost important problem for data-centric companiesIntegration >> ML
  21. Two exacerbating factors
  22. One-size fits all
  23. Database cache coherencyData deployment from HadoopNever get to full connectivity
  24. Metcalfe’s lawAll data in multi-subscriber, real-time logsThe company is a big distributed system
  25. Batch is dominant paradigm for data processing, why?First thing you want when you have real-time data streams is real-time transformations
  26. 1790Collected data by Networks=>stream processing3,929,214 people$44kHorses and wagons are a high latency, batch channel
  27. Service: One input = one outputBatch job: All inputs = all outputsStream computing: any window = output for that window
  28. Importance of the log—buffering, multisubscriberOutput goes to a live serving system or another batch processing system (Hadoop, DWH)Examples: RecommendationsEmailMonitoringSecurity
  29. Storm and SamzaAbout process management – both integrate with KafkaMapReduce and HDFS