Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
5. Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of a database
– Messages strictly ordered
– All data persistent
• Distributed by default
– Replication
– Partitioning model
16. Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Replicated to each datacenter
• Tens of thousands of data producers
• Thousands of consumers
• 7 million messages written/sec
• 35 million messages read/sec
• Hadoop integration
Who are you?
What is this talk about?
Exciting topic
More
Messaging system, like JMS (but different!)
Producers, consumers distributed
Start with state at LinkedIn, describe each pipeline
1 Pipeline for database data
1 Pipeline for metrics
1 Pipeline for events
1 JMS-based pipeline
No pipeline for application logs
300 ActiveMQ brokers
The log is fundamental abstraction Kafka provides
You can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
What is a log?
Traditional uses?
Non-traditional uses…
Time ordered
Semi-structured
Data structure not a text file
List of changes
Contents of record doesn’t matter
Indexed by “time”
Not application log (i.e. text file)
Remotely accessible
State machine replication
Data model of Kafka: A topic
Partitions can be spread over machines, replicated
Path of a write
Leadership failover
Guarantees
AKA ETL
Many systems
Event data
Most important problem for data-centric companies
Integration >> ML
Maslow’s Hiearchy
Abraham Maslow, Physchologist, 1943
Physiological – eat, drink, sleep
Safety – Not being attacked
Love/Belonging – friends and family
Esteem – respect of others
Self-Actualization – morality, creativity, spontenaity
Want to do Deep Learning
Instead finding that their CSV data ALSO has commas in it
Copying files around
Ugh The Caveman
Data Warehousing has a bad reputation
Two exacerbating factors
15 years ago, just the first one (transactional data)
New categories are very high volume, maybe 100x the transactional data
Look like events
Internet of things
One-size fits all
Tell story:
Started with Hadoop, added arrows to get data there
Want to build fancy algorithms, need data (expectation 90% of time for fancy, 10% for data)
Holy shit this is hard!
Data is missing, data is late, computation runs on wrong data
Hadoop without good data is just a very expensive space heater
Never get to full connectivity
Metcalfe’s law
Each new system connects to get/give data
All data in multi-subscriber, real-time logs
The company is a big distributed system
The data center is the distributed system
Three dims:
Throughput
Guarantees
Latency
Advantages over messaging:
Huge data backlog
Order
Advantages over files
Real-time
Advantage over both: principled notion of time
Whole organization is big distributed system
Commit log = data transfer
Stream processing = triggers
Batch is dominant paradigm for data processing, why?
Service: One input = one output
Batch job: All inputs = all outputs
Stream computing: any window = output for that window
No different from batch processing flow (instead of files/tables, logs)
Storm and Samza
About process management – both integrate with Kafka
MapReduce and HDFS