SlideShare uma empresa Scribd logo
1 de 40
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Multi-Tier, Multi-Tenant, Multi-Problem Kafka
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Who Am I?
3
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What Will We Talk About?
 Multi-Tenant Pipelines
 Multi-Tier Architecture
 Why I Drink Interesting Problems
 Conclusion
4
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Multi-Tenant Pipelines
5
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Tracking and Data Deployment
 Tracking – Data going to HDFS
 Data Deployment – Hadoop job results going to online applications
 Many shared topics
 Schemas require a common header
 All message counts are audited
 Special Problems
– Hard to tell what application is dropping messages
– Some of these messages are copied 42 times!
6
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Metrics
 Application and OS metrics
 Deployment and build system events
 Service calls – sampling of timing information for individual application calls
 Some application logs
 Special Problems
– Every server in the datacenter produces to this cluster at least twice
– Graphing/Alerting system consumes the metrics 20 times
7
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Logging
 Application logging messages destined for ELK clusters
 Lower retention than other clusters
 Loosest restrictions on message schema and encoding
 Special Problems
– Not many – it’s still overprovisioned
– Customers starting to ask about aggregation
8
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Queuing
 Everything else
 Primarily messages internal to applications
 Also emails and user messaging
 Messages are Avro encoded, but do not require headers
 Special Problems:
– Many messages which use unregistered schemas
– Clusters can have very high message rates (but not large data)
9
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Special Case Clusters
 Not all use cases fit multi-tenancy
– Custom configurations that are needed
– Tighter performance guarantees
– Use of topic deletion
 Espresso (KV store) internal replication
 Brooklin – Change capture
 Replication from Hadoop to Voldemort
10
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Tiered Cluster Architecture
11
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
One Kafka Cluster
12
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Multiple Clusters – Message Aggregation
13
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Why Not Direct?
 Network Concerns
– Bandwidth
– Network partitioning
– Latency
 Security Concerns
– Firewalls and ACLs
– Encrypting data in transit
 Resource Concerns
– A misbehaving application can swamp production resources
14
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What Do We Lose?
 You may lose message ordering
– Mirror maker breaks apart message batches and redistributes them
 You may lose key to partition affinity
– Mirror maker will partition based on the key
– Differing partition counts in source and target will result in differing distribution
– Mirror maker does not (without work) honor custom partitioning
15
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Aggregation Rules
 Aggregate clusters are only for consuming messages
– Producing to an aggregate cluster is not allowed
– This assures all aggregate clusters have the same content
 Not every topic appears in PROD aggregate-tracking clusters
– Trying to discourage aggregate cluster usage in PROD
– All topics are available in CORP
 Aggregate-queuing is whitelist only and very restricted
– Please discuss your use case with us before developing
16
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Interesting Problems
17
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Buy The Book!
18
Early Access available now.
Covers all aspects of Kafka,
from setup to client
development to ongoing
administration and
troubleshooting.
Also discusses stream
processing and other use
cases.
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring Using Kafka
 Monitoring and alerting are self-service
– No gatekeeper on what metrics are collected and stored
 Applications use a common container
– EventBus Kafka producer
– Simple annotation of metrics to collect
– Sampled service calls
– Application logs
 Everything is produced to Kafka and consumed by the monitoring
infrastructure
19
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring Kafka
 Kafka is great for monitoring your applications
20
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
KMon and EnlightIN
 Developed a separate monitoring and notification system
– Metrics are only retained long enough to alert on them
– One rule: we can’t use Kafka
 Alerting is simplified from our self-service system
– Nothing complex like regular expressions or RPNs
– Only used for critical Kafka and Zookeeper alerts
– Faster and more reliable
 Notifications are cleaner
– Alerts are grouped into incidents for fewer notifications when things break
– Notification system is generic and subscribable so we can use it for other things
21
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Monitoring
 Bytes In and Out, Messages In
– Why not messages out?
 Partitions
– Count and Leader Count
– Under Replicated and Offline
 Threads
– Network pool, Request pool
– Max Dirty Percent
 Requests
– Rates and times - total, queue, local, and send
22
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Is Kafka Working?
 Knowing that the cluster is up isn’t always enough
– Network problems
– Metrics can lie
 Customers still ask us first if something breaks
– Part of the solution is educating them as to what to monitor
– Need to be absolutely sure of the answer “There’s nothing wrong with Kafka”
23
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Monitoring Framework
 Producer to consumer testing of a Kafka cluster
– Assures that producers and consumers actually work
– Measures how long messages take to get through
 We have a SLO of 99.99% availability for all clusters
 Working on multi-tier support
– Answers the question of how long messages take to get to Hadoop
 LinkedIn Kafka Open Source
– https://github.com/linkedin/streaming
24
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Is Mirroring Working?
 Most critical data flows through Kafka
– Most of that depends on mirror makers
– How do we make sure it all gets where it’s going?
 Mirror maker pipelines can have over a thousand topics
– Different message rates
– Some are more important than others
 Lag threshold monitoring doesn’t work
– Traffic spikes cause false alerts
– What should the threshold be?
– No easy way to monitor 1000 topics and over 10k partitions
25
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Audit
 Audit tracks topic completeness across all clusters in the pipeline
– Primarily tracking messages
– Schema must have a valid header
– Alerts for DWH topics are set for 0.1% message loss
 Provided as an integrated part of the internal Kafka libraries
 Used for data completeness checks before Hadoop jobs run
26
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Auditing Message Flows
27
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Burrow
 Burrow is an advanced Kafka consumer monitoring system
– Provides an objective view of consumer status
– Much more powerful than threshold-based lag monitoring
 Burrow is Open Source!
– Used by many other companies, including Wikimedia and Blizzard
– Used internally to assure all Mirror Makers and Audit are running correctly
 Exports metrics for all consumers to self-service monitoring
 https://github.com/linkedin/Burrow
28
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
MTTF Is Not Your Friend
 We have over 1800 Kafka brokers
– All have at least 12 drives, most have 16
– Dual CPUs, at least 64 GB of memory
– Really lousy Megaraid controllers
 This means hardware fails daily
– We don’t always know when it happens, if it doesn’t take the system down
– It can’t always be fixed immediately
– We can take one broker down, but not two
29
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Moving Partitions
 Prior to Kafka 0.8, moving partitions was basically impossible
– It’s still not easy – you have to be explicit about what you are moving
– There’s no good way to balance partitions in a cluster
 We developed kafka-assigner to solve the problem
– A single command to remove a broker and distribute it’s partitions
– Chainable modules for balancing partitions
– Open source! https://github.com/linkedin/kafka-tools
 Also working on “Cruise Control” for Kafka
– An add-on service that will handle redistributing partitions automatically
30
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Pushing Data from Hadoop
 To help Hadoop jobs, we maintain a KafkaPushJob
– A mapper that produces messages to Kafka
– Pushes to data-deployment, which then gets mirrored to production
 Hadoop jobs tend to push a lot of data all at once
– Some jobs spin up hundreds of mappers
– Pushing many gigabytes of data in a very short period of time
 This overwhelms a Kafka cluster
– Spurious alerts for under replicated partitions
– Problems with mirroring the messages out
31
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Quotas
 Quotas limit traffic based on client ID
– Specified in bytes/sec on a per-broker basis
– Not per-topic or per-partition
 Should be transparent to clients
– Accomplished by delaying the response to requests
– Newer clients have metrics specific to quotas for clarity
 We use it to protect the replication of the cluster
– Set it as high as possible while protecting against a single bad client
32
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Delete Topic
 Feature has been under development for almost 3 years
– Only recently has it even worked a little bit
– We’re still not sure about it (from SRE’s point of view)
 Recently performed additional testing so we can use it
– Found that even when disabled for a cluster, something was happening
– Some brokers claimed the topic was gone, some didn’t
– Mirror makers broke for the topic
 One of the code paths in the controller was not blocked
– Metadata change went out, but it was hard to diagnose
33
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Brokers are Independent
 When there’s a problem in the cluster, brokers might have bad information
– The controller should tell them what the topic metadata is
– Brokers get out of sync due to connection issues or bugs
 There’s no good tool for just sending a request to a broker and reading the
response
– We had to write a Java application just to send a metadata request
 Coming soon – kafka-protocol
– Simple CLI tool for sending individual requests to Kafka brokers
– Will be part of the https://github.com/linkedin/kafka-tools repository
34
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Conclusion
35
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Improvement - JBOD
 We use RAID-10 on all brokers
– Trade off a lot of performance for a little resiliency
– Lose half of our disk space
 Current JBOD implementation isn’t great
– No admin tools for moving partitions
– Assignment is round-robin
– Broker shuts down if a single disk fails
 Looking at options
– Might try to fix the JBOD implementation in Kafka
– Testing running multiple brokers on a single server
36
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Mirror Maker Improvements
 Mirror Maker has performance issues
– Has to decompress and recompress every message
– Loses information about partition affinity and strict ordering
 Developed an Identity message handler
– Messages in source partition 0 get produced directly to partition 0
– Requires mirror maker to maintain downstream partition counts
 Working on the next steps
– No decompression of message batches
– Looking at other options on how to run mirror makers
37
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Administrative Improvements
 Multiple cluster management
– Topic management across clusters
– Visualization of mirror maker paths
 Better client monitoring
– Burrow for consumer monitoring
– No open source solution for producer monitoring (audit)
 End-to-end availability monitoring
38
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Getting Involved With Kafka
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
– dev@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Meetups
– Bay Area – https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/
 Contribute code
39
Multi tier, multi-tenant, multi-problem kafka

Mais conteĂșdo relacionado

Mais procurados

Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaYoungHeon (Roy) Kim
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israelGwen (Chen) Shapira
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistentconfluent
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 

Mais procurados (20)

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & Grafana
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 

Semelhante a Multi tier, multi-tenant, multi-problem kafka

More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More ProblemsTodd Palino
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise controlBill Liu
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoTJim Haughwout
 
Modern Software Development
Modern Software DevelopmentModern Software Development
Modern Software DevelopmentAngel Conde Manjon
 
Symantec SDN Deployment
Symantec SDN DeploymentSymantec SDN Deployment
Symantec SDN DeploymentRudrajit Tapadar
 
Security and Virtualization in the Data Center
Security and Virtualization in the Data CenterSecurity and Virtualization in the Data Center
Security and Virtualization in the Data CenterCisco Canada
 
Working with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data ArchitecturesWorking with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data ArchitecturesDave McAllister
 
Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1Kurt Liu
 
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)Cisco Canada
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016AdobeMarketingCloud
 
An Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An OverviewAn Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An OverviewManageEngine
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInTodd Palino
 
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slidesOracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slidesGrid Dynamics
 
OSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd ErkOSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd ErkNETWAYS
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
Updates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSUpdates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSShapeBlue
 
Serverless: Market Overview and Investment Opportunities
Serverless: Market Overview and Investment OpportunitiesServerless: Market Overview and Investment Opportunities
Serverless: Market Overview and Investment OpportunitiesUnderscore VC
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Simplifying and Future-Proofing Hadoop
Simplifying and Future-Proofing HadoopSimplifying and Future-Proofing Hadoop
Simplifying and Future-Proofing HadoopPrecisely
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInJens Pillgram-Larsen
 

Semelhante a Multi tier, multi-tenant, multi-problem kafka (20)

More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Modern Software Development
Modern Software DevelopmentModern Software Development
Modern Software Development
 
Symantec SDN Deployment
Symantec SDN DeploymentSymantec SDN Deployment
Symantec SDN Deployment
 
Security and Virtualization in the Data Center
Security and Virtualization in the Data CenterSecurity and Virtualization in the Data Center
Security and Virtualization in the Data Center
 
Working with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data ArchitecturesWorking with Hybrid Clouds and Data Architectures
Working with Hybrid Clouds and Data Architectures
 
Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1Oracle Ravello Presentation 7Dec16 v1
Oracle Ravello Presentation 7Dec16 v1
 
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
 
An Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An OverviewAn Integrated Approach to Manage IT Network Traffic - An Overview
An Integrated Approach to Manage IT Network Traffic - An Overview
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slidesOracle Commerce as a Secure,  Scalable Hybrid Cloud Service, webinar slides
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
 
OSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd ErkOSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd Erk
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Updates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSUpdates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDS
 
Serverless: Market Overview and Investment Opportunities
Serverless: Market Overview and Investment OpportunitiesServerless: Market Overview and Investment Opportunities
Serverless: Market Overview and Investment Opportunities
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Simplifying and Future-Proofing Hadoop
Simplifying and Future-Proofing HadoopSimplifying and Future-Proofing Hadoop
Simplifying and Future-Proofing Hadoop
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
 

Mais de Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsTodd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum PainTodd Palino
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTodd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceTodd Palino
 

Mais de Todd Palino (10)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Último

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 

Último (20)

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 

Multi tier, multi-tenant, multi-problem kafka

  • 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multi-Tier, Multi-Tenant, Multi-Problem Kafka
  • 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  • 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Multi-Tenant Pipelines  Multi-Tier Architecture  Why I Drink Interesting Problems  Conclusion 4
  • 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multi-Tenant Pipelines 5
  • 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Tracking and Data Deployment  Tracking – Data going to HDFS  Data Deployment – Hadoop job results going to online applications  Many shared topics  Schemas require a common header  All message counts are audited  Special Problems – Hard to tell what application is dropping messages – Some of these messages are copied 42 times! 6
  • 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Metrics  Application and OS metrics  Deployment and build system events  Service calls – sampling of timing information for individual application calls  Some application logs  Special Problems – Every server in the datacenter produces to this cluster at least twice – Graphing/Alerting system consumes the metrics 20 times 7
  • 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Logging  Application logging messages destined for ELK clusters  Lower retention than other clusters  Loosest restrictions on message schema and encoding  Special Problems – Not many – it’s still overprovisioned – Customers starting to ask about aggregation 8
  • 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Queuing  Everything else  Primarily messages internal to applications  Also emails and user messaging  Messages are Avro encoded, but do not require headers  Special Problems: – Many messages which use unregistered schemas – Clusters can have very high message rates (but not large data) 9
  • 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Special Case Clusters  Not all use cases fit multi-tenancy – Custom configurations that are needed – Tighter performance guarantees – Use of topic deletion  Espresso (KV store) internal replication  Brooklin – Change capture  Replication from Hadoop to Voldemort 10
  • 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Tiered Cluster Architecture 11
  • 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Kafka Cluster 12
  • 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multiple Clusters – Message Aggregation 13
  • 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Why Not Direct?  Network Concerns – Bandwidth – Network partitioning – Latency  Security Concerns – Firewalls and ACLs – Encrypting data in transit  Resource Concerns – A misbehaving application can swamp production resources 14
  • 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Do We Lose?  You may lose message ordering – Mirror maker breaks apart message batches and redistributes them  You may lose key to partition affinity – Mirror maker will partition based on the key – Differing partition counts in source and target will result in differing distribution – Mirror maker does not (without work) honor custom partitioning 15
  • 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Aggregation Rules  Aggregate clusters are only for consuming messages – Producing to an aggregate cluster is not allowed – This assures all aggregate clusters have the same content  Not every topic appears in PROD aggregate-tracking clusters – Trying to discourage aggregate cluster usage in PROD – All topics are available in CORP  Aggregate-queuing is whitelist only and very restricted – Please discuss your use case with us before developing 16
  • 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Interesting Problems 17
  • 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 18 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  • 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring Using Kafka  Monitoring and alerting are self-service – No gatekeeper on what metrics are collected and stored  Applications use a common container – EventBus Kafka producer – Simple annotation of metrics to collect – Sampled service calls – Application logs  Everything is produced to Kafka and consumed by the monitoring infrastructure 19
  • 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring Kafka  Kafka is great for monitoring your applications 20
  • 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. KMon and EnlightIN  Developed a separate monitoring and notification system – Metrics are only retained long enough to alert on them – One rule: we can’t use Kafka  Alerting is simplified from our self-service system – Nothing complex like regular expressions or RPNs – Only used for critical Kafka and Zookeeper alerts – Faster and more reliable  Notifications are cleaner – Alerts are grouped into incidents for fewer notifications when things break – Notification system is generic and subscribable so we can use it for other things 21
  • 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  • 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Is Kafka Working?  Knowing that the cluster is up isn’t always enough – Network problems – Metrics can lie  Customers still ask us first if something breaks – Part of the solution is educating them as to what to monitor – Need to be absolutely sure of the answer “There’s nothing wrong with Kafka” 23
  • 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Monitoring Framework  Producer to consumer testing of a Kafka cluster – Assures that producers and consumers actually work – Measures how long messages take to get through  We have a SLO of 99.99% availability for all clusters  Working on multi-tier support – Answers the question of how long messages take to get to Hadoop  LinkedIn Kafka Open Source – https://github.com/linkedin/streaming 24
  • 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Is Mirroring Working?  Most critical data flows through Kafka – Most of that depends on mirror makers – How do we make sure it all gets where it’s going?  Mirror maker pipelines can have over a thousand topics – Different message rates – Some are more important than others  Lag threshold monitoring doesn’t work – Traffic spikes cause false alerts – What should the threshold be? – No easy way to monitor 1000 topics and over 10k partitions 25
  • 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Audit  Audit tracks topic completeness across all clusters in the pipeline – Primarily tracking messages – Schema must have a valid header – Alerts for DWH topics are set for 0.1% message loss  Provided as an integrated part of the internal Kafka libraries  Used for data completeness checks before Hadoop jobs run 26
  • 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Auditing Message Flows 27
  • 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Burrow  Burrow is an advanced Kafka consumer monitoring system – Provides an objective view of consumer status – Much more powerful than threshold-based lag monitoring  Burrow is Open Source! – Used by many other companies, including Wikimedia and Blizzard – Used internally to assure all Mirror Makers and Audit are running correctly  Exports metrics for all consumers to self-service monitoring  https://github.com/linkedin/Burrow 28
  • 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. MTTF Is Not Your Friend  We have over 1800 Kafka brokers – All have at least 12 drives, most have 16 – Dual CPUs, at least 64 GB of memory – Really lousy Megaraid controllers  This means hardware fails daily – We don’t always know when it happens, if it doesn’t take the system down – It can’t always be fixed immediately – We can take one broker down, but not two 29
  • 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Moving Partitions  Prior to Kafka 0.8, moving partitions was basically impossible – It’s still not easy – you have to be explicit about what you are moving – There’s no good way to balance partitions in a cluster  We developed kafka-assigner to solve the problem – A single command to remove a broker and distribute it’s partitions – Chainable modules for balancing partitions – Open source! https://github.com/linkedin/kafka-tools  Also working on “Cruise Control” for Kafka – An add-on service that will handle redistributing partitions automatically 30
  • 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Pushing Data from Hadoop  To help Hadoop jobs, we maintain a KafkaPushJob – A mapper that produces messages to Kafka – Pushes to data-deployment, which then gets mirrored to production  Hadoop jobs tend to push a lot of data all at once – Some jobs spin up hundreds of mappers – Pushing many gigabytes of data in a very short period of time  This overwhelms a Kafka cluster – Spurious alerts for under replicated partitions – Problems with mirroring the messages out 31
  • 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Quotas  Quotas limit traffic based on client ID – Specified in bytes/sec on a per-broker basis – Not per-topic or per-partition  Should be transparent to clients – Accomplished by delaying the response to requests – Newer clients have metrics specific to quotas for clarity  We use it to protect the replication of the cluster – Set it as high as possible while protecting against a single bad client 32
  • 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Delete Topic  Feature has been under development for almost 3 years – Only recently has it even worked a little bit – We’re still not sure about it (from SRE’s point of view)  Recently performed additional testing so we can use it – Found that even when disabled for a cluster, something was happening – Some brokers claimed the topic was gone, some didn’t – Mirror makers broke for the topic  One of the code paths in the controller was not blocked – Metadata change went out, but it was hard to diagnose 33
  • 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Brokers are Independent  When there’s a problem in the cluster, brokers might have bad information – The controller should tell them what the topic metadata is – Brokers get out of sync due to connection issues or bugs  There’s no good tool for just sending a request to a broker and reading the response – We had to write a Java application just to send a metadata request  Coming soon – kafka-protocol – Simple CLI tool for sending individual requests to Kafka brokers – Will be part of the https://github.com/linkedin/kafka-tools repository 34
  • 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 35
  • 36. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Improvement - JBOD  We use RAID-10 on all brokers – Trade off a lot of performance for a little resiliency – Lose half of our disk space  Current JBOD implementation isn’t great – No admin tools for moving partitions – Assignment is round-robin – Broker shuts down if a single disk fails  Looking at options – Might try to fix the JBOD implementation in Kafka – Testing running multiple brokers on a single server 36
  • 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Mirror Maker Improvements  Mirror Maker has performance issues – Has to decompress and recompress every message – Loses information about partition affinity and strict ordering  Developed an Identity message handler – Messages in source partition 0 get produced directly to partition 0 – Requires mirror maker to maintain downstream partition counts  Working on the next steps – No decompression of message batches – Looking at other options on how to run mirror makers 37
  • 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 38
  • 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Bay Area – https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/  Contribute code 39

Notas do Editor

  1. So who am I, and why am I qualified to stand up here? I am a member of the Data Infrastructure Streaming SRE team at LinkedIn. We’re responsible for Kafka and Zookeeper operations, as well as Samza and a couple iterations of our change capture systems. SRE stands for Site Reliability Engineering. Many of you, like myself before I started in this role, may not be familiar with the title. SRE combines several roles that fit together into one Operations position Foremost, we are administrators. We manage all of the systems in our area We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them. At the end of the day, our job is to keep the site running, always.
  2. What are the things we are going to cover in this talk? I’m going to assume some basic knowledge of what Kafka is and how it works, so I won’t be covering the basics. I’ll start by describing the Kafka pipelines we have set up at LinkedIn in our multi-tenant environment. This will transition into the tier architecture that many of those pipelines use. But I’ll spend most of our time on the interesting problems that we’ve run into in running Kafka at such a large scale. We’ll wrap up talking about a couple of the things that we’re working on now, and hopefully have some time for Q&A
  3. I won’t be going into too much detail on how Kafka works. If you do not have a basic understanding of Kafka itself, I suggest checking out some of the resources listed in the Reference slides at the end of this deck. Here’s what a single Kafka cluster looks like at LinkedIn. I’ll get into some details on the TrackerProducer/TrackerConsumer components later, but they are internal libraries that wrap the open source Kafka producer and consumer components and integrate with our schema registry and our monitoring systems. Every cluster has multiple Kafka brokers, storing their metadata in a Zookeeper ensemble. We have producers sending messages in, and consumers reading messages out. At the present time, our consumers talk to Zookeeper as well and everything works well. In LinkedIn’s environment, all of these components live in the same datacenter, in the same network. What happens when you have two sites to deal with?
  4. Now we iterate on the architecture. We add the concept of an aggregate Kafka cluster, which contains all of the messages from each of the primary datacenter local clusters. We also have a copy of this cluster in the secondary datacenter, C, for consumers there to access. We still have cross-datacenter traffic – that can’t be avoided if we need to move data around. But we have isolated it to one application, mirror maker, which we can monitor and assure works properly. This is a better situation than needing to have each consumer worry about it for themselves. We’ve definitely added complexity here, but it serves a purpose. By having the infrastructure be a little more complex, we simplify the usage of Kafka for our customers. Producers know that if they send messages to their local cluster, it will show up in the appropriate places without additional work on their part. Consumers can select which view of the data they need, and have assurances that they will see everything that is produced. The intricacies of how the data gets moved around are left to people like me, who run the Kafka infrastructure itself.
  5. We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns. The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring. There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now. The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
  6. More components means that we have more places to poke and prod to get the most efficiency out of our system. With multiple tiers most of this revolves around making sure the sizes of everything are correct.
  7. This is as good a time as any for a little self-promotion. Many of the questions around how to set up and lay out Kafka clusters, including specific performance concerns and tuning, are covered in this fine book that I am co-authoring. You’ll also find a trove of information about client development, stream processing, and a variety of use cases for Kafka. We currently have 4 chapters complete, and it’s available from O’Reilly under their early access program. We expect to have the book completed late this year, or early next, with chapters being released as soon as we can write them.
  8. Many of us use Kafka for monitoring applications. At LinkedIn, every application and server writes metrics and logs to Kafka. We have central applications that read out these metrics and provide pretty graphs, thresholding, and other tools to make sure that everything is running properly within LinkedIn. Kafka itself is no exception, which leads to this
 As soon as I say “monitoring Kafka with Kafka”, we know this is not a good thing
  9. For the broker, what are the critical metrics that I’m keeping an eye on every day? Bytes in, bytes out, and messages in are all critical metrics for us from a growth point of view. While we don’t alert on these, we do keep an eye on them because they help us to understand how the usage of the cluster is growing over time, and they let us plan for the next expansion. You may ask why I don’t have messages out on this list. It’s because there is no messages out metric. Kafka consumers read batches of messages, not single messages, and it’s not easy for Kafka to count messages on the outbound side. There’s a metric on the number of fetches, but it’s less interesting to me. For partitions, we start with the number of partitions per broker, and the number of leader partitions per broker. As we know, there is a single broker responsible for leadership for a given partition. In a healthy cluster, I want to make sure that each broker has approximately the same number of partitions, and that each broker is leading about 50% of those because we have a replication factor of 2 for most things. We can also see this reflected in the bytes rates, because if the partitions are imbalanced, the bytes rates will be as well. This gives us uneven load and that can cause a lot of problems. More importantly though, we monitor the number of under replicated partitions that each broker is reporting. I’m going to get into this in much more detail in a few slides, but this indicates the number of partitions that the broker is leader for where at least one of the replicas has fallen behind. This is the single most important metric to monitor and alert on. It indicates a number of problems and a single alert here will provide coverage of most Kafka issues. Lastly, there are metrics on the thread pool usage, both network and request pools, as well as rate and time metrics on the different types of requests. These are all examples of metrics that are good to have, but they’re difficult to alert on. If you are able to establish a good baseline on some of the request time metrics, I do recommend doing it, however, as rising request times can indicate a problem that is building up, and you may be able to see it before it becomes under replicated partitions. Buried in the middle there is the “max dirty percent” metric. This is a measurement of how many log segments are able to be compacted that are not currently compacted. Right now, this is the only way to monitor the health of log compaction within Kafka, which is critical for the consumer offsets topic at the very least. If the thread doing log compaction dies (which it can do frequently), the only way you will know is by this metric increasing and staying high. Normal behavior is for the metric to spike up and immediately drop back down again.
  10. There are a number of things that can be improved upon, both in the brokers and in the mirror maker, to make it easier to set up and manage multiple datacenters. Another big problem is that we are using RAID and providing a single mount point to the Kafka brokers for a log dir. This is because there are some issues with the way JBOD is handled in the broker. Specifically, the brokers assign partitions to log dirs by round robin, not taking into account current size. In addition, there are no administrative functions to move partitions from one directory to another. And if a single disk fails, the entire broker fails. If JBOD was more robust, we could have replication factors of 3 or 4 without an increase in hardware cost, which would allow us to have “no data loss” configurations.
  11. The big improvement to mirror maker is the creation of an identity mirror maker, which would keep message batches together in the exact same partition from source to target cluster. This would completely eliminate the compression overhead from the mirror maker, making it much faster and more efficient. Of course, this requires maintaining the partitions counts in the clusters properly, and allowing the mirror maker to increase partition counts in a target cluster if needed.
  12. That leads into the idea of multi-cluster management. While there are a couple people making some headway on this in the open source world, we still lack a solid interface for managing Kafka clusters as part of an overall infrastructure. This would include maintaining topic configurations across multiple clusters and easily configuring and visualizing the mirror maker links between them. Another piece needed is better client monitoring overall. Burrow provides us with a good view of what the consumers are doing, but there’s nothing available yet for producer client monitoring. We, of course, have our internal audit system for this. And other companies have their own versions as well. It would be nice to have an open source solution that anyone can use for assuring that the producers are working properly. We could also use better end-to-end monitoring of our Kafka clusters, so we can know that they are available. We have a lot of metrics that can track information about the individual components, but without a client view of the cluster, we don’t know if the cluster is actually available. We also have a hard time making sure that the entire pipeline is working properly. There’s not a lot available for this right now, but watch this space

  13. So how can you get more involved in the Kafka community? The most obvious answer is to go apache.kafka.org. From there you can Join the mailing lists, either on the development or the user side You’ll find people on the #apache-kafka channel on Freenode IRC if you have questions We also coordinate meetups for both Kafka and Samza in the Bay Area, with streaming if you are not local You can also dive into the source repository, and work on and contribute your own tools back. Kafka may be young, but it’s a critical piece of data infrastructure for many of us.