2. instaclustr.com
Introduction
# Who am I?
/bin/whoami
● Ben Bromhead, CTO, Instaclustr
# Who is Instaclustr?
/bin/id -g -n
● Experts in reliability at scale
● Manage/Support 3k+ Cassandra, Spark and Elassandra nodes
● Platform providers automated provisioning, monitoring and management
● Available on AWS, GCP, Azure and IBM Cloud
● Managed Apache Kafka released May 21st
3. instaclustr.com
Agenda
● What is Kafka
● A quick intro to how it works
● Context - our offering and development process
● Hardware choice and benchmarking
● Topic and user management
● Broker security configuration
● Monitoring
● Backup and Restore
4. instaclustr.com
What Is Apache Kafka?
Key Characteristics
● Horizontal scalable, distributed system
● Performance
○ Low latency, high throughput
● Scalability
○ Linear broker scalability via partitioned topics
○ Linear consumer scalability via consumer groups
● Fault-tolerance
○ Data is replicated across multiple brokers
○ Automatic broker failover when primary replica goes offline
○ Automatic consumer failover when consumer in consumer group
goes offline
● Apache Foundation Open Source
● Production Proven
• Publish & Subscribe to streams of data
(reliable message transport)
• Transform and/or aggregate data streams
using distributed processing applications
(stream processing)
5. instaclustr.com
Why use Apache Kafka?
● Provide a buffering mechanism in front of a processing (ie deal with temporary incoming message rate
greater than processing app can deal with)
● A special case of buffering is to allow producers to publish messages with guaranteed delivery even if
the consumers are down when the message is published
● As an event store for events sourcing or Kappa architecture
● Facilitate flexible, configurable architectures with many producers -> many consumers by separating
the details who what is consuming messages for the apps that produce them (and vice-versa)
● Perform stream analytics (with Kafka Streams)
6. instaclustr.com
How does it work: producing records
● Each topic has a fixed number of partitions
● Records published to a topic by a producer are
divided amongst the topic’s partitions
● Partitions are ordered, immutable lists
● Each new record is appended to the end of a
partition
● Each partition is stored on a single leader broker,
and may optionally be replicated to one or more
follower brokers
7. instaclustr.com
How does it work: consuming records
● A consumer reads from one or more partitions
● Consumer maintains an offset of the last record in the partition read
● The consumer requests a micro-batch of records from Kafka. The
broker uses the offset to provide the latest records to the consumer
● Once the consumer has finished processing a record, it must
commit the new offset
● Because Kafka does not delete records immediately after they are
read, consumers may reset the offset to a previous value to replay
records
8. instaclustr.com
How does it work: consumer groups
● Multiple consumers reading from a topic may be
arranged into Consumer Groups
● A Consumer Group load-balances partitions amongst
consumers
● If a consumer goes offline, the consumer group will
automatically re-distribute it’s partitions amongst the
remaining consumers
9. instaclustr.com
How it works: Easier Abstractions
● High-level API
● Drop-in source (import) & sink (export) connectors
exist for many popular technologies, including
Amazon S3, Amazon Kinesis, Apache Cassandra,
HDFS and JDBC
Kafka Connect
● Provides functionality to aggregate data, join
multiple topics and perform complex
transformations to live data as it arrives
● The API abstracts away most of the difficult
scalability, fault-tolerance and consistency
problems associated with performing live
aggregations on a distributed system
Kafka Streams
10. instaclustr.com
Instaclustr Managed Kafka - Key Features
● Available Now:
○ Open source Apache Kafka (Brokers) and Zookeeper
automatically provisioned in AWS, GCP and Azure
○ Broker Monitoring
○ Instaclustr monitoring and provisioning API support
○ Private network clusters (AWS only)
○ Run in your cloud provider account or ours
● For GA (end June):
○ SOC2 compliant
○ User & credential management
○ More cluster config options
○ Topic Level and Synthetic transaction monitoring
○ Infrastructure config tuning
● Likely future release scope:
○ Topic Management UI
○ Cluster “copy”
○ Managed:
■ Kafka Connect
■ Schema Registry
■ Mirror Maker
○ Dynamic scaling
11. instaclustr.com
Instaclustr Managed Kafka - Development Process
● First customer requests 2016
● Internal infrastructure deployment and usage of Kafka mid 2017
● Managed service platform development
commenced November 2017
● Early access program with 4 customers
commenced December 2017
● Public preview release 21 May 2018
● GA expected 25 June 2018
12. instaclustr.com
Hardware Choice and Benchmarking - GP2 vs ST1
● AWS Benchmark - r4.large w
500GB disks
● Avg 10% improved throughput
with ST1 vs GP2 EBS
● ST1 is 45% of the cost of GP2
13. instaclustr.com
Hardware Choice and Benchmarking - GP2 vs ST1
● AWS Benchmark - r4.large w
500GB disks
● Avg 10% improved throughput
with ST1 vs GP2 EBS
● ST1 is 45% of the cost of GP2
14. instaclustr.com
Hardware Choice and Benchmarking - SSL vs non-SSL
● AWS Benchmark - r4.large w
1500GB ST1 disks
● 512 byte messages
● ~30% decrease in throughput with
Broker and Client SSL enabled
15. instaclustr.com
Hardware Choice and Benchmarking - SSL vs non-SSL
● AWS Benchmark - r4.large w
1500GB ST1 disks
● 512 byte messages
● ~30% decrease in throughput with
Broker and Client SSL enabled
16. instaclustr.com
Hardware Choice and Benchmarking - Number of Topics
● Increasing topics small reduction
performances
● However,
more topics = more partitions
and
significantly slows recovery time from
node failure
10
Topic
s
100
Topic
s
1000
Topic
s
5000
Topic
s
17. instaclustr.com
Hardware Choice and Benchmarking -
Colocated Zookeeper
● Often recommended to host zookeeper
separately to Kafka.
● However, recent changes have
significantly reduced load on
Zookeeper from Kafka.
○ Consumer offsets are no longer
stored in Zookeeper.
● Our benchmarking showed no
measurable difference in performance,
at least for smaller clusters.
Consumer Rate - Colocated Consumer Rate - Separate
6 Broker Test with Node Restart
18. instaclustr.com
Topic and User Configuration Management
● Existing Kafka utilities for managing topic and user configuration required direct access to Zookeeper
● However, Zookeeper does not have a robust external security model (TLS support, node to node auth, etc)
● Providing Zookeeper access to customers introduces a whole class of very strange ways to break a cluster
by corrupting Zookeeper
● Solutions:
○ Developed command line tool to use Kafka API for topic configuration (https://github.com/instaclustr/ic-kafka-tools)
■ may add to Instaclustr console later although we think maintaining topic config as a version controlled file in your repo is
a better approach
○ Adding user management to Instaclustr console
■ we do no want to keep cluster passwords in our central management system so this feature will require users to enter an
existing Kafka credentials to be temporarily used by our system
19. instaclustr.com
Broker Security Configuration
● Using SCRAM (Salted Challenge Response Authentication Mechanism) for authentication
○ More secure
○ Allows easier rotation of credentials
○ Initial release for client->broker only with plain text for broker to broker
○ Decided to also use for broker->broker to allow rapid rotation of credentials as part of SOC2 security measures
● TLS built on existing Cassandra infrastructure
○ New CA created per cluster
○ CA used to generate certificates for each node
○ CA pub cert available for clients to download for full validation of certificates
● Access to managed clusters also follows same model as Cassandra
○ Public IPs and whitelisting in firewall (security group or equivalent)
○ Private IPs with VPC Peering (or equivalent in other cloud providers)
○ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for admin access
○ Did not expose through firewall Zookeeper due to weak security model
20. instaclustr.com
Monitoring
● Metrics exposed via JMX allowing us to use our existing Cassandra monitoring
○ Custom agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann -> Cassandra + Spark -> Console, APIs, Grafana
● Exposing broker-level and per-topic metrics
● Alerting?
○ The basics: service state, disk usage free space, server still exists
○ Kafka metrics: offline partitions, active controllers != 1, partition under replicated
○ Synthetic transactions: publish and consume message to controlled topic, measure success and latency
21. instaclustr.com
Backup and Restore
● Internet wisdom = Kafka Backups is not a thing
○ Rely on replication within cluster or mirror maker replication to another cluster
● Hmm - we rarely use backups for Cassandra but there have been a few times we’ve been very glad to have
them
○ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication
● Working on regular automated backup and restore of topic and security configuration
● Consider using Kafka Connect to write important message to offline backup