You have learned about Kafka event sourcing with streams and using Kafka as a database, but you may be having a tough time wrapping your head around what that means and what challenges you will face. Kafka’s exactly once semantics, data retention rules, and stream DSL make it a great database for real-time transaction processing. This talk will focus on how to use Kafka events as a database. We will talk about using KTables vs GlobalKTables, and how to apply them to patterns we use with traditional databases. We will go over a real-world example of joining events against existing data and some issues to be aware of. We will finish covering some important things to remember about state stores, partitions, and streams to help you avoid problems when your data sets become large.
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisler, Northwestern Mutual
1. The Northwestern Mutual Life Insurance Company – Milwaukee, WI
Using Kafka as a Database
Chad Preisler
2. 2
What does that mean?
Event Sourcing
Transactional database
All data is stored in Kafka Topics.
No traditional Relational Database.
Using Streams DSL, KTable, GlobalKTable, and Stores to process and search for data.
3. Why Did We Do It?
• Decoupled services
• Easily manage record retention
• Real-time processing
• Immutable log
• Topics can be safely shared using ACLs
• Fault Tolerance: never miss a record
• Confluent Cloud: SLA, Support. Broker just works.
4. What makes it work?
• Exactly Once Semantics
• Data Retention
• Stream DSL
– KTable
– Stream to KTable joins
– Stream to Stream joins
5. What is a Stream?
Read-process-write operation on a Kafka topic
Java Stream DSL
• Read from multiple topics and write to one output topic
• Read from and output to one topic
• Read from multiple topics and write to multiple topics
Click to add text
6. Exactly Once
Guarantees all the following things
happen or are all rolled back:
• Source topic commit.
• Sink topic commit.
• State Store commit.
7. Topic Retention
• Every topic has a retention period
• Retention periods can be any length of time including indefinitely
• Can easily manage retention times to meet business requirements.
8. Stream Joins
• Java DSL allows you to join streams together. Similar to relational
databases.
• Join Stream to Stream
• Join Stream to KTable
• Join Stream to GlobalKTable
• Join KTable to Ktable
• All support Inner and Left join.
• Some support outer joins
9. Ktable
• KTable is an abstraction over a Stream.
• Each data record represents an update.
• You can treat it like a read only table.
• Backed by a RocksDB on the application’s machine.
• Each instance of the stream app will get a portion of the topic data.
• Partitions are split across all instances of the stream application.
– Not all running instances get all the data.
– If three instances of a stream are running and topic has 12 partitions each stream
instance will get 4 partitions worth of data each.
10. GlobalKTable
• Like a KTable
• Main difference: Each instance of the application gets all the records.
– Data is not split across instances by partition.
• Are completely loaded before the stream starts processing.
11. KTable Pros/Cons
Pros:
• Loads fast if you run more than one POD.
• Fast lookup
• Will Start processing base on timestamps.
– It will always process in the same order.
Cons:
Topic key for KTable topic and Stream topic need to be the same: Needs to be co-partitioned.
– Streams 2.4 API allows KTable to Ktable joins on foreign keys.
• If your keys are not evenly distributed over partitions loading becomes an issue.
• Will start processing before all records are loaded.
12. GlobalKTable Pros/Cons
Pro:
• All records load before the stream starts.
• Very fast once loaded.
• Allow joins on non-key values.
Cons:
• Can take a very long time to load.
– Can reuse RocksDB if machine has attached storage
– Just builds the delta if DB already exists
16. Kafka Read/Update: Transformer
Use the Apache Kafka Stream’s
Processor API
• Little bit of work to implement
classes.
• Still get the benefits of KTable
Auto updates via Kafka Stream.
State restored on start up.
17. Things to remember
• Small values work better than larger values.
• KTables and GlobalKTables load quickly when the keys are normally distributed.
• The first-time stream applications start reading from a topic, they start from the
beginning.
– Regular Kafka Consumers start from the end by default.
– If an existing stream changes the input topic(s) and there are no committed indexes for that topic
then it will start from the beginning.
– Can change default behavior for new input topics with auto.offset.reset
– Only applies when stream application has not committed offsets
– After offsets are committed will continue where it left off
18. Things to remember
• Don’t write to topics “out of band” in your stream application.
– Use the stream DSL to convert records and write them to topics.
– Don’t create producers while processing in your stream.
• Do be careful with exactly once and external systems.
– Exactly once will send last processed record on crash.
– If a crash in your application happens after external system call, it will continue to call each time
the application respawns.
• Do make sure to set an uncaught exception handler and runtime shutdown hook to
log exceptions and handle shutting down the JVM.