SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Scaling a core banking engine
using Apache Kafka
Peter Dudbridge (aka. Dudders)
Engineering Director
Thought Machine
What will we cover?
The relationship between correctness and scale, through the lens of a core banking system
● To guide us through this subject we will tell a story starting with the monolith and ending with a
scalable distributed system
● We will encounter some patterns and anti patterns on the way
● We will see some tricky tradeoffs, but also show that with the right design, the need not hold us
back!
● Loosely based on real life experiences!
Note: I assume you know what kafka and microservices are (if not you may be lost)
What is core banking and why do we care?
The most important system that you’ve never heard of (unless something goes
wrong!)
● We need both correctness and scale!
● You want 24/7 access to your money
● You don’t want to lose all your money!
● You also don’t want to pay for your current account…
Core banking, at its heart, is:
● Product engine
● Ledger (accounts + balances)
The monolith
Works great!
● Everything is one machine
● No network calls! Yay!
● Total ordering, single clock
● Need more resources?
> Mainframe: we’ve got you covered!
Waaait what year is it again?
● Glass ceiling
● Very expensive
● Core banking traffic is very spiky
● HA / DR
You know the answer to the problem: microservices!
● We need an architecture that can scale horizontally, on demand: that
means microservices
● Has anyone tried carving up a monolith?
● We quickly realize the things that worked well in the monolith fall flat on
their face when distributed
● We also quickly realize that we’re now playing a game of trade offs
○ Latency vs throughput
○ Consistency vs availability
○ Correctness vs everything!
● Usually the first thing we realize is we need to embrace asynchronousity
Towards microservices
Why Kafka
Why am I speaking at this conference... we just need an async queue type thing right?
why use Kafka?
● Kafka streams can be partitioned, which allows it to scale
● Kafka can be configured to be durable (good for banking!)
● Kafka is a commit log!
○ This has changed the game with regards to stream processing
○ Towards a general theory of table and stream relativity:
The aggregation of a stream of updates over time yields a table
The observation of changes to a table over time yields a stream
- Streaming Systems
Some prerequisites
Here are some prerequisites for correctness that aren’t the focus of today’s talk
● We choose consistency over availability (when we are forced to make a choice)
● Idempotency and retryability is baked in
● This gives us effective once delivery
● Holistic consistency
● Dual writes - pick your favourite pattern (we use transactional outbox)
Let’s solutionize!
So let’s start with the general problem we’re trying to solve (then immediately start solutionizing)
● We’re processing a payment. In core banking lingo this is a posting
● In Thought Machine we have a super flexible product engine that is driven by an abstraction we
refer to as Smart Contracts (not dogecoin, sorry)
● The trade off with having super flexibility, is we have to do super loads of work on the hot path
Correctness - our first battle
Since streams are data in motion, tables are data at rest, we decide to store our ledger / balances and
accounts in a database
We build our prototype but quickly realize a problem
● When two users concurrently access one account, we see a race condition
● Fortunately there’s a common pattern for solving this!
Scale - our first battle
So now our system is ‘correct’, however this has come at a price, let’s open up our observability dashboard
to see if we can’t offset this cost!
● We’re spending all of our time on the network
● We’re having to scale out our application to handle more requests, but our database is running out
of connections!
● We’re burning cash and we seem to be hitting the exact problem we moved away from the
monolith to solve
● Batching is the answer. Don’t worry - it’s a subset of streaming
Batching strategies
The big questions is - what should the batch size be?
● No ‘one size fits all’ answer
● Consider what work we are batching:
○ Kafka consumer batches
○ Calls our consumer might be making to another microservice
○ Database queries / commands
○ Kafka producing batches
○ Kafka committing offsets
● Rule of thumb - try a few strategies out!
Batching strategies
#strategy 1 - one batch to rule them all
● Maintain whatever batch size we pull from Kafka in our consumers
● We can tweak our batch size via the Kafka consumer settings:
fetch.max.wait.ms | fetch.min.bytes | fetch.max.bytes
● Works great if our batch is homogeneous
● Easy to reason about architecture, single place to configure out batch sizes
● Assumes the batch size the works for kafka also works for the DB, calls to other microservices, etc.
Batching strategies
#strategy 2 - fan out
● If our batch is heterogeneous we might see that some elements of the batch process quicker than
others, in #strategy 1 we always pay the price of the slowest element
● If our batch contains different types of work (e.g. different messages need routing to different
places) fan out might be our only option - if this is the case first consider separate topics
● If we fan out we might get benefits with parallelizing work (always test this!)
● The long tail problem isn’t trivial to solve
● Consume a batch of 3 from Kafka
● Fan out the request
● Decide how long to wait, save the long tail to local storage
● Commit all offsets and pull next batch
Batching strategies
#strategy 3 - dynamic batch sizing
If our batches are super heterogeneous / unpredictable a batch size that works well today might not work
so well tomorrow (singles day anyone?)
● Monitor our batch response time / error rate over time
● If the response time gets quicker increase the batch size
● If the response time gets slower and we observe timeouts, decrease the batch size
● Use sensible bounds! Lots of things can affect response times so is easy to scale on a false positive!
Correctness: we meet again...
We play around with batch sizes / strategies, and we find that sweet spot that gives us the best
throughput / latency profile, however we spot something odd when perusing the observability
dashboards:
Woah - so we introduced batching to mitigate the cost of correctness, but it looks like we hit a point
where it stops to help, and even makes it worse!
● The bigger the batch, the more likely we are to have a lock conflict with a batch being processed by
another consumer
Pro tip - know your bottlenecks!
Correctness: we meet again...
If only there was a way we could reduce the number of lock
conflicts then maybe we could further increase our batch size and
get more perf gainz?
… what if there was a way we could make sure that messages that might conflict always go to the same
consumer? That way we could resolve the conflict in the application, or evict it to the next batch?
Ordering
Ordering!
● A key feature of Kafka is the consistent hashing of user defined partition keys, guaranteeing that
any message with a given partition keys’ value, will always go to the same consumer (until a
rebalance happens)
● If we partition on Account ID, we can capitalize on the affinity this gives us
Ordering
Warning! Ordering is a double edged sword
You may recall that we had a total ordering in our monolith, which made things nice
● The only way to have a total ordering with Kafka is to limit our topic to 1 partition (and as a result a
single consumer) - very bad for scale!
Do we need a total ordering? Maybe not
1. Identify the causal relationships in your data model
2. Hopefully everything isn’t causally related. Banking is great as generally only stuff that happens
within an account is causally related. My banking activity doesn’t (usually) affect to yours. If this is
the case, partition on that
3. $$$ profit from a partial ordering $$$
Warning! Be cautious about having a hard dependency on ordering. Whereas Kafka guarantees an
ordering per partition, it can be difficult / restrictive to maintain that ordering when processing, e.g.:
○ Fan out
○ Rebalancing
○ Stream joins
Scale: turning the tide of the war
By capitalizing on bank account affinity within our consumers we mitigated lock contention
● We can use larger batch sizes
● We now hit expected issues, such as max message size errors, and poor latency profiles (we have to
wait for batches to fill up!). We choose a batch size that sits comfortably below the point we hit our
known bottlenecks
Does this mean we can get rid of our lock? That way our perf would be even better?
● Not wise, we still want our safety net (see warnings from last slide!)
● But maybe there’s a better way to do it…
Effective concurrency control
Sometimes the best lock, is no lock
● Our locking mechanism is quite primitive, and crucially, pessimistic
● Pessimistic locks work best when we expect collisions
● Optimistic ‘locking’ doesn’t use a lock so is cheaper in the happy case, but more expensive when we
have a collision
Effective concurrency control
Optimistic concurrency control is sometimes implemented using real timestamp. In a distributed system
it’s often better to use logical timestamps
● Let’s imagine we want to support rounding up to the nearest dollar, here’s our requirements
○ Rounding MUST NOT happen on the hot path
○ Rounding MUST happen strictly after the original posting (i.e. no intermediate postings are allowed)
Scale - bringing it home
Our system is beginning to flow. We’ve got some smart batching
strategies and effective use of partitioning helps reduce transaction
contention in our database
We can now handle hundreds of concurrent users, however we hit a
wall as we try to scale out more: our application is stateless so can
scale horizontally, our database is a single point of contention and can
only scale vertically
● Read replicas - synchronous replicas can serve consistent reads
keeping contention off the master
● NewSQL - emerging databases such as Spanner and
CockroachDB might help (not a silver bullet!)
● Shared nothing architecture
● AZ affinity (but don’t rely on it!)
I made it go 10x
faster without
sacrificing
correctness!
Client: it
needs to go
100x faster
Data locality
Whereas these strategies help, they’re only really buying us time, and certainly not getting us the stonks
we need
We take a good hard look our system, we still seem to be spending most of our time on the network going
to the database. What’s more - looking at these interactions, and there seems to be a lot of commonality,
we keep asking the same questions and getting the same answers
If only there was a way we could avoid these redundant interactions??
Data locality!
● Bringing the data closer to the process. Or the other way around, but stored procs aren’t cool
anymore :(
● This can mean geographically closer, but can also mean computationally closer, e.g. caching the
result of a costly computation
Data locality
Scenario: our contract execution requires the current month’s postings, to check for reward eligibility.
Current state: every time we process a posting we’re fetching the same months worth of data (+1) from the
database
#strategy 1 - read through cache
Data locality
#strategy 2 - in memory
The intermediate cache works nicely, however we’re still having to transfer a lot of data over the wire,
and whereas the cache is more performant, it’s still a single point of contention
If only there was a nice way to partition the cached data - wait a minute…
Disclaimer: bootstrapping can be a difficult problem to solve!
Data locality
#strategy 3 - stateful services 😲
If we’re already storing our postings in memory, and we’re only ever fetching the most recent posting
from the DB… but isn’t our processor the thing that processed that most recent posting?
Data locality
… but this is blasphemy! Our services should be stateLESS ?!
For a long time now stateless services have been the royal road to scalability. Nearly every treatise on scalability declares
statelessness as the best practices approved method for building scalable systems. A stateless architecture is easy to scale
horizontally and only requires simple round-robin load balancing.
What’s not to love? Perhaps the increased latency from the roundtrips to the database. Or maybe the complexity of the caching
layer required to hide database latency problems. Or even the troublesome consistency issues.
- Todd hoff
But what about the bootstrapping problem? …. Hello Kafka! Remember: The aggregation of a stream of
updates over time yields a table
Core bankings lust for correctness rears its head
● We’ve got by so far with a partial (per account) ordering
● We now realize that we have to obey the laws of accounting
○ We need to construct a trial balance at the end of the business day that will feed into the daily balance sheet
● This means we need to calculate a balance for every account - even internal accounts (double
entry)
● Do we need a total ordering after all?
Pro tip: don’t push your consistency
problems onto your clients (we want
holistic consistency)
Correctness: boss fight
Watermarks
Scenario: End of business day reconciliation
We need a way to calculate a balance for every account in the bank, including internal accounts
● We can’t calculate balances for internal accounts on the hot path as we’d get massive lock
contention
● If our cut off is 12pm, how do we know when we’ve received all postings before this time?
Watermarks!
A watermark is a monotonically increasing timestamp of the oldest work not yet completed
This means that if a watermark advances past some time T (e.g. 12pm) we are guaranteed that no
more processing will happen for events at or previous to T
● Balances are calculated from the timestamp set by the ledger aka. processing time (not
event time)
● Because our cut off is a fixed point, we’re dealing with fixed windows
Watermarks
#strategy 1 - heuristic watermarks
Perhaps the easiest approach would be just to wait a bit of time after the cut off
● How long do we wait? The client needs this ASAP!
● How do we tell the difference between idleness and a network partition?
● We don’t want to smash the ledger to get the result - let’s use Kafka :)
[if idle] system clock gives us an indication
when the fixed window is closing
[if active] watch out for if we receive
postings with timestamp after T
Health check kafka + processors
Publish to watermark stream (out of band)
for a downstream balance processor to
read
Watermarks
#strategy 2 - perfect watermarks
Strategy works great although given it’s a heuristic approach we inherently will always have the
possibility of late data - what could go wrong!
● Since we are using processing timestamps, i.e. we have ingress timestamping, it is possible to have a
perfect watermark approach
[if active] detect if we write a posting passed the close
timestamp of a window
[if idle] system clock indicates a window is closing, check for
an open DB transactions
Publish watermark in-band, explicitly to every partition
(shocking, i know). Downstream balances waits for a
watermark from all partitions
Correctness - a final word
A quick note on high availability and disaster recovery
● Since we’ve optimized for batching, and we only hit the DB when we need to, we have
inadvertently optimized our system not to depend on tight DB commit latencies
● For banks, we need zero data loss when it comes to DR. We can achieve this with sync replication!
● Can Kafka do multi region replication like the database can?
○ Probably not synchronously! Tables are streams at rest! We can hydrate a new Kafka cluster from our journal
● Can we have multi region active-active?
Scale - a final word
To achieve massive throughput in your microservices architecture you need
to let your data flow
● Commit your offsets fast
○ Avoid service fan out (i.e. to different services)
○ Avoid chaining service calls
○ Avoid sync service calls for writes
○ Don’t await on async request / response
● Consider choreography over orchestration
○ Aim to avoid any sort of shared contention
○ Think carefully about saga state. Consider the Routing Slip pattern
● Don’t try and solve distributed 2PC
● … but beware the saga rollback
● Observability!
○ Tracing can tell you a lot
○ CPU over time can tell you if your services aren’t being utilized (e.g. your
data isn’t flowing)
Summary
● Building a system that is both correct, and moves at scale, is hard - but certainly not impossible!
● We can only get some of the way with patterns / anti-patterns
● The rest requires some creativity - architecture is the sum of its parts
○ Synergies between ordering and concurrency control
○ Choose a batching strategy that works best for the whole system
○ Your caching strategy should complement your system, not paper over the cracks
○ The less moving parts the better - building something simple is hard!
○ A bad design compounds problems - make a hard choice or pay a high cloud bill
○ If the end result is something the client can’t use or doesn’t want - we have failed
● Building on this, we can see trade offs as relationships. We don’t have to choose one or the other
○ Correctness AND scale
○ High throughput AND low latency
○ Consistency AND availability
Thanks for listening!
Feedback appreciated:
peter.dudbridge@gmail.com
linkedin.com/in/peterdudbridge

Mais conteúdo relacionado

Mais procurados

Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 
Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...
Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...
Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...confluent
 
Apache Kafka® Use Cases for Financial Services
Apache Kafka® Use Cases for Financial ServicesApache Kafka® Use Cases for Financial Services
Apache Kafka® Use Cases for Financial Servicesconfluent
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 

Mais procurados (20)

Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...
Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...
Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...
 
Apache Kafka® Use Cases for Financial Services
Apache Kafka® Use Cases for Financial ServicesApache Kafka® Use Cases for Financial Services
Apache Kafka® Use Cases for Financial Services
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 

Semelhante a Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought Machine

Reasoning about data and consistency in systems
Reasoning about data and consistency in systemsReasoning about data and consistency in systems
Reasoning about data and consistency in systemsDaniel Norman
 
DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesAlex Cruise
 
Microservices Coordination using Saga
Microservices Coordination using SagaMicroservices Coordination using Saga
Microservices Coordination using SagaEran Levy
 
Distributed System explained (with NodeJS) - Bruno Bossola - Codemotion Milan...
Distributed System explained (with NodeJS) - Bruno Bossola - Codemotion Milan...Distributed System explained (with NodeJS) - Bruno Bossola - Codemotion Milan...
Distributed System explained (with NodeJS) - Bruno Bossola - Codemotion Milan...Codemotion
 
Don’t give up, You can... Cache!
Don’t give up, You can... Cache!Don’t give up, You can... Cache!
Don’t give up, You can... Cache!Stefano Fago
 
designing distributed scalable and reliable systems
designing distributed scalable and reliable systemsdesigning distributed scalable and reliable systems
designing distributed scalable and reliable systemsMauro Servienti
 
Distributed systems and consistency
Distributed systems and consistencyDistributed systems and consistency
Distributed systems and consistencyseldo
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
Doing Enterprise Business with Processes & Rules
Doing Enterprise Business with Processes & RulesDoing Enterprise Business with Processes & Rules
Doing Enterprise Business with Processes & RulesSrinath Perera
 
Ensuring Your Technology Will Scale
Ensuring Your Technology Will ScaleEnsuring Your Technology Will Scale
Ensuring Your Technology Will Scalebasissetventures
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)StreamNative
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesYoav Francis
 
Distributed System explained (with Java Microservices)
Distributed System explained (with Java Microservices)Distributed System explained (with Java Microservices)
Distributed System explained (with Java Microservices)Mario Romano
 

Semelhante a Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought Machine (20)

Reasoning about data and consistency in systems
Reasoning about data and consistency in systemsReasoning about data and consistency in systems
Reasoning about data and consistency in systems
 
DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 Slides
 
Microservices Coordination using Saga
Microservices Coordination using SagaMicroservices Coordination using Saga
Microservices Coordination using Saga
 
Messaging
MessagingMessaging
Messaging
 
Messaging
MessagingMessaging
Messaging
 
Distributed System explained (with NodeJS) - Bruno Bossola - Codemotion Milan...
Distributed System explained (with NodeJS) - Bruno Bossola - Codemotion Milan...Distributed System explained (with NodeJS) - Bruno Bossola - Codemotion Milan...
Distributed System explained (with NodeJS) - Bruno Bossola - Codemotion Milan...
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Bloom plseminar-sp15
Bloom plseminar-sp15Bloom plseminar-sp15
Bloom plseminar-sp15
 
Don’t give up, You can... Cache!
Don’t give up, You can... Cache!Don’t give up, You can... Cache!
Don’t give up, You can... Cache!
 
designing distributed scalable and reliable systems
designing distributed scalable and reliable systemsdesigning distributed scalable and reliable systems
designing distributed scalable and reliable systems
 
Distributed systems and consistency
Distributed systems and consistencyDistributed systems and consistency
Distributed systems and consistency
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Doing Enterprise Business with Processes & Rules
Doing Enterprise Business with Processes & RulesDoing Enterprise Business with Processes & Rules
Doing Enterprise Business with Processes & Rules
 
Ensuring Your Technology Will Scale
Ensuring Your Technology Will ScaleEnsuring Your Technology Will Scale
Ensuring Your Technology Will Scale
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Software + Babies
Software + BabiesSoftware + Babies
Software + Babies
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
 
Distributed System explained (with Java Microservices)
Distributed System explained (with Java Microservices)Distributed System explained (with Java Microservices)
Distributed System explained (with Java Microservices)
 

Mais de HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

Mais de HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Último

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Último (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought Machine

  • 1. Scaling a core banking engine using Apache Kafka Peter Dudbridge (aka. Dudders) Engineering Director Thought Machine
  • 2. What will we cover? The relationship between correctness and scale, through the lens of a core banking system ● To guide us through this subject we will tell a story starting with the monolith and ending with a scalable distributed system ● We will encounter some patterns and anti patterns on the way ● We will see some tricky tradeoffs, but also show that with the right design, the need not hold us back! ● Loosely based on real life experiences! Note: I assume you know what kafka and microservices are (if not you may be lost)
  • 3. What is core banking and why do we care? The most important system that you’ve never heard of (unless something goes wrong!) ● We need both correctness and scale! ● You want 24/7 access to your money ● You don’t want to lose all your money! ● You also don’t want to pay for your current account… Core banking, at its heart, is: ● Product engine ● Ledger (accounts + balances)
  • 4. The monolith Works great! ● Everything is one machine ● No network calls! Yay! ● Total ordering, single clock ● Need more resources? > Mainframe: we’ve got you covered! Waaait what year is it again? ● Glass ceiling ● Very expensive ● Core banking traffic is very spiky ● HA / DR
  • 5. You know the answer to the problem: microservices! ● We need an architecture that can scale horizontally, on demand: that means microservices ● Has anyone tried carving up a monolith? ● We quickly realize the things that worked well in the monolith fall flat on their face when distributed ● We also quickly realize that we’re now playing a game of trade offs ○ Latency vs throughput ○ Consistency vs availability ○ Correctness vs everything! ● Usually the first thing we realize is we need to embrace asynchronousity Towards microservices
  • 6. Why Kafka Why am I speaking at this conference... we just need an async queue type thing right? why use Kafka? ● Kafka streams can be partitioned, which allows it to scale ● Kafka can be configured to be durable (good for banking!) ● Kafka is a commit log! ○ This has changed the game with regards to stream processing ○ Towards a general theory of table and stream relativity: The aggregation of a stream of updates over time yields a table The observation of changes to a table over time yields a stream - Streaming Systems
  • 7. Some prerequisites Here are some prerequisites for correctness that aren’t the focus of today’s talk ● We choose consistency over availability (when we are forced to make a choice) ● Idempotency and retryability is baked in ● This gives us effective once delivery ● Holistic consistency ● Dual writes - pick your favourite pattern (we use transactional outbox)
  • 8. Let’s solutionize! So let’s start with the general problem we’re trying to solve (then immediately start solutionizing) ● We’re processing a payment. In core banking lingo this is a posting ● In Thought Machine we have a super flexible product engine that is driven by an abstraction we refer to as Smart Contracts (not dogecoin, sorry) ● The trade off with having super flexibility, is we have to do super loads of work on the hot path
  • 9. Correctness - our first battle Since streams are data in motion, tables are data at rest, we decide to store our ledger / balances and accounts in a database We build our prototype but quickly realize a problem ● When two users concurrently access one account, we see a race condition ● Fortunately there’s a common pattern for solving this!
  • 10. Scale - our first battle So now our system is ‘correct’, however this has come at a price, let’s open up our observability dashboard to see if we can’t offset this cost! ● We’re spending all of our time on the network ● We’re having to scale out our application to handle more requests, but our database is running out of connections! ● We’re burning cash and we seem to be hitting the exact problem we moved away from the monolith to solve ● Batching is the answer. Don’t worry - it’s a subset of streaming
  • 11. Batching strategies The big questions is - what should the batch size be? ● No ‘one size fits all’ answer ● Consider what work we are batching: ○ Kafka consumer batches ○ Calls our consumer might be making to another microservice ○ Database queries / commands ○ Kafka producing batches ○ Kafka committing offsets ● Rule of thumb - try a few strategies out!
  • 12. Batching strategies #strategy 1 - one batch to rule them all ● Maintain whatever batch size we pull from Kafka in our consumers ● We can tweak our batch size via the Kafka consumer settings: fetch.max.wait.ms | fetch.min.bytes | fetch.max.bytes ● Works great if our batch is homogeneous ● Easy to reason about architecture, single place to configure out batch sizes ● Assumes the batch size the works for kafka also works for the DB, calls to other microservices, etc.
  • 13. Batching strategies #strategy 2 - fan out ● If our batch is heterogeneous we might see that some elements of the batch process quicker than others, in #strategy 1 we always pay the price of the slowest element ● If our batch contains different types of work (e.g. different messages need routing to different places) fan out might be our only option - if this is the case first consider separate topics ● If we fan out we might get benefits with parallelizing work (always test this!) ● The long tail problem isn’t trivial to solve ● Consume a batch of 3 from Kafka ● Fan out the request ● Decide how long to wait, save the long tail to local storage ● Commit all offsets and pull next batch
  • 14. Batching strategies #strategy 3 - dynamic batch sizing If our batches are super heterogeneous / unpredictable a batch size that works well today might not work so well tomorrow (singles day anyone?) ● Monitor our batch response time / error rate over time ● If the response time gets quicker increase the batch size ● If the response time gets slower and we observe timeouts, decrease the batch size ● Use sensible bounds! Lots of things can affect response times so is easy to scale on a false positive!
  • 15. Correctness: we meet again... We play around with batch sizes / strategies, and we find that sweet spot that gives us the best throughput / latency profile, however we spot something odd when perusing the observability dashboards: Woah - so we introduced batching to mitigate the cost of correctness, but it looks like we hit a point where it stops to help, and even makes it worse! ● The bigger the batch, the more likely we are to have a lock conflict with a batch being processed by another consumer Pro tip - know your bottlenecks!
  • 16. Correctness: we meet again... If only there was a way we could reduce the number of lock conflicts then maybe we could further increase our batch size and get more perf gainz? … what if there was a way we could make sure that messages that might conflict always go to the same consumer? That way we could resolve the conflict in the application, or evict it to the next batch?
  • 17. Ordering Ordering! ● A key feature of Kafka is the consistent hashing of user defined partition keys, guaranteeing that any message with a given partition keys’ value, will always go to the same consumer (until a rebalance happens) ● If we partition on Account ID, we can capitalize on the affinity this gives us
  • 18. Ordering Warning! Ordering is a double edged sword You may recall that we had a total ordering in our monolith, which made things nice ● The only way to have a total ordering with Kafka is to limit our topic to 1 partition (and as a result a single consumer) - very bad for scale! Do we need a total ordering? Maybe not 1. Identify the causal relationships in your data model 2. Hopefully everything isn’t causally related. Banking is great as generally only stuff that happens within an account is causally related. My banking activity doesn’t (usually) affect to yours. If this is the case, partition on that 3. $$$ profit from a partial ordering $$$ Warning! Be cautious about having a hard dependency on ordering. Whereas Kafka guarantees an ordering per partition, it can be difficult / restrictive to maintain that ordering when processing, e.g.: ○ Fan out ○ Rebalancing ○ Stream joins
  • 19. Scale: turning the tide of the war By capitalizing on bank account affinity within our consumers we mitigated lock contention ● We can use larger batch sizes ● We now hit expected issues, such as max message size errors, and poor latency profiles (we have to wait for batches to fill up!). We choose a batch size that sits comfortably below the point we hit our known bottlenecks Does this mean we can get rid of our lock? That way our perf would be even better? ● Not wise, we still want our safety net (see warnings from last slide!) ● But maybe there’s a better way to do it…
  • 20. Effective concurrency control Sometimes the best lock, is no lock ● Our locking mechanism is quite primitive, and crucially, pessimistic ● Pessimistic locks work best when we expect collisions ● Optimistic ‘locking’ doesn’t use a lock so is cheaper in the happy case, but more expensive when we have a collision
  • 21. Effective concurrency control Optimistic concurrency control is sometimes implemented using real timestamp. In a distributed system it’s often better to use logical timestamps ● Let’s imagine we want to support rounding up to the nearest dollar, here’s our requirements ○ Rounding MUST NOT happen on the hot path ○ Rounding MUST happen strictly after the original posting (i.e. no intermediate postings are allowed)
  • 22. Scale - bringing it home Our system is beginning to flow. We’ve got some smart batching strategies and effective use of partitioning helps reduce transaction contention in our database We can now handle hundreds of concurrent users, however we hit a wall as we try to scale out more: our application is stateless so can scale horizontally, our database is a single point of contention and can only scale vertically ● Read replicas - synchronous replicas can serve consistent reads keeping contention off the master ● NewSQL - emerging databases such as Spanner and CockroachDB might help (not a silver bullet!) ● Shared nothing architecture ● AZ affinity (but don’t rely on it!) I made it go 10x faster without sacrificing correctness! Client: it needs to go 100x faster
  • 23. Data locality Whereas these strategies help, they’re only really buying us time, and certainly not getting us the stonks we need We take a good hard look our system, we still seem to be spending most of our time on the network going to the database. What’s more - looking at these interactions, and there seems to be a lot of commonality, we keep asking the same questions and getting the same answers If only there was a way we could avoid these redundant interactions?? Data locality! ● Bringing the data closer to the process. Or the other way around, but stored procs aren’t cool anymore :( ● This can mean geographically closer, but can also mean computationally closer, e.g. caching the result of a costly computation
  • 24. Data locality Scenario: our contract execution requires the current month’s postings, to check for reward eligibility. Current state: every time we process a posting we’re fetching the same months worth of data (+1) from the database #strategy 1 - read through cache
  • 25. Data locality #strategy 2 - in memory The intermediate cache works nicely, however we’re still having to transfer a lot of data over the wire, and whereas the cache is more performant, it’s still a single point of contention If only there was a nice way to partition the cached data - wait a minute… Disclaimer: bootstrapping can be a difficult problem to solve!
  • 26. Data locality #strategy 3 - stateful services 😲 If we’re already storing our postings in memory, and we’re only ever fetching the most recent posting from the DB… but isn’t our processor the thing that processed that most recent posting?
  • 27. Data locality … but this is blasphemy! Our services should be stateLESS ?! For a long time now stateless services have been the royal road to scalability. Nearly every treatise on scalability declares statelessness as the best practices approved method for building scalable systems. A stateless architecture is easy to scale horizontally and only requires simple round-robin load balancing. What’s not to love? Perhaps the increased latency from the roundtrips to the database. Or maybe the complexity of the caching layer required to hide database latency problems. Or even the troublesome consistency issues. - Todd hoff But what about the bootstrapping problem? …. Hello Kafka! Remember: The aggregation of a stream of updates over time yields a table
  • 28. Core bankings lust for correctness rears its head ● We’ve got by so far with a partial (per account) ordering ● We now realize that we have to obey the laws of accounting ○ We need to construct a trial balance at the end of the business day that will feed into the daily balance sheet ● This means we need to calculate a balance for every account - even internal accounts (double entry) ● Do we need a total ordering after all? Pro tip: don’t push your consistency problems onto your clients (we want holistic consistency) Correctness: boss fight
  • 29. Watermarks Scenario: End of business day reconciliation We need a way to calculate a balance for every account in the bank, including internal accounts ● We can’t calculate balances for internal accounts on the hot path as we’d get massive lock contention ● If our cut off is 12pm, how do we know when we’ve received all postings before this time? Watermarks! A watermark is a monotonically increasing timestamp of the oldest work not yet completed This means that if a watermark advances past some time T (e.g. 12pm) we are guaranteed that no more processing will happen for events at or previous to T ● Balances are calculated from the timestamp set by the ledger aka. processing time (not event time) ● Because our cut off is a fixed point, we’re dealing with fixed windows
  • 30. Watermarks #strategy 1 - heuristic watermarks Perhaps the easiest approach would be just to wait a bit of time after the cut off ● How long do we wait? The client needs this ASAP! ● How do we tell the difference between idleness and a network partition? ● We don’t want to smash the ledger to get the result - let’s use Kafka :) [if idle] system clock gives us an indication when the fixed window is closing [if active] watch out for if we receive postings with timestamp after T Health check kafka + processors Publish to watermark stream (out of band) for a downstream balance processor to read
  • 31. Watermarks #strategy 2 - perfect watermarks Strategy works great although given it’s a heuristic approach we inherently will always have the possibility of late data - what could go wrong! ● Since we are using processing timestamps, i.e. we have ingress timestamping, it is possible to have a perfect watermark approach [if active] detect if we write a posting passed the close timestamp of a window [if idle] system clock indicates a window is closing, check for an open DB transactions Publish watermark in-band, explicitly to every partition (shocking, i know). Downstream balances waits for a watermark from all partitions
  • 32. Correctness - a final word A quick note on high availability and disaster recovery ● Since we’ve optimized for batching, and we only hit the DB when we need to, we have inadvertently optimized our system not to depend on tight DB commit latencies ● For banks, we need zero data loss when it comes to DR. We can achieve this with sync replication! ● Can Kafka do multi region replication like the database can? ○ Probably not synchronously! Tables are streams at rest! We can hydrate a new Kafka cluster from our journal ● Can we have multi region active-active?
  • 33. Scale - a final word To achieve massive throughput in your microservices architecture you need to let your data flow ● Commit your offsets fast ○ Avoid service fan out (i.e. to different services) ○ Avoid chaining service calls ○ Avoid sync service calls for writes ○ Don’t await on async request / response ● Consider choreography over orchestration ○ Aim to avoid any sort of shared contention ○ Think carefully about saga state. Consider the Routing Slip pattern ● Don’t try and solve distributed 2PC ● … but beware the saga rollback ● Observability! ○ Tracing can tell you a lot ○ CPU over time can tell you if your services aren’t being utilized (e.g. your data isn’t flowing)
  • 34. Summary ● Building a system that is both correct, and moves at scale, is hard - but certainly not impossible! ● We can only get some of the way with patterns / anti-patterns ● The rest requires some creativity - architecture is the sum of its parts ○ Synergies between ordering and concurrency control ○ Choose a batching strategy that works best for the whole system ○ Your caching strategy should complement your system, not paper over the cracks ○ The less moving parts the better - building something simple is hard! ○ A bad design compounds problems - make a hard choice or pay a high cloud bill ○ If the end result is something the client can’t use or doesn’t want - we have failed ● Building on this, we can see trade offs as relationships. We don’t have to choose one or the other ○ Correctness AND scale ○ High throughput AND low latency ○ Consistency AND availability
  • 35. Thanks for listening! Feedback appreciated: peter.dudbridge@gmail.com linkedin.com/in/peterdudbridge