Core banking is one of the last bastions for the mainframe. As many other industries have moved to the cloud, why are most of the world’s banks yet to follow?
The answer lies in a bank's conflicting needs: correctness and scale - historically achievable using a monolithic application running on a large mainframe. The clock is ticking for the banks as we approach an inflection point where the mainframes become too expensive, and aren’t flexible enough to meet the modern banking consumers needs
A simple lift and shift onto the cloud does not work. As we distribute our core processing we spend an increasing amount of time on the network, and race conditions lurk that threaten ‘correctness’
This session explores how Thought Machine’s core banking system ‘Vault’ was built in a cloud first manner, leveraging Kafka to enable asynchronous and parallel processing at scale, specifically focusing on the architectural patterns we have used to ensure ‘correctness’ in such an environment
Unleash Your Potential - Namagunga Girls Coding Club
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought Machine
1. Scaling a core banking engine
using Apache Kafka
Peter Dudbridge (aka. Dudders)
Engineering Director
Thought Machine
2. What will we cover?
The relationship between correctness and scale, through the lens of a core banking system
● To guide us through this subject we will tell a story starting with the monolith and ending with a
scalable distributed system
● We will encounter some patterns and anti patterns on the way
● We will see some tricky tradeoffs, but also show that with the right design, the need not hold us
back!
● Loosely based on real life experiences!
Note: I assume you know what kafka and microservices are (if not you may be lost)
3. What is core banking and why do we care?
The most important system that you’ve never heard of (unless something goes
wrong!)
● We need both correctness and scale!
● You want 24/7 access to your money
● You don’t want to lose all your money!
● You also don’t want to pay for your current account…
Core banking, at its heart, is:
● Product engine
● Ledger (accounts + balances)
4. The monolith
Works great!
● Everything is one machine
● No network calls! Yay!
● Total ordering, single clock
● Need more resources?
> Mainframe: we’ve got you covered!
Waaait what year is it again?
● Glass ceiling
● Very expensive
● Core banking traffic is very spiky
● HA / DR
5. You know the answer to the problem: microservices!
● We need an architecture that can scale horizontally, on demand: that
means microservices
● Has anyone tried carving up a monolith?
● We quickly realize the things that worked well in the monolith fall flat on
their face when distributed
● We also quickly realize that we’re now playing a game of trade offs
○ Latency vs throughput
○ Consistency vs availability
○ Correctness vs everything!
● Usually the first thing we realize is we need to embrace asynchronousity
Towards microservices
6. Why Kafka
Why am I speaking at this conference... we just need an async queue type thing right?
why use Kafka?
● Kafka streams can be partitioned, which allows it to scale
● Kafka can be configured to be durable (good for banking!)
● Kafka is a commit log!
○ This has changed the game with regards to stream processing
○ Towards a general theory of table and stream relativity:
The aggregation of a stream of updates over time yields a table
The observation of changes to a table over time yields a stream
- Streaming Systems
7. Some prerequisites
Here are some prerequisites for correctness that aren’t the focus of today’s talk
● We choose consistency over availability (when we are forced to make a choice)
● Idempotency and retryability is baked in
● This gives us effective once delivery
● Holistic consistency
● Dual writes - pick your favourite pattern (we use transactional outbox)
8. Let’s solutionize!
So let’s start with the general problem we’re trying to solve (then immediately start solutionizing)
● We’re processing a payment. In core banking lingo this is a posting
● In Thought Machine we have a super flexible product engine that is driven by an abstraction we
refer to as Smart Contracts (not dogecoin, sorry)
● The trade off with having super flexibility, is we have to do super loads of work on the hot path
9. Correctness - our first battle
Since streams are data in motion, tables are data at rest, we decide to store our ledger / balances and
accounts in a database
We build our prototype but quickly realize a problem
● When two users concurrently access one account, we see a race condition
● Fortunately there’s a common pattern for solving this!
10. Scale - our first battle
So now our system is ‘correct’, however this has come at a price, let’s open up our observability dashboard
to see if we can’t offset this cost!
● We’re spending all of our time on the network
● We’re having to scale out our application to handle more requests, but our database is running out
of connections!
● We’re burning cash and we seem to be hitting the exact problem we moved away from the
monolith to solve
● Batching is the answer. Don’t worry - it’s a subset of streaming
11. Batching strategies
The big questions is - what should the batch size be?
● No ‘one size fits all’ answer
● Consider what work we are batching:
○ Kafka consumer batches
○ Calls our consumer might be making to another microservice
○ Database queries / commands
○ Kafka producing batches
○ Kafka committing offsets
● Rule of thumb - try a few strategies out!
12. Batching strategies
#strategy 1 - one batch to rule them all
● Maintain whatever batch size we pull from Kafka in our consumers
● We can tweak our batch size via the Kafka consumer settings:
fetch.max.wait.ms | fetch.min.bytes | fetch.max.bytes
● Works great if our batch is homogeneous
● Easy to reason about architecture, single place to configure out batch sizes
● Assumes the batch size the works for kafka also works for the DB, calls to other microservices, etc.
13. Batching strategies
#strategy 2 - fan out
● If our batch is heterogeneous we might see that some elements of the batch process quicker than
others, in #strategy 1 we always pay the price of the slowest element
● If our batch contains different types of work (e.g. different messages need routing to different
places) fan out might be our only option - if this is the case first consider separate topics
● If we fan out we might get benefits with parallelizing work (always test this!)
● The long tail problem isn’t trivial to solve
● Consume a batch of 3 from Kafka
● Fan out the request
● Decide how long to wait, save the long tail to local storage
● Commit all offsets and pull next batch
14. Batching strategies
#strategy 3 - dynamic batch sizing
If our batches are super heterogeneous / unpredictable a batch size that works well today might not work
so well tomorrow (singles day anyone?)
● Monitor our batch response time / error rate over time
● If the response time gets quicker increase the batch size
● If the response time gets slower and we observe timeouts, decrease the batch size
● Use sensible bounds! Lots of things can affect response times so is easy to scale on a false positive!
15. Correctness: we meet again...
We play around with batch sizes / strategies, and we find that sweet spot that gives us the best
throughput / latency profile, however we spot something odd when perusing the observability
dashboards:
Woah - so we introduced batching to mitigate the cost of correctness, but it looks like we hit a point
where it stops to help, and even makes it worse!
● The bigger the batch, the more likely we are to have a lock conflict with a batch being processed by
another consumer
Pro tip - know your bottlenecks!
16. Correctness: we meet again...
If only there was a way we could reduce the number of lock
conflicts then maybe we could further increase our batch size and
get more perf gainz?
… what if there was a way we could make sure that messages that might conflict always go to the same
consumer? That way we could resolve the conflict in the application, or evict it to the next batch?
17. Ordering
Ordering!
● A key feature of Kafka is the consistent hashing of user defined partition keys, guaranteeing that
any message with a given partition keys’ value, will always go to the same consumer (until a
rebalance happens)
● If we partition on Account ID, we can capitalize on the affinity this gives us
18. Ordering
Warning! Ordering is a double edged sword
You may recall that we had a total ordering in our monolith, which made things nice
● The only way to have a total ordering with Kafka is to limit our topic to 1 partition (and as a result a
single consumer) - very bad for scale!
Do we need a total ordering? Maybe not
1. Identify the causal relationships in your data model
2. Hopefully everything isn’t causally related. Banking is great as generally only stuff that happens
within an account is causally related. My banking activity doesn’t (usually) affect to yours. If this is
the case, partition on that
3. $$$ profit from a partial ordering $$$
Warning! Be cautious about having a hard dependency on ordering. Whereas Kafka guarantees an
ordering per partition, it can be difficult / restrictive to maintain that ordering when processing, e.g.:
○ Fan out
○ Rebalancing
○ Stream joins
19. Scale: turning the tide of the war
By capitalizing on bank account affinity within our consumers we mitigated lock contention
● We can use larger batch sizes
● We now hit expected issues, such as max message size errors, and poor latency profiles (we have to
wait for batches to fill up!). We choose a batch size that sits comfortably below the point we hit our
known bottlenecks
Does this mean we can get rid of our lock? That way our perf would be even better?
● Not wise, we still want our safety net (see warnings from last slide!)
● But maybe there’s a better way to do it…
20. Effective concurrency control
Sometimes the best lock, is no lock
● Our locking mechanism is quite primitive, and crucially, pessimistic
● Pessimistic locks work best when we expect collisions
● Optimistic ‘locking’ doesn’t use a lock so is cheaper in the happy case, but more expensive when we
have a collision
21. Effective concurrency control
Optimistic concurrency control is sometimes implemented using real timestamp. In a distributed system
it’s often better to use logical timestamps
● Let’s imagine we want to support rounding up to the nearest dollar, here’s our requirements
○ Rounding MUST NOT happen on the hot path
○ Rounding MUST happen strictly after the original posting (i.e. no intermediate postings are allowed)
22. Scale - bringing it home
Our system is beginning to flow. We’ve got some smart batching
strategies and effective use of partitioning helps reduce transaction
contention in our database
We can now handle hundreds of concurrent users, however we hit a
wall as we try to scale out more: our application is stateless so can
scale horizontally, our database is a single point of contention and can
only scale vertically
● Read replicas - synchronous replicas can serve consistent reads
keeping contention off the master
● NewSQL - emerging databases such as Spanner and
CockroachDB might help (not a silver bullet!)
● Shared nothing architecture
● AZ affinity (but don’t rely on it!)
I made it go 10x
faster without
sacrificing
correctness!
Client: it
needs to go
100x faster
23. Data locality
Whereas these strategies help, they’re only really buying us time, and certainly not getting us the stonks
we need
We take a good hard look our system, we still seem to be spending most of our time on the network going
to the database. What’s more - looking at these interactions, and there seems to be a lot of commonality,
we keep asking the same questions and getting the same answers
If only there was a way we could avoid these redundant interactions??
Data locality!
● Bringing the data closer to the process. Or the other way around, but stored procs aren’t cool
anymore :(
● This can mean geographically closer, but can also mean computationally closer, e.g. caching the
result of a costly computation
24. Data locality
Scenario: our contract execution requires the current month’s postings, to check for reward eligibility.
Current state: every time we process a posting we’re fetching the same months worth of data (+1) from the
database
#strategy 1 - read through cache
25. Data locality
#strategy 2 - in memory
The intermediate cache works nicely, however we’re still having to transfer a lot of data over the wire,
and whereas the cache is more performant, it’s still a single point of contention
If only there was a nice way to partition the cached data - wait a minute…
Disclaimer: bootstrapping can be a difficult problem to solve!
26. Data locality
#strategy 3 - stateful services 😲
If we’re already storing our postings in memory, and we’re only ever fetching the most recent posting
from the DB… but isn’t our processor the thing that processed that most recent posting?
27. Data locality
… but this is blasphemy! Our services should be stateLESS ?!
For a long time now stateless services have been the royal road to scalability. Nearly every treatise on scalability declares
statelessness as the best practices approved method for building scalable systems. A stateless architecture is easy to scale
horizontally and only requires simple round-robin load balancing.
What’s not to love? Perhaps the increased latency from the roundtrips to the database. Or maybe the complexity of the caching
layer required to hide database latency problems. Or even the troublesome consistency issues.
- Todd hoff
But what about the bootstrapping problem? …. Hello Kafka! Remember: The aggregation of a stream of
updates over time yields a table
28. Core bankings lust for correctness rears its head
● We’ve got by so far with a partial (per account) ordering
● We now realize that we have to obey the laws of accounting
○ We need to construct a trial balance at the end of the business day that will feed into the daily balance sheet
● This means we need to calculate a balance for every account - even internal accounts (double
entry)
● Do we need a total ordering after all?
Pro tip: don’t push your consistency
problems onto your clients (we want
holistic consistency)
Correctness: boss fight
29. Watermarks
Scenario: End of business day reconciliation
We need a way to calculate a balance for every account in the bank, including internal accounts
● We can’t calculate balances for internal accounts on the hot path as we’d get massive lock
contention
● If our cut off is 12pm, how do we know when we’ve received all postings before this time?
Watermarks!
A watermark is a monotonically increasing timestamp of the oldest work not yet completed
This means that if a watermark advances past some time T (e.g. 12pm) we are guaranteed that no
more processing will happen for events at or previous to T
● Balances are calculated from the timestamp set by the ledger aka. processing time (not
event time)
● Because our cut off is a fixed point, we’re dealing with fixed windows
30. Watermarks
#strategy 1 - heuristic watermarks
Perhaps the easiest approach would be just to wait a bit of time after the cut off
● How long do we wait? The client needs this ASAP!
● How do we tell the difference between idleness and a network partition?
● We don’t want to smash the ledger to get the result - let’s use Kafka :)
[if idle] system clock gives us an indication
when the fixed window is closing
[if active] watch out for if we receive
postings with timestamp after T
Health check kafka + processors
Publish to watermark stream (out of band)
for a downstream balance processor to
read
31. Watermarks
#strategy 2 - perfect watermarks
Strategy works great although given it’s a heuristic approach we inherently will always have the
possibility of late data - what could go wrong!
● Since we are using processing timestamps, i.e. we have ingress timestamping, it is possible to have a
perfect watermark approach
[if active] detect if we write a posting passed the close
timestamp of a window
[if idle] system clock indicates a window is closing, check for
an open DB transactions
Publish watermark in-band, explicitly to every partition
(shocking, i know). Downstream balances waits for a
watermark from all partitions
32. Correctness - a final word
A quick note on high availability and disaster recovery
● Since we’ve optimized for batching, and we only hit the DB when we need to, we have
inadvertently optimized our system not to depend on tight DB commit latencies
● For banks, we need zero data loss when it comes to DR. We can achieve this with sync replication!
● Can Kafka do multi region replication like the database can?
○ Probably not synchronously! Tables are streams at rest! We can hydrate a new Kafka cluster from our journal
● Can we have multi region active-active?
33. Scale - a final word
To achieve massive throughput in your microservices architecture you need
to let your data flow
● Commit your offsets fast
○ Avoid service fan out (i.e. to different services)
○ Avoid chaining service calls
○ Avoid sync service calls for writes
○ Don’t await on async request / response
● Consider choreography over orchestration
○ Aim to avoid any sort of shared contention
○ Think carefully about saga state. Consider the Routing Slip pattern
● Don’t try and solve distributed 2PC
● … but beware the saga rollback
● Observability!
○ Tracing can tell you a lot
○ CPU over time can tell you if your services aren’t being utilized (e.g. your
data isn’t flowing)
34. Summary
● Building a system that is both correct, and moves at scale, is hard - but certainly not impossible!
● We can only get some of the way with patterns / anti-patterns
● The rest requires some creativity - architecture is the sum of its parts
○ Synergies between ordering and concurrency control
○ Choose a batching strategy that works best for the whole system
○ Your caching strategy should complement your system, not paper over the cracks
○ The less moving parts the better - building something simple is hard!
○ A bad design compounds problems - make a hard choice or pay a high cloud bill
○ If the end result is something the client can’t use or doesn’t want - we have failed
● Building on this, we can see trade offs as relationships. We don’t have to choose one or the other
○ Correctness AND scale
○ High throughput AND low latency
○ Consistency AND availability