Speakers: Ravi Dubey, Senior Manager, Software Engineering, Capital One + Jeff Sharpe, Software Engineer, Capital One
Capital One supports interactions with real-time streaming transactional data using Apache Kafka®. Kafka helps deliver information to internal operation teams and bank tellers to assist with assessing risk and protect customers in a myriad of ways.
Inside the bank, Kafka allows Capital One to build a real-time system that takes advantage of modern data and cloud technologies without exposing customers to unnecessary data breaches, or violating privacy regulations. These examples demonstrate how a streaming platform enables Capital One to act on their visions faster and in a more scalable way through the Kafka solution, helping establish Capital One as an innovator in the banking space.
Join us for this online talk on lessons learned, best practices and technical patterns of Capital One’s deployment of Apache Kafka.
-Find out how Kafka delivers on a 5-second service-level agreement (SLA) for inside branch tellers.
-Learn how to combine and host data in-memory and prevent personally identifiable information (PII) violations of in-flight transactions.
-Understand how Capital One manages Kafka Docker containers using Kubernetes.
Watch the recording: https://videos.confluent.io/watch/6e6ukQNnmASwkf9Gkdhh69?.
Long journey of Ruby standard library at RubyConf AU 2024
Capital One Delivers Risk Insights in Real Time with Stream Processing
1. Confidential
Capital One Delivers Risk Insights in
Real Time with Stream Processing
Jeff Sharpe and Ravi Dubey
Capital One Retail Bank
Confluent Online Talk
May 30, 2018
2. 2
Ravi is a senior manager working for Capital One in Virginia. Ravi
has over 25 years of software development and management
experience across a range of products in support of government
and commercial industries. His most recent experience includes
full stack development of web apps, cloud-based enterprise-facing
support applications and a high-throughput, low-latency,
distributed cloud-hosted data processing platform.
Ravi Dubey
Senior Manager, Software Engineering, Capital One
Jeff is a senior software engineer working for Capital One in
Virginia. He’s been an engineer for almost 18 years, with major
projects spanning five different languages. Though he began his
work on kernel drivers and web applications, he’s been repeatedly
drawn into high volume, high throughput data processing
projects.
Jeff Sharpe
Senior Software Engineer, Capital One
3. 3
Housekeeping Items
● This session will last about an hour.
● This session will be recorded.
● You can submit your questions by entering them into the GoToWebinar panel.
● The last 10-15 minutes will consist of Q&A.
● The slides and recording will be available after the talk.
4. Thanks…
• Bobby Calderwood
– @bobbycalderwood
– https://www.confluent.io/blog/author/bobby/
• Keith Gasser
– Keith.Gasser@capitalone.com
5. Real Time Decisioning Platform - Introduction
• Decisioning using ML models and Rules using low-latency
processing
• Streamed, batched, or micro-batched messages
7. RT Decisioning Platform - Introduction
• High Speed Durable Message Bus – Apache Kafka
• Enterprise Data Sources – Streams, Databases, and
Warehouses
• ETL – Apache NiFi, Kafka Connect, Confluent Schema Registry
• Distributed Processing – Apache Flink and others
• Feature Caching – Apache Flink, Redis, Kafka Compacted
Topics
• Prometheus, Grafana – Metrics, Alert Management
• Supplemented with Cloud compute, RDBMS, and Caching
services
• Containerization – Docker and Kubernetes
8. RT Decisioning Platform - Kafka Messaging
• Durable, fast, and clustered Kafka topics act as data streams
regarding decisioning input and decision scoring output
• DataStream window intervals correlate to Kafka Topic
log.retention.ms, typically between 30 and 180+ days
• DataStream objects are aggregated into cached features,
such as average daily balance for a specific account holder
• Ten brokers in total per AWS region, dozens of topics
• Producers include NiFi, Kafka Connect, external Streams
Producer-Maintained
Transaction IDs
(Can arrive out of order)
Producer
(Data Source)
22 21 19 20 18 17 16 15 14 13 12 11 9 10 8 5 7 6 4 3 2 124
Kafka
Topic
+ Payload
11. Enterprise Compliance: Image Rehydration
• Cloud VM Machine Images require periodic update
• RT Platform stack has 100+ distinct containers – underlying
image rehydration best handled with an abstraction layer
• Simple Blue-Green approaches can work for stateless
components, BUT…
• Network Storage and other Disk Volumes add complexity for
stateful components such as Kafka Brokers
• Kafka Clustering provides fault tolerance and failover during
rehydration, though we needed a solution to manage Kafka
logs mounted on Cloud Storage
Storage mount points broken
during instance recreation
12. Kubernetes
• Kubernetes (k8s) is OSS that manages container lifecycle,
addressing, and networking among other things
• Scheduler “moves” both Pods and associated storage volumes
defined in Stateful Sets in coordination between VM nodes
enabling clean rolling rehydration of Kafka Brokers
• Services allow Kafka Brokers and Kafka Connect to be accessed
by a logical service name by all platform components.
• Software Networking enables single TLS solution between all
components, common DNS, and integrated cloud Load
Balancing
• For external access to Kafka on the RT Platform, we recycle
external DNS mapping IP to common name at configurable
intervals (20 sec)
13. Kafka Considerations – Cluster 1
• RT Platform hosts all containers on instance types… 150GB RAM,
40 Cores, 10GB network performance. Good for most stack
components
– Instance Node affinity set so max one Kafka broker and max one ZK node.
– Shared ZooKeeper cluster with other RT Platform components
– In AWS, st1 EBS volume types optimized for write throughput, optimized
for Kafka
• Brokers increase demand on instance and platform shared
resources
– Platform Zookeeper state
– Instance OS open files
– Instance RAM
– Instance Network Access
– Instance Storage IO
14. • Kafka Brokers utilize RAM including Java heap and page cache
correlating to the size of topics.
• Replication Factor of 3 means four times the disk space consumed
Kafka Considerations – Cluster 1
Deeper Topics = More Disk Space
More Page Cache RAM
16. C Kafka
C
C
C
C
C
C
Z
C
C C C
C C CC
C
CC
Kafka
C
Z
C
C
C
C
C
C
C
C
C
Larger (m4.10xlarge , n1-standard-32 , n1-highmem-32)
instance/machine types: Faster network speeds, 100+ GB of RAM,
30+ cores, noisier neighbors competing for RAM, Network IO, “Blast
Radius”
TLS IOIO
Kafka Considerations – Cluster 1
17. Smaller instance/machine types (m4.2xlarge , n1-highmem-4 ,
standard-8), dedicated ZK, single broker node affinity, Connect, and or
Schema Registry. Tradeoff: risk, predictability, simplicity vs. faster
networking network and high-end CPU
Kafka
C
C
C
C
C
C
C
Z
C
C C C
C C CC
C
CC
CC
CC
Z KCSR
KC
Kafka
Z KCSR
Kafka
Z KC
KC
Kafka
Z
KC
KC
+
Kafka Considerations – Cluster 2
18. Kafka Real-Time Upgrades
• RT Platform supports multiple active tenants, so
uniform downtime during version upgrades is not
usually an option.
• Rolling upgrades potentially pose compatibility risks
between Kafka versions.
19. Kafka Real-Time Upgrades
1- Green Cluster provisioned and Topic Offsets
captured
12 11 9 10 8 5 7 6 4 3 2 1
Producer
Kafka1Svc
Capture Each
Topic Offset
20. Kafka Real-Time Upgrades
2- Tooling Backfills new Topics
• Depending on desired window size, tooling may be used to
backfill data for topics on new clusters, respecting time
stamp for consistent retention policy.
• Possible Candidate Process for Mirroring
13 12 11 9 10 8 5 7 6 4 3
12 11 9 10 8 5 7 6 4 3 2 1
Backfill Tooling,
Possible Mirroring
Producer
Kafka1Svc
21. Kafka Real-Time Upgrades
3- Producer flows set to load second Kafka cluster as
required
• Producers reference newly upgraded Kafka Clusters by
new k8s service name and upgrade to new cluster
independently
14 13 12 11 9 10 8 5 7 6
14 14 13 12 11 9 10 8 5 7
Producer
Kafka2Svc
Kafka1Svc
22. Kafka Real-Time Upgrades- Consequences
• Overlaps between 2) and 3) likely to create
duplicates (better than gaps)
• If downstream state based on original cluster or
original offsets are not preserved, all messages in
window may need to be replayed to recover
14 13 12 11 9 10 8 5 7 6
14 14 13 12 11 9 10 8 5 7
Producer
23. Kafka Across Regions
• Regional Clusters
• Why Do This?
– Partitioned Strategy
• Active-Active
• Latency or Partition Routed, Increased
Performance and Efficiency
– Disaster Recovery
• Active-Passive, Active-Active
• Redundantly Constructed and Routed,
Increased Reliability
• Issues
– Syncing Data
– Latency
• Inefficient Operation Across Great
Distance
• Kafka Cluster Replication not
recommended
24. Kafka Across Regions – Data Syncing Options
• Duplicate Common Upstream Sources
• Producer-Driven Replication
• Mirroring
• Mirroring + Consolidation
25. Kafka Across Regions – Data Syncing Options
Common Upstream
• Local Producers use Common Source
• 2 Topics Represent 1 Logical Topic
• Pros
• Fewest Number of Topics
• Consumer behavior minimally impacted
• Cons
• Each Local Producer needs to know about Each Regional
Deployment
26. Kafka Across Regions – Data Syncing Options
Producer
Region BRegion A
Producer
Topi
c
Topi
c
2 Topics Represent 1 Logical set of Messages
Consumers Consumers
Common Upstream
ETL Pull
27. Kafka Across Regions – Data Syncing Options
Producer-Driven Replication
• Producers maintain Topic consistency across multiple
regions
• 2 Topics Represent 1 Logical Topic, Clusters
• Pros
• Fewest Number of Topics
• Consumer behavior minimally impacted
• Cons
• Each Producer needs to know about Each Regional
Deployment
• Failure strategy, Reliability Tracking, SLA, etc. must be
Implemented by each Producer– likely using shadow topics
28. Kafka Across Regions – Data Syncing Options
Producer
Region BRegion A
Producer
Topic AB Topic
BA
2 Topics Represent 1 Logical set of Messages
Consumers Consumers
Producer-Driven Replication
A Routed Data B Routed Data
Shadow TopicShadow Topic
29. Kafka Across Regions – Data Syncing Options
Mirroring
• Tooling Automatically Replicates Topics
• Confluent Replicator (Licensed)
• Mirror Maker, uReplicator (OSS)
• 4 Topics Represent 1 Logical Topic
• Pros
• Producer behavior minimally impacted
• Cons
• Each Consumer needs to know about Each Replicated
Topic
• Complexity–More Topics
30. Kafka Across Regions – Data Syncing Options
Producer
Region BRegion A
Producer
Topic A Topic
B
4 Topics Represent 1 Logical set of Messages
Consumers Consumers
Mirroring
Topic
B’
Topic
A’
Mirror
31. Kafka Across Regions – Data Syncing Options
Mirroring + Consolidation
• Tooling Automatically Replicates Topics
• Additional Tooling merges Topics for Consumers
• ETL Tooling, NiFi, etc.
• Kafka Connect
• 6 Topics Represent 1 Logical Topic
• Pros
• Producer behavior minimally impacted
• Consumer behavior minimally impacted
• Cons
• Custom tooling must implement failure strategy, reliability
tracking, etc.
• Complexity– Lots More Topics, flow logic, and associated
resource consumption
32. Kafka Across Regions – Data Syncing Options
Producer
Region BRegion A
Producer
Topic A Topic
B
6 Topics Represent 1 Logical set of Messages
Mirroring + Consolidation
Topic
B’
Topic
A’
Consumers
Topic AB
Consumers
Topic
BA
ETL ETL
Mirror
33. Kafka Across Regions – Data Syncing Options
• Multiple Tenant Use Cases and Risk
Tolerances
• Combination of Solutions
– Common Upstream
– Confluent Replication
So What Do We Use?
34. Kafka – Moving Forward
• Exactly Once Semantics/Transactionality
• Hyper Partitioning
• Alternate Backends to Support Indefinite
Retention (S3, etc.)
35. Kafka for Real Time Bank Decisions
Handling Private Information
Real-Time Request and Response
36. Handling PII (not) on Kafka
Goal:
Remove the possibility of exposing PII
38. Encrypted Volume:
Following the Path of Least Resistance
Good
• Highly durable across Kafka
restarts
• Simple disaster recovery
planning
• Follows recommended
Kafka configuration
practices
Not So Good
• Information privacy
regulations require extra
levels of protection
• Durability is based on
additional storage volumes
being managed with the
Kafka service
39. Volatile Storage: Performance & Privacy
KAFKA
Consumer
Consumer
ConsumerTopic
Persistence
Initial
State
Storage
tmpfs
Storage
Copy on Startup
Library Card#
8675309
Library Card#
TOK:113581321
Tokenizatio
n
Producer
40. Volatile Storage: Strange Trade-offs
Improvements
• Noticeably better
performance
• Data is always “in flight”, so
extra encryption shouldn’t
be needed
• Effectively stateless images
Complications
• Needs scripting to bootstrap
• Topic contents are cleared
on host reboot
• Zookeeper won’t be able to
manage offsets between
reboots
41. Volatile Storage: Why We Aren’t Using It
• We need long-term storage of data and RAM is already a
precious resource.
• Our recovery strategy is built on Kafka as our state storage
mechanism. Losing that state complicates recovery efforts.
• Host disk caching gives us most of the benefit of volatile
storage.
42. Request-Response Pattern
/rəˈkwest rəˈspans ˈpadərn/
noun
1. A pattern of interaction with a remote service where the
local task submits a request for remote work and
expects a response before continuing work.
2. A specialized use of Kafka using dedicated topic pairs
to communicate with a shared service
43. Request Response Basics
Application
Request Topic
Response Topic
3. Prepare DataData
4. Assign a unique ID
5. Put request on
request topic
Service
(Service does work,
and builds a response
with the Request ID)
6. Read Response topic
until Request ID is seen
2. Initialize Producer
1. Initialize ConsumerConsumer
Producer
ID: 14159-26535
ID:14159-26532
ID:14159-26531
ID:14159-26533
ID:14159-26535
ID:14159-26536
48. The Request-Response Pattern
This is actually the
“Background Job” pattern:
1. Submit Job
2. Get assigned a Job ID
3. Poll for the service for until the Job ID is
marked as complete
4. Retrieve the results of the job
49. Request Response: Serverless Considerations
• Try to reuse Producers and Consumers
• Explicitly assign Consumer partitions
• Attempt to read from the Consumer before
submitting to the Producer
• Remember to commit offsets before sending
responses
50. REST
GRPC
ETC
Slightly Better: The Real-Time Tap Pattern
Input Topic
Precomputed
Values
Processing
Service
Application
Request
Real Time
Service
Read Request
Process
Send Response
Deliver Data
Request
Response Response
51. Real-Time Tap Pattern
• Real-time request is handled by a session-based
protocol
• Resilient data processing is handled by Kafka
• Failures are reported when they happen via
real-time protocol
• Kafka interactions can be optimized by the handler
service, rather than relying on clients