South Gate Tech’s was invited to hold a webinar in front of a London-based engineering community, called Chеltenham Geek Nights and present a case study about the implementation of Kafka, Faust and Python , hosted in AWS. The case study describes the solution that South Gate Tech’s engineering team developed and deployed for Tide as part of a customer project, delivered by a dedicated team. Presentation by Georgi Tenev - Senior Engineer at South Gate Tech.
South Gate Tech is a software development, outsourcing and staff augmentation service provider, specialising in big data, cloud, digital transformation and software engineering. Go to https://southgate.tech for more case studies and to get in touch about your project needs.
Real time stream processing with Kafka, Python and Faust
1. Real Time Stream Processing
with Kafka & Python
Georgi Tenev
May, 2020
2. About me
● Georgi
● UK + BE + BG; Dev + DevOps
● Nike Running Blue Level - 2000 km milestone by end of 2020 :) Into cars,
motorbikes & mountain bikes.
● Member of the family of a boutique software company based in Sofia - South
Gate Tech.
4. Tide
● Provides bank accounts for businesses, fully online
● Approaching 200,000 members
● Data Driven
5. What’s the pain
● Don’t onboard fraudulent members
● Automate the decision process, where possible
○ faster and less error-prone compared to manual approval
● Decide quickly
○ auto-approve
○ send for manual approval
7. Kafka
● 10,000ft. overview - message broker which durably persists messages,
designed for massive scale
○ consists of a cluster of brokers; A topic is the main entity
● Why so popular
○ scale
○ HA, durability & replication
○ Disk IO optimization
○ lightweight consumers
○ turn your db “inside out”
8. Kafka Topic
● Producers write to topic
● Consumers within a consumer group
read from topic
● Retention, Replication, Durability
image credits
9. Partitions
● Topic consists of >= 1 partitions
● Append only write-ahead log
○ OS optimization
● Offset
● Ordering
image credits
10. Consumer group
● Pub-sub semantics for the whole Consumer group
● Queue semantics for consumers within a consumer group
11. Consumer group
● Pub-sub semantics for the whole Consumer group
● Queue semantics for consumers within a consumer group
● consumer gets a subset of partition
12.
13.
14. What is Faust
● Asynchronous Stream Processing Python Framework
● Developed by Robinhood
○ “a pioneer of commission-free investing”
○ “build scalable and reliable distributed systems much faster”
18. Faust Agent
● Main processing actor in a Faust App
● A unary async function - receives a stream as its argument
19. Faust Stream
● ~ async python generator
● Abstraction over a kafka topic
● Can apply operations on the stream (e.g. orders.filter(), orders.take(5))
20. Faust Record
● DTO, Represents events with fully fledged python class instances
● Serialization & Deserialization
27. Challenges
● Shadowing of different engine versions (~ canary deployment)
● Joining streams
● Using a single kafka topic
28. Interesting resources
● Book - Designing Event-Driven Systems by Confluent
● Article - Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three
Cheap Machines)
● Faust's documentation
● Skeleton Faust project
● AvroSchema documentation
image from https://kafka.apache.org/documentation/
remove
similar to Python’s builtin `dataclass
give an example of a consumer which only cares about a subset of the attributes
add cf orcheastrion
Receive raw data about new member
name, email, etc
Enrich the raw data with additional information that we can collect
Perform feature extraction
raw data “go6ko@fraudster.xyz” -> feature email_domain_risky=yes`
Persist new features into a feature store, keyed by given member id
Invoke the core Decision Engine
Send decision to downstream consumers