Build an event processing pipeline that scales to millions of events/second with sub-millisecond latencies all while ingesting multiple streams and remaining resilient in the face of host failures. We will present a production-ready reference architecture that ingests from multiple Kafka streams, and produces results to downstream Kafka topics. Common use cases include fraud detection, customer 360, and data enrichment.
Build Low Latency, Windowless Event Processing Pipelines with Quine and ScyllaDB
1. Build Low Latency, Windowless
Event Processing Pipelines
with Quine and ScyllaDB
Matthew Cullum, Director of Engineering, thatDot
sponsored by
2. Matthew
■ Director of Engineering @thatDot, makers of Quine
■ 18+ years experience in enterprise software
■ Industrial automation and distributed systems
■ Senior engineer, senior architect, CTO roles
@mattcullum
@brackishman
3. ■ Complex Event Processing
■ Graph Data Structure
■ Design an architecture
■ Quine + ScyllaDB real world performance
■ Questions
Presentation Agenda
4. Build Low Latency, Windowless Event
Processing Pipelines with Quine and
ScyllaDB
Our Goal:
■ Fits within existing high volume streams
■ Scales linearly to meet any enterprise scale with < 1ms latency
■ Maintains a single stateful graph data structure
■ Performs complex multi-node queries in real-time
■ No time-windows
6. ■ High volume of events in one or more streams of data
■ Kafka, Kinesis, Pulsar, etc.
■ Infer complex relationships between the events
■ Use Cases:
■ Fraud Detection
■ Network Management
■ Advanced Persistent Threat Detection
■ XDR/EDR
■ Monitoring State Change (CDC)
Complex Event Processing
7. Graph Data Structure
■ Nodes with properties, connected by edges
■ Categorical data (a lot of rich information gets encoded or discarded)
■ No costly joins
8. ■ Current graph databases can’t
■ Process at event streaming scale (>1M+ events/second) while….
■ Completing multi-node traversals (complex queries) < 1ms
■ What we want to accomplish
■ Event processing performance - 1M+ events/sec ingest
■ Match a relatively rare (2%) 4-node complex pattern in real-time
■ Resilient in face of infrastructure/network failures
■ Infrastructure cost effective
Scaling Graph for Production Use Cases
9. ■ Start with a database that already has similar characteristics
■ Design a graph data structure over a key-value store
■ Translate graph queries into per-node queries
What-If Architecture
MATCH (attempt1:login)-[:NEXT]->(attempt2:login)-[:NEXT]->(attempt3:login) WHERE
attempt1.result=”FAILURE”
AND attempt2.result=”FAILURE”
AND attempt3.result=”SUCCESS”
■ Find node of type login, WHERE .result=”FAILURE”, follow NEXT edges
■ Looking for neighbor of type login WHERE .result=”FAILURE”, follow NEXT
■ Edges looking for neighbor of type login WHERE .result=”SUCCESS”
10. Processing Event Streams
Quine ingests data → builds a graph→ persists to pluggable storage →runs
live computation on graph to compute results → and then streams them out.
12. Quine Guard Band Test Example
■ A script is used to generate events
■ Quine and DB host failures manually
triggered.
■ Pre-loaded Kafka with enough events to
sustain one million events/second for
two hours.
■ Github repo for reproducible test
Component
# of
hosts
Typical Host types
Quine Cluster 140 ● c2-standard-30 (30 vCPUs, 120GB RAM)
● Max heap for JVM set to 12GB
● 1 hot spare
DB Cluster 66 ● n1-highmem-32 (32 vCPU, 208GB RAM)
● x 375 GB local SSD each
● r1 x 375 GB local SSD each
Kafka 3 ● n2-standard-4 (4 vCPU, 16 GB RAM)
● 420 partitions
13. 1 Million Events/Second With Failures
#1 initial peak of 1.25M events/sec
#2 Quine settles into a steady ingest rate > 1M events/sec
#3 Quine recovers nicely after killing single node
#4 DB maintenance event exactly 1 hour into test
#5 Quine has no problem with two-node failure events.
#6 Stopped and resumed a Quine host for about 1 minute
to inject high latency
#7 Stop and resume a persistor host for about 1 minute to
inject high latency
#8 Single DB host killed and quickly recovered
#9 DB maintenance event exactly 2 hours into test
#10 Consumes remaining data from Kafka
14. 21,000 Standing Query Results/Second
MATCH (p0)-[:parent]->(p1)-[:parent]->(p2)-[:parent]->(p3)
WHERE
EXISTS(p0.customer_id) AND
EXISTS(p0.sensor_id) AND
EXISTS(p0.process_id) AND
EXISTS(p0.filename) AND
EXISTS(p0.command_line) AND
EXISTS(p0.user_id) AND
EXISTS(p0.timestamp_unix_utc) AND
EXISTS(p0.sha256)
RETURN
id(p0) AS id
15. 62% Cost Savings!
Component
Original Hosts @
$130/hr
New Hosts @
$50/hr
% Difference
Quine
Cluster
140 x
c2-standard-30
120 x
n2d-standard-16
54%
DB Cluster 66 x
n1-highmem-32
40 x
n2d-highmem-16
70%
18. Complex Event Processing Is No Longer Hard
■ Quine + ScyllaDB scales up linearly
■ Easily achieves 1M events/sec
and 21000 query results/sec
■ Resilient to infrastructure failures
with back pressure
■ Should scale ~linearly to 10+M events/sec