In this talk at Snowplow London Meetup #3 I introduced Tupilak, Snowplow’s unified log fabric. Putting a real-time event pipeline into production has many challenges: we need the pipeline to scale automatically based on event volumes, we need constant monitoring to prevent data loss and minimise end-to-end lag, and we need the ability to upgrade and extend the pipeline with zero downtime. We call software which does all this a “unified log fabric”, to distinguish it from the unified logs (e.g. Kafka and Kinesis) and stream processing frameworks (e.g. Spark Streaming and Kafka Streams) which such a fabric monitors and orchestrates.
As part of incorporating Snowplow’s Kinesis-based event pipeline into our Managed Service, we developed our own unified log fabric, called Tupilak. In this talk, I introduced Tupilak, explaining the core monitoring and scaling functions of Tupilak and showing live real-time pipelines visualised in the Tupilak UI. I dived into the architecture of Tupilak, shared its basic scaling algorithm and also took a look at how Tupilak itself is built on a Snowplow event stream. I also talked about the roadmap for Tupilak, including our plans for introducing lag-based auto-scaling and porting Tupilak to Kubernetes.
2. Quick show of hands
• Batch pipeline: how many here run the Snowplow batch pipeline?
• Real-time pipeline: how many here run the Snowplow RT pipeline?
• Orchestration: how are you running, scaling, monitoring the real-time
pipeline?
• Anything else: who here is evaluating Snowplow or just curious?
3. From the beginning, Snowplow RT was
designed around small, composable workers…
Diagram from our
Feb 2014 Snowplow
v0.9.0 release post
4. … based on the insight that RT pipelines
can be composed a little like Unix pipes
5. Today, we see a growing number of async
micro-services making up Snowplow RT
Stream
Collector
Stream
Enrich
Kinesis S3
Kinesis
Elasticsearch
Kinesis Tee
(coming soon)
Redshift
dripfeeder
(design stage)
User’s AWS
Lambda
function
User’s KCL
worker app
User’s Spark
Streaming job
6. But managing this kind of complexity
has some major challenges
“How do we
monitor this
topology, and
alert if something
(data loss; event
lag) is going
wrong?”
“How do we scale
our streams and
micro-services to
handle event
peaks and
troughs
smoothly?”
“How do we re-
configure or
upgrade our
micro-services
without breaking
things?”
7. Snowplow Batch has evolved a deep
technical stack to handle these challenges
8. We asked, what should the equivalent
underlying fabric be for Snowplow RT?
9. Enter Tupilak!
“A tupilak was an avenging monster
fabricated by a shaman by using
animal parts (bone, skin, hair,
sinew, etc). The creature was given
life by ritualistic chants. It was then
placed into the sea to seek and
destroy a specific enemy.”
10. Today Tupilak serves 3 key functions for the
Snowplow RT pipeline (Managed Service)
Monitoring
Auto-scaling
Alerting
• Visualizing the complex stream + worker topology in one place
• Indicating micro-services which are failing or falling behind (“lagging”)
• Auto-scaling the number of shards in each Kinesis stream
• Auto-scaling the number of EC2 instances running each micro-service
• Notifying our ops team in the case of a failing or lagging micro-service
via PagerDuty
11. Let’s look at auto-scaling in particular
# Shards in
Kinesis
Stream
# EC2
Instances
• We scale the number of shards in each
stream based on the read/write
throughput we are seeing
Read/write
throughput
• We scale the number of EC2 instances
based on some fixed assumptions about
the ratio between shards and workers
+
-
+
-
14. What’s next for Tupilak? 1. Better auto-scaling
# Shards in
Kinesis
Stream
# EC2
Instances
• We scale the number of shards in each stream
based on the read/write throughput we are
seeing, and the lag of any services
consuming this stream or downstream of this
stream
Read/write
throughput
+
-
+
-
Micro-service
lag
Performance metrics relative to stream
15. 2. Replacing our use of EC2 Auto-Scaling
Groups with Docker + Kubernetes