Triage Presentation

A consumer proxy that solves head-of-line blocking for
Kafka consumers
1
Hey everyone, and welcome to our presentation. I’m Aashish and together with Aryan, Jordan, and Michael - our team built Triage, a consumer proxy that solves head-of-
line blocking for Kafka consumers.

Overview
1. Microservices & Event-Driven Architecture (EDA)
2. Message Queues & Apache Kafka
3. Head-of-Line Blocking
4. Existing Solutions
5. Introducing Triage
6. Triage Design Challenges
7. Future Work
8. Q&A
2
Here’s a quick overview of what you can expect. First, we’ll address the larger context of microservices and event-driven architecture. From there, we’ll take a look at
message queues and focus on Apache Kafka, with a few details on how it works. Next, we’ll examine the problem of head-of-line blocking and its consequences, after
which we’ll share our research on some existing solutions. At that point, we’ll present Triage and our approach to solving head-of-line blocking, along with some
interesting design challenges we faced. We’ll end with some ideas for future work, and leave some room for a Q&A. We’re excited to show you what we built so let’s get
started!

“63% of enterprises have adopted
microservice architectures, and it’s
only expected to grow in the
coming decade.”
3
Microservice architecture has really gained in popularity over the last decade and in 2020, it was estimated that over 63% of enterprises had adopted microservices and
were satis
fi
ed with the tradeo
ff
s.

4
Shopping
App
API Logic
DB
Orders Microservice
API Logic
DB
Products Microservice
API Logic
DB
Stock Microservice
Here’s an example of a microservice architecture for a shopping app.
The takeaway here is to notice how the services are isolated into separate pieces. The orders, products, and stock inventory services all have their own logic and data
stores, and the shopping app can communicate with all of them.

What do microservices offer?
1. Development work can occur in parallel
2. Scalability becomes easier
3. Polyglot environment
5
Since services can be decoupled in this way, work can be done in parallel which leads to faster development times. Additionally, there’s a bene
fi
t in the ability to take
individual components and scale them independently.
Often, multiple technologies and programming languages are used in these setups, which is known as a polyglot microservice environment. Given the use of these
di
ff
erent languages, an important question is:

“How do we successfully
achieve intra-system
communication?”
6
How do we successfully achieve the required intra-system communication, for the system to function properly? One option is to use a request-response model, which is
commonly used on the web.

Request
Response
7
Request
Response
Request
Response
Imagine a # of interconnected microservices where services can send a request, and wait for responses. The issue is that if a single service in this chain experiences a
slowdown, the request lifecycle of any connected service will also be delayed.
To overcome this problem, a common choice is to implement an event-driven architecture, or an EDA.

EDAs are centered around events -
which are changes in state - or
noti
fi
cations about a change.
8
EDAs are centered around events, which can be thought of as changes in state, or noti
fi
cations about a change.

In an EDA, services can operate
independently without concern for
the state of any other service.
9
The key here is that services can operate independently without concern for the state of any other service.

10
Event-Driven Architecture
The service on the left can communicate with all 3 services on the right, independently. This architecture bypasses the problem where a delayed service causes a
slowdown throughout the entire system.
In order to achieve this decoupling, EDAs can be implemented using message queues.

Message Queue Functionality
Queue
Producer
11
Consumer
Here we have two producers to the left of the message queue. These applications write events to the queue. The consumer, which is to the right, reads these events o
ff
of the queue.

Traditional message queues:
events are read and then
removed.
Log-based message queues:
events are persisted on a log.
12
In traditional message queues, events are read and then removed from the queue. An alternative approach is to use log-based message queues. Here, all the events are
persisted on a log so you don’t lose them once they’re read.

13
Powered by
Among log-based message queues, Kafka is the most popular - over 80% of Fortune 100 companies across industries use it as part of their architecture.

What does Kafka offer?
•Scalability
•Parallelism
•Decoupling
14
Kafka is designed for scalability and parallelism, and it maintains the intended decoupling of an EDA. It’s worth taking a look at what’s unique about Kafka and how it
works.

In Kafka, events are
called messages.
15
In the context of Kafka, events are called messages and this is how we’ll refer to them.

Topic
Kafka
Partition 2
Partition 1
16
In this image, messages are grouped using a named identi
fi
er - called a topic.
Kafka achieves scalability by writing all the messages of a topic to partitions.
So in this example, messages in a single topic are written to two di
ff
erent partitions.

17
Topic 1
Topic 2
Partition 2
Consumer Group A
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Producer 1
Producer 2
Partition 1
Kafka
Partition 2
Partition 1
Consumer Group B
If we add the other pieces of the architecture, it’ll look something like this.
Producers, seen on the left, write messages to a topic.
Consumers, on the right, are organized into groups with a group ID. If a consumer wants to read messages, it can subscribe to a speci
fi
c topic; then, individual consumer
instances can read messages from a partition.

Want more scalability?
Add more partitions.
18
Need more parallelism?
Use consumer instances.
To achieve more scalability, you could simply increase the # of partitions per topic.
Additionally, the use of multiple consumer instances means that the messages can be processed in parallel.

While a consumer instance can
consume from more than one
partition, a partition can only be
consumed by a single consumer
instance.
19
It is important to note that while a consumer instance can consume from more than one partition, a partition can only be consumed by a single consumer instance. In
other words, 2 di
ff
erent consumer instances can’t consume from the same partition.

Kafka commits
20
• O
ff
set: A number that indicates the position of the message in
the queue.
• A consumer periodically commits o
ff
sets back to Kafka to
acknowledge the last message it successfully processed.
• In case of a crash, Kafka will remember where to resume
message delivery from.
Kafka uses commits to know which messages have been successfully processed.
The way this works is that every message on a Kafka partition has an o
ff
set - this is a number that indicates the position of the message in the queue. Think of it like an
index in an array.
A consumer periodically commits o
ff
sets back to Kafka, indicating the last message it successfully processed. If a consumer instance crashes, Kafka will remember
where to resume message delivery from.

21
O
ff
set 48 49 50 51
Producer
Consumer
Last Committed O
ff
set
Kafka
Here, once the consumer commits o
ff
set #50, Kafka knows that the messages from 48-50 have all been successfully processed. The consumer can continue consuming
before it commits the next o
ff
set.

1. Producers write messages to a speci
fi
c topic.
2. Kafka routes these messages to partitions.
3. Consumers subscribe to a speci
fi
c topic to receive messages
f and commit o
ff
sets.
4. Each partition in a topic can only be consumed by one
d consumer instance.
22
Recap
To recap, producers write messages to a speci
fi
c topic. Kafka then routes these messages to partitions.
Consumers subscribe to a topic to receive messages and commit o
ff
sets.
Each partition in a topic can only be consumed by one consumer instance.

Overview
23
7. Future Work
8. Q&A
Now that we've shown the larger context, Jordan from our team will explain the problem of head-of-line blocking in message queues.

Head-of-Line Blocking
24
A real-world example of head of line blocking that we are all likely familiar with is when you're at the supermarket and the person in the front of the line is taking a long
time to
fi
nish paying. Perhaps they're trying to use expired coupons or have multiple fruits each with their own ID or they're trying to pay with bitcoin. It slows down the
entire line and everyone behind them has to wait.

Head-of-Line Blocking - Message Queues
25
Processing
in
Progress
Message queues can also su
ff
er from head of line blocking. In this example, there are four messages. The
fi
rst green message is processed quickly.
Animation
The orange one though takes longer to process, and crucially, while it’s being processed, all of the other messages have to wait.
Animation
Once the slow message is processed, the rest of the queue can proceed.
Animation

26
Poison Pills
Non-Uniform Consumer Latency
There are two major causes of head of line blocking when it comes to message queues. The
fi
rst is poison pills.

Head-of-Line Blocking - Poison Pills
27
In this example, the circles are regular messages and the skull and crossbones represents a poison pill.
A poison pill message is one that the consumer does not know how to handle. For example, if the application developer is expecting an order quantity as an integer but
receives one as a string, and has not written error handling to handle this scenario, the application may crash. This will prevent processing of all of the messages behind
the poison pill message in the queue.
The
fi
rst message is consumed quickly.
Animate
but the poison pill message crashes the consumer application.
Animate
No further messages can be processed.

Head-of-Line Blocking - Non-Uniform Consumer Latency
Orange Service
Green Service
28
Processing
in
Progress
The second main cause of head of line blocking is non-uniform, consumer latency.
Suppose we have a consumer application that calls one of two external services depending on the content of a message… for green messages the application calls the
green service and for orange messages the application calls the orange service.
The
fi
rst message is processed normally since the green service is healthy.
Trigger Animation
Now imagine that the orange service is slower than usual to respond, perhaps due to network issues.
Trigger Animation
This means that the processing of all the messages in the queue is slowed, even though the green messages have nothing to do with the orange service. The messages
are not able to be processed until the orange service completes. Once the orange service
fi
nishes, the block is lifted, and the rest of the messages can be processed
Trigger Animation

Solution Requirements
Polyglot
Data Loss Prevention
29
Open Source
Handle
Poison Pills
Handle Non-
Uniform Consumer
Latency
In determining our desired approach to solving head of line blocking, we decided on
fi
ve solution requirements.
The
fi
rst two were handling the two main causes of head-of-line-blocking.
The third requirement was that data loss was prevented. A naive way of handling head-of-line-blocking would be to just drop messages that are causing it. This might be
appropriate for non-critical scenarios, such as tracking likes on social media where it's not critical that every like is captured. However, for critical situations such as those
involving orders, it is crucial that every order is captured, otherwise potential revenue may be lost. We wanted a solution that could prevent data loss.
The fourth requirement was that the potential solution could be easily integrated into polyglot microservice environments.
Lastly, the
fi
fth requirement was that the potential solution would be open source and easily available to developers.

Overview
30
7. Future Work
8. Q&A
With these solution requirements in mind, we’ll now look at the existing solutions that we found that addressed head-of-line-blocking…

Existing Solutions
31
1. Con
fl
uent Parallel Consumer
2. DoorDash's Worker Model
3. Uber’s Consumer Proxy
- The three solutions we found were Con
fl
uent Parallel Consumer, DoorDash’s Worker Model, and Uber’s Consumer Proxy.

Existing Solutions Comparison
32
Polyglot
Data Loss
Prevention
Open
Source
DoorDash
Kafka
Workers
Uber
Consumer
Proxy
Con
fl
uent
Parallel
Consumer
Handles
Poison Pill
Handle Non-
Uniform Consumer
Latency
- Con
fl
uent Parallel Consumer
fi
xes head-of-line blocking caused by both poison pills as well as non-uniform consumer latency. But it doesn’t have a way to store
poison pill messages, and since we cannot tolerate data loss, this solution was not viable for our use case. Also, their library is written in Java, meaning developers
would have to write their applications in Java as well; this was counter to our goal of
fi
nding a solution that worked well in a polyglot environment.
- While using Kafka, DoorDash experienced spikes in latency in their consumer applications. Individual slow messages were causing delayed processing for all
messages in a given partition - a real world example of non-uniform consumer latency. To address this, they introduced something they called "Kafka Workers". This
solution, however, failed to address poison pills, and with no mechanism to prevent data loss, this solution was insu
ffi
cient.
- Lastly, Uber’s Consumer Proxy solves head-of-line blocking resulting from both poison pills and from non-uniform consumer latency - Poison pills are handled without
data loss, and non-uniform consumer latency is addressed by parallel consumption of messages. Uber built Consumer Proxy as its own piece of infrastructure in order
to work well in polyglot environments. However, as an in-house solution, it is not available for us or other developers to use.

Overview
7. Future Work
8. Q&A
33
- Given that none of the existing solutions
fi
t all of our requirements, we decided to build Triage. Next, Aryan will discuss what Triage is, and how it handles both causes
of head of line blocking.

What is Triage?
34
Kafka
Cluster
Consumer
Application
Thanks, Jordan… Triage acts as a proxy for consumer applications. It ingests messages from the Kafka Cluster and sends
them to downstream consumer applications.

Triage Instance
Triage at a high level
35
Partition Application Logic
DynamoDB
Instance
Consumer
Application
Partition
Partition
Kafka Topic
- Here’s a high-level view of a Triage instance in the cloud.
- Triage consumes from a single partition, just like any other Kafka consumer.
- Triage's functionality consists of the application logic, running in an AWS container, and a DynamoDB instance.
- Problematic messages are stored in Dynamo for examination at a later time
- This pattern is known as the "dead-letter pattern"

Messages
36
Dead
Letter
Store
Dead-Letter Pattern
- In dead-letter patterns, problematic messages (referred to as dead letters) are removed from the consumer application and
persisted to an external data store for later processing.

Partition
Commit Tracker - Overview
37
Triage Application Logic
msg ack/nack
ack
nack
ack
ack
Consumer
Instance
Consumer
Instance
Consumer
Instance
To manage commits back to Kafka, Triage uses an internal system of acknowledgements with a component we call Commit
Tracker.
Consumers can send an “ack”, a positive acknowledgement, back to Triage, indicating that a message was successfully
processed or a “nack”, a negative acknowledgement, to indicate a poison pill message.

Commit Tracker - Ack/Nack
38
0 1 2 3
Offset: 4 5 6 7 8 9
offset msg acked?
0
1
2
3
4
5
6
7
8
9
false
false
false
false
false
false
false
false
false
false
Ack
true
true
true
true
true
Stored
Ack
Nack
true
true
true
true
Commit Tracker
Using the Commit Tracker, Triage can calculate which o
ff
sets to commit back to Kafka. This ensures that the health of the
partition is maintained.
Let's take a look at how Commit Tracker works, since it's central to the functionality of Triage.
Triage
fi
rst ingests a large batch of messages and stores them in a hashmap.
TRIGGER ANIMATION
The keys of the hashmap are the message o
ff
sets and the values are a custom struct with two
fi
elds: the message itself and
a boolean, indicating whether it has been acknowledged.
As Triage receives "acks" from consumers, we update the commit hash accordingly.
TRIGGER ANIMATION

When a message is "nacked", however, we cannot update the commit hash immediately.
TRIGGER ANIMATION
We must
fi
rst ensure the message has been successfully written to our dead-letter store, which is a DynamoDB table, and
only then do we update the commit hash.
TRIGGER ANIMATION
Next, the rest of the messages are processed by the consumers, including one, the orange message, that takes a long time
to be processed by the consumer. As a result, the faster green messages are processed and acked before the orange one
is.
TRIGGER ANIMATION

39
0 1 2 3
Offset Committed: Commit
5
Offset: 4 5 6 7 8 9
offset msg acked?
0
1
2
3
4
5
6
7
8
9
false
true
true
true
true
true
39
Commit Tracker - Commit Calculator
true
true
true
true
Commit Tracker
It's important to note, that since we always wait for con
fi
rmation from Dynamo before updating the commit hash, at this
point, whether a message has been "acked" or "nacked" isn't important - we only want to know that a message has been
acknowledged in some way.
So, how do we calculate which o
ff
set to commit back to Kafka? We want to commit as many o
ff
sets as we can, so we need
to
fi
nd the greatest committable o
ff
set.
Periodically, a component called "Commit Calculator" runs in the background. It checks the commit hash to see the greatest
o
ff
set with a value of true, for which all lower o
ff
sets also have a value of true.
TRIGGER ANIMATION
Triage can then commit this o
ff
set back to Kafka.

TRIGGER ANIMATION
Once we receive con
fi
rmation from Kafka that the commit was successful, we can then delete all entries up to and including
that o
ff
set from Commit Tracker, since they're no longer needed.
TRIGGER ANIMATION

How Triage Solves Head-of-Line Blocking
40
With this understanding of Commit Tracker and the core functionality of Triage, let's take a look at how we solve Head of
Line Blocking due to both Poison Pills and Non-Uniform Consumer Latency

msg ack/nack
nack
ack
DynamoDB
Dead Letter Store
ack
ack
How Triage Solves Poison Pills
41
Let's start with Poison Pills - here we can see a consumer application receiving a poison pill message.
- Trigger Animation
Consumer applications can tell Triage that the message they've received is a poison pill by sending a "nack".
- Trigger Animation
Triage sends that message to a DynamoDB table, so that it can be handled at a later time. This frees up the consumer to
continue processing messages.
- Trigger Animation

How Triage Solves Non-Uniform Consumer Latency
42
To address non-uniform consumer latency, Triage enables the parallel consumption of messages from a single partition.
Here, we have two instances of a single consumer application that rely on one of two external services based on the
contents of a message. For orange messages, the application calls the orange external service; for greens, the green
service.
- Trigger Animation
Here, you can see that because the orange service is slow, the consumer instance at the top is taking an unusually long time
to process a message.
- Trigger Animation (TALK OVER)
Because of the one-to-many pattern enabled by Triage, healthy consumer instances are able to continue consumption, so
the queue keeps moving.

Overview
43
7. Future Work
8. Q&A
Now that you know how Triage solves head-of-line blocking, Mike will cover some of the challenges that we faced when
building Triage as well as our plans for some improvements we'd like to build out.

Triage Design Challenges
• Achieving Parallel Processing via Concurrency
• Polyglot Support
• Ease of Deployment
44
Based on our requirements and our intended design for Triage, there were three notable challenges that we'd like to discuss.
- Achieving Parallel Processing via Concurrency,
- Polyglot Support,
- and Ease of Deployment.
For each of these challenges, I'll talk a little about them and discuss our respective solutions.
Let's start with parallel consumption via concurrency

Parallel Consumption
Kafka Partition
45
We need a one to many relationship between Triage and instances of a consumer application to solve HoLB caused by NUCL.

Challenge: Achieving Concurrency
46
Our solution was to write the application logic of Triage in Go. Go is designed with concurrency in mind via what are called Goroutines.
We can think of Goroutines as non-blocking function loops. Several, think thousands, Goroutines can run in the background with very little resource overhead.

Challenge: Parallel Consumption
Solution: Go & Goroutines
Goroutine C
Goroutine B
Goroutine A
47
Triage
Within Triage, we run a dedicated Goroutine for each downstream consumer instance. These Goroutines pull messages and send them to consumer instances, allowing
us to consume from a single partition in parallel.

Concurrency in Triage
48
Concurrency via Go also allowed us to implement Triage as a single application. Each major component of Triage exists as a Goroutine, that themselves utilize other
Goroutines. We achieved communication across these Goroutines using channels.

49
Goroutine 1 Goroutine 2
Channel
Channels are strongly-typed queue like structures. Goroutines can places messages on the channel for other goroutines to receive.
When messages are received, it's important to know that they are removed from the channel.

50
Goroutine 2
Goroutine 1
Goroutine 3
Goroutine 5
Goroutine 4
Goroutine 6
Channel
Because messages are removed, we can have multiple senders and receivers without worrying about unintended data duplication.
Animate

Concurrency in Triage
51
Fetcher
Commit Tracker
Consumer Manager
messagesChannel
newConsumersChannel
Dispatch
senderRoutine A
senderRoutine B
Connection Request
Let's take a look at some of the major components of Triage and how we take advantage of concurrency.
At a high level, we need a process to continually ingest messages from Kafka - this Goroutine is called Fetcher, in blue. It then needs to pipe these messages via the
"messages channel" to a Goroutine called Dispatch and write them to our Commit Tracker in green.
While all this is happening, we need another process to listen for incoming connection requests from consumer instances - we call this Goroutine "Consumer Manager".
When it receives a request, after authenticating it, Consumer Manager places the network address of the consumer instance onto a "newConsumers" channel.
When Dispatch receives a network address via this channel, it creates yet another Goroutine called "senderRoutine" that pulls messages from the messagesChannel.
These senderRoutines, as their names imply, send messages to their respective consumer instances.

52
Connection Request
Dynamo
DB
Fetcher
Commit
Tracker
Consumer
Manager
messagesChannel
newConsumersChannel
Dispatch
senderRoutine A
senderRoutine B
Commit
Calculator
commitsChannel
messages
commits
acknowledgementsChannel
Filter
Reaper
deadLettersChannel
consumerRoutine
Triage Application Logic
committerRoutine
Zooming out a little bit really hammers home the bene
fi
ts we gain from concurrency. All of the components inside Triage, that you can see on the screen, are goRoutines,
many of which rely on other goRoutines.
While implementing all of this functionality without Go is certainly possible, Go made it very intuitive for us, cementing it as the correct language for the job.
* All components should be in an "application logic box"

• Achieving Parallel Consumption via Concurrency
53
The next challenge we faced was polyglot support.

Challenge: Polyglot Support
Java Consumer
Go Consumer
NodeJS Consumer
54
Kafka
Cluster
Triage Instances
As you can see on the right side of the diagram, we needed Triage to be able to support consumer applications written in a host of di
ff
erent languages.

• Solution:
• Implementation: Service + Thin Client Library
• Network Protocol: gRPC
55
Our solution was to implement Triage as a service coupled with a thin client library, in addition to our choice of gRPC as our primary network communication protocol.

Service vs Client Library
56
Before choosing our implementation model, we considered both a pure client library and pure service approach.

Potential Client Library Implementation
Consumer Application
57
Kafka
Cluster
A potential pure client library implementation would have all the application logic of Triage exist as imported code within the consumer application.
This comes with the bene
fi
t of not having to introduce new pieces of infrastructure to a user's system and makes testing Triage simpler. But, supporting additional
languages would require a complete rewrite of Triage.
Maintaining Triage would be pretty di
ffi
cult, since any change to a system's Kafka version would require updating all versions of Triage.
We considered these to be poor tradeo
ff
s.
An alternative would be to implement Triage as a service.

Service Implementation
58
Kafka
Cluster
Triage Service Consumer Application
With the pure service approach, Triage would act as a piece of infrastructure that sits between the Kafka cluster and consumer applications. This allows us to avoid the
aforementioned cons of a client library implementation, but we still wanted to make connecting to Triage simple for developers.

Solution: Service + Thin Client Library
59
Triage Service
Kafka
Cluster
Consumer Application
Triage
Client
We decided on a hybrid approach.
The core application logic of Triage exists on a container running in AWS.
Consumer applications use a thin client library to manage communicating with Triage.
This lightweight client exists within each instance of a consumer application.
It provides convenience methods for sending an initial connection request and exposing an endpoint to receive messages from Triage.

Multi-language Support
60
Kafka
Cluster
Triage Service
Triage Client
Triage Client
Triage Client
While we don't gain the full language agnosticism that a pure service approach might o
ff
er, building out multi-language support only requires us to rewrite our simple
client library in another language.

1. Sends an initial HTTP request to Triage to request a
connection.
2. Runs a gRPC server to receive messages from Triage.
61
The client library:
Ultimately, the client library only
1) sends an initial HTTP request to Triage to request a connection
and
2) runs a gRPC server to receive messages.
Because it's operationally very simple, rewriting the client library is far more manageable than rewriting Triage in its entirety.

Machine A
Machine B
doWork()
code execution
gRPC - an RPC Framework
Machine A
doWork()
code execution
62
Local Procedure Call Remote Procedure Call
To manage the communication between Triage and consumer instances, we chose gRPC as a network protocol, primarily for the ease with which we could build out
multi language support. I think it's helpful to talk a little bit about what gRPC is.
gRPC is an RPC framework, created by Google, where RPC stands for remote procedure call. We can think of procedure calls as simple function calls or invocations.
With a local procedure call, everything exists on a single host machine. In the
fi
gure on the left, the function "doWork" is executed on Machine A resulting in code being
executed on Machine A.
Remote procedure calls, however, allow us to execute code on a di
ff
erent machine. In the
fi
gure on the right, "doWork" is being called on Machine A, resulting in code
being executed on Machine B.

gRPC & Triage
63
gRPC Client
processMessage(message)
Triage
Consumer Instance
gRPC Server
code execution
It's helpful to understand that gRPC uses the same client-server model that we're familiar with.
With Triage, the Triage container acts as a gRPC client, and calls "processMessage()", with the message as an argument.
The consumer instance runs a gRPC server that listens for this procedure call. It then executes code to process the message before sending a response, the "ack" or
"nack" we've talked about before, back to Triage.

64
Code Generation
gRPC Server gRPC Server
gRPC Server
• function name
• Parameters
• Return value
gRPC Service De
fi
nition
The biggest reason we decided on gRPC is its code generation feature.
Using what's called a gRPC service de
fi
nition, client and server implementations can be automatically generated in all major programming languages.
Creating a gRPC service de
fi
nition is pretty straightforward. You simply de
fi
ne a function interface - that is, what is the name of the function, what parameters does it
have, and what does it return.
Because the most complicated part of building the Triage client library is handled for us via this code generation, we can write support for other languages with relative
ease.

65
The
fi
nal challenge we faced was making Triage easy to deploy for application developers.

Challenge: Ease of Deployment
• Solution:
• AWS CDK
• Triage CLI
66
Our solution was to create an automated deployment script using AWS's Cloud Development Kit. Developers can use our command line tool, Triage CLI, to easily deploy
Triage to AWS using this CDK script.

67
Kafka Topic ECS
Partition
Partition
Partition
Solution: AWS CDK
Because Triage operates on a per-partition basis, we needed to deploy a container running Triage for each partition in a given Kafka topic. To do this, we used Elastic
Container Service, speci
fi
cally with Fargate as our deployment vehicle.
With ECS, we can de
fi
ne a minimum number of Triage containers running at any given time - were one to crash, for some reason, another would be provisioned to
replace it automatically.
Using Fargate means management of individual compute resources is abstracted away for our users and allows them to only think about containers.
The key for us, was that by using CDK, we could write a reusable script to deploy Triage containers via Fargate. That being said, we still needed to answer the question of
how to interpolate user speci
fi
c information, such as Kafka authentication credentials, into Triage during deployment.

Solution: Triage CLI
• triage init
• triage deploy
• Triage network address
• Authentication Key
68
To do so, we created a command line tool called Triage CLI. It can be downloaded as an NPM package and features a 2-step deployment process.
triage init installs any necessary dependencies for deployment and generates a con
fi
guration
fi
le where developers can supply authentication and Kafka speci
fi
c
information.
triage deploy interpolates the data in this con
fi
guration
fi
le into the CDK script. It also creates an internal con
fi
g
fi
le used by individual Triage containers. It then deploys
these containers to AWS.
Finally, it returns the network address and authentication key needed for consuming applications to connect to Triage. Using Triage CLI, a developer can leverage our
CDK Script to deploy Triage to the cloud.

69
Having solved these major challenges, we were able to build Triage without compromising on any of our design requirements. For a more in-depth exploration of how
Triage works and implementation details, check out our write up, linked in the Zoom meeting description.

Overview
70
7. Future Work
8. Q&A
Before we open up for questions, we'd like to cover some features we'd like to add.

Future Work
1. Extend client library language support
2. Cause of Failure for Dead Letter Table
3. Dead Letter Noti
fi
cations
71
We'd
fi
rst like to build out additional language support for our thin client library. As we've discussed, doing so shouldn't be di
ffi
cult, since the majority of the work is done
for us via gRPC code generation. Supporting other popular languages like JavaScript or Ruby would help us serve more developers.
We'd also like to add a cause of failure column to our table that stores dead-letter messages - it would contain failure reasons that developers could supply when sending
a "nack" back to Triage for poison pills. This would aid in analysis and remedying faulty messages.
Finally, we'd like to add a simple noti
fi
cation system that could alert developers when poison pills are stored in the dead-letter table, allowing for rapid response. We think
this is perhaps the easiest to implement and is likely our next step.

72
Questions?
github.com/Team-Triage
Aashish Balaji Jordan Swartz
Michael Jung
Aryan Binazir
Toronto, Canada San Diego, CA Los Angeles, CA
Chapel Hill, NC
With that, I'd like to thank you all for joining us this afternoon and we'll open the
fl
oor for questions!

Triage Presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Triage Presentation

Semelhante a Triage Presentation (20)

Último

Último (20)

Triage Presentation