O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Triage Presentation

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 74 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Mais recentes (20)

Anúncio

Triage Presentation

  1. 1. A consumer proxy that solves head-of-line blocking for Kafka consumers 1 Hey everyone, and welcome to our presentation. I’m Aashish and together with Aryan, Jordan, and Michael - our team built Triage, a consumer proxy that solves head-of- line blocking for Kafka consumers.
  2. 2. Overview 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A 2 Here’s a quick overview of what you can expect. First, we’ll address the larger context of microservices and event-driven architecture. From there, we’ll take a look at message queues and focus on Apache Kafka, with a few details on how it works. Next, we’ll examine the problem of head-of-line blocking and its consequences, after which we’ll share our research on some existing solutions. At that point, we’ll present Triage and our approach to solving head-of-line blocking, along with some interesting design challenges we faced. We’ll end with some ideas for future work, and leave some room for a Q&A. We’re excited to show you what we built so let’s get started!
  3. 3. “63% of enterprises have adopted microservice architectures, and it’s only expected to grow in the coming decade.” 3 Microservice architecture has really gained in popularity over the last decade and in 2020, it was estimated that over 63% of enterprises had adopted microservices and were satis fi ed with the tradeo ff s.
  4. 4. 4 Shopping App API Logic DB Orders Microservice API Logic DB Products Microservice API Logic DB Stock Microservice Here’s an example of a microservice architecture for a shopping app. The takeaway here is to notice how the services are isolated into separate pieces. The orders, products, and stock inventory services all have their own logic and data stores, and the shopping app can communicate with all of them.
  5. 5. What do microservices offer? 1. Development work can occur in parallel 2. Scalability becomes easier 3. Polyglot environment 5 Since services can be decoupled in this way, work can be done in parallel which leads to faster development times. Additionally, there’s a bene fi t in the ability to take individual components and scale them independently. Often, multiple technologies and programming languages are used in these setups, which is known as a polyglot microservice environment. Given the use of these di ff erent languages, an important question is:
  6. 6. “How do we successfully achieve intra-system communication?” 6 How do we successfully achieve the required intra-system communication, for the system to function properly? One option is to use a request-response model, which is commonly used on the web.
  7. 7. Request Response 7 Request Response Request Response Imagine a # of interconnected microservices where services can send a request, and wait for responses. The issue is that if a single service in this chain experiences a slowdown, the request lifecycle of any connected service will also be delayed. To overcome this problem, a common choice is to implement an event-driven architecture, or an EDA.
  8. 8. EDAs are centered around events - which are changes in state - or noti fi cations about a change. 8 EDAs are centered around events, which can be thought of as changes in state, or noti fi cations about a change.
  9. 9. In an EDA, services can operate independently without concern for the state of any other service. 9 The key here is that services can operate independently without concern for the state of any other service.
  10. 10. 10 Event-Driven Architecture The service on the left can communicate with all 3 services on the right, independently. This architecture bypasses the problem where a delayed service causes a slowdown throughout the entire system. In order to achieve this decoupling, EDAs can be implemented using message queues.
  11. 11. Message Queue Functionality Queue Producer 11 Consumer Here we have two producers to the left of the message queue. These applications write events to the queue. The consumer, which is to the right, reads these events o ff of the queue.
  12. 12. Traditional message queues: events are read and then removed. Log-based message queues: events are persisted on a log. 12 In traditional message queues, events are read and then removed from the queue. An alternative approach is to use log-based message queues. Here, all the events are persisted on a log so you don’t lose them once they’re read.
  13. 13. 13 Powered by Among log-based message queues, Kafka is the most popular - over 80% of Fortune 100 companies across industries use it as part of their architecture.
  14. 14. What does Kafka offer? •Scalability •Parallelism •Decoupling 14 Kafka is designed for scalability and parallelism, and it maintains the intended decoupling of an EDA. It’s worth taking a look at what’s unique about Kafka and how it works.
  15. 15. In Kafka, events are called messages. 15 In the context of Kafka, events are called messages and this is how we’ll refer to them.
  16. 16. Topic Kafka Partition 2 Partition 1 16 In this image, messages are grouped using a named identi fi er - called a topic. Kafka achieves scalability by writing all the messages of a topic to partitions. So in this example, messages in a single topic are written to two di ff erent partitions.
  17. 17. 17 Topic 1 Topic 2 Partition 2 Consumer Group A Consumer 1 Consumer 2 Consumer 3 Consumer 4 Producer 1 Producer 2 Partition 1 Kafka Partition 2 Partition 1 Consumer Group B If we add the other pieces of the architecture, it’ll look something like this. Producers, seen on the left, write messages to a topic. Consumers, on the right, are organized into groups with a group ID. If a consumer wants to read messages, it can subscribe to a speci fi c topic; then, individual consumer instances can read messages from a partition.
  18. 18. Want more scalability? Add more partitions. 18 Need more parallelism? Use consumer instances. To achieve more scalability, you could simply increase the # of partitions per topic. Additionally, the use of multiple consumer instances means that the messages can be processed in parallel.
  19. 19. While a consumer instance can consume from more than one partition, a partition can only be consumed by a single consumer instance. 19 It is important to note that while a consumer instance can consume from more than one partition, a partition can only be consumed by a single consumer instance. In other words, 2 di ff erent consumer instances can’t consume from the same partition.
  20. 20. Kafka commits 20 • O ff set: A number that indicates the position of the message in the queue. • A consumer periodically commits o ff sets back to Kafka to acknowledge the last message it successfully processed. • In case of a crash, Kafka will remember where to resume message delivery from. Kafka uses commits to know which messages have been successfully processed. The way this works is that every message on a Kafka partition has an o ff set - this is a number that indicates the position of the message in the queue. Think of it like an index in an array. A consumer periodically commits o ff sets back to Kafka, indicating the last message it successfully processed. If a consumer instance crashes, Kafka will remember where to resume message delivery from.
  21. 21. 21 O ff set 48 49 50 51 Producer Consumer Last Committed O ff set Kafka Here, once the consumer commits o ff set #50, Kafka knows that the messages from 48-50 have all been successfully processed. The consumer can continue consuming before it commits the next o ff set.
  22. 22. 1. Producers write messages to a speci fi c topic. 2. Kafka routes these messages to partitions. 3. Consumers subscribe to a speci fi c topic to receive messages f and commit o ff sets. 4. Each partition in a topic can only be consumed by one d consumer instance. 22 Recap To recap, producers write messages to a speci fi c topic. Kafka then routes these messages to partitions. Consumers subscribe to a topic to receive messages and commit o ff sets. Each partition in a topic can only be consumed by one consumer instance.
  23. 23. Overview 23 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A Now that we've shown the larger context, Jordan from our team will explain the problem of head-of-line blocking in message queues.
  24. 24. Head-of-Line Blocking 24 A real-world example of head of line blocking that we are all likely familiar with is when you're at the supermarket and the person in the front of the line is taking a long time to fi nish paying. Perhaps they're trying to use expired coupons or have multiple fruits each with their own ID or they're trying to pay with bitcoin. It slows down the entire line and everyone behind them has to wait.
  25. 25. Head-of-Line Blocking - Message Queues 25 Processing in Progress Message queues can also su ff er from head of line blocking. In this example, there are four messages. The fi rst green message is processed quickly. Animation The orange one though takes longer to process, and crucially, while it’s being processed, all of the other messages have to wait. Animation Once the slow message is processed, the rest of the queue can proceed. Animation
  26. 26. 26 Poison Pills Non-Uniform Consumer Latency There are two major causes of head of line blocking when it comes to message queues. The fi rst is poison pills.
  27. 27. Head-of-Line Blocking - Poison Pills 27 In this example, the circles are regular messages and the skull and crossbones represents a poison pill. A poison pill message is one that the consumer does not know how to handle. For example, if the application developer is expecting an order quantity as an integer but receives one as a string, and has not written error handling to handle this scenario, the application may crash. This will prevent processing of all of the messages behind the poison pill message in the queue. The fi rst message is consumed quickly. Animate but the poison pill message crashes the consumer application. Animate No further messages can be processed.
  28. 28. Head-of-Line Blocking - Non-Uniform Consumer Latency Orange Service Green Service 28 Processing in Progress The second main cause of head of line blocking is non-uniform, consumer latency. Suppose we have a consumer application that calls one of two external services depending on the content of a message… for green messages the application calls the green service and for orange messages the application calls the orange service. The fi rst message is processed normally since the green service is healthy. Trigger Animation Now imagine that the orange service is slower than usual to respond, perhaps due to network issues. Trigger Animation This means that the processing of all the messages in the queue is slowed, even though the green messages have nothing to do with the orange service. The messages are not able to be processed until the orange service completes. Once the orange service fi nishes, the block is lifted, and the rest of the messages can be processed Trigger Animation
  29. 29. Solution Requirements Polyglot Data Loss Prevention 29 Open Source Handle Poison Pills Handle Non- Uniform Consumer Latency In determining our desired approach to solving head of line blocking, we decided on fi ve solution requirements. The fi rst two were handling the two main causes of head-of-line-blocking. The third requirement was that data loss was prevented. A naive way of handling head-of-line-blocking would be to just drop messages that are causing it. This might be appropriate for non-critical scenarios, such as tracking likes on social media where it's not critical that every like is captured. However, for critical situations such as those involving orders, it is crucial that every order is captured, otherwise potential revenue may be lost. We wanted a solution that could prevent data loss. The fourth requirement was that the potential solution could be easily integrated into polyglot microservice environments. Lastly, the fi fth requirement was that the potential solution would be open source and easily available to developers.
  30. 30. Overview 30 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A With these solution requirements in mind, we’ll now look at the existing solutions that we found that addressed head-of-line-blocking…
  31. 31. Existing Solutions 31 1. Con fl uent Parallel Consumer 2. DoorDash's Worker Model 3. Uber’s Consumer Proxy - The three solutions we found were Con fl uent Parallel Consumer, DoorDash’s Worker Model, and Uber’s Consumer Proxy.
  32. 32. Existing Solutions Comparison 32 Polyglot Data Loss Prevention Open Source DoorDash Kafka Workers Uber Consumer Proxy Con fl uent Parallel Consumer Handles Poison Pill Handle Non- Uniform Consumer Latency - Con fl uent Parallel Consumer fi xes head-of-line blocking caused by both poison pills as well as non-uniform consumer latency. But it doesn’t have a way to store poison pill messages, and since we cannot tolerate data loss, this solution was not viable for our use case. Also, their library is written in Java, meaning developers would have to write their applications in Java as well; this was counter to our goal of fi nding a solution that worked well in a polyglot environment. - While using Kafka, DoorDash experienced spikes in latency in their consumer applications. Individual slow messages were causing delayed processing for all messages in a given partition - a real world example of non-uniform consumer latency. To address this, they introduced something they called "Kafka Workers". This solution, however, failed to address poison pills, and with no mechanism to prevent data loss, this solution was insu ffi cient. - Lastly, Uber’s Consumer Proxy solves head-of-line blocking resulting from both poison pills and from non-uniform consumer latency - Poison pills are handled without data loss, and non-uniform consumer latency is addressed by parallel consumption of messages. Uber built Consumer Proxy as its own piece of infrastructure in order to work well in polyglot environments. However, as an in-house solution, it is not available for us or other developers to use.
  33. 33. Overview 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A 33 - Given that none of the existing solutions fi t all of our requirements, we decided to build Triage. Next, Aryan will discuss what Triage is, and how it handles both causes of head of line blocking.
  34. 34. What is Triage? 34 Kafka Cluster Consumer Application Thanks, Jordan… Triage acts as a proxy for consumer applications. It ingests messages from the Kafka Cluster and sends them to downstream consumer applications.
  35. 35. Triage Instance Triage at a high level 35 Partition Application Logic DynamoDB Instance Consumer Application Partition Partition Kafka Topic - Here’s a high-level view of a Triage instance in the cloud. - Triage consumes from a single partition, just like any other Kafka consumer. - Triage's functionality consists of the application logic, running in an AWS container, and a DynamoDB instance. - Problematic messages are stored in Dynamo for examination at a later time - This pattern is known as the "dead-letter pattern"
  36. 36. Messages 36 Dead Letter Store Dead-Letter Pattern - In dead-letter patterns, problematic messages (referred to as dead letters) are removed from the consumer application and persisted to an external data store for later processing.
  37. 37. Partition Commit Tracker - Overview 37 Triage Application Logic msg ack/nack ack nack ack ack Consumer Instance Consumer Instance Consumer Instance To manage commits back to Kafka, Triage uses an internal system of acknowledgements with a component we call Commit Tracker. Consumers can send an “ack”, a positive acknowledgement, back to Triage, indicating that a message was successfully processed or a “nack”, a negative acknowledgement, to indicate a poison pill message.
  38. 38. Commit Tracker - Ack/Nack 38 0 1 2 3 Offset: 4 5 6 7 8 9 offset msg acked? 0 1 2 3 4 5 6 7 8 9 false false false false false false false false false false Ack true true true true true Stored Ack Nack true true true true Commit Tracker Using the Commit Tracker, Triage can calculate which o ff sets to commit back to Kafka. This ensures that the health of the partition is maintained. Let's take a look at how Commit Tracker works, since it's central to the functionality of Triage. Triage fi rst ingests a large batch of messages and stores them in a hashmap. TRIGGER ANIMATION The keys of the hashmap are the message o ff sets and the values are a custom struct with two fi elds: the message itself and a boolean, indicating whether it has been acknowledged. As Triage receives "acks" from consumers, we update the commit hash accordingly. TRIGGER ANIMATION
  39. 39. When a message is "nacked", however, we cannot update the commit hash immediately. TRIGGER ANIMATION We must fi rst ensure the message has been successfully written to our dead-letter store, which is a DynamoDB table, and only then do we update the commit hash. TRIGGER ANIMATION Next, the rest of the messages are processed by the consumers, including one, the orange message, that takes a long time to be processed by the consumer. As a result, the faster green messages are processed and acked before the orange one is. TRIGGER ANIMATION
  40. 40. 39 0 1 2 3 Offset Committed: Commit 5 Offset: 4 5 6 7 8 9 offset msg acked? 0 1 2 3 4 5 6 7 8 9 false true true true true true 39 Commit Tracker - Commit Calculator true true true true Commit Tracker It's important to note, that since we always wait for con fi rmation from Dynamo before updating the commit hash, at this point, whether a message has been "acked" or "nacked" isn't important - we only want to know that a message has been acknowledged in some way. So, how do we calculate which o ff set to commit back to Kafka? We want to commit as many o ff sets as we can, so we need to fi nd the greatest committable o ff set. Periodically, a component called "Commit Calculator" runs in the background. It checks the commit hash to see the greatest o ff set with a value of true, for which all lower o ff sets also have a value of true. TRIGGER ANIMATION Triage can then commit this o ff set back to Kafka.
  41. 41. TRIGGER ANIMATION Once we receive con fi rmation from Kafka that the commit was successful, we can then delete all entries up to and including that o ff set from Commit Tracker, since they're no longer needed. TRIGGER ANIMATION
  42. 42. How Triage Solves Head-of-Line Blocking 40 With this understanding of Commit Tracker and the core functionality of Triage, let's take a look at how we solve Head of Line Blocking due to both Poison Pills and Non-Uniform Consumer Latency
  43. 43. msg ack/nack nack ack DynamoDB Dead Letter Store ack ack How Triage Solves Poison Pills 41 Let's start with Poison Pills - here we can see a consumer application receiving a poison pill message. - Trigger Animation Consumer applications can tell Triage that the message they've received is a poison pill by sending a "nack". - Trigger Animation Triage sends that message to a DynamoDB table, so that it can be handled at a later time. This frees up the consumer to continue processing messages. - Trigger Animation
  44. 44. How Triage Solves Non-Uniform Consumer Latency 42 To address non-uniform consumer latency, Triage enables the parallel consumption of messages from a single partition. Here, we have two instances of a single consumer application that rely on one of two external services based on the contents of a message. For orange messages, the application calls the orange external service; for greens, the green service. - Trigger Animation Here, you can see that because the orange service is slow, the consumer instance at the top is taking an unusually long time to process a message. - Trigger Animation (TALK OVER) Because of the one-to-many pattern enabled by Triage, healthy consumer instances are able to continue consumption, so the queue keeps moving.
  45. 45. Overview 43 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A Now that you know how Triage solves head-of-line blocking, Mike will cover some of the challenges that we faced when building Triage as well as our plans for some improvements we'd like to build out.
  46. 46. Triage Design Challenges • Achieving Parallel Processing via Concurrency • Polyglot Support • Ease of Deployment 44 Based on our requirements and our intended design for Triage, there were three notable challenges that we'd like to discuss. - Achieving Parallel Processing via Concurrency, - Polyglot Support, - and Ease of Deployment. For each of these challenges, I'll talk a little about them and discuss our respective solutions. Let's start with parallel consumption via concurrency
  47. 47. Parallel Consumption Kafka Partition 45 We need a one to many relationship between Triage and instances of a consumer application to solve HoLB caused by NUCL.
  48. 48. Challenge: Achieving Concurrency 46 Our solution was to write the application logic of Triage in Go. Go is designed with concurrency in mind via what are called Goroutines. We can think of Goroutines as non-blocking function loops. Several, think thousands, Goroutines can run in the background with very little resource overhead.
  49. 49. Challenge: Parallel Consumption Solution: Go & Goroutines Goroutine C Goroutine B Goroutine A 47 Triage Within Triage, we run a dedicated Goroutine for each downstream consumer instance. These Goroutines pull messages and send them to consumer instances, allowing us to consume from a single partition in parallel.
  50. 50. Concurrency in Triage 48 Concurrency via Go also allowed us to implement Triage as a single application. Each major component of Triage exists as a Goroutine, that themselves utilize other Goroutines. We achieved communication across these Goroutines using channels.
  51. 51. Challenge: Achieving Concurrency 49 Goroutine 1 Goroutine 2 Channel Channels are strongly-typed queue like structures. Goroutines can places messages on the channel for other goroutines to receive. When messages are received, it's important to know that they are removed from the channel.
  52. 52. Challenge: Achieving Concurrency 50 Goroutine 2 Goroutine 1 Goroutine 3 Goroutine 5 Goroutine 4 Goroutine 6 Channel Because messages are removed, we can have multiple senders and receivers without worrying about unintended data duplication. Animate
  53. 53. Concurrency in Triage 51 Fetcher Commit Tracker Consumer Manager messagesChannel newConsumersChannel Dispatch senderRoutine A senderRoutine B Connection Request Let's take a look at some of the major components of Triage and how we take advantage of concurrency. At a high level, we need a process to continually ingest messages from Kafka - this Goroutine is called Fetcher, in blue. It then needs to pipe these messages via the "messages channel" to a Goroutine called Dispatch and write them to our Commit Tracker in green. While all this is happening, we need another process to listen for incoming connection requests from consumer instances - we call this Goroutine "Consumer Manager". When it receives a request, after authenticating it, Consumer Manager places the network address of the consumer instance onto a "newConsumers" channel. When Dispatch receives a network address via this channel, it creates yet another Goroutine called "senderRoutine" that pulls messages from the messagesChannel. These senderRoutines, as their names imply, send messages to their respective consumer instances.
  54. 54. 52 Connection Request Dynamo DB Fetcher Commit Tracker Consumer Manager messagesChannel newConsumersChannel Dispatch senderRoutine A senderRoutine B Commit Calculator commitsChannel messages commits acknowledgementsChannel Filter Reaper deadLettersChannel consumerRoutine Triage Application Logic committerRoutine Zooming out a little bit really hammers home the bene fi ts we gain from concurrency. All of the components inside Triage, that you can see on the screen, are goRoutines, many of which rely on other goRoutines. While implementing all of this functionality without Go is certainly possible, Go made it very intuitive for us, cementing it as the correct language for the job. * All components should be in an "application logic box"
  55. 55. Triage Design Challenges • Achieving Parallel Consumption via Concurrency • Polyglot Support • Ease of Deployment 53 The next challenge we faced was polyglot support.
  56. 56. Challenge: Polyglot Support Java Consumer Go Consumer NodeJS Consumer 54 Kafka Cluster Triage Instances As you can see on the right side of the diagram, we needed Triage to be able to support consumer applications written in a host of di ff erent languages.
  57. 57. Challenge: Polyglot Support • Solution: • Implementation: Service + Thin Client Library • Network Protocol: gRPC 55 Our solution was to implement Triage as a service coupled with a thin client library, in addition to our choice of gRPC as our primary network communication protocol.
  58. 58. Service vs Client Library 56 Before choosing our implementation model, we considered both a pure client library and pure service approach.
  59. 59. Potential Client Library Implementation Consumer Application 57 Kafka Cluster A potential pure client library implementation would have all the application logic of Triage exist as imported code within the consumer application. This comes with the bene fi t of not having to introduce new pieces of infrastructure to a user's system and makes testing Triage simpler. But, supporting additional languages would require a complete rewrite of Triage. Maintaining Triage would be pretty di ffi cult, since any change to a system's Kafka version would require updating all versions of Triage. We considered these to be poor tradeo ff s. An alternative would be to implement Triage as a service.
  60. 60. Service Implementation 58 Kafka Cluster Triage Service Consumer Application With the pure service approach, Triage would act as a piece of infrastructure that sits between the Kafka cluster and consumer applications. This allows us to avoid the aforementioned cons of a client library implementation, but we still wanted to make connecting to Triage simple for developers.
  61. 61. Challenge: Polyglot Support Solution: Service + Thin Client Library 59 Triage Service Kafka Cluster Consumer Application Triage Client We decided on a hybrid approach. The core application logic of Triage exists on a container running in AWS. Consumer applications use a thin client library to manage communicating with Triage. This lightweight client exists within each instance of a consumer application. It provides convenience methods for sending an initial connection request and exposing an endpoint to receive messages from Triage.
  62. 62. Multi-language Support 60 Kafka Cluster Triage Service Triage Client Triage Client Triage Client While we don't gain the full language agnosticism that a pure service approach might o ff er, building out multi-language support only requires us to rewrite our simple client library in another language.
  63. 63. 1. Sends an initial HTTP request to Triage to request a connection. 2. Runs a gRPC server to receive messages from Triage. 61 The client library: Ultimately, the client library only 1) sends an initial HTTP request to Triage to request a connection and 2) runs a gRPC server to receive messages. Because it's operationally very simple, rewriting the client library is far more manageable than rewriting Triage in its entirety.
  64. 64. Machine A Machine B doWork() code execution gRPC - an RPC Framework Machine A doWork() code execution 62 Local Procedure Call Remote Procedure Call To manage the communication between Triage and consumer instances, we chose gRPC as a network protocol, primarily for the ease with which we could build out multi language support. I think it's helpful to talk a little bit about what gRPC is. gRPC is an RPC framework, created by Google, where RPC stands for remote procedure call. We can think of procedure calls as simple function calls or invocations. With a local procedure call, everything exists on a single host machine. In the fi gure on the left, the function "doWork" is executed on Machine A resulting in code being executed on Machine A. Remote procedure calls, however, allow us to execute code on a di ff erent machine. In the fi gure on the right, "doWork" is being called on Machine A, resulting in code being executed on Machine B.
  65. 65. gRPC & Triage 63 gRPC Client processMessage(message) Triage Consumer Instance gRPC Server code execution It's helpful to understand that gRPC uses the same client-server model that we're familiar with. With Triage, the Triage container acts as a gRPC client, and calls "processMessage()", with the message as an argument. The consumer instance runs a gRPC server that listens for this procedure call. It then executes code to process the message before sending a response, the "ack" or "nack" we've talked about before, back to Triage.
  66. 66. 64 Code Generation gRPC Server gRPC Server gRPC Server • function name • Parameters • Return value gRPC Service De fi nition The biggest reason we decided on gRPC is its code generation feature. Using what's called a gRPC service de fi nition, client and server implementations can be automatically generated in all major programming languages. Creating a gRPC service de fi nition is pretty straightforward. You simply de fi ne a function interface - that is, what is the name of the function, what parameters does it have, and what does it return. Because the most complicated part of building the Triage client library is handled for us via this code generation, we can write support for other languages with relative ease.
  67. 67. Triage Design Challenges • Achieving Parallel Consumption via Concurrency • Polyglot Support • Ease of Deployment 65 The fi nal challenge we faced was making Triage easy to deploy for application developers.
  68. 68. Challenge: Ease of Deployment • Solution: • AWS CDK • Triage CLI 66 Our solution was to create an automated deployment script using AWS's Cloud Development Kit. Developers can use our command line tool, Triage CLI, to easily deploy Triage to AWS using this CDK script.
  69. 69. Challenge: Ease of Deployment 67 Kafka Topic ECS Partition Partition Partition Solution: AWS CDK Because Triage operates on a per-partition basis, we needed to deploy a container running Triage for each partition in a given Kafka topic. To do this, we used Elastic Container Service, speci fi cally with Fargate as our deployment vehicle. With ECS, we can de fi ne a minimum number of Triage containers running at any given time - were one to crash, for some reason, another would be provisioned to replace it automatically. Using Fargate means management of individual compute resources is abstracted away for our users and allows them to only think about containers. The key for us, was that by using CDK, we could write a reusable script to deploy Triage containers via Fargate. That being said, we still needed to answer the question of how to interpolate user speci fi c information, such as Kafka authentication credentials, into Triage during deployment.
  70. 70. Challenge: Ease of Deployment Solution: Triage CLI • triage init • triage deploy • Triage network address • Authentication Key 68 To do so, we created a command line tool called Triage CLI. It can be downloaded as an NPM package and features a 2-step deployment process. triage init installs any necessary dependencies for deployment and generates a con fi guration fi le where developers can supply authentication and Kafka speci fi c information. triage deploy interpolates the data in this con fi guration fi le into the CDK script. It also creates an internal con fi g fi le used by individual Triage containers. It then deploys these containers to AWS. Finally, it returns the network address and authentication key needed for consuming applications to connect to Triage. Using Triage CLI, a developer can leverage our CDK Script to deploy Triage to the cloud.
  71. 71. Triage Design Challenges • Achieving Parallel Consumption via Concurrency • Polyglot Support • Ease of Deployment 69 Having solved these major challenges, we were able to build Triage without compromising on any of our design requirements. For a more in-depth exploration of how Triage works and implementation details, check out our write up, linked in the Zoom meeting description.
  72. 72. Overview 70 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A Before we open up for questions, we'd like to cover some features we'd like to add.
  73. 73. Future Work 1. Extend client library language support 2. Cause of Failure for Dead Letter Table 3. Dead Letter Noti fi cations 71 We'd fi rst like to build out additional language support for our thin client library. As we've discussed, doing so shouldn't be di ffi cult, since the majority of the work is done for us via gRPC code generation. Supporting other popular languages like JavaScript or Ruby would help us serve more developers. We'd also like to add a cause of failure column to our table that stores dead-letter messages - it would contain failure reasons that developers could supply when sending a "nack" back to Triage for poison pills. This would aid in analysis and remedying faulty messages. Finally, we'd like to add a simple noti fi cation system that could alert developers when poison pills are stored in the dead-letter table, allowing for rapid response. We think this is perhaps the easiest to implement and is likely our next step.
  74. 74. 72 Questions? github.com/Team-Triage Aashish Balaji Jordan Swartz Michael Jung Aryan Binazir Toronto, Canada San Diego, CA Los Angeles, CA Chapel Hill, NC With that, I'd like to thank you all for joining us this afternoon and we'll open the fl oor for questions!

×