SlideShare uma empresa Scribd logo
1 de 183
Baixar para ler offline
Building High-Fidelity
Data Streams
Some Historical Context
(Let’s go back to 2017)
Some Historical Context
Circa Dec 2017, I was tasked with building PayPal’s Change Data Capture
(CDC) system
Some Historical Context
Circa Dec 2017, I was tasked with building PayPal’s Change Data Capture
(CDC) system
CDC provides a streaming alternative to repeatedly querying a database!
It’s an ideal replacement for “polling”-style query workloads like the following!
Some Historical Context
select * from transaction where account_id=<x> and txn_date
between <t1> and <t2>
select * from transaction where account_id=<x> and txn_date
between <t1> and <t2>
PayPal Activity Feed
A query like this powers the PayPal user activity feed!
Some Historical Context
select * from transaction where account_id=<x> and txn_date
between <t1> and <t2>
The query above will need to be run very frequently to meet
users’ timeliness expectations!
PayPal Activity Feed
A query like this powers the PayPal user activity feed!
Users expect to see a completed transaction in their feed as
soon as they complete a money transfer, crypto purchase, or
merchant checkout!
Some Historical Context
Some Historical Context
As PayPal traf
fi
c grew, queries like this one became more expensive to run and
presented a burden on DB capacity!
Some Historical Context
As PayPal traf
fi
c grew, queries like this one became more expensive to run and
presented a burden on DB capacity!
Some Historical Context
There was a strong interest in of
fl
oading DB traf
fi
c to something like CDC
However, client teams wanted guarantees around non-functional requirements
(NFRs) similar to what they had with DBs
3 nines (99.9%) Availability
Sub-second data latency (P95)
Scalability
No off-the-shelf solutions were available to meet these requirements!
Reliability - i.e. No data loss!
(…So, we built our own solution!)
The Core Data Highway
(Change Data Capture @ PayPal)
Some Historical Context
If building our own solution, we knew that we needed to make NFRs
fi
rst class citizens!
e.g. scalability, availability, reliability/data quality, observability, etc…
Some Historical Context
When outages occur, resolving incidents come with an added time pressure since any data
lag is very visible and impacts customers!
In streaming systems, any gaps in NFRs can be unforgiving
If building our own solution, we knew that we needed to make NFRs
fi
rst class citizens!
e.g. scalability, availability, reliability/data quality, observability, etc…
Some Historical Context
If you don’t build your systems with the -ilities as
fi
rst class citizens, you pay a steep
operational tax
In streaming systems, any gaps in NFRs can be unforgiving
If building our own solution, we knew that we needed to make NFRs
fi
rst class citizens!
e.g. scalability, availability, reliability/data quality, observability, etc…
When outages occur, resolving incidents come with an added time pressure since any data
lag is very visible and impacts customers!
Core Data Highway
• 70+ RAC primaries
• 70k+ tables
• 20-30 PBs of data
Back to the more Recent Past
(Circa 2020)
Moving to the Cloud
I joined Datazoom, a company specializing in video telemetry collection!
But, with a few key differences!
Built on bare-metal in PP DCs
Deep Pockets
Internal Customers
Build for the public cloud
Shoestring (Seed Funded)
External Customers
Datazoom wanted similar streaming delivery guarantees as PayPal!
PayPal Datazoom
Building High Fidelity Data Streams!
Now, that we have some context….
Let’s build a high-
fi
delity stream system from the ground up!
Start Simple
Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
But
fi
rst, let’s decouple S and D by putting messaging infrastructure between them
E
S D
Events topic
Start Simple
Make a few more implementation decisions about this system
E
S D
Start Simple
Make a few more implementation decisions about this system
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
Run S & D on single, separate EC2 Instances
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
T
o make things a bit more interesting, let’s provide our stream as a service
We de
fi
ne our system boundary using a blue box as shown below!
È
S D
Reliability
(Is This System Reliable?)
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Concrete Goal : 0 message loss
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Concrete Goal : 0 message loss
Once S has ACKd a message to a remote sender, D must deliver that message to
a remote receiver
Reliability
How do we build reliability into our system?
È
S D
Reliability
Let’s
fi
rst generalize our system!
`
A B C
m1 m1
Reliability
In order to make this system reliable
`
A B C
m1 m1
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
Insight : If each process/link is transactional in nature, the chain will be
transactional!
In order to make this system reliable
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be
transactional!
T
ransactionality = At least once delivery
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
How do we make each link transactional?
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be
transactional!
T
ransactionality = At least once delivery
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
A is an ingest node
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is an internal node
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
C is an expel node
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
What does A need to do?
• Receive a Request (e.g. REST)
• Do some processing
• Reliably send data to Kafka
• kProducer.send(topic, message)
• kProducer.
fl
ush()
• Producer Con
fi
g
• acks = all
• Send HTTP Response to caller
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
What does C need to do?
• Read data (a batch) from Kafka
• Do some processing
• Reliably send data out
• ACK / NACK Kafka
• Consumer Con
fi
g
• enable.auto.commit = false
• ACK moves the read checkpoint
forward
• NACK forces a reread of the same data
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
B needs to act like a reliable Kafka Consumer
`
A B C
m1 m1
Reliability
How reliable is our system now?
Reliability
How reliable is our system now?
What happens if a process crashes?
`
A B C
m1 m1
Reliability
How reliable is our system now?
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`
A B C
m1 m1
Reliability
How reliable is our system now?
If C crashes, we will stop delivering messages to external consumers!
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`
A B C
m1 m1
Reliability
`
A B C
m1 m1
Solution : Place each service in an autoscaling group of size T
`
A B C
m1 m1
T-1 concurrent
failures
Reliability
`
A B C
m1 m1
Solution : Place each service in an autoscaling group of size T
`
A B C
m1 m1
T-1 concurrent
failures
For now, we appear to have a pretty reliable data stream
But how do we measure its reliability?
Observability
(A story about Lag & Loss Metrics)
(This brings us to …)
Lag : What is it?
Lag : What is it?
Lag is simply a measure of message delay in a system
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
Lag : How do we compute it?
Lag : How do we compute it?
eventTime : the creation time of an event message
Lag can be calculated for any message m1 at any node N in the system as
lag(m1, N) = current_time(m1, N) - eventTime(m1)
`
A B C
m1 m1
T0
eventTime:
Lag : How do we compute it?
`
A B C
T0 = 12p
eventTime:
m1
Lag : How do we compute it?
`
A B C
T1 = 12:01p
T0 = 12p
eventTime:
m1
Lag : How do we compute it?
`
A B C
T1 = 12:01p T3 = 12:04p
T0 = 12p
eventTime:
m1
Lag : How do we compute it?
`
A B C
T1 = 12:01p T3 = 12:04p T5 = 12:10p
T0 = 12p
eventTime:
m1 m1
Lag : How do we compute it?
`
A B C
T1 = 12:01p T3 = 12:04p T5 = 12:10p
T0 = 12p
eventTime:
m1 m1
lag(m1, A) = T1-T0 lag(m1, B) = T3-T0 lag(m1, C) = T5-T0
lag(m1, A) = 1m lag(m1, B) = 4m lag(m1, C) = 10m
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
T0
eventTime:
m1 m1
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 3 ms)
C = T5 - T0 (e.g 8 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
T0
eventTime:
m1 m1
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 3 ms)
C = T5 - T0 (e.g 8 ms)
A B C
Observation: Lag is Cumulative
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 3 ms)
C = T5 - T0 (e.g 8 ms)
Lag-out @
A = T2 - T0 (e.g 2 ms)
B = T4 - T0 (e.g 4 ms)
C = T6 - T0 (e.g 10 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4 T6
T0
eventTime:
m1
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 3 ms)
C = T5 - T0 (e.g 8 ms)
Lag-out @
A = T2 - T0 (e.g 2 ms)
B = T4 - T0 (e.g 4 ms)
C = T6 - T0 (e.g 10 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4
T0
eventTime:
m1
E2E Lag
The most important metric for Lag in any streaming system
is E2E Lag: the total time a message spent in the system
T6
Lag : How do we compute it?
While it is interesting to know the lag for a particular message m1, it is of little use
since we typically deal with millions of messages
Lag : How do we compute it?
While it is interesting to know the lag for a particular message m1, it is of little use
since we typically deal with millions of messages
Instead, we prefer statistics (e.g. P95) to capture population behavior
Lag : How do we compute it?
Some useful Lag statistics are:
E2E Lag (p95) : 95th percentile time of messages spent in the system
Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
Lag : How do we compute it?
Some useful Lag statistics are:
E2E Lag (p95) : 95th percentile time of messages spent in the system
Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
Process_Duration(N, p95) : Time Spent at any node in the chain!
Lag_out(N, p95) - Lag_in(N, p95)
m1 m1
Process_Duration Graphs show you the contribution to overall Lag from each hop
Lag : How do we compute it?
Loss : What is it?
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
Hence, our goal is to minimize loss in order to deliver high quality insights
Loss : How do we compute it?
Loss : How do we compute it?
Concepts : Loss
Loss can be computed as the set difference of messages between any 2
points in the system
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
Solution
Allocate messages to 1-minute wide time buckets using message eventTime
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Now
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Now
Update late arrival data
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Now
Update late arrival data
Compute
Loss
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Now
Update late arrival data
Compute
Loss
Age Out
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
Solution
Allocate messages to 1-minute wide time buckets using message eventTime
Wait a few minutes for messages to transit, then compute loss (e.g. 12:35p loss table)
Raise alarms if loss occurs over a con
fi
gured threshold (e.g. > 1%)
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
We now have a way to measure the reliability (via Loss metrics) and latency (via Lag
metrics) of our system.
Loss : How do we compute it?
But wait…
Performance
(have we tuned our system for performance yet??)
Performance
Goal : Build a system that can deliver messages reliably from S to D with low latency
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
T
o understand streaming system performance, let’s understand the components of E2E Lag
Performance
Ingest Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
Performance
Ingest Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
• This time includes overhead of reliably sending messages to Kafka
Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
Performance
Ingest Time Expel Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
Expel Time : Time to process and egest a message at D. This includes time to wait for
ACKs from external services!
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
E2E Lag : T
otal time messages spend in the system from message ingest to expel!
Performance Penalties
(T
rade off: Latency for Reliability)
Performance
Challenge 1 : Ingest Penalty
In the name of reliability, S needs to call kProducer.
fl
ush() on every inbound API
request
S also needs to wait for 3 ACKS from Kafka before sending its API response
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
Challenge 1 : Ingest Penalty
Approach : Amortization
Support Batch APIs (i.e. multiple messages per web request) to amortize the
ingest penalty
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
Challenge 2 : Expel Penalty
Observations
Kafka is very fast — many orders of magnitude faster than HTTP RTT
s
The majority of the expel time is the HTTP RTT
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
Challenge 2 : Expel Penalty
Approach : Amortization
In each D node, add batch + parallelism
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
Challenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
Performance
Challenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
Performance
Challenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
Performance
Challenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
E.g. Any 4xx HTTP response code, except for 429 (T
oo Many Requests)
Performance
Challenge 3 : Retry Penalty
Approach
We pay a latency penalty on retry, so we need to smart about
What we retry — Don’t retry any non-recoverable failures
How we retry
Performance
Challenge 3 : Retry Penalty
Approach
We pay a latency penalty on retry, so we need to smart about
What we retry — Don’t retry any non-recoverable failures
How we retry — One Idea : Tiered Retries
Performance - Tiered Retries
Local Retries
T
ry to send message a
con
fi
gurable number of times @ D
Global Retries
Performance - Tiered Retries
Local Retries
T
ry to send message a
con
fi
gurable number of times @ D
If we exhaust local retries, D
transfers the message to a Global
Retrier
Global Retries
Performance - Tiered Retries
Local Retries
T
ry to send message a
con
fi
gurable number of times @ D
If we exhaust local retries, D
transfers the message to a Global
Retrier
Global Retries
The Global Retrier than retries
the message over a longer span of
time
`
E
S
S
S
S
S
S
S
D
RO
RI
S
S
S
gr
Performance - 2 Tiered Retries
RI : Retry_In
RO : Retry_Out
At this point, we have a system that works well at low scale
Performance
(But how does this system scale
with increasing traf
fi
c?)
Scalability
Scalability
Goal : Build a system that can deliver messages reliably from S to D with low latency
up to a scale limit
Scalability
Goal : Build a system that can deliver messages reliably from S to D with low latency
up to a scale limit
What do I mean by a scale limit?
Scalability
First, Let’s dispel a myth!
Scalability
First, Let’s dispel a myth!
There is no such thing as a system that can handle in
fi
nite scale
Scalability
First, Let’s dispel a myth!
Every system is capacity limited — in the case of AWS, some of these are arti
fi
cial!
Capacity limits & performance bottlenecks can only be revealed under load
There is no such thing as a system that can handle in
fi
nite scale
Scalability
First, Let’s dispel a myth!
Every system is capacity limited — in the case of AWS, some of these are arti
fi
cial!
Capacity limits & performance bottlenecks can only be revealed under load
There is no such thing as a system that can handle in
fi
nite scale
Hence, we need to periodically load test our system at increasingly higher load to
fi
nd and
remove performance bottlenecks
Each successful test raises our scale rating!
Scalability
First, Let’s dispel a myth!
Every system is capacity limited — in the case of AWS, some of these are arti
fi
cial!
Capacity limits & performance bottlenecks can only be revealed under load
There is no such thing as a system that can handle in
fi
nite scale
Hence, we need to periodically load test our system at increasingly higher load to
fi
nd and
remove performance bottlenecks
By executing these tests ahead of organic load, we avoid customer-impacting
message delays!
Each successful test raises our scale rating!
Scalability
How do we set a target scale rating for our load tests?
Desired Scale Rating: 1 MM events/second @ ingest
Scalability
Is it enough to only specify the target throughput?
How do we set a target scale rating for our load tests?
Desired Scale Rating: 1 MM events/second @ ingest
So, a scale rating must be quali
fi
ed by a latency condition to be useful!
Scalability
Is it enough to only specify the target throughput? No
Most systems experience increasing latency with increasing traf
fi
c!
A scale rating which violates our E2E lag SLA is not useful!
How do we set a target scale rating for our load tests?
Desired Scale Rating: 1 MM events/second @ ingest
So, a scale rating must be quali
fi
ed by a latency condition to be useful!
Scalability
Is it enough to only specify the target throughput? No
Most systems experience increasing latency with increasing traf
fi
c!
A scale rating which violates our E2E lag SLA is not useful!
Updated Scale Rating: 1 MM events/second with <1s (p95) E2E lag
How do we set a target scale rating for our load tests?
Desired Scale Rating: 1 MM events/second @ ingest
Scalability
But, how do we select a scale rating? What should it be based on?
Scalability
But, how do we select a scale rating? What should it be based on?
We maintain traf
fi
c alarms that tell us when our organic traf
fi
c exceeds 1/m of our
previous scale rating
This tells us that a new test needs to be scheduled!
@ Datazoom, we target a scale rating that is a multiple m of our organic peak!
We spend the duration of our load tests diagnosing system performance and
identifying bottlenecks that result in higher than expected latency!
Once we resolve all bottlenecks, we update our traf
fi
c alarms for our new higher
scale rating based on a new target multiple m’
Scalability - Autoscaling
Autoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
Scalability - Autoscaling
Autoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
Scalability - Autoscaling
Autoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
Scalability - Autoscaling
Autoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
MSK Serverless aims to bridge this, but
it’s expensive!
Scalability - Autoscaling
@ Datazoom, we use K8S — this gives us all of the bene
fi
ts of containerization!
But since we run on the cloud, K8s pods need to scale on top of EC2, which itself can
can be autoscaled!
Hence, we have a 2-level autoscaling problem!
EC2 EC2 EC2
Scalability - HPA Journey
K8s : HPA a.k.a. Horizontal Pod Autoscaler (CPU)
EC2 : EC2 autoscaling (Memory)
CloudWatch (CW) provides the scaling signal
First Attempt : Independently Scale EC2 & K8s pods
Nov 25, 2020 : Major Kinesis outage in US-East-1
EC2 EC2 EC2
HPA
EC2
CW
Scalability - HPA Journey
Datazoom doesn’t directly use Kinesis, but CloudWatch does!
Nov 25, 2020 : Major Kinesis outage in US-East-1
EC2 EC2 EC2
HPA
EC2
CW
EC2 EC2 EC2
HPA
EC2
CW
K8s scaled (in) to Min Pod counts, resulting in a gray failure (high lag & some message loss)
Scalability - HPA Journey
Switching to K8s Event Driven Autoscaler handles this situation
Also switched to using K8S Cluster Autoscaler coordinate EC2 Autoscaling with HPA!
EC2 EC2 EC2
HPA
EC2
CW
KEDA
Scalability
At this point, we have a system that works well within a traf
fi
c or scale rating
Availability
(But, what can we say about its availability??)
Availability
What does Availability mean?
Availability
What does Availability mean?
Availability measures the consistency with which a service meets the expectations of its
users
Availability
What does Availability mean?
Availability measures the consistency with which a service meets the expectations of its
users
For example, if I want to ssh into a machine, I want it to be running
However, if its performance is degraded (e.g. low available memory), uptime is not
enough! Any command I issue will take too long to complete
Availability
Gray Zone Down
Up
• 0 message loss
• E2E Lag (p95)<= SLA (e.g. 1s)
• Either message loss or
the system is not
accepting new messages
• E2E Lag (p95)>>> SLA
Availability
Down
Up
• 0 message loss
• E2E Lag (p95)<= SLA (e.g. 1s)
• Either message loss or the system is
not accepting new messages
• E2E Lag (p95)> SLA
Availability
Down
Up
• 0 message loss
• E2E Lag (p95)<= SLA (e.g. 1s)
• Either message loss or the system is
not accepting new messages
• E2E Lag (p95)> SLA
Let’s use this approach to de
fi
ne an Availability SLA for Data Streams!
Availability
If no, then this 1 minute period is out of Lag SLAs.
In the parlance of the “nines” model, this would
constitute a minute of downtime!
There are 1440 minutes in a day
For a given minute (e.g 12:04p), calculate the E2E lag (p95) for all events whose
event_time falls in the 12:04p minute!
If E2E lag must be < 1 second, compare this statistic to 1 second!
Is E2E_Lag(p95) < 1 second ?
If yes, then this 1 minute period is within Lag
SLAs
Consider that …
Availability
You may be familiar with the “nines of availability” chart below:
Availability % Downtime per year
Downtime per
quarter
Downtime per
month
Downtime per week
Downtime per day (24
hours)
99% ("two nines") 3.65 days 21.9 hours 7.31 hours 1.68 hours 14.40 minutes
99.5% ("two and a half nines") 1.83 days 10.98 hours 3.65 hours 50.40 minutes 7.20 minutes
99.9% ("three nines") 8.77 hours 2.19 hours 43.83 minutes 10.08 minutes 1.44 minutes
99.95% ("three and a half
nines")
4.38 hours 65.7 minutes 21.92 minutes 5.04 minutes 43.20 seconds
99.99% ("four nines") 52.60 minutes 13.15 minutes 4.38 minutes 1.01 minutes 8.64 seconds
Availability
The next question we must ask about our system is whether the downtime SLAs should
be computed on a daily, weekly, monthly, etc… basis
Availability % Downtime per year
Downtime per
quarter
Downtime per
month
Downtime per week Downtime per day (24 hours)
99.9% ("three nines") 8.77 hours 2.19 hours 43.83 minutes 10.08 minutes 1.44 minutes
For 3 nines’ uptime computed daily, no more than 1.44 minutes of downtime is allowed!
Availability
Can we enforce availability?
No!
Availability is intrinsic to a system - it is impacted by calamities, many of which are
out of our control!
The best we can do is to measure & monitor it!
Availability
We now have a system that reliably delivers messages from source S to destination D
at low latency below its scale rating. We also have a way to measure its availability!
Can we enforce availability?
No!
Availability is intrinsic to a system - it is impacted by calamities, many of which are
out of our control!
The best we can do is to measure & monitor it!
• We have a system with the NFRs that we desire!
Availability
Practical Challenges
(What are some practical challenges that we face?)
Practical Challenges
For example, what factors in
fl
uence latency in the real world?
Aside from building a lossless system, the next biggest challenge is keeping latency
low!
Practical Challenges
Minimizing latency is a practice in noise reduction!
Practical Challenges
Minimizing latency is a practice in noise reduction!
The following factors impact latency in a streaming system!
Network lookups needed for processing
Autoscaling - Kafka Rebalancing
AWS Maintenance T
asks (e.g. Kafka)
Bad Releases
Practical Challenges - Latency
Consider a modern pipeline as shown below
Ingest Normalize Enrich Route T
ransform T
ransmit
A very common practice is to forbid any network lookups along the stream processing
chain above!
However, most of the services above need to look up some con
fi
guration data as part
of processing E.g. Enrichment, Router, T
ransformer, T
ransmitter
Practical Challenges - Latency
A common approach would be to pull data from a central cache such as redis
Ingest Normalize Enrich Route T
ransform T
ransmit
If lookups can be achieved in 1-2 ms, this lookup penalty may be acceptable.
Get
Practical Challenges - Latency
In practice, we observe 1 minute outages whenever failovers occur in Redis HA
Clusters!
Ingest Normalize Enrich Route T
ransform T
ransmit
T
o shield ourselves against any latency gaps, we adopt some caching practices!
Get
Failover
Practical Challenges - Latency
Practice 1 : Adopt Caffeine (Local Cache) across all services
• Source : https://github.com/ben-manes/caffeine
• Caffeine is a high performance, near optimal caching library.
Caffeine<Object, Object> cache = Caffeine.newBuilder()
.refreshAfterWrite(<Staleness Interval>, TimeUnit)
.expireAfterWrite( <Expiration Interval>, TimeUnit)
.maximumSize( maxNumOfEntries );
Practical Challenges - Latency
Load a cache entry
Time
Caffeine<Object, Object> cache = Caffeine.newBuilder()
.refreshAfterWrite(2, TimeUnit.MINUTES)
.expireAfterWrite( 30, TimeUnit.DAYS)
.maximumSize( maxNumOfEntries );
v1
Practical Challenges - Latency
Load a cache entry
refreshAfter
Write = 2 mins
Time
expireAfter
Write = ∞
Caffeine<Object, Object> cache = Caffeine.newBuilder()
.refreshAfterWrite(2, TimeUnit.MINUTES)
.expireAfterWrite( 30, TimeUnit.DAYS)
.maximumSize( maxNumOfEntries );
v1
Practical Challenges - Latency
refreshAfter
Write = 2 mins
Lookup entry
Time
expireAfter
Write = ∞
v1
Practical Challenges - Latency
refreshAfter
Write = 2 mins
Lookup entry
Time
expireAfter
Write = ∞
Practical Challenges - Latency
refreshAfter
Write = 2 mins
Lookup entry
Time
expireAfter
Write = ∞
v1
v2
Practical Challenges - Latency
refreshAfter
Write = 2 mins
expireAfter
Write = ∞
Lookup entry
Time
v1
Practical Challenges - Latency
refreshAfter
Write = 2 mins
expireAfter
Write = ∞
Lookup entry
Time
v2
Practical Challenges - Latency
Practice 2 : Eager load all cache entries for all caches into Caffeine at application
start
Microservice application (pod) not made ready until caches load
Hence, until caches load, no traf
fi
c passes through the initializing application
pod!
With this in place, we avoid latency spikes due to Redis issues & lazy init
penalties!
Practical Challenges
Minimizing latency is a practice in noise reduction!
The following factors impact latency in a streaming system!
Network lookups needed for processing
AWS Maintenance T
asks (e.g. Kafka)
Internal Operational Processes
Autoscaling - Kafka Rebalancing
Practical Challenges - Latency
Anytime AWS patched our Kafka clusters, we noticed large latency spikes
Ingest Normalize Enrich Route T
ransform T
ransmit
Since AWS patches brokers in a rolling fashion, we typically see about 3 latency spikes
over 30 minutes
These violated our availability SLAs since we cannot afford more than 1 latency spike
out of Lag SLA.
Recall: 3 nines honored daily —> (1.44 minutes out of Lag SLA)
MSK Patching Impacts Lag
T
ypical E2E Lag
Lag Spikes due to
MSK Maintenance
Practical Challenges - Latency
Between weekly releases or scheduled maintenance, one color side receives traf
fi
c
(light), while the other color side receives none (dark)
Ingest Normalize Enrich Route T
ransform T
ransmit
Ingest Normalize Enrich Route T
ransform T
ransmit
(Light)
CustomerT
raf
fi
c
T
o solve this issue, we adopted a Blue-Green operating model!
(Dark)
Patches applied
Practical Challenges - Latency
Ingest Normalize Enrich Route T
ransform T
ransmit
Ingest Normalize Enrich Route T
ransform T
ransmit
(Light)
CustomerT
raf
fi
c
(Dark)
Patches applied
Cost Ef
fi
ciency : The Dark side is scaled to 0 to save on costs!
Practical Challenges - Latency
By switching traf
fi
c away from the side scheduled for AWS maintenance, we avoid a
large cause of latency spikes
With this success, we immediately began leveraging the Blue-Green model for other
operations as well, such as:
Practical Challenges - Latency
By switching traf
fi
c away from the side scheduled for AWS maintenance, we avoid a
large cause of latency spikes
With this success, we immediately began leveraging the Blue-Green model for other
operations as well, such as:
Software Releases
Canary T
esting
Load T
esting
Scheduled Maintenance
T
echnology Migrations
Outage Recovery
Doubling Capacity due to
Sustained T
raf
fi
c Burst
Practical Challenges
Minimizing latency is a practice in noise reduction!
The following factors impact latency in a streaming system!
Network lookups needed for processing
AWS Maintenance T
asks (e.g. Kafka)
Internal Operational Processes
Autoscaling - Kafka Rebalancing
Practical Challenges - Latency
During autoscaling, we noticed latency spikes! These lasted less than 1 minute, but
compromise our availability SLAs
This has to do with Kafka consumer rebalancing!
Kafka consumers subscribe to topics
T
opics are split into physical partitions! These partitions are load balanced across
multiple consuming K8s Pods in the same Consumer Group
During autoscaling activity, the number of consumers in a Consumer Group
changes, causing Kafka to execute a partition reassignment (a.k.a. rebalance)
Practical Challenges - Latency
The current default rebalancing behavior (via the range assignor) is a stop-the-world
action. This causes pretty large pauses in consumption, which result in latency spikes!
An alternative rebalancing algorithm came out called KICR or Kafka Incremental
Cooperative Rebalancing (via the cooperative sticky assignor)
Practical Challenges - Latency
The graph below shows the impact of using KCIR.
KCIR Default Rebalancing
Practical Challenges - Latency
Unfortunately, in using the Cooperative Sticky Assignor, we noticed signi
fi
cant duplicate
consumption during autoscaling behavior. In one load test, duplication consumption @ one
service > 100%. ( https://issues.apache.org/jira/browse/KAFKA-14757 )
Conclusion
• We now have a system with the NFRs that we desire!
• While we’ve covered many key elements, we had to skip some for
time (e.g. Isolation, Containerization, K8s Autoscaling).
• These will be covered in upcoming blogs! Follow for updates on
(@r39132)
• We have also shown a real-life system and challenges that go with it!
Acknowledgments
PayPal Datazoom
• Vincent Chen
• Anisha Nainani
• Pramod Garre
• Harsh Bhimani
• Nirmalya Ghosh
• Yash Shah
• Deepak Mohankumar
Chandramouli
• Dheeraj Rampally
• Romit Mehta
• Kafka team (Maulin Vasavada,
Na Yang, Doron Mimon)
• Sandhu Santhakumar
• Akhil S. Varughese
• Shiju V.
• Sunny Dabas
• Shahanas S.
• Naveen Chandra
• Thasni M. Salim
• Serena Eldhose
• Gordon Lai
• Fajib Aboobacker
• Josh Evans
• Bob Carlson
• T
ony Gentille

Mais conteúdo relacionado

Semelhante a Building High Fidelity Data Streams (QCon London 2023)

MongoDB World 2019: Streaming ETL on the Shoulders of Giants
MongoDB World 2019: Streaming ETL on the Shoulders of GiantsMongoDB World 2019: Streaming ETL on the Shoulders of Giants
MongoDB World 2019: Streaming ETL on the Shoulders of GiantsMongoDB
 
Network Security: Putting Theory into Practice, the Wrong Way
Network Security: Putting Theory into Practice, the Wrong WayNetwork Security: Putting Theory into Practice, the Wrong Way
Network Security: Putting Theory into Practice, the Wrong WayJohn ILIADIS
 
Webinar - Order out of Chaos: Avoiding the Migration Migraine
Webinar - Order out of Chaos: Avoiding the Migration MigraineWebinar - Order out of Chaos: Avoiding the Migration Migraine
Webinar - Order out of Chaos: Avoiding the Migration MigrainePeak Hosting
 
Build A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million UsersBuild A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million UsersAmazon Web Services
 
Evolution of Microservices - Craft Conference
Evolution of Microservices - Craft ConferenceEvolution of Microservices - Craft Conference
Evolution of Microservices - Craft ConferenceAdrian Cockcroft
 
Sdwan webinar
Sdwan webinarSdwan webinar
Sdwan webinarpmohapat
 
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and NanosecondsRobert Greiner
 
AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60PolarSeven Pty Ltd
 
Securing Servers in Public and Hybrid Clouds
Securing Servers in Public and Hybrid CloudsSecuring Servers in Public and Hybrid Clouds
Securing Servers in Public and Hybrid CloudsRightScale
 
Open Source Bristol 30 March 2022
Open Source Bristol 30 March 2022Open Source Bristol 30 March 2022
Open Source Bristol 30 March 2022Timothy Spann
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
Developing a Globally Distributed Purging System
Developing a Globally Distributed Purging SystemDeveloping a Globally Distributed Purging System
Developing a Globally Distributed Purging SystemFastly
 
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...HostedbyConfluent
 
AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...
AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...
AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...Amazon Web Services
 
Humans and Data Don’t Mix: Best Practices to Secure Your Cloud
Humans and Data Don’t Mix: Best Practices to Secure Your CloudHumans and Data Don’t Mix: Best Practices to Secure Your Cloud
Humans and Data Don’t Mix: Best Practices to Secure Your CloudPriyanka Aash
 
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...confluent
 
(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...
(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...
(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...Amazon Web Services
 
IoT Story: From Edge to HDP
IoT Story: From Edge to HDPIoT Story: From Edge to HDP
IoT Story: From Edge to HDPDataWorks Summit
 
AZUG.BE - Azure User Group Belgium - First public meeting
AZUG.BE - Azure User Group Belgium - First public meetingAZUG.BE - Azure User Group Belgium - First public meeting
AZUG.BE - Azure User Group Belgium - First public meetingMaarten Balliauw
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 

Semelhante a Building High Fidelity Data Streams (QCon London 2023) (20)

MongoDB World 2019: Streaming ETL on the Shoulders of Giants
MongoDB World 2019: Streaming ETL on the Shoulders of GiantsMongoDB World 2019: Streaming ETL on the Shoulders of Giants
MongoDB World 2019: Streaming ETL on the Shoulders of Giants
 
Network Security: Putting Theory into Practice, the Wrong Way
Network Security: Putting Theory into Practice, the Wrong WayNetwork Security: Putting Theory into Practice, the Wrong Way
Network Security: Putting Theory into Practice, the Wrong Way
 
Webinar - Order out of Chaos: Avoiding the Migration Migraine
Webinar - Order out of Chaos: Avoiding the Migration MigraineWebinar - Order out of Chaos: Avoiding the Migration Migraine
Webinar - Order out of Chaos: Avoiding the Migration Migraine
 
Build A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million UsersBuild A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million Users
 
Evolution of Microservices - Craft Conference
Evolution of Microservices - Craft ConferenceEvolution of Microservices - Craft Conference
Evolution of Microservices - Craft Conference
 
Sdwan webinar
Sdwan webinarSdwan webinar
Sdwan webinar
 
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and Nanoseconds
 
AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60
 
Securing Servers in Public and Hybrid Clouds
Securing Servers in Public and Hybrid CloudsSecuring Servers in Public and Hybrid Clouds
Securing Servers in Public and Hybrid Clouds
 
Open Source Bristol 30 March 2022
Open Source Bristol 30 March 2022Open Source Bristol 30 March 2022
Open Source Bristol 30 March 2022
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Developing a Globally Distributed Purging System
Developing a Globally Distributed Purging SystemDeveloping a Globally Distributed Purging System
Developing a Globally Distributed Purging System
 
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
 
AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...
AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...
AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...
 
Humans and Data Don’t Mix: Best Practices to Secure Your Cloud
Humans and Data Don’t Mix: Best Practices to Secure Your CloudHumans and Data Don’t Mix: Best Practices to Secure Your Cloud
Humans and Data Don’t Mix: Best Practices to Secure Your Cloud
 
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
 
(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...
(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...
(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...
 
IoT Story: From Edge to HDP
IoT Story: From Edge to HDPIoT Story: From Edge to HDP
IoT Story: From Edge to HDP
 
AZUG.BE - Azure User Group Belgium - First public meeting
AZUG.BE - Azure User Group Belgium - First public meetingAZUG.BE - Azure User Group Belgium - First public meeting
AZUG.BE - Azure User Group Belgium - First public meeting
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 

Mais de Sid Anand

Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionSid Anand
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Sid Anand
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Sid Anand
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Sid Anand
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Sid Anand
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Sid Anand
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari Sid Anand
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Sid Anand
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)Sid Anand
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Sid Anand
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with MavenSid Anand
 
Learning git
Learning gitLearning git
Learning gitSid Anand
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)Sid Anand
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)Sid Anand
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Sid Anand
 

Mais de Sid Anand (20)

Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
 
Learning git
Learning gitLearning git
Learning git
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
 

Último

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 

Último (20)

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 

Building High Fidelity Data Streams (QCon London 2023)

  • 3. Some Historical Context Circa Dec 2017, I was tasked with building PayPal’s Change Data Capture (CDC) system
  • 4. Some Historical Context Circa Dec 2017, I was tasked with building PayPal’s Change Data Capture (CDC) system CDC provides a streaming alternative to repeatedly querying a database! It’s an ideal replacement for “polling”-style query workloads like the following!
  • 5. Some Historical Context select * from transaction where account_id=<x> and txn_date between <t1> and <t2>
  • 6. select * from transaction where account_id=<x> and txn_date between <t1> and <t2> PayPal Activity Feed A query like this powers the PayPal user activity feed! Some Historical Context
  • 7. select * from transaction where account_id=<x> and txn_date between <t1> and <t2> The query above will need to be run very frequently to meet users’ timeliness expectations! PayPal Activity Feed A query like this powers the PayPal user activity feed! Users expect to see a completed transaction in their feed as soon as they complete a money transfer, crypto purchase, or merchant checkout! Some Historical Context
  • 8. Some Historical Context As PayPal traf fi c grew, queries like this one became more expensive to run and presented a burden on DB capacity!
  • 9. Some Historical Context As PayPal traf fi c grew, queries like this one became more expensive to run and presented a burden on DB capacity!
  • 10. Some Historical Context There was a strong interest in of fl oading DB traf fi c to something like CDC However, client teams wanted guarantees around non-functional requirements (NFRs) similar to what they had with DBs 3 nines (99.9%) Availability Sub-second data latency (P95) Scalability No off-the-shelf solutions were available to meet these requirements! Reliability - i.e. No data loss!
  • 11. (…So, we built our own solution!) The Core Data Highway (Change Data Capture @ PayPal)
  • 12. Some Historical Context If building our own solution, we knew that we needed to make NFRs fi rst class citizens! e.g. scalability, availability, reliability/data quality, observability, etc…
  • 13. Some Historical Context When outages occur, resolving incidents come with an added time pressure since any data lag is very visible and impacts customers! In streaming systems, any gaps in NFRs can be unforgiving If building our own solution, we knew that we needed to make NFRs fi rst class citizens! e.g. scalability, availability, reliability/data quality, observability, etc…
  • 14. Some Historical Context If you don’t build your systems with the -ilities as fi rst class citizens, you pay a steep operational tax In streaming systems, any gaps in NFRs can be unforgiving If building our own solution, we knew that we needed to make NFRs fi rst class citizens! e.g. scalability, availability, reliability/data quality, observability, etc… When outages occur, resolving incidents come with an added time pressure since any data lag is very visible and impacts customers!
  • 15. Core Data Highway • 70+ RAC primaries • 70k+ tables • 20-30 PBs of data
  • 16. Back to the more Recent Past (Circa 2020)
  • 17. Moving to the Cloud I joined Datazoom, a company specializing in video telemetry collection! But, with a few key differences! Built on bare-metal in PP DCs Deep Pockets Internal Customers Build for the public cloud Shoestring (Seed Funded) External Customers Datazoom wanted similar streaming delivery guarantees as PayPal! PayPal Datazoom
  • 18. Building High Fidelity Data Streams! Now, that we have some context…. Let’s build a high- fi delity stream system from the ground up!
  • 20. Start Simple Goal : Build a system that can deliver messages from source S to destination D S D
  • 21. Start Simple Goal : Build a system that can deliver messages from source S to destination D S D But fi rst, let’s decouple S and D by putting messaging infrastructure between them E S D Events topic
  • 22. Start Simple Make a few more implementation decisions about this system E S D
  • 23. Start Simple Make a few more implementation decisions about this system E S D Run our system on a cloud platform (e.g. AWS)
  • 24. Start Simple Make a few more implementation decisions about this system Operate at low scale E S D Run our system on a cloud platform (e.g. AWS)
  • 25. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition E S D Run our system on a cloud platform (e.g. AWS)
  • 26. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2) E S D Run our system on a cloud platform (e.g. AWS)
  • 27. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2) Run S & D on single, separate EC2 Instances E S D Run our system on a cloud platform (e.g. AWS)
  • 28. Start Simple T o make things a bit more interesting, let’s provide our stream as a service We de fi ne our system boundary using a blue box as shown below! È S D
  • 30. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D
  • 31. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D Concrete Goal : 0 message loss
  • 32. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D Concrete Goal : 0 message loss Once S has ACKd a message to a remote sender, D must deliver that message to a remote receiver
  • 33. Reliability How do we build reliability into our system? È S D
  • 35. Reliability In order to make this system reliable ` A B C m1 m1
  • 36. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link In order to make this system reliable
  • 37. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link Insight : If each process/link is transactional in nature, the chain will be transactional! In order to make this system reliable
  • 38. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link In order to make this system reliable Insight : If each process/link is transactional in nature, the chain will be transactional! T ransactionality = At least once delivery
  • 39. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link How do we make each link transactional? In order to make this system reliable Insight : If each process/link is transactional in nature, the chain will be transactional! T ransactionality = At least once delivery
  • 40. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1
  • 41. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 A is an ingest node
  • 42. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is an internal node
  • 43. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 C is an expel node
  • 44. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 What does A need to do? • Receive a Request (e.g. REST) • Do some processing • Reliably send data to Kafka • kProducer.send(topic, message) • kProducer. fl ush() • Producer Con fi g • acks = all • Send HTTP Response to caller
  • 45. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 What does C need to do? • Read data (a batch) from Kafka • Do some processing • Reliably send data out • ACK / NACK Kafka • Consumer Con fi g • enable.auto.commit = false • ACK moves the read checkpoint forward • NACK forces a reread of the same data
  • 46. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C
  • 47. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C B needs to act like a reliable Kafka Producer
  • 48. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C B needs to act like a reliable Kafka Producer B needs to act like a reliable Kafka Consumer
  • 49. ` A B C m1 m1 Reliability How reliable is our system now?
  • 50. Reliability How reliable is our system now? What happens if a process crashes? ` A B C m1 m1
  • 51. Reliability How reliable is our system now? What happens if a process crashes? If A crashes, we will have a complete outage at ingestion! ` A B C m1 m1
  • 52. Reliability How reliable is our system now? If C crashes, we will stop delivering messages to external consumers! What happens if a process crashes? If A crashes, we will have a complete outage at ingestion! ` A B C m1 m1
  • 53. Reliability ` A B C m1 m1 Solution : Place each service in an autoscaling group of size T ` A B C m1 m1 T-1 concurrent failures
  • 54. Reliability ` A B C m1 m1 Solution : Place each service in an autoscaling group of size T ` A B C m1 m1 T-1 concurrent failures For now, we appear to have a pretty reliable data stream
  • 55. But how do we measure its reliability?
  • 56. Observability (A story about Lag & Loss Metrics) (This brings us to …)
  • 57. Lag : What is it?
  • 58. Lag : What is it? Lag is simply a measure of message delay in a system
  • 59. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag
  • 60. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag The greater the lag, the greater the impact to the business
  • 61. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag The greater the lag, the greater the impact to the business Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
  • 62. Lag : How do we compute it?
  • 63. Lag : How do we compute it? eventTime : the creation time of an event message Lag can be calculated for any message m1 at any node N in the system as lag(m1, N) = current_time(m1, N) - eventTime(m1) ` A B C m1 m1 T0 eventTime:
  • 64. Lag : How do we compute it? ` A B C T0 = 12p eventTime: m1
  • 65. Lag : How do we compute it? ` A B C T1 = 12:01p T0 = 12p eventTime: m1
  • 66. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T0 = 12p eventTime: m1
  • 67. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T5 = 12:10p T0 = 12p eventTime: m1 m1
  • 68. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T5 = 12:10p T0 = 12p eventTime: m1 m1 lag(m1, A) = T1-T0 lag(m1, B) = T3-T0 lag(m1, C) = T5-T0 lag(m1, A) = 1m lag(m1, B) = 4m lag(m1, C) = 10m
  • 69. ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 T0 eventTime: m1 m1 Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms)
  • 70. ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 T0 eventTime: m1 m1 Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms) A B C Observation: Lag is Cumulative
  • 71. Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms) Lag-out @ A = T2 - T0 (e.g 2 ms) B = T4 - T0 (e.g 4 ms) C = T6 - T0 (e.g 10 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T6 T0 eventTime: m1
  • 72. Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms) Lag-out @ A = T2 - T0 (e.g 2 ms) B = T4 - T0 (e.g 4 ms) C = T6 - T0 (e.g 10 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T0 eventTime: m1 E2E Lag The most important metric for Lag in any streaming system is E2E Lag: the total time a message spent in the system T6
  • 73. Lag : How do we compute it? While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages
  • 74. Lag : How do we compute it? While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages Instead, we prefer statistics (e.g. P95) to capture population behavior
  • 75. Lag : How do we compute it? Some useful Lag statistics are: E2E Lag (p95) : 95th percentile time of messages spent in the system Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
  • 76. Lag : How do we compute it? Some useful Lag statistics are: E2E Lag (p95) : 95th percentile time of messages spent in the system Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N Process_Duration(N, p95) : Time Spent at any node in the chain! Lag_out(N, p95) - Lag_in(N, p95)
  • 77. m1 m1 Process_Duration Graphs show you the contribution to overall Lag from each hop Lag : How do we compute it?
  • 78. Loss : What is it?
  • 79. Loss : What is it? Loss is simply a measure of messages lost while transiting the system
  • 80. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate!
  • 81. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate! The greater the loss, the lower the data quality
  • 82. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate! The greater the loss, the lower the data quality Hence, our goal is to minimize loss in order to deliver high quality insights
  • 83. Loss : How do we compute it?
  • 84. Loss : How do we compute it? Concepts : Loss Loss can be computed as the set difference of messages between any 2 points in the system m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
  • 85. Loss : How do we compute it? Message Id E2E Loss E2E Loss % m1 1 1 1 1 m2 1 1 1 1 m3 1 0 0 0 … … … … … m10 1 1 0 0 Count 10 9 7 5 Per Node Loss(N) 0 1 2 2 5 50% m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
  • 86. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 87. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? Solution Allocate messages to 1-minute wide time buckets using message eventTime m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 88. Loss : How do we compute it? Message Id E2E Loss E2E Loss % m1 1 1 1 1 m2 1 1 1 1 m3 1 0 0 0 … … … … … m10 1 1 0 0 Count 10 9 7 5 Per Node Loss(N) 0 1 2 2 5 50% m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p
  • 89. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
  • 90. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now
  • 91. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data
  • 92. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data Compute Loss
  • 93. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data Compute Loss Age Out
  • 94. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? Solution Allocate messages to 1-minute wide time buckets using message eventTime Wait a few minutes for messages to transit, then compute loss (e.g. 12:35p loss table) Raise alarms if loss occurs over a con fi gured threshold (e.g. > 1%) m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 95. We now have a way to measure the reliability (via Loss metrics) and latency (via Lag metrics) of our system. Loss : How do we compute it? But wait…
  • 96. Performance (have we tuned our system for performance yet??)
  • 97. Performance Goal : Build a system that can deliver messages reliably from S to D with low latency S S S S S S S D S S S … SS S … T o understand streaming system performance, let’s understand the components of E2E Lag
  • 98. Performance Ingest Time S S S S S S S D S S S … S S S … Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
  • 99. Performance Ingest Time S S S S S S S D S S S … S S S … • This time includes overhead of reliably sending messages to Kafka Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
  • 100. Performance Ingest Time Expel Time S S S S S S S D S S S … S S S … Expel Time : Time to process and egest a message at D. This includes time to wait for ACKs from external services!
  • 101. Performance E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … S S S … E2E Lag : T otal time messages spend in the system from message ingest to expel!
  • 102. Performance Penalties (T rade off: Latency for Reliability)
  • 103. Performance Challenge 1 : Ingest Penalty In the name of reliability, S needs to call kProducer. fl ush() on every inbound API request S also needs to wait for 3 ACKS from Kafka before sending its API response E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … SS S …
  • 104. Performance Challenge 1 : Ingest Penalty Approach : Amortization Support Batch APIs (i.e. multiple messages per web request) to amortize the ingest penalty E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … SS S …
  • 105. Performance E2E Lag Ingest Time Expel Time T ransit Time Challenge 2 : Expel Penalty Observations Kafka is very fast — many orders of magnitude faster than HTTP RTT s The majority of the expel time is the HTTP RTT S S S S S S S D S S S … SS S …
  • 106. Performance E2E Lag Ingest Time Expel Time T ransit Time Challenge 2 : Expel Penalty Approach : Amortization In each D node, add batch + parallelism S S S S S S S D S S S … SS S …
  • 107. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts
  • 108. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures
  • 109. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures In contrast, we should never retry a message that has 0 chance of success! We call these Non-Recoverable Failures
  • 110. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures In contrast, we should never retry a message that has 0 chance of success! We call these Non-Recoverable Failures E.g. Any 4xx HTTP response code, except for 429 (T oo Many Requests)
  • 111. Performance Challenge 3 : Retry Penalty Approach We pay a latency penalty on retry, so we need to smart about What we retry — Don’t retry any non-recoverable failures How we retry
  • 112. Performance Challenge 3 : Retry Penalty Approach We pay a latency penalty on retry, so we need to smart about What we retry — Don’t retry any non-recoverable failures How we retry — One Idea : Tiered Retries
  • 113. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D Global Retries
  • 114. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D If we exhaust local retries, D transfers the message to a Global Retrier Global Retries
  • 115. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D If we exhaust local retries, D transfers the message to a Global Retrier Global Retries The Global Retrier than retries the message over a longer span of time
  • 116. ` E S S S S S S S D RO RI S S S gr Performance - 2 Tiered Retries RI : Retry_In RO : Retry_Out
  • 117. At this point, we have a system that works well at low scale Performance
  • 118. (But how does this system scale with increasing traf fi c?) Scalability
  • 119. Scalability Goal : Build a system that can deliver messages reliably from S to D with low latency up to a scale limit
  • 120. Scalability Goal : Build a system that can deliver messages reliably from S to D with low latency up to a scale limit What do I mean by a scale limit?
  • 122. Scalability First, Let’s dispel a myth! There is no such thing as a system that can handle in fi nite scale
  • 123. Scalability First, Let’s dispel a myth! Every system is capacity limited — in the case of AWS, some of these are arti fi cial! Capacity limits & performance bottlenecks can only be revealed under load There is no such thing as a system that can handle in fi nite scale
  • 124. Scalability First, Let’s dispel a myth! Every system is capacity limited — in the case of AWS, some of these are arti fi cial! Capacity limits & performance bottlenecks can only be revealed under load There is no such thing as a system that can handle in fi nite scale Hence, we need to periodically load test our system at increasingly higher load to fi nd and remove performance bottlenecks Each successful test raises our scale rating!
  • 125. Scalability First, Let’s dispel a myth! Every system is capacity limited — in the case of AWS, some of these are arti fi cial! Capacity limits & performance bottlenecks can only be revealed under load There is no such thing as a system that can handle in fi nite scale Hence, we need to periodically load test our system at increasingly higher load to fi nd and remove performance bottlenecks By executing these tests ahead of organic load, we avoid customer-impacting message delays! Each successful test raises our scale rating!
  • 126. Scalability How do we set a target scale rating for our load tests? Desired Scale Rating: 1 MM events/second @ ingest
  • 127. Scalability Is it enough to only specify the target throughput? How do we set a target scale rating for our load tests? Desired Scale Rating: 1 MM events/second @ ingest
  • 128. So, a scale rating must be quali fi ed by a latency condition to be useful! Scalability Is it enough to only specify the target throughput? No Most systems experience increasing latency with increasing traf fi c! A scale rating which violates our E2E lag SLA is not useful! How do we set a target scale rating for our load tests? Desired Scale Rating: 1 MM events/second @ ingest
  • 129. So, a scale rating must be quali fi ed by a latency condition to be useful! Scalability Is it enough to only specify the target throughput? No Most systems experience increasing latency with increasing traf fi c! A scale rating which violates our E2E lag SLA is not useful! Updated Scale Rating: 1 MM events/second with <1s (p95) E2E lag How do we set a target scale rating for our load tests? Desired Scale Rating: 1 MM events/second @ ingest
  • 130. Scalability But, how do we select a scale rating? What should it be based on?
  • 131. Scalability But, how do we select a scale rating? What should it be based on? We maintain traf fi c alarms that tell us when our organic traf fi c exceeds 1/m of our previous scale rating This tells us that a new test needs to be scheduled! @ Datazoom, we target a scale rating that is a multiple m of our organic peak! We spend the duration of our load tests diagnosing system performance and identifying bottlenecks that result in higher than expected latency! Once we resolve all bottlenecks, we update our traf fi c alarms for our new higher scale rating based on a new target multiple m’
  • 132. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost
  • 133. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost
  • 134. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost Autoscaling Considerations What can autoscale? What can’t autoscale?
  • 135. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost Autoscaling Considerations What can autoscale? What can’t autoscale? MSK Serverless aims to bridge this, but it’s expensive!
  • 136. Scalability - Autoscaling @ Datazoom, we use K8S — this gives us all of the bene fi ts of containerization! But since we run on the cloud, K8s pods need to scale on top of EC2, which itself can can be autoscaled! Hence, we have a 2-level autoscaling problem! EC2 EC2 EC2
  • 137. Scalability - HPA Journey K8s : HPA a.k.a. Horizontal Pod Autoscaler (CPU) EC2 : EC2 autoscaling (Memory) CloudWatch (CW) provides the scaling signal First Attempt : Independently Scale EC2 & K8s pods Nov 25, 2020 : Major Kinesis outage in US-East-1 EC2 EC2 EC2 HPA EC2 CW
  • 138. Scalability - HPA Journey Datazoom doesn’t directly use Kinesis, but CloudWatch does! Nov 25, 2020 : Major Kinesis outage in US-East-1 EC2 EC2 EC2 HPA EC2 CW EC2 EC2 EC2 HPA EC2 CW K8s scaled (in) to Min Pod counts, resulting in a gray failure (high lag & some message loss)
  • 139. Scalability - HPA Journey Switching to K8s Event Driven Autoscaler handles this situation Also switched to using K8S Cluster Autoscaler coordinate EC2 Autoscaling with HPA! EC2 EC2 EC2 HPA EC2 CW KEDA
  • 140. Scalability At this point, we have a system that works well within a traf fi c or scale rating
  • 141. Availability (But, what can we say about its availability??)
  • 143. Availability What does Availability mean? Availability measures the consistency with which a service meets the expectations of its users
  • 144. Availability What does Availability mean? Availability measures the consistency with which a service meets the expectations of its users For example, if I want to ssh into a machine, I want it to be running However, if its performance is degraded (e.g. low available memory), uptime is not enough! Any command I issue will take too long to complete
  • 145. Availability Gray Zone Down Up • 0 message loss • E2E Lag (p95)<= SLA (e.g. 1s) • Either message loss or the system is not accepting new messages • E2E Lag (p95)>>> SLA
  • 146. Availability Down Up • 0 message loss • E2E Lag (p95)<= SLA (e.g. 1s) • Either message loss or the system is not accepting new messages • E2E Lag (p95)> SLA
  • 147. Availability Down Up • 0 message loss • E2E Lag (p95)<= SLA (e.g. 1s) • Either message loss or the system is not accepting new messages • E2E Lag (p95)> SLA Let’s use this approach to de fi ne an Availability SLA for Data Streams!
  • 148. Availability If no, then this 1 minute period is out of Lag SLAs. In the parlance of the “nines” model, this would constitute a minute of downtime! There are 1440 minutes in a day For a given minute (e.g 12:04p), calculate the E2E lag (p95) for all events whose event_time falls in the 12:04p minute! If E2E lag must be < 1 second, compare this statistic to 1 second! Is E2E_Lag(p95) < 1 second ? If yes, then this 1 minute period is within Lag SLAs Consider that …
  • 149. Availability You may be familiar with the “nines of availability” chart below: Availability % Downtime per year Downtime per quarter Downtime per month Downtime per week Downtime per day (24 hours) 99% ("two nines") 3.65 days 21.9 hours 7.31 hours 1.68 hours 14.40 minutes 99.5% ("two and a half nines") 1.83 days 10.98 hours 3.65 hours 50.40 minutes 7.20 minutes 99.9% ("three nines") 8.77 hours 2.19 hours 43.83 minutes 10.08 minutes 1.44 minutes 99.95% ("three and a half nines") 4.38 hours 65.7 minutes 21.92 minutes 5.04 minutes 43.20 seconds 99.99% ("four nines") 52.60 minutes 13.15 minutes 4.38 minutes 1.01 minutes 8.64 seconds
  • 150. Availability The next question we must ask about our system is whether the downtime SLAs should be computed on a daily, weekly, monthly, etc… basis Availability % Downtime per year Downtime per quarter Downtime per month Downtime per week Downtime per day (24 hours) 99.9% ("three nines") 8.77 hours 2.19 hours 43.83 minutes 10.08 minutes 1.44 minutes For 3 nines’ uptime computed daily, no more than 1.44 minutes of downtime is allowed!
  • 151. Availability Can we enforce availability? No! Availability is intrinsic to a system - it is impacted by calamities, many of which are out of our control! The best we can do is to measure & monitor it!
  • 152. Availability We now have a system that reliably delivers messages from source S to destination D at low latency below its scale rating. We also have a way to measure its availability! Can we enforce availability? No! Availability is intrinsic to a system - it is impacted by calamities, many of which are out of our control! The best we can do is to measure & monitor it!
  • 153. • We have a system with the NFRs that we desire! Availability
  • 154. Practical Challenges (What are some practical challenges that we face?)
  • 155. Practical Challenges For example, what factors in fl uence latency in the real world? Aside from building a lossless system, the next biggest challenge is keeping latency low!
  • 156. Practical Challenges Minimizing latency is a practice in noise reduction!
  • 157. Practical Challenges Minimizing latency is a practice in noise reduction! The following factors impact latency in a streaming system! Network lookups needed for processing Autoscaling - Kafka Rebalancing AWS Maintenance T asks (e.g. Kafka) Bad Releases
  • 158. Practical Challenges - Latency Consider a modern pipeline as shown below Ingest Normalize Enrich Route T ransform T ransmit A very common practice is to forbid any network lookups along the stream processing chain above! However, most of the services above need to look up some con fi guration data as part of processing E.g. Enrichment, Router, T ransformer, T ransmitter
  • 159. Practical Challenges - Latency A common approach would be to pull data from a central cache such as redis Ingest Normalize Enrich Route T ransform T ransmit If lookups can be achieved in 1-2 ms, this lookup penalty may be acceptable. Get
  • 160. Practical Challenges - Latency In practice, we observe 1 minute outages whenever failovers occur in Redis HA Clusters! Ingest Normalize Enrich Route T ransform T ransmit T o shield ourselves against any latency gaps, we adopt some caching practices! Get Failover
  • 161. Practical Challenges - Latency Practice 1 : Adopt Caffeine (Local Cache) across all services • Source : https://github.com/ben-manes/caffeine • Caffeine is a high performance, near optimal caching library. Caffeine<Object, Object> cache = Caffeine.newBuilder() .refreshAfterWrite(<Staleness Interval>, TimeUnit) .expireAfterWrite( <Expiration Interval>, TimeUnit) .maximumSize( maxNumOfEntries );
  • 162. Practical Challenges - Latency Load a cache entry Time Caffeine<Object, Object> cache = Caffeine.newBuilder() .refreshAfterWrite(2, TimeUnit.MINUTES) .expireAfterWrite( 30, TimeUnit.DAYS) .maximumSize( maxNumOfEntries ); v1
  • 163. Practical Challenges - Latency Load a cache entry refreshAfter Write = 2 mins Time expireAfter Write = ∞ Caffeine<Object, Object> cache = Caffeine.newBuilder() .refreshAfterWrite(2, TimeUnit.MINUTES) .expireAfterWrite( 30, TimeUnit.DAYS) .maximumSize( maxNumOfEntries ); v1
  • 164. Practical Challenges - Latency refreshAfter Write = 2 mins Lookup entry Time expireAfter Write = ∞ v1
  • 165. Practical Challenges - Latency refreshAfter Write = 2 mins Lookup entry Time expireAfter Write = ∞
  • 166. Practical Challenges - Latency refreshAfter Write = 2 mins Lookup entry Time expireAfter Write = ∞ v1 v2
  • 167. Practical Challenges - Latency refreshAfter Write = 2 mins expireAfter Write = ∞ Lookup entry Time v1
  • 168. Practical Challenges - Latency refreshAfter Write = 2 mins expireAfter Write = ∞ Lookup entry Time v2
  • 169. Practical Challenges - Latency Practice 2 : Eager load all cache entries for all caches into Caffeine at application start Microservice application (pod) not made ready until caches load Hence, until caches load, no traf fi c passes through the initializing application pod! With this in place, we avoid latency spikes due to Redis issues & lazy init penalties!
  • 170. Practical Challenges Minimizing latency is a practice in noise reduction! The following factors impact latency in a streaming system! Network lookups needed for processing AWS Maintenance T asks (e.g. Kafka) Internal Operational Processes Autoscaling - Kafka Rebalancing
  • 171. Practical Challenges - Latency Anytime AWS patched our Kafka clusters, we noticed large latency spikes Ingest Normalize Enrich Route T ransform T ransmit Since AWS patches brokers in a rolling fashion, we typically see about 3 latency spikes over 30 minutes These violated our availability SLAs since we cannot afford more than 1 latency spike out of Lag SLA. Recall: 3 nines honored daily —> (1.44 minutes out of Lag SLA)
  • 172. MSK Patching Impacts Lag T ypical E2E Lag Lag Spikes due to MSK Maintenance
  • 173. Practical Challenges - Latency Between weekly releases or scheduled maintenance, one color side receives traf fi c (light), while the other color side receives none (dark) Ingest Normalize Enrich Route T ransform T ransmit Ingest Normalize Enrich Route T ransform T ransmit (Light) CustomerT raf fi c T o solve this issue, we adopted a Blue-Green operating model! (Dark) Patches applied
  • 174. Practical Challenges - Latency Ingest Normalize Enrich Route T ransform T ransmit Ingest Normalize Enrich Route T ransform T ransmit (Light) CustomerT raf fi c (Dark) Patches applied Cost Ef fi ciency : The Dark side is scaled to 0 to save on costs!
  • 175. Practical Challenges - Latency By switching traf fi c away from the side scheduled for AWS maintenance, we avoid a large cause of latency spikes With this success, we immediately began leveraging the Blue-Green model for other operations as well, such as:
  • 176. Practical Challenges - Latency By switching traf fi c away from the side scheduled for AWS maintenance, we avoid a large cause of latency spikes With this success, we immediately began leveraging the Blue-Green model for other operations as well, such as: Software Releases Canary T esting Load T esting Scheduled Maintenance T echnology Migrations Outage Recovery Doubling Capacity due to Sustained T raf fi c Burst
  • 177. Practical Challenges Minimizing latency is a practice in noise reduction! The following factors impact latency in a streaming system! Network lookups needed for processing AWS Maintenance T asks (e.g. Kafka) Internal Operational Processes Autoscaling - Kafka Rebalancing
  • 178. Practical Challenges - Latency During autoscaling, we noticed latency spikes! These lasted less than 1 minute, but compromise our availability SLAs This has to do with Kafka consumer rebalancing! Kafka consumers subscribe to topics T opics are split into physical partitions! These partitions are load balanced across multiple consuming K8s Pods in the same Consumer Group During autoscaling activity, the number of consumers in a Consumer Group changes, causing Kafka to execute a partition reassignment (a.k.a. rebalance)
  • 179. Practical Challenges - Latency The current default rebalancing behavior (via the range assignor) is a stop-the-world action. This causes pretty large pauses in consumption, which result in latency spikes! An alternative rebalancing algorithm came out called KICR or Kafka Incremental Cooperative Rebalancing (via the cooperative sticky assignor)
  • 180. Practical Challenges - Latency The graph below shows the impact of using KCIR. KCIR Default Rebalancing
  • 181. Practical Challenges - Latency Unfortunately, in using the Cooperative Sticky Assignor, we noticed signi fi cant duplicate consumption during autoscaling behavior. In one load test, duplication consumption @ one service > 100%. ( https://issues.apache.org/jira/browse/KAFKA-14757 )
  • 182. Conclusion • We now have a system with the NFRs that we desire! • While we’ve covered many key elements, we had to skip some for time (e.g. Isolation, Containerization, K8s Autoscaling). • These will be covered in upcoming blogs! Follow for updates on (@r39132) • We have also shown a real-life system and challenges that go with it!
  • 183. Acknowledgments PayPal Datazoom • Vincent Chen • Anisha Nainani • Pramod Garre • Harsh Bhimani • Nirmalya Ghosh • Yash Shah • Deepak Mohankumar Chandramouli • Dheeraj Rampally • Romit Mehta • Kafka team (Maulin Vasavada, Na Yang, Doron Mimon) • Sandhu Santhakumar • Akhil S. Varughese • Shiju V. • Sunny Dabas • Shahanas S. • Naveen Chandra • Thasni M. Salim • Serena Eldhose • Gordon Lai • Fajib Aboobacker • Josh Evans • Bob Carlson • T ony Gentille