How to build observability into Serverless (BuildStuff 2018)

How to build observability into Serverless
Yan Cui @theburningmonk

Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
been shot down were not present for the
damage assessment.
The holes in the returning aircraft, then,
represented areas where a bomber could
take damage and still return home safely.

survivor bias in monitoring
Only focus on failure modes that we were able to successfully
identify through investigation and postmortem in the past.
The bullet holes that shot us down and we couldn’t identify stay
invisible, and will continue to shoot us down.

What do I mean by “observability”?

Monitoring
watching out for
known failure modes
in the system,
e.g. network I/O, CPU,
memory usage, …

Observability
being able to debug
the system, and gain
insights into the
system’s behaviour

In control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of its external outputs.
https://en.wikipedia.org/wiki/Observability

Known SuccessKnown Errors
easy to monitor!

Known Unknowns

Known UnknownsUnknown Unknowns

invisible bullet
holes

only alert on
this

alert on the
absence of this!

what went wrong?

These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distributed systems tracing infrastructure
• Log aggregation/analytics
“
” http://bit.ly/2DnjyuW- Observability Engineering at Twitter

microservices death stars circa 2015

mm… I wonder what’s
going on here…

I got this!

About me
▪ Principal Engineer at DAZN
▪ AWS Serverless Hero
▪ Author of Production-Ready Serverless* by Manning
▪ Blogger**
▪ Speaker
* https://bit.ly/production-ready-serverless
** https://theburningmonk.com

https://www.ft.com/content/07d375ee-6ee5-11e8-92d3-6c13e5c92914

https://www.theguardian.com/media/2018/may/14/streaming-service-dazn-netﬂix-sport-us-boxing-eddie-hearn

About DAZN
▪ Available in 7 countries - Austria, Switzerland, Germany, Japan, Canada,
Italy and USA
▪ Available on 30+ platforms

About DAZN
▪~1,000,000 concurrent viewers at peak

follow @dazneng for
updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
WE’RE HIRING!

NOWHERE
to install agents/daemons

•nowhere to install agents/daemons
new challenges

user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
handler
handler
handler
handler
handler
handler
handler

user request
user request
user request
user request
user request
user request
user request
critical paths:
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead

user request
user request
user request
user request
user request
user request
user request
critical paths:
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
NO background processing
except what platform provides

•no background processing
new challenges

EC2
concurrency used to be
handled by your code

EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, it’s handled by the
AWS Lambda platform

EC2
logs & metrics used to be
batched here

EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, they are batched in each
concurrent execution, at best…

HIGHER concurrency to log
aggregation/telemetry system

•higher concurrency to telemetry system
new challenges

Lambda
data is batched between
invocations

Lambda
idle
data is batched between
invocations

Lambda
idle
garbage collectiondata is batched between
invocations

Lambda
idle
garbage collectiondata is batched between
invocations
HIGH chance of data loss

•high chance of data loss (if batching)
new challenges

my code
send metrics
internet internet
press button something happens

?
functions are often chained together
via asynchronous invocations

?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES

?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
tracing ASYNCHRONOUS
invocations through so many
different event sources is difficult

•asynchronous invocations
•high chance of data loss (if batching)
new challenges

2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?

2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp Request Id
your log message

one log group per
function
one log stream for each
concurrent invocation

logs are not easily searchable in
CloudWatch Logs
me

CloudWatch Logs is an async event source for Lambda

Concurrent Executions
Time
regional max
concurrency
functions that are
delivering business value

Concurrent Executions
Time
regional max
concurrency
functions that are
delivering business value
ship logs

either set concurrency limit on the log shipping function
(and potentially lose logs due to throttling)
or…

1 shard = 1 concurrent execution
i.e. control the no. of concurrent
executions with no. of shards

use structured logging with JSON

https://stackify.com/what-is-structured-logging-and-why-developers-need-it/ https://blog.treasuredata.com/blog/2012/04/26/log-everything-as-json/

https://www.loggly.com/blog/8-handy-tips-consider-logging-json/

traditional loggers are too heavy for Lambda

CloudWatch Logs
$0.50 per GB ingested
$0.03 per GB archived per month

CloudWatch Logs
$0.50 per GB ingested
$0.03 per GB archived per month
1M invocation of a 128MB function =
$0.000000208 * 1M + $0.20 =
$0.408

DON’T leave debug logging ON in production

have to redeploy ALL the
functions along the call path to
collect all relevant debug logs

https://github.com/middyjs/middy

EC2
Lambda
Lambda
Lambda
Lambda
Lambda
Concurrency is handled by
the AWS Lambda platform

sampling decision has to be
followed by an entire call chain

Initial Request ID
User ID
Session ID
User-Agent
Order ID
…

every function needs to do the right thing and
propagate information such as correlation IDs
along to APIs, streams, queues, etc.

invest in tools to make it easy to do the “right thing”

nonintrusive
extensible
consistent

nonintrusive
extensible
consistent
works for streams

store correlation IDs in global variable

use middleware to auto-capture incoming correlation IDs

extract correlation IDs from
invocation event, and store them in
the correlation-ids module
reset

logger to always include captured correlation IDs

HTTP and AWS SDK clients to auto-forward correlation IDs on

context.awsRequestId
get-index

context.awsRequestId x-correlation-id
get-index

{
“headers”: {
“x-correlation-id”: “…”
},
…
}
get-index

{
“body”: null,
“resource”: “/restaurants”,
“headers”: {
“x-correlation-id”: “…”
},
…
}
get-index get-restaurants

get-restaurants
global.CONTEXT
global.CONTEXT
x-correlation-id = …
x-correlation-xxx = …
get-index
headers[“User-Agent”]
headers[“Debug-Log-Enabled”]
headers[“User-Agent”]
headers[“Debug-Log-Enabled”]
headers[“x-correlation-id”]
capture
forward
function
event
log.info(…)

those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called within one slice
of user event

Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA

console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs

CloudWatch Logs AWS Lambda
ELK stack
logs
m
etrics
CloudWatch

trade-off
delay
cost
concurrency

trade-off
delay
cost
concurrency
no latency overhead

API Gateway
send custom metrics
asynchronously

SNS KinesisS3API Gateway
…
send custom metrics
asynchronously
send custom metrics as
part of function invocation

don’t span over async
invocations
good for identifying dependencies of a function,
but not good enough for tracing the entire call
chain as user request/data flows through the
system via async event sources.

don’t span over non-AWS services

make it easy to do the right thing

API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/prod-ready-serverless
get 40% off with:
ytcui

@theburningmonk
theburningmonk.com
github.com/theburningmonk
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/prod-ready-serverless
get 40% off with:
ytcui

How to build observability into Serverless (BuildStuff 2018)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a How to build observability into Serverless (BuildStuff 2018)

Semelhante a How to build observability into Serverless (BuildStuff 2018) (20)

Mais de Yan Cui

Mais de Yan Cui (20)

Último

Último (20)

How to build observability into Serverless (BuildStuff 2018)