This is a near duplication of the previous keynote deck where we talk about three examples of where I really felt the pain of not applying core observability techniques. The three covered are:
- No pre-aggregation
- Arbitrarily wide events
- Exploration over dashboarding
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Observability - Experiencing the “why” behind the jargon (FlowCon 2019)
1. @A_Bangser @FlowConFR #FlowCon
My slides are / will be available for you at:
@A_Bangser @FlowConFR #FlowCon
Observability -
Experiencing the “why” behind the jargon
Abby Bangser
https://www.slideshare.net/AbigailBangser
7. @A_Bangser @FlowConFR #FlowCon
“measure of how well” means observability is a scale
How easy is it to answer a new question without deploying new code?
Incident
triage
Incident
triage
happening?!
observability observability
observability
13. @A_Bangser @FlowConFR #FlowCon
So you might be thinking… “right, monitoring”
https://bravenewgeek.com/wp-content/uploads/2019/10/monitoring_vs_observability_overlay-1024x539.png
14. @A_Bangser @FlowConFR #FlowCon
So you might be thinking… “right, monitoring”
https://bravenewgeek.com/wp-content/uploads/2019/10/monitoring_vs_observability_overlay-1024x539.png
15. @A_Bangser @FlowConFR #FlowCon
So you might be thinking… “right, monitoring”
https://bravenewgeek.com/wp-content/uploads/2019/10/monitoring_vs_observability_overlay-1024x539.png
16. @A_Bangser @FlowConFR #FlowCon
True observability is discovering new behaviours
https://bravenewgeek.com/wp-content/uploads/2019/10/monitoring_vs_observability_overlay-1024x539.png
18. @A_Bangser @FlowConFR #FlowCon
Characteristics of what generates valuable outputs
https://thenewstack.io/observability-a-3-year-retrospective/
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be exploratory
19. @A_Bangser @FlowConFR #FlowCon
Characteristics of what generates valuable outputs
https://thenewstack.io/observability-a-3-year-retrospective/
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be exploratory
ByTwitter,CCBY4.0,
https://commons.wikimedia.org/w/index.php?curid=76921548
20. @A_Bangser @FlowConFR #FlowCon
Let’s understand a couple of these through examples
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be exploratory
21. @A_Bangser @FlowConFR #FlowCon
Let’s understand a couple of these through examples
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be exploratory
22. @A_Bangser @FlowConFR #FlowCon
The promise of monitoring vs my reality
My rollercoaster journey with understanding metrics and
pre-aggregation starts back in 2016...
23. @A_Bangser @FlowConFR #FlowCon
Monitorama 2016 - an awakening
Lessons include…
➔ It is not just testing that is dead
➔ Wow! There is a world of available data I have no idea about
➔ These tools are so cool...wait, what are these tools?
24. @A_Bangser @FlowConFR #FlowCon
Metrics can track success (and failure) of changes made
https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals
25. @A_Bangser @FlowConFR #FlowCon
An ask:
I want to monitor live
systems
An opportunity:
Help create a
client’s first cloud
infrastructure
@A_Bangser @FlowConFR #FlowCon
29. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
Two years and many projects later Hobbsy had a plan
Track latency over 4 weeks and alert when current trends exceed 2 standard deviations
2standarddeviations
30. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
Two years and many projects later Hobbsy had a plan
Track latency over 4 weeks and alert when current trends exceed 2 standard deviations
2standarddeviations
31. @A_Bangser @FlowConFR #FlowCon
To do this at MOO
s / MOO / any company over a few years old /
➔ 40 services
➔ 4 core languages
➔ 3 eras of architectural decisions
➔ 2 transport protocols (http and gRPC)
32. @A_Bangser @FlowConFR #FlowCon
To do this at MOO
s / MOO / any company over a few years old /
➔ 40 services
➔ 4 core languages
➔ 3 eras of architectural decisions
➔ 2 transport protocols (http and gRPC)
...and a partridge in a pear tree
36. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
Our data collection made certain assumptions which
in the end required re-collecting in a different way
37. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
How histograms gets generated in a time series DB
le= 0.05
http_requests_seconds_bucket
le= 0.1 le= 0.5 le= 1 le= 5 le= +inf
* “le” stands for “less than or equal to”
38. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
How histograms gets generated in a time series DB
le= 0.05
http_requests_seconds_bucket
le= 0.1 le= 0.5 le= 1 le= 5 le= +inf
* “le” stands for “less than or equal to”
www.moo.com in 0.25 seconds
39. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
How histograms gets generated in a time series DB
le= 0.05
http_requests_seconds_bucket
le= 0.1 le= 0.5 le= 1 le= 5 le= +inf
* “le” stands for “less than or equal to”
www.moo.com in 0.25 seconds
40. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
How histograms gets generated in a time series DB
le= 0.05
http_requests_seconds_bucket
le= 0.1 le= 0.5 le= 1 le= 5 le= +inf
* “le” stands for “less than or equal to”
www.moo.com/big_file in 5 seconds
www.moo.com in 0.25 seconds
41. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
How histograms gets generated in a time series DB
le= 0.05
http_requests_seconds_bucket
le= 0.1 le= 0.5 le= 1 le= 5 le= +inf
* “le” stands for “less than or equal to”
www.moo.com/big_file in 5 seconds
www.moo.com in 0.25 seconds
49. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon
So, while consistent metrics
trending over time was a big
step forward...
In retrospect,
these experiences were
not mature observability
50. @A_Bangser @FlowConFR #FlowCon
Why avoid pre-aggregation?
Because you can never regain the original context and detail,
you can only ever ask predetermined questions
51. @A_Bangser @FlowConFR #FlowCon
Let’s understand a couple of these through examples
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be exploratory
52. @A_Bangser @FlowConFR #FlowCon
Data is not the same as information
Step one is accepting that while sentences may be readable.
<key : value> pairs are more easily queried.
56. @A_Bangser @FlowConFR #FlowCon
So then we backfilled in structure
grok {
match => [
"Request",
"%{URIPROTO:request_uri_scheme}://
%{HOSTNAME:request_uri_host}(?::%{POSINT:request_uri_port})
?%{URIPATH:request_uri_path}(?:%{URIPARAM:request_uri_query})?"
]}
}
57. @A_Bangser @FlowConFR #FlowCon
And of course, from there we wanted more
mutate {
split => { "uri_array" => "/"}
add_field => {
"uri_root" => ["/%{[uri_array][1]}"]
"uri_first" => ["/%{[uri_array][2]}"]
"uri_second" => ["/%{[uri_array][3]}"]
"uri_root_first" => "%{uri_root}%{uri_first}"
"uri_root_second" => "%{uri_root}%{uri_first}%{uri_second}"
}
86. @A_Bangser @FlowConFR #FlowCon
In order to combate tribal knowledge based guessing
when debugging our complex systems, we need:
A low friction way to add fields to your
logs for structure and searchability
Allowing application and user context to
be wrapped in a business context
CustomerID:234567VersionOfApp:2
RequestedUri:www.
87. @A_Bangser @FlowConFR #FlowCon
Let’s understand a couple of these through examples
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be exploratory
89. @A_Bangser @FlowConFR #FlowCon
Hmmm, a warning alert has come in
This is an automated alert based on a warning production service sending a high percent of 500’s in production!
102. @A_Bangser @FlowConFR #FlowCon
Let’s break down what this dashboard shows
Enhanced Images
Original Images
Enhanced Images
Enhanced and resized
Request Counts Response Latency
107. @A_Bangser @FlowConFR #FlowCon
Why ditch the dashboards?
The scar tissue of your past outages is not a sufficient
replacement for the creativity required to investigate your
future incidents
https://www.needpix.com/photo/907639/images-leash-leash-polaroid-free-pictures-free-photos-free-images-royalty-free
108. @A_Bangser @FlowConFR #FlowCon
Let’s revisit those characteristics
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be
exploratory
ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515
109. @A_Bangser @FlowConFR #FlowCon
Let’s revisit those characteristics
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be
exploratory
ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515
The only way to ask new questions
is to keep the original raw data
available and queryable
110. @A_Bangser @FlowConFR #FlowCon
Let’s revisit those characteristics
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be
exploratory
ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515
Make data easy to
add details to and
easy to query
111. @A_Bangser @FlowConFR #FlowCon
Let’s revisit those characteristics
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be
exploratory
ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515
Empower creative
and shared
exploration based
on business context
112. @A_Bangser @FlowConFR #FlowCon
Let’s revisit those characteristics
➔ raw events
➔ no pre-aggregation
➔ structured data
➔ arbitrarily wide events
➔ schema-less-ness
➔ high cardinality dimensions
➔ oriented around the lifecycle of the request
➔ batched up context
➔ static dashboards don’t work, it must be
exploratory
ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515
The only way to ask new questions
is to keep the original raw data
available and queryable
Make data easy to
add details to and
easy to query
Empower creative
and shared
exploration based
on business context
113. @A_Bangser @FlowConFR #FlowCon
QA
TWU
Looking back journeys are never clear, so why do we
still expect them to be when we start a new one?
Political
Science Major
Data analysis for
investments
A desire to
learn how to
code
Automation
FTW!
An “analyst”
computer
A “DevOps”
friend
engaged me
in his work
onitorama
An infrastructure
project
Platform
Engineering @
Professional
scuba diver
A (slight)
obsession with
observability
115. @A_Bangser @FlowConFR #FlowCon
➔ All of tech and product is now asking more interesting questions
➔ We are expecting more of our tooling
➔ We are building new awareness about our services and system
Start where you are.
Use what you have.
Do what you can.
- Arthur Ashe