Prometheus: From technical metrics to business observability

Prometheus
From technical monitoring to business
obervability
Julien Pivotto (@roidelapluie)
Sysadmin Days Paris
October 18th, 2018

user{name="roidelapluie"} 1
I like Open Source
I like monitoring
I like automation
... and all of that is my daily job at inuits

Sysadmin
Creative Commons Zero https://www.flickr.com/photos/freestocks/25668265836

Sysadmin's view
Access to a lot of components
Range from the frontends to the databases
With 24x7 oncall shifts

DevOps
In a DevOps world, more data, more awareness
More changes, different scale
Evolution
How can we keep up??

The DevOps principles: CAMS
(a definition of DevOps)
Culture
Automation
Measurement
Sharing
(Damon Edwards and John Willis, 2010 http://devopsdictionary.com/wiki/CAMS)
This talk is about all of it..

Monitoring
Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065

Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874

Creative Commons Public Domain https://pxhere.com/en/photo/265717

Traditional Monitoring
It works - OK
It does not work - CRITICAL
It kinda works - WARNING
I don't know - UNKNOWN

Creative Commons Public Domain https://pxhere.com/fr/photo/952999

Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511

Creative Commons Attribution-Share Alike 3.0 Unported
https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504.jpg

Real world
It works ; it does not work ; it kinda works ; it
maybe works ; no one uses it ; it is broken ; some
things are broken ; it should work but it does not ;
where are my users? help me...

The Technical bias
By looking at technical service, we miss the
actual point
Are we serving our users correctly?
Just looking at the traffic light will not tell you
about the traffic jams.

Observability

Metrics
Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/

Metric
Name
Labels (Key-Value Pairs)
Value (Number)
Timestamp
Fetched at a high frequency

Name: Number of HTTP requests
Labels:
status: 200
vhost: inuits.eu
method: post
Value: 1823
Timestamp: Thu Oct 18 10:18:06 CEST 2018

Name: Number of HTTP requests
Labels:
status: 200
vhost: inuits.eu
method: post
Value: 2123
Timestamp: Thu Oct 18 10:18:36 CEST 2018

300 Requests in 30 s = 10 requests per seconds
(POST for inuits.eu with response code 200)

http_request_total{job="fe",instance="fe1",code="200"}

Types of metrics
Counters
Gauges
Histograms
Summaries

Counters
Always go up
start from zero
rate, increase
e.g. number of http requests

Gauges
Go up and down
Average, Sum, Max, ...
^ over time
e.g. concurrent users

Histograms and summaries
Sets of requests
Using "buckets"
Useful to get duration, percentiles, SLA

Metrics and monitoring
Metrics do not represent problems
Metrics represent a state, give insights
Metrics can be graphed
You can alert based on them

Exposed metrics are "raw"
In general you can just expose counters, and let
the monitoring server do the real maths.
That keeps the overhead very low of apps.

Tooling
Creative Commons Attribution 2.0 https://www.flickr.com/photos/psd/5298483

What are the needs ?
Ingest metrics at high frequency
React to changes
Empower people
Alert on metrics

Use one toolchain
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/161054138@N08/37880775085

Stop with:
Having 1 "monitoring" + 1 "graphing" stacks
Big all in one tools: think decentralize, scale
Auto Discovery (use service discovery instead)
Manual config
Fragile monitoring (think HA)

Prometheus
https://prometheus.io/

Prometheus
Open Source monitoring tool
Complete Ecosystem
For cloud and on prem
Built around metrics

Cloud Native
Easy to configure, deploy, maintain
Designed in multiple services
Container ready
Orchestration ready (dynamic config)
Fuzziness

Efficient
"Scrapes" millions of metrics
Scales
Manages its own optimized db
(prometheus/tsdb)

Pull vs Push
Prometheus pulls metrics
But does not know what it will get!
The target decides what to expose
(short term batches can still push to a
"pushgateway")

Exporters
Expose metrics with an HTTP API
Bindings available for many languages (for
"native" metrics)
Exporters do not save data ; they are not
"proxies" and don't "cache" anything

Common exporters
Node Exporter: Linux System Metrics
Grok Exporter: Metrics from log files
SNMP Exporter: Network devices
Blackbox exporter: TCP, DNS, Http requests

Grafana
Open Source (Apache 2.0)
Web app
Specialized in visualization
Pluggable
Multiple datasources: prometheus, graphite,
influxdb...
Has an API!

Grafana and Prometheus
Prometheus shipped its own consoles
Now it recommends Grafana and deprecated
its own consoles

Business Metrics

What are business metrics?
Metrics that effectively tell you how you fullfil
your customers' requests
Provide quality and level of service to
customers

CPU usage is no money
https://www.flickr.com/photos/nox_noctis_silentium/3960497840

Where to get them?
Frontends
Databases
Caching systems (sessions, ...)
...
Each one of them requires a cross-team
understanding of the business.

Where to start?
Creative Commons Attribution 2.0 https://www.flickr.com/photos/franckmichel/16265376747/

USE
Brendan Gregg's USE method
U = Utilisation S = Saturation E = Errors
For resources like network, CPU, memory,...
Also asynchrone processes, ...

RED
Tom Wilkie's RED method
R = Requests E = Errors D = Duration
HTTP Requests, synchrone processes,...

What to get?
Request Rate
Saturation
Error Rate
Duration

Before we dig in ..
What we will see now is monitoring data. It should
not be used for precise usages, like invoicing.

Caching System Monitoring
(USE)

What do we learn?
Users can connect to the platform: The
authentication works
The platform is currently used

Benefits
Connected users = they can use the platform
Know when you can do maintenances
Know about your user's general habits (trends)

Database
Using SQL exporters to query the data from
your database
Requires a cross team approach
Gets you fine grained, quality data

Database trap
Do not try to replace BI/Reporting
Do not take too many labels -- stay in the
monitoring area

sum by (instance, env) (
rate(http_requests_duration_count[5m])
)

sum by (code, env) (
rate(http_requests_duration_count{code!="200"}[5m])
) / ignoring (code) group_left
sum by (env) (
)

sum by (env) (
rate(http_requests_duration_sum[5m])
) /
sum by (env) (
)

What can we learn?
We have traffic from outside
How much traffic
Quality of the trafic
How long it really takes (end to end)

Adding Time
Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/rswilson74/3375654385

Timeseries
How we use time: We take the metrics for the
last 7 weeks
We take the median value (exclude 3 top and 3
low)
Excludes anomalies due to
incidents/holidays...

http_requests:rate5m offset 1w
offset queries data in the past

record: past_request_rate
  expr: http_requests:rate5m offset 1w
  labels:
    when: 1w

record: past_request_rate
  expr: http_requests:rate5m offset 2w
  labels:
    when: 2w

max without(when) (
  bottomk(1,
    topk(4,
      past_request_rate
    )
  )
)

What do we learn?
Predict users habits
Deviation from the norm are not normal
It means that users can not reach us/use our
services

Why business metrics matter?
Good service depends on: linux health, dns,
network, ntp, disk space, cpu, open files, database,
cache systems, load balancers, partners,
electricity, virtualization stack, nfs, ... and it moves
over time
Customers won't call you because your disk is full!

Partners
Creative Commons Attribution 2.0 https://www.flickr.com/photos/deanhochman/27248626739

Given that the End User matters
We have decided to standadize metrics
exchange between partners
Prometheus format used (soon to be
OpenMetrics)
Everyone knows HTTP!

What do we exchange?
We are not interested in partner's internal (and
don't want to expose us)
We are exchanging precomputed metrics (rate
over 5 minutes, duration over 5 minutes),
excluding servers, instances, ...
Identify, in the chain, the bottlenecks and the
issues

Dashboards
We define our dashboards in two parts:
10 graphes on top about the business: RED,
USE, Alerts, data from partners, monitoring
robots, state of the monitoring
hidden by default: Technical Health - ntp, disk,
db, network, jvm, ...

Limited number of graphes
Errors in RED
Attention points in Yellow/Orange

technical view; more graphes; empty when OK

Dashboards
Duplicate the dashboard to have an historical
view

Dashboards
Easy drill down between dashboards / with pre
defined variables

Dashboard
Provide relevant help where needed
(from the haproxy documentation)

Dashboards
On product launch / change / ... extract
relevant data from the service and build a
"temporary dashboard"
Share with the teams and managers, show on
big screen

Conventions
Color conventions, general:
RED = Bad
Yellow = Attention
Blue/Green = OK
Also:
RED = problem at our side
Yellow/orange = problem at partners side

HTTP Codes
2xx: Greens
3xx: Yellows
4xx: Blues (404: grey)
5xx: Orange/Red
! Same accross all dashboards to enable easy
reading

Side note: github.com/grafana/grafonnet-lib/
A Jsonnet Library to write grafana dashboards.

Conclusion
https://www.flickr.com/photos/willy_photoshop/34829332342/

Quick Answers
Business monitoring allows yo to know early
when things are wrong
Provides clear asnwers to your customers in
minutes (no more "I will check")
// to make between technical and business
metrics (to find causes)

What happened?
Is it REALLY fixed?
When?
Until when (technical and business)?
What did I miss? What is the impact?

Metrics benefits
Because you run queries and alerts from a
central location
You can run queries accross targets/jobs
Detect faulty instances, alert for server X
based on metrics of server Y

Metrics benefits
Trends
Dynamic thresholds
Predictions

Do not underestimate the monitoring of the
development / staging environments.

Business metrics are good candidates
to wake up someone at night.

Prometheus benefits
Pull Based , metrics centrincs
The targets (e.g. developers) choose the
metrics they expose => Empowering people
HTTP permits TLS, Client Auth, ... and cross
org sharing of metrics
Becoming a standard in the industry

Grafana
Central point for all teams
Show current and pas status
Should give you the opportunity to answer
questions

Focusing on Business Metrics is hard work that
will show benefits accross teams and provide
visibility towards hierarchy, enabling you to gain
trust and move on more quickly towards a DevOps
model.

Julien Pivotto
roidelapluie
roidelapluie@inuits.eu
Inuits
https://inuits.eu
info@inuits.eu
Contact

Prometheus: From technical metrics to business observability

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Prometheus: From technical metrics to business observability

Semelhante a Prometheus: From technical metrics to business observability (20)

Mais de Julien Pivotto

Mais de Julien Pivotto (20)

Último

Último (20)

Prometheus: From technical metrics to business observability