SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
Monitoring Microservices
with Prometheus
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie @dagrobie tobidt@gmail.com
Monitoring
● Ability to observe and understand systems and their behavior.
○ Know when things go wrong
○ Understand and debug service misbehavior
○ Detect trends and act in advance
● Blackbox vs. Whitebox monitoring
○ Blackbox: Observes systems externally with periodic checks
○ Whitebox: Provides internally observed metrics
● Whitebox: Different levels of granularity
○ Logging
○ Tracing
○ Metrics
Monitoring
● Metrics monitoring system and time series database
○ Instrumentation (client libraries and exporters)
○ Metrics collection, processing and storage
○ Querying, alerting and dashboards
○ Analysis, trending, capacity planning
○ Focused on infrastructure, not business metrics
● Key features
○ Powerful query language for metrics with label dimensions
○ Stable and simple operation
○ Built for modern dynamic deploy environments
○ Easy setup
● What it’s not
○ Logging system
○ Designed for perfect answers
Prometheus
Instrumentation case study
Gusta: a simple like service
● Service to handle everything around liking a resource
○ List all liked likes on a resource
○ Create a like on a resource
○ Delete a like on a resource
● Implementation
○ Written in golang
○ Uses the gokit.io toolkit
Gusta overview
// Like represents all information of a single like.
type Like struct {
ResourceID string `json:"resourceID"`
UserID string `json:"userID"`
CreatedAt time.Time `json:"createdAt"`
}
// Service describes all methods provided by the gusta service.
type Service interface {
ListResourceLikes(resourceID string) ([]Like, error)
LikeResource(resourceID, userID string) error
UnlikeResource(resourceID, userID string) error
}
Gusta core
// main.go
var store gusta.Store
store = gusta.NewMemoryStore()
var s gusta.Service
s = gusta.NewService(store)
s = gusta.LoggingMiddleware(logger)(s)
var h http.Handler
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"))
http.Handle("/", h)
if err := http.ListenAndServe(*httpAddr, nil); err != nil {
logger.Log("exit error", err)
}
Gusta server
./gusta
ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080
ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null
ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null
ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null
ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null
ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null
ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null
ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null
ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not
found"
ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null
ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null
Gusta server
Basic Instrumentation
Providing operational insight
● “Four golden signals” cover the essentials
○ Latency
○ Traffic
○ Errors
○ Saturation
● Similar concepts: RED and USE methods
○ Request: Rate, Errors, Duration
○ Utilization, Saturation, Errors
● Information about the service itself
● Interaction with dependencies (other services, databases, etc.)
What information should be provided?
● Direct instrumentation
○ Traffic, Latency, Errors, Saturation
○ Service specific metrics (and interaction with dependencies)
○ Prometheus client libraries provide packages to instrument HTTP
requests out of the box
● Exporters
○ Utilization, Saturation
○ node_exporter CPU, memory, IO utilization per host
○ wmi_exporter does the same for Windows
○ cAdvisor (Container advisor) provides similar metrics for each container
Where to get the information from?
// main.go
import "github.com/prometheus/client_golang/prometheus"
var registry = prometheus.NewRegistry()
registry.MustRegister(
prometheus.NewGoCollector(),
prometheus.NewProcessCollector(os.Getpid(), ""),
)
// Pass down registry when creating HTTP handlers.
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry)
Initializing Prometheus client library
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requests := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "gusta_http_server_requests_total",
Help: "Total number of requests handled by the HTTP server.",
ConstLabels: prometheus.Labels{"method": method, "path": path},
},
[]string{"code"},
)
registry.MustRegister(requests)
h = promhttp.InstrumentHandlerCounter(requests, h)
Counting HTTP requests
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requestDuration := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "gusta_http_server_request_duration_seconds",
Help: "A histogram of latencies for requests.",
Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1},
ConstLabels: prometheus.Labels{"method": method, "path": path},
},
[]string{},
)
registry.MustRegister(requestDuration)
h = promhttp.InstrumentHandlerDuration(requestDuration, h)
Observing HTTP request latency
Exposing metrics
Observing the current state
● Prometheus is a pull based monitoring system
○ Instances expose an HTTP endpoint to expose their metrics
○ Prometheus uses service discovery or static target lists to collect the
state periodically
● Centralized management
○ Prometheus decides how often to scrape instances
● Prometheus stores the data on local disc
○ In a big outage, you could run Prometheus on your laptop!
How to collect the metrics?
// main.go
// ...
http.Handle("/metrics", promhttp.HandlerFor(
registry,
promhttp.HandlerOpts{},
))
Exposing the metrics via HTTP
curl -s http://localhost:8080/metrics | grep requests
# HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server.
# TYPE gusta_http_server_requests_total counter
gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3
gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429
gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51
gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14
gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3
Request metrics
curl -s http://localhost:8080/metrics | grep request_duration
# HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests.
# TYPE gusta_http_server_request_duration_seconds histogram
...
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429
gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984
gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429
...
Latency metrics
curl -s http://localhost:8080/metrics | grep process
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 892.78
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 23
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 9.3446144e+07
...
Out-of-the-box process metrics
Collecting metrics
Scraping all service instances
# Scrape all targets every 5 seconds by default.
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
# Scrape the Prometheus server itself.
- job_name: prometheus
static_configs:
- targets: [localhost:9090]
# Scrape the Gusta service.
- job_name: gusta
static_configs:
- targets: [localhost:8080]
Static configuration
scrape_configs:
# Scrape the Gusta service using Consul.
- job_name: consul
consul_sd_configs:
- server: localhost:8500
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*,prod,.*
action: keep
- source_labels: [__meta_consul_service]
target_label: job
Consul service discovery
Target overview
Simple Graph UI
Simple Graph UI
Dashboards
Human-readable metrics
Grafana example
Alerts
Actionable metrics
ALERT InstanceDown
IF up == 0
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Instance down for more than 5 minutes.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.",
}
ALERT RunningOutOfFileDescriptors
IF process_open_fds / process_fds * 100 > 95
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Instance has many open file descriptors.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.",
}
Alert examples
ALERT GustaHighErrorRate
IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m]))
/ sum without(code, instance) (rate(gusta_http_server_requests_total[1m]))
* 100 > 0.1
FOR 2m
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Gusta service endpoints have a high error rate.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.",
}
ALERT GustaHighLatency
IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Gusta service endpoints have a high latency.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }}
has a 95% percentile latency of {{ $value }} seconds.",
}
Alert examples
ALERT FilesystemRunningFull
IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0
FOR 1h
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Filesystem space is filling up.",
description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }}
is predicted to run out of space within the next 24 hours.",
}
Alert examples
Summary
● Monitoring is essential to run, understand and operate services.
● Prometheus
○ Client instrumentation
○ Scrape configuration
○ Querying
○ Dashboards
○ Alert rules
● Important Metrics
○ Four golden signals: Latency, Traffic, Error, Saturation
● Best practices
Recap
● https://prometheus.io
● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/
● Our “StackOverflow” https://www.robustperception.io/blog/
● Ask the community https://prometheus.io/community/
● Google’s SRE book https://landing.google.com/sre/book/index.html
● USE method http://www.brendangregg.com/usemethod.html
● My philosophy on alerting https://goo.gl/UnvYhQ
Sources
Thank you
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie - @dagrobie
● High availability
○ Run two identical servers
● Scaling
○ Shard by datacenter / team / service ( / instance )
● Aggregation across Prometheus servers
○ Federation
● Retention time
○ Generic remote storage support available.
● Pull vs. Push
○ Doesn’t matter in practice. Advantages depend on use case.
● Security
○ Focused on writing a monitoring system, left to the user.
FAQ

Mais conteúdo relacionado

Mais procurados

Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)Weaveworks
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 
PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language Weaveworks
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?Wojciech Barczyński
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 

Mais procurados (20)

Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
Prometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome MonitoringPrometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome Monitoring
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 

Semelhante a Monitoring microservices with Prometheus

Using NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content CacheUsing NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content CacheKevin Jones
 
ITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content CachingITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content CachingOrtus Solutions, Corp
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
The RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your ServicesThe RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your ServicesKausal
 
Tracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP ArchiveTracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP ArchiveAkamai Developers & Admins
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveRick Viscomi
 
PostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingPostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingGrant Fritchey
 
Dynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with PrometheusDynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with PrometheusGeorg Öttl
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysisDhaval Mehta
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
observability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareSneha Inguva
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deploymentsOdoo
 

Semelhante a Monitoring microservices with Prometheus (20)

Using NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content CacheUsing NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content Cache
 
ITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content CachingITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content Caching
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
The RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your ServicesThe RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your Services
 
Tracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP ArchiveTracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP Archive
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
 
PostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingPostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and Alerting
 
Dynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with PrometheusDynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with Prometheus
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Metrics with Ganglia
Metrics with GangliaMetrics with Ganglia
Metrics with Ganglia
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
observability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new software
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Native Container Monitoring
Native Container MonitoringNative Container Monitoring
Native Container Monitoring
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Redis
RedisRedis
Redis
 

Mais de Tobias Schmidt

Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusTobias Schmidt
 
The history of Prometheus at SoundCloud
The history of Prometheus at SoundCloudThe history of Prometheus at SoundCloud
The history of Prometheus at SoundCloudTobias Schmidt
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alertingTobias Schmidt
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudTobias Schmidt
 
Prometheus loves Grafana
Prometheus loves GrafanaPrometheus loves Grafana
Prometheus loves GrafanaTobias Schmidt
 
16 months @ SoundCloud
16 months @ SoundCloud16 months @ SoundCloud
16 months @ SoundCloudTobias Schmidt
 

Mais de Tobias Schmidt (7)

Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
The history of Prometheus at SoundCloud
The history of Prometheus at SoundCloudThe history of Prometheus at SoundCloud
The history of Prometheus at SoundCloud
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alerting
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Prometheus loves Grafana
Prometheus loves GrafanaPrometheus loves Grafana
Prometheus loves Grafana
 
16 months @ SoundCloud
16 months @ SoundCloud16 months @ SoundCloud
16 months @ SoundCloud
 
Two database findings
Two database findingsTwo database findings
Two database findings
 

Último

Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 

Último (20)

Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 

Monitoring microservices with Prometheus

  • 1. Monitoring Microservices with Prometheus Tobias Schmidt - MicroCPH May 17, 2017 github.com/grobie @dagrobie tobidt@gmail.com
  • 3. ● Ability to observe and understand systems and their behavior. ○ Know when things go wrong ○ Understand and debug service misbehavior ○ Detect trends and act in advance ● Blackbox vs. Whitebox monitoring ○ Blackbox: Observes systems externally with periodic checks ○ Whitebox: Provides internally observed metrics ● Whitebox: Different levels of granularity ○ Logging ○ Tracing ○ Metrics Monitoring
  • 4. ● Metrics monitoring system and time series database ○ Instrumentation (client libraries and exporters) ○ Metrics collection, processing and storage ○ Querying, alerting and dashboards ○ Analysis, trending, capacity planning ○ Focused on infrastructure, not business metrics ● Key features ○ Powerful query language for metrics with label dimensions ○ Stable and simple operation ○ Built for modern dynamic deploy environments ○ Easy setup ● What it’s not ○ Logging system ○ Designed for perfect answers Prometheus
  • 5. Instrumentation case study Gusta: a simple like service
  • 6. ● Service to handle everything around liking a resource ○ List all liked likes on a resource ○ Create a like on a resource ○ Delete a like on a resource ● Implementation ○ Written in golang ○ Uses the gokit.io toolkit Gusta overview
  • 7. // Like represents all information of a single like. type Like struct { ResourceID string `json:"resourceID"` UserID string `json:"userID"` CreatedAt time.Time `json:"createdAt"` } // Service describes all methods provided by the gusta service. type Service interface { ListResourceLikes(resourceID string) ([]Like, error) LikeResource(resourceID, userID string) error UnlikeResource(resourceID, userID string) error } Gusta core
  • 8. // main.go var store gusta.Store store = gusta.NewMemoryStore() var s gusta.Service s = gusta.NewService(store) s = gusta.LoggingMiddleware(logger)(s) var h http.Handler h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP")) http.Handle("/", h) if err := http.ListenAndServe(*httpAddr, nil); err != nil { logger.Log("exit error", err) } Gusta server
  • 9. ./gusta ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080 ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not found" ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null Gusta server
  • 11. ● “Four golden signals” cover the essentials ○ Latency ○ Traffic ○ Errors ○ Saturation ● Similar concepts: RED and USE methods ○ Request: Rate, Errors, Duration ○ Utilization, Saturation, Errors ● Information about the service itself ● Interaction with dependencies (other services, databases, etc.) What information should be provided?
  • 12. ● Direct instrumentation ○ Traffic, Latency, Errors, Saturation ○ Service specific metrics (and interaction with dependencies) ○ Prometheus client libraries provide packages to instrument HTTP requests out of the box ● Exporters ○ Utilization, Saturation ○ node_exporter CPU, memory, IO utilization per host ○ wmi_exporter does the same for Windows ○ cAdvisor (Container advisor) provides similar metrics for each container Where to get the information from?
  • 13. // main.go import "github.com/prometheus/client_golang/prometheus" var registry = prometheus.NewRegistry() registry.MustRegister( prometheus.NewGoCollector(), prometheus.NewProcessCollector(os.Getpid(), ""), ) // Pass down registry when creating HTTP handlers. h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry) Initializing Prometheus client library
  • 14. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requests := prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "gusta_http_server_requests_total", Help: "Total number of requests handled by the HTTP server.", ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{"code"}, ) registry.MustRegister(requests) h = promhttp.InstrumentHandlerCounter(requests, h) Counting HTTP requests
  • 15. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requestDuration := prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "gusta_http_server_request_duration_seconds", Help: "A histogram of latencies for requests.", Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1}, ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{}, ) registry.MustRegister(requestDuration) h = promhttp.InstrumentHandlerDuration(requestDuration, h) Observing HTTP request latency
  • 17. ● Prometheus is a pull based monitoring system ○ Instances expose an HTTP endpoint to expose their metrics ○ Prometheus uses service discovery or static target lists to collect the state periodically ● Centralized management ○ Prometheus decides how often to scrape instances ● Prometheus stores the data on local disc ○ In a big outage, you could run Prometheus on your laptop! How to collect the metrics?
  • 18. // main.go // ... http.Handle("/metrics", promhttp.HandlerFor( registry, promhttp.HandlerOpts{}, )) Exposing the metrics via HTTP
  • 19. curl -s http://localhost:8080/metrics | grep requests # HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server. # TYPE gusta_http_server_requests_total counter gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3 gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429 gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51 gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14 gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3 Request metrics
  • 20. curl -s http://localhost:8080/metrics | grep request_duration # HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests. # TYPE gusta_http_server_request_duration_seconds histogram ... gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429 gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984 gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429 ... Latency metrics
  • 21. curl -s http://localhost:8080/metrics | grep process # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 892.78 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1024 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 23 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 9.3446144e+07 ... Out-of-the-box process metrics
  • 22. Collecting metrics Scraping all service instances
  • 23. # Scrape all targets every 5 seconds by default. global: scrape_interval: 5s evaluation_interval: 5s scrape_configs: # Scrape the Prometheus server itself. - job_name: prometheus static_configs: - targets: [localhost:9090] # Scrape the Gusta service. - job_name: gusta static_configs: - targets: [localhost:8080] Static configuration
  • 24. scrape_configs: # Scrape the Gusta service using Consul. - job_name: consul consul_sd_configs: - server: localhost:8500 relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prod,.* action: keep - source_labels: [__meta_consul_service] target_label: job Consul service discovery
  • 31. ALERT InstanceDown IF up == 0 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance down for more than 5 minutes.", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.", } ALERT RunningOutOfFileDescriptors IF process_open_fds / process_fds * 100 > 95 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance has many open file descriptors.", description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.", } Alert examples
  • 32. ALERT GustaHighErrorRate IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m])) / sum without(code, instance) (rate(gusta_http_server_requests_total[1m])) * 100 > 0.1 FOR 2m LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high error rate.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.", } ALERT GustaHighLatency IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1 LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high latency.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} has a 95% percentile latency of {{ $value }} seconds.", } Alert examples
  • 33. ALERT FilesystemRunningFull IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0 FOR 1h LABELS { severity = "warning" } ANNOTATIONS { summary = "Filesystem space is filling up.", description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 24 hours.", } Alert examples
  • 35. ● Monitoring is essential to run, understand and operate services. ● Prometheus ○ Client instrumentation ○ Scrape configuration ○ Querying ○ Dashboards ○ Alert rules ● Important Metrics ○ Four golden signals: Latency, Traffic, Error, Saturation ● Best practices Recap
  • 36. ● https://prometheus.io ● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/ ● Our “StackOverflow” https://www.robustperception.io/blog/ ● Ask the community https://prometheus.io/community/ ● Google’s SRE book https://landing.google.com/sre/book/index.html ● USE method http://www.brendangregg.com/usemethod.html ● My philosophy on alerting https://goo.gl/UnvYhQ Sources
  • 37. Thank you Tobias Schmidt - MicroCPH May 17, 2017 github.com/grobie - @dagrobie
  • 38. ● High availability ○ Run two identical servers ● Scaling ○ Shard by datacenter / team / service ( / instance ) ● Aggregation across Prometheus servers ○ Federation ● Retention time ○ Generic remote storage support available. ● Pull vs. Push ○ Doesn’t matter in practice. Advantages depend on use case. ● Security ○ Focused on writing a monitoring system, left to the user. FAQ