Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitor The World: Meaningful
Metrics for Kubernetes
Applications and Clusters
Nick Turner
Software Development Engineer
Amazon EKS
C O N 4 0 8

Who am I?
• Amazon Elastic Container Service for Kubernetes (Amazon EKS) team
• Formerly worked at two Seattle startups using Kubernetes, Porch and
OfferUp

Agenda
Why Monitor
Monitoring Methodology
Metrics Sources & Instrumentation
Applications
Control Plane

Why Do We Monitor?
Problem Detection
Outage Prevention
We are nosy

But It’s Hard
• Microservices
• Wealth of metrics
• Complex interactions
• Containers
• More transient
• OS is not the complete picture
• Need new tools

A Method to the Madness
Resources
USE method by Brendan Gregg
For every resource, check:
• Utilization
• Saturation
• Errors
Services
RED method by Tom Wilkie
For every service, monitor request:
• Rate
• Errors
• Duration

Metrics Sources
Node A
Pod
Kubelet
cAdvisor
Node
Problem
Detector
Node
Exporter
Node B
Pod
Kubelet
cAdvisor
Node
Problem
Detector
Node
Exporter
Prometheus Kube
State
Metrics
Pod Metrics
Server
fluentdgrafana

You Should Know
3 Built-In Metrics APIs
• metrics.k8s.io
• custom.metrics.k8s.io
• external.metrics.k8s.io
Kubelet cAdvisor
• Currently used by kubelet to expose
summary API
• Port is deprecated in 1.10, disabled
in 1.11
• Might need to run a standalone
eventually.
• Will cAdvisor be replaced by CRI
metrics?
HPA
• Uses the metrics server for resources
• Uses a custom metrics pipeline for
custom metrics
Metrics Server
• No historical data
• Node & Pod, CPU & Mem

You Should Know
Kube State Metrics
• Derives metrics from API
• Can be resource intensive for large
clusters
Node Problem Detector
• Adds conditions to nodes
Node Exporter
• Exposes lots of metrics at the node
level, including the basics such as
CPU, Memory, Network

A Quick Look
kubectl top
kubectl logs
kubectl get events

Prometheus
• Why Prometheus?
• Community
• Number of integrations
• Ease of use
• Why not Prometheus?
• Manage it yourself
• Complexity in large setups
• Possibility: Hybrid Approach
• Use Prometheus to collect metrics
that are exposed on /metrics
endpoints
• Send a subset of critical metrics to
Amazon CloudWatch or a third party
solution.

Federation
Prometheus
Aggregation
Layer
Prometheus
AZ2
Prometheus
AZ3
Prometheus
AZ1

If you had to pick one metric…
What matters?
• User experience
• Your sleep and sanity

Start with Your Users
Business Metrics
• E.g. orders fulfilled successfully
Application Request Errors
• Tells you where to start
• Use tracing and logs to determine where to look next

Wait for It
Application Latency
• Critical measurement of user experience

A Complete Picture
Request Rate & Saturation
• Understand how your application behaves under load

What Else Causes Outages?
Know Your Code and Configuration Version
• Know what version your code is, and where it has been deployed
• The same goes for configuration!
In Kubernetes:
• Add a version label to your PodSpecs

Versioning a Deployment
# Using kube_pod_labels
sum(kube_pod_labels{label_version != "", label_app = "autostore"}) by (label_version)

Visualizing a Deployment

Take Advantage of Kube State Metrics
Of note:
• Container restarts
• % Pods available

Drilling deeper
Resources
• CPU
• Memory
• Network
• Disk

Monitoring Resources with USE
Start with a correct setup:
• Requests and limits for all pods
• --kube-reserved
• Namespace ResourceQuotas if desired
Where can we perform aggregation?
• Container
• Pod
• Deployment
• Node
• Namespace
• Cluster

CPU Utilization
container_cpu_usage_seconds_total
# namespace:container_cpu_usage_seconds_total:sum_rate
sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="",
container_name!=""}[5m])) by (namespace)
# namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate
sum by (namespace, pod_name, container_name) (
rate(container_cpu_usage_seconds_total{job="kubelet", image!="",
container_name!=""}[5m])
)

CPU Saturation
node_load1
sum(node_load1{job="node-exporter"})
/
sum(node:node_num_cpu:sum)

Memory Utilization
# namespace:container_memory_usage_bytes:sum
sum(container_memory_usage_bytes{job="kubelet", image!="", container_name!=""}) by
(namespace)
# :node_memory_utilisation:
1 – sum(
node_memory_MemFree{job="node-exporter"}
+ node_memory_Cached{job="node-exporter"}
+ node_memory_Buffers{job="node-exporter”})
/
sum(node_memory_MemTotal{job="node-exporter"})

Start with RED
Monitor the API Server with RED
• Errors
• Duration (Latency)
• Rate
• Saturation
Also:
• Pod restarts

As Your Cluster Scales
Where are the bottlenecks?
• Pod scheduling Latency
• Metrics Resource Usage
• API Server Resource Usage

How Do I Monitor Etcd?
• Leader Elections
• etcd_server_has_leader
• etcd_server_leader_changes_seen_total
• Disk Write Performance
• etcd_disk_wal_fsync_duration_seconds_bucket
• etcd_disk_backend_commit_duration_seconds_bucket
• Database Size
• When etcd_mvcc_db_total_size_in_bytes reaches the quota limit, etcd will trigger a
NOSPACE alarm
• Corruption

Thank you!
Nick Turner
nic@amazon.com

Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018

Semelhante a Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018 (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018