Mais conteúdo relacionado Semelhante a Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018 (20) Mais de Amazon Web Services (20) Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 20182. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitor The World: Meaningful
Metrics for Kubernetes
Applications and Clusters
Nick Turner
Software Development Engineer
Amazon EKS
C O N 4 0 8
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Who am I?
• Amazon Elastic Container Service for Kubernetes (Amazon EKS) team
• Formerly worked at two Seattle startups using Kubernetes, Porch and
OfferUp
4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Why Monitor
Monitoring Methodology
Metrics Sources & Instrumentation
Applications
Control Plane
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why Do We Monitor?
Problem Detection
Outage Prevention
We are nosy
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
But It’s Hard
• Microservices
• Wealth of metrics
• Complex interactions
• Containers
• More transient
• OS is not the complete picture
• Need new tools
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Method to the Madness
Resources
USE method by Brendan Gregg
For every resource, check:
• Utilization
• Saturation
• Errors
Services
RED method by Tom Wilkie
For every service, monitor request:
• Rate
• Errors
• Duration
10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics Sources
Node A
Pod
Kubelet
cAdvisor
Node
Problem
Detector
Node
Exporter
Node B
Pod
Kubelet
cAdvisor
Node
Problem
Detector
Node
Exporter
Prometheus Kube
State
Metrics
Pod Metrics
Server
fluentdgrafana
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
You Should Know
3 Built-In Metrics APIs
• metrics.k8s.io
• custom.metrics.k8s.io
• external.metrics.k8s.io
Kubelet cAdvisor
• Currently used by kubelet to expose
summary API
• Port is deprecated in 1.10, disabled
in 1.11
• Might need to run a standalone
eventually.
• Will cAdvisor be replaced by CRI
metrics?
HPA
• Uses the metrics server for resources
• Uses a custom metrics pipeline for
custom metrics
Metrics Server
• No historical data
• Node & Pod, CPU & Mem
13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
You Should Know
Kube State Metrics
• Derives metrics from API
• Can be resource intensive for large
clusters
Node Problem Detector
• Adds conditions to nodes
Node Exporter
• Exposes lots of metrics at the node
level, including the basics such as
CPU, Memory, Network
14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Quick Look
kubectl top
kubectl logs
kubectl get events
15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prometheus
• Why Prometheus?
• Community
• Number of integrations
• Ease of use
• Why not Prometheus?
• Manage it yourself
• Complexity in large setups
• Possibility: Hybrid Approach
• Use Prometheus to collect metrics
that are exposed on /metrics
endpoints
• Send a subset of critical metrics to
Amazon CloudWatch or a third party
solution.
16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Federation
Prometheus
Aggregation
Layer
Prometheus
AZ2
Prometheus
AZ3
Prometheus
AZ1
17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
If you had to pick one metric…
What matters?
• User experience
• Your sleep and sanity
19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Start with Your Users
Business Metrics
• E.g. orders fulfilled successfully
Application Request Errors
• Tells you where to start
• Use tracing and logs to determine where to look next
20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Wait for It
Application Latency
• Critical measurement of user experience
21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Complete Picture
Request Rate & Saturation
• Understand how your application behaves under load
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What Else Causes Outages?
Know Your Code and Configuration Version
• Know what version your code is, and where it has been deployed
• The same goes for configuration!
In Kubernetes:
• Add a version label to your PodSpecs
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Versioning a Deployment
# Using kube_pod_labels
sum(kube_pod_labels{label_version != "", label_app = "autostore"}) by (label_version)
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Visualizing a Deployment
26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Take Advantage of Kube State Metrics
Of note:
• Container restarts
• % Pods available
27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Drilling deeper
Resources
• CPU
• Memory
• Network
• Disk
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring Resources with USE
Start with a correct setup:
• Requests and limits for all pods
• --kube-reserved
• Namespace ResourceQuotas if desired
Where can we perform aggregation?
• Container
• Pod
• Deployment
• Node
• Namespace
• Cluster
29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CPU Utilization
container_cpu_usage_seconds_total
# namespace:container_cpu_usage_seconds_total:sum_rate
sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="",
container_name!=""}[5m])) by (namespace)
# namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate
sum by (namespace, pod_name, container_name) (
rate(container_cpu_usage_seconds_total{job="kubelet", image!="",
container_name!=""}[5m])
)
30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CPU Saturation
node_load1
sum(node_load1{job="node-exporter"})
/
sum(node:node_num_cpu:sum)
31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Memory Utilization
# namespace:container_memory_usage_bytes:sum
sum(container_memory_usage_bytes{job="kubelet", image!="", container_name!=""}) by
(namespace)
# :node_memory_utilisation:
1 – sum(
node_memory_MemFree{job="node-exporter"}
+ node_memory_Cached{job="node-exporter"}
+ node_memory_Buffers{job="node-exporter”})
/
sum(node_memory_MemTotal{job="node-exporter"})
32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Start with RED
Monitor the API Server with RED
• Errors
• Duration (Latency)
• Rate
• Saturation
Also:
• Pod restarts
35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
As Your Cluster Scales
Where are the bottlenecks?
• Pod scheduling Latency
• Metrics Resource Usage
• API Server Resource Usage
36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How Do I Monitor Etcd?
• Leader Elections
• etcd_server_has_leader
• etcd_server_leader_changes_seen_total
• Disk Write Performance
• etcd_disk_wal_fsync_duration_seconds_bucket
• etcd_disk_backend_commit_duration_seconds_bucket
• Database Size
• When etcd_mvcc_db_total_size_in_bytes reaches the quota limit, etcd will trigger a
NOSPACE alarm
• Corruption
37. Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Nick Turner
nic@amazon.com
38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.