SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitor The World: Meaningful
Metrics for Kubernetes
Applications and Clusters
Nick Turner
Software Development Engineer
Amazon EKS
C O N 4 0 8
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Who am I?
• Amazon Elastic Container Service for Kubernetes (Amazon EKS) team
• Formerly worked at two Seattle startups using Kubernetes, Porch and
OfferUp
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Why Monitor
Monitoring Methodology
Metrics Sources & Instrumentation
Applications
Control Plane
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why Do We Monitor?
Problem Detection
Outage Prevention
We are nosy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
But It’s Hard
• Microservices
• Wealth of metrics
• Complex interactions
• Containers
• More transient
• OS is not the complete picture
• Need new tools
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Method to the Madness
Resources
USE method by Brendan Gregg
For every resource, check:
• Utilization
• Saturation
• Errors
Services
RED method by Tom Wilkie
For every service, monitor request:
• Rate
• Errors
• Duration
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics Sources
Node A
Pod
Kubelet
cAdvisor
Node
Problem
Detector
Node
Exporter
Node B
Pod
Kubelet
cAdvisor
Node
Problem
Detector
Node
Exporter
Prometheus Kube
State
Metrics
Pod Metrics
Server
fluentdgrafana
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
You Should Know
3 Built-In Metrics APIs
• metrics.k8s.io
• custom.metrics.k8s.io
• external.metrics.k8s.io
Kubelet cAdvisor
• Currently used by kubelet to expose
summary API
• Port is deprecated in 1.10, disabled
in 1.11
• Might need to run a standalone
eventually.
• Will cAdvisor be replaced by CRI
metrics?
HPA
• Uses the metrics server for resources
• Uses a custom metrics pipeline for
custom metrics
Metrics Server
• No historical data
• Node & Pod, CPU & Mem
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
You Should Know
Kube State Metrics
• Derives metrics from API
• Can be resource intensive for large
clusters
Node Problem Detector
• Adds conditions to nodes
Node Exporter
• Exposes lots of metrics at the node
level, including the basics such as
CPU, Memory, Network
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Quick Look
kubectl top
kubectl logs
kubectl get events
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prometheus
• Why Prometheus?
• Community
• Number of integrations
• Ease of use
• Why not Prometheus?
• Manage it yourself
• Complexity in large setups
• Possibility: Hybrid Approach
• Use Prometheus to collect metrics
that are exposed on /metrics
endpoints
• Send a subset of critical metrics to
Amazon CloudWatch or a third party
solution.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Federation
Prometheus
Aggregation
Layer
Prometheus
AZ2
Prometheus
AZ3
Prometheus
AZ1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
If you had to pick one metric…
What matters?
• User experience
• Your sleep and sanity
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Start with Your Users
Business Metrics
• E.g. orders fulfilled successfully
Application Request Errors
• Tells you where to start
• Use tracing and logs to determine where to look next
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Wait for It
Application Latency
• Critical measurement of user experience
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Complete Picture
Request Rate & Saturation
• Understand how your application behaves under load
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What Else Causes Outages?
Know Your Code and Configuration Version
• Know what version your code is, and where it has been deployed
• The same goes for configuration!
In Kubernetes:
• Add a version label to your PodSpecs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Versioning a Deployment
# Using kube_pod_labels
sum(kube_pod_labels{label_version != "", label_app = "autostore"}) by (label_version)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Visualizing a Deployment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Take Advantage of Kube State Metrics
Of note:
• Container restarts
• % Pods available
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Drilling deeper
Resources
• CPU
• Memory
• Network
• Disk
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring Resources with USE
Start with a correct setup:
• Requests and limits for all pods
• --kube-reserved
• Namespace ResourceQuotas if desired
Where can we perform aggregation?
• Container
• Pod
• Deployment
• Node
• Namespace
• Cluster
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CPU Utilization
container_cpu_usage_seconds_total
# namespace:container_cpu_usage_seconds_total:sum_rate
sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="",
container_name!=""}[5m])) by (namespace)
# namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate
sum by (namespace, pod_name, container_name) (
rate(container_cpu_usage_seconds_total{job="kubelet", image!="",
container_name!=""}[5m])
)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CPU Saturation
node_load1
sum(node_load1{job="node-exporter"})
/
sum(node:node_num_cpu:sum)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Memory Utilization
# namespace:container_memory_usage_bytes:sum
sum(container_memory_usage_bytes{job="kubelet", image!="", container_name!=""}) by
(namespace)
# :node_memory_utilisation:
1 – sum(
node_memory_MemFree{job="node-exporter"}
+ node_memory_Cached{job="node-exporter"}
+ node_memory_Buffers{job="node-exporter”})
/
sum(node_memory_MemTotal{job="node-exporter"})
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Start with RED
Monitor the API Server with RED
• Errors
• Duration (Latency)
• Rate
• Saturation
Also:
• Pod restarts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
As Your Cluster Scales
Where are the bottlenecks?
• Pod scheduling Latency
• Metrics Resource Usage
• API Server Resource Usage
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How Do I Monitor Etcd?
• Leader Elections
• etcd_server_has_leader
• etcd_server_leader_changes_seen_total
• Disk Write Performance
• etcd_disk_wal_fsync_duration_seconds_bucket
• etcd_disk_backend_commit_duration_seconds_bucket
• Database Size
• When etcd_mvcc_db_total_size_in_bytes reaches the quota limit, etcd will trigger a
NOSPACE alarm
• Corruption
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Nick Turner
nic@amazon.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
 
Advanced Continuous Delivery Best Practices (DEV317-R1) - AWS re:Invent 2018
Advanced Continuous Delivery Best Practices (DEV317-R1) - AWS re:Invent 2018Advanced Continuous Delivery Best Practices (DEV317-R1) - AWS re:Invent 2018
Advanced Continuous Delivery Best Practices (DEV317-R1) - AWS re:Invent 2018
 
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
 
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018
 
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
 
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...
 
Introduction to GraphQL (MOB316-R1) - AWS re:Invent 2018
Introduction to GraphQL (MOB316-R1) - AWS re:Invent 2018Introduction to GraphQL (MOB316-R1) - AWS re:Invent 2018
Introduction to GraphQL (MOB316-R1) - AWS re:Invent 2018
 
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...
 
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
 
AWS, I Choose You: Pokemon's Battle against the Bots (SEC402-R1) - AWS re:Inv...
AWS, I Choose You: Pokemon's Battle against the Bots (SEC402-R1) - AWS re:Inv...AWS, I Choose You: Pokemon's Battle against the Bots (SEC402-R1) - AWS re:Inv...
AWS, I Choose You: Pokemon's Battle against the Bots (SEC402-R1) - AWS re:Inv...
 
Migrating to AWS Fargate (CON311-R1) - AWS re:Invent 2018
Migrating to AWS Fargate (CON311-R1) - AWS re:Invent 2018Migrating to AWS Fargate (CON311-R1) - AWS re:Invent 2018
Migrating to AWS Fargate (CON311-R1) - AWS re:Invent 2018
 
Visibility into Serverless Applications built using AWS Fargate (CON312-R1) -...
Visibility into Serverless Applications built using AWS Fargate (CON312-R1) -...Visibility into Serverless Applications built using AWS Fargate (CON312-R1) -...
Visibility into Serverless Applications built using AWS Fargate (CON312-R1) -...
 
SRV318 Running Kubernetes with Amazon EKS
SRV318 Running Kubernetes with Amazon EKSSRV318 Running Kubernetes with Amazon EKS
SRV318 Running Kubernetes with Amazon EKS
 
MySQL High Availability & Disaster Recovery (DAT361) - AWS re:Invent 2018
MySQL High Availability & Disaster Recovery (DAT361) - AWS re:Invent 2018MySQL High Availability & Disaster Recovery (DAT361) - AWS re:Invent 2018
MySQL High Availability & Disaster Recovery (DAT361) - AWS re:Invent 2018
 
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...
 
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...
Deploy Serverless Apps with Python: AWS Chalice Deep Dive (DEV427-R2) - AWS r...
 
Operating Your Serverless API in Production at Scale - AWS Online Tech Talks
Operating Your Serverless API in Production at Scale - AWS Online Tech TalksOperating Your Serverless API in Production at Scale - AWS Online Tech Talks
Operating Your Serverless API in Production at Scale - AWS Online Tech Talks
 
Use Monitoring, Logs, and Analytics Tools to Measure CDN and Site Performance...
Use Monitoring, Logs, and Analytics Tools to Measure CDN and Site Performance...Use Monitoring, Logs, and Analytics Tools to Measure CDN and Site Performance...
Use Monitoring, Logs, and Analytics Tools to Measure CDN and Site Performance...
 
Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment...
Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment...Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment...
Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment...
 
Run Kubernetes with Amazon EKS
Run Kubernetes with Amazon EKSRun Kubernetes with Amazon EKS
Run Kubernetes with Amazon EKS
 

Semelhante a Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018

국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018
국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018
국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018
Amazon Web Services Korea
 
AWS에서 Kubernetes 실전 활용하기::유병우::AWS Summit Seoul 2018
AWS에서 Kubernetes 실전 활용하기::유병우::AWS Summit Seoul 2018AWS에서 Kubernetes 실전 활용하기::유병우::AWS Summit Seoul 2018
AWS에서 Kubernetes 실전 활용하기::유병우::AWS Summit Seoul 2018
Amazon Web Services Korea
 

Semelhante a Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018 (20)

Getting Started with Kubernetes on AWS
Getting Started with Kubernetes on AWSGetting Started with Kubernetes on AWS
Getting Started with Kubernetes on AWS
 
Analyze Slide Images and Process Phenotypic Assays at Scale on AWS (CMP358) -...
Analyze Slide Images and Process Phenotypic Assays at Scale on AWS (CMP358) -...Analyze Slide Images and Process Phenotypic Assays at Scale on AWS (CMP358) -...
Analyze Slide Images and Process Phenotypic Assays at Scale on AWS (CMP358) -...
 
게임 고객사를 위한 ‘AWS 컨테이너 교육’ 자료 - 유재석 솔루션즈 아키텍트, AWS :: Gaming Immersion Day 201...
게임 고객사를 위한 ‘AWS 컨테이너 교육’ 자료 -  유재석 솔루션즈 아키텍트, AWS :: Gaming Immersion Day 201...게임 고객사를 위한 ‘AWS 컨테이너 교육’ 자료 -  유재석 솔루션즈 아키텍트, AWS :: Gaming Immersion Day 201...
게임 고객사를 위한 ‘AWS 컨테이너 교육’ 자료 - 유재석 솔루션즈 아키텍트, AWS :: Gaming Immersion Day 201...
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
 
[AWS Container Service] Getting Started with Kubernetes on AWS
[AWS Container Service] Getting Started with Kubernetes on AWS[AWS Container Service] Getting Started with Kubernetes on AWS
[AWS Container Service] Getting Started with Kubernetes on AWS
 
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
 
Introduction to Serverless on AWS - Builders Day Jerusalem
Introduction to Serverless on AWS - Builders Day JerusalemIntroduction to Serverless on AWS - Builders Day Jerusalem
Introduction to Serverless on AWS - Builders Day Jerusalem
 
국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018
국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018
국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018
 
Comparing Compute Options for Microservices - AWS Summti Sydney 2018
Comparing Compute Options for Microservices - AWS Summti Sydney 2018Comparing Compute Options for Microservices - AWS Summti Sydney 2018
Comparing Compute Options for Microservices - AWS Summti Sydney 2018
 
AWS 微服務中的 Container 選項比較 (Level 400)
AWS 微服務中的 Container 選項比較   (Level 400)AWS 微服務中的 Container 選項比較   (Level 400)
AWS 微服務中的 Container 選項比較 (Level 400)
 
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018
 
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
Developing with .NET Core on AWS: What's New (DEV318-R1) - AWS re:Invent 2018
 
Run Kubernetes with Amazon EKS - SRV318 - Chicago AWS Summit
Run Kubernetes with Amazon EKS - SRV318 - Chicago AWS SummitRun Kubernetes with Amazon EKS - SRV318 - Chicago AWS Summit
Run Kubernetes with Amazon EKS - SRV318 - Chicago AWS Summit
 
Set Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tool...
Set Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tool...Set Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tool...
Set Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tool...
 
Expert Tips for Successful Kubernetes Deployments on AWS
Expert Tips for Successful Kubernetes Deployments on AWSExpert Tips for Successful Kubernetes Deployments on AWS
Expert Tips for Successful Kubernetes Deployments on AWS
 
Introduction to Serverless computing and AWS Lambda - Floor28
Introduction to Serverless computing and AWS Lambda - Floor28Introduction to Serverless computing and AWS Lambda - Floor28
Introduction to Serverless computing and AWS Lambda - Floor28
 
Introduction to Serverless computing and AWS Lambda | AWS Floor28
Introduction to Serverless computing and AWS Lambda | AWS Floor28Introduction to Serverless computing and AWS Lambda | AWS Floor28
Introduction to Serverless computing and AWS Lambda | AWS Floor28
 
AWS에서 Kubernetes 실전 활용하기::유병우::AWS Summit Seoul 2018
AWS에서 Kubernetes 실전 활용하기::유병우::AWS Summit Seoul 2018AWS에서 Kubernetes 실전 활용하기::유병우::AWS Summit Seoul 2018
AWS에서 Kubernetes 실전 활용하기::유병우::AWS Summit Seoul 2018
 
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
 
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
 

Mais de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CON408) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitor The World: Meaningful Metrics for Kubernetes Applications and Clusters Nick Turner Software Development Engineer Amazon EKS C O N 4 0 8
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Who am I? • Amazon Elastic Container Service for Kubernetes (Amazon EKS) team • Formerly worked at two Seattle startups using Kubernetes, Porch and OfferUp
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda Why Monitor Monitoring Methodology Metrics Sources & Instrumentation Applications Control Plane
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why Do We Monitor? Problem Detection Outage Prevention We are nosy
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. But It’s Hard • Microservices • Wealth of metrics • Complex interactions • Containers • More transient • OS is not the complete picture • Need new tools
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Method to the Madness Resources USE method by Brendan Gregg For every resource, check: • Utilization • Saturation • Errors Services RED method by Tom Wilkie For every service, monitor request: • Rate • Errors • Duration
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Metrics Sources Node A Pod Kubelet cAdvisor Node Problem Detector Node Exporter Node B Pod Kubelet cAdvisor Node Problem Detector Node Exporter Prometheus Kube State Metrics Pod Metrics Server fluentdgrafana
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. You Should Know 3 Built-In Metrics APIs • metrics.k8s.io • custom.metrics.k8s.io • external.metrics.k8s.io Kubelet cAdvisor • Currently used by kubelet to expose summary API • Port is deprecated in 1.10, disabled in 1.11 • Might need to run a standalone eventually. • Will cAdvisor be replaced by CRI metrics? HPA • Uses the metrics server for resources • Uses a custom metrics pipeline for custom metrics Metrics Server • No historical data • Node & Pod, CPU & Mem
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. You Should Know Kube State Metrics • Derives metrics from API • Can be resource intensive for large clusters Node Problem Detector • Adds conditions to nodes Node Exporter • Exposes lots of metrics at the node level, including the basics such as CPU, Memory, Network
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Quick Look kubectl top kubectl logs kubectl get events
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prometheus • Why Prometheus? • Community • Number of integrations • Ease of use • Why not Prometheus? • Manage it yourself • Complexity in large setups • Possibility: Hybrid Approach • Use Prometheus to collect metrics that are exposed on /metrics endpoints • Send a subset of critical metrics to Amazon CloudWatch or a third party solution.
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Federation Prometheus Aggregation Layer Prometheus AZ2 Prometheus AZ3 Prometheus AZ1
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. If you had to pick one metric… What matters? • User experience • Your sleep and sanity
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start with Your Users Business Metrics • E.g. orders fulfilled successfully Application Request Errors • Tells you where to start • Use tracing and logs to determine where to look next
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Wait for It Application Latency • Critical measurement of user experience
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Complete Picture Request Rate & Saturation • Understand how your application behaves under load
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What Else Causes Outages? Know Your Code and Configuration Version • Know what version your code is, and where it has been deployed • The same goes for configuration! In Kubernetes: • Add a version label to your PodSpecs
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Versioning a Deployment # Using kube_pod_labels sum(kube_pod_labels{label_version != "", label_app = "autostore"}) by (label_version)
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Visualizing a Deployment
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Take Advantage of Kube State Metrics Of note: • Container restarts • % Pods available
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Drilling deeper Resources • CPU • Memory • Network • Disk
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring Resources with USE Start with a correct setup: • Requests and limits for all pods • --kube-reserved • Namespace ResourceQuotas if desired Where can we perform aggregation? • Container • Pod • Deployment • Node • Namespace • Cluster
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CPU Utilization container_cpu_usage_seconds_total # namespace:container_cpu_usage_seconds_total:sum_rate sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) # namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate sum by (namespace, pod_name, container_name) ( rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m]) )
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CPU Saturation node_load1 sum(node_load1{job="node-exporter"}) / sum(node:node_num_cpu:sum)
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Memory Utilization # namespace:container_memory_usage_bytes:sum sum(container_memory_usage_bytes{job="kubelet", image!="", container_name!=""}) by (namespace) # :node_memory_utilisation: 1 – sum( node_memory_MemFree{job="node-exporter"} + node_memory_Cached{job="node-exporter"} + node_memory_Buffers{job="node-exporter”}) / sum(node_memory_MemTotal{job="node-exporter"})
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start with RED Monitor the API Server with RED • Errors • Duration (Latency) • Rate • Saturation Also: • Pod restarts
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. As Your Cluster Scales Where are the bottlenecks? • Pod scheduling Latency • Metrics Resource Usage • API Server Resource Usage
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How Do I Monitor Etcd? • Leader Elections • etcd_server_has_leader • etcd_server_leader_changes_seen_total • Disk Write Performance • etcd_disk_wal_fsync_duration_seconds_bucket • etcd_disk_backend_commit_duration_seconds_bucket • Database Size • When etcd_mvcc_db_total_size_in_bytes reaches the quota limit, etcd will trigger a NOSPACE alarm • Corruption
  • 37. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Nick Turner nic@amazon.com
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.