SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
ZMON - OS monitoring in the cloud
Atmosphere 2016 | Krakow 17.5.2016 | jan.mussler@zalando.de | @JanMussler
15 countries
3 fulfillment centers
18+ million active customers
3.0+ billion € revenue
135+ million visits per month
1.000+ employees in tech
Europe's Leading Fashion Platform
Visit us: tech.zalando.com
Zalando’s Technology History
RADICAL AGILITY
AUTONOMY
➊ One AWS account per Team
➋ Deployment with Docker
➌ Managed SSH Access
➍ REST/OAuth 2.0 mandatory
➎ Traceability of changes
IN A NUTSHELL
STUPS
AWS
DEPLOYMENT
Senza CLI
Deploy Tool
Pier One
Docker Registry
docker pull
docker push
Taupage
AMI
Internet
*.abc.example.org *.xyz.example.org
Team ABC Team XYZ
ISOLATED AWS ACCOUNTS
EC2EC2
ELBELB
EC2
ZMON
Flexible and extendable: Checks & Alerts in Python
Integrate: REST APIs, OAUTH2, AWS Auto Discovery
Fully configurable via UI / API: no restarts required!
Great for teams: team dashboards, alerts inheritance
Fast/scaling metrics: Redis, KairosDB + Grafana2
Hackweek 2015 - iOS app and Android app ;-)
ZMON - High Lights ;-)
Display historic data using Grafana 2
Notifications plus iOS and Android App
E-Mail
Full authentication for all endpoints
OAUTH2 login flow (e.g. via Github login)
“TV Tokens” for “read-only” dashboard login
Grafana 2 bundled and API implemented
● ZMON stores dashboards incl. tags/stars
● KairosDB proxy
● ElasticSearch proxy (in progress)
ZMON Controller -> UI + REST API
Example
Tokeninfo (GO)Tokeninfo (GO)
Provider (Java)
Provider (Java)
Tokeninfo (GO)Tokeninfo (GO)
C* Nodes
C* Nodes
C* Nodes
C* Nodes
Plan B Deployment - Multi Region Setup (JWT issue/verification)
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
Will create “entities” to describe deployment
ELBs, ASGs, Application, instances,...
Crawls AWS API every 60 sec to update
ZMON AWS Agent - Auto Discovery
➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]"
id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]
type: instance
application_id: planb-tokeninfo
host: 172.31.169.6
infrastructure_account: aws:999
instance_type: c4.xlarge
ip: 172.31.169.6
ports: { '9020': 9020, '9021': 9021 }
region: eu-west-1
source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44
stack_name: planb-tokeninfo-eu-west-1
stack_version: cd44
Example Instance Entity
Instance Metrics
● Memory usage
● Disk space usage
● CPU usage
● Application logs
● Application metrics
Monitoring Plan-B instances on AWS
Scalyr Agent
Log shipping
Prometheus
Node Agent
:9100/metrics
Taupage AMI (Ubuntu base)
Application Container
Go / Spring Boot / Cassandra
Docker run time
:8080 -> app
:7979 -> metrics
Jolokia Request Example
Check Results
Alert on application metrics
HTTP requests reading JSON application metrics
Read JMX data via Jolokia/HTTP for Cassandra
Read Prometheus Node data for EC2 metrics
CloudWatch() queries for ELB metrics
Scalyr API queries for application logs
Check commands used so far
Annotated Metric Data in Grafana
Annotated Metric Data in Grafana
Entities
● hosts, databases, applications, instances ...
● generic key value object
● 10000+ entities in our deployment
Entities
{
"id": "node01:8080",
"type": "instance",
"host": "node01",
"ports": {"8080":8080,"8181":8181},
"application_id": "zmon",
"application_version": "0.1.0",
"dc":"dc1"
}
Entity "node01:8080"
Entity Service (part of controller)
id: localhost:5432
type: postgres
host: localhost
port: 5432
shards:
local_zmon_db: "localhost:5432/local_zmon_db"
local-postgres.yaml
Integrated easy-to-use entity store with REST API
Build your own discovery agent (K8S, …)
>zmon entities push local-postgres.yaml
Checks
● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch,
Redis, SNMP/NRPE, tcp, SOAP, Scalyr, ES, ...
● returns "value" object
○ Quickly, every check returned "dicts"
Checks
REST API to update or use web front end
zmon check-definitions update select-1-check.yaml
Managing checks
name: "Select 1"
owning_team: "Team ZMON"
command: |
sql().execute("select 1 as a").results()
entities:
- type: postgresql
interval: 15
description: "Test connection to PostgreSQL"
select-1-check.yaml
Trial Run - Quick feedback and easier development
Alerts
● Executes using a check’s value, bound to single check
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities(color) and tags
Alerts
Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
Anyone can add alerts to checks
Alerts are owned by team
Monitor application boundaries/dependencies
Make use of inheritance to customize
Sharing and reuse of alerts and checks
Deployment
Workers
(Python)
Workers
(Python)
ZMON Core + UI + KairosDB
Scheduler
(jvm)
Redis
Worker
(Python)
KairosDB
(Java)
Controller
(Java)
PostgreSQL
Queue/State
CLI
(Python)
Check/Alert definition
Entity data
Cassandra
Frontend
(AngularJS)
Metric Cache
ZMON in AWS / Multi DC Setup
*.foo.example.org *.bar.example.org
Team "Foo" Team "Bar"
EC2
Instance
EC2
InstanceEC2
Instance
EC2
Instance
ZMON
Appliance
ZMON
ApplianceEC2
Instance
EC2
Instance
ZMON
Data Service
ELB ELB
● Scheduler supports queue filters by entity
○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter
○ only handles entities with {"dc":"dc1"}
● Worker can report home using:
○ Redis (we use this across DCs)
○ HTTPS (AWS->DC)
Multi DC / Zone deployment possible
Micro
Services
Expose your data / Convention on key names/structure
{
"zmon.response.200.GET.checks.all-active-check-definitions.count": 10,
"zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071,
"zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181,
"zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512,
"zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173,
"zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233,
"zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.max": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.median": 1161,
"zmon.response.200.GET.checks.all-active-check-definitions.min": 1114
}
Application metrics
Continued ...
Spring boot (extending metrics)
https://github.com/zalando/zmon-actuator
Python (Swagger first on Flask)
https://github.com/zalando/connexion
Clojure (Swagger first)
https://github.com/zalando-stups/friboo/
Example libraries and framework support ...
Demo:
https://demo.zmon.io
ZMON on Github:
https://github.com/zalando/zmon
Documentation:
https://docs.zmon.io
Zalando Tech:
https://tech.zalando.com

Mais conteúdo relacionado

Destaque

MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...PROIDEA
 
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...PROIDEA
 
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...PROIDEA
 
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...PROIDEA
 
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...PROIDEA
 
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst [CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst PROIDEA
 
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz][4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]PROIDEA
 
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...PROIDEA
 
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)PROIDEA
 
[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)PROIDEA
 

Destaque (10)

MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...
 
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
 
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
 
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
 
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
 
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst [CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
 
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz][4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
 
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
 
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
 
[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)
 

Semelhante a Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs

OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...NETWAYS
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker Zalando Technology
 
ZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZalando Technology
 
DevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureDevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureMikhail Prudnikov
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivAmazon Web Services
 
F5 Automation and service discovery
F5 Automation and service discoveryF5 Automation and service discovery
F5 Automation and service discoveryScott van Kalken
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkCloudflare
 
AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.Splunk
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made SimpleLuciano Mammino
 
Containerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaContainerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaRyan Cuprak
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubAlfonso Martino
 
Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...Lucas Jellema
 
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...Christian Schneider
 
AWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacksAWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacksEmmanuel Quentin
 
Deep Dive into SpaceONE
Deep Dive into SpaceONEDeep Dive into SpaceONE
Deep Dive into SpaceONEChoonho Son
 
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando TechDocker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando TechHenning Jacobs
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 

Semelhante a Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs (20)

OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
 
ZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering Platform
 
DevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureDevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless Architecture
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
 
Monitoring klassisch oder Cloud
Monitoring klassisch oder CloudMonitoring klassisch oder Cloud
Monitoring klassisch oder Cloud
 
F5 Automation and service discovery
F5 Automation and service discoveryF5 Automation and service discovery
F5 Automation and service discovery
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient network
 
AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made Simple
 
Containerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaContainerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS Lambda
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
 
Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...
 
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
 
AWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacksAWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacks
 
Deep Dive into SpaceONE
Deep Dive into SpaceONEDeep Dive into SpaceONE
Deep Dive into SpaceONE
 
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando TechDocker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 

Último

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 

Último (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs

  • 1. ZMON - OS monitoring in the cloud Atmosphere 2016 | Krakow 17.5.2016 | jan.mussler@zalando.de | @JanMussler
  • 2. 15 countries 3 fulfillment centers 18+ million active customers 3.0+ billion € revenue 135+ million visits per month 1.000+ employees in tech Europe's Leading Fashion Platform Visit us: tech.zalando.com
  • 5. ➊ One AWS account per Team ➋ Deployment with Docker ➌ Managed SSH Access ➍ REST/OAuth 2.0 mandatory ➎ Traceability of changes IN A NUTSHELL STUPS
  • 6. AWS DEPLOYMENT Senza CLI Deploy Tool Pier One Docker Registry docker pull docker push Taupage AMI
  • 7. Internet *.abc.example.org *.xyz.example.org Team ABC Team XYZ ISOLATED AWS ACCOUNTS EC2EC2 ELBELB EC2
  • 9. Flexible and extendable: Checks & Alerts in Python Integrate: REST APIs, OAUTH2, AWS Auto Discovery Fully configurable via UI / API: no restarts required! Great for teams: team dashboards, alerts inheritance Fast/scaling metrics: Redis, KairosDB + Grafana2 Hackweek 2015 - iOS app and Android app ;-) ZMON - High Lights ;-)
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Display historic data using Grafana 2
  • 15. Notifications plus iOS and Android App E-Mail
  • 16. Full authentication for all endpoints OAUTH2 login flow (e.g. via Github login) “TV Tokens” for “read-only” dashboard login Grafana 2 bundled and API implemented ● ZMON stores dashboards incl. tags/stars ● KairosDB proxy ● ElasticSearch proxy (in progress) ZMON Controller -> UI + REST API
  • 18. Tokeninfo (GO)Tokeninfo (GO) Provider (Java) Provider (Java) Tokeninfo (GO)Tokeninfo (GO) C* Nodes C* Nodes C* Nodes C* Nodes Plan B Deployment - Multi Region Setup (JWT issue/verification) C* NodesProvider (Java)ELB Tokeninfo (Go)ELB C* NodesProvider (Java)ELB Tokeninfo (Go)ELB
  • 19. Will create “entities” to describe deployment ELBs, ASGs, Application, instances,... Crawls AWS API every 60 sec to update ZMON AWS Agent - Auto Discovery
  • 20. ➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]" id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1] type: instance application_id: planb-tokeninfo host: 172.31.169.6 infrastructure_account: aws:999 instance_type: c4.xlarge ip: 172.31.169.6 ports: { '9020': 9020, '9021': 9021 } region: eu-west-1 source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44 stack_name: planb-tokeninfo-eu-west-1 stack_version: cd44 Example Instance Entity
  • 21. Instance Metrics ● Memory usage ● Disk space usage ● CPU usage ● Application logs ● Application metrics Monitoring Plan-B instances on AWS Scalyr Agent Log shipping Prometheus Node Agent :9100/metrics Taupage AMI (Ubuntu base) Application Container Go / Spring Boot / Cassandra Docker run time :8080 -> app :7979 -> metrics
  • 25. HTTP requests reading JSON application metrics Read JMX data via Jolokia/HTTP for Cassandra Read Prometheus Node data for EC2 metrics CloudWatch() queries for ELB metrics Scalyr API queries for application logs Check commands used so far
  • 26. Annotated Metric Data in Grafana
  • 27. Annotated Metric Data in Grafana
  • 29. ● hosts, databases, applications, instances ... ● generic key value object ● 10000+ entities in our deployment Entities { "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1" } Entity "node01:8080"
  • 30. Entity Service (part of controller) id: localhost:5432 type: postgres host: localhost port: 5432 shards: local_zmon_db: "localhost:5432/local_zmon_db" local-postgres.yaml Integrated easy-to-use entity store with REST API Build your own discovery agent (K8S, …) >zmon entities push local-postgres.yaml
  • 32. ● select subset of entities ● executes Python expression ○ powerful using eval with custom context ○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch, Redis, SNMP/NRPE, tcp, SOAP, Scalyr, ES, ... ● returns "value" object ○ Quickly, every check returned "dicts" Checks
  • 33. REST API to update or use web front end zmon check-definitions update select-1-check.yaml Managing checks name: "Select 1" owning_team: "Team ZMON" command: | sql().execute("select 1 as a").results() entities: - type: postgresql interval: 15 description: "Test connection to PostgreSQL" select-1-check.yaml
  • 34.
  • 35. Trial Run - Quick feedback and easier development
  • 37. ● Executes using a check’s value, bound to single check ● Defines team and responsible team ● Allows inheritance from other alert ● Evaluates Python expression yielding True/False ● No "WARNING" state, no "UNKNOWN" state ● Priorities(color) and tags Alerts
  • 38.
  • 39.
  • 40. Downtimes ● Set or schedule downtimes using the UI ● Use API to automate downtimes, e.g. in deployment tool
  • 41. Anyone can add alerts to checks Alerts are owned by team Monitor application boundaries/dependencies Make use of inheritance to customize Sharing and reuse of alerts and checks
  • 43. Workers (Python) Workers (Python) ZMON Core + UI + KairosDB Scheduler (jvm) Redis Worker (Python) KairosDB (Java) Controller (Java) PostgreSQL Queue/State CLI (Python) Check/Alert definition Entity data Cassandra Frontend (AngularJS) Metric Cache
  • 44. ZMON in AWS / Multi DC Setup *.foo.example.org *.bar.example.org Team "Foo" Team "Bar" EC2 Instance EC2 InstanceEC2 Instance EC2 Instance ZMON Appliance ZMON ApplianceEC2 Instance EC2 Instance ZMON Data Service ELB ELB
  • 45. ● Scheduler supports queue filters by entity ○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters ● Scheduler can apply base filter ○ only handles entities with {"dc":"dc1"} ● Worker can report home using: ○ Redis (we use this across DCs) ○ HTTPS (AWS->DC) Multi DC / Zone deployment possible
  • 46.
  • 48. Expose your data / Convention on key names/structure { "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512, "zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.min": 1114 }
  • 51. Spring boot (extending metrics) https://github.com/zalando/zmon-actuator Python (Swagger first on Flask) https://github.com/zalando/connexion Clojure (Swagger first) https://github.com/zalando-stups/friboo/ Example libraries and framework support ...