SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Regain control thanks to Prometheus
Guillaume Lefevre, DevOps Engineer, OCTO Technology
Etienne Coutaud, DevOps Engineer, OCTO Technology
About us
Guillaume Lefevre
DevOps Engineer, OCTO Technology
@guillaumelfv
Etienne Coutaud
DevOps Engineer, OCTO Technology
@etiennecoutaud
What we are dealing with
• Track every events of a package lifecycle
• Aggregate and compute events to generate relevant business data
• Store and serve results for consultation or further computing
What we are dealing with
2,000,000
packages / day
20,000,000
events / day
200 processes
running 24/7
CQRS Architecture
Kafka
Spark
streaming
API
Scala
Front
Cassandra
ElasticSearch
Final
users
Package
Flashcode
Events
● ~ 90 servers
● Whole configuration managed with
Zookeeper
Monitoring system metrics
Kafka
Spark
streaming
API
Scala
Front
Cassandra
ElasticSearch
Final
users
Package
Flashcode
Events
prometheus node_exporter
Zookeeper
Monitoring middleware metrics
Kafka
Spark
streaming
API
Scala
Front
Cassandra
ElasticSearch
Final
users
Package
Flashcode
Events
prometheus node_exporter
third-party exporters (specific exporter / JMX)
JMX
HAproxy
exporter
HAproxy
exporter
JMX
Homemade
exporter
JMX
Zookeeper
Zookeeper
exporter
Elasticsearch
exporter
Monitoring business metrics
Kafka
Spark
streaming
API
Scala
Front
Cassandra
ElasticSearch
Final
users
Package
Flashcode
Events
prometheus node_exporter
third-party exporters (specific exporter / JMX)
JMX
JMX JMX
HAproxy
exporter
HAproxy
exporter
Homemade
exporter
client library
business
metrics
business
metrics
Zookeeper
Zookeeper
exporter
Elasticsearch
exporter
Full stack
prometheus node_exporter
third-party exporters (specific exporter / JMX)
client library
Prometheus
AlertManager Mail
SlackGrafana
targets
Kafka
Spark
streaming
API
Scala
Front
Cassandra
ElasticSearch
Final
users
Package
Flashcode
Events
JMX
JMX JMX
Elasticsearch
exporter
HAproxy
exporter
HAproxy
exporter
Homemade
exporter
business
metrics
business
metrics
Zookeeper
Zookeeper
exporter
Data volume
~ 250
endpoints scrapped
~ 15,000
different metrics
~ 50 GB
data / day
Enabling metrics: node_exporter
$ ./node_exporter --web.listen-address 0.0.0.0:9100 --collector.systemd
Have a look at the “disabled by default” collectors
https://github.com/prometheus/node_exporter
Wrapped with systemd
Enabling metrics: JMX
KAFKA_OPTS="-javaagent:/var/lib/jmx-exporter/jmx_prometheus_javaagent.jar=9999:/var/lib/jmx-exporter/kafka.yml"
Use the exporter as a java-agent at JVM startup
Kafka
JMX exporter
Prometheus
HTTP : 9999
JVM
JMX
Enabling metrics: Scala app
object InjectorReacApp extends SparkApp ({ (ssc: StreamingContext) =>
ReacUserMetricsSystem.load("InjectorReacMetrics")(ssc.sparkContext)
object SparkAppMetrics extends ContextLogger {
private lazy val handledXMLCounter: SparkCounter = ReacUserMetricsSystem.counter("handledXMLCounter")
private lazy val measuredXMLCounter: SparkCounter = ReacUserMetricsSystem.counter("measuredXMLCounter")
private lazy val handledMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("handledMessagesCounter")
private lazy val rejectedMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("rejectedMessagesCounter")
private lazy val succeededMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("succeededMessagesCounter")
Use the scala library to define metrics
Inject metrics into Spark context
SelfHealing using AlertManager + Jenkins
ALERT SparkWorkerDataFull
IF node_filesystem_free{job='spark_system_metrics} < 0.05
FOR 5m
LABELS {
severity = "high",
environment = "production",
selfhealing = "true",
jenkins_job = "clean_data_worker"
}
ANNOTATIONS {
summary = "Disk /data on {{ $labels.instance }} 95% full"
}
route:
- match:
selfhealing: true
jenkins_job: clean_data_worker
receiver: jenkins-selfhealing-clean_data_worker
receivers:
- name: jenkins-selfhealing-clean_data_worker
webhook_configs:
- url: "http://jenkins:8080/job/production/clean_data_worker/build"
AlertRule in Prometheus server
AlertManager config file
Our favorite Prometheus queries
● Topk: used to get largest K elements by sample value, useful to extract components from the
whole metrics
topk(3,(1 - node_memory_MemFree / node_memory_MemTotal))
Example: Get the 3 servers with the least free RAM percentage
● predict_linear: used to predict a value with a simple linear regression, useful to extrapolate disk
filling
predict_linear(node_filesystem_free{mountpoint='/data'}[1h], 6 * 3600) < 0
Example: Predict filesystem fullness in less than 6h
Demo time
Code available at: https://github.com/EtienneCoutaud/kubecon2017-demo.git
Kafka
App
Grafana
Prometheus Alert Manager
Jenkins
Slack
system metrics
middleware metrics
service metrics
inject read
start
How do ops use it?
• Self-healing & Alerting help you focus on delivering more value
• Predict failure and correct it to prevent outage and business loss
• Faster troubleshooting and root cause pinpointing
• Tune and improve configuration middleware parameters
• Discover bottleneck and improve overall platform performance
How do dev use it?
• Easier performance testing and benchmarking
• Bring visibility and confidence into the platform
• Give a comprehensible view of the platform & KPIs to top management
• Allow developers to add their own metrics, dashboards and alerts at
application level
Tip: Never stop tweaking
• Choose the right metrics
• Choose accurate thresholds
• Prevent alert fatigue
Continuous improvement is the best way to:
Tip: Mind your NTP
• Ensure all your applications, servers and monitoring platform are in-sync
• Servers and browser clocks must not be too far apart
Trick: Scraping interval tradeoff
Choose wisely based on your business requirements!
low scraping interval (5s) higher scraping interval (15s)
• Almost real-time
• CPU intensive (increase with endpoints)
• Robust Prom server or federation needed
• Metrics “lag effect”
• Use less server resources
• Sustain more endpoints
Golden tip
Work with your users!
What’s next?
• Upgrade to 2.0 to improve performance
• Explore advanced features (federation, HA & third party-storage)
• Improve trigger precision and alarms
• Set up dashboard drilling for faster investigation
• Automate incident responses because we are lazy
• Monitor the underlying infrastructure
Take-away
• FullStack Monitoring
• Fast to deploy, long to master
• Share it with everybody / Provide feedback to everyone
• Changes the way of working
• Prometheus has many differents exporters, you will surely find yours!
• The format makes it easy to build a custom exporter

Mais conteúdo relacionado

Mais procurados

Monitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusMonitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusWeaveworks
 
Docker at and with SignalFx
Docker at and with SignalFxDocker at and with SignalFx
Docker at and with SignalFxSignalFx
 
Cncf k8s_network_03 (Ingress introduction)
Cncf k8s_network_03 (Ingress introduction)Cncf k8s_network_03 (Ingress introduction)
Cncf k8s_network_03 (Ingress introduction)Erhwen Kuo
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionNagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with SpinnakerSpinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with SpinnakerAndrew Phillips
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...InfluxData
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | BonitooInfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | BonitooInfluxData
 
Altitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
Altitude NY 2018: 132 websites, 1 service: Your local news runs on FastlyAltitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
Altitude NY 2018: 132 websites, 1 service: Your local news runs on FastlyFastly
 
AWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxAWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxSignalFx
 
Altitude NY 2018: Don't let the weeds overwhelm the garden
Altitude NY 2018: Don't let the weeds overwhelm the gardenAltitude NY 2018: Don't let the weeds overwhelm the garden
Altitude NY 2018: Don't let the weeds overwhelm the gardenFastly
 
Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Red Hat Developers
 
Lightning Fast Monitoring against Lightning Fast Outages
Lightning Fast Monitoring against Lightning Fast OutagesLightning Fast Monitoring against Lightning Fast Outages
Lightning Fast Monitoring against Lightning Fast OutagesMaxime Petazzoni
 
Building an Enterprise CaaS with Kubernetes and Rancher 2.0
Building an Enterprise CaaS with Kubernetes and Rancher 2.0Building an Enterprise CaaS with Kubernetes and Rancher 2.0
Building an Enterprise CaaS with Kubernetes and Rancher 2.0Shannon Williams
 
Building streaming applications using a managed Kafka service | DevNation Tec...
Building streaming applications using a managed Kafka service | DevNation Tec...Building streaming applications using a managed Kafka service | DevNation Tec...
Building streaming applications using a managed Kafka service | DevNation Tec...Red Hat Developers
 
Cncf k8s_network_part1
Cncf k8s_network_part1Cncf k8s_network_part1
Cncf k8s_network_part1Erhwen Kuo
 
An Introduction to Rancher
An Introduction to RancherAn Introduction to Rancher
An Introduction to RancherConner Swann
 
Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)Weaveworks
 

Mais procurados (20)

Monitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusMonitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with Prometheus
 
Docker at and with SignalFx
Docker at and with SignalFxDocker at and with SignalFx
Docker at and with SignalFx
 
Cncf k8s_network_03 (Ingress introduction)
Cncf k8s_network_03 (Ingress introduction)Cncf k8s_network_03 (Ingress introduction)
Cncf k8s_network_03 (Ingress introduction)
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with SpinnakerSpinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | BonitooInfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
 
Altitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
Altitude NY 2018: 132 websites, 1 service: Your local news runs on FastlyAltitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
Altitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly
 
AWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxAWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFx
 
Altitude NY 2018: Don't let the weeds overwhelm the garden
Altitude NY 2018: Don't let the weeds overwhelm the gardenAltitude NY 2018: Don't let the weeds overwhelm the garden
Altitude NY 2018: Don't let the weeds overwhelm the garden
 
Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...
 
Breaking the monolith
Breaking the monolithBreaking the monolith
Breaking the monolith
 
Lightning Fast Monitoring against Lightning Fast Outages
Lightning Fast Monitoring against Lightning Fast OutagesLightning Fast Monitoring against Lightning Fast Outages
Lightning Fast Monitoring against Lightning Fast Outages
 
Building an Enterprise CaaS with Kubernetes and Rancher 2.0
Building an Enterprise CaaS with Kubernetes and Rancher 2.0Building an Enterprise CaaS with Kubernetes and Rancher 2.0
Building an Enterprise CaaS with Kubernetes and Rancher 2.0
 
Building streaming applications using a managed Kafka service | DevNation Tec...
Building streaming applications using a managed Kafka service | DevNation Tec...Building streaming applications using a managed Kafka service | DevNation Tec...
Building streaming applications using a managed Kafka service | DevNation Tec...
 
Cncf k8s_network_part1
Cncf k8s_network_part1Cncf k8s_network_part1
Cncf k8s_network_part1
 
An Introduction to Rancher
An Introduction to RancherAn Introduction to Rancher
An Introduction to Rancher
 
Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)
 

Semelhante a Regain Control Thanks To Prometheus

Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupKeptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupJürgen Etzlstorfer
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET Journal
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
 
Managing and supporting PowerApps & Flow at scale by Daniel Laskewitz
Managing and supporting PowerApps & Flow at scale by Daniel LaskewitzManaging and supporting PowerApps & Flow at scale by Daniel Laskewitz
Managing and supporting PowerApps & Flow at scale by Daniel LaskewitzDaniel Laskewitz
 
Cloud Platform Symantec Meetup Nov 2014
Cloud Platform Symantec Meetup Nov 2014Cloud Platform Symantec Meetup Nov 2014
Cloud Platform Symantec Meetup Nov 2014Miguel Zuniga
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpNathan Handler
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Puppet
 
Building Autonomous Operations for Kubernetes with keptn
Building Autonomous Operations for Kubernetes with keptnBuilding Autonomous Operations for Kubernetes with keptn
Building Autonomous Operations for Kubernetes with keptnJohannes Bräuer
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDinakar Guniguntala
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with PrometheusOpenStack Korea Community
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services
 
NET Aspire - NET Conf IL 2024 - Tamir Dresher.pdf
NET Aspire - NET Conf IL 2024 - Tamir Dresher.pdfNET Aspire - NET Conf IL 2024 - Tamir Dresher.pdf
NET Aspire - NET Conf IL 2024 - Tamir Dresher.pdfTamir Dresher
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfManimuthu Ayyannan
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 

Semelhante a Regain Control Thanks To Prometheus (20)

CV_RishabhDixit
CV_RishabhDixitCV_RishabhDixit
CV_RishabhDixit
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupKeptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Managing and supporting PowerApps & Flow at scale by Daniel Laskewitz
Managing and supporting PowerApps & Flow at scale by Daniel LaskewitzManaging and supporting PowerApps & Flow at scale by Daniel Laskewitz
Managing and supporting PowerApps & Flow at scale by Daniel Laskewitz
 
Cloud Platform Symantec Meetup Nov 2014
Cloud Platform Symantec Meetup Nov 2014Cloud Platform Symantec Meetup Nov 2014
Cloud Platform Symantec Meetup Nov 2014
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
Building Autonomous Operations for Kubernetes with keptn
Building Autonomous Operations for Kubernetes with keptnBuilding Autonomous Operations for Kubernetes with keptn
Building Autonomous Operations for Kubernetes with keptn
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
NET Aspire - NET Conf IL 2024 - Tamir Dresher.pdf
NET Aspire - NET Conf IL 2024 - Tamir Dresher.pdfNET Aspire - NET Conf IL 2024 - Tamir Dresher.pdf
NET Aspire - NET Conf IL 2024 - Tamir Dresher.pdf
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdf
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Regain Control Thanks To Prometheus

  • 1. Regain control thanks to Prometheus Guillaume Lefevre, DevOps Engineer, OCTO Technology Etienne Coutaud, DevOps Engineer, OCTO Technology
  • 2. About us Guillaume Lefevre DevOps Engineer, OCTO Technology @guillaumelfv Etienne Coutaud DevOps Engineer, OCTO Technology @etiennecoutaud
  • 3. What we are dealing with • Track every events of a package lifecycle • Aggregate and compute events to generate relevant business data • Store and serve results for consultation or further computing
  • 4. What we are dealing with 2,000,000 packages / day 20,000,000 events / day 200 processes running 24/7
  • 7. Monitoring middleware metrics Kafka Spark streaming API Scala Front Cassandra ElasticSearch Final users Package Flashcode Events prometheus node_exporter third-party exporters (specific exporter / JMX) JMX HAproxy exporter HAproxy exporter JMX Homemade exporter JMX Zookeeper Zookeeper exporter Elasticsearch exporter
  • 8. Monitoring business metrics Kafka Spark streaming API Scala Front Cassandra ElasticSearch Final users Package Flashcode Events prometheus node_exporter third-party exporters (specific exporter / JMX) JMX JMX JMX HAproxy exporter HAproxy exporter Homemade exporter client library business metrics business metrics Zookeeper Zookeeper exporter Elasticsearch exporter
  • 9. Full stack prometheus node_exporter third-party exporters (specific exporter / JMX) client library Prometheus AlertManager Mail SlackGrafana targets Kafka Spark streaming API Scala Front Cassandra ElasticSearch Final users Package Flashcode Events JMX JMX JMX Elasticsearch exporter HAproxy exporter HAproxy exporter Homemade exporter business metrics business metrics Zookeeper Zookeeper exporter
  • 10. Data volume ~ 250 endpoints scrapped ~ 15,000 different metrics ~ 50 GB data / day
  • 11. Enabling metrics: node_exporter $ ./node_exporter --web.listen-address 0.0.0.0:9100 --collector.systemd Have a look at the “disabled by default” collectors https://github.com/prometheus/node_exporter Wrapped with systemd
  • 12. Enabling metrics: JMX KAFKA_OPTS="-javaagent:/var/lib/jmx-exporter/jmx_prometheus_javaagent.jar=9999:/var/lib/jmx-exporter/kafka.yml" Use the exporter as a java-agent at JVM startup Kafka JMX exporter Prometheus HTTP : 9999 JVM JMX
  • 13. Enabling metrics: Scala app object InjectorReacApp extends SparkApp ({ (ssc: StreamingContext) => ReacUserMetricsSystem.load("InjectorReacMetrics")(ssc.sparkContext) object SparkAppMetrics extends ContextLogger { private lazy val handledXMLCounter: SparkCounter = ReacUserMetricsSystem.counter("handledXMLCounter") private lazy val measuredXMLCounter: SparkCounter = ReacUserMetricsSystem.counter("measuredXMLCounter") private lazy val handledMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("handledMessagesCounter") private lazy val rejectedMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("rejectedMessagesCounter") private lazy val succeededMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("succeededMessagesCounter") Use the scala library to define metrics Inject metrics into Spark context
  • 14. SelfHealing using AlertManager + Jenkins ALERT SparkWorkerDataFull IF node_filesystem_free{job='spark_system_metrics} < 0.05 FOR 5m LABELS { severity = "high", environment = "production", selfhealing = "true", jenkins_job = "clean_data_worker" } ANNOTATIONS { summary = "Disk /data on {{ $labels.instance }} 95% full" } route: - match: selfhealing: true jenkins_job: clean_data_worker receiver: jenkins-selfhealing-clean_data_worker receivers: - name: jenkins-selfhealing-clean_data_worker webhook_configs: - url: "http://jenkins:8080/job/production/clean_data_worker/build" AlertRule in Prometheus server AlertManager config file
  • 15. Our favorite Prometheus queries ● Topk: used to get largest K elements by sample value, useful to extract components from the whole metrics topk(3,(1 - node_memory_MemFree / node_memory_MemTotal)) Example: Get the 3 servers with the least free RAM percentage ● predict_linear: used to predict a value with a simple linear regression, useful to extrapolate disk filling predict_linear(node_filesystem_free{mountpoint='/data'}[1h], 6 * 3600) < 0 Example: Predict filesystem fullness in less than 6h
  • 16. Demo time Code available at: https://github.com/EtienneCoutaud/kubecon2017-demo.git Kafka App Grafana Prometheus Alert Manager Jenkins Slack system metrics middleware metrics service metrics inject read start
  • 17. How do ops use it? • Self-healing & Alerting help you focus on delivering more value • Predict failure and correct it to prevent outage and business loss • Faster troubleshooting and root cause pinpointing • Tune and improve configuration middleware parameters • Discover bottleneck and improve overall platform performance
  • 18. How do dev use it? • Easier performance testing and benchmarking • Bring visibility and confidence into the platform • Give a comprehensible view of the platform & KPIs to top management • Allow developers to add their own metrics, dashboards and alerts at application level
  • 19. Tip: Never stop tweaking • Choose the right metrics • Choose accurate thresholds • Prevent alert fatigue Continuous improvement is the best way to:
  • 20. Tip: Mind your NTP • Ensure all your applications, servers and monitoring platform are in-sync • Servers and browser clocks must not be too far apart
  • 21. Trick: Scraping interval tradeoff Choose wisely based on your business requirements! low scraping interval (5s) higher scraping interval (15s) • Almost real-time • CPU intensive (increase with endpoints) • Robust Prom server or federation needed • Metrics “lag effect” • Use less server resources • Sustain more endpoints
  • 22. Golden tip Work with your users!
  • 23. What’s next? • Upgrade to 2.0 to improve performance • Explore advanced features (federation, HA & third party-storage) • Improve trigger precision and alarms • Set up dashboard drilling for faster investigation • Automate incident responses because we are lazy • Monitor the underlying infrastructure
  • 24. Take-away • FullStack Monitoring • Fast to deploy, long to master • Share it with everybody / Provide feedback to everyone • Changes the way of working • Prometheus has many differents exporters, you will surely find yours! • The format makes it easy to build a custom exporter