SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
Monitoring @ Facebook
Ran Leibman
Production Engineer
Monitoring Tools, Components, & Mentality at Facebook
Who Am I ?
Agenda
• Problems in today’s monitoring, solutions & approaches
• Facebook Monitoring Architecture
• Dive into each component
• Show Use cases
The Problems
Problems - Nagios Checks
•Good for binary checks
•Monitoring based on a “point in time” when the script executes
•Can’t monitor on a time window (can do clowny checks with a
temp file but that is not so elegant …)
•Perf data
•Not all plugins implement this
•Data gathering is coupled with the way you want to alert
•Hard to aggregate from perf data from multiple checks
Problems - Cron Scripts
•Difficult to put this scripts in “maintenance mode” while deploying
code or acknowledging an issue
•How do you know that your script actually runs?
•Who stopped crond!?
•Error handling?
Problems - Metrics R 2nd Class Citizen
•We are not always treating our metric store like we treat our
application data
•Storing metrics in clowny temp files or some unmaintained mysql
•Metric is DATA like any other and we should treat it as such!
•How & from where we are going to query it?
•What is the best way to store it?
•Retention?
Problems - Ops Ownership
•Usually only the ops teams
owns monitoring
•Even if the developers wants
to add metrics and alerts they
have a steep learning curve in
order to achieve this
Facebook Monitoring Architecture
Operational Data Store - ODS
Operational Data Store - ODS
•“key —> float” type of metrics, associated with an entity
•system.load1, system.cpu-user, system.n_eth-txbyt
•chef.run_sucess
•chef.last_run_time
•myapp.request_num
•myapp.request_median
Entity / Key-Value
Operational Data Store - ODS
•Use Gorilla in-memory TSDB for short term data (24 hours)
•Store permanent data in HBase on top of HDFS
•Aggregation
•by rack / cluster / tier
•by custom tags - app, tier name, etc …
•cross datacenter aggregations
Data Store
Collecting Metrics
Operational Data Store - ODS
• API modules for every imaginable language
• Thrift Endpoint - https://thrift.apache.org
• Implements fb303 counters
• fb303 counters are collected from the service by FBAgent
• FBAgent submit the metrics over to ODS
Retention
Operational Data Store - ODS
•Metrics is A LOT of data!
•All ODS metrics are being rolled up in the same way
•Daily - 2 Weeks (depends on you)
•Weekly - 2 Weeks (1min)
•Monthly - 1 Month (1h)
•Yearly - until the end of time (1h)
•How do I solve the data loss??
Aggregation
Operational Data Store
•Aggregate important metrics save spikes
•p50, p90, p99
•top(N)
•count
•min, max
•Aggregate by cluster | rack | custom
Scuba - Real-Time Deep Dive Log Monitoring
Whats is going on RIGHT NOW with my service?
Scuba - Real-Time Log Monitoring
•Was started as a hackathon ! today we can’t live without it
•Combine application logs from all servers & containers into a
single table
•Data is stored in memory
•Very small lag (<1min)
•Super fast queries (median of ~300ms)
•SQL like query syntax
Logging to scuba
Scuba - Real-Time Log Monitoring
• Libraries for every imaginable language
• PHP, Python, C++, Bash, etc …
• Scuba supports:
• String
• Ints
• Set of String
• Stack of strings (usually for stack traces)
So… What’s the catch ?
Scuba - Real-Time Log Monitoring
• Strict quota policy
• by size
• by time
• Use sampling in order to reduce load
• Not good for pipelining of data
Alarm System
What is an Alert?
Alarm System
•Creating an alert does not mean you’ll be notified!
•Alert is an event stating that something happened
•Can’t ssh to server
•p50 of request time is slower than 100ms 80% of the time in the last 10min
•the application tier in us-east is 50% down
•The alert should contain ALL relevant data about the event
•Alerts can be suppressed in case of maintenance
Alarm System - Alert Structure
FBAR - Facebook Auto-Remediation
Automation Automation Automation !
FBAR
• Most alarms could be auto remediate
without human intervention
• Code it once, never do it again
• Doing the work of 136,000 engineering
hours (29/04/2015)
•136,000 / 8 = 17,000 engineers a day !
FBAR
FBAR
FBAR
Notifications & Subscriptions
What should I be paged about?
Notification & Subscriptions
•Actionable alerts
•Impactful alerts
•Before you subscribe to an alert ask yourself:
•Can I automate this? (FBAR!)
•Is this actionable?
•Should an engineer wake up because of this?
Notification & Subscriptions
Dashboards
What are they good for ?
Dashboards
• Awesome tool for debugging production issues
• Making your case:
• We need to fix this code path
• If we had the BLABLA tool it would reduce this by a factor of X
• Since we deployed the last release engagement dropped by
10% in west Europe
• Dashboards are cool to look at =) it’s the best way to get an
understanding of what is going on with the service
Cubism
based on Cubism.js
? ‫לנו‬ ‫היה‬ ‫מה‬ ‫אז‬
Use data!
Think before you alert
Surfacing problems you never thought
existed
Monitoring is not an “Ops Job”
Questions?
Ran Leibman
Production Engineer
Monitoring Tools, Components, & Mentality at Facebook

Mais conteúdo relacionado

Mais procurados

Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basFlorent Ramiere
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0J.B. Langston
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Timothy Spann
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann
 
Introduction to data flow management using apache nifi
Introduction to data flow management using apache nifiIntroduction to data flow management using apache nifi
Introduction to data flow management using apache nifiAnshuman Ghosh
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetCarl W. Handlin
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXAbhishek Mallick
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 

Mais procurados (20)

Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Introduction to data flow management using apache nifi
Introduction to data flow management using apache nifiIntroduction to data flow management using apache nifi
Introduction to data flow management using apache nifi
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 

Destaque

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmOSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmNETWAYS
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessAdrian Cockcroft
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Brian Brazil
 
Production testing through monitoring
Production testing through monitoringProduction testing through monitoring
Production testing through monitoringLeon Fayer
 
Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's MonitoringBrian Overstreet
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a ServiceJames Turnbull
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleChris Jackson
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Adrian Cockcroft
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Appsbrucelawson
 
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...DevOpsDays Tel Aviv
 
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...AppDynamics
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppDynamics
 
Visibility from user to infrastructure on AWS
Visibility from user to infrastructure on AWSVisibility from user to infrastructure on AWS
Visibility from user to infrastructure on AWSAppDynamics
 
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...AppDynamics
 
Unified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleUnified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleAppDynamics
 
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...AppSphere 2016 - Automate performance testing with AppDynamics using continuo...
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...Brad Stoner
 

Destaque (20)

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmOSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
 
2016 metrics-as-culture
2016 metrics-as-culture2016 metrics-as-culture
2016 metrics-as-culture
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - Monitoringless
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
 
Statistics for Engineers
Statistics for EngineersStatistics for Engineers
Statistics for Engineers
 
Production testing through monitoring
Production testing through monitoringProduction testing through monitoring
Production testing through monitoring
 
Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's Monitoring
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a Service
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Apps
 
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
 
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...
 
Metals, nonmetals, metalloids
Metals, nonmetals, metalloidsMetals, nonmetals, metalloids
Metals, nonmetals, metalloids
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance Challenges
 
Visibility from user to infrastructure on AWS
Visibility from user to infrastructure on AWSVisibility from user to infrastructure on AWS
Visibility from user to infrastructure on AWS
 
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...
 
Unified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleUnified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin Whittle
 
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...AppSphere 2016 - Automate performance testing with AppDynamics using continuo...
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...
 

Semelhante a Monitoring Tools, Components, & Mentality at Facebook

Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applicationsAmit Kejriwal
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzardkevin_donovan
 
Building high performance and scalable share point applications
Building high performance and scalable share point applicationsBuilding high performance and scalable share point applications
Building high performance and scalable share point applicationsTalbott Crowell
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Amazon Web Services
 
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux FestBuilding an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux FestMatt Tesauro
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceMercedes Coyle
 
Tech for the Non Technical - Anatomy of an Application Stack
Tech for the Non Technical - Anatomy of an Application StackTech for the Non Technical - Anatomy of an Application Stack
Tech for the Non Technical - Anatomy of an Application StackIntelligent_ly
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
 
Oracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsOracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsVigilant Technologies
 
EM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEnkitec
 
Performance tuning Grails applications
 Performance tuning Grails applications Performance tuning Grails applications
Performance tuning Grails applicationsGR8Conf
 
Cashing in on logging and exception data
Cashing in on logging and exception dataCashing in on logging and exception data
Cashing in on logging and exception dataStackify
 
Building an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineBuilding an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineMatt Tesauro
 
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark SonisStatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark SonisStatsCraft
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedJamie Grier
 
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Lucas Jellema
 
Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems Catalogic Software
 

Semelhante a Monitoring Tools, Components, & Mentality at Facebook (20)

Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzard
 
Building high performance and scalable share point applications
Building high performance and scalable share point applicationsBuilding high performance and scalable share point applications
Building high performance and scalable share point applications
 
Metrics & more
Metrics & more Metrics & more
Metrics & more
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
 
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux FestBuilding an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
 
Tech for the Non Technical - Anatomy of an Application Stack
Tech for the Non Technical - Anatomy of an Application StackTech for the Non Technical - Anatomy of an Application Stack
Tech for the Non Technical - Anatomy of an Application Stack
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Oracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsOracle EBS Production Support - Recommendations
Oracle EBS Production Support - Recommendations
 
EM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance Pages
 
Performance tuning Grails applications
 Performance tuning Grails applications Performance tuning Grails applications
Performance tuning Grails applications
 
Cashing in on logging and exception data
Cashing in on logging and exception dataCashing in on logging and exception data
Cashing in on logging and exception data
 
Building an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineBuilding an Open Source AppSec Pipeline
Building an Open Source AppSec Pipeline
 
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark SonisStatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
 
Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems
 
Ioug oow12 em12c
Ioug oow12 em12cIoug oow12 em12c
Ioug oow12 em12c
 

Mais de DevOpsDays Tel Aviv

YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...DevOpsDays Tel Aviv
 
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, SaltoGRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, SaltoDevOpsDays Tel Aviv
 
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...DevOpsDays Tel Aviv
 
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...DevOpsDays Tel Aviv
 
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDogPRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDogDevOpsDays Tel Aviv
 
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...DevOpsDays Tel Aviv
 
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGGDevOpsDays Tel Aviv
 
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...DevOpsDays Tel Aviv
 
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityTHE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityDevOpsDays Tel Aviv
 
THE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABELTHE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABELDevOpsDays Tel Aviv
 
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackCONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackDevOpsDays Tel Aviv
 
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, DeveleapSOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, DeveleapDevOpsDays Tel Aviv
 
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...DevOpsDays Tel Aviv
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHHOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHDevOpsDays Tel Aviv
 
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBHOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBDevOpsDays Tel Aviv
 
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, IcingaFLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, IcingaDevOpsDays Tel Aviv
 
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITYDevOpsDays Tel Aviv
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioDevOpsDays Tel Aviv
 
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, AuguryONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, AuguryDevOpsDays Tel Aviv
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDevOpsDays Tel Aviv
 

Mais de DevOpsDays Tel Aviv (20)

YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
 
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, SaltoGRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
 
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
 
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
 
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDogPRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
 
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
 
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
 
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
 
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityTHE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
 
THE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABELTHE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABEL
 
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackCONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
 
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, DeveleapSOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
 
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHHOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
 
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBHOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
 
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, IcingaFLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
 
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
 
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, AuguryONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
 

Último

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Último (20)

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Monitoring Tools, Components, & Mentality at Facebook

  • 1. Monitoring @ Facebook Ran Leibman Production Engineer Monitoring Tools, Components, & Mentality at Facebook
  • 3. Agenda • Problems in today’s monitoring, solutions & approaches • Facebook Monitoring Architecture • Dive into each component • Show Use cases
  • 5.
  • 6. Problems - Nagios Checks •Good for binary checks •Monitoring based on a “point in time” when the script executes •Can’t monitor on a time window (can do clowny checks with a temp file but that is not so elegant …) •Perf data •Not all plugins implement this •Data gathering is coupled with the way you want to alert •Hard to aggregate from perf data from multiple checks
  • 7. Problems - Cron Scripts •Difficult to put this scripts in “maintenance mode” while deploying code or acknowledging an issue •How do you know that your script actually runs? •Who stopped crond!? •Error handling?
  • 8. Problems - Metrics R 2nd Class Citizen •We are not always treating our metric store like we treat our application data •Storing metrics in clowny temp files or some unmaintained mysql •Metric is DATA like any other and we should treat it as such! •How & from where we are going to query it? •What is the best way to store it? •Retention?
  • 9. Problems - Ops Ownership •Usually only the ops teams owns monitoring •Even if the developers wants to add metrics and alerts they have a steep learning curve in order to achieve this
  • 12. Operational Data Store - ODS •“key —> float” type of metrics, associated with an entity •system.load1, system.cpu-user, system.n_eth-txbyt •chef.run_sucess •chef.last_run_time •myapp.request_num •myapp.request_median Entity / Key-Value
  • 13. Operational Data Store - ODS •Use Gorilla in-memory TSDB for short term data (24 hours) •Store permanent data in HBase on top of HDFS •Aggregation •by rack / cluster / tier •by custom tags - app, tier name, etc … •cross datacenter aggregations Data Store
  • 14. Collecting Metrics Operational Data Store - ODS • API modules for every imaginable language • Thrift Endpoint - https://thrift.apache.org • Implements fb303 counters • fb303 counters are collected from the service by FBAgent • FBAgent submit the metrics over to ODS
  • 15. Retention Operational Data Store - ODS •Metrics is A LOT of data! •All ODS metrics are being rolled up in the same way •Daily - 2 Weeks (depends on you) •Weekly - 2 Weeks (1min) •Monthly - 1 Month (1h) •Yearly - until the end of time (1h) •How do I solve the data loss??
  • 16. Aggregation Operational Data Store •Aggregate important metrics save spikes •p50, p90, p99 •top(N) •count •min, max •Aggregate by cluster | rack | custom
  • 17. Scuba - Real-Time Deep Dive Log Monitoring
  • 18. Whats is going on RIGHT NOW with my service? Scuba - Real-Time Log Monitoring •Was started as a hackathon ! today we can’t live without it •Combine application logs from all servers & containers into a single table •Data is stored in memory •Very small lag (<1min) •Super fast queries (median of ~300ms) •SQL like query syntax
  • 19. Logging to scuba Scuba - Real-Time Log Monitoring • Libraries for every imaginable language • PHP, Python, C++, Bash, etc … • Scuba supports: • String • Ints • Set of String • Stack of strings (usually for stack traces)
  • 20. So… What’s the catch ? Scuba - Real-Time Log Monitoring • Strict quota policy • by size • by time • Use sampling in order to reduce load • Not good for pipelining of data
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 27.
  • 28. What is an Alert? Alarm System •Creating an alert does not mean you’ll be notified! •Alert is an event stating that something happened •Can’t ssh to server •p50 of request time is slower than 100ms 80% of the time in the last 10min •the application tier in us-east is 50% down •The alert should contain ALL relevant data about the event •Alerts can be suppressed in case of maintenance
  • 29. Alarm System - Alert Structure
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. FBAR - Facebook Auto-Remediation
  • 35. Automation Automation Automation ! FBAR • Most alarms could be auto remediate without human intervention • Code it once, never do it again • Doing the work of 136,000 engineering hours (29/04/2015) •136,000 / 8 = 17,000 engineers a day !
  • 36. FBAR
  • 37. FBAR
  • 38. FBAR
  • 40.
  • 41.
  • 42. What should I be paged about? Notification & Subscriptions •Actionable alerts •Impactful alerts •Before you subscribe to an alert ask yourself: •Can I automate this? (FBAR!) •Is this actionable? •Should an engineer wake up because of this?
  • 43.
  • 44.
  • 47. What are they good for ? Dashboards • Awesome tool for debugging production issues • Making your case: • We need to fix this code path • If we had the BLABLA tool it would reduce this by a factor of X • Since we deployed the last release engagement dropped by 10% in west Europe • Dashboards are cool to look at =) it’s the best way to get an understanding of what is going on with the service
  • 49.
  • 50.
  • 51.
  • 52. ? ‫לנו‬ ‫היה‬ ‫מה‬ ‫אז‬
  • 55. Surfacing problems you never thought existed
  • 56. Monitoring is not an “Ops Job”