SlideShare uma empresa Scribd logo
1 de 61
Baixar para ler offline
Winning the metrics battle (finally)
Winning the metrics battle
         (finally)
       Simon Hildrew           Nick Satterly
  Infrastructure Developer   Monitoring Engineer
        The Guardian           The Guardian
The metrics battlefield
Total metrics


                                180,000




                       50,000

1,400   2,800
http://www.flickr.com/photos/ghostsigns/6676069121



                                              5 minutes


                                                every 15
                                                 seconds

                                                           http://www.flickr.com/photos/millynet/134071210
developer dashboards
Physical screens   Screensaver hacks
20


15


10


 5


 0
dev


hack
business dashboards
metrics + dashboards = culture change
http://www.flickr.com/photos/chrisjames_taylor/5454315456
our approach
         Side project    ➡   Prioritise
Incremental upgrade      ➡   Understand the real problem
Use off the shelf tool   ➡   Question the tools
  Pragmatic solution     ➡   Be ambitious
      Done in a year     ➡   Keep learning
Prioritise
drowning in work




http://www.flickr.com/photos/iampeas/246738971
a dedicated monitoring and
     metrics engineer
Understand the
 real problem
Urgent issue -
current tool end of life
The story so far...
metrics were not helping us
 solve production outages
ballooning number of
     applications
but... difficult to instrument applications
T.T. Detect
                      +
T.T. Fix   =   T.T. Diagnose
                      +
                T.T. Resolve
inaccessible tools




             http://www.flickr.com/photos/kdashy/2678539087
inconsistent data



http://www.flickr.com/photos/sybrenstuvel/2468506922
hypothesising & arguing
 easier than measuring


               http://www.flickr.com/photos/nouqraz/200049988
The ‘right’ thing
• measure everything
• measure frequently
• measure each data point once
• input and output must be open
Question the tools
Brute force?




http://www.flickr.com/photos/epublicist/3546059144
The safe option?




http://www.flickr.com/photos/alicebartlett/2361209195
Unintuitive?




http://www.flickr.com/photos/merlijnhoek/2841785343
Imposing a flawed model?
http://www.flickr.com/photos/evansville/8953838/
Too difficult / no progress?
http://www.flickr.com/photos/ginja_andy/4165849136/
Nagios


•   the “IBM” of monitoring tools

•   compromise over quantity and frequency of checks

•   < insert your criticism of nagios here >
Zabbix


•   metric collection tightly coupled to monitoring tool

•   confusing UI with poor visualisation

•   needed brute force to make limited API work
The ‘right’ thing
• measure everything
• measure frequently
• measure each data point once
• input and output must be open
don’t compromise
Be ambitious
http://www.flickr.com/photos/mugley/2961131550




                                 Throw work away
Draw your dream
http://www.flickr.com/photos/sk8geek/7358702704




                             Get as far as you can
screens           users
                                            db?             alerting?


 Etsy dashboard
                                                         message queue




              graphite                                   SNMP?           syslog?



 FITB                     ganglia                 api?



network      hosts           applications
Develop missing pieces




              http://www.flickr.com/photos/kalexanderson/5969012589
screens           users
                                            mongodb                   alerta       elastic
                                                                                   search


 Etsy dashboard
                                                                 message queue



                                                                          syslog     SNMP
              graphite                          ganglia alerts
                                                                          alerts     alerts




 FITB                     ganglia                ganglia-api




network      hosts           applications
Guardian Management
https://github.com/guardian/guardian-management
Ganglia API
https://github.com/guardian/ganglia-api
rescale image???




                       Alerta
https://github.com/guardian/alerta
Current stack
• Ganglia             • Guardian management
                        https://github.com/guardian/guardian-management


• FITB                • Guardian ganglia-api
                        https://github.com/guardian/ganglia-api
• Graphite
                      • Guardian alerta
• Etsy dashboards       https://github.com/guardian/alerta
Keep learning
we are not there yet
Watch the cultural changes
detecting
diagnosis
diagnosis
performance testing
confirmation
#monitoringsucks
➡ Prioritise
➡ Understand the real problem
➡ Question the tools
➡ Be ambitious
➡ Keep learning
tools can change culture
Thank you
               http://github.com/guardian
                 http://gu.com/p/3ap5f
       Simon Hildrew                    Nick Satterly
            @sihil                      @nicksatterly
simon.hildrew@guardian.co.uk    nick.satterly@guardian.co.uk

Mais conteúdo relacionado

Semelhante a Winning the metrics battle

Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at ScaleCrossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scale
jgoulah
 
Move out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternatives
tzang ms
 
Honeypots for Active Defense
Honeypots for Active DefenseHoneypots for Active Defense
Honeypots for Active Defense
Greg Foss
 

Semelhante a Winning the metrics battle (20)

Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at ScaleCrossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scale
 
Google Cloud: Next'19 Extended Hanoi
Google Cloud: Next'19 Extended HanoiGoogle Cloud: Next'19 Extended Hanoi
Google Cloud: Next'19 Extended Hanoi
 
Move out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternatives
 
Google Wave: Ripple or Tsunami for Research
Google Wave: Ripple or Tsunami for ResearchGoogle Wave: Ripple or Tsunami for Research
Google Wave: Ripple or Tsunami for Research
 
Honeypots for Active Defense
Honeypots for Active DefenseHoneypots for Active Defense
Honeypots for Active Defense
 
Hyperleger Fabric Workshop - Denver Blockchain Week
Hyperleger Fabric Workshop - Denver Blockchain WeekHyperleger Fabric Workshop - Denver Blockchain Week
Hyperleger Fabric Workshop - Denver Blockchain Week
 
Performance - a challenging craft
Performance  - a challenging craftPerformance  - a challenging craft
Performance - a challenging craft
 
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
 
Ultimate Git Workflow - Seoul 2015
Ultimate Git Workflow - Seoul 2015Ultimate Git Workflow - Seoul 2015
Ultimate Git Workflow - Seoul 2015
 
CONFidence 2017: Hackers vs SOC - 12 hours to break in, 250 days to detect (G...
CONFidence 2017: Hackers vs SOC - 12 hours to break in, 250 days to detect (G...CONFidence 2017: Hackers vs SOC - 12 hours to break in, 250 days to detect (G...
CONFidence 2017: Hackers vs SOC - 12 hours to break in, 250 days to detect (G...
 
Blue team reboot - HackFest
Blue team reboot - HackFest Blue team reboot - HackFest
Blue team reboot - HackFest
 
How to fully automate a store.pptx
How to fully automate a store.pptxHow to fully automate a store.pptx
How to fully automate a store.pptx
 
Introduzione alle metodologie di sviluppo agile
Introduzione alle metodologie di sviluppo agileIntroduzione alle metodologie di sviluppo agile
Introduzione alle metodologie di sviluppo agile
 
Accessibility and web innovation. (no notes)
Accessibility and web innovation. (no notes)Accessibility and web innovation. (no notes)
Accessibility and web innovation. (no notes)
 
Using Blockchain to Increase Supply Chain Transparency
Using Blockchain to Increase Supply Chain TransparencyUsing Blockchain to Increase Supply Chain Transparency
Using Blockchain to Increase Supply Chain Transparency
 
Adoption of AI: The Great Opportunities for Everyone
Adoption of AI: The Great Opportunities for EveryoneAdoption of AI: The Great Opportunities for Everyone
Adoption of AI: The Great Opportunities for Everyone
 
AB Testing, Ads and other 3rd party tags - London WebPerf - March 2018
AB Testing, Ads and other 3rd party tags - London WebPerf - March 2018AB Testing, Ads and other 3rd party tags - London WebPerf - March 2018
AB Testing, Ads and other 3rd party tags - London WebPerf - March 2018
 
Micro Frontends
Micro FrontendsMicro Frontends
Micro Frontends
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Winning the metrics battle