O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Next generation alerting and fault detection, SRECon Europe 2016

718 visualizações

Publicada em

There is a common belief that in order to solve more [advanced] alerting cases and get more complete coverage, we need complex, often math-heavy solutions based on machine learning or stream processing.
This talk sets context and pro's/cons for such approaches, and provides anecdotal examples from the industry, nuancing the applicability of these methods.

We then explore how we can get dramatically better alerting, as well as make our lives a lot easier by optimizing workflow and machine-human interaction through an alerting IDE (exemplified by bosun), basic logic, basic math and metric metadata, even for solving complicated alerting problems such as detecting faults in seasonal timeseries data.

https://www.usenix.org/conference/srecon16europe/program/presentation/plaetinck

Publicada em: Dados e análise
  • Seja o primeiro a comentar

Next generation alerting and fault detection, SRECon Europe 2016

  1. 1. Next-generation alerting and fault detection Dieter Plaetinck, raintank SRECon16 Europe – Dublin, Ireland July 12, 2016
  2. 2. Alerting, fault & anomaly detection through: Machine learning event & stream processing Alerting IDE’s
  3. 3. Also on ● support for Graphite and Grafana
  4. 4. Also on ● support for Graphite and Grafana ● hosted Graphite and Grafana
  5. 5. Presumptions ● Monitoring using metrics in place
  6. 6. Presumptions ● Monitoring using metrics in place ● Alerting on metrics
  7. 7. Presumptions ● Monitoring using metrics in place ● Alerting on metrics ● Alerts need high signal/noise ratio
  8. 8. www.quora.com/unanswered
  9. 9. Google trends
  10. 10. Static thresholds → automated anomaly detection ● Not scaling / too much data
  11. 11. Static thresholds → automated anomaly detection ● Not scaling / too much data ● Infrastructure complexity
  12. 12. Static thresholds → automated anomaly detection ● Not scaling / too much data ● Infrastructure complexity ● Alerting on Patterns
  13. 13. Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a field of study that gives computers the ability to learn without being explicitly programmed. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.” https://en.wikipedia.org/wiki/Machine_learning
  14. 14. http://www.extremetech.com/extreme/224445-its-2-0-how-googles-deepmind-is-beating-the-best-in- go-and-why-that-matters
  15. 15. https://research.googleblog.com/2014/09/building-deeper-understanding-of-images.html
  16. 16. Using machine learning for automated anomaly detection
  17. 17. Challenges.
  18. 18. Challenges 1 ● e.g. Amazon, Facebook, LinkedIncontext
  19. 19. 1 ● e.g. Amazon, Facebook, LinkedIn ● e.g. infrastructure change context Challenges
  20. 20. 2 ● Games vs your infraChanging rules Challenges
  21. 21. Challenges 2 ● Games vs your infra ● Trained model doesn’t work on new scenarios Changing rules
  22. 22. 3 Image recognition, security vs ops metrics Signal strength Challenges
  23. 23. 4 ● e.g. super fast to fastrelevancy Challenges
  24. 24. 4 ● e.g. super fast to fast ● e.g. redundancy failover relevancy Challenges
  25. 25. 4 ● e.g. super fast to fast ● e.g. redundancy failover operator knows best relevancy Challenges
  26. 26. 5 ● data prep: filtering, selection, cleaning ● statistical modeling, model selection ● training, testing ● track performance & maintenance ● operate infrastructure ● fitting UX/UI Effort Challenges
  27. 27. 6 ● IntrinsicComplexity Challenges
  28. 28. 6 ● Intrinsic ● Incidental https://engineering.quora.com/Avoiding-Complexity- of-Machine-Learning-Systems Complexity Challenges
  29. 29. ML / AD for operations has merits BUT:
  30. 30. ML / AD for operations has merits BUT: ● Anomalies != Faults. Signal/noise trap
  31. 31. ML / AD for operations has merits BUT: ● Anomalies != Faults. Signal/noise trap ● Significant effort & complexity
  32. 32. ML / AD for operations has merits BUT: ● Anomalies != Faults. Signal/noise trap ● Significant effort & complexity ● Limited use cases
  33. 33. What might help ● Enrich metric metadata (metrics20.org)
  34. 34. What might help ● Enrich metric metadata (metrics20.org) clustered, stronger signals with more context
  35. 35. What might help ● Enrich metric metadata (metrics20.org) clustered, stronger signals with more context classification for model selection
  36. 36. What might help ● Enrich metric metadata (metrics20.org) clustered, stronger signals with more context classification for model selection derive relevancy
  37. 37. What might help ● Enrich metric metadata (metrics20.org) clustered, stronger signals with more context classification for model selection derive relevancy ● integration with CM, PaaS.
  38. 38. What might help ● Enrich metric metadata (metrics20.org) clustered, stronger signals with more context classification for model selection derive relevancy ● integration with CM, PaaS. awareness of infrastructure
  39. 39. What might help ● Enrich metric metadata (metrics20.org) clustered, stronger signals with more context classification for model selection derive relevancy ● integration with CM, PaaS. awareness of infrastructure awareness of infrastructure change
  40. 40. HOW do THEYdo it?
  41. 41. https://codeascraft.com/2013/06/11/introducing-kale/
  42. 42. https://www.oreilly.com/ideas/monitoring-distributed-systems “it’s important that monitoring systems - especially the critical path from the onset of a production problem, through a page to a human, through basic triage and deep debugging - be kept simple and comprehensible by everyone on the team.” “Similarly, to keep noise low and signal high, the elements of your monitoring system that direct to a pager need to be very simple and robust. Rules that generate alerts for humans should be simple to understand and represent a clear failure.”
  43. 43. Conclusion
  44. 44. CEP & Stream processing
  45. 45. CEP & stream processing e.g. storm, riemann.io, spark streaming
  46. 46. in → logic → out
  47. 47. Riemann.io
  48. 48. CEP & stream processing Compared to query-based alerting systems:
  49. 49. CEP & stream processing Compared to query-based alerting systems: ● Good scheduling guarantees/execution timeliness
  50. 50. CEP & stream processing Compared to query-based alerting systems: ● Good scheduling guarantees/execution timeliness ● Unfamiliar paradigm (maybe)
  51. 51. CEP & stream processing Compared to query-based alerting systems: ● Good scheduling guarantees/execution timeliness ● Unfamiliar paradigm (maybe) ● Performance/scalability (maybe)
  52. 52. CEP & stream processing Compared to query-based alerting systems: ● Good scheduling guarantees/execution timeliness ● Unfamiliar paradigm (maybe) ● Performance/scalability (maybe) ● operational complexity (maybe)
  53. 53. Conclusion
  54. 54. Not a bad idea… But doesn’t get to the root of the alerting problems.
  55. 55. Aha!
  56. 56. Picture by Matt Simmons
  57. 57. IDE for alerting Support programmers building and maintaining software
  58. 58. IDE for alerting Support programmers building and maintaining software Support operators building and maintaining alerting
  59. 59. 1 vs traditional alerting, machine learning [historical] testing Key features
  60. 60. 2 Arbitrary scopedata juggling Key features
  61. 61. Key features 2 Arbitrary scope Arbitrary data data juggling
  62. 62. 3dependencies Key features
  63. 63. http://www.slideshare.net/adriancockcroft/gluecon-monitoring-microservices-and-containers-a-challenge
  64. 64. Key features 4transcience
  65. 65. Key features 5DRY
  66. 66. Key insights.
  67. 67. Key insights 1remove hassle wrt improving signal/noise
  68. 68. Key insights 1 ● ongoing maintenance & tuning is critical remove hassle wrt improving signal/noise
  69. 69. Key insights 1 ● ongoing maintenance & tuning is critical ● code for UI and logic > knobs remove hassle wrt improving signal/noise
  70. 70. Key insights 1 ● ongoing maintenance & tuning is critical ● code for UI and logic > knobs ● leveraging additional data remove hassle wrt improving signal/noise
  71. 71. Key insights 2 ● Author to recipientcommunication
  72. 72. Key insights 2 ● Author to recipient ● Alert often primary UI communication
  73. 73. Key insights 3Human > computer
  74. 74. Key insights 4attention is scarce, expensive
  75. 75. Key insights 4attention is scarce, expensive "provide monitoring platform that enables operators to efficiently utilize their attention"
  76. 76. fault detection with bosun
  77. 77. Classify series & find KPI’s
  78. 78. Smoothly seasonal: good
  79. 79. Smoothly seasonal: offset
  80. 80. Smoothly seasonal: spikes
  81. 81. Smoothly seasonal: erratic
  82. 82. Band(), graphiteBand() bosun.org/expressions.html Solution 1/2 : strength
  83. 83. Solution 1/2 : strength
  84. 84. Solution 2/2 : erraticness
  85. 85. Deviation-now Erraticness now = -------------------------- Deviation-historical Solution 2/2 : erraticness
  86. 86. Deviation-now median-historical Erraticness now = -------------------------- * ----------------------- Deviation-historical median-now Solution 2/2 : erraticness
  87. 87. deviation-now * median-historical Erraticness now = ------------------------------------------------------- (deviation-historical * median-now) + 0.01 Solution 2/2 : erraticness
  88. 88. dieter.plaetinck.be/post/practical-fault-detection-on-timeseries-part-2 More details Bosun macro, template & example Grafana dashboard
  89. 89. Static thresholds → automated anomaly detection ● Not scaling / too much data
  90. 90. Static thresholds → automated anomaly detection ● Not scaling / too much data ● Infrastructure complexity
  91. 91. Static thresholds → automated anomaly detection ● Not scaling / too much data ● Infrastructure complexity ● Alerting on patterns
  92. 92. Conclusion
  93. 93. ● All about the workflow
  94. 94. ● All about the workflow ● An IDE like bosun exponentially boosts ability to maintain high signal/noise alerting
  95. 95. ● All about the workflow ● An IDE like bosun exponentially boosts ability to maintain high signal/noise alerting ● Build & share!
  96. 96. Want more ? ● bosun.org/resources presentations by Kyle Brandt (LISA 2014 + Monitorama 2015) ● “my philosophy on alerting” by Rob Ewaschuk ● kitchensoap.com/2015/05/01/openlettertomonitoringproducts ● kitchensoap.com/2013/07/22/owning-attention-considerations-for-alert-design ● “monitoring microservices” by Adrian Cockroft ● (dieter.plaetinck.be/post/practical-fault-detection-alerting-dont-need-to-be-data- scientist) ● dieter.plaetinck.be/post/practical-fault-detection-on-timeseries-part-2 ● metrics20.org/media ● mabrek.github.io ● iwringer.wordpress.com @Dieter_be - @raintanksaas – slack.raintank.io – raintank.io – bosun.org – grafana.org

×