In today’s world, a company must be a “Learning Organization” in order to be successful and innovative. Learning from both failure and success, in order to implement small incremental improvements is critical. But until you implement and apply new information, you haven’t truly “learned” anything and you certainly haven’t improved.
According to the 2015 Monitoring Survey, most companies leverage metrics from monitoring and logging purely for performance analytics and trending. If high availability and reliability are important, they also leverage metrics to alert on fault and anomaly detection. Despite these “best practices”, the metrics are primarily only used as context to keep things “running” or return them back to “normal” if there’s a problem. Rarely is that data used as a method to identify areas of improvement once services have been restored. When an outage occurs to your system, you will absolutely repair and restore services as best you know how, but are you paying attention to the data from the recovery efforts? What were operators seeing during diagnosis and remediation? What were their actions? What was going on with everyone, including conversations? A step-by-step replay of exactly what took place during that outage.
This “old-view” perspective on the purpose of monitoring, logging, and alerting leaves the full value of metrics unrealized. It fails to address what’s important to the overall business objective and it lacks any hope of seeking out innovation or disruption of the status quo.
This talk will illustrate how to identify if your company is making the best use of metrics and ways to not only learn from failure, but to become a “Learning Company”.
9. WHY ARE YOU COLLECTING THIS DATA?
NOTE: You may choose more than one
▸ Performance analysis and trending
▸ Fault and Anomaly detection
▸ Capacity Planning
▸ A/B Testing
@jasonhand | VictorOps | #AllDayDevOps
10. THE RESULTS
NOTE: Respondents may have chose more than one
▸ Performance analysis and trending - 63%
▸ Fault and Anomaly detection - 53%
▸ Capacity Planning - 45%
▸ A/B Testing - 11%
@jasonhand | VictorOps | #AllDayDevOps
27. The result of underutilizing monitoring & alerting
is that the IT department and the organization have
no chance to...
LEARN,
IMPROVE, OR
INNOVATE.@jasonhand | VictorOps | #AllDayDevOps
28. CONTINUALLY UNDERSTANDING & RESPONDING
TO THE FEEDBACK
from
monitoring, logging, & alerting
allows you to use information about events in the past to drive future
actions.
@jasonhand | VictorOps | #AllDayDevOps
39. RE·SIL·IENT/RƏˈZILYƏNT/
The ability to resist, absorb, recover from or successfully adapt to
adversity or a change in conditions
@jasonhand | VictorOps | #AllDayDevOps
43. Without deviation from the norm,
progress is not possible
— Frank Zappa
@jasonhand | VictorOps | #AllDayDevOps
44. What Did You
LEARNFrom the Recovery Efforts?
(including monitoring & alerting)
@jasonhand | VictorOps | #AllDayDevOps
45. POSTMORTEMS / LEARNING REVIEWS:
Stories of:
WHAT TOOK PLACE
leading up to & during
the disruption & recovery efforts
@jasonhand | VictorOps | #AllDayDevOps
62. INNOVATE
Learning from both success & failure
to develop & implement
small incremental improvements
is critical.
@jasonhand | VictorOps | #AllDayDevOps
78. LEARNING & INNOVATING
leads to uncovering new ways of
BUILDING, DEPLOYING, AND MAINTAINING
SOFTWARE & INFRASTRUCTURE
Which leads to...
@jasonhand | VictorOps | #AllDayDevOps