I've been working on Observability things or many years, and while I didn't set out to join Stripe just for the purposes of Observability, I quickly decided that's what I needed to do. How does one go about changing a company to have a culture of observability, measuring and monitoring? Let's see if my ideas worked and how you can learn from my experiences!
2. Cory “gphat” Watson
• Joined Stripe in August, 2015
• Previously at Keen IO and Twitter
• Generalist
3. Starting Point
• Stripe had some visibility, but not enough.
• No clear ownership, broken windows.
• Lack of confidence, vision for future.
• Very reactive.
11. Start Over, Kinda
• Spend time with the tools
• Improve if possible
• Replace if not
• Leverage past knowledge
12. Empathy and Respect
• People not generally evil, but they are busy!
• Stressed, doing best with what they have
• Being a hater is lazy
• Help people be great at their jobs
13. Replaced Existing System
• Maybe a bad call, technically better
• Overcoming momentum is hard, adds work
• Declaring bankruptcy
• Saved us ops headaches
• Still going
14. Tip: Nemawashi
• Start small, you’re a great guinea pig
• Quietly lay a foundation and gather feedback
• Ask how you can improve, follow up!
• Engage discontent! Usually fine. Sometimes you need
whisky.
15. Identify Power Users
• Find interested parties
• Talk to them, give them what they need
• Empower them to help others
• Watch them grow!
16. Value
• What are you improving?
• How can you measure it?
• Is this the best way?
21. Flat Org Work Ethic
• Probably the biggest challenge, getting started
• So, ya know, get started
• Be willing to do the work, shave the preposterous line
of yaks
• Stigmergy
• Strike when good opportunities arise (incidents, etc)
22. Advertise
• Don’t be afraid!
• Promote team accomplishments.
• Moreso, promote the accomplishment of others.
• Humbly ask to help, then learn.
• We send monthly “State of” addresses…
23. Make It Easy & Good
• Harder than it sounds (email!)
• Make it easy/automatic to do things right and hard to
do wrong.
• Quality is important.
24. Automated Monitors
• Baseline monitoring
• Common problems, common solutions
• Users have no state, are surprised
• People care when you show them failure and how to
fix it.
31. Yes, but not done.
• Some teams? Hell yes. Strong champions, huge
improvement.
• Some other teams, kinda the same.
• Some other other teams, what is Observability and
why do I care? Rare!
32. Usage?
• 200+ dashboards created, 339 in old (over 2 years)
• 200+ monitors created, dozens in old (nobody trusted,
was unreliable!)
• ~3000 distinct metrics (can’t compare, tags now!)
• All positive feedback from automation. (Avg 4.5, 2.5%
response)
33. Tools?
• Dozens of OSS PRs, OSS *StatsD library (Scala),
internal libraries (we own)
• Vast improvement over old pipeline, no loss
• New styles, better naming, more consistency
• Being tied to a commercial product cuts both ways
34. Adjustments?
• Embracing other tools (log analysis, error catching)
• Beginning to work on strategic things (global timers,
histograms and sets)
• Need to improve metrics on our own work (we got by
easy for a while)
• Monitoring is hard, need to fix.
35. Summary
• Start small
• Seek feedback
• Think on your value
• Measure effectiveness
• Enjoy!