Mark Imbriaco has over 20 years of experience leading operations at several tech companies. In this document, he shares lessons he has learned around making small incremental decisions, fighting "hero culture" where engineers take on too much, practicing systems to increase confidence under stress, designing for collaboration and visibility, and creating a culture of shipping products. The overall message is about the importance of flexible processes that incorporate learning from successes and failures.
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
DOES SFO 2016 - Mark Imbriaco - Lessons From the Bleeding Edge
1. Mark Imbriaco @markimbriaco
Lessons From the Bleeding Edge
What I learned leading Ops at GitHub,
Heroku, DigitalOcean, and more...
Mark Imbriaco
mark@operable.io
2. Mark Imbriaco @markimbriaco
• Building and operating Internet services
for over 20 years.
• TechOps leadership at 37signals, Heroku,
LivingSocial, GitHub, and DigitalOcean.
• Founder of Operable.
• Frequently opinionated.
Who am I?
5. Mark Imbriaco @markimbriaco
Break large decisions into smaller decisions
whenever you can. Not only is it easier to
make small decisions, it's also easier to
change them when you find out you're
wrong.
Make tiny decisions.
6. Mark Imbriaco @markimbriaco
Engineers have a hard time leaving
problems unsolved, but there are always
more problems. Push back and enforce
healthy balance.
Fight hero culture.
8. Mark Imbriaco @markimbriaco
If you haven't practiced your plan, you don't
have a plan. Build deliberate practice and
feedback mechanisms into your processes
to increases confidence when working
under stress.
Practice makes perfect.
9. Mark Imbriaco @markimbriaco
Be prescriptive where possible to allow
people to focus on the areas where they
provide the most value.
Don't make me think.
10. Mark Imbriaco @markimbriaco
Learn from both successes and failures.
Learning reviews should be a habit, not an
opportunity for assigning blame.
Make it safe to learn.
11. Mark Imbriaco @markimbriaco
• Apologize. And mean it.
• Demonstrate a thorough understanding of
the problem.
• Explain what you're doing to reduce the
likelihood of similar problems. Don't over
promise.
... and share the results publicly.
13. Mark Imbriaco @markimbriaco
Go the extra mile to understand the
problems that your internal customers have
and demonstrate that you understand them,
especially when you have to say no.
Empathy is a core value.
15. Mark Imbriaco @markimbriaco
Design collaboration into your processes.
Make sharing the default and bias toward
visibility. Remember, visibility is the ultimate
compensating control.
Collaborate by default.
16. Mark Imbriaco @markimbriaco
If you're building a web tool, enlist help from
a friendly designer. If all else fails, pick a UI
framework like Bootstrap and fake it. A little
bit of visual design goes a long way.
Ops tools don't have to be ugly.
18. Mark Imbriaco @markimbriaco
Celebrate your wins and share in the
celebration of others to build a virtuous
cycle of forward progress. And remember,
shipping isn't just for software.
Build a culture of shipping.
20. Mark Imbriaco @markimbriaco
Big design up front does not work well in
software. Processes are no different. Be
flexible, adaptable, and constantly apply
what you learn.
Do the simplest thing that could work.
21. Mark Imbriaco @markimbriaco
Beware the illusion of agreement and be
explicit. Make sure that your hard won
knowledge is shared across your entire
organization and deliberately considered in
new projects.
Close the feedback loop.
22. Mark Imbriaco @markimbriaco
If you're interested in ChatOps and have
security or audit requirements, I'd love to
talk to you.
Mark Imbriaco
mark@operable.io
Thanks!
Notas do Editor
37signals: 37signals, now called Basecamp, are the creators of Ruby on Rails. I was the 7th employee and first Ops hire. I built the infrastructure for Basecamp, Highrise, and the other 37signals products and was the first Ops manager.
Heroku: Director of Cloud Operations. 60k-1.5MM apps in 18 months. Acquired by Salesforce.
LivingSocial: VP, TechOps. Roughly $1BB in transaction volume when I joined. Seemed like a good idea at the time.
GitHub: TechOps Leader. Focused on infrastructure and process.
DigitalOcean: VP, TechOps. Public cloud focused on ease of use for developers and growing incredibly fast. ~7 regions when I started and we brought up 4 more in a year.
When Gene asked me to speak I was honored. I was kind of down on the term DevOps, because I felt like it had largely been reframed to be a synonym for continuous delivery rather than a cultural movement. Now don't get me wrong, I'm all for continuous delivery and I'm love automation at least as much as the next person, but I care much more about people.
Then I was lucky enough to be in town the same time as DOES15 and Gene invited me to stop in. I was blown away.
Much harder for enterprises, etc.
Database sharding. Mr. Moore.
Was the whole Ops team for way too long. Kept insisting that I could handle it, even when it was clear that I was burning myself out. Well past the point where the workload was way too much.
Not an outlier.
So I finally went on vacation. Traveling down the mountain, etc.
Heroku playbooks. All alerts must be actionable and link to playbook. Include steps for validation, decision criteria, links to relevant docs, and contact info for errata. Include simulation steps in every runbook, track last simulation date, make new hires go through it in the first couple of weeks before they go on-call to make sure they're fresh and to build confidence.
I don't generally like prescriptive processes, particularly runbooks. It's usually a sign that the remediation can be automated. But there are exceptions.
Be prescriptive with details like when to open a status event so I can focus on what matters, actually resolving the event. Heroku example rules of thumb. Quickly go to "Investigating" status. Maybe # of of customers affected or service or ...
Don't make me decide about things that aren't in the path of resolution though.
Retro retro retro.
Skynet outage. On vacation, team reacted, etc. Lessons learned, timeline constructed, etc.
Lemons into lemonade.
Teams didn't get along. Didn't believe that each other were reasonable. Really came down to lack of understanding and empathy. Ops didn't understand pressure to ship and the obstacles they were creating. Dev didn't understand the depth of the Ops backlog and technical debt. Sitting down and having an open conversation, unpacking requests and instead of saying "No" when someone asked for a new piece of technology trying to understand the context of the request and the problem being solved. Offering alternate solutions, explaining rationale, and sometimes turning No into Not Right Now were all good paths.
ChatOps is great for this. Move workflow to the place where you collaborate to improve visibility both in the moment and retrospectively. Teammates can see new app deploys, see troubleshooting steps both positive and negative, and build some intuition about the "flow" of work. It helps with training new hires -- instead of asking how to deploy software, new team members see it happen in chat on their first day. Serendipitous interactions.
Even without ChatOps there are plenty of things you can do. Set up monitoring on your status site that pages your communications or community or support team when you go public with an incident, etc.
This page was clearly not designed by someone in Ops. It's beautiful, glanceable, and actionable.
GitHub celebrated shipping via Team (internal intranet), serendipitously in chat, and via "Toasts".
And when shipping, especially when shipping new processes, do the simplest thing that could work. "Tank" rotation.
Team waking up every night for pages. Wakes up, bandaids the problem, goes back to sleep. Gets up in morning and asks a developer for help with the real fix, applies fix, and goes on about their day.
I tell dev leads my guys are being woken up every night, "What do you mean?"
"What do you mean, what do I mean? How can you not know? YOU helped them with it this morning! ..."