HumanOps is a set of principles which focus on the human aspects of running infrastructure.
It deliberately highlights the importance of the teams running systems, not just the systems themselves.
The health of your infrastructure is not just about hardware, software, automations and uptime - it also includes the health and wellbeing of your team.
The goal of HumanOps is to improve and maintain the good health of your team: easing communication, reducing fatigue and reducing stress.
6. ● Humans are part of any system
● Initial design, ongoing improvements
● Maintenance
● Upgrades
● Issues, Incident response
7. ● System issues = error rates + SLA + ...
● Human issues = alerts out of hours + interruptions + .
● System issues = Human issues
8. ● Downtime = loss of users, reputation, revenue
● Downtime caused by unreliable systems
● Unhealthy teams reduce reliability
● Unhealthy teams = loss of users, reputation, revenue
13. ● Power failure to half of our servers
● Automated failover unavailable
(known failure condition)
● Manual DNS switch required
● Expected impact: 20 min
● Actual impact: 43min
18. ● First responder, acknowledge alert
● Load incident response checklist
● Log into #ops-war-room in Slack
● Log incident into JIRA
● Begin investigation
19. 1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5. List of known issues and
documented workarounds/fixes
20. ● The “limits of human memory and
attention”
○ Complexity
○ Stress and fatigue
○ Ego
● Pilots, doctors, divers:
Bruce Willis Ruins All Films
(BCD, weights, releases, air, final)
21. 1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5. List of known issues and
documented workarounds/fixes
22. ● Realistic replica environment
● or mock command line
● Record actions and timing
● Multiple failures
● Unexpected results