O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Unicorn
On-call
DevOpsDays Portugal 2019, Lisbon
Unicorn
On-call
• DevOpsDays Portugal 2019, Lisbon
Unicorn
On-call
DevOpsDays Portugal 2019, Lisbon
Hi there! I’m Pedro!
• Engineering Director @
• Impact-driven person
• Passionate about People, Technology, and Products
•...
On-call :: Definition
(of a person) able to be contacted in order to provide a
professional service if necessary, but not ...
On-call :: You need
Shifts People Systems
On-call :: You need
Rotas People Systems
On-call :: You need
Rotas Heroes Systems
On-call :: You need
Rotas The
Critical Ones
Heroes
On-call :: Why
Customers Production
Systems
Engineers Company Job
Our On-call
journey (so far)
Choose your civilization
Greek
Egyptian
Mesopotamian
Asian
Stone Age
Town Center
200 (wood)
Stone Age
• Everyone was on-call (even the CEO)
Tool Age
Tool Age
• “Operations team” aka “DevOps team”
• “DevOps engineers” were on-call (not all of them)
Tool Age
• Officially we only had two engineers on-call
Tool Age
• 3 days rotas
Tool Age
• After business hours, everyone available on Slack would help out
Tool Age
• VictorOps
Tool Age
• Tons of alarms
• False positives (Broken windows theory
https://en.wikipedia.org/wiki/Broken_windows_theory)
• ...
Tool Age
• No compensation (voluntarily and pro-bono)
Tool Age
• Alarms in staging environments
Tool Age
• Fatigue and Burn out
Tool Age
• Churn
Tool Age
• Blameless PMs (PIs and PEs)
Bronze Age
Bronze Age
• We evaluated 3 scenarios: “Primary / Secondary”, “Just primary” and
“Primary / Secondary (SRE)”
• SRE team co...
Bronze Age
• We had more than twenty engineers on-call
Bronze Age
• Tools: One hotspot per rota (no smartphones so that we don’t make
people carry two devices) + VictorOps App
Bronze Age
• One week rotas (four rotas in total)
• The rotas start / end every Tuesday (i.e. End-of-Sprint day) aligning
...
Bronze Age
• Only critical systems covered by the program (defined by Engineering
and agreed with stakeholders (e.g. Produ...
Bronze Age
• On-call playbooks
Bronze Age
• Incident commander defined - The Incident Commander (IC) holds
the high-level state about the incident. They ...
Bronze Age
• Weekly fire drills (or like Google calls it "Wheel of misfortune")
Bronze Age
• Compensation defined (money + time off). Flat fee. No compensation
per incident
Bronze Age
• Alarms fine tuned
• Defined time to Ack under 5 minutes
• Redefined thresholds
• Distinguished Alarms from No...
Bronze Age
• Volunteer based and not compulsory based (Yeah… we ran into
“trouble” and I went on-call because of that: eat...
Bronze Age
• Engineers participating in multiple rotas
• Avoiding engineers doing rotas back to back
Bronze Age
• PTO/Vacations and unexpected leaves self-managed (with facilitation)
by each rota
Bronze Age
• Acacio’s list when joining the program (origin: internal meetup with
Acacio Cruz –> Google SRE and co-author ...
Bronze Age
• Shadowing when joining the program
Bronze Age
• Little time to work on the resiliency of systems (hard to prioritize and
hard to complete action points from ...
Bronze Age
• On-call procedure
• Updating the company’s status page
• Keeping the organization/stakeholders informed with ...
Bronze Age
• 24x7x365 coverage
Bronze Age
• Performance reviews completely disassociated from the on-call
program (no one gets a worst review because of ...
Bronze Age
• Although we have offices in different time zones we didn’t use a
“follow the sun” strategy (lack of engineers...
Bronze Age
• P0s are all-hands on deck and we are “entitled” to call all engineers
that can help
• Panic button on slack w...
Iron Age
Iron Age
• Vanguard program (thank you Raoul, Bruno, and Sean)
Iron Age
• Gamification
Iron Age
• On-call engineers stay off their regular sprint to work on Vanguard’s
backlog
Final thoughts
Final thoughts
• On-call doesn’t need to suck
Final thoughts
• Be fair
• Be honest
• Be respectful
Final thoughts
• Although the engineers are being paid to be on-call… don’t forget that
they are doing us a favor!
Final thoughts
• Don’t aim for perfection and don’t overthink things… Any start is
better than none
Final thoughts
• Google SRE book is a great inspiration (and an herculean task to read
the entire book… 552 pages!)
Final thoughts
• #oncallselfie
Final thoughts
• Burnout is a real thing… it affects performance and churn… but most
importantly… health!
Final thoughts
• Tune those alarms! It’s one of the main factors of success!
Final thoughts
• Don’t make rushed decisions because you are getting too many alerts
(e.g. turning off alarms)
Final thoughts
• Take advantage of the business hours (when you have the entire
engineering team at the office) to tackle ...
Final thoughts
• Being on-call doesn’t mean that you need to save the world. We don’t
need “Rambos”… so play it safe, stic...
Final thoughts
• Don’t hesitate to jump into a (video) call to coordinate the incident
resolution (usually Slack is not en...
Final thoughts
• Don’t forget to keep the stakeholders in the loop (we are in the heat
zone… but they are suffering from t...
Final thoughts
• Action items on (Blameless) post mortems should be tracked and
assured that they are executed
Final thoughts
• Don’t fall into the wishful thinking game: if you believe/suspect that
an alarm is triggered by something...
Final thoughts
• Always write PMs (for PEs and PIs) and bare in mind that you should
have public versions of the PM (soone...
Thank you
DevOpsDays Portugal 2019, Lisbon
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019
Próximos SlideShares
Carregando em…5
×

Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019

326 visualizações

Publicada em

Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019

Publicada em: Internet
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Unicorn on-call :: DevOpsDays Portugal, Lisbon, 2019

  1. 1. Unicorn On-call DevOpsDays Portugal 2019, Lisbon
  2. 2. Unicorn On-call • DevOpsDays Portugal 2019, Lisbon
  3. 3. Unicorn On-call DevOpsDays Portugal 2019, Lisbon
  4. 4. Hi there! I’m Pedro! • Engineering Director @ • Impact-driven person • Passionate about People, Technology, and Products • Agile, Lean and DevOps aficionado • 10+ years of experience running engineering teams
  5. 5. On-call :: Definition (of a person) able to be contacted in order to provide a professional service if necessary, but not formally on duty. ‘The team is on call 24 hours-a-day, and is trained in resuscitation techniques and how to use live-saving defibrillators.’ ‘If you work in a global organization, you might be on call 24 hours a day for troubleshooting or consulting.’ ‘You have to get up in the middle of the night if you're on call.’
  6. 6. On-call :: You need Shifts People Systems
  7. 7. On-call :: You need Rotas People Systems
  8. 8. On-call :: You need Rotas Heroes Systems
  9. 9. On-call :: You need Rotas The Critical Ones Heroes
  10. 10. On-call :: Why Customers Production Systems Engineers Company Job
  11. 11. Our On-call journey (so far)
  12. 12. Choose your civilization Greek Egyptian Mesopotamian Asian
  13. 13. Stone Age Town Center 200 (wood)
  14. 14. Stone Age • Everyone was on-call (even the CEO)
  15. 15. Tool Age
  16. 16. Tool Age • “Operations team” aka “DevOps team” • “DevOps engineers” were on-call (not all of them)
  17. 17. Tool Age • Officially we only had two engineers on-call
  18. 18. Tool Age • 3 days rotas
  19. 19. Tool Age • After business hours, everyone available on Slack would help out
  20. 20. Tool Age • VictorOps
  21. 21. Tool Age • Tons of alarms • False positives (Broken windows theory https://en.wikipedia.org/wiki/Broken_windows_theory) • MTTA not tracked • MTTR “over 9000” • All systems were on-call (Because none was… so all of them were)
  22. 22. Tool Age • No compensation (voluntarily and pro-bono)
  23. 23. Tool Age • Alarms in staging environments
  24. 24. Tool Age • Fatigue and Burn out
  25. 25. Tool Age • Churn
  26. 26. Tool Age • Blameless PMs (PIs and PEs)
  27. 27. Bronze Age
  28. 28. Bronze Age • We evaluated 3 scenarios: “Primary / Secondary”, “Just primary” and “Primary / Secondary (SRE)” • SRE team covering own rota (infra one) –> We rebranded the Ops team to SRE team • Development teams with rotas (dedicated to their systems) • One engineer per rota (no secondaries) • Engineers on-call (eat your own dog food: you develop it… you maintain it in PROD!)
  29. 29. Bronze Age • We had more than twenty engineers on-call
  30. 30. Bronze Age • Tools: One hotspot per rota (no smartphones so that we don’t make people carry two devices) + VictorOps App
  31. 31. Bronze Age • One week rotas (four rotas in total) • The rotas start / end every Tuesday (i.e. End-of-Sprint day) aligning the rotas calendar with the sprints calendar
  32. 32. Bronze Age • Only critical systems covered by the program (defined by Engineering and agreed with stakeholders (e.g. Product, Customer Services, Support))
  33. 33. Bronze Age • On-call playbooks
  34. 34. Bronze Age • Incident commander defined - The Incident Commander (IC) holds the high-level state about the incident. They structure the incident response task force, assigning responsibilities according to need and priority
  35. 35. Bronze Age • Weekly fire drills (or like Google calls it "Wheel of misfortune")
  36. 36. Bronze Age • Compensation defined (money + time off). Flat fee. No compensation per incident
  37. 37. Bronze Age • Alarms fine tuned • Defined time to Ack under 5 minutes • Redefined thresholds • Distinguished Alarms from Notifications: The alarm requires immediate action. The notification can wait for the next day or so • Cleaned up alarms from non Production environments
  38. 38. Bronze Age • Volunteer based and not compulsory based (Yeah… we ran into “trouble” and I went on-call because of that: eat your own dog food… lead by example… I took 4 consecutive weeks on-call)
  39. 39. Bronze Age • Engineers participating in multiple rotas • Avoiding engineers doing rotas back to back
  40. 40. Bronze Age • PTO/Vacations and unexpected leaves self-managed (with facilitation) by each rota
  41. 41. Bronze Age • Acacio’s list when joining the program (origin: internal meetup with Acacio Cruz –> Google SRE and co-author of Google SRE book)
  42. 42. Bronze Age • Shadowing when joining the program
  43. 43. Bronze Age • Little time to work on the resiliency of systems (hard to prioritize and hard to complete action points from PMs during sprints)
  44. 44. Bronze Age • On-call procedure • Updating the company’s status page • Keeping the organization/stakeholders informed with the incident status every 5 minutes
  45. 45. Bronze Age • 24x7x365 coverage
  46. 46. Bronze Age • Performance reviews completely disassociated from the on-call program (no one gets a worst review because of not participating in the program)
  47. 47. Bronze Age • Although we have offices in different time zones we didn’t use a “follow the sun” strategy (lack of engineers in the US)
  48. 48. Bronze Age • P0s are all-hands on deck and we are “entitled” to call all engineers that can help • Panic button on slack with Zappier integration
  49. 49. Iron Age
  50. 50. Iron Age • Vanguard program (thank you Raoul, Bruno, and Sean)
  51. 51. Iron Age • Gamification
  52. 52. Iron Age • On-call engineers stay off their regular sprint to work on Vanguard’s backlog
  53. 53. Final thoughts
  54. 54. Final thoughts • On-call doesn’t need to suck
  55. 55. Final thoughts • Be fair • Be honest • Be respectful
  56. 56. Final thoughts • Although the engineers are being paid to be on-call… don’t forget that they are doing us a favor!
  57. 57. Final thoughts • Don’t aim for perfection and don’t overthink things… Any start is better than none
  58. 58. Final thoughts • Google SRE book is a great inspiration (and an herculean task to read the entire book… 552 pages!)
  59. 59. Final thoughts • #oncallselfie
  60. 60. Final thoughts • Burnout is a real thing… it affects performance and churn… but most importantly… health!
  61. 61. Final thoughts • Tune those alarms! It’s one of the main factors of success!
  62. 62. Final thoughts • Don’t make rushed decisions because you are getting too many alerts (e.g. turning off alarms)
  63. 63. Final thoughts • Take advantage of the business hours (when you have the entire engineering team at the office) to tackle issues that might come up during out-of-business hours (when you “only” have the on-call engineers available)
  64. 64. Final thoughts • Being on-call doesn’t mean that you need to save the world. We don’t need “Rambos”… so play it safe, stick to the playbooks and don’t make risky decisions under stress
  65. 65. Final thoughts • Don’t hesitate to jump into a (video) call to coordinate the incident resolution (usually Slack is not enough) – sync vs async comms
  66. 66. Final thoughts • Don’t forget to keep the stakeholders in the loop (we are in the heat zone… but they are suffering from the sideline… and they need to know what is happening)
  67. 67. Final thoughts • Action items on (Blameless) post mortems should be tracked and assured that they are executed
  68. 68. Final thoughts • Don’t fall into the wishful thinking game: if you believe/suspect that an alarm is triggered by something harmless that you “can’t control” (e.g. network glitch)… be ready to prove that… otherwise don’t stop investigating the root cause
  69. 69. Final thoughts • Always write PMs (for PEs and PIs) and bare in mind that you should have public versions of the PM (sooner or later your customers will ask for them)
  70. 70. Thank you DevOpsDays Portugal 2019, Lisbon

×