HipChat operates a ‘You Build It, You Run It’ service model, where developers are responsible for building, testing, and operating their systems. While we have a high speed of development, things can break – but we also recover quickly. Learn about how we've integrated best practices within our planning, building, operating and learning processes to optimize for speed and efficiency but also mitigate, prepare for, and handle incidents.
The presenter will walk you through four steps for how to operate at a high speed of development and also prepare for any incident — planning, prevention, preparation and collecting feedback— and instruct you on how you can build these processes into your Atlasssian workflow (including JIRA Software, HipChat, Bitbucket, Confluence, Bamboo, and StatusPage).
Learn about:
- Planning: How we use JIRA Software and Confluence to plan roadmaps and sync up with teams
- Prevention: Best practices during code reviews and testing
- Preparation: How we prepare for incidents with war games
Review: Collecting feedback, assessing incident causes and improving our processes
Come out of this session with a newfound understanding of how to use Atlassian products within your DevOps workflow!
Mickie Betz, Software Developer, Atlassian
17. Shape usage
Please only use circles, rectangles, and
rounded rectangles to call attention to a
particular part of a screenshot, for the
sake of consistency.
DACI:
Decision
Making
Framework
Gather
Stakeholders
Gather Data
Consider
Options
20. Transparent
All team members can
provide insight and watch the
decision making process
Benefits of DACIs
Documented
Provides historical context on
why decisions were made, by
whom, and with what
information
Quick
The hard deadline ensures
decisions are made fast
Try it yourself! Available as a Confluence template or PDF here:
https://www.atlassian.com/team-playbook/plays/daci
22. We use a 50% opacity N900
box over a full-bleed photo
when using text overlays
To ensure a continuous flow of work:
• Cards should represent work that is:
• immediately consumable
• will take no longer than 2 days to accomplish
• Obey ‘Work in Progress’ Limits for stages
of work
23.
24. Column Exceeds WIP
Limit
Code Review has a max of 2 Work In
Progress Tickets, yet 3 have been pulled
in. The column turns red to indicate it
needs attention.
25. We Plan To Fail
MOST IMPORTANTLY…
* But don’t fail to plan
29. Failing well is no different
from any other skill. If you
want to be good at it you
need to practice.
30. Pick a System
Identify the team that delivers the
services for that system to participate
and think of all the ways the system
can fail.
War Game
Process
Generate
Scenarios
Role Play
Plan the Game
Break things!
31. Talk through
Assumptions
Use the output of this session to:
• compare with the results of the
actual War Game exercise
• identify and make necessary
improvements before the actual War
Game
War Game
Process
Generate
Scenarios
Role Play
Plan the Game
Break things!
Attribution: http://www.nerdlikeyou.com/4-low-cost-costume-tips/
32. Talk through
Assumptions
• What are the expected impacts?
• How would you detect them?
• What monitoring would trip?
• Who is responsible for responding,
and what would they do to restore
the service?
War Game
Process
Generate
Scenarios
Role Play
Plan the Game
Break things!
Attribution: http://www.nerdlikeyou.com/4-low-cost-costume-tips/
33. Choose a scenario
Good scenario candidates:
• have no customer impact
• match an actual failure as closely as
possible
War Game
Process
Generate
Scenarios
Role Play
Plan the Game
Break things!
Video Source: Giphy
34. War Game
Process
Generate
Scenarios
Role Play
Plan the Game
Break things!
Induce a Failure
Observe expected vs actual impact
Practice all parts of running an
incident, including communication
Prioritize actions to improve
Video Source: Giphy
35. Proactive
We introduce monitoring,
alerting, and logging as we build
the system, so we can identify
problems early
Practice Makes Perfect
Preparing for incidents
strengthens our systems and our
incident management and
communication skills
Benefits of War Games
37. Under conditions of
complexity, not only are
checklists a help, they are
required for success
ATUL GAWANDE, THE CHECKLIST MANIFEST: HOW TO GET THINGS RIGHT
38. Example tasks to complete before a
service is operationally ready:
• Performance tests for anticipated peak load
• Disaster Recovery plan is covered and documented
• Encrypt data in-transit and at-rest
• Ensure system is externally monitored
• Logs are accessible
• On-call schedule is setup
• Automatic testing occurs before changes are made to
production
OPERATIONAL READINESS
40. Peer code reviews are the
single biggest thing you can
do to improve your code
JEFF ATWOOD,
CO-FOUNDER OF STACK OVERFLOW
Write and
Ship Good
Code, Fast
Code Reviews
CI/CD
Automated
Tests
41. HipChat Code Reviews:
• Review all code- nothing is too short or
simple!
• Code changes require two approvers
Write and
Ship Good
Code, Fast
Code Reviews
CI/CD
Automated
Tests
45. Continuous Integration/
Continuous Deployment
Merge small amounts of code often to
reduce risk and deploy changes once the
build goes green
Write and
Ship Good
Code, Fast
Code Reviews
CI/CD
Automated
Tests
47. Automated Tests
Triggered by our CI/CD builds:
• Increases developer confidence in
changes
• Minimizes risk of code changes and
deploys
• Increases code quality
Write and
Ship Good
Code, Fast
Code Reviews
CI/CD
Automated
Tests
48. I want to be woken up in the
middle of the night for an
incident.
NO ONE EVER
49. Proactive
Encourages monitoring so we
can identify issues before
they’re problems
You Build It, You Run It
Transparent
Encourages knowledge
sharing so on-call members
are informed
Ownership
Encourages autonomy and
accountability
52. Blameless
Focus on learning, not
punishment
Post Incident Reviews
Actionable
Find follow-up actions to
prevent repeat occurrences
Learn
Understand the root cause of
the incident.
53. Finding the root cause of a
failure is like finding the root
cause of a success.
POST INCIDENT REVIEWS