A crucial transition is taking place at Atlassian... we can feel our DNA evolving a little each day. Our focus is always on the future, and that future will mean rallying behind a cloud-first strategy. In doing so, we have the unique opportunity to re-imagine the way we run our services and get behind a modern approach to distribute our operations function and optimize for scale. This talk will cover the steps we've taken on that journey as we build site reliability engineering, an operations approach pioneered by industry champs like Netflix. We'll talk about the concept, how it applies at Atlassian, the wins we have achieved, and learning you can bring back to your team.
24. The journey
to SRE
Improve
Define how the team will work and
how we measure success
Build
Get the team up and running!
Vision
Revisit regularly - if its not working,
tweak, change, refine.
25. Team StructureGoals and MetricsResponsibilities
Vision
In 6 months we will:
• Replace monitoring
• DR Plan and Test
How we measure
success?
• Number of Incidents
• PIR Coverage
• Service list
• Service Owners
• Team Duties
Size and structure of
team
27. ToolsHiring
Build
Training
Get the team in place
• Start Early!
• Promotion Opportunities
• Existing hiring pipeline
Set things up so they
can work!
• Last part of the talk!
• Bootcamps
• Wheel of Misfortune!
33. The SRE team runs ahead of the rest of the
team on reliability and encourages everyone
to lift their game
ANDRE SERNA, DEV MANAGER
“
”
34. In the past the separate ops and dev teams
would often pick the solution they were best
positioned to implement. I like that our SRE
team is able to pick the best solution to the
problem instead.
JAMES BUNTON, DEV-ON-ROTATION
“
”