Love DevOps? Wait 'Till You Meet SRE

•

19 likes•6,440 views

A crucial transition is taking place at Atlassian... we can feel our DNA evolving a little each day. Our focus is always on the future, and that future will mean rallying behind a cloud-first strategy. In doing so, we have the unique opportunity to re-imagine the way we run our services and get behind a modern approach to distribute our operations function and optimize for scale. This talk will cover the steps we've taken on that journey as we build site reliability engineering, an operations approach pioneered by industry champs like Netflix. We'll talk about the concept, how it applies at Atlassian, the wins we have achieved, and learning you can bring back to your team.

Software

NICK WRIGHT • SRE MANAGER • ATLASSIAN
Love DevOps? Wait ‘til you
meet SRE!

SRE AND HOW IT CAN HELP
GETTING STARTED
OPS TOOLCHAIN
Agenda
SETTING THE SCENE

Too much
firefighting
Caters News Agency

Fixing the
same thing
repeatedly
America’s Funniest Home Videos

SETTING THE SCENE
SRE AND HOW IT CAN HELP
Agenda
GETTING STARTED
OPS TOOLCHAIN

Site
Reliability
Engineering
Preventative
Multiple distinct operations teams, or
a You-Build-It, You-Run-It model.
Specialised
Engineers focus on a single service
or group of related services.
Decentralised
Primary focus: get away from break-
ﬁx, do work that prevents outages.

SRE vs DevOps?
SRE DevOps
• Operations
• Incident response
• Post Mortems
• Monitoring, Events, Alertings
• Capacity planning
• Primary focus: Reliability
• Delivery
• Release automation
• Environment builds
• Conﬁg management
• Infrastructure as code
• Primary focus: Delivery Speed

Balance
Interrupt vs
Preventative work
GravityGlue.com

Hire Devs!
And have a common
hiring pool

SETTING THE SCENE
Agenda
SRE AND HOW IT CAN HELP
OPS TOOLCHAIN
GETTING STARTED

The journey
to SRE
Improve
Deﬁne how the team will work and
how we measure success
Build
Get the team up and running!
Vision
Revisit regularly - if its not working,
tweak, change, reﬁne.

Team StructureGoals and MetricsResponsibilities
Vision
In 6 months we will:
• Replace monitoring
• DR Plan and Test
How we measure
success?
• Number of Incidents
• PIR Coverage
• Service list
• Service Owners
• Team Duties
Size and structure of
team

ToolsHiring
Build
Training
Get the team in place
• Start Early!
• Promotion Opportunities
• Existing hiring pipeline
Set things up so they
can work!
• Last part of the talk!
• Bootcamps
• Wheel of Misfortune!

Regular check-ins
Improve
Review decisions
Change where needed
Blog success stories!

100%Post Incident Review
Completion Rate

The SRE team runs ahead of the rest of the
team on reliability and encourages everyone
to lift their game
ANDRE SERNA, DEV MANAGER
“
”

In the past the separate ops and dev teams
would often pick the solution they were best
positioned to implement. I like that our SRE
team is able to pick the best solution to the
problem instead.
JAMES BUNTON, DEV-ON-ROTATION
“
”

Alerts
Dashboard
Incident
Ticket
HOT roomOps room
SREs
Atlassians
Ops JIRA Confluence
Run Book

JIRA HipChat
Discussions
Incident
Ticket
HOT room

Incident
Ticket Pending
Fixing
Reviewing
Closed

Incident
Ticket
ALL
MOST
FEW
ONE
Minor Impact Moderate Impact Severe Impact Outage

Incident
Ticket
DetectFail Fix CloseRespond
JIRA ticket

Incident
Ticket
HOT roomOps room
SREs
Ops JIRA
Confluence
JIRA
Actions!

Pending
Fixing
Reviewing
Closed
Draft
Approval
Published
Completed
JIRA

JIRA
Team 1
JIRA
Team 2 Team 3
Reporting

Pedro Canahuati
“Scaling the Operations Organisation at
Facebook”
Ben Treynor
“Keys to SRE”

Thank you!
NICK WRIGHT • SRE MANAGER • ATLASSIAN

Viewers also liked

You got a couple Microservices, now what? - Adding SRE to DevOpsGonzalo Maldonado

Get the most out of your security logs using syslog-ngPeter Czanik

JIRA 7: New APIs, New Plugin Points, New JIRAAtlassian

Basics of OSI and TCP IP Layershafsabanu

Docker, Continuous Integration, and YouAtlassian

Building Bridges Across Company and Community -SCALE15xNithya A. Ruff

Continuous talk, AnsibleFest London 2016Steve Smith

CI/CD Pipeline to Deploy and Maintain an OpenStack IaaS CloudSimon McCartney

CICD by TeerapatTroublemaker Khunpech

Intel® VTune™ Amplifier - Intel Software Conference 2013Intel Software Brasil

System administrator fundamentalNasmee Salaeh

Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleIE Group

Monitoring Kubernetes with PrometheusTobias Schmidt

Intel VTuneVikram Singh Saini

Perl在nginx里的应用琛琳饶

Continuous Validation - Lean Startup Machine Sydney 2013Shihab Hamid

How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...Peter Leschev

Enterprise Day 2015 - beyond software teams (Atlassian)Riada AB

Facebook Scaling OverviewMoritz Haarmann

Atlassian Q&A - Inside and Outcolleenfry

Viewers also liked (20)

You got a couple Microservices, now what? - Adding SRE to DevOps

Get the most out of your security logs using syslog-ng

JIRA 7: New APIs, New Plugin Points, New JIRA

Basics of OSI and TCP IP Layers

Docker, Continuous Integration, and You

Building Bridges Across Company and Community -SCALE15x

Continuous talk, AnsibleFest London 2016

CI/CD Pipeline to Deploy and Maintain an OpenStack IaaS Cloud

CICD by Teerapat

Intel® VTune™ Amplifier - Intel Software Conference 2013

System administrator fundamental

Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Monitoring Kubernetes with Prometheus

Intel VTune

Perl在nginx里的应用

Continuous Validation - Lean Startup Machine Sydney 2013

How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...

Enterprise Day 2015 - beyond software teams (Atlassian)

Facebook Scaling Overview

Atlassian Q&A - Inside and Out

Recently uploaded

Powering Real-Time Decisions with Continuous Data StreamsSafe Software

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

Post Quantum Cryptography – The Impact on Identityteam-WIBU

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171

Sending Calendar Invites on SES and Calendarsnack.pdf31events.com

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

Large Language Models for Test Case Evolution and RepairLionel Briand

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

VK Business Profile - provides IT solutions and Web Developmentvyaparkranti

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp

How to submit a standout Adobe Champion ApplicationBradBedford3

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Lecture # 8 software design and architecture (SDA).pptesrabilgic2

Advantages of Odoo ERP 17 for Your BusinessEnvertis Software Solutions

Recently uploaded (20)

Powering Real-Time Decisions with Continuous Data Streams

Salesforce Implementation Services PPT By ABSYZ

Post Quantum Cryptography – The Impact on Identity

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf

Sending Calendar Invites on SES and Calendarsnack.pdf

Understanding Flamingo - DeepMind's VLM Architecture

What is Advanced Excel and what are some best practices for designing and cre...

Large Language Models for Test Case Evolution and Repair

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

VK Business Profile - provides IT solutions and Web Development

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx

How to submit a standout Adobe Champion Application

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Lecture # 8 software design and architecture (SDA).ppt

Advantages of Odoo ERP 17 for Your Business

Love DevOps? Wait 'Till You Meet SRE

1. NICK WRIGHT • SRE MANAGER • ATLASSIAN Love DevOps? Wait ‘til you meet SRE!

2. SRE AND HOW IT CAN HELP GETTING STARTED OPS TOOLCHAIN Agenda SETTING THE SCENE

4. incidents per month 10+

5. incidents per month 100+

6. incidents per month 400+

7. incidents per month 900+

10. Too much firefighting Caters News Agency

11. Fixing the same thing repeatedly America’s Funniest Home Videos

12. Job Satisfaction NASA

13. Service Ops Application Development

14. SETTING THE SCENE SRE AND HOW IT CAN HELP Agenda GETTING STARTED OPS TOOLCHAIN

15. Site Reliability Engineering Preventative Multiple distinct operations teams, or a You-Build-It, You-Run-It model. Specialised Engineers focus on a single service or group of related services. Decentralised Primary focus: get away from break- ﬁx, do work that prevents outages.

16. SRE vs DevOps? SRE DevOps • Operations • Incident response • Post Mortems • Monitoring, Events, Alertings • Capacity planning • Primary focus: Reliability • Delivery • Release automation • Environment builds • Conﬁg management • Infrastructure as code • Primary focus: Delivery Speed

17. Solutions

18. Balance Interrupt vs Preventative work GravityGlue.com

19. Hire Devs! And have a common hiring pool

20. Always do Post- Mortems

21. Scrap the release meeting!

22. SETTING THE SCENE Agenda SRE AND HOW IT CAN HELP OPS TOOLCHAIN GETTING STARTED

23. ? ? ? ? ? ??

24. The journey to SRE Improve Deﬁne how the team will work and how we measure success Build Get the team up and running! Vision Revisit regularly - if its not working, tweak, change, reﬁne.

25. Team StructureGoals and MetricsResponsibilities Vision In 6 months we will: • Replace monitoring • DR Plan and Test How we measure success? • Number of Incidents • PIR Coverage • Service list • Service Owners • Team Duties Size and structure of team

26. Team Structure Developer TeamsSRE

27. ToolsHiring Build Training Get the team in place • Start Early! • Promotion Opportunities • Existing hiring pipeline Set things up so they can work! • Last part of the talk! • Bootcamps • Wheel of Misfortune!

28. Regular check-ins Improve Review decisions Change where needed Blog success stories!

29. Does it work?!

30.

31. 100%Post Incident Review Completion Rate

32. DR Compliance

33. The SRE team runs ahead of the rest of the team on reliability and encourages everyone to lift their game ANDRE SERNA, DEV MANAGER “ ”

34. In the past the separate ops and dev teams would often pick the solution they were best positioned to implement. I like that our SRE team is able to pick the best solution to the problem instead. JAMES BUNTON, DEV-ON-ROTATION “ ”

35. SETTING THE SCENE Agenda SRE AND HOW IT CAN HELP OPS TOOLCHAIN GETTING STARTED

36. Incident

37. Alerts Dashboard Incident Ticket HOT roomOps room SREs Atlassians Ops JIRA Confluence Run Book

38. Ops room Ops JIRA JQL Select Action

39. JIRA HipChat Discussions Incident Ticket HOT room

40. Incident Ticket Pending Fixing Reviewing Closed

41. Incident Ticket ALL MOST FEW ONE Minor Impact Moderate Impact Severe Impact Outage

42. Incident Ticket DetectFail Fix CloseRespond JIRA ticket

43. Post Mortem

44. Incident Ticket HOT roomOps room SREs Ops JIRA Confluence JIRA Actions!

45. Confluence

46. Confluence Actions Linked Here