Site reliability engineering

SITE RELIABILITY
ENGINEERING
FOR GROWING ORGANIZATIONS

My company in 20s
• End-to-end payments platform
• API-First
• Docker, C#, ASP.NET, Java,
Powershell, SQL
• #31 Nilsen top merchant acquirer
• Inc. 5000 fastest growing
company
• STL Top Place to Work
• Sound fun? It is. Come see me.
2

WHAT IS SRE?
• “Ops, if everything is treated as a
software problem”
• Typically experienced software
devs with a passion for
automation & infrastructure
• Sort of like devops, but with more
of a focus on production
automation, resiliency and
scalability
• Google wrote this book – It is
being adopted and explored by
many companies
• SRE for Google won't be SRE for
your team!
3

GROWING COMPANY PROBLEMS
4
• If you don't set a service level expectation, they will form around 100% uptime
• Keeping everything running as-is gets treated as a sunk cost
• Code atrophies or gets frozen, but the business keeps changing
EXPECTATIONS OFTEN OUTPACE CAPACITY

5
• How many more users can we
support at our current growth rate?
• If you buy X, will that let us scale?
• If we agree to buy X, can we wait
until next year's budget?
These are very hard questions to
answer without data and documented
expectations for performance &
uptime.
FINANCES GET MORE FORMAL - YOU NEED METRICS TO JUSTIFY ENHANCEMENTS

6
You will need more automation to keep
it running, not more people
• People are an ongoing cost,
automation is a capitalizable
investment
• With a bigger customer base, five
minute outages become damaging
experiences. Machines can react
faster.
• 4 nines (99.99%) is < 5 minutes
downtime per month. How quickly
can you triage an alert?
COMPLEXITY IS EVER INCREASING

7
• Improve reaction time to incidents
‒ Focus can be spent on documenting tribal knowledge, minimizing mistakes and
improving RTO
• Learn from mistakes, turn them into opportunities
‒ SRE teams can focus on blameless postmortems, extracting as much marrow as
possible from incidents, then being a champion for change
• Raise awareness for system behavior, weaknesses & strengths
‒ SRE can be an independent consulting agency or PR firm for dev teams
‒ SRE will create and/or publicize metrics to show facts
• Bandwidth dedicated to forward-looking system behaviors.
‒ Usually this is done as time permits (which is limited when companies grow fast).
WHY YOU WANT A DEDICATED SRE TEAM

SOUNDS GOOD! HOW IS IT DONE?
8
‒ Monitor externally the way your customers see you AND the way you see you
‒ There will be false alarms so not everyone should see these
AUTOMATED MONITORING

9
LOG INDEXING AND AGGREGATION

10
• Build self-healing systems when we can
‒ Service health checks & automated recovery actions
‒ Desired state configuration
‒ Service Orchestration
• Document procedures/playbooks/runbooks when we can't

11
• More than just a socket connection
‒ Does a typical request return a 200-OK?
‒ How many 200/300 Responses vs 400/500?
‒ Can you connect to your downstream
dependencies?
‒ How long have you been up?
• Provide rich info, but quickly
‒ Other endpoints can give more expensive
info
HEALTH CHECKS

12
• SLOs – Service Level Objectives
‒ Where you’d like to be
• SLAs – Service Level Agreements
‒ Where you tell your customers you’ll be
‒ Penalties
‒ More liberal than your SLOs
• Error Budgets
‒ Based on your SLO, how much risk
can you tolerate?
SERVICE LEVELS

SIGNAL VS NOISE
13
• Alert fatigue is real. Keep your alerts actionable.
• Rare errors can be the most interesting, but error velocity is an indicator.
• Strengthen the signal-noise ratio to combat fatigue.

14
MY EXPERIENCES
• Team was formed from various departments
• Carried forward some SRE-related projects from dev
• Matured & documented processes
‒ Playbooks
‒ Dependencies, metrics, app catalog
• Sharing responsibility for prod incidents with operations and
dev teams
• Finding ways to consult on app design & rollout
• We are first-responders, but the dev & ops teams are on call

5,124 HOURS
AKA CISCO FIELD NOTICE FN-64291

AUTO-IMMUNE DISORDER
AGGRESSIVE HEALTH CHECKING

STORIES FROM THE FIELD
WHAT’S THE STRANGEST PLACE YOU’VE
WORKED A PRODUCTION INCIDENT?

THANK YOU!
• Twitter: jmloeffler
• G+: jmloeffler
• Github: jmloeffler
19

Site reliability engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Site reliability engineering

Similar to Site reliability engineering (20)

Recently uploaded

Recently uploaded (20)

Site reliability engineering