1. Reanimating DevOps
DevOps has been about putting software engineering know-how into operations.
Without the reverse, it is just continuously deployed carcasses.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
2. DevOps
a way for
technology
organizations
to move faster
with less risk
Continuous Integration and Deployment
A brief rant on the impedance between those.
http://oncoscape.sttrcancer.org/
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
3. Carcass
may be a bit
harsh
• Shit breaks
• All the time
• Senior engineers know
– the Internet is held together with string and hope
• Because of this, CI/CD gives us tremendous power
– to rapidly replace broken software in production
– with other broken software
or perhaps not
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
4. DevOps’
pompous
assumption
DevOps Software Engineers:
• Operations people struggle because they don’t have
the tooling to automate their jobs. We can help them!
Operations People
• We struggle because software engineers write
software that is a fucking tire fire of failure and when it
breaks in production we never know why. ¯_(ツ)_/¯
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
6. DTrace • DTrace
– Instrumentation of single systems.
– Seamlessly crossed user-space/kernel-space divide.
– From user-space probes to hardware.
– Simple awk-like interface.
– Open source, free to consume.
– Did not require cooperation.
• eBPF
– Finally provides plumbing on Linux.
– Same breadth as DTrace.
– No simple consumption tools - yet.
Admit it.
You still want it.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
7. Real Behavior
First a short story
About the revolution in web monitoring
That most forgot happened
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
8. Where is RUM for Systems?
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
9. Inconvenient
realities
• 2000: collecting data at human scale became possible.
• 2017:
– More humans and more devices per human
– Internet ofThings is starting hypergrowth
– Systems “do a lot” for each “user-facing” interaction
(100,000x - 1,000,000x ; my speculation)
• Extrinsic growth: 1 EB → 870 EB [1], 7% → 40% [2]
• Intrinsic magnification: 100,000 – 1,000,000
• 10mm to 100mm (WAG)
• Moore’s law
– 2(2017−2000)/1.5
≅ 2500
[1] https://en.wikipedia.org/wiki/Internet_traffic
[2] https://en.wikipedia.org/wiki/Global_Internet_usage
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
10. Compromise SurgicalTracing
• For some small bit of systems activity
• Complete dimensionality
• Useful for debugging and troubleshooting
Generalized Behavior
• For all systems activity
• Very limited dimensionality
• Robust understanding of behavior
(and changes thereof)
`
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
11. To where? SurgicalTracing
• eBPF will likely meet OpenTracing
– (TBD user-tooling + zipkin + services)
– Maybe honeycomb.io
Generalized Behavior
• Better user-level system instrumentation.
• eBPF extraction of systems information
• More scalable/economicalTSDBs
– Circonus
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
12. Never undervalue grace in failure. Rule . 𝛌1
Crash landings should be both fast and
controlled.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
13. What it means to
fail quickly &
safely
• The scope of failure should collapse
completely.
• The time to failure should be
measured in small multiples of
normal service time
• Nothing outside the scope of failure
should be impacted.
https://www.youtube.com/watch?v=5SL1A2d2e7M
14. Autopsies: not just for medicine. Rule . 𝛌2
Post-mortems are fundamental.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
15. Pragmatic analysis is required to
understand failure’s
true nature
• Post-mortem analysis is critical
• Stack traces
• Forensic logs
• Images (cores, dumps, etc.)
16. The difference between a shock and electrocution is real.
Rule . 𝛌3
Use circuit breakers.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
17. Circuit breakers are designed to
avoid
cascading failure
• it’s not all about,
especially with microservices
• protect yourselves and others
• circuit breakers of many type
• timing
• queue depth
• concurrency
http://melissaomarkham.com
18. You cannot understand what you cannot measure. Rule . 𝛌4
Behavior is complex.
Understand it.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
19. Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
20. Don’t measure to assess availability
measure to
understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
21. It’s easy to demand perfection; it’s also stupid. Rule . 𝛌5
Have an failure budget.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
22. Avoid failure is simply impossible,
expect and
manage failure
• use failure budgets
• set expectations reasonably
• define and reward successes on
improvement and competency,
not just uptime.
23. Justice should be blind; operations should not. Rule . 𝛌6
Instrumentation & Observability have no
equals.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus
24. For every “I wonder what X is right now?”
in production,
you must have
answers
DTrace
eBPF
Instrument code for observability
https://www.pinterest.com/pin/441775044670412234/
25. Thanks
Software Engineers can deliver us observability…
if they choose to.
Theo Schlossnagle - @postwait
Founder & CEO - @Circonus