This presentation is for Go developers and operators of Go applications who are interested in reducing costs and latency, or debugging problems such as memory leaks, infinite loops, performance regressions, etc. of such applications. We'll start with a brief description of the unique aspects of the Go runtime, and then take a look at the builtin profilers as well as Go's execution tracer. Additionally we'll look at the interoperability with popular observability tools such as Linux perf and bpftrace. After this presentation you should have a good idea of the various tools you can use, and which ones might be the most useful to you in a production environment.
Strategies for Landing an Oracle DBA Job as a Fresher
Continuous Go Profiling & Observability
1. Brought to you by
Continuous Go Profiling &
Observability
Felix Geisendörfer
Staff Engineer at
2. ■ Go developers and operators of Go applications
■ Interested in reducing costs and latency, or debugging problems such as
memory leaks, infinite loops and performance regressions
■ Focus is on Go’s built-in tools, but we’ll also cover Linux perf and eBPF
Target Audience
3. Felix Geisendörfer
Staff Engineer at Datadog
■ Working on continuous Go profiling as a product
■ Previous 6.5 years working for Apple (Factory Traceability)
■ Open Source Contributor (node.js, Go): github.com/felixge
5. What is profiling?
■ Anything that produces a weighted list of stack traces
■ Example: CPU Profiler that interrupts process every 10ms of CPU time,
captures a stack trace and aggregates their counts
stack trace count
main;foo 5
main;foo;bar 4
main;foobar 4
6. What is Continuous Profiling?
■ Profiling in production
■ Continuously upload profiles to a backend for later analysis
7. Why profile in production?
■ Data distributions have a big impact on performance
■ Production profiles can help mitigate and root cause incidents
■ Profiling is usually low overhead (1-10%)
8. About Go
■ Compiled language like C/C++/Rust
■ Should work well with industry standard observability tools … right?
10. Goroutines
■ Green threads scheduled onto OS thread by Go runtime
■ Tightly integrated with Go’s network stack (epoll on Linux)
■ Tiny 2 KiB stacks that grow dynamically
■ Fast context switching (~170ns), 10x faster than Linux threads
see https://dtdg.co/3n6kBoC
■ Data sharing via mutexes and channels (CSP)
11. The trouble with goroutines
uprobe:./example:main.Foo {
@start[tid] = nsecs;
}
uretprobe:./example:main.Foo {
@msecs = hist((nsecs - @start[tid]) / 1000000);
delete(@start[tid]);
}
END {
clear(@start);
}
15. ■ Does not follow System V AMD64 ABI 🙈
■ Arguments are passed on the stack rather than using registers (slowish)
■ Go 1.17 switched to a register calling convention, but still idiosyncratic (to
support goroutine scalability, multiple return arguments, etc.)
■ ABI0 remains in use to support legacy assembly code
Go’s Calling Convention
See Proposal: Register-based Go calling convention: https://dtdg.co/2VIPOSV
16. ■ Requires separate stack for C call frames which need to be static
■ High complexity and some overhead (~60ns) to switch between stacks
see https://dtdg.co/2X1HvTq
Calling C Code
17. ■ Go pushed frame pointers onto the stack, has no -fomit-frame-pointer
■ Go also generates DWARF unwind/symbol tables by default
■ Leads to good interoperability with tools such as Linux perf
■ Go runtime uses idiosyncratic gopclntab unwinding and symbol tables
(DWARF is strippable and $@!%^# turing complete, so this is good)
Less odd: Stack Traces
18. Duck Test: Go is an odd duck
Pay attention when using 3rd party tools in production
Ashley Willis (CC BY-NC-SA 4.0)
19. ■ Quirky runtime, Pedestrian language, limited type system, but ...
■ What Go lacks as language, it makes up for in tooling
■ Built-in documentation, testing, benchmarking, code formatting, tracing,
profiling and more!
So why bother with Go?
20. ■ Five different profilers: CPU, Heap, Mutex, Block, Goroutine
go test -cpuprofile cpu.prof -memprofile mem.prof -bench
■ pprof visualization and analysis tool
go tool pprof -http=:6060 cpu.prof
Built-in observability tools
25. ■ Annotate goroutines with arbitrary key/value pairs
■ Understand CPU consumption of individual requests, users, endpoints, etc.
CPU Profiler: Labels
labels := pprof.Labels("user_id", "123")
pprof.Do(ctx, labels, func(ctx context.Context) {
// handle request
go update(ctx) // child goroutine inherits labels
})
26. ■ Uses setitimer(2) to receive SIGPROF signal for every 10ms of CPU time
■ Signal handler takes stack traces and aggregates them into a profile
■ setitimer(2) has thread delivery bias and can’t keep up when utilizing more
than 2.5 cores 🙄
■ Rhys Hiltner (Twitch) and myself are working on an upstream patch to use
timer_create(2)
See: runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057: https://dtdg.co/3CAeApm
CPU Profiler: Implementation Details
27. ■ Samples mutex wait (both) and channel wait (block profiler) events
■ Why the overlap?
● Block captures Lock(), i.e. the blocked mutexes
● Mutex captures Unlock(), i.e. the mutexes doing the blocking
■ Block profile used to be biased. Fix contributed for Go 1.17.
see https://go-review.googlesource.com/c/go/+/299991
Mutex & Block Profiler
29. Allocation & Heap Profiler
func malloc(size):
object = ... // alloc magic
if poisson_sample(size):
s = stacktrace()
profile[s].allocs++
profile[s].alloc_bytes += sizeof(object)
track_profiled(object, s)
return object
func sweep(object):
// do gc stuff to free object
if is_profiled(object)
s = alloc_stacktrace(object)
profile[s].frees++
profile[s].free_bytes += sizeof(object)
return object
30. ■ Allocations per stack trace
■ Memory remaining inuse on the heap (allocs-frees)
■ Can identify the source of memory leaks, but not the refs retaining things
Allocation & Heap Profiler
31. ■ Can sometimes guide CPU optimizations better than CPU profiler
Allocation & Heap Profiler
made using tweetpik.com
32. ■ Second-Order Effects: Reducing allocs can make unrelated code faster (!)
■ 💡 Reduce allocations and number of pointers on the heap
Allocation & Heap Profiler
made using tweetpik.com
33. ■ Briefly stops all goroutines and captures their stack traces (⚠ Latency)
■ Useful for debugging goroutine leaks
■ Text output format also includes waiting times for debugging “stuck
programs” (block/mutex don’t show this until the blocking event has finished)
■ fgprof captures goroutine profiles at 100 Hz -> Wallclock Profile
https://github.com/felixge/fgprof
Goroutine Profiler
35. ■ Frame pointers & DWARF tables lead to good interoperability
■ perf offers better accuracy (but accuracy of builtin profilers is decent enough)
■ Deals with dual Go and C stacks (no need for runtime.SetCgoTraceback())
■ Downsides: Linux only, Security, Permissions, Lack of Profiler Labels
■ Example: perf record -F 99 -g ./myapp && perf report
Linux perf
36. ■ Example: bpftrace -e 'profile:hz:99 { @[ustack()] = count(); }' -c ./myapp
■ Should require less context switching, stacks aggregated in kernel
■ Otherwise similar caveats as Linux perf
eBPF (bpftrace)
38. ■ Go is a bit odd for a compiled language, but ...
■ Wide variety of profiling and observability tools can be used
■ Most should be safe for production (⚠ goroutine profiler, execution tracer,
uretprobes)
■ Continuous Profiling makes sure you always have the data at your fingertips
Recap