Debugging under fire: Keeping your head when systems have lost their mind
Real-time in the real world: DIRT in production
1. Real-time in the
real world:
DIRT in production
Bryan Cantrill Brendan Gregg
SVP, Engineering Lead Performance Engineer
bryan@joyent.com brendan@joyent.com
@bcantrill @brendangregg
2. Previously, on #surgecon...
• Two years ago at Surge, we described the emergence
of real-time data semantics in web-facing applications
• We dubbed this data-intensive real-time (DIRT)
• Last year at Surge 2011, we presented experiences
building a DIRTy system of our own — a facility for real-
time analytics of latency in the cloud
• While this system is interesting, it is somewhat synthetic
in nature in that it does not need to scale (much) with
respect to users...
3. #surgecon 2012
• Accelerated by the rise of mobile applications, DIRTy
systems are becoming increasingly common
• In the past year, we’ve seen apps in production at scale
• There are many examples of this, but for us, a paragon
of the emerging DIRTy apps has been Voxer
• Voxer is a push-to-talk mobile app that can be thought of
as the confluence of voice mail and SMS
• A canonical DIRTy app: latency and scale both matter!
• Our experiences debugging latency bubbles with Voxer
over the past year have taught us quite a bit about the
new challenges that DIRTy apps pose...
4. The challenge of DIRTy apps
• DIRTy applications tend to have the human in the loop
• Good news: deadlines are soft — microseconds only
matter when they add up to tens of milliseconds
• Bad news: because humans are in the loop, demand
for the system can be non-linear
• One must deal not only with the traditional challenge of
scalability, but also the challenge of a real-time system
• Worse, emerging DIRTy apps have mobile devices at
their edge — network transience makes clients seem ill-
behaved with respect to connection state!
5. The lessons of DIRTy apps
• Many latency bubbles originate deep in the stack; OS
understanding and instrumentation have been essential
even when the OS is not at fault
• For up-stack problems, tooling has been essential
• Latency outliers can come from many sources:
application restarts, dropped connections, slow disks,
boundless memory growth
• We have also seen some traditional real-time problems
with respect to CPU scheduling, e.g. priority inversions
• Enough foreplay; on with the DIRTy disaster pr0n!
6. Application restarts
• Modern internet-facing architectures are designed to be
resilient with respect to many failure modes…
• ...but application restarts can induce pathological,
cascading latency bubbles, as clients reconnect,
clusters reconverge, etc.
• For example, Voxer ran into a node.js bug where it
would terminate on ECONNABORTED from accept(2)
• Classic difference in OS semantics: BSD and illumos
variants (including SmartOS) do this; Linux doesn’t
• Much more likely over a transient network!
7. Dropped connections
• If an application can’t keep up with TCP backlog,
packets (SYNs) are dropped:
$ netstat -s | grep Drop
tcpTimRetransDrop = 56 tcpTimKeepalive = 2582
tcpTimKeepaliveProbe= 1594 tcpTimKeepaliveDrop = 41
tcpListenDrop =3089298 tcpListenDropQ0 = 0
tcpHalfOpenDrop = 0 tcpOutSackRetrans =1400832
icmpOutDrops = 0 icmpOutErrors = 0
sctpTimRetrans = 0 sctpTimRetransDrop = 0
sctpTimHearBeatProbe= 0 sctpTimHearBeatDrop = 0
sctpListenDrop = 0 sctpInClosed = 0
• Client waits, then retransmits (after 1 or 3 seconds),
inducing tremendous latency outliers; terrible for DIRTy
apps!
8. Dropped connections, cont.
• The fix for dropped connections :
• If due to a surge, increase TCP backlog
• If due to sustained load, increase CPU resources,
decrease CPU consumption or scale the app
• If fixed by increasing the TCP backlog, check that the
system backlog tunable took effect!
• If not, does the app need to be restarted?
• If not, is the application providing its own backlog that it
taking precedent?
• How close are we to dropping?
10. Dropped connections, cont.
The kernel code (usr/src/uts/common/inet/tcp/tcp_input.c):
/*
* THIS FUNCTION IS DIRECTLY CALLED BY IP VIA SQUEUE FOR SYN.
* tcp_input_data will not see any packets for listeners since the listener
* has conn_recv set to tcp_input_listener.
*/
/* ARGSUSED */
static void
tcp_input_listener(void *arg, mblk_t *mp, void *arg2, ip_recv_attr_t *ira)
{
[...]
if (listener->tcp_conn_req_cnt_q >= listener->tcp_conn_req_max) {
mutex_exit(&listener->tcp_eager_lock);
TCP_STAT(tcps, tcp_listendrop);
TCPS_BUMP_MIB(tcps, tcpListenDrop);
if (lconnp->conn_debug) {
(void) strlog(TCP_MOD_ID, 0, 1, SL_TRACE|SL_ERROR,
"tcp_input_listener: listen backlog (max=%d) "
"overflow (%d pending) on %s",
listener->tcp_conn_req_max,
listener->tcp_conn_req_cnt_q,
tcp_display(listener, NULL, DISP_PORT_ONLY));
}
goto error2;
}
[...]
11. Dropped connections, cont.
SEE ALL THE THINGS!
tcp_conn_req_cnt_q distributions:
cpid:3063 max_q:8
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
1 | 0
cpid:11504 Text max_q:128
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7279
1 |@@ 405
2 |@ 255
4 |@ 138
8 | 81
16 | 83
32 | 62
64 | 67
128 | 34
256 | 0
tcpListenDrops:
cpid:11504 max_q:128 34
12. Dropped connections, cont.
• Uses DTrace to get a distribution of TCP backlog queue length on
SYN; max_q is the current backlog length, per-process:
fbt::tcp_input_listener:entry
{
this->connp = (conn_t *)arg0;
this->tcp = (tcp_t *)this->connp->conn_proto_priv.cp_tcp;
self->max = strjoin("max_q:", lltostr(this->tcp->tcp_conn_req_max));
self->pid = strjoin("cpid:", lltostr(this->connp->conn_cpid));
@[self->pid, self->max] = quantize(this->tcp->tcp_conn_req_cnt_q);
}
mib:::tcpListenDrop
{
this->max = self->max;
this->pid = self->pid;
this->max != NULL ? this->max : "<null>";
this->pid != NULL ? this->pid : "<null>";
@drops[this->pid, this->max] = count();
printf("%Y %s:%s %sn", walltimestamp, probefunc, probename, this->pid);
}
• Script is on http://github.com/brendangregg/dtrace-cloud-tools as
net/tcpconnreqmaxq-pid*.d
13. Dropped connections, cont.
Or, snoop each drop:
# ./tcplistendrop.d
TIME SRC-IP PORT DST-IP PORT
2012 Jan 19 01:22:49 10.17.210.103 25691 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.108 18423 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.116 38883 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.117 10739 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.112 27988 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.106 28824 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.12.143.16 65070 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.100 56392 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.99 24628 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.98 11686 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.101 34629 -> 192.192.240.212 80
[...]
14. Dropped connections, cont.
• That code parsed IP and TCP headers from the in-kernel packet
buffer:
fbt::tcp_input_listener:entry { self->mp = args[1]; }
fbt::tcp_input_listener:return { self->mp = 0; }
mib:::tcpListenDrop
/self->mp/
{
this->iph = (ipha_t *)self->mp->b_rptr;
this->tcph = (tcph_t *)(self->mp->b_rptr + 20);
printf("%-20Y %-18s %-5d -> %-18s %-5dn", walltimestamp,
inet_ntoa(&this->iph->ipha_src),
ntohs(*(uint16_t *)this->tcph->th_lport),
inet_ntoa(&this->iph->ipha_dst),
ntohs(*(uint16_t *)this->tcph->th_fport));
}
• Script is tcplistendrop*.d, also on github
15. Dropped connections, cont.
• To summarize, dropped connections induce acute
latency bubbles
• With Voxer, found that failures often cascaded: high
CPU utilization due to unrelated issues will induce TCP
listen drops
• Tunables don’t always take effect: need confirmation
• Having a quick tool to check scalability issues (DTrace)
has been invaluable
16. Slow disks
• Slow I/O in a cloud computing environment can be
caused by multi-tenancy — which is to say, neighbors:
• Neighbor running a backup
• Neighbor running a benchmark
• Neighbors can’t be seen by tenants...
• ...but is it really a neighbor?
24. Slow disks, cont.
• Correlating the layers narrows the latency location
• Or you can associate in the same D script
• Via text, filtering on slow I/O, works fine
• For high frequency I/O, heat maps
30. Slow disks, cont.
• On Joyent’s IaaS architecture, it’s usually not the disks
or filesystem; useful to rule that out quickly
• Some of the time it is, due to bad disks (1000+ms I/O);
heat map or iosnoop correlation matches
• Some of the time it’s due to big I/O (how quick is a 40
Mbyte read from cache?)
• Some of the time it is other tenants (benchmarking!);
much less for us now with ZFS I/O throttling
• With ZFS and an SSD-based intent log, HW RAID is not
just unobservable, but entirely unnecessary — adios
PERC!
31. Memory growth
• Riak had endless memory growth
• Expected 9GB, after two days:
$ prstat -c 1
Please wait...
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
21722 103 43G 40G cpu0 59 0 72:23:41 2.6% beam.smp/594
15770 root 7760K 540K sleep 57 0 23:28:57 0.9% zoneadmd/5
95 root 0K 0K sleep 99 -20 7:37:47 0.2% zpool-zones/166
12827 root 128M 73M sleep 100 - 0:49:36 0.1% node/5
10319 bgregg 10M 6788K sleep 59 0 0:00:00 0.0% sshd/1
10402 root 22M 288K sleep 59 0 0:18:45 0.0% dtrace/1
[...]
• Eventually hits paging and terrible performance, needing a restart
• Remember, application restarts are a latency disaster!
33. Memory growth, cont.
• Is this a memory leak?
?
• In the app logic: Voxer?
Voxer
• In the DB logic: Riak? Riak
• In the DB’s Erlang VM? Erlang VM
libc, lib*
• In the OS libraries?
kernel
• In the OS kernel?
• Or application growth?
• Where would you guess?
34. Memory growth, cont.
• Voxer (App): don’t think it’s us
• Basho (Riak): don’t think it’s us
• Joyent (OS): don’t think it’s us
• This sort of issue is usually app growth...
• ...but we can check libs & kernel to be sure
35. Memory growth, cont.
• libumem was in use for allocations
• fast, scalable, object-caching, multi-threaded
support
• user-land version of kmem (slab allocator, Bonwick)
36. Memory growth, cont.
• Fix by experimentation (backend=mmap, other
allocators) wasn’t working.
• Detailed observability can be enabled in libumem,
allowing heap profiling and leak detection
• While designed with speed and production use in
mind, it still comes with some cost (time and space),
and isn’t on by default: restart required.
• UMEM_DEBUG=audit
41. Memory growth, cont.
• More DTrace showed the size of the malloc()s causing the brk()s:
# dtrace -x dynvarsize=4m -n '
pid$target::malloc:entry { self->size = arg0; }
syscall::brk:entry /self->size/ { printf("%d bytes", self->size); }
pid$target::malloc:return { self->size = 0; }' -p 17472
dtrace: description 'pid$target::malloc:entry ' matched 7 probes
CPU ID FUNCTION:NAME
0 44 brk:entry 8343520 bytes
0 44 brk:entry 8343520 bytes
[...]
• These 8 Mbyte malloc()s grew the heap
• Even though the heap has Gbytes not in use
• This is starting to look like an OS issue
42. Memory growth, cont.
• More tools were created:
• Show memory entropy (+ malloc - free)
along with heap growth, over time
• Show codepath taken for allocations
compare successful with unsuccessful (heap growth)
• Show allocator internals: sizes, options, flags
• And run in the production environment
• Briefly. Tracing frequent allocs does cost overhead
• Casting light into what was a black box
44. Memory growth, cont.
• These new tools and metrics pointed to the allocation
algorithm “instant fit”
• This had been hypothesized earlier; the tools provided
solid evidence that this really was the case here
• A new version of libumem was built to force use of
VM_BESTFIT
• ...and added by Robert Mustacchi as a tunable:
UMEM_OPTIONS=allocator=best
• Riak was restarted with new libumem version, solving
the problem
45. Memory growth, cont.
• Not the first issue with the system memory allocator;
depending on configuration, Riak may use libc’s
malloc(), which isn’t designed to be scalable
• man page does say it isn’t multi-thread scaleable
• libumem was the answer (with the fix)
46. Memory growth, cont.
• The fragmentation problem was interesting because it
was unusual; it is not the most common source of
memory growth!
• DIRTy systems are often event-oriented…
• ...in event-oriented systems, memory growth can be a
consequence of either surging or drowning
• In an interpreted environment, memory growth can also
come from memory that is semantically leaked
• Voxer — like many emerging DIRTy apps — has a
substantial node.js component; how to debug node.js
memory growth?
47. Memory growth, cont.
• We have developed a postmortem technique for making
sense of a node.js heap:
OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROP
fe806139 1 1 Object: Queue
fc424131 1 1 Object: Credentials
fc424091 1 1 Object: version
fc4e3281 1 1 Object: message
fc404f6d 1 1 Object: uncaughtException
...
fafcb229 1007 23 ClientRequest: outputEncodings, _headerSent, ...
fafc5e75 1034 5 Timing: req_start, res_end, res_bytes, req_end, ...
fafcbecd 1037 3 Object: aborted, data, end
8045475 1060 1 Object:
fb0cee9d 1220 9 HTTPParser: socket, incoming, onHeadersComplete, ...
fafc58d5 1271 25 Socket: _connectQueue, bytesRead, _httpMessage, ...
fafc4335 1311 16 ServerResponse: outputEncodings, statusCode, ...
• Used by @izs to debug a nasty node.js leak
• Search for “findjsobjects” (one word) for details
54. CPU scheduling, cont.
• Findings:
• FSS scheduler class bug:
• FSS uses a more complex technique to avoid CPU starvation. A thread priority
could stay high and on-CPU for many seconds before the priority is decayed to
allow another thread to run.
• Analyzed (more DTrace) and fixed (thanks Jerry Jelinek)
• DTrace analysis of the scheduler was invaluable
• Under (too) high CPU load, your runtime can be bound by how
well you schedule, not do work
• Not the only scheduler issue we’ve encountered
55. CPU scheduling, cont.
• CPU caps to throttle tenants in our cloud
• Experiment: add hot-CPU threads (saturation):
56. CPU scheduling, cont.
• CPU caps to throttle tenants in our cloud
• Experiment: add hot-CPU threads:
:-(
57. Visualizing CPU latency
• Using a node.js ustack helper and the DTrace profile
provider, we can determine the relative frequency of
stack backtraces in terms of CPU consumption
• Stacks can be visualized with flame graphs, a stack
visualization we developed:
58. DIRT in production
• node.js is particularly amenable for the DIRTy apps that
typify the real-time web
• The ability to understand latency must be considered
when deploying node.js-based systems into production!
• Understanding latency requires dynamic instrumentation
and novel visualization
• At Joyent, we have added DTrace-based dynamic
instrumentation for node.js to SmartOS, and novel
visualization into our cloud and software offerings
• Better production support — better observability, better
debuggability — remains an important area of node.js
development!
59. Beyond node.js
• node.js is adept at connecting components in the
system; it is unlikely to be the only component!
• As such, when using node.js to develop a DIRTy app,
you can expect to spend as much time (if not more!)
understanding the components as the app
• When selecting components — operating system, in-
memory data store, database, distributed data store —
observability must be a primary consideration!
• When building a team, look for full-stack engineers —
DIRTy apps pose a full-stack challenge!
60. Thank you!
• @mranney for being an excellent guinea pigcustomer
• @dapsays for the V8 DTrace ustack helper and V8
debugging support
• More information: http://dtrace.org/blogs/brendan, http://
dtrace.org/blogs/dap, and http://smartos.org