Real-time in the real world: DIRT in production

Real-time in the
real world:
DIRT in production
Bryan Cantrill Brendan Gregg
SVP, Engineering Lead Performance Engineer

bryan@joyent.com brendan@joyent.com
@bcantrill @brendangregg

Previously, on #surgecon...

• Two years ago at Surge, we described the emergence
of real-time data semantics in web-facing applications
• We dubbed this data-intensive real-time (DIRT)
• Last year at Surge 2011, we presented experiences
building a DIRTy system of our own — a facility for real-
time analytics of latency in the cloud
• While this system is interesting, it is somewhat synthetic
in nature in that it does not need to scale (much) with
respect to users...

#surgecon 2012

• Accelerated by the rise of mobile applications, DIRTy
systems are becoming increasingly common
• In the past year, we’ve seen apps in production at scale
• There are many examples of this, but for us, a paragon
of the emerging DIRTy apps has been Voxer
• Voxer is a push-to-talk mobile app that can be thought of
as the conﬂuence of voice mail and SMS
• A canonical DIRTy app: latency and scale both matter!
• Our experiences debugging latency bubbles with Voxer
over the past year have taught us quite a bit about the
new challenges that DIRTy apps pose...

The challenge of DIRTy apps

• DIRTy applications tend to have the human in the loop
• Good news: deadlines are soft — microseconds only
matter when they add up to tens of milliseconds

• Bad news: because humans are in the loop, demand
for the system can be non-linear

• One must deal not only with the traditional challenge of
scalability, but also the challenge of a real-time system
• Worse, emerging DIRTy apps have mobile devices at
their edge — network transience makes clients seem ill-
behaved with respect to connection state!

The lessons of DIRTy apps

• Many latency bubbles originate deep in the stack; OS
understanding and instrumentation have been essential
even when the OS is not at fault
• For up-stack problems, tooling has been essential
• Latency outliers can come from many sources:
application restarts, dropped connections, slow disks,
boundless memory growth
• We have also seen some traditional real-time problems
with respect to CPU scheduling, e.g. priority inversions
• Enough foreplay; on with the DIRTy disaster pr0n!

Application restarts

• Modern internet-facing architectures are designed to be
resilient with respect to many failure modes…
• ...but application restarts can induce pathological,
cascading latency bubbles, as clients reconnect,
clusters reconverge, etc.
• For example, Voxer ran into a node.js bug where it
would terminate on ECONNABORTED from accept(2)
• Classic difference in OS semantics: BSD and illumos
variants (including SmartOS) do this; Linux doesn’t
• Much more likely over a transient network!

Dropped connections

• If an application can’t keep up with TCP backlog,
packets (SYNs) are dropped:
$ netstat -s | grep Drop
tcpTimRetransDrop = 56 tcpTimKeepalive = 2582
tcpTimKeepaliveProbe= 1594 tcpTimKeepaliveDrop = 41
tcpListenDrop =3089298 tcpListenDropQ0 = 0
tcpHalfOpenDrop = 0 tcpOutSackRetrans =1400832
icmpOutDrops = 0 icmpOutErrors = 0
sctpTimRetrans = 0 sctpTimRetransDrop = 0
sctpTimHearBeatProbe= 0 sctpTimHearBeatDrop = 0
sctpListenDrop = 0 sctpInClosed = 0

• Client waits, then retransmits (after 1 or 3 seconds),
inducing tremendous latency outliers; terrible for DIRTy
apps!

Dropped connections, cont.

• The ﬁx for dropped connections :
• If due to a surge, increase TCP backlog
• If due to sustained load, increase CPU resources,
decrease CPU consumption or scale the app

• If ﬁxed by increasing the TCP backlog, check that the
system backlog tunable took effect!
• If not, does the app need to be restarted?
• If not, is the application providing its own backlog that it
taking precedent?
• How close are we to dropping?


• Networking 101

App

TCP SYN
accept() backlog
listen
drop
max

The kernel code (usr/src/uts/common/inet/tcp/tcp_input.c):
/*
* THIS FUNCTION IS DIRECTLY CALLED BY IP VIA SQUEUE FOR SYN.
* tcp_input_data will not see any packets for listeners since the listener
* has conn_recv set to tcp_input_listener.
*/
/* ARGSUSED */
static void
tcp_input_listener(void *arg, mblk_t *mp, void *arg2, ip_recv_attr_t *ira)
{
[...]
if (listener->tcp_conn_req_cnt_q >= listener->tcp_conn_req_max) {
mutex_exit(&listener->tcp_eager_lock);
TCP_STAT(tcps, tcp_listendrop);
TCPS_BUMP_MIB(tcps, tcpListenDrop);
if (lconnp->conn_debug) {
(void) strlog(TCP_MOD_ID, 0, 1, SL_TRACE|SL_ERROR,
"tcp_input_listener: listen backlog (max=%d) "
"overflow (%d pending) on %s",
listener->tcp_conn_req_max,
listener->tcp_conn_req_cnt_q,
tcp_display(listener, NULL, DISP_PORT_ONLY));
}
goto error2;
}
[...]

SEE ALL THE THINGS!
tcp_conn_req_cnt_q distributions:

cpid:3063 max_q:8
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
1 | 0

cpid:11504 Text max_q:128
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7279
1 |@@ 405
2 |@ 255
4 |@ 138
8 | 81
16 | 83
32 | 62
64 | 67
128 | 34
256 | 0

tcpListenDrops:
cpid:11504 max_q:128 34


• Uses DTrace to get a distribution of TCP backlog queue length on
SYN; max_q is the current backlog length, per-process:
fbt::tcp_input_listener:entry
{
this->connp = (conn_t *)arg0;
this->tcp = (tcp_t *)this->connp->conn_proto_priv.cp_tcp;
self->max = strjoin("max_q:", lltostr(this->tcp->tcp_conn_req_max));
self->pid = strjoin("cpid:", lltostr(this->connp->conn_cpid));
@[self->pid, self->max] = quantize(this->tcp->tcp_conn_req_cnt_q);
}

mib:::tcpListenDrop
{
this->max = self->max;
this->pid = self->pid;
this->max != NULL ? this->max : "<null>";
this->pid != NULL ? this->pid : "<null>";
@drops[this->pid, this->max] = count();
printf("%Y %s:%s %sn", walltimestamp, probefunc, probename, this->pid);
}

• Script is on http://github.com/brendangregg/dtrace-cloud-tools as
net/tcpconnreqmaxq-pid*.d


Or, snoop each drop:

# ./tcplistendrop.d
TIME SRC-IP PORT DST-IP PORT
2012 Jan 19 01:22:49 10.17.210.103 25691 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.108 18423 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.116 38883 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.117 10739 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.112 27988 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.106 28824 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.12.143.16 65070 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.100 56392 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.99 24628 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.98 11686 -> 192.192.240.212 80
2012 Jan 19 01:22:49 10.17.210.101 34629 -> 192.192.240.212 80
[...]


• That code parsed IP and TCP headers from the in-kernel packet
buffer:

fbt::tcp_input_listener:entry { self->mp = args[1]; }
fbt::tcp_input_listener:return { self->mp = 0; }

mib:::tcpListenDrop
/self->mp/
{
this->iph = (ipha_t *)self->mp->b_rptr;
this->tcph = (tcph_t *)(self->mp->b_rptr + 20);
printf("%-20Y %-18s %-5d -> %-18s %-5dn", walltimestamp,
inet_ntoa(&this->iph->ipha_src),
ntohs(*(uint16_t *)this->tcph->th_lport),
inet_ntoa(&this->iph->ipha_dst),
ntohs(*(uint16_t *)this->tcph->th_fport));
}

• Script is tcplistendrop*.d, also on github


• To summarize, dropped connections induce acute
latency bubbles
• With Voxer, found that failures often cascaded: high
CPU utilization due to unrelated issues will induce TCP
listen drops
• Tunables don’t always take effect: need conﬁrmation
• Having a quick tool to check scalability issues (DTrace)
has been invaluable

Slow disks

• Slow I/O in a cloud computing environment can be
caused by multi-tenancy — which is to say, neighbors:

• Neighbor running a backup
• Neighbor running a benchmark
• Neighbors can’t be seen by tenants...
• ...but is it really a neighbor?

Slow disks, cont.

• Unix 101
Process
Syscall
Interface
VFS

ZFS ...

Block Device Interface

Disks

Slow disks, cont.

• Unix 101
Process
sync. Syscall
Interface
VFS

ZFS ...

Block Device Interface iostat(1)
often async:
write buffering,
Disks
read ahead

Slow disks, cont.

• VFS-level-iostat: vfsstat
# vfsstat -Z 1
r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone
1.2 2.8 0.6 0.2 0.0 0.0 0.0 0.0 0 0 0.0 0.0 global (0)
0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 34.9 9cc2d0d3 (2)
0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 46.5 72188ca0 (3)
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 16.5 4d2a62bb (4)
0.3 0.1 0.1 0.3 0.0 0.0 0.0 0.0 0 0 0.0 27.6 8bbc4000 (5)
5.9 0.2 0.5 0.1 0.0 0.0 0.0 0.0 0 0 5.0 11.3 d305ee44 (6)
0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 132.0 9897c8f5 (7)
0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0 0 0.0 40.7 5f3c7d9e (9)
0.2 0.8 0.5 0.6 0.0 0.0 0.0 0.0 0 0 0.0 31.9 22ef87fc (10)

• Kernel changes, new kstats (thanks Bill Pijewski)

Slow disks, cont.

• zfsslower.d:
# ./zfsslower.d 10
TIME PROCESS D B ms FILE
2012 Sep 27 13:45:33 zlogin W 372 11 /zones/b8b2464c/var/adm/wtmpx
2012 Sep 27 13:45:36 bash R 8 14 /zones/b8b2464c/opt/local/bin/zsh
2012 Sep 27 13:45:58 mysqld R 1048576 19 /zones/b8b2464c/var/mysql/ibdata1
2012 Sep 27 13:45:58 mysqld R 1048576 22 /zones/b8b2464c/var/mysql/ibdata1
2012 Sep 27 13:46:14 master R 8 6 /zones/b8b2464c/root/opt/local/
libexec/postfix/qmgr
2012 Sep 27 13:46:14 master R 4096 5 /zones/b8b2464c/root/opt/local/etc/
postfix/master.cf
[...]

• Go-to tool. Are there VFS-level I/O > 10ms? (arg)
• Stupidly easy to do

Slow disks, cont.

• Written in DTrace
[...]
fbt::zfs_read:entry,
fbt::zfs_write:entry
{
self->path = args[0]->v_path;
self->kb = args[1]->uio_resid / 1024;
self->start = timestamp;
}

fbt::zfs_read:return,
fbt::zfs_write:return
/self->start && (timestamp - self->start) >= min_ns/
{
this->iotime = (timestamp - self->start) / 1000000;
this->dir = probefunc == "zfs_read" ? "R" : "W";
printf("%-20Y %-16s %1s %4d %6d %sn", walltimestamp,
execname, this->dir, self->kb, this->iotime,
self->path != NULL ? stringof(self->path) : "<null>");
}
[...]

• zfsslower.d, also on github, originated from the DTrace book

Slow disks, cont.

• Traces VFS/ZFS interface (kernel)
from usr/src/uts/common/fs/zfs/zfs_vnops.c:

/*
* Regular file vnode operations template
*/
vnodeops_t *zfs_fvnodeops;
const fs_operation_def_t zfs_fvnodeops_template[] = {
VOPNAME_OPEN, { .vop_open = zfs_open },
VOPNAME_CLOSE, { .vop_close = zfs_close },
VOPNAME_READ, { .vop_read = zfs_read },
VOPNAME_WRITE, { .vop_write = zfs_write },
VOPNAME_IOCTL, { .vop_ioctl = zfs_ioctl },
VOPNAME_GETATTR, { .vop_getattr = zfs_getattr },
[...]

Slow disks, cont.

• Unix 101
Process
Syscall
Interface
VFS
zfsslower.d
ZFS ...
Correlate

iosnoop Block Device Interface

Disks

Slow disks, cont.

• Correlating the layers narrows the latency location
• Or you can associate in the same D script
• Via text, ﬁltering on slow I/O, works ﬁne
• For high frequency I/O, heat maps

Slow disks, cont.

• WHAT DOES IT MEAN?

Slow disks, cont.

• Latency outliers:

Slow disks, cont.

• Latency outliers:

Inconceivable
Very Bad
Bad
Good

Slow disks, cont.

• Inconceivably bad, 1000+ms VFS-level latency:
• Queueing behind large ZFS SPA syncs (tunable)
• Other tenants benchmarking (before we added I/O throttling to
SmartOS)
read = red, write = blue
• Reads queueing behind
writes. Needed to tune 60ms
ZFS and LSI PERC
(shakes ﬁst!) latency

time (s)

Slow disks, cont.
• Deeper tools rolled as needed. Anywhere in ZFS.
# dtrace -n 'io:::start { @[stack()] = count(); }'
dtrace: description 'io:::start ' matched 6 probes
^C
genunix`ldi_strategy+0x53
zfs`vdev_disk_io_start+0xcc
zfs`zio_vdev_io_start+0xab
zfs`zio_execute+0x88
zfs`zio_nowait+0x21
zfs`vdev_mirror_io_start+0xcd
zfs`zio_vdev_io_start+0x250
zfs`zio_execute+0x88
zfs`zio_nowait+0x21
zfs`arc_read_nolock+0x4f9
zfs`arc_read+0x96
zfs`dsl_read+0x44
zfs`dbuf_read_impl+0x166
zfs`dbuf_read+0xab
zfs`dmu_buf_hold_array_by_dnode+0x189
zfs`dmu_buf_hold_array+0x78
zfs`dmu_read_uio+0x5c
zfs`zfs_read+0x1a3
genunix`fop_read+0x8b
genunix`read+0x2a7
143

Slow disks, cont.

• On Joyent’s IaaS architecture, it’s usually not the disks
or ﬁlesystem; useful to rule that out quickly
• Some of the time it is, due to bad disks (1000+ms I/O);
heat map or iosnoop correlation matches
• Some of the time it’s due to big I/O (how quick is a 40
Mbyte read from cache?)
• Some of the time it is other tenants (benchmarking!);
much less for us now with ZFS I/O throttling
• With ZFS and an SSD-based intent log, HW RAID is not
just unobservable, but entirely unnecessary — adios
PERC!

Memory growth

• Riak had endless memory growth
• Expected 9GB, after two days:
$ prstat -c 1
Please wait...
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
21722 103 43G 40G cpu0 59 0 72:23:41 2.6% beam.smp/594
15770 root 7760K 540K sleep 57 0 23:28:57 0.9% zoneadmd/5
95 root 0K 0K sleep 99 -20 7:37:47 0.2% zpool-zones/166
12827 root 128M 73M sleep 100 - 0:49:36 0.1% node/5
10319 bgregg 10M 6788K sleep 59 0 0:00:00 0.0% sshd/1
10402 root 22M 288K sleep 59 0 0:18:45 0.0% dtrace/1
[...]

• Eventually hits paging and terrible performance, needing a restart
• Remember, application restarts are a latency disaster!

Memory growth, cont.

• What is in the heap?
$ pmap 14719
14719: beam.smp
0000000000400000 2168K r-x-- /opt/riak/erts-5.8.5/bin/beam.smp
000000000062D000 328K rw--- /opt/riak/erts-5.8.5/bin/beam.smp
000000000067F000 4193540K rw--- /opt/riak/erts-5.8.5/bin/beam.smp
00000001005C0000 4194296K rw--- [ anon ]
00000002005BE000 4192016K rw--- [ anon ]
0000000300382000 4193664K rw--- [ anon ]
00000004002E2000 4191172K rw--- [ anon ]
00000004FFFD3000 4194040K rw--- [ anon ]
00000005FFF91000 4194028K rw--- [ anon ]
00000006FFF4C000 4188812K rw--- [ anon ]
00000007FF9EF000 588224K rw--- [ heap ]
[...]

• ... and why does it keep growing?


• Is this a memory leak?

?
• In the app logic: Voxer?
Voxer
• In the DB logic: Riak? Riak
• In the DB’s Erlang VM? Erlang VM
libc, lib*
• In the OS libraries?
kernel
• In the OS kernel?
• Or application growth?
• Where would you guess?


• Voxer (App): don’t think it’s us
• Basho (Riak): don’t think it’s us
• Joyent (OS): don’t think it’s us
• This sort of issue is usually app growth...
• ...but we can check libs & kernel to be sure


• libumem was in use for allocations
• fast, scalable, object-caching, multi-threaded
support

• user-land version of kmem (slab allocator, Bonwick)


• Fix by experimentation (backend=mmap, other
allocators) wasn’t working.
• Detailed observability can be enabled in libumem,
allowing heap proﬁling and leak detection

• While designed with speed and production use in
mind, it still comes with some cost (time and space),
and isn’t on by default: restart required.

• UMEM_DEBUG=audit


• libumem provides some default observability
• Eg, slabs:

> ::umem_malloc_info
CACHE BUFSZ MAXMAL BUFMALLC AVG_MAL MALLOCED OVERHEAD %OVER
0000000000707028 8 0 0 0 0 0 0.0%
000000000070b028 16 8 8730 8 69836 1054998 1510.6%
000000000070c028 32 16 8772 16 140352 1130491 805.4%
000000000070f028 48 32 1148038 25 29127788 156179051 536.1%
0000000000710028 64 48 344138 40 13765658 58417287 424.3%
0000000000711028 80 64 36 62 2226 4806 215.9%
0000000000714028 96 80 8934 79 705348 1168558 165.6%
0000000000715028 112 96 1347040 87 117120208 190389780 162.5%
0000000000718028 128 112 253107 111 28011923 42279506 150.9%
000000000071a028 160 144 40529 118 4788681 6466801 135.0%
000000000071b028 192 176 140 155 21712 25818 118.9%
000000000071e028 224 208 43 188 8101 6497 80.1%
000000000071f028 256 240 133 229 30447 26211 86.0%
0000000000720028 320 304 56 276 15455 12276 79.4%
0000000000723028 384 368 35 335 11726 7220 61.5%
[...]


• ... and heap (captured @14GB RSS):
> ::vmem
ADDR NAME INUSE TOTAL SUCCEED FAIL
fffffd7ffebed4a0 sbrk_top 9090404352 14240165888 4298117 84403
fffffd7ffebee0a8 sbrk_heap 9090404352 9090404352 4298117 0
fffffd7ffebeecb0 vmem_internal 664616960 664616960 79621 0
fffffd7ffebef8b8 vmem_seg 651993088 651993088 79589 0
fffffd7ffebf04c0 vmem_hash 12583424 12587008 27 0
fffffd7ffebf10c8 vmem_vmem 46200 55344 15 0
00000000006e7000 umem_internal 352862464 352866304 88746 0
00000000006e8000 umem_cache 113696 180224 44 0
00000000006e9000 umem_hash 13091328 13099008 86 0
00000000006ea000 umem_log 0 0 0 0
00000000006eb000 umem_firewall_va 0 0 0 0
00000000006ec000 umem_firewall 0 0 0 0
00000000006ed000 umem_oversize 5218777974 5520789504 3822051 0
00000000006f0000 umem_memalign 0 0 0 0
0000000000706000 umem_default 2552131584 2552131584 307699 0

• The heap is 9 GB (as expected), but sbrk_top total is 14 GB (equal
to RSS). And growing.

• Are there Gbyte-sized malloc()/free()s?

# dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472
dtrace: description 'pid$target::malloc:entry ' matched 3 probes
^C
2 | 0
4 | 3
8 |@ 5927
16 |@@@@ 41818
32 |@@@@@@@@@ 81991
64 |@@@@@@@@@@@@@@@@@@ 169888
128 |@@@@@@@ 69891
256 | 2257
512 | 406
1024 | 893
2048 | 146
4096 | 1467
8192 | 755
16384 | 950
32768 | 83
65536 | 31
131072 | 11
262144 | 15
524288 | 0
1048576 | 1
2097152 | 0

• No huge malloc()s, but RSS continues to climb.


• Tracing why the heap grows via brk():
# dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }'
dtrace: description 'syscall::brk:entry ' matched 1 probe
CPU ID FUNCTION:NAME
10 18 brk:entry
libc.so.1`_brk_unlocked+0xa
libumem.so.1`vmem_sbrk_alloc+0x84
libumem.so.1`vmem_xalloc+0x669
libumem.so.1`vmem_alloc+0x14f
libumem.so.1`umem_alloc+0x72
libumem.so.1`malloc+0x59
libstdc++.so.6.0.14`_Znwm+0x20
libstdc++.so.6.0.14`_Znam+0x9
eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea...
eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc
eleveldb.so`eleveldb_get+0xd3
beam.smp`process_main+0x6939
beam.smp`sched_thread_func+0x1cf
beam.smp`thr_wrapper+0xbe


• More DTrace showed the size of the malloc()s causing the brk()s:

# dtrace -x dynvarsize=4m -n '
pid$target::malloc:entry { self->size = arg0; }
syscall::brk:entry /self->size/ { printf("%d bytes", self->size); }
pid$target::malloc:return { self->size = 0; }' -p 17472

dtrace: description 'pid$target::malloc:entry ' matched 7 probes
CPU ID FUNCTION:NAME
0 44 brk:entry 8343520 bytes
0 44 brk:entry 8343520 bytes
[...]

• These 8 Mbyte malloc()s grew the heap
• Even though the heap has Gbytes not in use
• This is starting to look like an OS issue


• More tools were created:
• Show memory entropy (+ malloc - free)
along with heap growth, over time

• Show codepath taken for allocations
compare successful with unsuccessful (heap growth)

• Show allocator internals: sizes, options, ﬂags
• And run in the production environment
• Brieﬂy. Tracing frequent allocs does cost overhead
• Casting light into what was a black box

4 <- vmem_xalloc 0
4 -> _sbrk_grow_aligned 4096
4 <- _sbrk_grow_aligned 17155911680
4 -> vmem_xalloc 7356400
4 | vmem_xalloc:entry umem_oversize
4 -> vmem_alloc 7356416
4 | vmem_xalloc:entry sbrk_heap
4 -> vmem_sbrk_alloc 7356416
4 -> vmem_alloc 7356416
4 | vmem_xalloc:entry sbrk_top
4 -> vmem_reap 16777216
4 <- vmem_reap 3178535181209758
4 | vmem_xalloc:return vmem_xalloc() == NULL, vm:
sbrk_top, size: 7356416, align: 4096, phase: 0, nocross: 0, min: 0, max: 0,
vmflag: 1
libumem.so.1`vmem_xalloc+0x80f
libumem.so.1`vmem_sbrk_alloc+0x33
libumem.so.1`umem_alloc+0x72
libumem.so.1`malloc+0x59
libstdc++.so.6.0.3`_Znwm+0x2b
libstdc++.so.6.0.3`_ZNSs4_Rep9_S_createEmmRKSaIcE+0x7e


• These new tools and metrics pointed to the allocation
algorithm “instant ﬁt”
• This had been hypothesized earlier; the tools provided
solid evidence that this really was the case here
• A new version of libumem was built to force use of
VM_BESTFIT
• ...and added by Robert Mustacchi as a tunable:
UMEM_OPTIONS=allocator=best
• Riak was restarted with new libumem version, solving
the problem


• Not the first issue with the system memory allocator;
depending on configuration, Riak may use libc’s
malloc(), which isn’t designed to be scalable

• man page does say it isn’t multi-thread scaleable
• libumem was the answer (with the fix)


• The fragmentation problem was interesting because it
was unusual; it is not the most common source of
memory growth!
• DIRTy systems are often event-oriented…
• ...in event-oriented systems, memory growth can be a
consequence of either surging or drowning
• In an interpreted environment, memory growth can also
come from memory that is semantically leaked
• Voxer — like many emerging DIRTy apps — has a
substantial node.js component; how to debug node.js
memory growth?


• We have developed a postmortem technique for making
sense of a node.js heap:

OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROP
fe806139 1 1 Object: Queue
fc424131 1 1 Object: Credentials
fc424091 1 1 Object: version
fc4e3281 1 1 Object: message
fc404f6d 1 1 Object: uncaughtException
...
fafcb229 1007 23 ClientRequest: outputEncodings, _headerSent, ...
fafc5e75 1034 5 Timing: req_start, res_end, res_bytes, req_end, ...
fafcbecd 1037 3 Object: aborted, data, end
8045475 1060 1 Object:
fb0cee9d 1220 9 HTTPParser: socket, incoming, onHeadersComplete, ...
fafc58d5 1271 25 Socket: _connectQueue, bytesRead, _httpMessage, ...
fafc4335 1311 16 ServerResponse: outputEncodings, statusCode, ...

• Used by @izs to debug a nasty node.js leak
• Search for “ﬁndjsobjects” (one word) for details

CPU scheduling

• Problem: occasional latency outliers
• Analysis: no smoking gun. No slow I/O or locks. Some random
dispatcher queue latency, but with CPU headroom.

$ prstat -mLc 1
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
17930 103 21 7.6 0.0 0.0 0.0 53 16 9.1 57K 1 73K 0 beam.smp/265
17930 103 20 7.0 0.0 0.0 0.0 57 16 0.4 57K 2 70K 0 beam.smp/264
17930 103 20 7.4 0.0 0.0 0.0 53 18 1.7 63K 0 78K 0 beam.smp/263
17930 103 19 6.7 0.0 0.0 0.0 60 14 0.4 52K 0 65K 0 beam.smp/266
17930 103 2.0 0.7 0.0 0.0 0.0 96 1.6 0.0 6K 0 8K 0 beam.smp/267
17930 103 1.0 0.9 0.0 0.0 0.0 97 0.9 0.0 4 0 47 0 beam.smp/280
[...]

CPU scheduling, cont.

• Unix 101
Threads:
R = Ready to run
O = On-CPU

R R R
CPU
Run Queue
O

Scheduler preemption

R R R R
CPU
Run Queue
O

CPU scheduler, cont.

• Unix 102
• TS (and FSS) check for CPU starvation

Priority Promotion

R R R R R R R
CPU
Run Queue
O
CPU
Starvation


• Experimentation: run 2 CPU-bound threads, 1 CPU
• Subsecond offset heat maps:


• Experimentation: run 2 CPU-bound threads, 1 CPU
• Subsecond offset heat maps:

THIS
SHOULDNT
HAPPEN


• Worst case (4 threads 1 CPU), 44 sec dispq latency
# dtrace -n 'sched:::off-cpu /execname == "burn1"/ { self->s = timestamp; }
sched:::on-cpu /self->s/ { @["off-cpu (ms)"] =
lquantize((timestamp - self->s) / 1000000, 0, 100000, 1000); self->s = 0; }'

off-cpu (ms)
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 387184
1000 | 2256
2000 | 1078
3000 | Expected 862
4000 | 1070
5000 | Bad 637

[...]
6000 |
Inconceivable 535

41000 | 3
42000 | 2
43000 | 2
44000 | 1
45000 | 0

ts_maxwait @pri 59 = 32s, FSS uses ?


• Findings:
• FSS scheduler class bug:
• FSS uses a more complex technique to avoid CPU starvation. A thread priority
could stay high and on-CPU for many seconds before the priority is decayed to
allow another thread to run.

• Analyzed (more DTrace) and ﬁxed (thanks Jerry Jelinek)

• DTrace analysis of the scheduler was invaluable
• Under (too) high CPU load, your runtime can be bound by how
well you schedule, not do work

• Not the only scheduler issue we’ve encountered

• CPU caps to throttle tenants in our cloud
• Experiment: add hot-CPU threads (saturation):

• CPU caps to throttle tenants in our cloud
• Experiment: add hot-CPU threads:

:-(

Visualizing CPU latency

• Using a node.js ustack helper and the DTrace proﬁle
provider, we can determine the relative frequency of
stack backtraces in terms of CPU consumption
• Stacks can be visualized with ﬂame graphs, a stack
visualization we developed:

DIRT in production

• node.js is particularly amenable for the DIRTy apps that
typify the real-time web
• The ability to understand latency must be considered
when deploying node.js-based systems into production!
• Understanding latency requires dynamic instrumentation
and novel visualization
• At Joyent, we have added DTrace-based dynamic
instrumentation for node.js to SmartOS, and novel
visualization into our cloud and software offerings
• Better production support — better observability, better
debuggability — remains an important area of node.js
development!

Beyond node.js

• node.js is adept at connecting components in the
system; it is unlikely to be the only component!
• As such, when using node.js to develop a DIRTy app,
you can expect to spend as much time (if not more!)
understanding the components as the app
• When selecting components — operating system, in-
memory data store, database, distributed data store —
observability must be a primary consideration!
• When building a team, look for full-stack engineers —
DIRTy apps pose a full-stack challenge!

Thank you!

• @mranney for being an excellent guinea pigcustomer
• @dapsays for the V8 DTrace ustack helper and V8
debugging support
• More information: http://dtrace.org/blogs/brendan, http://
dtrace.org/blogs/dap, and http://smartos.org

Real-time in the real world: DIRT in production

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Real-time in the real world: DIRT in production

Semelhante a Real-time in the real world: DIRT in production (20)

Mais de bcantrill

Mais de bcantrill (20)

Real-time in the real world: DIRT in production