SlideShare uma empresa Scribd logo
1 de 60
Baixar para ler offline
Real-time in the
real world:
DIRT in production
Bryan Cantrill     Brendan Gregg
SVP, Engineering   Lead Performance Engineer

bryan@joyent.com   brendan@joyent.com
@bcantrill         @brendangregg
Previously, on #surgecon...

    • Two years ago at Surge, we described the emergence
     of real-time data semantics in web-facing applications
    • We dubbed this data-intensive real-time (DIRT)
    • Last year at Surge 2011, we presented experiences
     building a DIRTy system of our own — a facility for real-
     time analytics of latency in the cloud
    • While this system is interesting, it is somewhat synthetic
     in nature in that it does not need to scale (much) with
     respect to users...
#surgecon 2012

   • Accelerated by the rise of mobile applications, DIRTy
    systems are becoming increasingly common
   • In the past year, we’ve seen apps in production at scale
   • There are many examples of this, but for us, a paragon
    of the emerging DIRTy apps has been Voxer
   • Voxer is a push-to-talk mobile app that can be thought of
    as the confluence of voice mail and SMS
   • A canonical DIRTy app: latency and scale both matter!
   • Our experiences debugging latency bubbles with Voxer
    over the past year have taught us quite a bit about the
    new challenges that DIRTy apps pose...
The challenge of DIRTy apps

   • DIRTy applications tend to have the human in the loop
      • Good news: deadlines are soft — microseconds only
        matter when they add up to tens of milliseconds

      • Bad news: because humans are in the loop, demand
        for the system can be non-linear

   • One must deal not only with the traditional challenge of
     scalability, but also the challenge of a real-time system
   • Worse, emerging DIRTy apps have mobile devices at
     their edge — network transience makes clients seem ill-
     behaved with respect to connection state!
The lessons of DIRTy apps

   • Many latency bubbles originate deep in the stack; OS
     understanding and instrumentation have been essential
     even when the OS is not at fault
   • For up-stack problems, tooling has been essential
   • Latency outliers can come from many sources:
     application restarts, dropped connections, slow disks,
     boundless memory growth
   • We have also seen some traditional real-time problems
     with respect to CPU scheduling, e.g. priority inversions
   • Enough foreplay; on with the DIRTy disaster pr0n!
Application restarts

    • Modern internet-facing architectures are designed to be
     resilient with respect to many failure modes…
    • ...but application restarts can induce pathological,
     cascading latency bubbles, as clients reconnect,
     clusters reconverge, etc.
    • For example, Voxer ran into a node.js bug where it
     would terminate on ECONNABORTED from accept(2)
    • Classic difference in OS semantics: BSD and illumos
     variants (including SmartOS) do this; Linux doesn’t
    • Much more likely over a transient network!
Dropped connections

   • If an application can’t keep up with TCP backlog,
    packets (SYNs) are dropped:
     $ netstat -s | grep Drop
        tcpTimRetransDrop   =    56    tcpTimKeepalive       = 2582
        tcpTimKeepaliveProbe= 1594     tcpTimKeepaliveDrop   =    41
        tcpListenDrop       =3089298   tcpListenDropQ0       =     0
        tcpHalfOpenDrop     =     0    tcpOutSackRetrans     =1400832
        icmpOutDrops        =     0    icmpOutErrors         =     0
        sctpTimRetrans      =     0    sctpTimRetransDrop    =     0
        sctpTimHearBeatProbe=     0    sctpTimHearBeatDrop   =     0
        sctpListenDrop      =     0    sctpInClosed          =     0


   • Client waits, then retransmits (after 1 or 3 seconds),
    inducing tremendous latency outliers; terrible for DIRTy
    apps!
Dropped connections, cont.

   • The fix for dropped connections :
      • If due to a surge, increase TCP backlog
       • If due to sustained load, increase CPU resources,
         decrease CPU consumption or scale the app

   • If fixed by increasing the TCP backlog, check that the
     system backlog tunable took effect!
   • If not, does the app need to be restarted?
   • If not, is the application providing its own backlog that it
     taking precedent?
   • How close are we to dropping?
Dropped connections, cont.

• Networking 101


        App


                    TCP      SYN
      accept()     backlog
                                         listen
                                          drop
                                   max
Dropped connections, cont.
 The kernel code (usr/src/uts/common/inet/tcp/tcp_input.c):
  /*
    * THIS FUNCTION IS DIRECTLY CALLED BY IP VIA SQUEUE FOR SYN.
    * tcp_input_data will not see any packets for listeners since the listener
    * has conn_recv set to tcp_input_listener.
    */
  /* ARGSUSED */
  static void
  tcp_input_listener(void *arg, mblk_t *mp, void *arg2, ip_recv_attr_t *ira)
  {
  [...]
           if (listener->tcp_conn_req_cnt_q >= listener->tcp_conn_req_max) {
                   mutex_exit(&listener->tcp_eager_lock);
                   TCP_STAT(tcps, tcp_listendrop);
                   TCPS_BUMP_MIB(tcps, tcpListenDrop);
                   if (lconnp->conn_debug) {
                           (void) strlog(TCP_MOD_ID, 0, 1, SL_TRACE|SL_ERROR,
                                "tcp_input_listener: listen backlog (max=%d) "
                                "overflow (%d pending) on %s",
                                listener->tcp_conn_req_max,
                                listener->tcp_conn_req_cnt_q,
                                tcp_display(listener, NULL, DISP_PORT_ONLY));
                   }
                   goto error2;
           }
  [...]
Dropped connections, cont.
SEE ALL THE THINGS!
tcp_conn_req_cnt_q distributions:

  cpid:3063                                                 max_q:8
              value     ------------- Distribution ------------- count
                 -1   |                                          0
                  0   |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
                  1   |                                          0

  cpid:11504                             Text               max_q:128
           value        ------------- Distribution ------------- count
              -1      |                                          0
               0      |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       7279
               1      |@@                                        405
               2      |@                                         255
               4      |@                                         138
               8      |                                          81
              16      |                                          83
              32      |                                          62
              64      |                                          67
             128      |                                          34
             256      |                                          0

tcpListenDrops:
   cpid:11504                         max_q:128                          34
Dropped connections, cont.

• Uses DTrace to get a distribution of TCP backlog queue length on
 SYN; max_q is the current backlog length, per-process:
   fbt::tcp_input_listener:entry
   {
       this->connp = (conn_t *)arg0;
       this->tcp = (tcp_t *)this->connp->conn_proto_priv.cp_tcp;
       self->max = strjoin("max_q:", lltostr(this->tcp->tcp_conn_req_max));
       self->pid = strjoin("cpid:", lltostr(this->connp->conn_cpid));
       @[self->pid, self->max] = quantize(this->tcp->tcp_conn_req_cnt_q);
   }

   mib:::tcpListenDrop
   {
       this->max = self->max;
       this->pid = self->pid;
       this->max != NULL ? this->max : "<null>";
       this->pid != NULL ? this->pid : "<null>";
       @drops[this->pid, this->max] = count();
       printf("%Y %s:%s %sn", walltimestamp, probefunc, probename, this->pid);
   }

• Script is on http://github.com/brendangregg/dtrace-cloud-tools as
 net/tcpconnreqmaxq-pid*.d
Dropped connections, cont.

 Or, snoop each drop:

# ./tcplistendrop.d
TIME                   SRC-IP          PORT         DST-IP            PORT
2012 Jan 19 01:22:49   10.17.210.103   25691   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.108   18423   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.116   38883   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.117   10739   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.112   27988   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.106   28824   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.12.143.16    65070   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.100   56392   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.99    24628   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.98    11686   ->   192.192.240.212     80
2012 Jan 19 01:22:49   10.17.210.101   34629   ->   192.192.240.212     80
[...]
Dropped connections, cont.

• That code parsed IP and TCP headers from the in-kernel packet
 buffer:

fbt::tcp_input_listener:entry { self->mp = args[1]; }
fbt::tcp_input_listener:return { self->mp = 0; }

mib:::tcpListenDrop
/self->mp/
{
        this->iph = (ipha_t *)self->mp->b_rptr;
        this->tcph = (tcph_t *)(self->mp->b_rptr + 20);
        printf("%-20Y %-18s %-5d -> %-18s %-5dn", walltimestamp,
            inet_ntoa(&this->iph->ipha_src),
            ntohs(*(uint16_t *)this->tcph->th_lport),
            inet_ntoa(&this->iph->ipha_dst),
            ntohs(*(uint16_t *)this->tcph->th_fport));
}



• Script is tcplistendrop*.d, also on github
Dropped connections, cont.

   • To summarize, dropped connections induce acute
     latency bubbles
   • With Voxer, found that failures often cascaded: high
     CPU utilization due to unrelated issues will induce TCP
     listen drops
   • Tunables don’t always take effect: need confirmation
   • Having a quick tool to check scalability issues (DTrace)
     has been invaluable
Slow disks

   • Slow I/O in a cloud computing environment can be
     caused by multi-tenancy — which is to say, neighbors:

      • Neighbor running a backup
      • Neighbor running a benchmark
   • Neighbors can’t be seen by tenants...
   • ...but is it really a neighbor?
Slow disks, cont.

• Unix 101
                      Process
                                        Syscall
                                        Interface
                          VFS

                    ZFS           ...

               Block Device Interface

                          Disks
Slow disks, cont.

• Unix 101
                      Process
             sync.                      Syscall
                                        Interface
                          VFS

                    ZFS           ...

               Block Device Interface   iostat(1)
                                        often async:
                                        write buffering,
                          Disks
                                        read ahead
Slow disks, cont.

  • VFS-level-iostat: vfsstat
# vfsstat -Z 1
  r/s   w/s kr/s     kw/s ractv wactv read_t writ_t   %r   %w   d/s   del_t   zone
  1.2   2.8    0.6    0.2   0.0   0.0    0.0    0.0    0    0   0.0     0.0   global (0)
  0.1   0.0    0.1    0.0   0.0   0.0    0.0    0.0    0    0   0.0    34.9   9cc2d0d3 (2)
  0.1   0.0    0.1    0.0   0.0   0.0    0.0    0.0    0    0   0.0    46.5   72188ca0 (3)
  0.0   0.0    0.0    0.0   0.0   0.0    0.0    0.0    0    0   0.0    16.5   4d2a62bb (4)
  0.3   0.1    0.1    0.3   0.0   0.0    0.0    0.0    0    0   0.0    27.6   8bbc4000 (5)
  5.9   0.2    0.5    0.1   0.0   0.0    0.0    0.0    0    0   5.0    11.3   d305ee44 (6)
  0.1   0.0    0.1    0.0   0.0   0.0    0.0    0.0    0    0   0.0   132.0   9897c8f5 (7)
  0.1   0.0    0.1    0.0   0.0   0.0    0.0    0.1    0    0   0.0    40.7   5f3c7d9e (9)
  0.2   0.8    0.5    0.6   0.0   0.0    0.0    0.0    0    0   0.0    31.9   22ef87fc (10)


  • Kernel changes, new kstats (thanks Bill Pijewski)
Slow disks, cont.

  • zfsslower.d:
# ./zfsslower.d 10
TIME                   PROCESS   D       B   ms   FILE
2012 Sep 27 13:45:33   zlogin    W     372   11   /zones/b8b2464c/var/adm/wtmpx
2012 Sep 27 13:45:36   bash      R       8   14   /zones/b8b2464c/opt/local/bin/zsh
2012 Sep 27 13:45:58   mysqld    R 1048576   19   /zones/b8b2464c/var/mysql/ibdata1
2012 Sep 27 13:45:58   mysqld    R 1048576   22   /zones/b8b2464c/var/mysql/ibdata1
2012 Sep 27 13:46:14   master    R       8   6    /zones/b8b2464c/root/opt/local/
libexec/postfix/qmgr
2012 Sep 27 13:46:14   master    R    4096   5    /zones/b8b2464c/root/opt/local/etc/
postfix/master.cf
[...]




  • Go-to tool. Are there VFS-level I/O > 10ms? (arg)
  • Stupidly easy to do
Slow disks, cont.

• Written in DTrace
[...]
fbt::zfs_read:entry,
fbt::zfs_write:entry
{
    self->path = args[0]->v_path;
    self->kb = args[1]->uio_resid / 1024;
    self->start = timestamp;
}

fbt::zfs_read:return,
fbt::zfs_write:return
/self->start && (timestamp - self->start) >= min_ns/
{
    this->iotime = (timestamp - self->start) / 1000000;
    this->dir = probefunc == "zfs_read" ? "R" : "W";
    printf("%-20Y %-16s %1s %4d %6d %sn", walltimestamp,
        execname, this->dir, self->kb, this->iotime,
        self->path != NULL ? stringof(self->path) : "<null>");
}
[...]


• zfsslower.d, also on github, originated from the DTrace book
Slow disks, cont.

• Traces VFS/ZFS interface (kernel)
 from usr/src/uts/common/fs/zfs/zfs_vnops.c:

/*
 * Regular file vnode operations template
 */
vnodeops_t *zfs_fvnodeops;
const fs_operation_def_t zfs_fvnodeops_template[] = {
        VOPNAME_OPEN,           { .vop_open = zfs_open },
        VOPNAME_CLOSE,          { .vop_close = zfs_close },
        VOPNAME_READ,           { .vop_read = zfs_read },
        VOPNAME_WRITE,          { .vop_write = zfs_write },
        VOPNAME_IOCTL,          { .vop_ioctl = zfs_ioctl },
        VOPNAME_GETATTR,        { .vop_getattr = zfs_getattr },
[...]
Slow disks, cont.

• Unix 101
                      Process
                                         Syscall
                                         Interface
                          VFS
zfsslower.d
                    ZFS           ...
    Correlate


iosnoop         Block Device Interface

                          Disks
Slow disks, cont.

• Correlating the layers narrows the latency location
   • Or you can associate in the same D script
• Via text, filtering on slow I/O, works fine
• For high frequency I/O, heat maps
Slow disks, cont.

• WHAT DOES IT MEAN?
Slow disks, cont.

• Latency outliers:
Slow disks, cont.

• Latency outliers:




                      Inconceivable
                      Very Bad
                      Bad
                      Good
Slow disks, cont.

• Inconceivably bad, 1000+ms VFS-level latency:
   • Queueing behind large ZFS SPA syncs (tunable)
   • Other tenants benchmarking (before we added I/O throttling to
     SmartOS)
                                           read = red, write = blue
   • Reads queueing behind
     writes. Needed to tune      60ms
     ZFS and LSI PERC
     (shakes fist!)               latency




                                                     time (s)
Slow disks, cont.
• Deeper tools rolled as needed. Anywhere in ZFS.
# dtrace -n 'io:::start { @[stack()] = count(); }'
dtrace: description 'io:::start ' matched 6 probes
^C
              genunix`ldi_strategy+0x53
              zfs`vdev_disk_io_start+0xcc
              zfs`zio_vdev_io_start+0xab
              zfs`zio_execute+0x88
              zfs`zio_nowait+0x21
              zfs`vdev_mirror_io_start+0xcd
              zfs`zio_vdev_io_start+0x250
              zfs`zio_execute+0x88
              zfs`zio_nowait+0x21
              zfs`arc_read_nolock+0x4f9
              zfs`arc_read+0x96
              zfs`dsl_read+0x44
              zfs`dbuf_read_impl+0x166
              zfs`dbuf_read+0xab
              zfs`dmu_buf_hold_array_by_dnode+0x189
              zfs`dmu_buf_hold_array+0x78
              zfs`dmu_read_uio+0x5c
              zfs`zfs_read+0x1a3
              genunix`fop_read+0x8b
              genunix`read+0x2a7
              143
Slow disks, cont.

    • On Joyent’s IaaS architecture, it’s usually not the disks
     or filesystem; useful to rule that out quickly
    • Some of the time it is, due to bad disks (1000+ms I/O);
     heat map or iosnoop correlation matches
    • Some of the time it’s due to big I/O (how quick is a 40
     Mbyte read from cache?)
    • Some of the time it is other tenants (benchmarking!);
     much less for us now with ZFS I/O throttling
    • With ZFS and an SSD-based intent log, HW RAID is not
     just unobservable, but entirely unnecessary — adios
     PERC!
Memory growth

• Riak had endless memory growth
• Expected 9GB, after two days:
$ prstat -c 1
Please wait...
   PID USERNAME SIZE    RSS STATE   PRI NICE       TIME    CPU   PROCESS/NLWP
 21722 103        43G   40G cpu0     59    0   72:23:41   2.6%   beam.smp/594
 15770 root     7760K 540K sleep     57    0   23:28:57   0.9%   zoneadmd/5
    95 root        0K    0K sleep    99 -20     7:37:47   0.2%   zpool-zones/166
 12827 root      128M   73M sleep   100    -    0:49:36   0.1%   node/5
 10319 bgregg     10M 6788K sleep    59    0    0:00:00   0.0%   sshd/1
 10402 root       22M 288K sleep     59    0    0:18:45   0.0%   dtrace/1
[...]



• Eventually hits paging and terrible performance, needing a restart
• Remember, application restarts are a latency disaster!
Memory growth, cont.

• What is in the heap?
$ pmap 14719
14719:   beam.smp
0000000000400000       2168K   r-x--   /opt/riak/erts-5.8.5/bin/beam.smp
000000000062D000        328K   rw---   /opt/riak/erts-5.8.5/bin/beam.smp
000000000067F000    4193540K   rw---   /opt/riak/erts-5.8.5/bin/beam.smp
00000001005C0000    4194296K   rw---     [ anon ]
00000002005BE000    4192016K   rw---     [ anon ]
0000000300382000    4193664K   rw---     [ anon ]
00000004002E2000    4191172K   rw---     [ anon ]
00000004FFFD3000    4194040K   rw---     [ anon ]
00000005FFF91000    4194028K   rw---     [ anon ]
00000006FFF4C000    4188812K   rw---     [ anon ]
00000007FF9EF000     588224K   rw---     [ heap ]
[...]


• ... and why does it keep growing?
Memory growth, cont.


  • Is this a memory leak?




                                  ?
     • In the app logic: Voxer?
                                         Voxer
     • In the DB logic: Riak?             Riak
     • In the DB’s Erlang VM?         Erlang VM
                                       libc, lib*
     • In the OS libraries?
                                         kernel
     • In the OS kernel?
  • Or application growth?
  • Where would you guess?
Memory growth, cont.

   • Voxer (App): don’t think it’s us
   • Basho (Riak): don’t think it’s us
   • Joyent (OS): don’t think it’s us
       • This sort of issue is usually app growth...
       • ...but we can check libs & kernel to be sure
Memory growth, cont.

   • libumem was in use for allocations
      • fast, scalable, object-caching, multi-threaded
        support

      • user-land version of kmem (slab allocator, Bonwick)
Memory growth, cont.

   • Fix by experimentation (backend=mmap, other
     allocators) wasn’t working.
   • Detailed observability can be enabled in libumem,
     allowing heap profiling and leak detection

      • While designed with speed and production use in
        mind, it still comes with some cost (time and space),
        and isn’t on by default: restart required.

      • UMEM_DEBUG=audit
Memory growth, cont.

• libumem provides some default observability
   • Eg, slabs:

 > ::umem_malloc_info
 CACHE             BUFSZ MAXMAL BUFMALLC   AVG_MAL    MALLOCED    OVERHEAD    %OVER
 0000000000707028       8     0        0         0           0           0     0.0%
 000000000070b028      16     8     8730         8       69836     1054998   1510.6%
 000000000070c028      32    16     8772        16      140352     1130491   805.4%
 000000000070f028      48    32 1148038         25    29127788   156179051   536.1%
 0000000000710028      64    48   344138        40    13765658    58417287   424.3%
 0000000000711028      80    64       36        62        2226        4806   215.9%
 0000000000714028      96    80     8934        79      705348     1168558   165.6%
 0000000000715028     112    96 1347040         87   117120208   190389780   162.5%
 0000000000718028     128   112   253107       111    28011923    42279506   150.9%
 000000000071a028     160   144    40529       118     4788681     6466801   135.0%
 000000000071b028     192   176      140       155       21712       25818   118.9%
 000000000071e028     224   208       43       188        8101        6497    80.1%
 000000000071f028     256   240      133       229       30447       26211    86.0%
 0000000000720028     320   304       56       276       15455       12276    79.4%
 0000000000723028     384   368       35       335       11726        7220    61.5%
 [...]
Memory growth, cont.

    • ... and heap (captured @14GB RSS):
> ::vmem
ADDR             NAME                         INUSE         TOTAL   SUCCEED FAIL
fffffd7ffebed4a0 sbrk_top                9090404352   14240165888   4298117 84403
fffffd7ffebee0a8   sbrk_heap             9090404352    9090404352   4298117     0
fffffd7ffebeecb0      vmem_internal       664616960     664616960     79621     0
fffffd7ffebef8b8        vmem_seg          651993088     651993088     79589     0
fffffd7ffebf04c0        vmem_hash          12583424      12587008        27     0
fffffd7ffebf10c8        vmem_vmem             46200         55344        15     0
00000000006e7000      umem_internal       352862464     352866304     88746     0
00000000006e8000        umem_cache           113696        180224        44     0
00000000006e9000        umem_hash          13091328      13099008        86     0
00000000006ea000      umem_log                    0             0         0     0
00000000006eb000      umem_firewall_va            0             0         0     0
00000000006ec000        umem_firewall             0             0         0     0
00000000006ed000      umem_oversize      5218777974    5520789504   3822051     0
00000000006f0000      umem_memalign               0             0         0     0
0000000000706000      umem_default       2552131584    2552131584    307699     0


• The heap is 9 GB (as expected), but sbrk_top total is 14 GB (equal
 to RSS). And growing.

    • Are there Gbyte-sized malloc()/free()s?
Memory growth, cont.
# dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472
dtrace: description 'pid$target::malloc:entry ' matched 3 probes
^C
           value ------------- Distribution ------------- count
               2 |                                         0
               4 |                                         3
               8 |@                                        5927
              16 |@@@@                                     41818
              32 |@@@@@@@@@                                81991
              64 |@@@@@@@@@@@@@@@@@@                       169888
             128 |@@@@@@@                                  69891
             256 |                                         2257
             512 |                                         406
            1024 |                                         893
            2048 |                                         146
            4096 |                                         1467
            8192 |                                         755
           16384 |                                         950
           32768 |                                         83
           65536 |                                         31
          131072 |                                         11
          262144 |                                         15
          524288 |                                         0
         1048576 |                                         1
         2097152 |                                         0


• No huge malloc()s, but RSS continues to climb.
Memory growth, cont.

• Tracing why the heap grows via brk():
# dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }'
dtrace: description 'syscall::brk:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 10     18                        brk:entry
              libc.so.1`_brk_unlocked+0xa
              libumem.so.1`vmem_sbrk_alloc+0x84
              libumem.so.1`vmem_xalloc+0x669
              libumem.so.1`vmem_alloc+0x14f
              libumem.so.1`vmem_xalloc+0x669
              libumem.so.1`vmem_alloc+0x14f
              libumem.so.1`umem_alloc+0x72
              libumem.so.1`malloc+0x59
              libstdc++.so.6.0.14`_Znwm+0x20
              libstdc++.so.6.0.14`_Znam+0x9
              eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea...
              eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN...
              eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl...
              eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
              eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
              eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S...
              eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc
              eleveldb.so`eleveldb_get+0xd3
              beam.smp`process_main+0x6939
              beam.smp`sched_thread_func+0x1cf
              beam.smp`thr_wrapper+0xbe
Memory growth, cont.

• More DTrace showed the size of the malloc()s causing the brk()s:

 # dtrace -x dynvarsize=4m -n '
 pid$target::malloc:entry { self->size = arg0; }
 syscall::brk:entry /self->size/ { printf("%d bytes", self->size); }
 pid$target::malloc:return { self->size = 0; }' -p 17472

 dtrace:     description 'pid$target::malloc:entry ' matched 7 probes
 CPU         ID                    FUNCTION:NAME
   0         44                        brk:entry 8343520 bytes
   0         44                        brk:entry 8343520 bytes
 [...]


• These 8 Mbyte malloc()s grew the heap
         • Even though the heap has Gbytes not in use
         • This is starting to look like an OS issue
Memory growth, cont.

• More tools were created:
   • Show memory entropy (+ malloc - free)
     along with heap growth, over time

   • Show codepath taken for allocations
     compare successful with unsuccessful (heap growth)

   • Show allocator internals: sizes, options, flags
• And run in the production environment
   • Briefly. Tracing frequent allocs does cost overhead
• Casting light into what was a black box
Memory growth, cont.
  4            <- vmem_xalloc                                   0
  4            -> _sbrk_grow_aligned                         4096
  4            <- _sbrk_grow_aligned                 17155911680
  4            -> vmem_xalloc                             7356400
  4             | vmem_xalloc:entry              umem_oversize
  4              -> vmem_alloc                            7356416
  4                -> vmem_xalloc                         7356416
  4                 | vmem_xalloc:entry          sbrk_heap
  4                  -> vmem_sbrk_alloc                   7356416
  4                    -> vmem_alloc                      7356416
  4                      -> vmem_xalloc                   7356416
  4                       | vmem_xalloc:entry    sbrk_top
  4                        -> vmem_reap                 16777216
  4                        <- vmem_reap         3178535181209758
  4                       | vmem_xalloc:return vmem_xalloc() == NULL, vm:
sbrk_top, size: 7356416, align: 4096, phase: 0, nocross: 0, min: 0, max: 0,
vmflag: 1
              libumem.so.1`vmem_xalloc+0x80f
              libumem.so.1`vmem_sbrk_alloc+0x33
              libumem.so.1`vmem_xalloc+0x669
              libumem.so.1`vmem_alloc+0x14f
              libumem.so.1`vmem_xalloc+0x669
              libumem.so.1`vmem_alloc+0x14f
              libumem.so.1`umem_alloc+0x72
              libumem.so.1`malloc+0x59
              libstdc++.so.6.0.3`_Znwm+0x2b
              libstdc++.so.6.0.3`_ZNSs4_Rep9_S_createEmmRKSaIcE+0x7e
Memory growth, cont.

   • These new tools and metrics pointed to the allocation
     algorithm “instant fit”
   • This had been hypothesized earlier; the tools provided
     solid evidence that this really was the case here
   • A new version of libumem was built to force use of
     VM_BESTFIT
   • ...and added by Robert Mustacchi as a tunable:
     UMEM_OPTIONS=allocator=best
   • Riak was restarted with new libumem version, solving
     the problem
Memory growth, cont.

   • Not the first issue with the system memory allocator;
     depending on configuration, Riak may use libc’s
     malloc(), which isn’t designed to be scalable

      • man page does say it isn’t multi-thread scaleable
   • libumem was the answer (with the fix)
Memory growth, cont.

   • The fragmentation problem was interesting because it
     was unusual; it is not the most common source of
     memory growth!
   • DIRTy systems are often event-oriented…
   • ...in event-oriented systems, memory growth can be a
     consequence of either surging or drowning
   • In an interpreted environment, memory growth can also
     come from memory that is semantically leaked
   • Voxer — like many emerging DIRTy apps — has a
     substantial node.js component; how to debug node.js
     memory growth?
Memory growth, cont.

• We have developed a postmortem technique for making
  sense of a node.js heap:

  OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROP
fe806139        1      1 Object: Queue
fc424131        1      1 Object: Credentials
fc424091        1      1 Object: version
fc4e3281        1      1 Object: message
fc404f6d        1      1 Object: uncaughtException
...
fafcb229     1007     23 ClientRequest: outputEncodings, _headerSent, ...
fafc5e75     1034      5 Timing: req_start, res_end, res_bytes, req_end, ...
fafcbecd     1037      3 Object: aborted, data, end
 8045475     1060      1 Object:
fb0cee9d     1220      9 HTTPParser: socket, incoming, onHeadersComplete, ...
fafc58d5     1271     25 Socket: _connectQueue, bytesRead, _httpMessage, ...
fafc4335     1311     16 ServerResponse: outputEncodings, statusCode, ...


• Used by @izs to debug a nasty node.js leak
• Search for “findjsobjects” (one word) for details
CPU scheduling

 • Problem: occasional latency outliers
 • Analysis: no smoking gun. No slow I/O or locks. Some random
   dispatcher queue latency, but with CPU headroom.



$ prstat -mLc 1
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT   VCX ICX SCL SIG PROCESS/LWPID
 17930 103       21 7.6 0.0 0.0 0.0 53 16 9.1     57K   1 73K   0 beam.smp/265
 17930 103       20 7.0 0.0 0.0 0.0 57 16 0.4     57K   2 70K   0 beam.smp/264
 17930 103       20 7.4 0.0 0.0 0.0 53 18 1.7     63K   0 78K   0 beam.smp/263
 17930 103       19 6.7 0.0 0.0 0.0 60 14 0.4     52K   0 65K   0 beam.smp/266
 17930 103      2.0 0.7 0.0 0.0 0.0 96 1.6 0.0     6K   0 8K    0 beam.smp/267
 17930 103      1.0 0.9 0.0 0.0 0.0 97 0.9 0.0      4   0 47    0 beam.smp/280
[...]
CPU scheduling, cont.

• Unix 101
                                      Threads:
                                      R = Ready to run
                                      O = On-CPU


                         R R R
                                         CPU
                        Run Queue
                                           O

       Scheduler         preemption


                         R R R R
                                         CPU
                        Run Queue
                                           O
CPU scheduler, cont.

• Unix 102
• TS (and FSS) check for CPU starvation


                         Priority Promotion


                    R R R R R R R
                                              CPU
                            Run Queue
                                               O
              CPU
            Starvation
CPU scheduling, cont.

• Experimentation: run 2 CPU-bound threads, 1 CPU
• Subsecond offset heat maps:
CPU scheduling, cont.

• Experimentation: run 2 CPU-bound threads, 1 CPU
• Subsecond offset heat maps:




                       THIS
                       SHOULDNT
                       HAPPEN
CPU scheduling, cont.

 • Worst case (4 threads 1 CPU), 44 sec dispq latency
# dtrace -n 'sched:::off-cpu /execname == "burn1"/ { self->s = timestamp; }
 sched:::on-cpu /self->s/ { @["off-cpu (ms)"] =
 lquantize((timestamp - self->s) / 1000000, 0, 100000, 1000); self->s = 0; }'

  off-cpu (ms)
           value     ------------- Distribution ------------- count
             < 0   |                                          0
               0   |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 387184
            1000   |                                          2256
            2000   |                                          1078
            3000   |              Expected                    862
            4000   |                                          1070
            5000   |              Bad                         637

[...]
            6000   |
                                  Inconceivable               535

           41000   |                                           3
           42000   |                                           2
           43000   |                                           2
           44000   |                                           1
           45000   |                                           0

                        ts_maxwait @pri 59 = 32s, FSS uses ?
CPU scheduling, cont.

• Findings:
   • FSS scheduler class bug:
       • FSS uses a more complex technique to avoid CPU starvation. A thread priority
         could stay high and on-CPU for many seconds before the priority is decayed to
         allow another thread to run.

       • Analyzed (more DTrace) and fixed (thanks Jerry Jelinek)

   • DTrace analysis of the scheduler was invaluable
   • Under (too) high CPU load, your runtime can be bound by how
     well you schedule, not do work

   • Not the only scheduler issue we’ve encountered
CPU scheduling, cont.
• CPU caps to throttle tenants in our cloud
• Experiment: add hot-CPU threads (saturation):
CPU scheduling, cont.
• CPU caps to throttle tenants in our cloud
• Experiment: add hot-CPU threads:




                                              :-(
Visualizing CPU latency

   • Using a node.js ustack helper and the DTrace profile
     provider, we can determine the relative frequency of
     stack backtraces in terms of CPU consumption
   • Stacks can be visualized with flame graphs, a stack
     visualization we developed:
DIRT in production

   • node.js is particularly amenable for the DIRTy apps that
     typify the real-time web
   • The ability to understand latency must be considered
     when deploying node.js-based systems into production!
   • Understanding latency requires dynamic instrumentation
     and novel visualization
   • At Joyent, we have added DTrace-based dynamic
     instrumentation for node.js to SmartOS, and novel
     visualization into our cloud and software offerings
   • Better production support — better observability, better
     debuggability — remains an important area of node.js
     development!
Beyond node.js

   • node.js is adept at connecting components in the
    system; it is unlikely to be the only component!
   • As such, when using node.js to develop a DIRTy app,
    you can expect to spend as much time (if not more!)
    understanding the components as the app
   • When selecting components — operating system, in-
    memory data store, database, distributed data store —
    observability must be a primary consideration!
   • When building a team, look for full-stack engineers —
    DIRTy apps pose a full-stack challenge!
Thank you!

   • @mranney for being an excellent guinea pigcustomer
   • @dapsays for the V8 DTrace ustack helper and V8
    debugging support
   • More information: http://dtrace.org/blogs/brendan, http://
    dtrace.org/blogs/dap, and http://smartos.org

Mais conteúdo relacionado

Mais procurados

Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologiesBrendan Gregg
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: IntroductionBrendan Gregg
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsBrendan Gregg
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsBrendan Gregg
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsBrendan Gregg
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance ToolsBrendan Gregg
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016Brendan Gregg
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to RootsBrendan Gregg
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudBrendan Gregg
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsBrendan Gregg
 

Mais procurados (20)

Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologies
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: Introduction
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
 
Speeding up ps and top
Speeding up ps and topSpeeding up ps and top
Speeding up ps and top
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 

Semelhante a Real-time in the real world: DIRT in production

Reconsider TCPdump for Modern Troubleshooting
Reconsider TCPdump for Modern TroubleshootingReconsider TCPdump for Modern Troubleshooting
Reconsider TCPdump for Modern TroubleshootingAvi Networks
 
Abandon Decades-Old TCPdump for Modern Troubleshooting
Abandon Decades-Old TCPdump for Modern TroubleshootingAbandon Decades-Old TCPdump for Modern Troubleshooting
Abandon Decades-Old TCPdump for Modern TroubleshootingAvi Networks
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performanceahl0003
 
Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2Chartbeat
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Dccp evaluation for sip signaling ict4 m
Dccp evaluation for sip signaling   ict4 m Dccp evaluation for sip signaling   ict4 m
Dccp evaluation for sip signaling ict4 m Agus Awaludin
 
Handy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemHandy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemSneha Inguva
 
(NET404) Making Every Packet Count
(NET404) Making Every Packet Count(NET404) Making Every Packet Count
(NET404) Making Every Packet CountAmazon Web Services
 
AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)Amazon Web Services
 
Please help with the below 3 questions, the python script is at the.pdf
Please help with the below 3  questions, the python script is at the.pdfPlease help with the below 3  questions, the python script is at the.pdf
Please help with the below 3 questions, the python script is at the.pdfsupport58
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareC4Media
 
Chapter 3. sensors in the network domain
Chapter 3. sensors in the network domainChapter 3. sensors in the network domain
Chapter 3. sensors in the network domainPhu Nguyen
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Ultra fast DDoS Detection with FastNetMon at Coloclue (AS 8283)
Ultra	fast	DDoS Detection	with	FastNetMon at	 Coloclue	(AS	8283)Ultra	fast	DDoS Detection	with	FastNetMon at	 Coloclue	(AS	8283)
Ultra fast DDoS Detection with FastNetMon at Coloclue (AS 8283)Pavel Odintsov
 
DPDK layer for porting IPS-IDS
DPDK layer for porting IPS-IDSDPDK layer for porting IPS-IDS
DPDK layer for porting IPS-IDSVipin Varghese
 
Troubleshooting TCP/IP
Troubleshooting TCP/IPTroubleshooting TCP/IP
Troubleshooting TCP/IPvijai s
 
Sobanski odl summit_2015
Sobanski odl summit_2015Sobanski odl summit_2015
Sobanski odl summit_2015John Sobanski
 
Socket programming using C
Socket programming using CSocket programming using C
Socket programming using CAjit Nayak
 
DIY Internet: Snappy, Secure Networking with MinimaLT (JSConf EU 2013)
DIY Internet: Snappy, Secure Networking with MinimaLT (JSConf EU 2013)DIY Internet: Snappy, Secure Networking with MinimaLT (JSConf EU 2013)
DIY Internet: Snappy, Secure Networking with MinimaLT (JSConf EU 2013)Igalia
 

Semelhante a Real-time in the real world: DIRT in production (20)

Reconsider TCPdump for Modern Troubleshooting
Reconsider TCPdump for Modern TroubleshootingReconsider TCPdump for Modern Troubleshooting
Reconsider TCPdump for Modern Troubleshooting
 
Abandon Decades-Old TCPdump for Modern Troubleshooting
Abandon Decades-Old TCPdump for Modern TroubleshootingAbandon Decades-Old TCPdump for Modern Troubleshooting
Abandon Decades-Old TCPdump for Modern Troubleshooting
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performance
 
Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Dccp evaluation for sip signaling ict4 m
Dccp evaluation for sip signaling   ict4 m Dccp evaluation for sip signaling   ict4 m
Dccp evaluation for sip signaling ict4 m
 
Handy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemHandy Networking Tools and How to Use Them
Handy Networking Tools and How to Use Them
 
(NET404) Making Every Packet Count
(NET404) Making Every Packet Count(NET404) Making Every Packet Count
(NET404) Making Every Packet Count
 
AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)
 
Please help with the below 3 questions, the python script is at the.pdf
Please help with the below 3  questions, the python script is at the.pdfPlease help with the below 3  questions, the python script is at the.pdf
Please help with the below 3 questions, the python script is at the.pdf
 
A22 Introduction to DTrace by Kyle Hailey
A22 Introduction to DTrace by Kyle HaileyA22 Introduction to DTrace by Kyle Hailey
A22 Introduction to DTrace by Kyle Hailey
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @Cloudflare
 
Chapter 3. sensors in the network domain
Chapter 3. sensors in the network domainChapter 3. sensors in the network domain
Chapter 3. sensors in the network domain
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Ultra fast DDoS Detection with FastNetMon at Coloclue (AS 8283)
Ultra	fast	DDoS Detection	with	FastNetMon at	 Coloclue	(AS	8283)Ultra	fast	DDoS Detection	with	FastNetMon at	 Coloclue	(AS	8283)
Ultra fast DDoS Detection with FastNetMon at Coloclue (AS 8283)
 
DPDK layer for porting IPS-IDS
DPDK layer for porting IPS-IDSDPDK layer for porting IPS-IDS
DPDK layer for porting IPS-IDS
 
Troubleshooting TCP/IP
Troubleshooting TCP/IPTroubleshooting TCP/IP
Troubleshooting TCP/IP
 
Sobanski odl summit_2015
Sobanski odl summit_2015Sobanski odl summit_2015
Sobanski odl summit_2015
 
Socket programming using C
Socket programming using CSocket programming using C
Socket programming using C
 
DIY Internet: Snappy, Secure Networking with MinimaLT (JSConf EU 2013)
DIY Internet: Snappy, Secure Networking with MinimaLT (JSConf EU 2013)DIY Internet: Snappy, Secure Networking with MinimaLT (JSConf EU 2013)
DIY Internet: Snappy, Secure Networking with MinimaLT (JSConf EU 2013)
 

Mais de bcantrill

Predicting the Present
Predicting the PresentPredicting the Present
Predicting the Presentbcantrill
 
Sharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmakingbcantrill
 
Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...bcantrill
 
I have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsbcantrill
 
Towards Holistic Systems
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systemsbcantrill
 
The Coming Firmware Revolution
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolutionbcantrill
 
Hardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Agebcantrill
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesbcantrill
 
No Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Lawbcantrill
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineeringbcantrill
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemapsbcantrill
 
Platform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarebcantrill
 
Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?bcantrill
 
dtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the unionbcantrill
 
The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsbcantrill
 
Papers We Love: ARC after dark
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after darkbcantrill
 
Principles of Technology Leadership
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadershipbcantrill
 
Zebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data pathZebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data pathbcantrill
 
Platform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyondPlatform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyondbcantrill
 
Debugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindbcantrill
 

Mais de bcantrill (20)

Predicting the Present
Predicting the PresentPredicting the Present
Predicting the Present
 
Sharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmaking
 
Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...
 
I have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systems
 
Towards Holistic Systems
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systems
 
The Coming Firmware Revolution
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolution
 
Hardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Age
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
 
No Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Law
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemaps
 
Platform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system software
 
Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?
 
dtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the union
 
The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systems
 
Papers We Love: ARC after dark
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after dark
 
Principles of Technology Leadership
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadership
 
Zebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data pathZebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data path
 
Platform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyondPlatform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyond
 
Debugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mind
 

Real-time in the real world: DIRT in production

  • 1. Real-time in the real world: DIRT in production Bryan Cantrill Brendan Gregg SVP, Engineering Lead Performance Engineer bryan@joyent.com brendan@joyent.com @bcantrill @brendangregg
  • 2. Previously, on #surgecon... • Two years ago at Surge, we described the emergence of real-time data semantics in web-facing applications • We dubbed this data-intensive real-time (DIRT) • Last year at Surge 2011, we presented experiences building a DIRTy system of our own — a facility for real- time analytics of latency in the cloud • While this system is interesting, it is somewhat synthetic in nature in that it does not need to scale (much) with respect to users...
  • 3. #surgecon 2012 • Accelerated by the rise of mobile applications, DIRTy systems are becoming increasingly common • In the past year, we’ve seen apps in production at scale • There are many examples of this, but for us, a paragon of the emerging DIRTy apps has been Voxer • Voxer is a push-to-talk mobile app that can be thought of as the confluence of voice mail and SMS • A canonical DIRTy app: latency and scale both matter! • Our experiences debugging latency bubbles with Voxer over the past year have taught us quite a bit about the new challenges that DIRTy apps pose...
  • 4. The challenge of DIRTy apps • DIRTy applications tend to have the human in the loop • Good news: deadlines are soft — microseconds only matter when they add up to tens of milliseconds • Bad news: because humans are in the loop, demand for the system can be non-linear • One must deal not only with the traditional challenge of scalability, but also the challenge of a real-time system • Worse, emerging DIRTy apps have mobile devices at their edge — network transience makes clients seem ill- behaved with respect to connection state!
  • 5. The lessons of DIRTy apps • Many latency bubbles originate deep in the stack; OS understanding and instrumentation have been essential even when the OS is not at fault • For up-stack problems, tooling has been essential • Latency outliers can come from many sources: application restarts, dropped connections, slow disks, boundless memory growth • We have also seen some traditional real-time problems with respect to CPU scheduling, e.g. priority inversions • Enough foreplay; on with the DIRTy disaster pr0n!
  • 6. Application restarts • Modern internet-facing architectures are designed to be resilient with respect to many failure modes… • ...but application restarts can induce pathological, cascading latency bubbles, as clients reconnect, clusters reconverge, etc. • For example, Voxer ran into a node.js bug where it would terminate on ECONNABORTED from accept(2) • Classic difference in OS semantics: BSD and illumos variants (including SmartOS) do this; Linux doesn’t • Much more likely over a transient network!
  • 7. Dropped connections • If an application can’t keep up with TCP backlog, packets (SYNs) are dropped: $ netstat -s | grep Drop tcpTimRetransDrop = 56 tcpTimKeepalive = 2582 tcpTimKeepaliveProbe= 1594 tcpTimKeepaliveDrop = 41 tcpListenDrop =3089298 tcpListenDropQ0 = 0 tcpHalfOpenDrop = 0 tcpOutSackRetrans =1400832 icmpOutDrops = 0 icmpOutErrors = 0 sctpTimRetrans = 0 sctpTimRetransDrop = 0 sctpTimHearBeatProbe= 0 sctpTimHearBeatDrop = 0 sctpListenDrop = 0 sctpInClosed = 0 • Client waits, then retransmits (after 1 or 3 seconds), inducing tremendous latency outliers; terrible for DIRTy apps!
  • 8. Dropped connections, cont. • The fix for dropped connections : • If due to a surge, increase TCP backlog • If due to sustained load, increase CPU resources, decrease CPU consumption or scale the app • If fixed by increasing the TCP backlog, check that the system backlog tunable took effect! • If not, does the app need to be restarted? • If not, is the application providing its own backlog that it taking precedent? • How close are we to dropping?
  • 9. Dropped connections, cont. • Networking 101 App TCP SYN accept() backlog listen drop max
  • 10. Dropped connections, cont. The kernel code (usr/src/uts/common/inet/tcp/tcp_input.c): /* * THIS FUNCTION IS DIRECTLY CALLED BY IP VIA SQUEUE FOR SYN. * tcp_input_data will not see any packets for listeners since the listener * has conn_recv set to tcp_input_listener. */ /* ARGSUSED */ static void tcp_input_listener(void *arg, mblk_t *mp, void *arg2, ip_recv_attr_t *ira) { [...] if (listener->tcp_conn_req_cnt_q >= listener->tcp_conn_req_max) { mutex_exit(&listener->tcp_eager_lock); TCP_STAT(tcps, tcp_listendrop); TCPS_BUMP_MIB(tcps, tcpListenDrop); if (lconnp->conn_debug) { (void) strlog(TCP_MOD_ID, 0, 1, SL_TRACE|SL_ERROR, "tcp_input_listener: listen backlog (max=%d) " "overflow (%d pending) on %s", listener->tcp_conn_req_max, listener->tcp_conn_req_cnt_q, tcp_display(listener, NULL, DISP_PORT_ONLY)); } goto error2; } [...]
  • 11. Dropped connections, cont. SEE ALL THE THINGS! tcp_conn_req_cnt_q distributions: cpid:3063 max_q:8 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 1 | 0 cpid:11504 Text max_q:128 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7279 1 |@@ 405 2 |@ 255 4 |@ 138 8 | 81 16 | 83 32 | 62 64 | 67 128 | 34 256 | 0 tcpListenDrops: cpid:11504 max_q:128 34
  • 12. Dropped connections, cont. • Uses DTrace to get a distribution of TCP backlog queue length on SYN; max_q is the current backlog length, per-process: fbt::tcp_input_listener:entry { this->connp = (conn_t *)arg0; this->tcp = (tcp_t *)this->connp->conn_proto_priv.cp_tcp; self->max = strjoin("max_q:", lltostr(this->tcp->tcp_conn_req_max)); self->pid = strjoin("cpid:", lltostr(this->connp->conn_cpid)); @[self->pid, self->max] = quantize(this->tcp->tcp_conn_req_cnt_q); } mib:::tcpListenDrop { this->max = self->max; this->pid = self->pid; this->max != NULL ? this->max : "<null>"; this->pid != NULL ? this->pid : "<null>"; @drops[this->pid, this->max] = count(); printf("%Y %s:%s %sn", walltimestamp, probefunc, probename, this->pid); } • Script is on http://github.com/brendangregg/dtrace-cloud-tools as net/tcpconnreqmaxq-pid*.d
  • 13. Dropped connections, cont. Or, snoop each drop: # ./tcplistendrop.d TIME SRC-IP PORT DST-IP PORT 2012 Jan 19 01:22:49 10.17.210.103 25691 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.108 18423 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.116 38883 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.117 10739 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.112 27988 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.106 28824 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.12.143.16 65070 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.100 56392 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.99 24628 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.98 11686 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.101 34629 -> 192.192.240.212 80 [...]
  • 14. Dropped connections, cont. • That code parsed IP and TCP headers from the in-kernel packet buffer: fbt::tcp_input_listener:entry { self->mp = args[1]; } fbt::tcp_input_listener:return { self->mp = 0; } mib:::tcpListenDrop /self->mp/ { this->iph = (ipha_t *)self->mp->b_rptr; this->tcph = (tcph_t *)(self->mp->b_rptr + 20); printf("%-20Y %-18s %-5d -> %-18s %-5dn", walltimestamp, inet_ntoa(&this->iph->ipha_src), ntohs(*(uint16_t *)this->tcph->th_lport), inet_ntoa(&this->iph->ipha_dst), ntohs(*(uint16_t *)this->tcph->th_fport)); } • Script is tcplistendrop*.d, also on github
  • 15. Dropped connections, cont. • To summarize, dropped connections induce acute latency bubbles • With Voxer, found that failures often cascaded: high CPU utilization due to unrelated issues will induce TCP listen drops • Tunables don’t always take effect: need confirmation • Having a quick tool to check scalability issues (DTrace) has been invaluable
  • 16. Slow disks • Slow I/O in a cloud computing environment can be caused by multi-tenancy — which is to say, neighbors: • Neighbor running a backup • Neighbor running a benchmark • Neighbors can’t be seen by tenants... • ...but is it really a neighbor?
  • 17. Slow disks, cont. • Unix 101 Process Syscall Interface VFS ZFS ... Block Device Interface Disks
  • 18. Slow disks, cont. • Unix 101 Process sync. Syscall Interface VFS ZFS ... Block Device Interface iostat(1) often async: write buffering, Disks read ahead
  • 19. Slow disks, cont. • VFS-level-iostat: vfsstat # vfsstat -Z 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone 1.2 2.8 0.6 0.2 0.0 0.0 0.0 0.0 0 0 0.0 0.0 global (0) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 34.9 9cc2d0d3 (2) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 46.5 72188ca0 (3) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 16.5 4d2a62bb (4) 0.3 0.1 0.1 0.3 0.0 0.0 0.0 0.0 0 0 0.0 27.6 8bbc4000 (5) 5.9 0.2 0.5 0.1 0.0 0.0 0.0 0.0 0 0 5.0 11.3 d305ee44 (6) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 132.0 9897c8f5 (7) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0 0 0.0 40.7 5f3c7d9e (9) 0.2 0.8 0.5 0.6 0.0 0.0 0.0 0.0 0 0 0.0 31.9 22ef87fc (10) • Kernel changes, new kstats (thanks Bill Pijewski)
  • 20. Slow disks, cont. • zfsslower.d: # ./zfsslower.d 10 TIME PROCESS D B ms FILE 2012 Sep 27 13:45:33 zlogin W 372 11 /zones/b8b2464c/var/adm/wtmpx 2012 Sep 27 13:45:36 bash R 8 14 /zones/b8b2464c/opt/local/bin/zsh 2012 Sep 27 13:45:58 mysqld R 1048576 19 /zones/b8b2464c/var/mysql/ibdata1 2012 Sep 27 13:45:58 mysqld R 1048576 22 /zones/b8b2464c/var/mysql/ibdata1 2012 Sep 27 13:46:14 master R 8 6 /zones/b8b2464c/root/opt/local/ libexec/postfix/qmgr 2012 Sep 27 13:46:14 master R 4096 5 /zones/b8b2464c/root/opt/local/etc/ postfix/master.cf [...] • Go-to tool. Are there VFS-level I/O > 10ms? (arg) • Stupidly easy to do
  • 21. Slow disks, cont. • Written in DTrace [...] fbt::zfs_read:entry, fbt::zfs_write:entry { self->path = args[0]->v_path; self->kb = args[1]->uio_resid / 1024; self->start = timestamp; } fbt::zfs_read:return, fbt::zfs_write:return /self->start && (timestamp - self->start) >= min_ns/ { this->iotime = (timestamp - self->start) / 1000000; this->dir = probefunc == "zfs_read" ? "R" : "W"; printf("%-20Y %-16s %1s %4d %6d %sn", walltimestamp, execname, this->dir, self->kb, this->iotime, self->path != NULL ? stringof(self->path) : "<null>"); } [...] • zfsslower.d, also on github, originated from the DTrace book
  • 22. Slow disks, cont. • Traces VFS/ZFS interface (kernel) from usr/src/uts/common/fs/zfs/zfs_vnops.c: /* * Regular file vnode operations template */ vnodeops_t *zfs_fvnodeops; const fs_operation_def_t zfs_fvnodeops_template[] = { VOPNAME_OPEN, { .vop_open = zfs_open }, VOPNAME_CLOSE, { .vop_close = zfs_close }, VOPNAME_READ, { .vop_read = zfs_read }, VOPNAME_WRITE, { .vop_write = zfs_write }, VOPNAME_IOCTL, { .vop_ioctl = zfs_ioctl }, VOPNAME_GETATTR, { .vop_getattr = zfs_getattr }, [...]
  • 23. Slow disks, cont. • Unix 101 Process Syscall Interface VFS zfsslower.d ZFS ... Correlate iosnoop Block Device Interface Disks
  • 24. Slow disks, cont. • Correlating the layers narrows the latency location • Or you can associate in the same D script • Via text, filtering on slow I/O, works fine • For high frequency I/O, heat maps
  • 25. Slow disks, cont. • WHAT DOES IT MEAN?
  • 26. Slow disks, cont. • Latency outliers:
  • 27. Slow disks, cont. • Latency outliers: Inconceivable Very Bad Bad Good
  • 28. Slow disks, cont. • Inconceivably bad, 1000+ms VFS-level latency: • Queueing behind large ZFS SPA syncs (tunable) • Other tenants benchmarking (before we added I/O throttling to SmartOS) read = red, write = blue • Reads queueing behind writes. Needed to tune 60ms ZFS and LSI PERC (shakes fist!) latency time (s)
  • 29. Slow disks, cont. • Deeper tools rolled as needed. Anywhere in ZFS. # dtrace -n 'io:::start { @[stack()] = count(); }' dtrace: description 'io:::start ' matched 6 probes ^C genunix`ldi_strategy+0x53 zfs`vdev_disk_io_start+0xcc zfs`zio_vdev_io_start+0xab zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`vdev_mirror_io_start+0xcd zfs`zio_vdev_io_start+0x250 zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`arc_read_nolock+0x4f9 zfs`arc_read+0x96 zfs`dsl_read+0x44 zfs`dbuf_read_impl+0x166 zfs`dbuf_read+0xab zfs`dmu_buf_hold_array_by_dnode+0x189 zfs`dmu_buf_hold_array+0x78 zfs`dmu_read_uio+0x5c zfs`zfs_read+0x1a3 genunix`fop_read+0x8b genunix`read+0x2a7 143
  • 30. Slow disks, cont. • On Joyent’s IaaS architecture, it’s usually not the disks or filesystem; useful to rule that out quickly • Some of the time it is, due to bad disks (1000+ms I/O); heat map or iosnoop correlation matches • Some of the time it’s due to big I/O (how quick is a 40 Mbyte read from cache?) • Some of the time it is other tenants (benchmarking!); much less for us now with ZFS I/O throttling • With ZFS and an SSD-based intent log, HW RAID is not just unobservable, but entirely unnecessary — adios PERC!
  • 31. Memory growth • Riak had endless memory growth • Expected 9GB, after two days: $ prstat -c 1 Please wait... PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 21722 103 43G 40G cpu0 59 0 72:23:41 2.6% beam.smp/594 15770 root 7760K 540K sleep 57 0 23:28:57 0.9% zoneadmd/5 95 root 0K 0K sleep 99 -20 7:37:47 0.2% zpool-zones/166 12827 root 128M 73M sleep 100 - 0:49:36 0.1% node/5 10319 bgregg 10M 6788K sleep 59 0 0:00:00 0.0% sshd/1 10402 root 22M 288K sleep 59 0 0:18:45 0.0% dtrace/1 [...] • Eventually hits paging and terrible performance, needing a restart • Remember, application restarts are a latency disaster!
  • 32. Memory growth, cont. • What is in the heap? $ pmap 14719 14719: beam.smp 0000000000400000 2168K r-x-- /opt/riak/erts-5.8.5/bin/beam.smp 000000000062D000 328K rw--- /opt/riak/erts-5.8.5/bin/beam.smp 000000000067F000 4193540K rw--- /opt/riak/erts-5.8.5/bin/beam.smp 00000001005C0000 4194296K rw--- [ anon ] 00000002005BE000 4192016K rw--- [ anon ] 0000000300382000 4193664K rw--- [ anon ] 00000004002E2000 4191172K rw--- [ anon ] 00000004FFFD3000 4194040K rw--- [ anon ] 00000005FFF91000 4194028K rw--- [ anon ] 00000006FFF4C000 4188812K rw--- [ anon ] 00000007FF9EF000 588224K rw--- [ heap ] [...] • ... and why does it keep growing?
  • 33. Memory growth, cont. • Is this a memory leak? ? • In the app logic: Voxer? Voxer • In the DB logic: Riak? Riak • In the DB’s Erlang VM? Erlang VM libc, lib* • In the OS libraries? kernel • In the OS kernel? • Or application growth? • Where would you guess?
  • 34. Memory growth, cont. • Voxer (App): don’t think it’s us • Basho (Riak): don’t think it’s us • Joyent (OS): don’t think it’s us • This sort of issue is usually app growth... • ...but we can check libs & kernel to be sure
  • 35. Memory growth, cont. • libumem was in use for allocations • fast, scalable, object-caching, multi-threaded support • user-land version of kmem (slab allocator, Bonwick)
  • 36. Memory growth, cont. • Fix by experimentation (backend=mmap, other allocators) wasn’t working. • Detailed observability can be enabled in libumem, allowing heap profiling and leak detection • While designed with speed and production use in mind, it still comes with some cost (time and space), and isn’t on by default: restart required. • UMEM_DEBUG=audit
  • 37. Memory growth, cont. • libumem provides some default observability • Eg, slabs: > ::umem_malloc_info CACHE BUFSZ MAXMAL BUFMALLC AVG_MAL MALLOCED OVERHEAD %OVER 0000000000707028 8 0 0 0 0 0 0.0% 000000000070b028 16 8 8730 8 69836 1054998 1510.6% 000000000070c028 32 16 8772 16 140352 1130491 805.4% 000000000070f028 48 32 1148038 25 29127788 156179051 536.1% 0000000000710028 64 48 344138 40 13765658 58417287 424.3% 0000000000711028 80 64 36 62 2226 4806 215.9% 0000000000714028 96 80 8934 79 705348 1168558 165.6% 0000000000715028 112 96 1347040 87 117120208 190389780 162.5% 0000000000718028 128 112 253107 111 28011923 42279506 150.9% 000000000071a028 160 144 40529 118 4788681 6466801 135.0% 000000000071b028 192 176 140 155 21712 25818 118.9% 000000000071e028 224 208 43 188 8101 6497 80.1% 000000000071f028 256 240 133 229 30447 26211 86.0% 0000000000720028 320 304 56 276 15455 12276 79.4% 0000000000723028 384 368 35 335 11726 7220 61.5% [...]
  • 38. Memory growth, cont. • ... and heap (captured @14GB RSS): > ::vmem ADDR NAME INUSE TOTAL SUCCEED FAIL fffffd7ffebed4a0 sbrk_top 9090404352 14240165888 4298117 84403 fffffd7ffebee0a8 sbrk_heap 9090404352 9090404352 4298117 0 fffffd7ffebeecb0 vmem_internal 664616960 664616960 79621 0 fffffd7ffebef8b8 vmem_seg 651993088 651993088 79589 0 fffffd7ffebf04c0 vmem_hash 12583424 12587008 27 0 fffffd7ffebf10c8 vmem_vmem 46200 55344 15 0 00000000006e7000 umem_internal 352862464 352866304 88746 0 00000000006e8000 umem_cache 113696 180224 44 0 00000000006e9000 umem_hash 13091328 13099008 86 0 00000000006ea000 umem_log 0 0 0 0 00000000006eb000 umem_firewall_va 0 0 0 0 00000000006ec000 umem_firewall 0 0 0 0 00000000006ed000 umem_oversize 5218777974 5520789504 3822051 0 00000000006f0000 umem_memalign 0 0 0 0 0000000000706000 umem_default 2552131584 2552131584 307699 0 • The heap is 9 GB (as expected), but sbrk_top total is 14 GB (equal to RSS). And growing. • Are there Gbyte-sized malloc()/free()s?
  • 39. Memory growth, cont. # dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472 dtrace: description 'pid$target::malloc:entry ' matched 3 probes ^C value ------------- Distribution ------------- count 2 | 0 4 | 3 8 |@ 5927 16 |@@@@ 41818 32 |@@@@@@@@@ 81991 64 |@@@@@@@@@@@@@@@@@@ 169888 128 |@@@@@@@ 69891 256 | 2257 512 | 406 1024 | 893 2048 | 146 4096 | 1467 8192 | 755 16384 | 950 32768 | 83 65536 | 31 131072 | 11 262144 | 15 524288 | 0 1048576 | 1 2097152 | 0 • No huge malloc()s, but RSS continues to climb.
  • 40. Memory growth, cont. • Tracing why the heap grows via brk(): # dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }' dtrace: description 'syscall::brk:entry ' matched 1 probe CPU ID FUNCTION:NAME 10 18 brk:entry libc.so.1`_brk_unlocked+0xa libumem.so.1`vmem_sbrk_alloc+0x84 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.14`_Znwm+0x20 libstdc++.so.6.0.14`_Znam+0x9 eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea... eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S... eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc eleveldb.so`eleveldb_get+0xd3 beam.smp`process_main+0x6939 beam.smp`sched_thread_func+0x1cf beam.smp`thr_wrapper+0xbe
  • 41. Memory growth, cont. • More DTrace showed the size of the malloc()s causing the brk()s: # dtrace -x dynvarsize=4m -n ' pid$target::malloc:entry { self->size = arg0; } syscall::brk:entry /self->size/ { printf("%d bytes", self->size); } pid$target::malloc:return { self->size = 0; }' -p 17472 dtrace: description 'pid$target::malloc:entry ' matched 7 probes CPU ID FUNCTION:NAME 0 44 brk:entry 8343520 bytes 0 44 brk:entry 8343520 bytes [...] • These 8 Mbyte malloc()s grew the heap • Even though the heap has Gbytes not in use • This is starting to look like an OS issue
  • 42. Memory growth, cont. • More tools were created: • Show memory entropy (+ malloc - free) along with heap growth, over time • Show codepath taken for allocations compare successful with unsuccessful (heap growth) • Show allocator internals: sizes, options, flags • And run in the production environment • Briefly. Tracing frequent allocs does cost overhead • Casting light into what was a black box
  • 43. Memory growth, cont. 4 <- vmem_xalloc 0 4 -> _sbrk_grow_aligned 4096 4 <- _sbrk_grow_aligned 17155911680 4 -> vmem_xalloc 7356400 4 | vmem_xalloc:entry umem_oversize 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_heap 4 -> vmem_sbrk_alloc 7356416 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_top 4 -> vmem_reap 16777216 4 <- vmem_reap 3178535181209758 4 | vmem_xalloc:return vmem_xalloc() == NULL, vm: sbrk_top, size: 7356416, align: 4096, phase: 0, nocross: 0, min: 0, max: 0, vmflag: 1 libumem.so.1`vmem_xalloc+0x80f libumem.so.1`vmem_sbrk_alloc+0x33 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.3`_Znwm+0x2b libstdc++.so.6.0.3`_ZNSs4_Rep9_S_createEmmRKSaIcE+0x7e
  • 44. Memory growth, cont. • These new tools and metrics pointed to the allocation algorithm “instant fit” • This had been hypothesized earlier; the tools provided solid evidence that this really was the case here • A new version of libumem was built to force use of VM_BESTFIT • ...and added by Robert Mustacchi as a tunable: UMEM_OPTIONS=allocator=best • Riak was restarted with new libumem version, solving the problem
  • 45. Memory growth, cont. • Not the first issue with the system memory allocator; depending on configuration, Riak may use libc’s malloc(), which isn’t designed to be scalable • man page does say it isn’t multi-thread scaleable • libumem was the answer (with the fix)
  • 46. Memory growth, cont. • The fragmentation problem was interesting because it was unusual; it is not the most common source of memory growth! • DIRTy systems are often event-oriented… • ...in event-oriented systems, memory growth can be a consequence of either surging or drowning • In an interpreted environment, memory growth can also come from memory that is semantically leaked • Voxer — like many emerging DIRTy apps — has a substantial node.js component; how to debug node.js memory growth?
  • 47. Memory growth, cont. • We have developed a postmortem technique for making sense of a node.js heap: OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROP fe806139 1 1 Object: Queue fc424131 1 1 Object: Credentials fc424091 1 1 Object: version fc4e3281 1 1 Object: message fc404f6d 1 1 Object: uncaughtException ... fafcb229 1007 23 ClientRequest: outputEncodings, _headerSent, ... fafc5e75 1034 5 Timing: req_start, res_end, res_bytes, req_end, ... fafcbecd 1037 3 Object: aborted, data, end 8045475 1060 1 Object: fb0cee9d 1220 9 HTTPParser: socket, incoming, onHeadersComplete, ... fafc58d5 1271 25 Socket: _connectQueue, bytesRead, _httpMessage, ... fafc4335 1311 16 ServerResponse: outputEncodings, statusCode, ... • Used by @izs to debug a nasty node.js leak • Search for “findjsobjects” (one word) for details
  • 48. CPU scheduling • Problem: occasional latency outliers • Analysis: no smoking gun. No slow I/O or locks. Some random dispatcher queue latency, but with CPU headroom. $ prstat -mLc 1 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 17930 103 21 7.6 0.0 0.0 0.0 53 16 9.1 57K 1 73K 0 beam.smp/265 17930 103 20 7.0 0.0 0.0 0.0 57 16 0.4 57K 2 70K 0 beam.smp/264 17930 103 20 7.4 0.0 0.0 0.0 53 18 1.7 63K 0 78K 0 beam.smp/263 17930 103 19 6.7 0.0 0.0 0.0 60 14 0.4 52K 0 65K 0 beam.smp/266 17930 103 2.0 0.7 0.0 0.0 0.0 96 1.6 0.0 6K 0 8K 0 beam.smp/267 17930 103 1.0 0.9 0.0 0.0 0.0 97 0.9 0.0 4 0 47 0 beam.smp/280 [...]
  • 49. CPU scheduling, cont. • Unix 101 Threads: R = Ready to run O = On-CPU R R R CPU Run Queue O Scheduler preemption R R R R CPU Run Queue O
  • 50. CPU scheduler, cont. • Unix 102 • TS (and FSS) check for CPU starvation Priority Promotion R R R R R R R CPU Run Queue O CPU Starvation
  • 51. CPU scheduling, cont. • Experimentation: run 2 CPU-bound threads, 1 CPU • Subsecond offset heat maps:
  • 52. CPU scheduling, cont. • Experimentation: run 2 CPU-bound threads, 1 CPU • Subsecond offset heat maps: THIS SHOULDNT HAPPEN
  • 53. CPU scheduling, cont. • Worst case (4 threads 1 CPU), 44 sec dispq latency # dtrace -n 'sched:::off-cpu /execname == "burn1"/ { self->s = timestamp; } sched:::on-cpu /self->s/ { @["off-cpu (ms)"] = lquantize((timestamp - self->s) / 1000000, 0, 100000, 1000); self->s = 0; }' off-cpu (ms) value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 387184 1000 | 2256 2000 | 1078 3000 | Expected 862 4000 | 1070 5000 | Bad 637 [...] 6000 | Inconceivable 535 41000 | 3 42000 | 2 43000 | 2 44000 | 1 45000 | 0 ts_maxwait @pri 59 = 32s, FSS uses ?
  • 54. CPU scheduling, cont. • Findings: • FSS scheduler class bug: • FSS uses a more complex technique to avoid CPU starvation. A thread priority could stay high and on-CPU for many seconds before the priority is decayed to allow another thread to run. • Analyzed (more DTrace) and fixed (thanks Jerry Jelinek) • DTrace analysis of the scheduler was invaluable • Under (too) high CPU load, your runtime can be bound by how well you schedule, not do work • Not the only scheduler issue we’ve encountered
  • 55. CPU scheduling, cont. • CPU caps to throttle tenants in our cloud • Experiment: add hot-CPU threads (saturation):
  • 56. CPU scheduling, cont. • CPU caps to throttle tenants in our cloud • Experiment: add hot-CPU threads: :-(
  • 57. Visualizing CPU latency • Using a node.js ustack helper and the DTrace profile provider, we can determine the relative frequency of stack backtraces in terms of CPU consumption • Stacks can be visualized with flame graphs, a stack visualization we developed:
  • 58. DIRT in production • node.js is particularly amenable for the DIRTy apps that typify the real-time web • The ability to understand latency must be considered when deploying node.js-based systems into production! • Understanding latency requires dynamic instrumentation and novel visualization • At Joyent, we have added DTrace-based dynamic instrumentation for node.js to SmartOS, and novel visualization into our cloud and software offerings • Better production support — better observability, better debuggability — remains an important area of node.js development!
  • 59. Beyond node.js • node.js is adept at connecting components in the system; it is unlikely to be the only component! • As such, when using node.js to develop a DIRTy app, you can expect to spend as much time (if not more!) understanding the components as the app • When selecting components — operating system, in- memory data store, database, distributed data store — observability must be a primary consideration! • When building a team, look for full-stack engineers — DIRTy apps pose a full-stack challenge!
  • 60. Thank you! • @mranney for being an excellent guinea pigcustomer • @dapsays for the V8 DTrace ustack helper and V8 debugging support • More information: http://dtrace.org/blogs/brendan, http:// dtrace.org/blogs/dap, and http://smartos.org