SlideShare uma empresa Scribd logo
1 de 155
Lorenzo Alberton
                       @lorenzoalberton


        Monitoring at scale:
intuitive dashboard design

                     Make decisions, fast




          PHP UK, Saturday 23rd February 2013
                                                1
Lorenzo Alberton
             Chief Technical Architect, DataSift
             http://alberton.info
             @lorenzoalberton




                                     http://bit.ly/scaleds

                                                             2
Big Data, little clue?
                               Monitoring is crucial




  http://www.flickr.com/photos/mrflip/5150336351/lightbox/   3
Complex architectures




                        4
Identify (and prevent) failures?

                         ?            ?




                              ?
                          No output data:
                             where is the
                              problem???
                         ?        ?         ?



                                                5
Identify (and prevent) failures?

                         ?            ?




                              ?
                          No output data:
                             where is the
                              problem???
                         ?        ?         ?



                                                5
Monitoring mindset

    You can’t control                                              Design systems
 what you can’t measure                                            to be monitored

          Tom DeMarco




                                                                 Good reporting:
 Observe patterns and
                                                          difference between noticing
 automate most things
                                                                and not having a clue
                    http://www.threesixtymag.co.uk/2012/12/state-of-mind-tee/           6
Monitoring mindset

    The
  hardest
   part
                         Good reporting:
                     difference between noticing
                        and not having a clue
                                                   7
Dashboard Design
   Learning the appropriate language




                                       8
Dashboard: what is it?

       Tool to display
        PIs and KPIs
           quantitative analysis


         Immediacy, intuitiveness
         and appropriate context
                                    9
Operational            Strategic              Analytic
monitors functions   quick overview of         comparisons,
   which need         an organization’s        reviewing
    constant,             health            extensive histories,
  real-time,                                    evaluating
minute-by-minute          assist with          performance
    attention         executive decisions
                                                assists with
  immediacy and        what is going on        data analysis
    practicality       right now is not
                     important - what is
 no statistics or    pressing is what has    doesn’t require
    analyzing           been going on         real-time data

                                                               10
Multiple dashboard views

  Operational:        Strategic:          Analytic:
 Ops / Engineering    CEO / CIO     Marketing / Accountancy




           Different view for each audience:
        keep metrics relevant to each group
                                                         11
Multiple dashboard views

  Operational:
 Ops / Engineering                    This talk
                                      is about
                                      this one

              (but the others are important too)
                                                   12
Effective Monitoring
       Understanding how we think




                                    13
Thinking, Fast and Slow




                          14
A tale of two systems

   Intuition                   Reasoning
 operates automatically               consciously
and quickly, with little or      allocates attention
 no effort and no sense        to the effortful mental
  of voluntary control        activities that demand it

       2+2=?                     216 × 725 = ?

   involuntary fast               voluntary slow
  effortless invisible            difficult visible
                                                          15
A tale of two systems

   Intuition
 operates automatically
and quickly, with little or
 no effort and no sense         Monitoring
                              should rely on
  of voluntary control          System I

       2+2=?

   involuntary fast
  effortless invisible
                                               16
A tale of two systems

                              Reasoning
                                     consciously
                                allocates attention
       System 2
 regulates our intuition      to the effortful mental
 and is ready to jump in     activities that demand it
when attention is required
                                216 × 725 = ?

                                 voluntary slow
                                 difficult visible
                                                         17
Model “Normality”




         http://www.flickr.com/photos/fwooper7/4942474212/   18
Be surprised by anomalies




   http://animal.discovery.com/tv-shows/wild-kingdom/about-animals/lions-elephant-hunters-pictures.htm   19
Create surprise with alerts




                              20
Create surprise with alerts




                              20
Over-Use of color
               Revenue               Goal

    80


    60


    40


    20


     0
         Jan   Feb       Mar   Apr     May   Jun




                                                   21
Over-Use of color
                Revenue               Goal

    80


    60


    40


    20


     0
         Jan   Feb        Mar   Apr     May   Jun


     Only attract attention when things go bad
                                                    21
Dashboard best practices



      Show, don’t tell

        Keep text/numbers to a minimum



                                         22
Clarity and immediacy FTW
    Charles Joseph Minard, Napoleon’s March on Moscow




“Probably the best statistical graphic ever drawn” - Edward Tufte
                 http://www.edwardtufte.com/tufte/posters           23
Clarity and immediacy FTW
    Charles Joseph Minard, Napoleon’s March on Moscow




             worst
“Probably the best statistical graphic ever drawn” - Edward Tufte
                 http://www.edwardtufte.com/tufte/posters           23
Graphs fit short-term memory
   Sales Jan  Feb Mar Apr May Jun           Jul
    US 23923 21695 20032 24030 24302 25032 26203
    EU 14390 16400 17303 21900 23547 20142 27321




                                                   24
Graphs fit short-term memory
     Sales Jan  Feb Mar Apr May Jun           Jul
      US 23923 21695 20032 24030 24302 25032 26203
      EU 14390 16400 17303 21900 23547 20142 27321
                                give values
                              a visual shape
          30000
                                                              US
          25000                                               EU
  Sales




          20000
          15000
          10000
                  Jan   Feb   Mar   Apr   May   Jun   Jul   Aug

                                                                   24
Dashboard best practices


       Communicate
        with clarity
           Simplicity is key


                               25
Dashboard design mistakes




                            26
Busy Dashboards Are Busy




      http://img.photobucket.com/albums/v254/tomklipp/Misc/C-130e-flight-station.jpg   27
Dashboard design mistakes


       Too much data,
    too little information

          At a glance, tell if there’s a
        problem, not a precise analysis

                                           28
The only thing I want to know

             Everything is alright




          http://www.x929.ca/shows/newsboy/?cat=28&paged=2   29
Attention as limited resource




        http://www.climateshifts.org/wp-content/uploads/2010/12/coal_hands.jpg   30
Attention has a limited budget


          Attention
          depletion
           Leverage intuition
           whenever possible

                                 31
Strain and effort ➔ Heuristics



It takes 5 machines 5 minutes to make 5 widgets,
             how long would it take
       100 machines to make 100 widgets?




                                               32
Strain and effort ➔ Heuristics



It takes 5 machines 5 minutes to make 5 widgets,
             how long would it take
       100 machines to make 100 widgets?




                                               33
Strain and effort ➔ Heuristics



It takes 5 machines 5 minutes to make 5 widgets,
             how long would it take
       100 machines to make 100 widgets?

                    100!

                                               33
Strain and effort ➔ Heuristics
 Tendency to answer questions with the first idea that
         comes to mind, without checking it


It takes 5 machines 5 minutes to make 5 widgets,
             how long would it take
       100 machines to make 100 widgets?

                      100!
                         5
                                                        33
Swap out difficult tasks for easier ones


  Heuristic, n.
          simple procedure that helps find
          adequate, though often imperfect,
          answers to difficult questions.


                                              34
Human-centric software?




                          35
Human-centric software?




Attention      Too subtle: didn’t notice
 is LAZY       Too tired: didn’t care
                                       35
Let the visual cortex do the work




         http://chariotsolutions.com/presentations/the-programming-ape   36
Dashboard best practices


  Organise information
  to support meaning
 Apply the latest understanding of human visual perception
         to the visual presentation of information


                                                             37
Organised by means of production
CPU Load          DB queries




Bandwidth



                    BAD
                                   38
Organised by context

         Shopping Cart    Product Catalog   Auth Service
Memory
Traffic
DB




                         BETTER
                                                           39
Organised by context

         Shopping Cart    Product Catalog   Auth Service
Memory
Traffic
DB




                         BETTER
                                                           39
Correlate events to add context

 Releases                   Performance
 / Events   Feature X                        TV Ads
                               hotfix




Last 7
Days




                5% users        DB load   90th percentile
Symptoms
               locked out        -40%     latency +730%

                                                            40
Dashboard best practices



   Reduce Visual Noise

  Clutter, Distractions, Clichés, Animations, Embellishments
                       create confusion


                                                               41
Gauges / Speedometers




                        42
Gauges / Speedometers




  3D effect


                        42
Gauges / Speedometers




  3D effect
  Glass reflection

                        42
Gauges / Speedometers




  3D effect
  Glass reflection
  Bouncing needle
                        42
Gauges / Speedometers




  3D effect             ...
  Glass reflection
  Bouncing needle
                              42
Gauges / Speedometers




  3D effect             ...
  Glass reflection       Bacon?
  Bouncing needle
                                 42
(3D) Pie charts
                                                                                  Size of round areas
                   17%
                                                                                  difficult to evaluate
                                                      23%


 13%                                                                               Distortion in the
                                                              1%
                                                              2%                      perceived size
                                                             4%                    (and value of data)
                                                                                           ➡
  21%
                                           17%
                                                                                          They sacrifice
                                                                                          accuracy for
                                                                                         aesthetic appeal
   http://www.dashboardinsight.com/articles/digital-dashboards/building-dashboards/the-case-against-3d-charts-in-dashboards.aspx   43
Pie chart vs. Bar chart



                                 A       27%
              5%
         6%
                       27%       B       23%
   16%
                                 C       22%

                                 D       16%
     22%               23%
                                 E       6%

                                 F       5%
     A             B         C
     D             E         F       0         25   50   75   100
                                                                    44
Pie chart vs. Bar chart

                   About the same screen estate


                                  A       27%
              5%
         6%
                       27%        B       23%
   16%
                                 C        22%

                                 D        16%
     22%               23%
                                  E       6%

                                  F       5%
     A             B         C
     D             E         F        0         25   50   75   100
                                                                     44
Pie chart vs. Bar chart



                                 A       27%
              5%
         6%
                       27%       B       23%
   16%
                                 C       22%

                                 D       16%
     22%               23%
                                 E       6%

                                 F       5%
     A             B         C
     D             E         F       0         25   50   75   100
                                                                    44
Pie chart vs. Bar chart
               Easier to compare size of bars
                (i.e. the value of the data)

                                 A       27%
              5%
         6%
                       27%       B       23%
   16%
                                 C       22%

                                 D       16%
     22%               23%
                                 E       6%

                                 F       5%
     A             B         C
     D             E         F       0         25   50   75   100
                                                                    44
Mind tricks




              45
Mind tricks
                                                         WHAT I IF TOLD
                                                             YOU




                                                           YOU READ THAT
                                                              WRONG
       http://www.quora.com/Optical-Illusions/What-are-some-great-optical-illusions   46
A machine for jumping to conclusions



          W Y S I AT I
          What You See Is All There Is



             Intuitive thinking
           jumps to conclusions
      on the basis of limited evidence
                                         47
Neglect of ambiguity




          Suppression of doubt
                                 48
Neglect of ambiguity




          Suppression of doubt
                                 48
Neglect of ambiguity


          Ann
       approached
        the bank


      Fabrication of coherent stories
          http://www.flickr.com/photos/27000501@N08/5613967601   49
Neglect of ambiguity


          Ann
       approached
        the bank


      Fabrication of coherent stories
          http://www.flickr.com/photos/27000501@N08/5613967601   49
Neglect of ambiguity


          Ann
       approached
        the bank


      Fabrication of coherent stories
          http://www.flickr.com/photos/27000501@N08/5613967601   49
WYSIATI and the need for more data
Data Througput




Server 3
Server 2
Server 1

                                      50
WYSIATI and the need for more data
Data Througput


           oh cr*p.




Server 3
Server 2
Server 1

                                      50
WYSIATI and the need for more data
Data Througput



              Surely, we’re
             losing data :-(

           No doubt about it.
Server 3
Server 2
Server 1

                                      50
WYSIATI and the need for more data
Data Througput




           wait, all other
            metrics are
               OK....
Server 3
Server 2
Server 1

                                      50
WYSIATI and the need for more data
Data Througput




                        Platform OK.
                   Metrics couldn’t reach
                       the stats server.
Server 3           (Stats server rebooted
Server 2           without eth1 interface)
Server 1

                                             50
Multiple perspectives / facets
                             Examine data
                             from multiple
                              perspectives
                            simultaneously
                           (one of them will
                         hopefully make sense)


                         Uncover meaningful
                          relationships that
                           exist in the data
                                            51
Grids / Crosstabs
                                                   Failures by service

                                       Auth Mgr              Product Catalog              Shopping Cart
                                 20K                   20K                          20K
                                                  US
                      Out Of     15K                   15K                          15K
                                                                                                          EU
                      Memory     10K              EU   10K                     EU   10K
Failures by type




                                  5K                    5K                     US    5K                   US
                                  0                     0                            0

                                 20K                   20K                          20K

                                 15K              EU   15K
                                                                               EU   15K
                      Timeout                                                  US
                                 10K                   10K                          10K                   EU
                                                  US
                                 5K                    5K                           5K                    US
                                  0                     0                            0

                                 20K                   20K                          20K

                                 15K                   15K                          15K
                   Unreachable
                                 10K                   10K                          10K

                                  5K              US    5K                     US    5K                   US
                                  0               EU    0                      EU    0                    EU

                                                                                                           52
Grids / Crosstabs
                                                   Failures by service

                                       Auth Mgr              Product Catalog              Shopping Cart
                                 20K                   20K                          20K
                                                  US
                      Out Of     15K                   15K                          15K
                                                                                                          EU
                      Memory     10K              EU   10K                     EU   10K
Failures by type




                                  5K                    5K                     US    5K                   US
                                  0                     0                            0

                                 20K                   20K                          20K

                                 15K              EU   15K
                                                                               EU   15K
                      Timeout                                                  US
                                 10K                   10K                          10K                   EU
                                                  US
                                 5K                    5K                           5K                    US
                                  0                     0                            0

                                 20K                   20K                          20K

                                 15K                   15K                          15K
                   Unreachable
                                 10K                   10K                          10K

                                  5K              US    5K                     US    5K                   US
                                  0               EU    0                      EU    0                    EU

                                                                                                           52
Grids / Crosstabs
                                                   Failures by service

                                       Auth Mgr              Product Catalog              Shopping Cart
                                 20K                   20K                          20K
                                                  US
                      Out Of     15K                   15K                          15K
                                                                                                          EU
                      Memory     10K              EU   10K                     EU   10K
Failures by type




                                  5K                    5K                     US    5K                   US
                                  0                     0                            0

                                 20K                   20K                          20K

                                 15K              EU   15K
                                                                               EU   15K
                      Timeout                                                  US
                                 10K                   10K                          10K                   EU
                                                  US
                                 5K                    5K                           5K                    US
                                  0                     0                            0

                                 20K                   20K                          20K

                                 15K                   15K                          15K
                   Unreachable
                                 10K                   10K                          10K

                                  5K              US    5K                     US    5K                   US
                                  0               EU    0                      EU    0                    EU

                                                                                                           52
Grids / Crosstabs
                                                   Failures by service

                                       Auth Mgr              Product Catalog              Shopping Cart
                                 20K                   20K                          20K
                                                  US
                      Out Of     15K                   15K                          15K
                                                                                                          EU
                      Memory     10K              EU   10K                     EU   10K
Failures by type




                                  5K                    5K                     US    5K                   US
                                  0                     0                            0

                                 20K                   20K                          20K

                                 15K              EU   15K
                                                                               EU   15K
                      Timeout                                                  US
                                 10K                   10K                          10K                   EU
                                                  US
                                 5K                    5K                           5K                    US
                                  0                     0                            0

                                 20K                   20K                          20K

                                 15K                   15K                          15K
                   Unreachable
                                 10K                   10K                          10K

                                  5K              US    5K                     US    5K                   US
                                  0               EU    0                      EU    0                    EU

                                                                                                           52
Halo effect - Biases


      Judgement influenced
     by previous information

        Information processed earlier
   might skew our perception of new data.
 No evidence required to jump to conclusions.
                                                53
Halo effect - Biases

              C++                 J av a


   C++

   Ruby

     R

          0   20   40   60   80
                                  Garbage Collection


                                                       54
Biases stronger than hard evidence


Data              A         B                No
                                             Data
 In
                 C++       J av a            Out




       Which component is broken? A or B ?

                                                55
Biases stronger than hard evidence


Data              A         B                No
                                             Data
 In
                 C++       J av a            Out




       Which component is broken? A or B ?

                                                55
Biases stronger than hard evidence


Data              A         B                No
                                             Data
 In
                 C++       J av a            Out




       Which component is broken? A or B ?
         Don’t guess, look at metrics!!!
                                                55
Priming effect




WASH
                 56
Priming effect




 S _ AP
                 57
Priming effect




 SOAP
                 58
Priming effect




  SLAP
                 59
Priming effect




 SNAP
                 60
Priming effect




SWAP
                 61
Pattern detection
  Colors                  Shapes                            Sounds

 GOOD
 BAD
              Our brain is good at
            creating associations
           and detecting patterns
              http://www.vladstudio.com/wallpaper/?violin            62
Shapes that create emotions




                              63
Shapes that create emotions




                              63
Normalise data, keep patterns consistent




Normalised


                                      64
Going Real-Time




                  65
Monitoring At Different Levels

    UX / Business metrics
           Is there a problem?




                    66
Monitoring At Different Levels

    UX / Business metrics
           Is there a problem?

       System monitors
          Where is the problem?




                    66
Monitoring At Different Levels

    UX / Business metrics
           Is there a problem?

       System monitors
          Where is the problem?

    Application monitors
          What is the problem?
                    66
Instrumentation: Monitoring + Alerting



                              www.android-zenoss.info




                                                        67
Instrumentation: Monitoring + Alerting

Unconventional
 alerting tools
     can be
  surprisingly
    effective

                                         67
Getting started with monitoring
Monigusto

A single-server box that contains the most
common/current tools for monitoring like
graphite, statsd, collectd, nagios, logstash,
jmxtrans, tasseo, gdash, librato and sensu
https://github.com/monigusto




Real-Time Graphing With Graphite
http://bit.ly/rt-graphite

                                                68
StatsD + Graphite


                       Example
 StatsD: Node.JS daemon. Listens for messages over a UDP port and
 extracts metrics, which are dumped to Graphite for further processing
 and visualisation.

 Graphite: Real-time graphing system. Data is sent to carbon
 (processing back-end) which stores data into Graphite’s db. Data
 visualised via Graphite’s web interface.

                                                                         69
StatsD metrics
                                                          ; statsd.ini
<?php                                                     [statsd]
                                                          host = yourhost
foreach ($items as $item) {
                                                          port = 8125
    // time how long it takes
    // to process this item...
    $time_start = microtime(true);
    // ... process item here ...
    $time = (int)(1000 * (microtime(true) - $time_start));
    StatsD::timing('workerX.processing_time', $time); // in ms


    // count items by type
    StatsD::increment('workerX.received.type.'.$item['type']);
}

                        https://github.com/etsy/statsd/                 70
StatsD metrics
                                                          ; statsd.ini
<?php                                                     [statsd]
                                                          host = yourhost
foreach ($items as $item) {
                                                          port = 8125
    // time how long it takes
    // to process this item...
    $time_start = microtime(true);
    // ... process item here ...
    $time = (int)(1000 * (microtime(true) - $time_start));
    StatsD::timing('workerX.processing_time', $time); // in ms


    // count items by type
    StatsD::increment('workerX.received.type.'.$item['type']);
}

                        https://github.com/etsy/statsd/                 70
StatsD metrics
                                                          ; statsd.ini
<?php                                                     [statsd]
                                                          host = yourhost
foreach ($items as $item) {
                                                          port = 8125
    // time how long it takes
                        define a
    // to process this item...
                      hierarchy of
    $time_start = microtime(true);
                     event names
    // ... process item here ...
    $time = (int)(1000 * (microtime(true) - $time_start));
    StatsD::timing('workerX.processing_time', $time); // in ms


    // count items by type
    StatsD::increment('workerX.received.type.'.$item['type']);
}

                        https://github.com/etsy/statsd/                 70
Graphite output




  workerX.processing_time.mean            workerX.processing_time.90percentile




                                 http://graphite.wikidot.com/                    71
Understanding Distribution
              Why averages suck




                                  72
Bell curve
                                                                   “normal” distribution
                                                                    of response times:
  # of requests
                          Average / Median
                                                                       Average = Median

                                                                      i.e. observed perf.
                                                                   represents the majority
                                                                     of the transactions

          Below Average                          Above Average
                                                                                                 Response time
          http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/              73
Bell curve - Alerting levels

  # of requests
                   Median
                                        Std Deviation:
                                      33% of transactions
                                        with the mean
                                        as the middle




                    Within 1 std                 Response time
                  deviation of mean


                                                            74
Bell curve - Alerting levels

  # of requests
                      Median
                                           2x Std Deviation:
                                          66% of transactions
                                              (majority)




                      Within 2 times                 Response time
                  Std Deviation of Mean


                                                                75
Bell curve - Alerting levels

   # of requests
                         Median
                                        Everything outside:
                                              outlier




    Outside 2 times                  Outside 2 times      Response time
 Std Deviation of Mean            Std Deviation of Mean


                                                                     76
“Normal” vs. Real distribution
  Real life: few very heavy outliers and long tail
                Median ≠ Average
  number of
   requests
                                                                           average looks a lot
        8                                                                   faster than most
                  ~20%                                                        transactions
        6
               of very fast
              transactions
        4

        2

        0
              1     2      3      4      5     6      7      8      9     10    11     12     Response
                                                                                                time
                                                   Average
                  20th percentile                        Median
       http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/          77
Averages vs. Percentiles
                 Average
 Load
 time   200
 (ms)


        150


        100


         50


          0
           8AM        10AM   12PM   2PM   4PM

Percentiles allow us to understand the distribution
 The 50th percentile is more stable than the average
                                                       78
Averages vs. Percentiles
                 Average     50th percentile     90th percentile
 Load
 time   200
 (ms)


        150


        100


         50


          0
           8AM        10AM   12PM          2PM       4PM

Percentiles allow us to understand the distribution
 The 50th percentile is more stable than the average
                                                                   78
Automatic Baselining and Alerts
                        50th percentile     90th percentile
 Load
 time   200
 (ms)


        150


        100


         50


          0
           8AM   10AM   12PM          2PM       4PM




                                                              79
Automatic Baselining and Alerts
                        50th percentile     90th percentile
 Load
 time   200
 (ms)


        150


        100


         50                                                   threshold
                                                                  X

          0
           8AM   10AM   12PM          2PM       4PM


    Alert if std deviation of 50th percentile is over X
                                                                          79
Tips And Tricks
Patterns our brain should recognise




                                      80
Normalise + Add baseline




                           81
Normalise + Add baseline




                           81
Normalise + Add baseline




                           81
Normalise + Add baseline




                           81
Normalise + Add baseline



                       let machines
                      determine the
                          baseline




                                      81
Anomaly detection in fluctuating traffic
IOPS




                                         82
Anomaly detection in fluctuating traffic
IOPS




                                         82
Anomaly detection in fluctuating traffic
IOPS




                                         82
Derivative (Detect big spikes)
derivative(IOPS)




                                 83
Derivative (Detect big spikes)
derivative(IOPS)




  OK



                                 83
Derivative (Detect big spikes)
derivative(IOPS)




  OK

    Anomalies
                                 83
Different visuals to spot differences
Stacked
Area




                                        84
Different visuals to spot differences
Stacked
Area




                                        84
Different visuals to spot differences
Overlapping
Lines




                                         85
Different visuals to spot differences
Overlapping
Lines




                                         85
Flattening effect




          Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012   86
Flattening effect
          saturation of a resource
         or discontinuation of flow




          Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012   86
Regular anomalies




         Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012   87
Regular anomalies
            check your cron jobs




         Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012   87
Advanced Heatmaps




                    88
Heat-Maps




            89
Heat-Maps




            89
Look! Rib cages! Network load viz




       http://www.network-weathermap.com/   http://cacti.net   90
10-40GB links - Bandwidth monitor




       http://www.network-weathermap.com/   http://cacti.net   91
10-40GB links - Bandwidth monitor



               Great, but not enough!

                    Contextualise
                      metrics




       http://www.network-weathermap.com/   http://cacti.net   91
HeatMaps: Cacti + WeatherMap




 Cacti: Network graphing solution harnessing the power of RRDTool’s
 data storage and graphing functionality. Provides a fast poller, graph
 templating, multiple data acquisition methods.

 Weathermap: Cacti plugin to integrate network maps into the
 Cacti web UI. Includes a web-based map editor.

                                                                          92
Network throughput / latency



                                                                                             345/s
                                           84
                                                32                                           225/s
                                                     /s
                                                                                             296/s
                                                                                             335/s
                                                                           7312/s            311/s
                                                                                             289/s
                                                                               14
                                                                                    5/
                                                                                         s

          4410/s                           5320/s




   80/s                                                                         1331/s

                      5320/s


                                                                      5320/s

     13/s


                                           2954/s              44/s
                               3296/s                                                                4322/s
   219/s


                                  2954/s              5320/s            832/s

                                                                                                     5320/s




          Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite           93
Network throughput / latency



                                                                                             345/s
                                           84
                                                32                                           225/s
                                                     /s
                                                                                             296/s
                                                                                             335/s
                                                                           7312/s            311/s
                                                                                             289/s
                                                                               14
                                                                                    5/

                                                                      augmentation
                                                                                         s

          4410/s                           5320/s


                                                                         service
                                                                       timing out?
   80/s                                                                         1331/s

                      5320/s


                                                                      5320/s

     13/s


                                           2954/s              44/s
                               3296/s                                                                4322/s
   219/s


                                  2954/s              5320/s            832/s

                                                                                                     5320/s




          Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite           93
Network throughput / latency



                                                                                             345/s
                                           84
                                                32                                           225/s
                                                     /s
                                                                                             296/s
                                                                                             335/s
                                                                           7312/s            311/s
                                                                                             289/s
                                                                               14
                                                                                    5/
                                                                                         s

          4410/s                           5320/s




   80/s
                      5320/s
                                                                                1331/s
                                                                                                               consumer
                                                                      5320/s
                                                                                                              slower than
     13/s                                                                                                      producer?
                                           2954/s              44/s
                               3296/s                                                                4322/s
   219/s


                                  2954/s              5320/s            832/s

                                                                                                     5320/s




          Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite                         93
Server load: memory, CPU, disk...



                500%




     Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite   94
Server load: memory, CPU, disk...

CPU/memory
 overload on
filtering node?

                    500%




         Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite   94
Server load: memory, CPU, disk...


                                                          Slow DB
                500%                                      queries?




     Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite   94
Server load: memory, CPU, disk...



                500%




                                              Disk Storage
                                                Running
                                              Out Of Space?



     Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite   94
Conclusions
   Almost beer time...




                         95
Guidelines: dashboards for humans

       Make the subtle obvious
Make the complex/busy simple/clean
Group data by context, not means of prod
Detect anomalies/deviation from norm
   Turn raw numbers into graphs
Appeal to intuition, conserve attention
                                          96
References
                                   http://www.alberton.info/talks

Daniel Kahneman, “Thinking, Fast and Slow”, Penguin Books 2012

   Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012

                Stephen Few - http://www.perceptualedge.com/

                               http://www.dashboardinsight.com

                                Coda Hale, The Programming APE




                                                                      97
We’re Hiring!




http://datasift.com/about-us/careers

      lorenzo@datasift.com
                                   98
Lorenzo Alberton
          @LorenzoAlberton




   Thank you!
       lorenzo@alberton.info
http://www.alberton.info/talks




            http://joind.in/8060
                                   99

Mais conteúdo relacionado

Mais procurados

Web design brief template
Web design brief templateWeb design brief template
Web design brief templateUğur Çelenk
 
Microservice Approach for Web Development with Micro Frontends
Microservice Approach for Web Development with Micro FrontendsMicroservice Approach for Web Development with Micro Frontends
Microservice Approach for Web Development with Micro Frontendsandrejusb
 
Let us understand design pattern
Let us understand design patternLet us understand design pattern
Let us understand design patternMindfire Solutions
 
Repository and Unit Of Work Design Patterns
Repository and Unit Of Work Design PatternsRepository and Unit Of Work Design Patterns
Repository and Unit Of Work Design PatternsHatim Hakeel
 
Domain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesDomain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesRadosław Maziarka
 
Responsive web-design through bootstrap
Responsive web-design through bootstrapResponsive web-design through bootstrap
Responsive web-design through bootstrapZunair Sagitarioux
 
Use Case Diagram
Use Case DiagramUse Case Diagram
Use Case DiagramKumar
 
DDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCDDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCAndy Butland
 
CQRS recipes or how to cook your architecture
CQRS recipes or how to cook your architectureCQRS recipes or how to cook your architecture
CQRS recipes or how to cook your architectureThomas Jaskula
 
How to build Micro Frontends with @angular/elements
How to build Micro Frontends with @angular/elementsHow to build Micro Frontends with @angular/elements
How to build Micro Frontends with @angular/elementsMarcellKiss7
 
Responsive web design
Responsive web designResponsive web design
Responsive web designRuss Weakley
 
Clean architecture with asp.net core
Clean architecture with asp.net coreClean architecture with asp.net core
Clean architecture with asp.net coreSam Nasr, MCSA, MVP
 
Mock Objects Presentation
Mock Objects PresentationMock Objects Presentation
Mock Objects PresentationAndriy Buday
 
Introduction to Bootstrap: Design for Developers
Introduction to Bootstrap: Design for DevelopersIntroduction to Bootstrap: Design for Developers
Introduction to Bootstrap: Design for DevelopersMelvin John
 
Lecture 12 requirements modeling - (system analysis)
Lecture 12   requirements modeling - (system analysis)Lecture 12   requirements modeling - (system analysis)
Lecture 12 requirements modeling - (system analysis)IIUI
 

Mais procurados (20)

CQRS and Event Sourcing
CQRS and Event SourcingCQRS and Event Sourcing
CQRS and Event Sourcing
 
Web design brief template
Web design brief templateWeb design brief template
Web design brief template
 
What is design pattern
What is design patternWhat is design pattern
What is design pattern
 
Microservice Approach for Web Development with Micro Frontends
Microservice Approach for Web Development with Micro FrontendsMicroservice Approach for Web Development with Micro Frontends
Microservice Approach for Web Development with Micro Frontends
 
Let us understand design pattern
Let us understand design patternLet us understand design pattern
Let us understand design pattern
 
Repository and Unit Of Work Design Patterns
Repository and Unit Of Work Design PatternsRepository and Unit Of Work Design Patterns
Repository and Unit Of Work Design Patterns
 
Domain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesDomain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and Microservices
 
Responsive web-design through bootstrap
Responsive web-design through bootstrapResponsive web-design through bootstrap
Responsive web-design through bootstrap
 
Use Case Diagram
Use Case DiagramUse Case Diagram
Use Case Diagram
 
Headless Architecture
Headless ArchitectureHeadless Architecture
Headless Architecture
 
DDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCDDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVC
 
CQRS recipes or how to cook your architecture
CQRS recipes or how to cook your architectureCQRS recipes or how to cook your architecture
CQRS recipes or how to cook your architecture
 
How to build Micro Frontends with @angular/elements
How to build Micro Frontends with @angular/elementsHow to build Micro Frontends with @angular/elements
How to build Micro Frontends with @angular/elements
 
Responsive web design
Responsive web designResponsive web design
Responsive web design
 
Clean architecture with asp.net core
Clean architecture with asp.net coreClean architecture with asp.net core
Clean architecture with asp.net core
 
Mock Objects Presentation
Mock Objects PresentationMock Objects Presentation
Mock Objects Presentation
 
Rwd ppt
Rwd pptRwd ppt
Rwd ppt
 
Introduction to Bootstrap: Design for Developers
Introduction to Bootstrap: Design for DevelopersIntroduction to Bootstrap: Design for Developers
Introduction to Bootstrap: Design for Developers
 
Lecture 12 requirements modeling - (system analysis)
Lecture 12   requirements modeling - (system analysis)Lecture 12   requirements modeling - (system analysis)
Lecture 12 requirements modeling - (system analysis)
 
Visual Design
Visual DesignVisual Design
Visual Design
 

Destaque

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
 
Scaling Teams, Processes and Architectures
Scaling Teams, Processes and ArchitecturesScaling Teams, Processes and Architectures
Scaling Teams, Processes and ArchitecturesLorenzo Alberton
 
The Art of Scalability - Managing growth
The Art of Scalability - Managing growthThe Art of Scalability - Managing growth
The Art of Scalability - Managing growthLorenzo Alberton
 
Scalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter FirehoseScalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter FirehoseLorenzo Alberton
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeLorenzo Alberton
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenLorenzo Alberton
 
Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresLorenzo Alberton
 

Destaque (7)

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
Scaling Teams, Processes and Architectures
Scaling Teams, Processes and ArchitecturesScaling Teams, Processes and Architectures
Scaling Teams, Processes and Architectures
 
The Art of Scalability - Managing growth
The Art of Scalability - Managing growthThe Art of Scalability - Managing growth
The Art of Scalability - Managing growth
 
Scalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter FirehoseScalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter Firehose
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks Age
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structures
 

Semelhante a Monitoring at scale - Intuitive dashboard design

Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Fitzgerald Analytics, Inc.
 
Web 2.0 & CMS - The Path From Solutions Back To Needs
Web 2.0 & CMS - The Path From Solutions Back To NeedsWeb 2.0 & CMS - The Path From Solutions Back To Needs
Web 2.0 & CMS - The Path From Solutions Back To NeedsScott Liewehr
 
Bio catch
Bio catchBio catch
Bio catchYanivt
 
Pre, Post, + Parallel Expriences: Keys To Understanding Your Customers’ Holis...
Pre, Post, + Parallel Expriences: Keys To Understanding Your Customers’ Holis...Pre, Post, + Parallel Expriences: Keys To Understanding Your Customers’ Holis...
Pre, Post, + Parallel Expriences: Keys To Understanding Your Customers’ Holis...Chris Pallé
 
Sharp Tools For Windows IT Administrators
Sharp Tools For Windows IT AdministratorsSharp Tools For Windows IT Administrators
Sharp Tools For Windows IT Administratorsliebsoft
 
Building an Insight Machine - Strata DDBD 2015
Building an Insight Machine - Strata DDBD 2015Building an Insight Machine - Strata DDBD 2015
Building an Insight Machine - Strata DDBD 2015Domino Data Lab
 
Network visualization for financial crime detection
Network visualization for financial crime detectionNetwork visualization for financial crime detection
Network visualization for financial crime detectionData Driven Innovation
 
Build Systems, Not Stuff
Build Systems, Not StuffBuild Systems, Not Stuff
Build Systems, Not StuffAbby Covert
 
From Analytics to Analysis to Action - GA Event, San Francisco 2011
From Analytics to Analysis to Action - GA Event, San Francisco 2011From Analytics to Analysis to Action - GA Event, San Francisco 2011
From Analytics to Analysis to Action - GA Event, San Francisco 2011Kayden Kelly
 
Observability - A mindset worth pursuing
Observability - A mindset worth pursuingObservability - A mindset worth pursuing
Observability - A mindset worth pursuingEyal Kenig
 
Brian Alpert: Smithsonian - Web Analytics
Brian Alpert: Smithsonian - Web AnalyticsBrian Alpert: Smithsonian - Web Analytics
Brian Alpert: Smithsonian - Web AnalyticsARTstor-Shared_Shelf
 
Optimizing the Virtual Environment
Optimizing the Virtual EnvironmentOptimizing the Virtual Environment
Optimizing the Virtual Environmentuptime software
 
Web Analytics Demystified - Competing On Web Analtytics
Web Analytics Demystified - Competing On Web AnaltyticsWeb Analytics Demystified - Competing On Web Analtytics
Web Analytics Demystified - Competing On Web Analtyticseefsafe
 
Digital-Analytics-The-Culture-of-Insights-and-Actions
Digital-Analytics-The-Culture-of-Insights-and-ActionsDigital-Analytics-The-Culture-of-Insights-and-Actions
Digital-Analytics-The-Culture-of-Insights-and-Actionssteveahaar
 
Prepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionPrepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionRamkumar Ravichandran
 
Architecture components of an BI
Architecture components of an BIArchitecture components of an BI
Architecture components of an BIandrecor30
 
DOG Meetup 18 November 2021 - Factry
DOG Meetup 18 November 2021 - FactryDOG Meetup 18 November 2021 - Factry
DOG Meetup 18 November 2021 - FactryDataops Ghent Meetup
 
Data Driven Practice with e-MDs
Data Driven Practice with e-MDsData Driven Practice with e-MDs
Data Driven Practice with e-MDsJonathan Ploudre
 

Semelhante a Monitoring at scale - Intuitive dashboard design (20)

Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
 
Web 2.0 & CMS - The Path From Solutions Back To Needs
Web 2.0 & CMS - The Path From Solutions Back To NeedsWeb 2.0 & CMS - The Path From Solutions Back To Needs
Web 2.0 & CMS - The Path From Solutions Back To Needs
 
Bio catch
Bio catchBio catch
Bio catch
 
Pre, Post, + Parallel Expriences: Keys To Understanding Your Customers’ Holis...
Pre, Post, + Parallel Expriences: Keys To Understanding Your Customers’ Holis...Pre, Post, + Parallel Expriences: Keys To Understanding Your Customers’ Holis...
Pre, Post, + Parallel Expriences: Keys To Understanding Your Customers’ Holis...
 
Sharp Tools For Windows IT Administrators
Sharp Tools For Windows IT AdministratorsSharp Tools For Windows IT Administrators
Sharp Tools For Windows IT Administrators
 
Building an Insight Machine - Strata DDBD 2015
Building an Insight Machine - Strata DDBD 2015Building an Insight Machine - Strata DDBD 2015
Building an Insight Machine - Strata DDBD 2015
 
Network visualization for financial crime detection
Network visualization for financial crime detectionNetwork visualization for financial crime detection
Network visualization for financial crime detection
 
Build Systems, Not Stuff
Build Systems, Not StuffBuild Systems, Not Stuff
Build Systems, Not Stuff
 
From Analytics to Analysis to Action - GA Event, San Francisco 2011
From Analytics to Analysis to Action - GA Event, San Francisco 2011From Analytics to Analysis to Action - GA Event, San Francisco 2011
From Analytics to Analysis to Action - GA Event, San Francisco 2011
 
Observability - A mindset worth pursuing
Observability - A mindset worth pursuingObservability - A mindset worth pursuing
Observability - A mindset worth pursuing
 
Am I a Business Hound?
Am I a Business Hound? Am I a Business Hound?
Am I a Business Hound?
 
Brian Alpert: Smithsonian - Web Analytics
Brian Alpert: Smithsonian - Web AnalyticsBrian Alpert: Smithsonian - Web Analytics
Brian Alpert: Smithsonian - Web Analytics
 
Optimizing the Virtual Environment
Optimizing the Virtual EnvironmentOptimizing the Virtual Environment
Optimizing the Virtual Environment
 
Web Analytics Demystified - Competing On Web Analtytics
Web Analytics Demystified - Competing On Web AnaltyticsWeb Analytics Demystified - Competing On Web Analtytics
Web Analytics Demystified - Competing On Web Analtytics
 
Digital-Analytics-The-Culture-of-Insights-and-Actions
Digital-Analytics-The-Culture-of-Insights-and-ActionsDigital-Analytics-The-Culture-of-Insights-and-Actions
Digital-Analytics-The-Culture-of-Insights-and-Actions
 
Prepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionPrepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolution
 
Architecture components of an BI
Architecture components of an BIArchitecture components of an BI
Architecture components of an BI
 
DOG Meetup 18 November 2021 - Factry
DOG Meetup 18 November 2021 - FactryDOG Meetup 18 November 2021 - Factry
DOG Meetup 18 November 2021 - Factry
 
E2i Accelerate Tech 2016 presentation
E2i Accelerate Tech 2016 presentationE2i Accelerate Tech 2016 presentation
E2i Accelerate Tech 2016 presentation
 
Data Driven Practice with e-MDs
Data Driven Practice with e-MDsData Driven Practice with e-MDs
Data Driven Practice with e-MDs
 

Último

Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 

Último (20)

Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 

Monitoring at scale - Intuitive dashboard design

  • 1. Lorenzo Alberton @lorenzoalberton Monitoring at scale: intuitive dashboard design Make decisions, fast PHP UK, Saturday 23rd February 2013 1
  • 2. Lorenzo Alberton Chief Technical Architect, DataSift http://alberton.info @lorenzoalberton http://bit.ly/scaleds 2
  • 3. Big Data, little clue? Monitoring is crucial http://www.flickr.com/photos/mrflip/5150336351/lightbox/ 3
  • 5. Identify (and prevent) failures? ? ? ? No output data: where is the problem??? ? ? ? 5
  • 6. Identify (and prevent) failures? ? ? ? No output data: where is the problem??? ? ? ? 5
  • 7. Monitoring mindset You can’t control Design systems what you can’t measure to be monitored Tom DeMarco Good reporting: Observe patterns and difference between noticing automate most things and not having a clue http://www.threesixtymag.co.uk/2012/12/state-of-mind-tee/ 6
  • 8. Monitoring mindset The hardest part Good reporting: difference between noticing and not having a clue 7
  • 9. Dashboard Design Learning the appropriate language 8
  • 10. Dashboard: what is it? Tool to display PIs and KPIs quantitative analysis Immediacy, intuitiveness and appropriate context 9
  • 11. Operational Strategic Analytic monitors functions quick overview of comparisons, which need an organization’s reviewing constant, health extensive histories, real-time, evaluating minute-by-minute assist with performance attention executive decisions assists with immediacy and what is going on data analysis practicality right now is not important - what is no statistics or pressing is what has doesn’t require analyzing been going on real-time data 10
  • 12. Multiple dashboard views Operational: Strategic: Analytic: Ops / Engineering CEO / CIO Marketing / Accountancy Different view for each audience: keep metrics relevant to each group 11
  • 13. Multiple dashboard views Operational: Ops / Engineering This talk is about this one (but the others are important too) 12
  • 14. Effective Monitoring Understanding how we think 13
  • 16. A tale of two systems Intuition Reasoning operates automatically consciously and quickly, with little or allocates attention no effort and no sense to the effortful mental of voluntary control activities that demand it 2+2=? 216 × 725 = ? involuntary fast voluntary slow effortless invisible difficult visible 15
  • 17. A tale of two systems Intuition operates automatically and quickly, with little or no effort and no sense Monitoring should rely on of voluntary control System I 2+2=? involuntary fast effortless invisible 16
  • 18. A tale of two systems Reasoning consciously allocates attention System 2 regulates our intuition to the effortful mental and is ready to jump in activities that demand it when attention is required 216 × 725 = ? voluntary slow difficult visible 17
  • 19. Model “Normality” http://www.flickr.com/photos/fwooper7/4942474212/ 18
  • 20. Be surprised by anomalies http://animal.discovery.com/tv-shows/wild-kingdom/about-animals/lions-elephant-hunters-pictures.htm 19
  • 21. Create surprise with alerts 20
  • 22. Create surprise with alerts 20
  • 23. Over-Use of color Revenue Goal 80 60 40 20 0 Jan Feb Mar Apr May Jun 21
  • 24. Over-Use of color Revenue Goal 80 60 40 20 0 Jan Feb Mar Apr May Jun Only attract attention when things go bad 21
  • 25. Dashboard best practices Show, don’t tell Keep text/numbers to a minimum 22
  • 26. Clarity and immediacy FTW Charles Joseph Minard, Napoleon’s March on Moscow “Probably the best statistical graphic ever drawn” - Edward Tufte http://www.edwardtufte.com/tufte/posters 23
  • 27. Clarity and immediacy FTW Charles Joseph Minard, Napoleon’s March on Moscow worst “Probably the best statistical graphic ever drawn” - Edward Tufte http://www.edwardtufte.com/tufte/posters 23
  • 28. Graphs fit short-term memory Sales Jan Feb Mar Apr May Jun Jul US 23923 21695 20032 24030 24302 25032 26203 EU 14390 16400 17303 21900 23547 20142 27321 24
  • 29. Graphs fit short-term memory Sales Jan Feb Mar Apr May Jun Jul US 23923 21695 20032 24030 24302 25032 26203 EU 14390 16400 17303 21900 23547 20142 27321 give values a visual shape 30000 US 25000 EU Sales 20000 15000 10000 Jan Feb Mar Apr May Jun Jul Aug 24
  • 30. Dashboard best practices Communicate with clarity Simplicity is key 25
  • 32. Busy Dashboards Are Busy http://img.photobucket.com/albums/v254/tomklipp/Misc/C-130e-flight-station.jpg 27
  • 33. Dashboard design mistakes Too much data, too little information At a glance, tell if there’s a problem, not a precise analysis 28
  • 34. The only thing I want to know Everything is alright http://www.x929.ca/shows/newsboy/?cat=28&paged=2 29
  • 35. Attention as limited resource http://www.climateshifts.org/wp-content/uploads/2010/12/coal_hands.jpg 30
  • 36. Attention has a limited budget Attention depletion Leverage intuition whenever possible 31
  • 37. Strain and effort ➔ Heuristics It takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 32
  • 38. Strain and effort ➔ Heuristics It takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 33
  • 39. Strain and effort ➔ Heuristics It takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 100! 33
  • 40. Strain and effort ➔ Heuristics Tendency to answer questions with the first idea that comes to mind, without checking it It takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 100! 5 33
  • 41. Swap out difficult tasks for easier ones Heuristic, n. simple procedure that helps find adequate, though often imperfect, answers to difficult questions. 34
  • 43. Human-centric software? Attention Too subtle: didn’t notice is LAZY Too tired: didn’t care 35
  • 44. Let the visual cortex do the work http://chariotsolutions.com/presentations/the-programming-ape 36
  • 45. Dashboard best practices Organise information to support meaning Apply the latest understanding of human visual perception to the visual presentation of information 37
  • 46. Organised by means of production CPU Load DB queries Bandwidth BAD 38
  • 47. Organised by context Shopping Cart Product Catalog Auth Service Memory Traffic DB BETTER 39
  • 48. Organised by context Shopping Cart Product Catalog Auth Service Memory Traffic DB BETTER 39
  • 49. Correlate events to add context Releases Performance / Events Feature X TV Ads hotfix Last 7 Days 5% users DB load 90th percentile Symptoms locked out -40% latency +730% 40
  • 50. Dashboard best practices Reduce Visual Noise Clutter, Distractions, Clichés, Animations, Embellishments create confusion 41
  • 52. Gauges / Speedometers 3D effect 42
  • 53. Gauges / Speedometers 3D effect Glass reflection 42
  • 54. Gauges / Speedometers 3D effect Glass reflection Bouncing needle 42
  • 55. Gauges / Speedometers 3D effect ... Glass reflection Bouncing needle 42
  • 56. Gauges / Speedometers 3D effect ... Glass reflection Bacon? Bouncing needle 42
  • 57. (3D) Pie charts Size of round areas 17% difficult to evaluate 23% 13% Distortion in the 1% 2% perceived size 4% (and value of data) ➡ 21% 17% They sacrifice accuracy for aesthetic appeal http://www.dashboardinsight.com/articles/digital-dashboards/building-dashboards/the-case-against-3d-charts-in-dashboards.aspx 43
  • 58. Pie chart vs. Bar chart A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  • 59. Pie chart vs. Bar chart About the same screen estate A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  • 60. Pie chart vs. Bar chart A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  • 61. Pie chart vs. Bar chart Easier to compare size of bars (i.e. the value of the data) A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  • 63. Mind tricks WHAT I IF TOLD YOU YOU READ THAT WRONG http://www.quora.com/Optical-Illusions/What-are-some-great-optical-illusions 46
  • 64. A machine for jumping to conclusions W Y S I AT I What You See Is All There Is Intuitive thinking jumps to conclusions on the basis of limited evidence 47
  • 65. Neglect of ambiguity Suppression of doubt 48
  • 66. Neglect of ambiguity Suppression of doubt 48
  • 67. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  • 68. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  • 69. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  • 70. WYSIATI and the need for more data Data Througput Server 3 Server 2 Server 1 50
  • 71. WYSIATI and the need for more data Data Througput oh cr*p. Server 3 Server 2 Server 1 50
  • 72. WYSIATI and the need for more data Data Througput Surely, we’re losing data :-( No doubt about it. Server 3 Server 2 Server 1 50
  • 73. WYSIATI and the need for more data Data Througput wait, all other metrics are OK.... Server 3 Server 2 Server 1 50
  • 74. WYSIATI and the need for more data Data Througput Platform OK. Metrics couldn’t reach the stats server. Server 3 (Stats server rebooted Server 2 without eth1 interface) Server 1 50
  • 75. Multiple perspectives / facets Examine data from multiple perspectives simultaneously (one of them will hopefully make sense) Uncover meaningful relationships that exist in the data 51
  • 76. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10K Failures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  • 77. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10K Failures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  • 78. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10K Failures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  • 79. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10K Failures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  • 80. Halo effect - Biases Judgement influenced by previous information Information processed earlier might skew our perception of new data. No evidence required to jump to conclusions. 53
  • 81. Halo effect - Biases C++ J av a C++ Ruby R 0 20 40 60 80 Garbage Collection 54
  • 82. Biases stronger than hard evidence Data A B No Data In C++ J av a Out Which component is broken? A or B ? 55
  • 83. Biases stronger than hard evidence Data A B No Data In C++ J av a Out Which component is broken? A or B ? 55
  • 84. Biases stronger than hard evidence Data A B No Data In C++ J av a Out Which component is broken? A or B ? Don’t guess, look at metrics!!! 55
  • 86. Priming effect S _ AP 57
  • 88. Priming effect SLAP 59
  • 91. Pattern detection Colors Shapes Sounds GOOD BAD Our brain is good at creating associations and detecting patterns http://www.vladstudio.com/wallpaper/?violin 62
  • 92. Shapes that create emotions 63
  • 93. Shapes that create emotions 63
  • 94. Normalise data, keep patterns consistent Normalised 64
  • 96. Monitoring At Different Levels UX / Business metrics Is there a problem? 66
  • 97. Monitoring At Different Levels UX / Business metrics Is there a problem? System monitors Where is the problem? 66
  • 98. Monitoring At Different Levels UX / Business metrics Is there a problem? System monitors Where is the problem? Application monitors What is the problem? 66
  • 99. Instrumentation: Monitoring + Alerting www.android-zenoss.info 67
  • 100. Instrumentation: Monitoring + Alerting Unconventional alerting tools can be surprisingly effective 67
  • 101. Getting started with monitoring Monigusto A single-server box that contains the most common/current tools for monitoring like graphite, statsd, collectd, nagios, logstash, jmxtrans, tasseo, gdash, librato and sensu https://github.com/monigusto Real-Time Graphing With Graphite http://bit.ly/rt-graphite 68
  • 102. StatsD + Graphite Example StatsD: Node.JS daemon. Listens for messages over a UDP port and extracts metrics, which are dumped to Graphite for further processing and visualisation. Graphite: Real-time graphing system. Data is sent to carbon (processing back-end) which stores data into Graphite’s db. Data visualised via Graphite’s web interface. 69
  • 103. StatsD metrics ; statsd.ini <?php [statsd] host = yourhost foreach ($items as $item) { port = 8125 // time how long it takes // to process this item... $time_start = microtime(true); // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing('workerX.processing_time', $time); // in ms // count items by type StatsD::increment('workerX.received.type.'.$item['type']); } https://github.com/etsy/statsd/ 70
  • 104. StatsD metrics ; statsd.ini <?php [statsd] host = yourhost foreach ($items as $item) { port = 8125 // time how long it takes // to process this item... $time_start = microtime(true); // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing('workerX.processing_time', $time); // in ms // count items by type StatsD::increment('workerX.received.type.'.$item['type']); } https://github.com/etsy/statsd/ 70
  • 105. StatsD metrics ; statsd.ini <?php [statsd] host = yourhost foreach ($items as $item) { port = 8125 // time how long it takes define a // to process this item... hierarchy of $time_start = microtime(true); event names // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing('workerX.processing_time', $time); // in ms // count items by type StatsD::increment('workerX.received.type.'.$item['type']); } https://github.com/etsy/statsd/ 70
  • 106. Graphite output workerX.processing_time.mean workerX.processing_time.90percentile http://graphite.wikidot.com/ 71
  • 107. Understanding Distribution Why averages suck 72
  • 108. Bell curve “normal” distribution of response times: # of requests Average / Median Average = Median i.e. observed perf. represents the majority of the transactions Below Average Above Average Response time http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 73
  • 109. Bell curve - Alerting levels # of requests Median Std Deviation: 33% of transactions with the mean as the middle Within 1 std Response time deviation of mean 74
  • 110. Bell curve - Alerting levels # of requests Median 2x Std Deviation: 66% of transactions (majority) Within 2 times Response time Std Deviation of Mean 75
  • 111. Bell curve - Alerting levels # of requests Median Everything outside: outlier Outside 2 times Outside 2 times Response time Std Deviation of Mean Std Deviation of Mean 76
  • 112. “Normal” vs. Real distribution Real life: few very heavy outliers and long tail Median ≠ Average number of requests average looks a lot 8 faster than most ~20% transactions 6 of very fast transactions 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 Response time Average 20th percentile Median http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 77
  • 113. Averages vs. Percentiles Average Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PM Percentiles allow us to understand the distribution The 50th percentile is more stable than the average 78
  • 114. Averages vs. Percentiles Average 50th percentile 90th percentile Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PM Percentiles allow us to understand the distribution The 50th percentile is more stable than the average 78
  • 115. Automatic Baselining and Alerts 50th percentile 90th percentile Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PM 79
  • 116. Automatic Baselining and Alerts 50th percentile 90th percentile Load time 200 (ms) 150 100 50 threshold X 0 8AM 10AM 12PM 2PM 4PM Alert if std deviation of 50th percentile is over X 79
  • 117. Tips And Tricks Patterns our brain should recognise 80
  • 118. Normalise + Add baseline 81
  • 119. Normalise + Add baseline 81
  • 120. Normalise + Add baseline 81
  • 121. Normalise + Add baseline 81
  • 122. Normalise + Add baseline let machines determine the baseline 81
  • 123. Anomaly detection in fluctuating traffic IOPS 82
  • 124. Anomaly detection in fluctuating traffic IOPS 82
  • 125. Anomaly detection in fluctuating traffic IOPS 82
  • 126. Derivative (Detect big spikes) derivative(IOPS) 83
  • 127. Derivative (Detect big spikes) derivative(IOPS) OK 83
  • 128. Derivative (Detect big spikes) derivative(IOPS) OK Anomalies 83
  • 129. Different visuals to spot differences Stacked Area 84
  • 130. Different visuals to spot differences Stacked Area 84
  • 131. Different visuals to spot differences Overlapping Lines 85
  • 132. Different visuals to spot differences Overlapping Lines 85
  • 133. Flattening effect Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 86
  • 134. Flattening effect saturation of a resource or discontinuation of flow Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 86
  • 135. Regular anomalies Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 87
  • 136. Regular anomalies check your cron jobs Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 87
  • 138. Heat-Maps 89
  • 139. Heat-Maps 89
  • 140. Look! Rib cages! Network load viz http://www.network-weathermap.com/ http://cacti.net 90
  • 141. 10-40GB links - Bandwidth monitor http://www.network-weathermap.com/ http://cacti.net 91
  • 142. 10-40GB links - Bandwidth monitor Great, but not enough! Contextualise metrics http://www.network-weathermap.com/ http://cacti.net 91
  • 143. HeatMaps: Cacti + WeatherMap Cacti: Network graphing solution harnessing the power of RRDTool’s data storage and graphing functionality. Provides a fast poller, graph templating, multiple data acquisition methods. Weathermap: Cacti plugin to integrate network maps into the Cacti web UI. Includes a web-based map editor. 92
  • 144. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ s 4410/s 5320/s 80/s 1331/s 5320/s 5320/s 13/s 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  • 145. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ augmentation s 4410/s 5320/s service timing out? 80/s 1331/s 5320/s 5320/s 13/s 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  • 146. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ s 4410/s 5320/s 80/s 5320/s 1331/s consumer 5320/s slower than 13/s producer? 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  • 147. Server load: memory, CPU, disk... 500% Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  • 148. Server load: memory, CPU, disk... CPU/memory overload on filtering node? 500% Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  • 149. Server load: memory, CPU, disk... Slow DB 500% queries? Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  • 150. Server load: memory, CPU, disk... 500% Disk Storage Running Out Of Space? Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  • 151. Conclusions Almost beer time... 95
  • 152. Guidelines: dashboards for humans Make the subtle obvious Make the complex/busy simple/clean Group data by context, not means of prod Detect anomalies/deviation from norm Turn raw numbers into graphs Appeal to intuition, conserve attention 96
  • 153. References http://www.alberton.info/talks Daniel Kahneman, “Thinking, Fast and Slow”, Penguin Books 2012 Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 Stephen Few - http://www.perceptualedge.com/ http://www.dashboardinsight.com Coda Hale, The Programming APE 97
  • 155. Lorenzo Alberton @LorenzoAlberton Thank you! lorenzo@alberton.info http://www.alberton.info/talks http://joind.in/8060 99