SlideShare uma empresa Scribd logo
1 de 68
Baixar para ler offline
Capacity: It's Not
  All About U!
         (née: “RegardingCapacity”)




Bob Sneed - Sr. Staff Engineer
   Sun Microsystems, Inc.
 Performance & Applications
      Engineering (PAE)

 Hotsos Symposium 2008, March 2-6 @ Dallas
          Rev 1.9c – March 19, 2008
  Copyright © 2008, Sun Microsystems, Inc.
            All Rights Reserved.
Abstract
  When it comes to managing computer capacity, the state-of-
   the-industry is wildly diverse -- but often both primitive and
 inconsistent in the area of enterprise computing. Indeed, most
 discussions regarding capacity don't even involve appropriate
      engineering units of measure! It's no surprise that the
    relationship between capacity management, performance
 management, and Quality of Service (QoS) management is so
       uneven in practice. This session will survey modern
  quandaries in Performance and Capacity Management, and
    offer some insights and abstractions aimed at stimulating
         constructive discussion, progressive engineering
        development, and intelligent practices in this area.

                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   2
Disclaimers
  Opinions and views expressed herein are those of the author,
    Bob Sneed, and do not represent any official opinion of Sun
           Microsystems, Incorporated - or anyone else.
I'm not a doctor and I don't even play one on TV - but I do regard
          Tom Baker and Chris Eccleston as role models.
 There is no warranty, expressed or implied, in the quality of the
      information herein, or its fitness for any given purpose.
  If you goof up applying this stuff and have a bad outcome or
        destroy a bunch of data – it's not my fault or Sun's.
                   This is version 1.x material.
                    Batteries not included.
                 Your mileage may vary (YMMV).
                    Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   3
Agenda

•   Motivations                                  [10]
•   Let's Talk PerfCap                        [15]
•   Case Study                                   [10]
•   Ruminations on the State of the Art          [ 5]
•   Heterogeneity, Elasticity, and Covariance    [15]
•   Concluding Remarks                           [ 5]
(All times in Bob-minutes; YMMV ...)




                                  Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   4
Motivations




Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   5
Concerns and Premises
• Primitivism: Many customers are doing capacity
  wrong with the result being variously massive over-
  provisioning, surprises in production, or much ado
  about normal!
• I'm annoyed: Many "capacity crises” are actually
  either chaos in action or misunderstandings about
  The Way Things Work.
• Advancing the art: Investments are required to
  make industry advances in managing Performance
  and Capacity (PerfCap).
• Customer value: Right-sizing is a win-win scenario.

                Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   6
How widespread is “wrong”?
• It's not that everyone is doing it wrong ...
  >   ... though even many who do PerfCap right are crippled by
      organizational behaviour and GIGO constraints ...
• In some places, PerfCap tends to get done right ...
  >   Technical computing (HPC, HPTC)
  >   Embedded computing & realtime systems
  >   In well-defined tiers with homogeneous workloads
• In some places, PerfCap tends to get done wrong ...
  >   Commercial IT – especially around big databases
  >   Heterogeneous workloads - some inherently complex,
      some resulting from consolidation or virtualization
• Bob says: “Tiers are for people who have not
  discovered resource and workload management!”
                    Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   7
PerfCap / Physics Metaphor
• Primitivism, pre-science ~ state of the practice
  >   Wonder; everything is mystery and magic
  >   Underlying causes attributed to nature or deities
  >   Stagnant - “Because we've always done it that way”
• Newtonian physics ~ state of the art
  >   Causality; testable hypotheses, repeatable outcomes
  >   Mathematical relationships determined
  >   Enables the modern era
• Einsteinian physics ~ the horizon
  >   Relativity; frames of reference
  >   True nature of things theorized; testability gets harder
  >   Propels the post-modern era
                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   8
Over-Provisioning; so What?
• Pros ...
  >   Hardware is cheap. Sun sells hardware. Good for Bob!
  >   Feature/function time-to-market has priority.
  >   Performance expertise scarce and inconsistent.
  >   No time for learning “new tricks”.
  >   “Throwing Iron” at problems has a fixed cost and a set
      delivery date - and it often “works”.
• Cons ...
  >   Capital costs
  >   Operational costs (power, cooling, space, administration)
  >   Stagnation: The applicable math, science, and vocabulary
      has ended up deferred – for nearly an entire era.


                    Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   9
Let's Talk PerfCap




  Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   10
PerfCap Language: Goals
• Business Metrics
  >   System performance in business terms, such as
      transactions per second, batch run time, or percent of
      jobs/transactions meeting some performance criteria
      (Service Level Agreement, or SLA)
  >   Business objectives are typically diverse in terms of
      importance and resource demands
• Business Metrics and Indicators (BMIs)
  >   Business metrics plus secondary indicator variables, such
      as aggregate packet rate or commit rate
  >   These are observables one might monitor and alarm on

                    Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   11
PerfCap: Solving the Right Problem
• “The Goal” - Goldratt
   >   Written as a novel; an unusual approach
       to conveying principles from Operations
       Research


 • “Are Your Lights On?” - Gause &
   Weinberg
   >   A fun and easy read
   >   From the same Weinberg as the classic
       “Psychology of Computer Programming”

                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   12
PerfCap Language: Capacity
 • Some definitions
   >   English: The ability to do a job.
   >   Technical: The maximum reliable throughput with
       acceptable response times.
   >   Geek: The throughput limitation of the bottleneck device.
 • Supermarket metaphors
   >   What percent of cashiers should be always idle?
   >   What purposes do “express lanes” serve?
 • Submarine metaphor
   >   Compare “100% underwater” with “crush depth”; which
       one represents capacity?


                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   13
PerfCap Language: Capacity Planning
 • Capacity Planning defined - with footnotes
   >   Estimating[A] capacity requirements[B] in time to be able
       order, receive, provision, and deploy – before you run out
       of capacity.
   [A] Prognostication and prestodigitation, usually based on B.S. forecasts
      from marketing departments
   [B] NOTE: Related disciplines increase capacity without capital outlays
        ●
          Efficiency – doing more with less; tuning; optimization
        ●
          Software Performance Engineering (SPE) – the discipline of
          engineering to meet performance requirements
 • It's not all about U! (Utilization)
   >   It's mostly about R (response time), X (throughput),
       service demands, and efficiency (which relates to U) and
       The Way Things Work
                         Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   14
PerfCap Language: Queuology
• Queueing Theory = math used for PerfCap work
  >   Too bad it does not have a simple one-word name like
      arithmetic, calculus, topology, trigonometry, or sadistics
      (how about “queuology”?)
• Response-time = Queue wait + Service time
  >   R=W+S
  >   NOTE: This is not Plain English. It must be taught in
      context to enable meaningful conversations.
• Bottleneck = scaling constraint
  >   NOTE: This is not Plain English. In PerfCap, this term has
      no negative emotional connotation.

                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   15
PerfCap Language: Crazy about U!
• Utilization (U)
  >   The percent of time a resource is not idle
  >   Physics analogy: Work = Force * Displacement
       ●
         No displacement means no work
• Another physical metaphor ...
  >   Helicopter: What does a helicopter's engine tachometer
      tell you about the helicopter's performance?




                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   16
PerfCap Language: U is for Useless?
• “Utilization is Virtually Useless as a Metric” - Adrian
  Cockcroft, CMG 2006
  > http://perfcap.blogspot.com/2005/12/cmg05-trip-comments-and-utilization-is.html
  > http://www.cmg.org/membersonly/2006/papers/6133.pdf

      “We have all been conditioned over the years to use utilization or %busy as the
      primary metric for capacity planning. Unfortunately, with increasing use of CPU
  virtualization and sophisticated CPU optimization techniques such as hyper-threading
       and power management the measurements we get from systems are "virtually
   useless". This paper will explain many of the fundamental alternatives, and express
    capacity in terms of headroom, in units of throughput within a response time limit.”

• Adrian wins 2007 CMG Michelson Award
  >   http://perfcap.blogspot.com/2007/12/a-michelson-award-acceptance-speech.html
      "Those who ask questions about utilization don't understand that their questions
                   have no meaning so the answers are irrelevant :-)"

                          Copyright © 2008, Sun Microsystems, Inc. All rights reserved.    17
Aggregate Utilization: U-all?
• Business Logic
  >   Workload classes (eg: OLTP, BATCH, pseudo-BATCH)
      ●
          Varies in business priority
      ●
          Varies in relative I/O content
      ●
          Varies in propensity to compute
  >   Per-class utilization varies based on many system factors
      (CPU architecture, OS scheduling, space/speed tradeoffs,
      efficiency tradeoffs, virtualization), and also due to often-
      uncontrolled competition for resources
  >   Cycles-per-instruction (CPI) varies with compile/build
      factors and competition factors
  >   Utilization is limited by concurrency of demand and
      bounded by serialization per Amdahl's Law
  >   Utilization often largely due to bad app code and/or bugs
                        Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   18
Aggregate Utilization: U what !?@#!
• Overhead categories
  >   Polling operations
  >   Lock and latch spins (adaptive)
  >   Locking and latching cache coherency
  >   Memory management (a maze of twisty passages ...)
  >   Re-work (fail-and-retry logic)
  >   Migrations & cache invalidations
  >   Context switches (voluntary and involuntary)
  >   Hardware thread-switching (some cheap, some not)
      ●
          SMP, VMT, SMT, CMT – all different!
  >   Performance monitoring and management tools
      ●
          Significant “probe effect” can occur from some tools
      ●
          The aggregate impact of tools is often a root cause of problems
  >   Bad tuning and bugs - outside of the business logic
                        Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   19
PerfCap Language: Like, U-know?
• Workload Characterization
  >   PerfCap definition: Attribution of resource utilization to
      various distinct business processes or technical
      functionality
       ●
         Essential to understanding resource usage
  >   Engineering definition: Characterization of platform
      response factors under a given workload
       ● Interesting to drive systems engineering


  >   Vernacular definition: Various broad terms like OLTP,
      BATCH, DSS, DW, PROD, UETP, DVLP, TEST, OLAP,
      ERP, ETL, ad-hoc, and my personal favourite - “mixed”
       ●
         Suggestive of requirements, but non-quantitative

                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   20
Hockey Sticks and Knees 4 U




     Excerpted from "Analyzing Computer System Performance” by Neil J. Gunther,
           Springer-Verlag 2005. ISBN 3540208658 (Used with permission.)

                         Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   21
So, what do U know?
 •   Do you know your overhead/work ratio?
 •   Do you know your ratio of OLTP to pseudo-BATCH?
 •   Do you know how these vary under load?
 •   Do you know how to observe, measure, and manage
     these things?




                  Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   22
PerfCap Language: Method Rrrrrr!
           Right                                                            Wrong
 • Performance                                     • Performance
   > Response time                                   > CPU %busy, %usr/%sys ratio
   > Throughput                                      > IOPS, disk latency, %wio
   > Variance                                        > Graphs of aggregated data
 • Capacity                                        • Capacity
   > Latent performance                              > Whatever you get at 100% utilization
 • Headroom                                        • Headroom
   > ((100% capacity) –                              > (100% – utilization)
        (current peak performance))
 • Utilization                                     • Utilization
   > (100% – %idle)                                  > (100% – headroom)


                         Copyright © 2008, Sun Microsystems, Inc. All rights reserved.        23
Case Study




Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   24
Case Study: Scenario
• Financial E10K user upgraded to E2900
  >   CPU power of E2900 was 125% that of the 10K system
       ●
         E10K: #64 US-II @ (64 “slow” cores)
       ●
         E2900: #12 US-IV+ @ (24 “fast” cores)
  >   Result: Utilization on E2900 was greater than on E10K!
  >   Impact: Great angst! Management wanted %idle > 20!
      E2900 dissed. Move to E6900 contemplated. (Focus was
      on utilization (U) ... response-time (R) and throughput (X)
      were essentially ignored)
  >   Breakthrough! Customer agreed to a test-to-fail exercise!
       ●
         Monitor response times per-transaction-class
       ●
         Increase benchmark workload until SLA not met
                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   25
It's not all about U!
          600

RTX2      480

          360
SLA =     240
600 sec   120

            0
                0   100                           200                           300       400   500




RTX1      0.5
          0.4

SLA =     0.3
          0.2
0.5 sec   0.1
           0
                0   100                           200                           300       400   500


          100

           80
UCPU                                            OMG! 20%Headroom?
           60

Max =      40
100%       20                               No! 300% Headroom
            0
                0   100                            200                          300       400   500
                                                                   Users
                          Copyright © 2008, Sun Microsystems, Inc. All rights reserved.               26
Case Study: Experimental Results
• The new system had plenty of latent capacity!
  >   Test-to-fail revealed 300% headroom at 80% utilization!
  >   All they needed was 1X headroom at 100 users!
  >   Workload characterization revealed that a single CPU-
      greedy transaction of no business importance was vastly
      over-achieving its SLA
  >   The CPU-greedy transaction under Solaris TS scheduling
      automatically fell to priority 0 - thus having zero impact on
      real OLTP as OLTP demand ramped up to 4x the level that
      corresponded with 80% aggregate CPU utilization
  >   At the “tipping point”, the chaos may have been due to
      LGWR priority dropping to 0 under Solaris TS scheduling
                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   27
Case Study: Business Outcome
• Customer emergency upgraded to an E6900
  >   CPU power of E6900 was 200% that of the 10K system
  >   Rumor has it that they got a really good discount
  >   E6900 showed a “comforting” 20%+ idle under full test load
• Moral
  >   Science is often secondary in commercial IT
  >   Due to issues of organizational behaviour, even empirical
      results might fail to triumph over rules of thumb
  >   The cost of hardware is a minor issue to many IT
      managers' decision-making process
  >   Get over it ... or - develop new metrics and methods by
      which IT managers can be made comfortable!

                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   28
Ruminations on the
  State of the Art




  Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   29
Common PerfCap Mistakes
• Absence of business metrics
  >   What Problem are You Trying to Solve?
• Equating usage with demand or requirement
  >   In other words, assuming that demand is inelastic
• Failure to do performance first and often
  >   Why scale waste and inefficiency?
• Assuming supply is inelastic
  >   In other words, assuming service times are constant
• Misinterpreting “the device with the highest utilization
  is the bottleneck device”
  >   Hmm, what about polling loops?
• Decisions based on intuition and rules of thumb
  >   Sophistication can pay great rewards!
                    Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   30
What's the Right Way to do PerfCap?
1) Empirical Methods (The Best & Most Expensive)
  ●
      Benchmarks, stress testing, test-to-scale, test-to-fail – with known Best
      Practices & basic performance analysis and tuning
2) Modeling (Highly Recommended & Moderate Cost)
  ●
      Using tools such as TeamQuest Model (TQM), BMC Perform/Predict, Hy-
      Performix, Gunther's PDQ or other application of proper science and math
3) Expert Opinions (The Minimum & Cheapest)
  ●
      Listening to the right experts for Best Practices, analysis and tuning
      methods, and sizing
4) Guesswork (The Norm)
  ●
      Straight-line extrapolations, naïve use of reference benchmarks, massive
      over-provisioning, bogus testing, luck
5) Opportunism (Commonplace)
  ●
      Spend the available budget
                          Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   31
RTFM: PerfCap Resources
• Dr. Neil Gunther – prolific, readable, digestible
  >   “The Practical Performance Analyst” - foundational
      http://www.amazon.com/dp/059512674X/
  >   “Guerrilla Capacity Planning” -
      http://www.perfdynamics.com/Manifesto/gcaprules.html




                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   32
RTFM: PerfCap Resources
• Cary Millsap – digestible, practical, methodical
  >   “Optimizing Oracle Performance”
      ●
          Chapter 1 & 2 – a great intro to the art of PerfCap, whether
          or not one applies it to Oracle
      ●   Method R




                        Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   33
RTFM: PerfCap Resources
• Raj Jain - “The Art of Computer Systems
  Performance Analysis”
  > Fundamental, foundational, readable




                Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   34
When Models Break
 • Good models break due to factors that are
   exogenous to the model (ie: not considered)
   >   Examples: bus saturation, cache saturation, lock
       contention, covariance
 • Bad models break because they are bad models
   >   Examples: “straight line” projections, models that do
       not consider basic queuing phenomena




                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   35
What Breaks Existing Models
 • Heterogeneity
   >   There is diversity in both supply and demand factors
   >   For example, OLTP, BATCH, and DSS are classical
       characterizations for common workload elements
 • Elasticity
   >   Resource supply and demand factors are each elastic
   >   For example, per-transaction demand might diminish under
       increasing load and supply might become more efficient
 • Covariance
   >   Competition for resources impacts all competitors -
       sometimes adversely or pathologically

                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   36
Heterogeneity, Elasticity,
    and Covariance




     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   37
Heterogeneity: Many Dimensions

 • Business priority
   >   Importance to the enterprise
 • Service demand
   >   Resource requirement, including deadline constraints
 • Technical priority
   >   Solaris scheduling priority
 • Quality (versus quantity)
   >   Not all CPU-seconds are created equal
 • Urgency
   >   Importance, as distinct from priority or share
       ●
           (example: princes and paupers)

                        Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   38
Heterogeneity: Early Warning Signs

 •   “ERP”
 •   “Consolidation”
 •   “RDBMS”
 •   “Ad-hoc”
 •   “Custom”
 •   “Producer/Consumer”
 •   “Client/Server”
 •   “Dispatcher thread/process”
 •   Testimony to the contrary (eg: “It's entirely
     homogeneous OLTP!”)
                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   39
Heterogeneity: Example(s)
# prstat -m
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT                             VCX ICX SCL SIG PROCESS/NLWP
 13632 oracle    50 50 0.0 0.0 0.0 0.0 0.0 0.0                                0   0 48K   0 sqlplus/1
 13633 oracle   0.0 96 0.0 0.0 0.0 0.0 48 0.0                                 0   0 46K   0 sqlplus/1
 15849 oracle    92 0.1 0.0 0.0 0.0 100 100 0.1                              13 45 1K     0 oracle/11
 27639 oracle    91 0.1 0.0 0.0 0.0 100 100 0.1                              24 50 2K     0 oracle/11
 13601 root      18 54 0.0 0.0 0.0 0.0 36 0.0                               178 178 87K   0 ps/1
 13551 root     0.0 68 0.0 0.0 0.0 0.0 39 0.0                               244 195 93K   0 prstat/1
 12614 oracle    64 0.2 0.0 0.0 0.0 100 100 0.1                              50 38 3K     0 oracle/11
 24020 oracle    47 0.5 0.0 0.0 0.0 100 100 0.1                             190 36 10K    0 oracle/11
[...]
 11087 oracle   9.3 0.1 0.0 0.0 0.0 0.0 90 0.0                                5          6 6K    0   oracle/1
 13490 root     0.0 8.5 0.0 0.0 0.0 0.0 93 0.0                              380          0 25K   0   sh/1
  2154 oracle   7.9 0.2 0.0 0.0 0.0 100 100 0.0                              53          5 3K    0   oracle/11
  9656 oracle   7.1 0.1 0.0 0.0 0.0 0.0 92 0.0                               37          5 2K    0   oracle/1
 24156 oracle   6.7 0.1 0.0 0.0 0.0 100 100 0.0                               6          4 2K    0   oracle/11
 13496 oracle   6.2 0.0 0.0 0.0 0.0 0.0 93 0.0                              341          0 19K   0   sh/1
 13488 oracle   6.0 0.0 0.0 0.0 0.0 0.0 96 0.0                              330          0 19K   0   sh/1
 25478 oracle   3.9 0.1 0.0 0.0 0.0 0.0 96 0.0                               46          3 2K    0   oracle/1
  8098 oracle   2.9 0.1 0.0 0.0 0.0 0.0 97 0.0                               60          3 2K    0   oracle/1
[...]
Total: 295 processes, 2869 lwps, load averages:                             11.64, 12.02, 12.05




                         Copyright © 2008, Sun Microsystems, Inc. All rights reserved.                           40
Heterogeneity: Exploring

 • Fun commands you can use at home ...
 # Taking U apart
 prstat -n 8192 -m                 // Microstate accounting
 prstat -n 8192 -mL                // Per-thread microstate accounting
 # Thread count ...
 awk '{print $15}' < prstat-sample.1 | sort | grep oracle | uniq -c | more
 # CPU intensity ...
 grep oracle/ prstat-sample.1 | awk '{print $3}' | sort -n +1 | uniq -c | more
 # Diverse priorities ...
 ps -e -o pid,class,pri,args

                          Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   41
Heterogeneity: Deal with it!
 • Identify it
   >   This is one aspect of workload characterization in the
       language of PerfCap
   >   Consider its many dimensions (business priority, service
       demand, technical priority, urgency, deadlines)
 • Tell the OS about it
   >   The OS does not know your priorities, so tell it!
   >   Automating this is a good investment
 • Model it
   >   w.r.t. competition and covariance – TBD



                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   42
Elasticity: Supply Factors

 • In general, “supply” is net of competing demands
   >   “I'm giving ya all I got, captain!”
   >   FCFS – who got in line first?
 • In a specific configuration, elastic factors abound
   >   With mixed-speed CPUs, Q(CPU-second) = f(MHz)
   >   With CMT, Q(CPU-second) = f(core loading)
   >   Q(CPU-second) = f(ISA & pipeline sophistication)
 • Unmanaged, the probability of thread pinning will
   increase with increasing interrupt load


                       Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   43
Elasticity: Supply Factors

 • Priority preemption
   >   Good – under TS, compute hogs will drift to priority 0
   >   Bad - unmanaged, a large population of homogeneous
       threads may frivolously preempt each other
   >   Ugly – interrupts have top priority; they can even interrupt
       and “pin” realtime (RT) threads
   >   Hideous – it's really tragically bad when TS demotes your
       highest-importance thread (eg: Oracle LGWR)




                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   44
Elasticity: Supply Factors
# mpstat 5
CPU minf mjf xcal      intr ithr         csw icsw migr smtx                            srw syscl        usr sys   wt   idl
  0    0   0 211        449 142          423    4   21   25                              0   460         17   2    0    82
  1    1   0 127        155    2         296    2    6   23                              0   199         13   1    0    86
  2    0   0   30        30    0          56    0    3    9                              0    64          1   0    0    98
  3    0   0    0         2    0           2    0    1    4                              0     0          0   0    0   100
  8    1   0 199        278    0         548    4   11   37                              0   470         23   1    0    76
  9    0   0    0         2    0           2    0    1    4                              0     0          0   0    0   100
 10    0   0   30        53    0         104    0    3   11                              0   155          4   0    0    95
 11    0   0    0         2    0           2    0    1    3                              0     0          0   0    0   100
 16    1   0 178        258    0         508    3   10   29                              0   521         16   1    0    82
 17    0   0    3         5    3           4    0    1    6                              0     2          0   0    0   100
[...]
104     1    0   222    194       4      377             1          6        28               0   281    16   1    0    83
105     0    0     0      2       0        2             0          1         2               0     0     0   0    0   100
106     0    0     0      3       0        4             0          1         3               0    13     0   0    0   100
107     0    0     0      2       0        2             0          1         2               0     0     0   0    0   100
112     1    0   141    229       1      451             2          3        23               0   289    18   1    0    81
113     0    0     1      3       1        2             0          1         1               0     0     0   0    0   100
114     0    0     0      6       0        9             0          2         2               0     3     0   0    0   100
115     0    0     0      2       0        2             0          1         1               0     0     0   0    0   100
120     4    0   397    409       3      804             4          3        44               0   450    23   3    0    74
121     0    0     1      3       1        2             0          1         2               0     0     0   0    0   100
122     0    0    13     15       0       28             0          2         3               0    13     1   0    0    99
123     0    0     0      2       0        2             0          1         1               0     0     0   0    0   100
                              Copyright © 2008, Sun Microsystems, Inc. All rights reserved.                                  45
Elasticity: Supply Factors
$ awk '{print $3,$4}' ps-sample.out | sort | uniq -c | sort -nr +2
   1 RT 157
   1 RT 140
   1 RT 100
   1 SYS 98
   1 SYS 96            Important!
   3 TS 60
   2 FX 60
   1 SYS 60
8238 TS 59
                     Primary modality;
                                                                          Hey! Wait a minute!
   1 TS 58
   3 TS 54            OLTP shadows                                         I'm really important!
  11 TS 53
   2 TS 52                                                                Why didn't anyone tell
   1 TS 51
   6 TS 50
                                                                                 the OS?
  14 TS 49              CPU hogs,                                                  Help!
   1 TS 36            punished by TS
   1 TS 34
   1 TS 29
   1 TS 22
   1 TS 12
   3 TS 0
$ grep lgw ps-sample.out
10494      1  TS 34 ora_lgwr_XYZP

NOTE: ps-sample.out data was from 'ps -e -o pid,ppid,class,pri,args'
                        Copyright © 2008, Sun Microsystems, Inc. All rights reserved.              46
Elasticity: Demand Factors
• “The mythical CPU-second”
  >   Sensitivity to compile options – eg: branch mispredicts,
      pipelining, inlined macro-operations versus library calls
  >   Sensitivity to link options – eg: locality versus I$ and D$
      behaviour
  >   Sensitivity to competition – could be viewed as elasticity of
      demand or supply, or as covariance ... depending on one's
      point of perspective
  >   Adaptive algorithms – eg: decisions to yield and re-queue
      (rather than spin) might be made as a function of system
      load –– and that can reduce the CPU-sec/transaction as
      load increases
                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   47
Elasticity: Demand Factors
• Under high load, frivolous migrations should
  decrease, leading to improved cache utilization and
  reduced memory waits
• Demand can vary in both quality (overhead/work) and
  quantity (overhead+work) as load is varied
  >   Ratio of business logic to spins for locks and latches
  >   Write coalescing by LGWR
  >   Checkpoint write deferral by DBWR




                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   48
Elasticity: Deal with it!
 • Demand
   >   Seek out and destroy inefficiency – but keep the 80/20 rule
       in mind
   >   Use Resource Management (RM) at the app, OS, and DB
       levels – maybe Oracle Resource Manager (ORM)?
   >   The final constraints are the speed of your components
       and the speed of light
 • Supply
   >   Invest in getting required factor-level QoS to various
       processes in relation to their business criticality

                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   49
Covariance: Pigs at the Trough
• Workloads often unmanaged and multi-modal
  >   Spectrum is wide, but simple case is BATCH vs. OLTP
• What if your OLTP SLA outliers are due to I/O
  competition from your BATCH?
  >   Maybe your BATCH is being over-served for I/O?
  >   Maybe you could throttle your BATCH I/O demands?
• What if your BATCH SLA outliers are due to CPU
  competition from your OLTP?
  >   Maybe your OLTP is being over-served for CPU?
  >   Maybe you could dynamically compromise on your OLTP
      CPU priority?

                   Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   50
Covariance: Some Examples

• “Foxes and chickens” problem: mixing incompatible
  species in the same cage
• Most famously: “batch versus OLTP”
  >   I/O demand by batch is what typically slows OLTP, but
      CPU demand by batch should not impact OLTP
  >   OLTP demand for I/O or CPU might impact batch
• Harder to see: “cache-sensitive” versus “cache-
  poluting” competition
  >   Cache-sensitive workload elements can be slowed by
      elements that constantly spoil the cache
• Heads-up! Virtualization means increased sharing!
                    Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   51
Covariance: Deal with It!
 • Expensive: Physical segregation and isolation
   >   e.g. - run BATCH or reports on another system
   >   e.g. - dedicate disks, channels, buses, and CPU to
       business or technical functions as required
 • Primitive: Temporal segregation and isolation
   >   e.g. - run BATCH at night
 • Refined: Prioritization, throttling, deadline scheduling
   >   e.g. - run BATCH at low priority, inject delays, increase
       priorities as deadlines get closer


                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   52
Concluding Remarks




   Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   53
Parting Thoughts
 • Participate in CMG http://www.cmg.org
   “Ignorance of the law is no excuse!”
 • Go where you may not have gone before
   >   Test-to-fail
   >   Analyse
   >   Fix or manage
   >   Repeat
 • If you are not managing to Business Metrics, you are
   wasting time and energy!



                       Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   54
Q&A?
Special Thanks to ...
• Adrian Cockcroft, Cary Millsap, Jim Holtman, Dr. Neil Gunther
  > mentors and provocateurs
• David J. Miller, Benoit Chaffanjon
  > editorial services & peer review
• Glenn Fawcett
  > smoke-jumping brotherhood & cool graphics
• Jim Mauro
  > northern star
• Larry Klein
  > inspiration from “It's all about U” ... and in general
                       Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   55
Extended Discussion Slides




      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   56
Primitivism
• “You might be a redneck if ...”
  >   You think "capacity" is when you pass out.
  >   You cannot imagine why anyone would model a cue.
  >   You have only seen a queue on Hop Sing or David
      Caradine.
  >   You believe chaos past 80% utilization is a law of
      nature.
  >   You make no effort whatsoever to control what's
      important to you.


  [... with a tip of the hat to Jeff Foxworthy ...]

                           Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   57
“Some people think that
  once they know the
tricks of the trade, that
 they know the trade.”

     “A little bit of
  knowledge can be a
   dangerous thing.”


    Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   58
Paths Forward
 • Increased education in PerfCap
   >   Math, science, language/vocabulary
   >   “Do performance first, then capacity.”
 • Increase usage of available tools
   >   Extract benefits, learn limitations, develop art
 • Increased networking amongst stakeholders
   >   Build awareness of what can go wrong; seek synergy
 • Breaking new ground
   >   CMT and Virtualization challenges
   >   Power management
   >   Automating workload management
   >   “PerfViz” - CMG focus area
   >   “Regarding Capacity” - Our focus for the rest of the hour ...
                       Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   59
Water Glass Metaphors
 • Is it 50% full or 50% empty?
   >   CMG-speak: Is it 80% busy, or 20% under-utilized?
 • “Big Rocks”
   >   Demonstrates heterogeneity and priority




                     Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   60
Two Views of “Best Practices”
• Bob Sneed's
  >   “Best Practices are time-proven and customer-proven
      practices which are well-documented and believed to
      have little or no downside potential.”
  >   “... practical workarounds for product design limitations”
  >   “... contrast with just works; needs no practices”
  >   “... contrast with tuning, which implies trial and error
• Dr. Neil Gunther's
  >   “Best Practices are an admission of failure.”
  >   “... trading workarounds, practices, and 'rules of thumb'
      does not advance the science or deepen understanding
  >   “... contrast with decomposing, understanding, modeling,
      proper engineering”
  >   “... just another form of trial and error”
                      Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   61
Pop Quiz #1

• SITUATION: A system runs at 100% CPU usage for 1
  hour each day completing a single compute-bound
  task. The SLA requires the task to complete in 4
  hours.
• Q1: How much “headroom” does this system have?
• Q2: How can this task's resource footprint be
  managed to never exceed 80% CPU usage?



               Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   62
Pop Quiz #1: Answers
• SITUATION: A system runs at 100% CPU usage for 1
  hour each day completing a single compute-bound
  task. The SLA requires the task to complete in 4
  hours.
• Q1: How much “headroom” does this system have?
• A1: 300% (in workload terms) or 75% (in percent-of-
  system terms) - it can do 4x the work it now does and
  remain within the SLA.
• Q2: How can this task's resource footprint be
  managed to never exceed 80% CPU usage?
• A2a: Huh? Why would anyone want to do that?
• A2b: Resource management.
                 Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   63
Pop Quiz #2

• SITUATION: An 8-way 1000-BogoMIPs box runs at
  75% CPU busy, with a workload that includes four
  compute-bound threads plus some OLTP. The new
  target system is a 4-way 2000-BogoMIPs system.
• Q1: What is the new system's projected CPU
  utilization?
• Q2: How can this system's workload be managed to
  never exceed 75% CPU utilization?


                Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   64
Pop Quiz #2: Answers
• SITUATION: An 8-way 1000-BogoMIPs box runs at
  75% CPU busy, with a workload that includes four
  compute-bound threads plus some OLTP. The new
  target system is a 4-way 2000-BogoMIPs system.
• Q1: What is the new system's projected CPU
  utilization?
• A1: 100%. Each of the four compute-bound threads
  will keep one CPU 100% busy.
• Q2: How can this system's workload be managed to
  never exceed 75% CPU utilization?
• A2a: Huh? Why would anyone want to do that?
• A2b: Resource management.
                Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   65
Pop Quiz #3

• SITUATION: An 8-way 1000-BogoMIPs box runs at
  75% CPU busy, with a workload that includes four
  compute-bound threads plus some OLTP. The new
  target system is a 4-way 2000-BogoMIPs system.
  (Same as last quiz, OK?)
• Q1: How will the compute-bound thread's
  performance be impacted by the upgrade? (Just
  roughly speaking – no need for precision here!)


                Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   66
Pop Quiz #3: Answers
• SITUATION: An 8-way 1000-BogoMIPs box runs at
  75% CPU busy, with a workload that includes four
  compute-bound threads plus some OLTP. The new
  target system is a 4-way 2000-BogoMIPs system.
  (Same as last quiz, OK?)
• Q1: How will the compute-bound thread's
  performance be impacted by the upgrade? (Just
  roughly speaking – no need for precision here!)
• A1: It should run almost 4x faster. Each new CPU is
  4x faster than the old ones. (2000/4)/(1000/8) = 4.
  The OLTP will use some of the CPU cycles, but its
  service demand pales next to the compute jobs.
                 Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   67
Pop Quiz #4



• ESSAY QUESTION: “At what point do these
  principles become difficult?”




               Copyright © 2008, Sun Microsystems, Inc. All rights reserved.   68

Mais conteúdo relacionado

Semelhante a Hotsos 08 regarding_capacity_1_9c

Sol linux cmg-t_1_1.pptx
Sol linux cmg-t_1_1.pptxSol linux cmg-t_1_1.pptx
Sol linux cmg-t_1_1.pptxBob Sneed
 
How to test a Mainframe Application
How to test a Mainframe ApplicationHow to test a Mainframe Application
How to test a Mainframe ApplicationMichael Erichsen
 
Siegel - keynote presentation, 18 may 2013
Siegel  - keynote presentation, 18 may 2013Siegel  - keynote presentation, 18 may 2013
Siegel - keynote presentation, 18 may 2013NeilSiegelslideshare
 
Robust design and reliability engineering synergy webinar 2013 04 10
Robust design and reliability engineering synergy webinar   2013 04 10Robust design and reliability engineering synergy webinar   2013 04 10
Robust design and reliability engineering synergy webinar 2013 04 10ASQ Reliability Division
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsYuriy Guts
 
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017Andrew Miller
 
Cloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfCloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfAtaulAzizIkram
 
ROI at the bug factory - Goldratt & throughput (2004)
ROI at the bug factory - Goldratt & throughput (2004)ROI at the bug factory - Goldratt & throughput (2004)
ROI at the bug factory - Goldratt & throughput (2004)Neil Thompson
 
Oracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & TuningOracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & TuningChris Muir
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Amazon Web Services
 
Art of Cloud Workload Translation
Art of Cloud Workload TranslationArt of Cloud Workload Translation
Art of Cloud Workload TranslationPaul Cooper
 
Anthony.demarco
Anthony.demarcoAnthony.demarco
Anthony.demarcoNASAPMC
 
S de0882 new-generation-tiering-edge2015-v3
S de0882 new-generation-tiering-edge2015-v3S de0882 new-generation-tiering-edge2015-v3
S de0882 new-generation-tiering-edge2015-v3Tony Pearson
 
Why 2015 is the Year of Copy Data - What are the requirements?
Why 2015 is the Year of Copy Data - What are the requirements?Why 2015 is the Year of Copy Data - What are the requirements?
Why 2015 is the Year of Copy Data - What are the requirements?Storage Switzerland
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Sciencelarsgeorge
 
Explorations of the three legged performance stool
Explorations of the three legged performance stoolExplorations of the three legged performance stool
Explorations of the three legged performance stoolC4Media
 
Application Optimized Performance: Choosing the Right Instance (CPN212) | AWS...
Application Optimized Performance: Choosing the Right Instance (CPN212) | AWS...Application Optimized Performance: Choosing the Right Instance (CPN212) | AWS...
Application Optimized Performance: Choosing the Right Instance (CPN212) | AWS...Amazon Web Services
 
[TMS 2018] 기술개발 / FuriosaAI 백준호 CEO, 글로벌 격전지에서 발견한 기회
[TMS 2018] 기술개발 / FuriosaAI 백준호 CEO, 글로벌 격전지에서 발견한 기회 [TMS 2018] 기술개발 / FuriosaAI 백준호 CEO, 글로벌 격전지에서 발견한 기회
[TMS 2018] 기술개발 / FuriosaAI 백준호 CEO, 글로벌 격전지에서 발견한 기회 NAVER D2 STARTUP FACTORY
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Aerospike
 

Semelhante a Hotsos 08 regarding_capacity_1_9c (20)

Sol linux cmg-t_1_1.pptx
Sol linux cmg-t_1_1.pptxSol linux cmg-t_1_1.pptx
Sol linux cmg-t_1_1.pptx
 
How to test a Mainframe Application
How to test a Mainframe ApplicationHow to test a Mainframe Application
How to test a Mainframe Application
 
Siegel - keynote presentation, 18 may 2013
Siegel  - keynote presentation, 18 may 2013Siegel  - keynote presentation, 18 may 2013
Siegel - keynote presentation, 18 may 2013
 
Robust design and reliability engineering synergy webinar 2013 04 10
Robust design and reliability engineering synergy webinar   2013 04 10Robust design and reliability engineering synergy webinar   2013 04 10
Robust design and reliability engineering synergy webinar 2013 04 10
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
 
Scalability
ScalabilityScalability
Scalability
 
Cloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfCloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdf
 
ROI at the bug factory - Goldratt & throughput (2004)
ROI at the bug factory - Goldratt & throughput (2004)ROI at the bug factory - Goldratt & throughput (2004)
ROI at the bug factory - Goldratt & throughput (2004)
 
Oracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & TuningOracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & Tuning
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
 
Art of Cloud Workload Translation
Art of Cloud Workload TranslationArt of Cloud Workload Translation
Art of Cloud Workload Translation
 
Anthony.demarco
Anthony.demarcoAnthony.demarco
Anthony.demarco
 
S de0882 new-generation-tiering-edge2015-v3
S de0882 new-generation-tiering-edge2015-v3S de0882 new-generation-tiering-edge2015-v3
S de0882 new-generation-tiering-edge2015-v3
 
Why 2015 is the Year of Copy Data - What are the requirements?
Why 2015 is the Year of Copy Data - What are the requirements?Why 2015 is the Year of Copy Data - What are the requirements?
Why 2015 is the Year of Copy Data - What are the requirements?
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
Explorations of the three legged performance stool
Explorations of the three legged performance stoolExplorations of the three legged performance stool
Explorations of the three legged performance stool
 
Application Optimized Performance: Choosing the Right Instance (CPN212) | AWS...
Application Optimized Performance: Choosing the Right Instance (CPN212) | AWS...Application Optimized Performance: Choosing the Right Instance (CPN212) | AWS...
Application Optimized Performance: Choosing the Right Instance (CPN212) | AWS...
 
[TMS 2018] 기술개발 / FuriosaAI 백준호 CEO, 글로벌 격전지에서 발견한 기회
[TMS 2018] 기술개발 / FuriosaAI 백준호 CEO, 글로벌 격전지에서 발견한 기회 [TMS 2018] 기술개발 / FuriosaAI 백준호 CEO, 글로벌 격전지에서 발견한 기회
[TMS 2018] 기술개발 / FuriosaAI 백준호 CEO, 글로벌 격전지에서 발견한 기회
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Hotsos 08 regarding_capacity_1_9c

  • 1. Capacity: It's Not All About U! (née: “RegardingCapacity”) Bob Sneed - Sr. Staff Engineer Sun Microsystems, Inc. Performance & Applications Engineering (PAE) Hotsos Symposium 2008, March 2-6 @ Dallas Rev 1.9c – March 19, 2008 Copyright © 2008, Sun Microsystems, Inc. All Rights Reserved.
  • 2. Abstract When it comes to managing computer capacity, the state-of- the-industry is wildly diverse -- but often both primitive and inconsistent in the area of enterprise computing. Indeed, most discussions regarding capacity don't even involve appropriate engineering units of measure! It's no surprise that the relationship between capacity management, performance management, and Quality of Service (QoS) management is so uneven in practice. This session will survey modern quandaries in Performance and Capacity Management, and offer some insights and abstractions aimed at stimulating constructive discussion, progressive engineering development, and intelligent practices in this area. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 2
  • 3. Disclaimers Opinions and views expressed herein are those of the author, Bob Sneed, and do not represent any official opinion of Sun Microsystems, Incorporated - or anyone else. I'm not a doctor and I don't even play one on TV - but I do regard Tom Baker and Chris Eccleston as role models. There is no warranty, expressed or implied, in the quality of the information herein, or its fitness for any given purpose. If you goof up applying this stuff and have a bad outcome or destroy a bunch of data – it's not my fault or Sun's. This is version 1.x material. Batteries not included. Your mileage may vary (YMMV). Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 3
  • 4. Agenda • Motivations [10] • Let's Talk PerfCap [15] • Case Study [10] • Ruminations on the State of the Art [ 5] • Heterogeneity, Elasticity, and Covariance [15] • Concluding Remarks [ 5] (All times in Bob-minutes; YMMV ...) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 4
  • 5. Motivations Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 5
  • 6. Concerns and Premises • Primitivism: Many customers are doing capacity wrong with the result being variously massive over- provisioning, surprises in production, or much ado about normal! • I'm annoyed: Many "capacity crises” are actually either chaos in action or misunderstandings about The Way Things Work. • Advancing the art: Investments are required to make industry advances in managing Performance and Capacity (PerfCap). • Customer value: Right-sizing is a win-win scenario. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 6
  • 7. How widespread is “wrong”? • It's not that everyone is doing it wrong ... > ... though even many who do PerfCap right are crippled by organizational behaviour and GIGO constraints ... • In some places, PerfCap tends to get done right ... > Technical computing (HPC, HPTC) > Embedded computing & realtime systems > In well-defined tiers with homogeneous workloads • In some places, PerfCap tends to get done wrong ... > Commercial IT – especially around big databases > Heterogeneous workloads - some inherently complex, some resulting from consolidation or virtualization • Bob says: “Tiers are for people who have not discovered resource and workload management!” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 7
  • 8. PerfCap / Physics Metaphor • Primitivism, pre-science ~ state of the practice > Wonder; everything is mystery and magic > Underlying causes attributed to nature or deities > Stagnant - “Because we've always done it that way” • Newtonian physics ~ state of the art > Causality; testable hypotheses, repeatable outcomes > Mathematical relationships determined > Enables the modern era • Einsteinian physics ~ the horizon > Relativity; frames of reference > True nature of things theorized; testability gets harder > Propels the post-modern era Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 8
  • 9. Over-Provisioning; so What? • Pros ... > Hardware is cheap. Sun sells hardware. Good for Bob! > Feature/function time-to-market has priority. > Performance expertise scarce and inconsistent. > No time for learning “new tricks”. > “Throwing Iron” at problems has a fixed cost and a set delivery date - and it often “works”. • Cons ... > Capital costs > Operational costs (power, cooling, space, administration) > Stagnation: The applicable math, science, and vocabulary has ended up deferred – for nearly an entire era. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 9
  • 10. Let's Talk PerfCap Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 10
  • 11. PerfCap Language: Goals • Business Metrics > System performance in business terms, such as transactions per second, batch run time, or percent of jobs/transactions meeting some performance criteria (Service Level Agreement, or SLA) > Business objectives are typically diverse in terms of importance and resource demands • Business Metrics and Indicators (BMIs) > Business metrics plus secondary indicator variables, such as aggregate packet rate or commit rate > These are observables one might monitor and alarm on Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 11
  • 12. PerfCap: Solving the Right Problem • “The Goal” - Goldratt > Written as a novel; an unusual approach to conveying principles from Operations Research • “Are Your Lights On?” - Gause & Weinberg > A fun and easy read > From the same Weinberg as the classic “Psychology of Computer Programming” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 12
  • 13. PerfCap Language: Capacity • Some definitions > English: The ability to do a job. > Technical: The maximum reliable throughput with acceptable response times. > Geek: The throughput limitation of the bottleneck device. • Supermarket metaphors > What percent of cashiers should be always idle? > What purposes do “express lanes” serve? • Submarine metaphor > Compare “100% underwater” with “crush depth”; which one represents capacity? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 13
  • 14. PerfCap Language: Capacity Planning • Capacity Planning defined - with footnotes > Estimating[A] capacity requirements[B] in time to be able order, receive, provision, and deploy – before you run out of capacity. [A] Prognostication and prestodigitation, usually based on B.S. forecasts from marketing departments [B] NOTE: Related disciplines increase capacity without capital outlays ● Efficiency – doing more with less; tuning; optimization ● Software Performance Engineering (SPE) – the discipline of engineering to meet performance requirements • It's not all about U! (Utilization) > It's mostly about R (response time), X (throughput), service demands, and efficiency (which relates to U) and The Way Things Work Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 14
  • 15. PerfCap Language: Queuology • Queueing Theory = math used for PerfCap work > Too bad it does not have a simple one-word name like arithmetic, calculus, topology, trigonometry, or sadistics (how about “queuology”?) • Response-time = Queue wait + Service time > R=W+S > NOTE: This is not Plain English. It must be taught in context to enable meaningful conversations. • Bottleneck = scaling constraint > NOTE: This is not Plain English. In PerfCap, this term has no negative emotional connotation. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 15
  • 16. PerfCap Language: Crazy about U! • Utilization (U) > The percent of time a resource is not idle > Physics analogy: Work = Force * Displacement ● No displacement means no work • Another physical metaphor ... > Helicopter: What does a helicopter's engine tachometer tell you about the helicopter's performance? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 16
  • 17. PerfCap Language: U is for Useless? • “Utilization is Virtually Useless as a Metric” - Adrian Cockcroft, CMG 2006 > http://perfcap.blogspot.com/2005/12/cmg05-trip-comments-and-utilization-is.html > http://www.cmg.org/membersonly/2006/papers/6133.pdf “We have all been conditioned over the years to use utilization or %busy as the primary metric for capacity planning. Unfortunately, with increasing use of CPU virtualization and sophisticated CPU optimization techniques such as hyper-threading and power management the measurements we get from systems are "virtually useless". This paper will explain many of the fundamental alternatives, and express capacity in terms of headroom, in units of throughput within a response time limit.” • Adrian wins 2007 CMG Michelson Award > http://perfcap.blogspot.com/2007/12/a-michelson-award-acceptance-speech.html "Those who ask questions about utilization don't understand that their questions have no meaning so the answers are irrelevant :-)" Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 17
  • 18. Aggregate Utilization: U-all? • Business Logic > Workload classes (eg: OLTP, BATCH, pseudo-BATCH) ● Varies in business priority ● Varies in relative I/O content ● Varies in propensity to compute > Per-class utilization varies based on many system factors (CPU architecture, OS scheduling, space/speed tradeoffs, efficiency tradeoffs, virtualization), and also due to often- uncontrolled competition for resources > Cycles-per-instruction (CPI) varies with compile/build factors and competition factors > Utilization is limited by concurrency of demand and bounded by serialization per Amdahl's Law > Utilization often largely due to bad app code and/or bugs Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 18
  • 19. Aggregate Utilization: U what !?@#! • Overhead categories > Polling operations > Lock and latch spins (adaptive) > Locking and latching cache coherency > Memory management (a maze of twisty passages ...) > Re-work (fail-and-retry logic) > Migrations & cache invalidations > Context switches (voluntary and involuntary) > Hardware thread-switching (some cheap, some not) ● SMP, VMT, SMT, CMT – all different! > Performance monitoring and management tools ● Significant “probe effect” can occur from some tools ● The aggregate impact of tools is often a root cause of problems > Bad tuning and bugs - outside of the business logic Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 19
  • 20. PerfCap Language: Like, U-know? • Workload Characterization > PerfCap definition: Attribution of resource utilization to various distinct business processes or technical functionality ● Essential to understanding resource usage > Engineering definition: Characterization of platform response factors under a given workload ● Interesting to drive systems engineering > Vernacular definition: Various broad terms like OLTP, BATCH, DSS, DW, PROD, UETP, DVLP, TEST, OLAP, ERP, ETL, ad-hoc, and my personal favourite - “mixed” ● Suggestive of requirements, but non-quantitative Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 20
  • 21. Hockey Sticks and Knees 4 U Excerpted from "Analyzing Computer System Performance” by Neil J. Gunther, Springer-Verlag 2005. ISBN 3540208658 (Used with permission.) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 21
  • 22. So, what do U know? • Do you know your overhead/work ratio? • Do you know your ratio of OLTP to pseudo-BATCH? • Do you know how these vary under load? • Do you know how to observe, measure, and manage these things? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 22
  • 23. PerfCap Language: Method Rrrrrr! Right Wrong • Performance • Performance > Response time > CPU %busy, %usr/%sys ratio > Throughput > IOPS, disk latency, %wio > Variance > Graphs of aggregated data • Capacity • Capacity > Latent performance > Whatever you get at 100% utilization • Headroom • Headroom > ((100% capacity) – > (100% – utilization) (current peak performance)) • Utilization • Utilization > (100% – %idle) > (100% – headroom) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 23
  • 24. Case Study Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 24
  • 25. Case Study: Scenario • Financial E10K user upgraded to E2900 > CPU power of E2900 was 125% that of the 10K system ● E10K: #64 US-II @ (64 “slow” cores) ● E2900: #12 US-IV+ @ (24 “fast” cores) > Result: Utilization on E2900 was greater than on E10K! > Impact: Great angst! Management wanted %idle > 20! E2900 dissed. Move to E6900 contemplated. (Focus was on utilization (U) ... response-time (R) and throughput (X) were essentially ignored) > Breakthrough! Customer agreed to a test-to-fail exercise! ● Monitor response times per-transaction-class ● Increase benchmark workload until SLA not met Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 25
  • 26. It's not all about U! 600 RTX2 480 360 SLA = 240 600 sec 120 0 0 100 200 300 400 500 RTX1 0.5 0.4 SLA = 0.3 0.2 0.5 sec 0.1 0 0 100 200 300 400 500 100 80 UCPU OMG! 20%Headroom? 60 Max = 40 100% 20 No! 300% Headroom 0 0 100 200 300 400 500 Users Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 26
  • 27. Case Study: Experimental Results • The new system had plenty of latent capacity! > Test-to-fail revealed 300% headroom at 80% utilization! > All they needed was 1X headroom at 100 users! > Workload characterization revealed that a single CPU- greedy transaction of no business importance was vastly over-achieving its SLA > The CPU-greedy transaction under Solaris TS scheduling automatically fell to priority 0 - thus having zero impact on real OLTP as OLTP demand ramped up to 4x the level that corresponded with 80% aggregate CPU utilization > At the “tipping point”, the chaos may have been due to LGWR priority dropping to 0 under Solaris TS scheduling Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 27
  • 28. Case Study: Business Outcome • Customer emergency upgraded to an E6900 > CPU power of E6900 was 200% that of the 10K system > Rumor has it that they got a really good discount > E6900 showed a “comforting” 20%+ idle under full test load • Moral > Science is often secondary in commercial IT > Due to issues of organizational behaviour, even empirical results might fail to triumph over rules of thumb > The cost of hardware is a minor issue to many IT managers' decision-making process > Get over it ... or - develop new metrics and methods by which IT managers can be made comfortable! Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 28
  • 29. Ruminations on the State of the Art Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 29
  • 30. Common PerfCap Mistakes • Absence of business metrics > What Problem are You Trying to Solve? • Equating usage with demand or requirement > In other words, assuming that demand is inelastic • Failure to do performance first and often > Why scale waste and inefficiency? • Assuming supply is inelastic > In other words, assuming service times are constant • Misinterpreting “the device with the highest utilization is the bottleneck device” > Hmm, what about polling loops? • Decisions based on intuition and rules of thumb > Sophistication can pay great rewards! Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 30
  • 31. What's the Right Way to do PerfCap? 1) Empirical Methods (The Best & Most Expensive) ● Benchmarks, stress testing, test-to-scale, test-to-fail – with known Best Practices & basic performance analysis and tuning 2) Modeling (Highly Recommended & Moderate Cost) ● Using tools such as TeamQuest Model (TQM), BMC Perform/Predict, Hy- Performix, Gunther's PDQ or other application of proper science and math 3) Expert Opinions (The Minimum & Cheapest) ● Listening to the right experts for Best Practices, analysis and tuning methods, and sizing 4) Guesswork (The Norm) ● Straight-line extrapolations, naïve use of reference benchmarks, massive over-provisioning, bogus testing, luck 5) Opportunism (Commonplace) ● Spend the available budget Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 31
  • 32. RTFM: PerfCap Resources • Dr. Neil Gunther – prolific, readable, digestible > “The Practical Performance Analyst” - foundational http://www.amazon.com/dp/059512674X/ > “Guerrilla Capacity Planning” - http://www.perfdynamics.com/Manifesto/gcaprules.html Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 32
  • 33. RTFM: PerfCap Resources • Cary Millsap – digestible, practical, methodical > “Optimizing Oracle Performance” ● Chapter 1 & 2 – a great intro to the art of PerfCap, whether or not one applies it to Oracle ● Method R Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 33
  • 34. RTFM: PerfCap Resources • Raj Jain - “The Art of Computer Systems Performance Analysis” > Fundamental, foundational, readable Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 34
  • 35. When Models Break • Good models break due to factors that are exogenous to the model (ie: not considered) > Examples: bus saturation, cache saturation, lock contention, covariance • Bad models break because they are bad models > Examples: “straight line” projections, models that do not consider basic queuing phenomena Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 35
  • 36. What Breaks Existing Models • Heterogeneity > There is diversity in both supply and demand factors > For example, OLTP, BATCH, and DSS are classical characterizations for common workload elements • Elasticity > Resource supply and demand factors are each elastic > For example, per-transaction demand might diminish under increasing load and supply might become more efficient • Covariance > Competition for resources impacts all competitors - sometimes adversely or pathologically Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 36
  • 37. Heterogeneity, Elasticity, and Covariance Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 37
  • 38. Heterogeneity: Many Dimensions • Business priority > Importance to the enterprise • Service demand > Resource requirement, including deadline constraints • Technical priority > Solaris scheduling priority • Quality (versus quantity) > Not all CPU-seconds are created equal • Urgency > Importance, as distinct from priority or share ● (example: princes and paupers) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 38
  • 39. Heterogeneity: Early Warning Signs • “ERP” • “Consolidation” • “RDBMS” • “Ad-hoc” • “Custom” • “Producer/Consumer” • “Client/Server” • “Dispatcher thread/process” • Testimony to the contrary (eg: “It's entirely homogeneous OLTP!”) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 39
  • 40. Heterogeneity: Example(s) # prstat -m PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 13632 oracle 50 50 0.0 0.0 0.0 0.0 0.0 0.0 0 0 48K 0 sqlplus/1 13633 oracle 0.0 96 0.0 0.0 0.0 0.0 48 0.0 0 0 46K 0 sqlplus/1 15849 oracle 92 0.1 0.0 0.0 0.0 100 100 0.1 13 45 1K 0 oracle/11 27639 oracle 91 0.1 0.0 0.0 0.0 100 100 0.1 24 50 2K 0 oracle/11 13601 root 18 54 0.0 0.0 0.0 0.0 36 0.0 178 178 87K 0 ps/1 13551 root 0.0 68 0.0 0.0 0.0 0.0 39 0.0 244 195 93K 0 prstat/1 12614 oracle 64 0.2 0.0 0.0 0.0 100 100 0.1 50 38 3K 0 oracle/11 24020 oracle 47 0.5 0.0 0.0 0.0 100 100 0.1 190 36 10K 0 oracle/11 [...] 11087 oracle 9.3 0.1 0.0 0.0 0.0 0.0 90 0.0 5 6 6K 0 oracle/1 13490 root 0.0 8.5 0.0 0.0 0.0 0.0 93 0.0 380 0 25K 0 sh/1 2154 oracle 7.9 0.2 0.0 0.0 0.0 100 100 0.0 53 5 3K 0 oracle/11 9656 oracle 7.1 0.1 0.0 0.0 0.0 0.0 92 0.0 37 5 2K 0 oracle/1 24156 oracle 6.7 0.1 0.0 0.0 0.0 100 100 0.0 6 4 2K 0 oracle/11 13496 oracle 6.2 0.0 0.0 0.0 0.0 0.0 93 0.0 341 0 19K 0 sh/1 13488 oracle 6.0 0.0 0.0 0.0 0.0 0.0 96 0.0 330 0 19K 0 sh/1 25478 oracle 3.9 0.1 0.0 0.0 0.0 0.0 96 0.0 46 3 2K 0 oracle/1 8098 oracle 2.9 0.1 0.0 0.0 0.0 0.0 97 0.0 60 3 2K 0 oracle/1 [...] Total: 295 processes, 2869 lwps, load averages: 11.64, 12.02, 12.05 Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 40
  • 41. Heterogeneity: Exploring • Fun commands you can use at home ... # Taking U apart prstat -n 8192 -m // Microstate accounting prstat -n 8192 -mL // Per-thread microstate accounting # Thread count ... awk '{print $15}' < prstat-sample.1 | sort | grep oracle | uniq -c | more # CPU intensity ... grep oracle/ prstat-sample.1 | awk '{print $3}' | sort -n +1 | uniq -c | more # Diverse priorities ... ps -e -o pid,class,pri,args Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 41
  • 42. Heterogeneity: Deal with it! • Identify it > This is one aspect of workload characterization in the language of PerfCap > Consider its many dimensions (business priority, service demand, technical priority, urgency, deadlines) • Tell the OS about it > The OS does not know your priorities, so tell it! > Automating this is a good investment • Model it > w.r.t. competition and covariance – TBD Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 42
  • 43. Elasticity: Supply Factors • In general, “supply” is net of competing demands > “I'm giving ya all I got, captain!” > FCFS – who got in line first? • In a specific configuration, elastic factors abound > With mixed-speed CPUs, Q(CPU-second) = f(MHz) > With CMT, Q(CPU-second) = f(core loading) > Q(CPU-second) = f(ISA & pipeline sophistication) • Unmanaged, the probability of thread pinning will increase with increasing interrupt load Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 43
  • 44. Elasticity: Supply Factors • Priority preemption > Good – under TS, compute hogs will drift to priority 0 > Bad - unmanaged, a large population of homogeneous threads may frivolously preempt each other > Ugly – interrupts have top priority; they can even interrupt and “pin” realtime (RT) threads > Hideous – it's really tragically bad when TS demotes your highest-importance thread (eg: Oracle LGWR) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 44
  • 45. Elasticity: Supply Factors # mpstat 5 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 211 449 142 423 4 21 25 0 460 17 2 0 82 1 1 0 127 155 2 296 2 6 23 0 199 13 1 0 86 2 0 0 30 30 0 56 0 3 9 0 64 1 0 0 98 3 0 0 0 2 0 2 0 1 4 0 0 0 0 0 100 8 1 0 199 278 0 548 4 11 37 0 470 23 1 0 76 9 0 0 0 2 0 2 0 1 4 0 0 0 0 0 100 10 0 0 30 53 0 104 0 3 11 0 155 4 0 0 95 11 0 0 0 2 0 2 0 1 3 0 0 0 0 0 100 16 1 0 178 258 0 508 3 10 29 0 521 16 1 0 82 17 0 0 3 5 3 4 0 1 6 0 2 0 0 0 100 [...] 104 1 0 222 194 4 377 1 6 28 0 281 16 1 0 83 105 0 0 0 2 0 2 0 1 2 0 0 0 0 0 100 106 0 0 0 3 0 4 0 1 3 0 13 0 0 0 100 107 0 0 0 2 0 2 0 1 2 0 0 0 0 0 100 112 1 0 141 229 1 451 2 3 23 0 289 18 1 0 81 113 0 0 1 3 1 2 0 1 1 0 0 0 0 0 100 114 0 0 0 6 0 9 0 2 2 0 3 0 0 0 100 115 0 0 0 2 0 2 0 1 1 0 0 0 0 0 100 120 4 0 397 409 3 804 4 3 44 0 450 23 3 0 74 121 0 0 1 3 1 2 0 1 2 0 0 0 0 0 100 122 0 0 13 15 0 28 0 2 3 0 13 1 0 0 99 123 0 0 0 2 0 2 0 1 1 0 0 0 0 0 100 Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 45
  • 46. Elasticity: Supply Factors $ awk '{print $3,$4}' ps-sample.out | sort | uniq -c | sort -nr +2 1 RT 157 1 RT 140 1 RT 100 1 SYS 98 1 SYS 96 Important! 3 TS 60 2 FX 60 1 SYS 60 8238 TS 59 Primary modality; Hey! Wait a minute! 1 TS 58 3 TS 54 OLTP shadows I'm really important! 11 TS 53 2 TS 52 Why didn't anyone tell 1 TS 51 6 TS 50 the OS? 14 TS 49 CPU hogs, Help! 1 TS 36 punished by TS 1 TS 34 1 TS 29 1 TS 22 1 TS 12 3 TS 0 $ grep lgw ps-sample.out 10494 1 TS 34 ora_lgwr_XYZP NOTE: ps-sample.out data was from 'ps -e -o pid,ppid,class,pri,args' Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 46
  • 47. Elasticity: Demand Factors • “The mythical CPU-second” > Sensitivity to compile options – eg: branch mispredicts, pipelining, inlined macro-operations versus library calls > Sensitivity to link options – eg: locality versus I$ and D$ behaviour > Sensitivity to competition – could be viewed as elasticity of demand or supply, or as covariance ... depending on one's point of perspective > Adaptive algorithms – eg: decisions to yield and re-queue (rather than spin) might be made as a function of system load –– and that can reduce the CPU-sec/transaction as load increases Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 47
  • 48. Elasticity: Demand Factors • Under high load, frivolous migrations should decrease, leading to improved cache utilization and reduced memory waits • Demand can vary in both quality (overhead/work) and quantity (overhead+work) as load is varied > Ratio of business logic to spins for locks and latches > Write coalescing by LGWR > Checkpoint write deferral by DBWR Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 48
  • 49. Elasticity: Deal with it! • Demand > Seek out and destroy inefficiency – but keep the 80/20 rule in mind > Use Resource Management (RM) at the app, OS, and DB levels – maybe Oracle Resource Manager (ORM)? > The final constraints are the speed of your components and the speed of light • Supply > Invest in getting required factor-level QoS to various processes in relation to their business criticality Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 49
  • 50. Covariance: Pigs at the Trough • Workloads often unmanaged and multi-modal > Spectrum is wide, but simple case is BATCH vs. OLTP • What if your OLTP SLA outliers are due to I/O competition from your BATCH? > Maybe your BATCH is being over-served for I/O? > Maybe you could throttle your BATCH I/O demands? • What if your BATCH SLA outliers are due to CPU competition from your OLTP? > Maybe your OLTP is being over-served for CPU? > Maybe you could dynamically compromise on your OLTP CPU priority? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 50
  • 51. Covariance: Some Examples • “Foxes and chickens” problem: mixing incompatible species in the same cage • Most famously: “batch versus OLTP” > I/O demand by batch is what typically slows OLTP, but CPU demand by batch should not impact OLTP > OLTP demand for I/O or CPU might impact batch • Harder to see: “cache-sensitive” versus “cache- poluting” competition > Cache-sensitive workload elements can be slowed by elements that constantly spoil the cache • Heads-up! Virtualization means increased sharing! Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 51
  • 52. Covariance: Deal with It! • Expensive: Physical segregation and isolation > e.g. - run BATCH or reports on another system > e.g. - dedicate disks, channels, buses, and CPU to business or technical functions as required • Primitive: Temporal segregation and isolation > e.g. - run BATCH at night • Refined: Prioritization, throttling, deadline scheduling > e.g. - run BATCH at low priority, inject delays, increase priorities as deadlines get closer Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 52
  • 53. Concluding Remarks Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 53
  • 54. Parting Thoughts • Participate in CMG http://www.cmg.org “Ignorance of the law is no excuse!” • Go where you may not have gone before > Test-to-fail > Analyse > Fix or manage > Repeat • If you are not managing to Business Metrics, you are wasting time and energy! Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 54
  • 55. Q&A? Special Thanks to ... • Adrian Cockcroft, Cary Millsap, Jim Holtman, Dr. Neil Gunther > mentors and provocateurs • David J. Miller, Benoit Chaffanjon > editorial services & peer review • Glenn Fawcett > smoke-jumping brotherhood & cool graphics • Jim Mauro > northern star • Larry Klein > inspiration from “It's all about U” ... and in general Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 55
  • 56. Extended Discussion Slides Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 56
  • 57. Primitivism • “You might be a redneck if ...” > You think "capacity" is when you pass out. > You cannot imagine why anyone would model a cue. > You have only seen a queue on Hop Sing or David Caradine. > You believe chaos past 80% utilization is a law of nature. > You make no effort whatsoever to control what's important to you. [... with a tip of the hat to Jeff Foxworthy ...] Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 57
  • 58. “Some people think that once they know the tricks of the trade, that they know the trade.” “A little bit of knowledge can be a dangerous thing.” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 58
  • 59. Paths Forward • Increased education in PerfCap > Math, science, language/vocabulary > “Do performance first, then capacity.” • Increase usage of available tools > Extract benefits, learn limitations, develop art • Increased networking amongst stakeholders > Build awareness of what can go wrong; seek synergy • Breaking new ground > CMT and Virtualization challenges > Power management > Automating workload management > “PerfViz” - CMG focus area > “Regarding Capacity” - Our focus for the rest of the hour ... Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 59
  • 60. Water Glass Metaphors • Is it 50% full or 50% empty? > CMG-speak: Is it 80% busy, or 20% under-utilized? • “Big Rocks” > Demonstrates heterogeneity and priority Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 60
  • 61. Two Views of “Best Practices” • Bob Sneed's > “Best Practices are time-proven and customer-proven practices which are well-documented and believed to have little or no downside potential.” > “... practical workarounds for product design limitations” > “... contrast with just works; needs no practices” > “... contrast with tuning, which implies trial and error • Dr. Neil Gunther's > “Best Practices are an admission of failure.” > “... trading workarounds, practices, and 'rules of thumb' does not advance the science or deepen understanding > “... contrast with decomposing, understanding, modeling, proper engineering” > “... just another form of trial and error” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 61
  • 62. Pop Quiz #1 • SITUATION: A system runs at 100% CPU usage for 1 hour each day completing a single compute-bound task. The SLA requires the task to complete in 4 hours. • Q1: How much “headroom” does this system have? • Q2: How can this task's resource footprint be managed to never exceed 80% CPU usage? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 62
  • 63. Pop Quiz #1: Answers • SITUATION: A system runs at 100% CPU usage for 1 hour each day completing a single compute-bound task. The SLA requires the task to complete in 4 hours. • Q1: How much “headroom” does this system have? • A1: 300% (in workload terms) or 75% (in percent-of- system terms) - it can do 4x the work it now does and remain within the SLA. • Q2: How can this task's resource footprint be managed to never exceed 80% CPU usage? • A2a: Huh? Why would anyone want to do that? • A2b: Resource management. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 63
  • 64. Pop Quiz #2 • SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system. • Q1: What is the new system's projected CPU utilization? • Q2: How can this system's workload be managed to never exceed 75% CPU utilization? Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 64
  • 65. Pop Quiz #2: Answers • SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system. • Q1: What is the new system's projected CPU utilization? • A1: 100%. Each of the four compute-bound threads will keep one CPU 100% busy. • Q2: How can this system's workload be managed to never exceed 75% CPU utilization? • A2a: Huh? Why would anyone want to do that? • A2b: Resource management. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 65
  • 66. Pop Quiz #3 • SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system. (Same as last quiz, OK?) • Q1: How will the compute-bound thread's performance be impacted by the upgrade? (Just roughly speaking – no need for precision here!) Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 66
  • 67. Pop Quiz #3: Answers • SITUATION: An 8-way 1000-BogoMIPs box runs at 75% CPU busy, with a workload that includes four compute-bound threads plus some OLTP. The new target system is a 4-way 2000-BogoMIPs system. (Same as last quiz, OK?) • Q1: How will the compute-bound thread's performance be impacted by the upgrade? (Just roughly speaking – no need for precision here!) • A1: It should run almost 4x faster. Each new CPU is 4x faster than the old ones. (2000/4)/(1000/8) = 4. The OLTP will use some of the CPU cycles, but its service demand pales next to the compute jobs. Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 67
  • 68. Pop Quiz #4 • ESSAY QUESTION: “At what point do these principles become difficult?” Copyright © 2008, Sun Microsystems, Inc. All rights reserved. 68