SlideShare uma empresa Scribd logo
1 de 84
Baixar para ler offline
System Performance

    Build Fuel Tune

          Simar Singh
      simar.singh@redknee.com
         learn@ssimar.com
Learn and Apply
Topics                                      Index (Click Links in Slide Show)
•   Performance                             • Concepts
•   Concurrency (Threads)
•   Troubleshooting
•   Processing (CPU/Cores)                  • Processing
•   Memory (System / Process)
•   Thread Dumps
                                            • Memory
•   Garbage Collection
•   Heap Dumps
•   Core Dumps & Postmortem
•   Java (jstack, jmap, jstat, VisualVM)
•   Solaris (prstat vmstat mpstat pstack)
Concepts
Concurrency and Performance
        (Part 1)
What will we Discuss?
– LEARN
– There are laws and principals that govern concurrency and performance.
– Performance can be built, fueled and/or tuned.
– How do we measure performance and capacity in abstract terms?
– Capacity (throughput) and Load are often used interchangeably but
  incorrectly.
– What is the difference between Resource utilization and saturation?
– How performance & capacity are measured on a live system (CPU & Memory)?

–   APPLY
–   Find out how is your system being used or abused?
–   Find out how your system is performing as a whole?
–   Find out how a particular process in the system is performing?
–   Find out how a particular thread in the process performing?
–   Find out the bottle-necks? What is less or missing?
Performance – Built, Fueled or Tuned
• Built (Implementation and Techniques)
   – Binary Search O(log n) is more efficient than Linear Search O(n)
   – Caching can improve Disk I/O significantly boosting
     performance.

• Fueled (More Resources)
   – Simply get a machine with more CPU(s) and Memory if
     constrained.
   – Implement RAID to improve Disk I/O

• Tuned (Settings and Configurations)
     Tune Garbage Collection to optimize Java Processes
   – Tune Oracle parameters to get optimum database performance
Capacity and Load
•   Load is an Expectation out of system
     –   It is the rate of work that we put on the system.
     –   It is an factor external to the system.
     –   Load may vary with time and events.
     –   It has no upper cap, can increase infinitely
•   Capacity is a Potential of the system
     – It is the max rate of work, the system supports efficiently, effectively & infinitely
     – It is a factor, internal to the system.
       Maximum capacity of a system is finite and stays fairly constant.
       We often call Throughput as the System’s Capacity for Load.
•   Chemistry between Load & Capacity
     – LOAD = CAPACITY?                         Good Expectation matches the potential. Hired
     – LOAD > CAPACITY?                         Bad Expectations is more than potential. Fired
     – LOAD < CAPACITY?                         Ugly Expectations is less then potential. Find another one
     –   If not good better be ugly than bad.
Performance Measurement of a
                     System
•
           Measures of System’s Capacity
    Response Time or Latency
      – Measures time spent executing a request
              • Round-trip time (RTT) for a Transaction
      – Good for understanding user experience
      – Least scalable, Developers focus on how much time each transaction takes
•   Throughput
      – Measures the number of transactions executed over a period of time
              • Output Transactions per second (TPS)
      – A measure of the system's capacity for load
      – Depending upon the resource type, It could be hit rate (for cache)
•   Resource Utilization
      – Measures the use of a resource
              • Memory, disk space, CPU, network bandwidth
      – Helpful for system sizing, is generally the easiest measurement to Understand
      – Throughput and Response Time can conflict, because resources are limited
              • Locking, resource contention, container activity
It is time for System Capacity to be Loaded with work
                              (Throttling & Buffering Techniques)


•   No one stops us to load a system more than its capacity (Max Throughput).

•   Transactions Per Seconds -Misconception, Real traffic may be in bursts
     – Received 3600 transactions in a hour, not sure if every second only 60 were pumped
     – Probably we received in bursts - all in first 10 minutes and for nothing last 50 minutes
     – So we really cant say, at what tps? We can regulate bursts with throttling and buffering
•   Throttling – (Implemented by producer to smoothen output)
     – Spreads bursts over time to smoothen output from a process
     – We may add throttles to control output rate from threads to each external interface
       Throttle of 10 tps ensures max output is 10 tps regardless of the load & capacity.
       Throttling is scheme for producers (Check production to rate the consumer can accept)

•   Buffering – (Implemented by consumer to smoothen input)
     – Spreads burst over time to smoothen input from an external interface
     – We add buffering to control input rate to threads from each external interface
       Application processes input at 10 tps, load above it will be buffered & processed later
       Buffering is a scheme for consumers (Take whatever is produced, consume at our own)
Supply Chain Principle
                   (Apply it to define a optimum Thread Pool Size)

•   The more throughput you want, more will be the resource consumption.

•   You may apply this principle to define the optimum thread-pool size for a
    system/application.

     – To support a Throughput (t) transactions per second-                (t)   =
       20 tps

     – Where each transaction takes (d) seconds to complete-               (d) =
       5 seconds

     – We need (d*t) threads at least (min size of the thread pool)-
        (d*t) = 100 threads

•   Thread is an abstract CPU unit resource here.
To support a Throughput (t) of 20 tps
                     Where each transaction takes(d) 5 seconds
                     We need 100 (d*t) threads at least

1 sec              2 sec             3 sec           4 sec           5 sec
  1 sec              2 sec             3 sec           4 sec           5 sec
    1 sec
20                 2 sec             3 sec           4 sec           5 sec
      1 sec              2 sec
                     20                3 sec           4 sec           5 sec
        1 sec              2 sec             3 sec
                                         20              4 sec           5 sec
          1 sec              2 sec             3 sec           4 sec
                                                           20              5 sec
                                                                             20
     20
 20                                                                        
           20
                                         20
                                                        20
                                                                        20
   20
                        20
                                          20
                                                          20
     20
                          20
                                            20

       20
                           20
Quantify Resource Consumption
         Utilization & Saturation
• Resource Utilization
    – Utilization measures how busy a resource is.
    – It is usually represented as a percentage average over a time interval.
• Resource Saturation
    – Saturation is often a measure of work that has queued waiting for the resource
    – It can be measured as both
         • As an average over time
         • And at a particular point in time.
    – For some resources that do not queue, saturation may be synthesized by error counts.
      Example Page-Faults reveal memory saturation.
• Load (input rate of requests) is an independent/external variable
• Resource consumption, Throughput (out-put rate of response) are
  dependent/internal variables, a function of load.
How Load, Resource Consumption and
       Throughput related?
•   As load increases, throughput increases, until maximum resource utilization on the
    bottleneck device is reached. At this point, maximum possible throughput is
    reached, Saturation occurs.
•   Then, queuing (waiting for saturated resources) starts to occur.
•   Queuing typically manifests itself by degradation in response times.
•   This phenomenon is described by Little’s Law:
           L=X*R
           L (LOAD), X (THROUGHPUT) and R (RESPONSE TIME)
•   As L increases, X increases (R also increases slightly, because there is always some
    level of contention at the component level).
•   At some point, X reaches Xmax – the maximum throughput of the system. At this
    point, as L continues to increase, the response time R increases in proportion and
    through-put may then start to decrease, both due to resource contention.
Performance pattern of a Concurrent Process
Example
                How Throughput and Resource Consumption are related?

•   Throughput & Latency can have an inverse or direct relationship
     – Concurrent tasks (Threads) often contend for resources (locking & contention)
         • Single-Threaded – Higher Throughput = Lower Latency
              – Consistent throughput, does not increase with incoming load & resources
              – Processes serially, Good for batch jobs
              – Response Time linearly varies with request order.
         • Multi-Threaded – Higher Throughput = Higher Latency (Most of the time)
              – Throughput may increase linearly with load, it starts to drop after threshold
              – Process Concurrently, Good for interactive modules (Web Apps)
              – Near consistent Response Time, doesn’t vary much with order but load.
             Single Threaded – 10 CPU(s)          Multi Threaded – 10 CPU(s)
                  Threads = 1                            Threads = 10
                  Latency = .1 seconds                   Latency = .1 seconds
                  Throughput = 1/.1 = 10 tx/sec          Throughput = 1/.1 * 10 = 100
                  Threads = 1                            Threads = 100
                  Latency = .001 second                  Latency = .2 seconds
                  Throughput = 1/.001 = 1000 tx/sec      Throughput = 1/.2 * 100 = 500 tx/sec
Producer Consumer Principle
                                     Predicting Maximum Throughput
                                   Identify Bottleneck Device/Resource

•   The Utilization Law:             Ui = T * Di
•   Where Ui is the percentage of utilization of a device in the application, T is the application
    throughput, and Di is the service demand of the application device.
•   The maximum throughput of an application Tmax is limited by the maximum service demand of all
    of the devices in the application.
•   EXAMPLE - A load test reports 200 kb/sec average throughput:
    CPUavg = 80%                Dcpu = 0.8 / 200 kb/sec = 0.004 sec/kb
    Memoryavg = 30%             Dmemory = 0.3 / 200 kb/sec = 0.0015 sec/kb
    Diskavg = 8%                Ddisk = 0.08 / 200 kb/sec = 0.0004 sec/kb
    Network I/Oavg = 40%        Dnetwork I/O = 0.4 / 200 kb/sec = 0.002 sec/kb
•   In this case, Dmax corresponds to the CPU. So, the CPU is the bottleneck device.
•   We can use this to predict the maximum throughput of the application by setting the CPU utilization to
    100% and dividing by Dcpu. In other words, for this example:
             Tmax = 1 / Dcpu = 250 kb/sec
•   In order to increase the capacity of this application, it would first be necessary to increase CPU capacity.
    Increasing memory, network capacity or disk capacity would have little or no effect on performance until
    after CPU capacity has been increased sufficiently.
Work Pools & Thread Pools
                          Working Together
•   Work Pools are queues of work to be performed by a software application or component.
     – If all threads in thread pool are busy, incoming work can be
       queued in work pool
     – Threads from thread pool, when freed can execute them later


•   Work Pools are filling up congestion & smoothen bursts
     – A queue consisting of units of work to be performed
     – CONGESTION, by allowing the current (client) threads to submit
       work and return
     – BURST, over capacity transaction can buffered in work pool and
       executed later
     – Allow for caching of units of work to reduce system intensive
       calls
           •   Can perform a bulk fetch form a database instead of fetching on record at a time
Queuing Tasks may be risky
•   One task could lock up another that would be able to continue if the queued task
    were to run.

•   Queuing can smoothen in-coming traffic burst limited in time (depending upon the
    rate of traffic and size)

•   Fails if traffic arrives on average faster than they can be processed.

•   In general, Work Pools are in memory so it is important to understand what the
    impact of restarting a system is, as in memory elements will be lost.

     – Is it relevant to lose the queued work?
     – Is the queue backed up on disk?
Bounded & Unbounded Pools
                  (Load Shedding)
•   If not bounded, pools can grow freely but can cause system to exhaust resources.
     – Work Pool / Queue Unbounded - (May overload Memory / Heap &
       crash)
            • Each work object in the queue stays holding the space until consumed
     – Thread Pool Unbounded – (May overload CPU / Native Space and
       Crash)
            • Each thread asks to be scheduled on CPU and consumes native stack space
•   If queue size is bounded, incoming execute requests block when it is full. We can apply different Policies to
    handle t, for example
     – Reject if there is no space (Can have side affects)
     – Remove based on Priority – (Ex priority may be function of time –
       Timeouts)
•   Thread Pools can have different policies when Work Pools is full:
     – Block till there is available space – Starve (VERY BAD – Sometimes
       Needed)
     – Run in Current Thread (Very Dangerous!)
Work pool & thread pool sizes can
often be traded off for each other
Large Work-Pool and small thread pools

– Minimizes CPU usage, OS resources, and context-switching overhead.

– Can lead to artificially low throughput especially if tasks frequently block (ex I/O bound)



Small Work pool generally require larger thread pool sizes

– Keeps CPUs busier

– May cause scheduling overhead (Context Switching) and may lessen throughput.
  Especially if the number of CPUs are less.
Processing (CPU) Performance &
        Troubleshooting
            (Part 2)
CPU
• Many modern systems from Sun boast numerous CPUs or virtual CPUs
  (which may be cores or hardware threads).

• The CPUs are shared by applications on the system, according to a policy
  prescribed by the operating system and scheduler

• If the system becomes CPU resource limited, then application or kernel
  threads have to wait on a queue to be scheduled on a processor,
  potentially degrading system performance.

• The time spent on these queues, the length of these queues and the
  utilization of the system processor are important metrics for quantifying
  CPU-related performance bottlenecks.
Process – User and Kernel Level
                 Threads
• Process includes the set of executable programs, address
  space, stack, and process control block. One or more threads
  may execute the program(s).
• User-level threads (threads library)
   – Invisible to the OS and are maintained by a thread Library.
   – are the interface for application parallelism
• Kernel threads
   – the unit that can be dispatched on a processor and it’s structures are
     maintain by the kernel
• Lightweight processes (LWP)
   – Each LWP supports one or more User Level Thread and maps to exactly one
     Kernel Level Thread. Maintains the state of a thread.
CPU Consumption Model




By default Solaris 10 uses Process 4 model, rest are obsolete.
Dispatcher and Run Queue at CPU
User Thread over a Solaris LWP
     State of User Thread and LWP may be different
Solaris Threading Model
If you are in a thread, the thread library must schedule it on an a LWP
Each LWP has a kernel thread, which schedules it on a CPU.
Threading models are used between LWPs & Solaris Threads
JVM Organization
JVM Memory Organization & Threads
•   Method Area
     – JVM loads the class file, their type info and binary data in this area
     – This memory area is shared by all threads
•   Heap Area
     – JVM places all objects the program instantiates onto the heap
     – This memory area is shared by all threads
     – This memory can be adjusted by VM options -Xmx & -Xms as required
•   Java Stack and Program Counter (PC) Register
     – Each new thread that executes, gets its own pc register & Java stack.
     – The value of the pc register indicates the next instruction to execute.
     – A thread's Java stack stores the state of Java method invocations for the
       thread. The state of a Java method invocation includes
             • its local variables & the parameters with which it was invoked,
             • its return value (if any), and intermediate calculations.
     – This memory may be adjusted by VM option –Xss, typically 1m for RK Apps
     – The state of native method (JVM method) invocations is stored in an
       implementation-dependent way in native method stacks, as well as possibly in
       registers or other implementation-dependent memory areas.
A Java thread’s Stack Memory
•   The Java stack is composed of stack frames (or frames).
•   A stack frame contains the state of one Java method invocation.

     – When a thread invokes a method, the Java virtual
       machine pushes a new frame onto that thread's
       Java stack.
     – When the method completes, the virtual machine
       pops and discards the frame for that method.
Thread Modes
         Kernel & User Mode Privilege
• A LWP may either execute in kernel (sys) or user (usr) privilege mode.
• Operations like, processing data on local memory and inter-process
  communication between threads of the same process does not require
  kernel mode privilege for the thread executing the user program.
• However, intra-process communication or hardware access are done by
  kernel programs the executing thread requires kernel mode privilege
• User programs often call by call kernel programs by making system calls.
• A LWP runs in user mode until it makes a system call that requires kernel
  mode privilege. The mode switch then happens, which is costly.
LWP/Thread Modes
User Mode and Kernel Mode
   Don’t confuse the modes with type (Kernel and User)
Complete Process State Diagram
     State of a process is a super set of Thread States
     A process’s thread state is defined by its threads.
vmstat tool provides a glimpse of the system's behavior




VMSTAT - Glimpse of CPU Behavior
 The vmstat tool provides a glimpse of the system's behavior on one line indicates
 both CPU utilization and saturation.
 The first line is the summary since boot, followed by samples every five seconds




  Far right is cpu:id for percent idle lets us determine how utilized the CPUs are
        In this ex, the idle time for the 5 second samples was always 0, indicating
        100% utilization.
  On the far left is kthr:r for the total number of threads on the ready to run queues.
  If the value is more than the number of CPU’s it indicates CPU saturation.
        Meanwhile, kthr:r was mostly 2 and sustained, indicating a modest
        saturation for this single CPU server. A value of 4 would indicate high
        saturation.
More about VMSTAT


Count                                               Description
 kthr
  r      Total number of runnable threads on the dispatcher queues

faults
  in     Number of interrupts per second
  sy     Number of system calls per second
  cs     Number of context switches per second, both voluntary and involuntary

 cpu
 us      Percent user time; time the CPUs spent processing user-mode threads

  sy     Percent system time; time the CPUs spent processing system calls on behalf of user-mode threads, plus
         the time spent processing kernel threads

  id     Percent idle; time the CPUs are waiting for runnable threads. This value can be used to determine CPU
         utilization
CPU Utilization
•   You can calculate CPU utilization from vmstat by subtracting id from 100 or by adding us
    and sy.
•   100% utilized may be fine—it can be the price of doing business.
•   When a Solaris system hits 100% CPU utilization, there is no sudden dip in performance;
    the performance degradation is gradual. Because of this, CPU saturation is often a
    better indicator of performance issues than is CPU utilization.
•   The measurement interval is important: 5% utilization sounds close to idle; however, for
    a 60-minute sample it may mean 100% utilization for 3 minutes and 0% utilization for
    57 minutes. It is useful to have both short- and long-duration measurements.
•   An server running at 10% CPU utilization sounds like 90% of the CPU is available for
    "free," that is, it could be used without affecting the existing application. This isn't quite
    true. When an application on a server with 10% CPU utilization wants the CPUs, they
    will almost always be available immediately. On a server with 100% CPU utilization, the
    same application will find that the CPUs are already busy—and will need to preempt
    the currently running thread or wait to be scheduled. This can increase latency.
CPU Saturation
• The kthr:r metric from vmstat is useful as a measure for CPU saturation.
  However, since this is the total across all the CPU run queues, divide kthr:r
  by the CPU count for a value that can be compared with other servers.

• Any sustained non-zero value is likely to degrade performance. The
  performance degradation is gradual (unlike the case with memory
  saturation, where it is rapid).

• Interval time is still quite important. It is possible to see CPU saturation
  (kthr:r) while a CPU is idle (cpu:idl). You may find that the run queue is
  quite long for a short period of time, followed by idle time. Averaging over
  the interval gives both a non-zero run queue length and idle time.
Solaris Peformance Tools
Tool      Uses           Description
vmstat    kstat          For an initial view of overall CPU behavior

psrinfo   kstat          For physical CPU properties

uptime    getloadavg()   For the load averages, to gauge recent CPU activity
sar       kstat, sadc    For overall CPU behavior, and dispatcher queue
                         statistics; sar also allows historical data collection

mpstat    kstat          For per-CPU statistics

prstat    procfs         To identify process CPU consumption

dtrace    Dtrace         For detailed analysis of CPU activity, including
                         scheduling events and dispatcher analysis
uptime Command
Prints up time with CPU Load averages. They represent both
utilization and saturation of the CPUs.


•   The numbers are the 1-, 5-, and 15-minute load averages.

•   Load averages is often approximated as the average number of runnable
    and running threads, which is a reasonable description.

•   A value equal to your CPU count usually means 100% utilization; less than
    your CPU count is proportionally less than 100% utilization; and greater
    than your CPU count is a measure of saturation

•   A consistent load average higher than your CPU count may cause degraded
    performance. Solaris handles CPU saturation very well, so load averages
    should not be used for anything more than an initial approximation of CPU
    load.
sar - The system activity reporter
Provide live statistics or can be activated to record historical
CPU statistics, prints the user (%usr), system (%sys), wait I/O
(%wio), and idle times (%idle).
Identifies long-term patterns that may be missed when taking a
quick look at the system. Also, historical data provides a
reference for what is "normal" for your system
The following example shows the default output of sar, which is
also the -u option to sar. An interval of 1 second and a count of
5 were specified.
sar –q - Statistics on the run queues




runq-sz (run queue size). Equivalent to the kthr:r field from vmstat; can be
used as a measure of CPU saturation
swpq-sz (swapped-out queue size). Number of swapped-out threads. Swapping
out threads is a last resort for relieving memory pressure, so this field will be
zero unless there was a dire memory shortage.
%runocc (run queue occupancy). Helps prevent a danger when intervals are
used, that is, short bursts of activity can be averaged down to unnoticeable
values. The run queue occupancy can identify whether short bursts of run queue
activity occurred
%swpocc (swapped out occupancy). Percentage of time there were swapped
out threads. If one thread is swapped, all others of threads of the process must also be.
Is my system performing well?




           About the Individual Processors
 psrinfo -v command determines the number of processors in the system and their
 speed. In Solaris 10, -vp prints additional information.




The mpstat command summarizes the utilization statistics for each CPU. Following
 syscl (system calls)               csw (context switches)
is an example of four CPU switches) migr (migrations of threads between processors)
 icsw (involuntary context machine, being sampled every 1 second.
 intr (interrupts)                  ithr (interrupts as threads)
 smtx (kernel mutexes)              srw (kernel reader/writer mutexes)
What are sampling and Clock tick
                  woes?
•   While most counters you see in Solaris are highly accurate, sampling issues remain
    in a few minor places. In particular, the run queue length as seen from vmstat
    (kthr:r) is based on a sample that is taken every second. Example, a problem was
    caused by a program that deliberately created numerous short-lived threads every
    second, such that the one-second run queue sample usually missed the activity.

•   The runq-sz from sar -q suffers from the same problem, as does %runocc(which for
    short-interval measurements defeats the purpose of %runocc).

•   These are all minor issues, and a valid workaround is to use DTrace, with which
    statistics can be created at any accuracy desired
Who Is Using the CPU?
The default output from the prstat command shows one line of output
per process, showing CPU utilization value before the prstat
command was executed.

The system load average indicates the demand and queuing for
CPU resources averaged over a 1-, 5-, and 15-minute period if that
exceeds the number of CPUs, the system is overloaded.
How is the CPU being consumed?
•   Use Options -m(show microstates) & -L(show per-thread) observe per-thread microstates.
•   Microstates represent a time-based summary broken into percentages for each thread.
•   USR through LAT sum to 100% of the time spent for each thread during the prstat sample.
•   USR (user time) and SYS (system time) thread spent running on the CPU.
•   The LAT (latency) is the amount of time thread spent waiting for CPU. A non-zero number means there
    was some queuing/saturation for CPU resources.
•   SLP inidicates the time thread spends blocked waiting for blocking events like Disk I/O etc.
•   TFL & DTL determine if and how much the thread is waiting for memory paging.
•   TRP indicates the time spent on software traps


       Each Thread is waiting for CPU about 0.2% of the time. - CPU resources are not constrained.




       Each Thread is waiting for CPU about 80% of the time. - CPU resources are Constrained
How are threads inside the process
             performing?

The example shows us that thread number two in the target process is using the most CPU, and
spending 83% of its time waiting for CPU. We can further look at information about thread
number two with the pstack <pid>/<LWPID> command. Just pstack <pid> to shows all threads




 Take a java thread dump and identify the thread with native thread id = 2. This is the one. This
 way con relate the code in Java that called the native system call or library method on the
 system.
Process Stack on a Java Virtual
               Machine: pstack
•   Use the “C++ stack unmangler” with Java virtual machine (JVM) targets to see the
    native java function calls  c stack
Tracing Processes
                                  truss
truss traces system calls made on behalf of a process. It includes the user LWP
(thread) number, system call name, arguments and return codes for each system call.




  truss –c option traces system call counts
Why Memory Saturation brings more
 rapid a degradation in performance
    compared to CPU saturation.
• Memory saturation may cause rapid degradation in performance. To come
  over saturation OS resorts to page-in/out and swapping, which themselves
  are an heavy task and with processes competing for memory, a race
  condition may occur.

• The available memory on a server may be artificially constrained, either
  through pre-allocation of memory or through the use of a garbage
  collection mechanism that doesn’t free up memory until some threshold is
  reached.
Thread Dumps

•   What exactly is "Thread dump“
     – Thread dump" basically gives you information on what
       each of the thread in the VM is doing at any given
       point of time.

•   If an application seems stuck, or is running out of resources, a thread dump will reveal
    the state of the server. Java's thread dumps are a vital tool for server debugging. For
    scenarios like
     –   PERFORMANCE RELATED ISSUES
     –   DEADLOCK (SYSTEM LOCKS UP)
     –   TIMEOUT ISSUES
     –   SYSTEM STOPS PROCESSING TRAFFIC
Thread dumps in Redknee Applications
•   Java thread dumps are obtained by doing:
     – Send (kill -3 <pid>) - On Unix           See
       thread dump in ctl logs
     – Press (Ctrl + Shift Break) – on Windows  See
       thread dumps on xbuild console
     – $JAVA_HOME/bin/jstack <pid>              See
       thread dumps on Shell console
•   Java thread dumps list all of the threads in an application

•   Threads are outputted in the order that they are created, newest thread being at the
    top

•   Threads should be named with a useful name of what they do or what they are
    responsible for (Open Tickets)
Common Threads in Redknee
•   Idle”
     –      CORBA Threads to handle incoming requests, however are currently not doing any work
•   “RMI TCP Connection(<port>)-<IP>”
     –      Outbound connection over RMI to a specific host and port
•   "FileLogger“
     –      Framework thread for logging
•   "JavaIDL Reader for <host>:<port>“
     –      CORBA Thread reading requests from a server
•   "TP-Processor8“
     –      Tomcat Web Thread
•   “Thread-<#>”
     –      Thread that has not been named (BAD)
•   "ChannelHome ForwardingThread“
     –      Thread used to cluster transactions over to peer
     –      One of these threads per Home that is clustered (DB table)
•   "Worker#1“
     –      Worker threads doing work
Thread Dump May Give you Clues
•   C:learnclasses>java Test
•   Full thread dump Java HotSpot(TM) Client VM (1.4.2_04-b05 mixed mode):

•   "Signal Dispatcher" daemon prio=10 tid=0x0091db28 nid=0x744 waiting on condition [0..0]

•   "Finalizer" daemon prio=9 tid=0x0091ab78 nid=0x73c in Object.wait() [1816f000..1816fd88]
•       at java.lang.Object.wait(Native Method)
•       - waiting on <0x10010498> (a java.lang.ref.ReferenceQueue$Lock)
•       at java.lang.ref.ReferenceQueue.remove(Unknown Source)
•       - locked <0x10010498> (a java.lang.ref.ReferenceQueue$Lock)
•       at java.lang.ref.ReferenceQueue.remove(Unknown Source)
•       at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

•   "Reference Handler" daemon prio=10 tid=0x009196f0 nid=0x738 in Object.wait() [1812f000..1812fd88]
•       at java.lang.Object.wait(Native Method)
•       - waiting on <0x10010388> (a java.lang.ref.Reference$Lock)
•       at java.lang.Object.wait(Unknown Source)
•       at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)
•       - locked <0x10010388> (a java.lang.ref.Reference$Lock)


•   "main" prio=5 tid=0x00234998 nid=0x4c8 runnable [6f000..6fc3c]
•      at Test.findNewLine(Test.java:13)
•      at Test.<init>(Test.java:4)
•      at Test.main(Test.java:20)

•   "VM Thread" prio=5 tid=0x00959370 nid=0x6e8 runnable

•   "VM Periodic Task Thread" prio=10 tid=0x0023e718 nid=0x74c waiting on condition
•   "Suspend Checker Thread" prio=10 tid=0x0091cd58 nid=0x740 runnable
What is there in the Thread Dump?
•   In this case we can see that, at the time we took the thread dump, there were seven threads: Show Thread
    Dump
     –   Signal Dispatcher
     –   Finalizer
     –   Reference Handler
     –   main
     –   VM Thread
     –   VM Periodic Task Thread
     –   Suspend Checker Thread


•   Each thread name is followed by whether the thread is a daemon thread or not.
•   Then comes prio the priority of the thread [ex: prio=5].
•   tid and nid are Java thread id and the native thread id.
•   Then what follows the state of the thread. It is either:
     –   Runnable [marked as R in some VMs]: This state indicates that the thread is either running currently or is ready to run the next time the OS
         thread scheduler schedules it.
     –   Suspended [marked as S in some VMs]: I presume this indicates that the thread is not in a runnable state. Can some one please confirm?!
     –   Object.wait() [marked as CW in some VMs]: indicates that the thread is waiting on an object using Object.wait()
     –   Waiting for monitor entry [marked as MW in some VMs]: indicates that the thread is waiting to enter a synchronized block
•   What follows the thread description line is a regular stack trace.
Threads in a Dead-Lock
•   A set of threads are said to be in a dead lock when there is a cyclic wait condition, ie. each thread in the
    deadlock is waiting on a resource locked by some other thread in the set of deadlocked threads. In newer
    JDKs they are detected automatically
     –   Found one Java-level deadlock:
     –   =============================
     –   "Thread-1":
     –    waiting to lock monitor 0x0091a27c (object 0x140fa790, a java.lang.Class),
     –    which is held by "Thread-0"

     –   "Thread-0":
     –    waiting to lock monitor 0x0091a25c (object 0x14026800, a java.lang.Class),
     –    which is held by "Thread-1"

     –   Java stack information for the threads listed above:
     –   ===================================================
     –   "Thread-1":
     –       at Deadlock$2.run(Deadlock.java:48)
     –       - waiting to lock <0x140fa790> (a java.lang.Class)
     –       - locked <0x14026800> (a java.lang.Class)
     –   "Thread-0":
     –       at Deadlock$1.run(Deadlock.java:33)
     –       - waiting to lock <0x14026800> (a java.lang.Class)
     –       - locked <0x140fa790> (a java.lang.Class)

     –   Found 1 deadlock.
Memory
Performance & Troubleshooting
             (Part 3)
Memory
• Memory includes
     physical (RAM)
     Swap space

• Swap space is a part storage acting as a memory.

• Memory is more complicated a subject than CPU.

• Memory saturation triggers CPU saturation (Page Faults / GC)
Memory Utilization and Saturation
• To sustain a higher throughput, application spawns more threads
  and holds the request data

• Each thread occupies memory for data it operates on and its own
  stack.

• A point where memory demanded by an process can no longer be
  met from available memory, saturation occurs.

• Sudden increases in utilization without accompanying increases in
  throughput can also be used to detect degraded performance
  modes caused by software ‘aging’ issues, such as memory leaks
VMSTAT – Glimpse of Memory
               Utilization


If the scan rate (sr) is continuously over 200 pages per second then there
is a memory shortage on the system.

Counter      Description
swap         Available swap space in Kbytes.
free         Combined size of the cache list and free list.
re           Page reclaims—The number of pages reclaimed from the cache list.
mf           Minor faults—The number of pages attached to an address space.
fr           Page-frees—Kilobytes that have been freed
pi and po    Kilobytes Paged in and Paged out respectively
de           Anticipated short-term memory in kilobytes shortfall to free ahead.
sr           The number of pages scanned by the page scanner per second.
Memory Consumption Model
Relieving Memory Pressure




After the free memory exhausts, from cache list (FS,I/O etc cache).
Next the swapper swaps out entire threads, seriously degrading the
performance of swapped-out applications. The page scanner selects pages,
and is characterized by the scan rate (sr) from vmstat. Both use some form
of the Not Recently Used algorithm.
The swapper and the page scanner are only used when appropriate. Since
Solaris 8, the cyclic page cache, which maintains lists for a Least Recently
Used selection, is preferred.
Heap and Non-Heap Memory
• Heap Memory
  Storage for Java objects
  -Xmx<size> & -Xms<size>


• Non Heap Memory
  Per-class structures such as runtime constant pool, field and method data,
  Code for methods and constructors, as well as interned Strings.
  Store loaded classes and other meta-data
  JVM code itself, JVM internal structures, loaded profiler agent code and data, etc.
  -XX:MaxPermSize=<size>


• Other
  Space system/OS takes for process
  Stacks of a threads (-Xss & -Xoss)
  System & Native space
What is Garbage Collection?



Reclaim memory from inaccessible object
Stack Overflow or Out of Memory
•   If u See OutOfMemoryError: unable to create native thread
     –    This means your Application is falling short Native Memory space – C Space
     –    Either, Insufficient memory to allocate thread stack or PC to the new Thread
     –    Or application has crossed JVM’s memory limit (3.2 GB in 32 bit environment)
     –    The JVM/application hangs with this error, we need to restart.
             • See if you can reduce active threads which ate away system’s memory
             • Or if you can decrease stack size to decrease memory use per thread
             • If you Can’t bring memory consumption down, need more system memory
•   If u See StackOverflowException
     – It means the thread that threw this exception fell short of Stack Memory
       Space
     – A thread stacks method states invoked by it on to the stack memory
     – For the number of nested invocations the thread makes, memory is
       insufficient
     – Only the thread dies with this exception, the application doesn’t hang.
             • See if you can bring down number of nested invocations by the thread
             • Or else, increase the stack size with VM option –Xss, by default it is 1m
Pros and Cons of Garbage Collection?




Advantages                        Disadvantages
     Increased reliability              Unpredictable application
     Easier to write complex            pauses
     apps                               Increased CPU/memory
     No memory leaks or                 utilization
     invalid pointers                   Brutally complex
GC Logging
• Java Garbage Collection activity may be recorded in a log
  file. VM options
   –   -verbosegc (Enable GC Logging, outputs to std-err
   –   -xloggc:<file> (GC logging to file)
   –   –XX:+PrintGCDetails (Detailed GC records)
   –   -XX:+PrintGCDateStamps (absolute instead of relative timestamps)
   –   Note: From relative timestamps in a GC log we can find absolute times by either by tracing forward from
       application/GC start or backwards from application/GC stop

• Asynchronous garbage collection occurs whenever memory
  available memory is low.
• System.gc() does not force a synchronous garbage
  collection but just gives a hint to VM. VM options
   – +XXDisableExplicitGC - Disable explicit GC
What to look for in GC Logs?
• Important information from GC logs
   – The size of the heap after garbage collection
   – The time taken to run the garbage collection
   – The number of bytes reclaimed by garbage collection

• Heap Size after GC may give us a good idea of
  memory requirement.
   – 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed)

• The other two help us assess the cost of GC to your
  application.
• All of them together help us tune GC.
How to Calculate Impact of GC on your
            Application?
• Run test (60sec, Collect GC logs)
  – 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed)
  – 42406K->41504K(458752K), 4.4044878 secs – (902K reclaimed)
  – 48617K->47874K(458752K), 4.5652409 secs – (770K reclaimed)

• Measure
  – Out of 60 sec, GC ran for 17.2 sec, ie 29% of the time.
  – Considering relative CPU utilization, GC cost may be even higher.
  – 3037K of memory was recycled in 60 secs, ie 51831 bytes/second

• Analyze
  – 29% time being consumed by GC is too high (should be between 5-15%)
  – Is 51831 bytes/sec of memory recycled justifiable against operation?
  – For an average 50 byte objects, it churned around 1036 objects/ sec
Heap Ranges – Xms to Xmx
• Heap Range can be defined
  – VM Args –Xmx & -Xms define Upper & Lower Bounds of Heap Size

• What causes VM to expand heap?
  – Expansion of heap is a CPU Intensive and causes defragmented Heap
  – VM Tries GC, Defragmentation, Compaction, etc to free up memory.
  – If still unable to free up required memory, VM decides to expand heap

  – VM may not wait till brink, it keeps some free space for temp objects
  – By default, Sun tries to keep the proportion of free space to living objects at each
    garbage collection within 40%-70% range.
      • If less than 40% heap is free after GC, expand the heap
      • If more than 70% heap is free after GC, contract the heap
  – VM Args that can customize the default ratio
      • -XX:MinFreeHeapRatio
      • -XX:MaxFreeHeapRatio
Gross Heap Tuning
• Consequences of large heap sizes
    – GC Cycles occur less frequently, but each sweep takes longer
    – Long GC cycles may induce perceptible pauses in the system.
    – If heap grows to a size more than available RAM, paging/swapping may occur.
• Consequences of low heap sizes
    – GC runs too frequently with less recovery in each cycle
    – Cost of GC becomes more
    – Since, GC has to sweep less space each time, pauses are imperceptible.
• Max verses Min Heap sizes.
    –   Contraction & Expansion of heap is costly, should be worth the cause.
    –   Frequent contraction expansion also leads to segmented heap.
    –   Keep Xmx=Xms, for transaction oriented system which frequently peaks.
    –   Keep Xms<Xmx if the application infrequently operates at upper capacity.
We Just Learnt Gross Heap
                    Tuning
                  There might just be need for Fine Tuning

• We can fine tune the GC considering the intricacies of
  GC Algorithm & Heap Structure. We will learn shortly.

• Goss Heap tuning is quite simple yet effective &
  empirically established.

• Gross techniques are fairly effective irrespective of the
  variables and most important we can always afford
  apply them.
What is the advanced heap made of?
                  The one that works with Generational Garbage Collector in JVM



• HEAP is made up of
  – Old Space or Tenure Space
      • Objects, when get old in the young space, are transferred here.

  – Young Space or Eden Space
      • Young objects are held here.

  – Scratch Space
      • Working Space for algorithms

  – New Space
     • <Young Space> + <Scratch Space>
jmap -heap
Generational Garbage Collector
        Modern Heap
Fine Tuning the Heap
Are there better GC implementations to chose? JDK 1.4.x Options


Generation        Low Pause Collectors                     Throughput Collectors                   Heap Sizes

               1 CPU            2+ CPUs               1 CPU                2+ CPUs


                                                                    Parallel Scavenge Collector
                 Serial       Parallel Copying                                                       -XX:NewSize
                                                      Copying
               Copying           Collector                              -XX:+UseParallelGC         -XX:MaxNewSize
  Young        Collector
                                                      Collector
                                                                    -XX:+UseAdaptiveSizePolicy    -XX:SurvivorRatio
                                                      (default)
               (default)      -XX:+UseParNewGC
                                                                       -XX:+AggressiveHeap




                Mark-                                  Mark-
                            Concurrent Collector
               Compact                                Compact         Mark-Compact Collector            -Xms
   Old         Collector                              Collector             (default)                   -Xmx
                           -XX:+UseConcMarkSweepGC
               (default)                              (default)



                                                                                                    -XX:PermSize
Permanent                    Can be turned off with –Xnoclassgc (use with care)
                                                                                                  -XX:MaxPermSize
jstat




Reference http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
Heap Dump (Java)
      Snapshot of the memory of a time
                               VMs usually invokes a GC before dumping heap

It contains
•   Objects (Class, fields, primitive values and references)
•   Classes (Classloader, name, super class, static fields)
• GC Roots (Objects defined to be reachable by the JVM)
• Thread Stacks (in time with per-frame about local objects)
Does not Contain
• Allocation information
    Who created the objects and where they have been created?
•   Live & Stale
   Used memory consists of both live and dead objects.
   JVM usually does a GC before generating a heap dump
   Tools may attempt to remove these when loading the dump                    unreachable
from the GC roots
Heap Dump (Java)
                    How to take it?
• On Demand
   VM-arg > JDK1.4.2_12 # -XX:+HeapDumpOnCtrlBreak
   Tools # JDK6 Jconsole, VisualVM, MAT
   jmap -d64 -dump:file=<file-ascii-hdump> <pid>
   jmap -d64 -dump:format=b, file=<file-bin-hdump> <pid>

• Automatic on Crash
   VM-arg # -XX:+HeapDumpOnOutOfMemoryError

• Postmorterm after crash; from Core-Dump
   jmap -d64 -dump:format=b,file=<file> <java-bin> <core-file>
Heap Dump (Java)
               Shallow vs Retained Heap
Shallow heap
• Held by object’s primitive fields and reference variables
• Excludes referenced objects but just references (32/64 bits)
Retained heap
• Object’s shallow size plus the shallow sizes of the objects that are
  accessible, directly or indirectly, only from this object.
• Memory that’s freed by the GC when this object is collected.
Garbage Collection Roots
•   A garbage collection root is an object accessible from outside the heap.
•   GC root objects, which will not be collected by Garbage Collector at the time
    of measuring Locals (Java/Native), Threads, System Class, JNI,, Monitor,
    Finalizer)
Shallow vs. Retained Heap




http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcept
s%2Fshallowretainedheap.html

In general, retained size is a GC root is an integral measure, which helps to understand
consumption memory by objects graphs
Dominator Tree
                               (Object Dependencies)
•   Identifies chunks of retained memory & the keep-alive
•   In the dominator tree each object is the immediate dominator of its children, so
    dependencies between the objects are easily identified.




•   The edges in the dominator tree do not directly correspond to object references from
    the object graph. Same object may actually be under retained set of multiple roots.
•   http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowretainedheap.html
OQL (Object Query Language)
            Heap Dump not just for
                Troubleshooting
•   OQL is an Object Query Language that let’s us query the heap dump in SQL
    fashion.

•   This enables us to analyze heap not only after problems but proactively search for
    patterns. Ex select to see if there are more than two objects for Boolean, ideally
    two .TRUE and .FALSE (singleton like Enums) are sufficient –
                                select toHtml(a) + " = " + a.value from java.lang.Boolean a
                                    where objectid(a.clazz.statics.TRUE) != objectid(a)
                                     && objectid(a.clazz.statics.FALSE) != objectid(a)
                                                    (Runs on Visual VM

•   Visual VM and MAT, both support nice interfaces for OQL
          http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html
          http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fwelcome.html
References
•   Thread Dump Analyzer (Thread Dumps)
•   (http://java.net/projects/tda/)
•   GC Viewer (GC logs)
•   (http://www.tagtraum.com/gcviewer.html)
•   Eclipse Memory Analyzer tool (Heap Dump, OQL)
    (http://help.eclipse.org/indigo/topic/org.eclipse.mat.ui.help/welcome.html )
•   Visual VM / J-Console /JMX – (Inspect Live Application, Snapshots, Dumps, OQL)
    Bundled with Java SDK
Feedback – Q&A
  simar.singh@redknee.com
     learn@ssimar.com

Mais conteúdo relacionado

Semelhante a Performance Concurrency Troubleshooting Final

Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
Yves Goeleven
 
Network emulator
Network emulatorNetwork emulator
Network emulator
jeromy fu
 
The rice and fail of an IoT solution
The rice and fail of an IoT solutionThe rice and fail of an IoT solution
The rice and fail of an IoT solution
Radu Vunvulea
 
Doc 2011101412020074
Doc 2011101412020074Doc 2011101412020074
Doc 2011101412020074
Rhythm Sun
 

Semelhante a Performance Concurrency Troubleshooting Final (20)

Load Test Drupal Site Using JMeter and Amazon AWS
Load Test Drupal Site Using JMeter and Amazon AWSLoad Test Drupal Site Using JMeter and Amazon AWS
Load Test Drupal Site Using JMeter and Amazon AWS
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
Expecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningExpecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance Tuning
 
Network emulator
Network emulatorNetwork emulator
Network emulator
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
 
05. performance-concepts-26-slides
05. performance-concepts-26-slides05. performance-concepts-26-slides
05. performance-concepts-26-slides
 
The rice and fail of an IoT solution
The rice and fail of an IoT solutionThe rice and fail of an IoT solution
The rice and fail of an IoT solution
 
Patterns of enterprise application architecture
Patterns of enterprise application architecturePatterns of enterprise application architecture
Patterns of enterprise application architecture
 
Storm 2012 03-29
Storm 2012 03-29Storm 2012 03-29
Storm 2012 03-29
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
 
Storage and I/O
Storage and I/OStorage and I/O
Storage and I/O
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Doc 2011101412020074
Doc 2011101412020074Doc 2011101412020074
Doc 2011101412020074
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
 
Fdp embedded systems
Fdp embedded systemsFdp embedded systems
Fdp embedded systems
 

Performance Concurrency Troubleshooting Final

  • 1. System Performance Build Fuel Tune Simar Singh simar.singh@redknee.com learn@ssimar.com
  • 2. Learn and Apply Topics Index (Click Links in Slide Show) • Performance • Concepts • Concurrency (Threads) • Troubleshooting • Processing (CPU/Cores) • Processing • Memory (System / Process) • Thread Dumps • Memory • Garbage Collection • Heap Dumps • Core Dumps & Postmortem • Java (jstack, jmap, jstat, VisualVM) • Solaris (prstat vmstat mpstat pstack)
  • 4. What will we Discuss? – LEARN – There are laws and principals that govern concurrency and performance. – Performance can be built, fueled and/or tuned. – How do we measure performance and capacity in abstract terms? – Capacity (throughput) and Load are often used interchangeably but incorrectly. – What is the difference between Resource utilization and saturation? – How performance & capacity are measured on a live system (CPU & Memory)? – APPLY – Find out how is your system being used or abused? – Find out how your system is performing as a whole? – Find out how a particular process in the system is performing? – Find out how a particular thread in the process performing? – Find out the bottle-necks? What is less or missing?
  • 5. Performance – Built, Fueled or Tuned • Built (Implementation and Techniques) – Binary Search O(log n) is more efficient than Linear Search O(n) – Caching can improve Disk I/O significantly boosting performance. • Fueled (More Resources) – Simply get a machine with more CPU(s) and Memory if constrained. – Implement RAID to improve Disk I/O • Tuned (Settings and Configurations) Tune Garbage Collection to optimize Java Processes – Tune Oracle parameters to get optimum database performance
  • 6. Capacity and Load • Load is an Expectation out of system – It is the rate of work that we put on the system. – It is an factor external to the system. – Load may vary with time and events. – It has no upper cap, can increase infinitely • Capacity is a Potential of the system – It is the max rate of work, the system supports efficiently, effectively & infinitely – It is a factor, internal to the system. Maximum capacity of a system is finite and stays fairly constant. We often call Throughput as the System’s Capacity for Load. • Chemistry between Load & Capacity – LOAD = CAPACITY? Good Expectation matches the potential. Hired – LOAD > CAPACITY? Bad Expectations is more than potential. Fired – LOAD < CAPACITY? Ugly Expectations is less then potential. Find another one – If not good better be ugly than bad.
  • 7. Performance Measurement of a System • Measures of System’s Capacity Response Time or Latency – Measures time spent executing a request • Round-trip time (RTT) for a Transaction – Good for understanding user experience – Least scalable, Developers focus on how much time each transaction takes • Throughput – Measures the number of transactions executed over a period of time • Output Transactions per second (TPS) – A measure of the system's capacity for load – Depending upon the resource type, It could be hit rate (for cache) • Resource Utilization – Measures the use of a resource • Memory, disk space, CPU, network bandwidth – Helpful for system sizing, is generally the easiest measurement to Understand – Throughput and Response Time can conflict, because resources are limited • Locking, resource contention, container activity
  • 8. It is time for System Capacity to be Loaded with work (Throttling & Buffering Techniques) • No one stops us to load a system more than its capacity (Max Throughput). • Transactions Per Seconds -Misconception, Real traffic may be in bursts – Received 3600 transactions in a hour, not sure if every second only 60 were pumped – Probably we received in bursts - all in first 10 minutes and for nothing last 50 minutes – So we really cant say, at what tps? We can regulate bursts with throttling and buffering • Throttling – (Implemented by producer to smoothen output) – Spreads bursts over time to smoothen output from a process – We may add throttles to control output rate from threads to each external interface Throttle of 10 tps ensures max output is 10 tps regardless of the load & capacity. Throttling is scheme for producers (Check production to rate the consumer can accept) • Buffering – (Implemented by consumer to smoothen input) – Spreads burst over time to smoothen input from an external interface – We add buffering to control input rate to threads from each external interface Application processes input at 10 tps, load above it will be buffered & processed later Buffering is a scheme for consumers (Take whatever is produced, consume at our own)
  • 9. Supply Chain Principle (Apply it to define a optimum Thread Pool Size) • The more throughput you want, more will be the resource consumption. • You may apply this principle to define the optimum thread-pool size for a system/application. – To support a Throughput (t) transactions per second- (t) = 20 tps – Where each transaction takes (d) seconds to complete- (d) = 5 seconds – We need (d*t) threads at least (min size of the thread pool)- (d*t) = 100 threads • Thread is an abstract CPU unit resource here.
  • 10. To support a Throughput (t) of 20 tps Where each transaction takes(d) 5 seconds We need 100 (d*t) threads at least 1 sec 2 sec 3 sec 4 sec 5 sec 1 sec 2 sec 3 sec 4 sec 5 sec 1 sec 20 2 sec 3 sec 4 sec 5 sec 1 sec 2 sec 20 3 sec 4 sec 5 sec 1 sec 2 sec 3 sec 20 4 sec 5 sec 1 sec 2 sec 3 sec 4 sec 20 5 sec 20 20 20  20 20 20 20 20 20 20 20 20 20 20 20 20
  • 11. Quantify Resource Consumption Utilization & Saturation • Resource Utilization – Utilization measures how busy a resource is. – It is usually represented as a percentage average over a time interval. • Resource Saturation – Saturation is often a measure of work that has queued waiting for the resource – It can be measured as both • As an average over time • And at a particular point in time. – For some resources that do not queue, saturation may be synthesized by error counts. Example Page-Faults reveal memory saturation. • Load (input rate of requests) is an independent/external variable • Resource consumption, Throughput (out-put rate of response) are dependent/internal variables, a function of load.
  • 12. How Load, Resource Consumption and Throughput related? • As load increases, throughput increases, until maximum resource utilization on the bottleneck device is reached. At this point, maximum possible throughput is reached, Saturation occurs. • Then, queuing (waiting for saturated resources) starts to occur. • Queuing typically manifests itself by degradation in response times. • This phenomenon is described by Little’s Law: L=X*R L (LOAD), X (THROUGHPUT) and R (RESPONSE TIME) • As L increases, X increases (R also increases slightly, because there is always some level of contention at the component level). • At some point, X reaches Xmax – the maximum throughput of the system. At this point, as L continues to increase, the response time R increases in proportion and through-put may then start to decrease, both due to resource contention.
  • 13. Performance pattern of a Concurrent Process
  • 14. Example How Throughput and Resource Consumption are related? • Throughput & Latency can have an inverse or direct relationship – Concurrent tasks (Threads) often contend for resources (locking & contention) • Single-Threaded – Higher Throughput = Lower Latency – Consistent throughput, does not increase with incoming load & resources – Processes serially, Good for batch jobs – Response Time linearly varies with request order. • Multi-Threaded – Higher Throughput = Higher Latency (Most of the time) – Throughput may increase linearly with load, it starts to drop after threshold – Process Concurrently, Good for interactive modules (Web Apps) – Near consistent Response Time, doesn’t vary much with order but load. Single Threaded – 10 CPU(s) Multi Threaded – 10 CPU(s) Threads = 1 Threads = 10 Latency = .1 seconds Latency = .1 seconds Throughput = 1/.1 = 10 tx/sec Throughput = 1/.1 * 10 = 100 Threads = 1 Threads = 100 Latency = .001 second Latency = .2 seconds Throughput = 1/.001 = 1000 tx/sec Throughput = 1/.2 * 100 = 500 tx/sec
  • 15. Producer Consumer Principle Predicting Maximum Throughput Identify Bottleneck Device/Resource • The Utilization Law: Ui = T * Di • Where Ui is the percentage of utilization of a device in the application, T is the application throughput, and Di is the service demand of the application device. • The maximum throughput of an application Tmax is limited by the maximum service demand of all of the devices in the application. • EXAMPLE - A load test reports 200 kb/sec average throughput: CPUavg = 80% Dcpu = 0.8 / 200 kb/sec = 0.004 sec/kb Memoryavg = 30% Dmemory = 0.3 / 200 kb/sec = 0.0015 sec/kb Diskavg = 8% Ddisk = 0.08 / 200 kb/sec = 0.0004 sec/kb Network I/Oavg = 40% Dnetwork I/O = 0.4 / 200 kb/sec = 0.002 sec/kb • In this case, Dmax corresponds to the CPU. So, the CPU is the bottleneck device. • We can use this to predict the maximum throughput of the application by setting the CPU utilization to 100% and dividing by Dcpu. In other words, for this example: Tmax = 1 / Dcpu = 250 kb/sec • In order to increase the capacity of this application, it would first be necessary to increase CPU capacity. Increasing memory, network capacity or disk capacity would have little or no effect on performance until after CPU capacity has been increased sufficiently.
  • 16. Work Pools & Thread Pools Working Together • Work Pools are queues of work to be performed by a software application or component. – If all threads in thread pool are busy, incoming work can be queued in work pool – Threads from thread pool, when freed can execute them later • Work Pools are filling up congestion & smoothen bursts – A queue consisting of units of work to be performed – CONGESTION, by allowing the current (client) threads to submit work and return – BURST, over capacity transaction can buffered in work pool and executed later – Allow for caching of units of work to reduce system intensive calls • Can perform a bulk fetch form a database instead of fetching on record at a time
  • 17. Queuing Tasks may be risky • One task could lock up another that would be able to continue if the queued task were to run. • Queuing can smoothen in-coming traffic burst limited in time (depending upon the rate of traffic and size) • Fails if traffic arrives on average faster than they can be processed. • In general, Work Pools are in memory so it is important to understand what the impact of restarting a system is, as in memory elements will be lost. – Is it relevant to lose the queued work? – Is the queue backed up on disk?
  • 18. Bounded & Unbounded Pools (Load Shedding) • If not bounded, pools can grow freely but can cause system to exhaust resources. – Work Pool / Queue Unbounded - (May overload Memory / Heap & crash) • Each work object in the queue stays holding the space until consumed – Thread Pool Unbounded – (May overload CPU / Native Space and Crash) • Each thread asks to be scheduled on CPU and consumes native stack space • If queue size is bounded, incoming execute requests block when it is full. We can apply different Policies to handle t, for example – Reject if there is no space (Can have side affects) – Remove based on Priority – (Ex priority may be function of time – Timeouts) • Thread Pools can have different policies when Work Pools is full: – Block till there is available space – Starve (VERY BAD – Sometimes Needed) – Run in Current Thread (Very Dangerous!)
  • 19. Work pool & thread pool sizes can often be traded off for each other Large Work-Pool and small thread pools – Minimizes CPU usage, OS resources, and context-switching overhead. – Can lead to artificially low throughput especially if tasks frequently block (ex I/O bound) Small Work pool generally require larger thread pool sizes – Keeps CPUs busier – May cause scheduling overhead (Context Switching) and may lessen throughput. Especially if the number of CPUs are less.
  • 20. Processing (CPU) Performance & Troubleshooting (Part 2)
  • 21. CPU • Many modern systems from Sun boast numerous CPUs or virtual CPUs (which may be cores or hardware threads). • The CPUs are shared by applications on the system, according to a policy prescribed by the operating system and scheduler • If the system becomes CPU resource limited, then application or kernel threads have to wait on a queue to be scheduled on a processor, potentially degrading system performance. • The time spent on these queues, the length of these queues and the utilization of the system processor are important metrics for quantifying CPU-related performance bottlenecks.
  • 22. Process – User and Kernel Level Threads • Process includes the set of executable programs, address space, stack, and process control block. One or more threads may execute the program(s). • User-level threads (threads library) – Invisible to the OS and are maintained by a thread Library. – are the interface for application parallelism • Kernel threads – the unit that can be dispatched on a processor and it’s structures are maintain by the kernel • Lightweight processes (LWP) – Each LWP supports one or more User Level Thread and maps to exactly one Kernel Level Thread. Maintains the state of a thread.
  • 23. CPU Consumption Model By default Solaris 10 uses Process 4 model, rest are obsolete.
  • 24. Dispatcher and Run Queue at CPU
  • 25. User Thread over a Solaris LWP State of User Thread and LWP may be different
  • 26. Solaris Threading Model If you are in a thread, the thread library must schedule it on an a LWP Each LWP has a kernel thread, which schedules it on a CPU. Threading models are used between LWPs & Solaris Threads
  • 28. JVM Memory Organization & Threads • Method Area – JVM loads the class file, their type info and binary data in this area – This memory area is shared by all threads • Heap Area – JVM places all objects the program instantiates onto the heap – This memory area is shared by all threads – This memory can be adjusted by VM options -Xmx & -Xms as required • Java Stack and Program Counter (PC) Register – Each new thread that executes, gets its own pc register & Java stack. – The value of the pc register indicates the next instruction to execute. – A thread's Java stack stores the state of Java method invocations for the thread. The state of a Java method invocation includes • its local variables & the parameters with which it was invoked, • its return value (if any), and intermediate calculations. – This memory may be adjusted by VM option –Xss, typically 1m for RK Apps – The state of native method (JVM method) invocations is stored in an implementation-dependent way in native method stacks, as well as possibly in registers or other implementation-dependent memory areas.
  • 29. A Java thread’s Stack Memory • The Java stack is composed of stack frames (or frames). • A stack frame contains the state of one Java method invocation. – When a thread invokes a method, the Java virtual machine pushes a new frame onto that thread's Java stack. – When the method completes, the virtual machine pops and discards the frame for that method.
  • 30. Thread Modes Kernel & User Mode Privilege • A LWP may either execute in kernel (sys) or user (usr) privilege mode. • Operations like, processing data on local memory and inter-process communication between threads of the same process does not require kernel mode privilege for the thread executing the user program. • However, intra-process communication or hardware access are done by kernel programs the executing thread requires kernel mode privilege • User programs often call by call kernel programs by making system calls. • A LWP runs in user mode until it makes a system call that requires kernel mode privilege. The mode switch then happens, which is costly.
  • 31. LWP/Thread Modes User Mode and Kernel Mode Don’t confuse the modes with type (Kernel and User)
  • 32. Complete Process State Diagram State of a process is a super set of Thread States A process’s thread state is defined by its threads.
  • 33. vmstat tool provides a glimpse of the system's behavior VMSTAT - Glimpse of CPU Behavior The vmstat tool provides a glimpse of the system's behavior on one line indicates both CPU utilization and saturation. The first line is the summary since boot, followed by samples every five seconds Far right is cpu:id for percent idle lets us determine how utilized the CPUs are In this ex, the idle time for the 5 second samples was always 0, indicating 100% utilization. On the far left is kthr:r for the total number of threads on the ready to run queues. If the value is more than the number of CPU’s it indicates CPU saturation. Meanwhile, kthr:r was mostly 2 and sustained, indicating a modest saturation for this single CPU server. A value of 4 would indicate high saturation.
  • 34. More about VMSTAT Count Description kthr r Total number of runnable threads on the dispatcher queues faults in Number of interrupts per second sy Number of system calls per second cs Number of context switches per second, both voluntary and involuntary cpu us Percent user time; time the CPUs spent processing user-mode threads sy Percent system time; time the CPUs spent processing system calls on behalf of user-mode threads, plus the time spent processing kernel threads id Percent idle; time the CPUs are waiting for runnable threads. This value can be used to determine CPU utilization
  • 35. CPU Utilization • You can calculate CPU utilization from vmstat by subtracting id from 100 or by adding us and sy. • 100% utilized may be fine—it can be the price of doing business. • When a Solaris system hits 100% CPU utilization, there is no sudden dip in performance; the performance degradation is gradual. Because of this, CPU saturation is often a better indicator of performance issues than is CPU utilization. • The measurement interval is important: 5% utilization sounds close to idle; however, for a 60-minute sample it may mean 100% utilization for 3 minutes and 0% utilization for 57 minutes. It is useful to have both short- and long-duration measurements. • An server running at 10% CPU utilization sounds like 90% of the CPU is available for "free," that is, it could be used without affecting the existing application. This isn't quite true. When an application on a server with 10% CPU utilization wants the CPUs, they will almost always be available immediately. On a server with 100% CPU utilization, the same application will find that the CPUs are already busy—and will need to preempt the currently running thread or wait to be scheduled. This can increase latency.
  • 36. CPU Saturation • The kthr:r metric from vmstat is useful as a measure for CPU saturation. However, since this is the total across all the CPU run queues, divide kthr:r by the CPU count for a value that can be compared with other servers. • Any sustained non-zero value is likely to degrade performance. The performance degradation is gradual (unlike the case with memory saturation, where it is rapid). • Interval time is still quite important. It is possible to see CPU saturation (kthr:r) while a CPU is idle (cpu:idl). You may find that the run queue is quite long for a short period of time, followed by idle time. Averaging over the interval gives both a non-zero run queue length and idle time.
  • 37. Solaris Peformance Tools Tool Uses Description vmstat kstat For an initial view of overall CPU behavior psrinfo kstat For physical CPU properties uptime getloadavg() For the load averages, to gauge recent CPU activity sar kstat, sadc For overall CPU behavior, and dispatcher queue statistics; sar also allows historical data collection mpstat kstat For per-CPU statistics prstat procfs To identify process CPU consumption dtrace Dtrace For detailed analysis of CPU activity, including scheduling events and dispatcher analysis
  • 38. uptime Command Prints up time with CPU Load averages. They represent both utilization and saturation of the CPUs. • The numbers are the 1-, 5-, and 15-minute load averages. • Load averages is often approximated as the average number of runnable and running threads, which is a reasonable description. • A value equal to your CPU count usually means 100% utilization; less than your CPU count is proportionally less than 100% utilization; and greater than your CPU count is a measure of saturation • A consistent load average higher than your CPU count may cause degraded performance. Solaris handles CPU saturation very well, so load averages should not be used for anything more than an initial approximation of CPU load.
  • 39. sar - The system activity reporter Provide live statistics or can be activated to record historical CPU statistics, prints the user (%usr), system (%sys), wait I/O (%wio), and idle times (%idle). Identifies long-term patterns that may be missed when taking a quick look at the system. Also, historical data provides a reference for what is "normal" for your system The following example shows the default output of sar, which is also the -u option to sar. An interval of 1 second and a count of 5 were specified.
  • 40. sar –q - Statistics on the run queues runq-sz (run queue size). Equivalent to the kthr:r field from vmstat; can be used as a measure of CPU saturation swpq-sz (swapped-out queue size). Number of swapped-out threads. Swapping out threads is a last resort for relieving memory pressure, so this field will be zero unless there was a dire memory shortage. %runocc (run queue occupancy). Helps prevent a danger when intervals are used, that is, short bursts of activity can be averaged down to unnoticeable values. The run queue occupancy can identify whether short bursts of run queue activity occurred %swpocc (swapped out occupancy). Percentage of time there were swapped out threads. If one thread is swapped, all others of threads of the process must also be.
  • 41. Is my system performing well? About the Individual Processors psrinfo -v command determines the number of processors in the system and their speed. In Solaris 10, -vp prints additional information. The mpstat command summarizes the utilization statistics for each CPU. Following syscl (system calls) csw (context switches) is an example of four CPU switches) migr (migrations of threads between processors) icsw (involuntary context machine, being sampled every 1 second. intr (interrupts) ithr (interrupts as threads) smtx (kernel mutexes) srw (kernel reader/writer mutexes)
  • 42. What are sampling and Clock tick woes? • While most counters you see in Solaris are highly accurate, sampling issues remain in a few minor places. In particular, the run queue length as seen from vmstat (kthr:r) is based on a sample that is taken every second. Example, a problem was caused by a program that deliberately created numerous short-lived threads every second, such that the one-second run queue sample usually missed the activity. • The runq-sz from sar -q suffers from the same problem, as does %runocc(which for short-interval measurements defeats the purpose of %runocc). • These are all minor issues, and a valid workaround is to use DTrace, with which statistics can be created at any accuracy desired
  • 43. Who Is Using the CPU? The default output from the prstat command shows one line of output per process, showing CPU utilization value before the prstat command was executed. The system load average indicates the demand and queuing for CPU resources averaged over a 1-, 5-, and 15-minute period if that exceeds the number of CPUs, the system is overloaded.
  • 44. How is the CPU being consumed? • Use Options -m(show microstates) & -L(show per-thread) observe per-thread microstates. • Microstates represent a time-based summary broken into percentages for each thread. • USR through LAT sum to 100% of the time spent for each thread during the prstat sample. • USR (user time) and SYS (system time) thread spent running on the CPU. • The LAT (latency) is the amount of time thread spent waiting for CPU. A non-zero number means there was some queuing/saturation for CPU resources. • SLP inidicates the time thread spends blocked waiting for blocking events like Disk I/O etc. • TFL & DTL determine if and how much the thread is waiting for memory paging. • TRP indicates the time spent on software traps Each Thread is waiting for CPU about 0.2% of the time. - CPU resources are not constrained. Each Thread is waiting for CPU about 80% of the time. - CPU resources are Constrained
  • 45. How are threads inside the process performing? The example shows us that thread number two in the target process is using the most CPU, and spending 83% of its time waiting for CPU. We can further look at information about thread number two with the pstack <pid>/<LWPID> command. Just pstack <pid> to shows all threads Take a java thread dump and identify the thread with native thread id = 2. This is the one. This way con relate the code in Java that called the native system call or library method on the system.
  • 46. Process Stack on a Java Virtual Machine: pstack • Use the “C++ stack unmangler” with Java virtual machine (JVM) targets to see the native java function calls  c stack
  • 47. Tracing Processes truss truss traces system calls made on behalf of a process. It includes the user LWP (thread) number, system call name, arguments and return codes for each system call. truss –c option traces system call counts
  • 48. Why Memory Saturation brings more rapid a degradation in performance compared to CPU saturation. • Memory saturation may cause rapid degradation in performance. To come over saturation OS resorts to page-in/out and swapping, which themselves are an heavy task and with processes competing for memory, a race condition may occur. • The available memory on a server may be artificially constrained, either through pre-allocation of memory or through the use of a garbage collection mechanism that doesn’t free up memory until some threshold is reached.
  • 49. Thread Dumps • What exactly is "Thread dump“ – Thread dump" basically gives you information on what each of the thread in the VM is doing at any given point of time. • If an application seems stuck, or is running out of resources, a thread dump will reveal the state of the server. Java's thread dumps are a vital tool for server debugging. For scenarios like – PERFORMANCE RELATED ISSUES – DEADLOCK (SYSTEM LOCKS UP) – TIMEOUT ISSUES – SYSTEM STOPS PROCESSING TRAFFIC
  • 50. Thread dumps in Redknee Applications • Java thread dumps are obtained by doing: – Send (kill -3 <pid>) - On Unix  See thread dump in ctl logs – Press (Ctrl + Shift Break) – on Windows  See thread dumps on xbuild console – $JAVA_HOME/bin/jstack <pid>  See thread dumps on Shell console • Java thread dumps list all of the threads in an application • Threads are outputted in the order that they are created, newest thread being at the top • Threads should be named with a useful name of what they do or what they are responsible for (Open Tickets)
  • 51. Common Threads in Redknee • Idle” – CORBA Threads to handle incoming requests, however are currently not doing any work • “RMI TCP Connection(<port>)-<IP>” – Outbound connection over RMI to a specific host and port • "FileLogger“ – Framework thread for logging • "JavaIDL Reader for <host>:<port>“ – CORBA Thread reading requests from a server • "TP-Processor8“ – Tomcat Web Thread • “Thread-<#>” – Thread that has not been named (BAD) • "ChannelHome ForwardingThread“ – Thread used to cluster transactions over to peer – One of these threads per Home that is clustered (DB table) • "Worker#1“ – Worker threads doing work
  • 52. Thread Dump May Give you Clues • C:learnclasses>java Test • Full thread dump Java HotSpot(TM) Client VM (1.4.2_04-b05 mixed mode): • "Signal Dispatcher" daemon prio=10 tid=0x0091db28 nid=0x744 waiting on condition [0..0] • "Finalizer" daemon prio=9 tid=0x0091ab78 nid=0x73c in Object.wait() [1816f000..1816fd88] • at java.lang.Object.wait(Native Method) • - waiting on <0x10010498> (a java.lang.ref.ReferenceQueue$Lock) • at java.lang.ref.ReferenceQueue.remove(Unknown Source) • - locked <0x10010498> (a java.lang.ref.ReferenceQueue$Lock) • at java.lang.ref.ReferenceQueue.remove(Unknown Source) • at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source) • "Reference Handler" daemon prio=10 tid=0x009196f0 nid=0x738 in Object.wait() [1812f000..1812fd88] • at java.lang.Object.wait(Native Method) • - waiting on <0x10010388> (a java.lang.ref.Reference$Lock) • at java.lang.Object.wait(Unknown Source) • at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source) • - locked <0x10010388> (a java.lang.ref.Reference$Lock) • "main" prio=5 tid=0x00234998 nid=0x4c8 runnable [6f000..6fc3c] • at Test.findNewLine(Test.java:13) • at Test.<init>(Test.java:4) • at Test.main(Test.java:20) • "VM Thread" prio=5 tid=0x00959370 nid=0x6e8 runnable • "VM Periodic Task Thread" prio=10 tid=0x0023e718 nid=0x74c waiting on condition • "Suspend Checker Thread" prio=10 tid=0x0091cd58 nid=0x740 runnable
  • 53. What is there in the Thread Dump? • In this case we can see that, at the time we took the thread dump, there were seven threads: Show Thread Dump – Signal Dispatcher – Finalizer – Reference Handler – main – VM Thread – VM Periodic Task Thread – Suspend Checker Thread • Each thread name is followed by whether the thread is a daemon thread or not. • Then comes prio the priority of the thread [ex: prio=5]. • tid and nid are Java thread id and the native thread id. • Then what follows the state of the thread. It is either: – Runnable [marked as R in some VMs]: This state indicates that the thread is either running currently or is ready to run the next time the OS thread scheduler schedules it. – Suspended [marked as S in some VMs]: I presume this indicates that the thread is not in a runnable state. Can some one please confirm?! – Object.wait() [marked as CW in some VMs]: indicates that the thread is waiting on an object using Object.wait() – Waiting for monitor entry [marked as MW in some VMs]: indicates that the thread is waiting to enter a synchronized block • What follows the thread description line is a regular stack trace.
  • 54. Threads in a Dead-Lock • A set of threads are said to be in a dead lock when there is a cyclic wait condition, ie. each thread in the deadlock is waiting on a resource locked by some other thread in the set of deadlocked threads. In newer JDKs they are detected automatically – Found one Java-level deadlock: – ============================= – "Thread-1": – waiting to lock monitor 0x0091a27c (object 0x140fa790, a java.lang.Class), – which is held by "Thread-0" – "Thread-0": – waiting to lock monitor 0x0091a25c (object 0x14026800, a java.lang.Class), – which is held by "Thread-1" – Java stack information for the threads listed above: – =================================================== – "Thread-1": – at Deadlock$2.run(Deadlock.java:48) – - waiting to lock <0x140fa790> (a java.lang.Class) – - locked <0x14026800> (a java.lang.Class) – "Thread-0": – at Deadlock$1.run(Deadlock.java:33) – - waiting to lock <0x14026800> (a java.lang.Class) – - locked <0x140fa790> (a java.lang.Class) – Found 1 deadlock.
  • 56. Memory • Memory includes physical (RAM) Swap space • Swap space is a part storage acting as a memory. • Memory is more complicated a subject than CPU. • Memory saturation triggers CPU saturation (Page Faults / GC)
  • 57. Memory Utilization and Saturation • To sustain a higher throughput, application spawns more threads and holds the request data • Each thread occupies memory for data it operates on and its own stack. • A point where memory demanded by an process can no longer be met from available memory, saturation occurs. • Sudden increases in utilization without accompanying increases in throughput can also be used to detect degraded performance modes caused by software ‘aging’ issues, such as memory leaks
  • 58. VMSTAT – Glimpse of Memory Utilization If the scan rate (sr) is continuously over 200 pages per second then there is a memory shortage on the system. Counter Description swap Available swap space in Kbytes. free Combined size of the cache list and free list. re Page reclaims—The number of pages reclaimed from the cache list. mf Minor faults—The number of pages attached to an address space. fr Page-frees—Kilobytes that have been freed pi and po Kilobytes Paged in and Paged out respectively de Anticipated short-term memory in kilobytes shortfall to free ahead. sr The number of pages scanned by the page scanner per second.
  • 60. Relieving Memory Pressure After the free memory exhausts, from cache list (FS,I/O etc cache). Next the swapper swaps out entire threads, seriously degrading the performance of swapped-out applications. The page scanner selects pages, and is characterized by the scan rate (sr) from vmstat. Both use some form of the Not Recently Used algorithm. The swapper and the page scanner are only used when appropriate. Since Solaris 8, the cyclic page cache, which maintains lists for a Least Recently Used selection, is preferred.
  • 61. Heap and Non-Heap Memory • Heap Memory Storage for Java objects -Xmx<size> & -Xms<size> • Non Heap Memory Per-class structures such as runtime constant pool, field and method data, Code for methods and constructors, as well as interned Strings. Store loaded classes and other meta-data JVM code itself, JVM internal structures, loaded profiler agent code and data, etc. -XX:MaxPermSize=<size> • Other Space system/OS takes for process Stacks of a threads (-Xss & -Xoss) System & Native space
  • 62. What is Garbage Collection? Reclaim memory from inaccessible object
  • 63. Stack Overflow or Out of Memory • If u See OutOfMemoryError: unable to create native thread – This means your Application is falling short Native Memory space – C Space – Either, Insufficient memory to allocate thread stack or PC to the new Thread – Or application has crossed JVM’s memory limit (3.2 GB in 32 bit environment) – The JVM/application hangs with this error, we need to restart. • See if you can reduce active threads which ate away system’s memory • Or if you can decrease stack size to decrease memory use per thread • If you Can’t bring memory consumption down, need more system memory • If u See StackOverflowException – It means the thread that threw this exception fell short of Stack Memory Space – A thread stacks method states invoked by it on to the stack memory – For the number of nested invocations the thread makes, memory is insufficient – Only the thread dies with this exception, the application doesn’t hang. • See if you can bring down number of nested invocations by the thread • Or else, increase the stack size with VM option –Xss, by default it is 1m
  • 64. Pros and Cons of Garbage Collection? Advantages Disadvantages Increased reliability Unpredictable application Easier to write complex pauses apps Increased CPU/memory No memory leaks or utilization invalid pointers Brutally complex
  • 65. GC Logging • Java Garbage Collection activity may be recorded in a log file. VM options – -verbosegc (Enable GC Logging, outputs to std-err – -xloggc:<file> (GC logging to file) – –XX:+PrintGCDetails (Detailed GC records) – -XX:+PrintGCDateStamps (absolute instead of relative timestamps) – Note: From relative timestamps in a GC log we can find absolute times by either by tracing forward from application/GC start or backwards from application/GC stop • Asynchronous garbage collection occurs whenever memory available memory is low. • System.gc() does not force a synchronous garbage collection but just gives a hint to VM. VM options – +XXDisableExplicitGC - Disable explicit GC
  • 66. What to look for in GC Logs? • Important information from GC logs – The size of the heap after garbage collection – The time taken to run the garbage collection – The number of bytes reclaimed by garbage collection • Heap Size after GC may give us a good idea of memory requirement. – 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed) • The other two help us assess the cost of GC to your application. • All of them together help us tune GC.
  • 67. How to Calculate Impact of GC on your Application? • Run test (60sec, Collect GC logs) – 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed) – 42406K->41504K(458752K), 4.4044878 secs – (902K reclaimed) – 48617K->47874K(458752K), 4.5652409 secs – (770K reclaimed) • Measure – Out of 60 sec, GC ran for 17.2 sec, ie 29% of the time. – Considering relative CPU utilization, GC cost may be even higher. – 3037K of memory was recycled in 60 secs, ie 51831 bytes/second • Analyze – 29% time being consumed by GC is too high (should be between 5-15%) – Is 51831 bytes/sec of memory recycled justifiable against operation? – For an average 50 byte objects, it churned around 1036 objects/ sec
  • 68. Heap Ranges – Xms to Xmx • Heap Range can be defined – VM Args –Xmx & -Xms define Upper & Lower Bounds of Heap Size • What causes VM to expand heap? – Expansion of heap is a CPU Intensive and causes defragmented Heap – VM Tries GC, Defragmentation, Compaction, etc to free up memory. – If still unable to free up required memory, VM decides to expand heap – VM may not wait till brink, it keeps some free space for temp objects – By default, Sun tries to keep the proportion of free space to living objects at each garbage collection within 40%-70% range. • If less than 40% heap is free after GC, expand the heap • If more than 70% heap is free after GC, contract the heap – VM Args that can customize the default ratio • -XX:MinFreeHeapRatio • -XX:MaxFreeHeapRatio
  • 69. Gross Heap Tuning • Consequences of large heap sizes – GC Cycles occur less frequently, but each sweep takes longer – Long GC cycles may induce perceptible pauses in the system. – If heap grows to a size more than available RAM, paging/swapping may occur. • Consequences of low heap sizes – GC runs too frequently with less recovery in each cycle – Cost of GC becomes more – Since, GC has to sweep less space each time, pauses are imperceptible. • Max verses Min Heap sizes. – Contraction & Expansion of heap is costly, should be worth the cause. – Frequent contraction expansion also leads to segmented heap. – Keep Xmx=Xms, for transaction oriented system which frequently peaks. – Keep Xms<Xmx if the application infrequently operates at upper capacity.
  • 70. We Just Learnt Gross Heap Tuning There might just be need for Fine Tuning • We can fine tune the GC considering the intricacies of GC Algorithm & Heap Structure. We will learn shortly. • Goss Heap tuning is quite simple yet effective & empirically established. • Gross techniques are fairly effective irrespective of the variables and most important we can always afford apply them.
  • 71. What is the advanced heap made of? The one that works with Generational Garbage Collector in JVM • HEAP is made up of – Old Space or Tenure Space • Objects, when get old in the young space, are transferred here. – Young Space or Eden Space • Young objects are held here. – Scratch Space • Working Space for algorithms – New Space • <Young Space> + <Scratch Space>
  • 75. Are there better GC implementations to chose? JDK 1.4.x Options Generation Low Pause Collectors Throughput Collectors Heap Sizes 1 CPU 2+ CPUs 1 CPU 2+ CPUs Parallel Scavenge Collector Serial Parallel Copying -XX:NewSize Copying Copying Collector -XX:+UseParallelGC -XX:MaxNewSize Young Collector Collector -XX:+UseAdaptiveSizePolicy -XX:SurvivorRatio (default) (default) -XX:+UseParNewGC -XX:+AggressiveHeap Mark- Mark- Concurrent Collector Compact Compact Mark-Compact Collector -Xms Old Collector Collector (default) -Xmx -XX:+UseConcMarkSweepGC (default) (default) -XX:PermSize Permanent Can be turned off with –Xnoclassgc (use with care) -XX:MaxPermSize
  • 77. Heap Dump (Java) Snapshot of the memory of a time VMs usually invokes a GC before dumping heap It contains • Objects (Class, fields, primitive values and references) • Classes (Classloader, name, super class, static fields) • GC Roots (Objects defined to be reachable by the JVM) • Thread Stacks (in time with per-frame about local objects) Does not Contain • Allocation information Who created the objects and where they have been created? • Live & Stale Used memory consists of both live and dead objects. JVM usually does a GC before generating a heap dump Tools may attempt to remove these when loading the dump unreachable from the GC roots
  • 78. Heap Dump (Java) How to take it? • On Demand VM-arg > JDK1.4.2_12 # -XX:+HeapDumpOnCtrlBreak Tools # JDK6 Jconsole, VisualVM, MAT jmap -d64 -dump:file=<file-ascii-hdump> <pid> jmap -d64 -dump:format=b, file=<file-bin-hdump> <pid> • Automatic on Crash VM-arg # -XX:+HeapDumpOnOutOfMemoryError • Postmorterm after crash; from Core-Dump jmap -d64 -dump:format=b,file=<file> <java-bin> <core-file>
  • 79. Heap Dump (Java) Shallow vs Retained Heap Shallow heap • Held by object’s primitive fields and reference variables • Excludes referenced objects but just references (32/64 bits) Retained heap • Object’s shallow size plus the shallow sizes of the objects that are accessible, directly or indirectly, only from this object. • Memory that’s freed by the GC when this object is collected. Garbage Collection Roots • A garbage collection root is an object accessible from outside the heap. • GC root objects, which will not be collected by Garbage Collector at the time of measuring Locals (Java/Native), Threads, System Class, JNI,, Monitor, Finalizer)
  • 80. Shallow vs. Retained Heap http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcept s%2Fshallowretainedheap.html In general, retained size is a GC root is an integral measure, which helps to understand consumption memory by objects graphs
  • 81. Dominator Tree (Object Dependencies) • Identifies chunks of retained memory & the keep-alive • In the dominator tree each object is the immediate dominator of its children, so dependencies between the objects are easily identified. • The edges in the dominator tree do not directly correspond to object references from the object graph. Same object may actually be under retained set of multiple roots. • http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowretainedheap.html
  • 82. OQL (Object Query Language) Heap Dump not just for Troubleshooting • OQL is an Object Query Language that let’s us query the heap dump in SQL fashion. • This enables us to analyze heap not only after problems but proactively search for patterns. Ex select to see if there are more than two objects for Boolean, ideally two .TRUE and .FALSE (singleton like Enums) are sufficient – select toHtml(a) + " = " + a.value from java.lang.Boolean a where objectid(a.clazz.statics.TRUE) != objectid(a) && objectid(a.clazz.statics.FALSE) != objectid(a) (Runs on Visual VM • Visual VM and MAT, both support nice interfaces for OQL http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fwelcome.html
  • 83. References • Thread Dump Analyzer (Thread Dumps) • (http://java.net/projects/tda/) • GC Viewer (GC logs) • (http://www.tagtraum.com/gcviewer.html) • Eclipse Memory Analyzer tool (Heap Dump, OQL) (http://help.eclipse.org/indigo/topic/org.eclipse.mat.ui.help/welcome.html ) • Visual VM / J-Console /JMX – (Inspect Live Application, Snapshots, Dumps, OQL) Bundled with Java SDK
  • 84. Feedback – Q&A simar.singh@redknee.com learn@ssimar.com