This document describes a workshop on high performance statistical computing. It discusses principles for improving computational performance, including identifying goals, problems, algorithms, and architectures that can impact performance. It outlines steps for achieving faster results, such as predicting resource needs, identifying bottlenecks, discovering inefficient code sections, decomposing problems, and distributing work across multiple resources. The document provides examples of how algorithmic complexity and problem complexity can greatly influence runtime. It also reviews computer system architectures and hierarchies from registers to offline storage.
Axa Assurance Maroc - Insurer Innovation Award 2024
High Performance Statistical Computing
1. High Performance Statistical
Computing with Applications in the
Social Sciences
Micah Altman
Senior Research Scientist
“introduction to the RCE” by,
Earl Robert Kinney
Manager, Research Computing Environment
Institute for Quantitative Social Science
Harvard University
2. Goals for today
Analysis
Describe performance goals
Identify resource use patterns
Identify resource bottlenecks
Identify performance hot-spots
Select problem decomposition
Application
Connect to RCE
Use the RCE to analyze larger
[Source: Wikimedia Commons]
data sets
Use the RCE to run interactive
analyses more quickly
Use the RCE to run large numbers
of analyses independently
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 25
3. Organization of this Workshop
Motivation
Principles
Introduction to RCE
Measuring Resource Use
Scaling Up
Tuning Up
Scaling Out
(Parallelization)
Additional Resources
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 3
4. Nine Steps to Faster Results
1. Predict your resource needs through benchmarks,
models, algorithmic analysis
2. Select alternate algorithms when resource needs grow
very rapidly with problem size
3. Identify resource bottlenecks using systems
performance analysis tools
4. Address bottlenecks by increasing resources and/or
changing program resource management
5. Discover hot-spots in programs using profiling tools
6. Adapt hot-spots to system architecture
7. Decompose the problem into independent subproblems
8. Distribute subproblems across pools of resources
9. Repeat analysis after making any changes
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 4
5. FREE! With every first class!
Coffee!
Chocolate!!
Consulting!!!
Time off for good behavior !!!!
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 5 of 85
6. IQSS (and affiliates) offer you support across all stages of your quantitative
research:
Research design, including:
design of surveys, selection of statistical methods.
Primary and secondary data collection, including:
the collection of geospatial and survey data.
Data management, including:
storage, cataloging, permanent archiving, and distribution.
Data analysis, including :
survey consulting, statistical software training, GIS consulting, high performance
research computing.
http://iq.harvard.edu/
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 6
7. The IQSS grants administration team helps with every aspect
of the grant process. Contact us when you are planning
your proposal.
Assisting in identifying research funding opportunities
Consulting on writing proposals
Assisting IQSS affiliates with:
preparation, review and submission of all grant applications
(“pre-award support”)
management of their sponsored research portfolio
(“post-award support”)
Interpret sponsor policies
Coordinate with FAS Research Administration and the Central Office for
Sponsored Programs
… And, of course, support seminars like this!
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 7
8. “One‟s Reach should exceed One‟s Grasp”
Leading edge statistical methods (such as
High Performance Statistical Computing: Principles
MCMC) can require lots of computing power
Ensuring robust results can multiply (and re-
multiply) the number of analyses done:
Sensitivity analysis
Parameterization studies
Alternative models, Bayesian model averaging
Performance benchmarks provides
information for budgeting computing $$$
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 8
9. “I Want it Now!”
High Performance Statistical Computing: Principles
Deadlines abound:
conferences, trials, publication dates
New observations, variables, corrections, or
model specifications may necessitate speedy
reanalysis
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 9
10. "My strength is as the strength of ten
because my heart is pure."
High Performance Statistical Computing: Principles
Selection of algorithms can change the
nature of the computational resource usage
Tuning for a particular system can increase
performance approximately ten-fold
In some circumstances work can be split
across thousands of systems.
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 10
11. Principles
High Performance Statistical Computing: Principles
Goals matter
Problems matter
Algorithms matter
Answers matter
Architecture matters
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 11
12. Types of Performance Goals
Task completion time – wait time to finish
High Performance Statistical Computing: Principles
Efficiency – resource use for task
Throughput –work done by system overall
Latency – delay before response
Responsiveness – perception of response
Reliability – probability task/system will fail
during time interval
“If you don‟t know where you‟re going, any road will take
you.” – Proverb
“If you come to a fork in the road, take it.” – Yogi Berra
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 12
13. Performance Goals – Rules of Thumb
Rules of thumb Completion time:
work (i)/resource(i)
High Performance Statistical Computing: Principles
Throughput:
Users of interactive maximize
software want (work/resource)
responsiveness for all jobs
Latency: time elapsed before
first response to input
Users of batch jobs want
Real-time: complete task
small completion times within fixed interval
“Responsiveness”
Systems administrators Perceived latency
want maximum Task completion time
throughput, reliability Task progress indicators
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 13
14. Size of Factors Affecting Performance
Run Time For Large Instance
High Performance Statistical Computing: Principles
(n=1000)
If runtime for solve small NP-Hard (worst case) 10^292 years
instance of a problem Very Inefficient 1.6 years
(n=10), running on single Algorithm
O(N^3)
system is one minute, how
Inefficient Algorithm 16 hours
long will it take to solve O(N^2)
larger instance of n=1000? Very Poor Memory 11 hours
Access Patterns
Un-optimized Code 67 minutes
Optimized code 7 minutes
Local Multiprocessing 2 minutes
Fully Parallel/Full 4 seconds
Cluster
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 14
15. Problem Complexity Classes
High Performance Statistical Computing: Principles
Decision Problem
Problem complexity class:
set of problems that can be
Decidable Undecidable
solved in O(f(n)) for some f
More general than EXPSPACE
algorithmic complexity –
EXPTIME
encompasses all possible
algorithms to solve the PSPACE
given problem
CO-NP NP
Polynomial time algorithm BQP
necessary for large problem NP-complete
instances P = BPP(?)
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 15
16. Some Problems Are HARD
Traveling Salesperson Problem (weighted Hamiltonian cycle): Plot a route through N
High Performance Statistical Computing: Principles
locations, visiting each once, that minimizes cost.
NP-Hard: worst-case instances require exponential time for optimal, certain, solution
NP-Complete: Equivalent to a large class of hard problems
Source: Applegate, Bixby, Chvátal, and Cook (1998)
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 16
17. How to “Solve” the Unsolvable
Think small: Use only a small number of cities.
Aggregate to regions and treat as quasi-cities.
High Performance Statistical Computing: Principles
Restrict Problem – Euclidean distances are
easier than travel cost.
Solve a different problem: minimum spanning
tree
Approximate solution: for Euclidean
distances, there is an algorithm based on
minimum spanning tree that is at most 50%
longer
Randomize: can a randomized algorithm find
solution with probability p? (No one
knows…, probably not)
Be Lucky: maybe “average” problem isn‟t that
hard?
Heuristics: Apply Simulated Annealing
(etc.), cross fingers
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 17
18. How to recognize hard problems…
Is the problem routinely solved by existing
High Performance Statistical Computing: Principles
systems?
Are efficient algorithms known?
Does it appear in lists of hard problems?
Is the problem universal?
(Any computing problem, sufficiently generalized is hard [Papadimitriou 1994])
Is run time growing exponentially in practice?
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 18
19. Algorithmic Complexity
Measures the complexity of a particular
High Performance Statistical Computing: Principles
solution to a problem
Resource complexity: a measure of the
resources used to solve a problem, as a
function of input size
Common resource measures:
Time, usually represented as number of
operations executed
Space, usually represented as number of discrete
scalar values stored
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 19
20. Algorithmic Complexity: Search
bubbleSort(list) Number of Operations
while (not finished)
High Performance Statistical Computing: Principles
{
finished <- true
for i in (1 to length(list)-1)
2
{
if (list[i]>list[i+1]) On
{
swap(list[i],list[i+1])
finished<- false
}
}
}
quicksort(list)
if (length(list)=1) return
select from (list)
for x in (list) {
if x=pivot, add x to pivotList
if x>pivot, add x to greaterList
if x>pivot, add x to lessList
O n
nlog
}
return(quicksort(
lessList + pivotlist + greaterlist
))
*illustrations courtesy of wikipedia
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 20
21. Search Complexity Continued
Tally sort:
High Performance Statistical Computing: Principles
Items in a fixed range, no duplicates
Inlist = logical(length=max-min)
For (i = 1:length(items)) {inlist[items[i]]=TRUE)}
For (I in mix:max) if (inlist[i]) dowork(i)
How fast is this?
Algorithm Recurse_sort(array L, i = 0, j = length(L)-1)
if L[j] < L[i] then
L[i] ↔ L[j]
if j - i > 1 then
t = (j - i + 1)/3
Recurse_sort (L, i , j-t)
Recurse_sort (L, i+t, j )
Recurse_sort (L, i , j-t)
return L
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 21
22. Answers Matter
High Performance Statistical Computing: Principles
Before optimization, verify the answer
Right can mean “right enough” if well-defined
Correct code may have different performance
characteristics than incorrect code
Returning wrong answer can always be done
quickly
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 22
23. Simple VN Architecture
High Performance Statistical Computing: Principles
Processor
Memory
Input/Out
put
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 23
24. More Modern
Disk
Disk
High Performance Statistical Computing: Principles
RAID Disk
Processor Controller
Processor
Processor
Core Core
Memory FPU FPU
L2
L1 L1
Network
Card
GPU
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 24
26. Deep Inside the Core
High Performance Statistical Computing: Principles
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 26
27. Resource Hierarchy: Big, Fast, Cheap*
High Performance Statistical Computing: Principles
Registers
(<1KB)
• Big, Fast, Cheap – Pick 2
Cache (1 MB)
• Latency increases with each step down
• Storage increases
Ram (10 Gigabytes) • Throughput decreases
(except, with some offline storage)
Local Storage (10‟s Terabytes)
ONLINE Storage (100‟s Petabytes)
OFFLINE STORAGE ( 10‟s Exabytes )
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 27
28. Reading One Byte: x-= m[1,3]
High Performance Statistical Computing: Principles
CPU: 8 bytes: Load Register
Cache: 256 Bytes <- Cache Line
RAM: 4K <- Page
Disk <- 8K from NFS
Networked File System
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 28
29. General Performance Implications of
Architecture
High Performance Statistical Computing: Principles
Talking to external devices can cause waits… (latency)
Information transmitted to CPU is limited by bus
(throughput)
In practice, expect 80% of theoretical data-path bandwidth at best
Some optimizations are highly specific to architectural
details
Hidden parallelism at low levels
Information travels in chunks
(at least bus size)
Complexity makes theoretical performance analysis difficult
– use benchmarks
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 29
30. From Principles to Practice
High Performance Statistical Computing: Principles
Practice = Principles * Optimization Goals *
Problem Type * Computing Environment
Optimization Goals
Throughput
Latency
Reliability
Scaling up
Scaling out
Problem Decomposition
Independent data
Independent calculations
Coupled calculations
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 30
31. Principled Preparation Checklist
High Performance Statistical Computing: Principles
Verify that your problem is tractable:
Substitute an easier problem
Restrict or limit the problem
Be lucky or clever
Establish performance goals
Identify possible algorithms
What is their resource complexity?
Are better algorithms known?
Identify potential system characteristics
Communications costs
Systems resources
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 31
32. Lab 0: Problem definition
Define your computing
High Performance Statistical Computing: Principles
problem as formally as
you can?
What algorithms are you
used to solve the
problem?
What are your
performance goals? [Source: http://andreymath.wikidot.com/
. Creative Commons Sharealike
Licensnce]
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 32
33. An Introduction to the IQSS RCE
High Performance Statistical Computing :
What is it?
Introduction to RCE
Why use it?
How does it work?
How do we use it?
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 33
34. What is the RCE?
High Performance Statistical Computing :
Introduction to RCE
•Full virtual desktop •For large interactive jobs. •Run hundreds of jobs at
Virtual Desktop
Interactive Nodes
Batch Processing
environment – connect •Large amounts of memory once.
anywhere available on demand •Optimized for non-
•Many research software •Stata, Matlab, Mathmatica interactive, independent
packages available work
•Persistent session – •Easy to run from your
connect anytime virtual desktop
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 34
35. Why use the RCE?
For Research
High Performance Statistical Computing :
An environment customized for quantitative social science research
A wide variety of research software packages are available
Fore Convenience
Introduction to RCE
The RCE enables you to access a research desktop from almost any computer
Sessions are persistent -- disconnect from your office, reconnect from home
File storage is central. Never worry about which computer has your files
For Resources
Large analysis jobs are offloaded to high-powered
Large resource pools : 800 processors , 3.3 TB of memory , 40 TB of disk storage
Regularly updated software
For collaboration
Offers an ideal environment for collaborative research projects
Share project files, desktops, software
For reliability
System performance and availability is constantly monitored
Research files are regularly backed up and stored securely
IQSS has full time staff dedicated to the support the RCE
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 35
36. RCE Architecture
High Performance Statistical Computing :
Virtual Batch Nodes
Desktop
Client Sessions
Introduction to RCE
Login
Nodes
Interactive Nodes
Disk Disk Disk Disk
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 36
37. RCE Architecture Rules of Thumb
Connect to interactive pool
High Performance Statistical Computing :
Small problems – run directly
Introduction to RCE
(on an interactive node)
Large-memory problems use interactive
nodes
Interactive problems use interactive nodes
Large-compute jobs use Batch submit
-- but problem must be decomposed
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 37
38. RCE Powered Apps – How it Works
User clicks on application User receives notice, offered
High Performance Statistical Computing :
from menu batch node to run their job.
No node is available User hits “yes”
Introduction to RCE
RCE checks for availability of RCE submits special
Interactive nodes condor
job to batch master node.
A node is available ~120
s
RCE submits special condor Window appears on RCE
job to interactive ~30 desktop and application
master node. s runs on node.
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 38
39. RCE Desktop
High Performance Statistical Computing :
Application Menu – Application launching
Introduction to RCE
Quick Launch – Quick Access to E-mail, Web,
and Office applications.
File Browser – Graphical view of your
home directory and files.
HMDC Outage Notifier – Updates to reflect
status of environment.
Desktop Shortcuts – Contains
Shortcuts to home directory and trash
Status Bar – Shows open applications.
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 39
40. Login Nodes
High Performance Statistical Computing :
Number of servers:
8
Introduction to RCE
Number of processors:
32
RAM per session:
~6 GB
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 40
41. Apps On Interactive Nodes
High Performance Statistical Computing :
Features
Introduction to RCE
Easiest way to
launch applications
Limitations
Smaller amounts of
RAM
Competition for
resources with
interactive
processes.
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 41
42. Interactive Nodes
High Performance Statistical Computing :
Number of servers:
13
Introduction to RCE
Number of
processors: 84
RAM per job:
1-64GB
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 42
43. Apps on Interactive Nodes
High Performance Statistical Computing :
Features
More memory available for
Introduction to RCE
application
Dedicated processor
reduces competition for
resources
Multiple cores available
(e.g. for Stata-MP)
Limitations
Interactive nodes are limited
in number
Time limit on applications
(currently 72 hours)
Time can be extended by
request
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 43
44. Batch Nodes
High Performance Statistical Computing :
Number of servers:
61
Introduction to RCE
Number of processors:
258
RAM per job:
2-4GB
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 44
45. Running Statistical Apps On
Batch Nodes
High Performance Statistical Computing :
Features
Introduction to RCE
Nearly 400 nodes can
run at the same time
Well suited for loosely-
coupled parallel
problems
Limitations
Memory is more limited
Application must be
designed to harness the
power of all node
No failover to other pools
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 45
46. Memory Limitations
High Performance Statistical Computing :
Login Nodes
Each user on the machine
Introduction to RCE
is allowed to use a portion
of available memory
No enforcement of login
limits (can be
oversubscribed)
Interactive/Batch Nodes
Each node has share of
memory based on request
Physical hardware will only
run number of jobs equal to
processor cores (not
oversubscribed)
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 46
47. Get started with the RCE: Checklist
High Performance Statistical Computing :
Apply for an RCE account:
support@help.hmdc.harvard.edu
Introduction to RCE
Install the free NX software
Connect to rce.hmdc.harvard.edu
Run interactive programs with menus
Run large interactive jobs with “RCE
Powered” menu
Run large batch jobs using a simple
launcher script
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 47
48. Lab 1: Connecting to the RCE
High Performance Statistical Computing :
In this lab, we will
Analyzing Resource Use
login to the RCE and
launch stata on a
Interactive node
[Source: http://andreymath.wikidot.com/
. Creative Commons Sharealike
Licensnce]
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 48
49. Systems Resource Use
High Performance Statistical Computing :
Benchmarks
Timing
Analyzing Resource Use
System resource monitoring
System resource limits
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 49
50. Benchmarks
What patterns of usage are likely to occur?
High Performance Statistical Computing :
What are the 80% cases?
Analyzing Resource Use
Are there 10% cases that have unusual patterns of
data access, or unusual input?
Can you construct a plausible worst-case?
Parameterize benchmarks
Parameterize problem size
Vary order-of-magnitude
Create benchmarks based on real cases
Use real problems for full benchmarking
Miniaturize real problems for quick tests
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 50
51. Common Benchmarks
High Performance Statistical Computing :
Artificial benchmarks
Analyzing Resource Use
Simple “unit” benchmarks
Real application + random data
Real application + real data
Real application + worst case data
Mix of applications
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 51
52. Timing
Why measure timings
High Performance Statistical Computing :
Direct or indirect measure of performance
Establish baseline for changes
Analyzing Resource Use
Empirical measure of scaling
Limitations
Timers are often imprecise for brief events
Other activity on the system “noise”
Many tools aggregate all phases of execution
Many tools aggregate all areas of resource use
CPU timings may exclude system resource use
Must use condor_submit to run these on non-
interactive nodes
Heisenbugs
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 52
53. Alternative: Queuing Models
Formalists alternative to benchmarks
Can be useful for capacity planning
Model services as network of queues
Different classes of “customers”
Resources with different delay characteristics
Transition probabilities
Distribution of “service events”
Poisson events discrete, independent, no
memory
Number of events are Poisson distributed
interarrival time exponentially distributed
Little‟s law:
Length of queue = arrival rate * time in queue Source: Takefusa, et al. 1999
Limitations
Heroic assumptions are often required
State-space explosion
Only simplest models solvable closed-form
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 53
54. Wall-Clock Time
Measure completion time
High Performance Statistical Computing :
Show phases of execution by inserting calls
Analyzing Resource Use
Linux: date
OS X: date
Windows: DATE
R: Sys.time()
Stata: display "$S_TIME $S_DATE”
Matlab: clock; tic
C: time(), getitimer()
> print(Sys.time())
[1] "2010-04-28 10:21:45 EDT"
> res <- optim(sq, distance, genseq, method="SANN",
+ control = list(maxit=30000, temp=2000))
> print(Sys.time())
[1] "2010-04-28 10:21:55 EDT"
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 54
55. CPU Time
Measure CPU time used by program
High Performance Statistical Computing :
Show “system”-state and “user”-state time
Analyzing Resource Use
Some tools show other resources
Linux: /usr/bin/time –v
OS X: /usr/bin/time –l
Windows: timeit.exe*
R: system.time()
Stata: timer
Matlab: cputime
C: getrusage()
$ /usr/bin/time -v
/usr/local/stata11/stata -b mycommand.do
User time (seconds): 0.00
System time (seconds): 0.01
Percent of CPU this job got: 64%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
...
*Optional tool, may require installation on your system
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 55
56. Interpreting CPU Time
User time (seconds): 0.00
High Performance Statistical Computing :
System time (seconds): 0.01
Percent of CPU this job got: 64%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
Analyzing Resource Use
If (system)/(system + user) > .1
Possibly inefficient use of system calls, I/O
If elapsed time >> (system+user)
Possible resource bottleneck
Possible sleep
If CPU Percent low possible CPU contention
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 56
57. Monitoring Running Processes
Show list of processes running
High Performance Statistical Computing :
See current and accumulated CPU usage
Analyzing Resource Use
See CPU utilization
Linux: top; gnome-system-monitor
OS X: top; Utilities -> “Activity Monitor”;
atMonitor (3rd party, highly recommended)
Windows: taskmrg.exe; top.exe *
$ gnome-system-monitor &
*Optional tool, may require installation on your system
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 57
58. Interpreting Process Monitor Results
$ gnome-system-monitor &
Show processes
High Performance Statistical Computing :
Sort # of processes
Analyzing Resource Use
waiting to use CPU
Sort processes
By CPU use
Show list of processes running
See current and accumulated CPU & memory usage
See CPU utilization
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 58
59. Sample Performance Curves
• Best case: linear in size of problem
• Nonlinearities could mean…
• inefficient algorithm (case 2)
• hard problem (case 3)
• poor data access patterns (case 4)
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 59
60. System Resource Monitoring
Why monitor system resources?
High Performance Statistical Computing :
Identify bottlenecks
Identify processes using resources – may affect overall
Analyzing Resource Use
throughput and capacity
Identify processes actively using resources – may affect
performance
Limitations
Tools are often imprecise for brief events
Other activity on the system “noise”
Many tools aggregate all phases of execution
Many tools aggregate all system use
Many tools aggregate sub-resource use
Must use condor_submit to run these on non-interactive nodes
Heisenbugs
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 60
61. Monitoring System Resources
See system aggregated use and activity for memory,
High Performance Statistical Computing :
disk, network
See memory use by process
Analyzing Resource Use
See resource use by process (varies by platform)
Linux: gnome-system-monitor ; /usr/bin/time –v
sar ; iostat ; vmstat
OS X: Utilities -> “Activity Monitor”
/usr/bin/time –v; sar ; iostat
Windows: perfmon.exe; taskmrg.exe
$ gnome-system-monitor &
$ sar –A 1 10
$ /usr/bin/time –v stata –b somefile.do
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 61
62. Detailed System Resource Tracing
See system use/calls for process as it runs
High Performance Statistical Computing :
Analyzing Resource Use
Linux: strace; systap (add-on)
OS X: dtrace
Windows: procmon.exe (add-on)
$ strace –o strace.log myProgram
$ sudo dtrace -n 'syscall:::entry { @[execname] = count() }' -c ls
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 62
63. Interpreting Process Memory Use
Use monitor->preferences to
High Performance Statistical Computing :
$ gnome-system-monitor & Add “Resident Memory”
Memory in residence
Analyzing Resource Use
Requested
memory
Memory – amount of virtual memory requested
Resident Memory – amount of memory currently
in RAM for process
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 63
64. Interpreting System Activity
High Performance Statistical Computing :
$ sar -bB 1 10
System memory activity
Analyzing Resource Use
01:37:57 PM pgpgin/s pgpgout/s fault/s majflt/s
01:37:58 PM 0.00 0.00 14.71 0.00
01:37:57 PM tps rtps wtps bread/s bwrtn/s
01:37:58 PM 0.00 0.00 0.00 0.00 0.00
System disk activity
Page faults – indicate memory activity or
resource contention
File i/o – indicates file activity
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 64
65. Interpreting System Activity
High Performance Statistical Computing :
$ perfmon
Analyzing Resource Use
Page faults – indicate memory activity or
resource contention
File i/o – indicates file activity
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 65
66. Interpreting Process Resource Use
High Performance Statistical Computing :
$ /usr/bin/time –v stata –b command
Often memory related
Analyzing Resource Use
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 149
Voluntary context switches: 1280
Involuntary context switches: 460
Swaps: 0
File system inputs: 0
File system outputs: 0 Process disk I/O
Page faults – indicate memory activity or resource contention
Voluntary context switches – indicates waiting on I/O or memory
Swaps – indicates a severe system memory shortage
File i/o – indicates file activity
If the numbers is always 0 – it‟s a lie
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 66
67. Symptoms of CPU Bound
System/Problem
High Performance Statistical Computing :
Analyzing Resource Use
CPU User+Sys activity near 100% while there are
active processes (if # of procs > # of cpus)
Performance curve for your problem is continuous
This is usually good
CPU is most expensive resource
You can trust code profiling reports
More likely to have gains from parallelization
However, if CPU %sys is high suspect inefficient use
of system calls, or borderline I.o or memory
bottlenecks
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 67
68. Symptoms of Resource Bottleneck
Memory Bottlenecks:
Severe:
High Performance Statistical Computing :
processes in swap queue (or wait on swap)
lots of space in use (see swap –m), swapping activity, free memory low
Analyzing Resource Use
Moderate:
high context switches +
high page (validity) faults +
active processes with memory >> resident memory
I/O Bottlenecks:
Moderate:
high % sys activity in CPU, high # of system calls, # interrupts
Severe
I/O rate high
Context switches, wait on I/O, or processes sleeping on I/O
Physical disk activity high
Performance curve
Discontinuous regions of accelerated performance decline
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 68
69. Tune against bottlenecks
Typically, a single resource will be
High Performance Computing: Scaling Up
the bottleneck point:
CPU
Memory
I/O: Graphics, Network, Disk
If you don‟t address the
bottleneck, optimizations
elsewhere won‟t matter
Bottlenecks may depend on usage
scenario and phase of operation
Fixing one bottleneck may reveal
others
Don’t expect speedup of the entire
program to be proportional to the
code you just tuned!
Programs interact, try to profile on
a quiet system first
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 69
70. Resource Analysis: Checklist
High Performance Statistical Computing :
Identify benchmarks
Analyzing Resource Use
Small instances of your problem
Can vary size
Target an isolated system
Minimize other activity
Time benchmarks at various sizes
Monitor systems resources
Look for non-linearities in performance curve
Look for bottlenecks
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 70
71. Lab: Analyzing Resource Use
In this lab, we will
High Performance Statistical Computing :
login to the RCE and
Analyzing Resource Use
run a simple set of
benchmarks
Use timing tools and
performance
analysis
Identify bottlenecks [Source: http://andreymath.wikidot.com/
. Creative Commons Sharealike
Licensnce]
and performance
curves
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 71
72. Scaling Up
Addressing resource bottlenecks
High Performance Statistical Computing :
System and application limits
Storing/accessing large datasets
Scaling Up
Visualizing large datasets
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 72
73. When to Scale Up
If resource analysis identifies a memory
High Performance Statistical Computing :
bottleneck
If resource analysis identifies an I/O bottleneck
(maybe …)
If problem size prevents program from starting
Scaling Up
If program crashes or hangs in the middle of
solving large problems (maybe…)
If planning ahead for significant usage changes:
- size of problem data >
~1/2 available physical memory (RAM)
- change of algorithm
- change of data structure
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 73
74. Addressing Memory Bottlenecks
Review: Symptoms of memory bottleneck
High Performance Statistical Computing :
Discontinuity in performance curve
Memory size of process increasing
Resident memory size of process relatively large
System activity shows memory activity
Scaling Up
Principals of addressing memory bottlenecks
Memory hierarchy
Locality of reference
Programming patterns
Add more resources
Modify data types
Modify data structures
Modify algorithms
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 74
75. Memory Hierarchy
High Performance Statistical Computing :
Registers
(<1KB)
Scaling Up
Cache (1 MB)
Ram (10 Gigabytes)
Local Storage (10‟s Terabytes)
ONLINE Storage (100‟s Petabytes)
OFFLINE STORAGE ( 10‟s Exabytes )
If a register access took a Buy one, get 8092 free!
second, tape access would
take a few centuries..
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 75
76. Locality of reference
Temporal locality: reuse same data elements
High Performance Statistical Computing :
Spatial locality: use elements that are “near” each-
other in memory
What is “near”?
Scaling Up
For vectors and files: sequential ordering
For matrices: either row or column ordering
depends on language
For complex data structures:
use experimentation and analysis
Row-Major Order
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 76
77. Adding More Resources
“$$$” Optimization
High Performance Statistical Computing :
Buy more memory, or…
use the RCE to request a larger share
Scaling Up
This is effective if local set size < share size
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 77
78. System and Application Resource Limits
Limits imposed by system or application
High Performance Statistical Computing :
Virtual Memory
Logical memory space for process
Virtual memory limits maximum size of memory requrested
Scaling Up
Can prevent program from starting, or loading large data
Physical Memory
Physical RAM installed in system
Usually smaller than VM, but not always
Maximum efficient local set
Resident Size Limits
Affects maximum efficient local set
- not as severely as physical limits
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 78
79. Limits in Linux and OS X
Where limits are set:
High Performance Statistical Computing :
Set at bootup
Set by system at login – group/user level total memory limits
Set in shell at process creation – request new limit (up to user
maximum)
Set in code via setrlimit
Scaling Up
Set in application
Know your limits
Linux/OS X: /usr/bin/ulimit –a
R: none for Linux
Stata: query memory
Limits on 32 v. 64 bit systems
32 bit OS has limit of 4GB for virtual & physical memory
64 bit OS
No practical limit on virtual memory
Physical memory still limited by hardware configuration and design
Data structures may require more memory to store, since pointers
and default data types are larger
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 79
80. Limits in windows systems
Where limits are set:
High Performance Statistical Computing :
Limits implied by configuration at boot
Virtual memory typically depends on configured paging space on disk + pagefile
R: memory.limit()
Limits on 32 v. 64 bit systems
Most 32 bit windows OS has limit of 3GB physical memory
Scaling Up
32 Bit addressing allows 4GB, but 1GB reserved for memory mapped hardware, so
only 3GB left over in most Windows configurations
64 bit OS
No practical limit on virtual memory (8 TB)
Physical memory still limited by hardware configuration and design
Data structures may require more memory to store, since pointers and default data
types are larger
Some windows applications are 32-bit versions, so still limited to 4GB virtual
memory.
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 80 of 85
81. Basic Memory Management in Statistical
Software
High Performance Statistical Computing :
Matlab R Stata
Memory Limit --- memory.size() set memory
[windows only]
Scaling Up
Remove objects CLEAR rm() clear
Shrink data --- as.integer(real_ compress
types val)
as.factor(string_
val)
Measure data object.size() memory
size gc()
Order for virtual PACK gc() set virtual
memory
M. Altman & B. Kinney High Perf. Stat. Computing (v.9/10/11) 81