2. Overview
Morning Session (Innovation Center, Room 105)
• 09:00 – 10:15 Overview: Event Based Program Analysis
• 10:15 – 10:45 Break
• 10:45 – 11:45 Instrumentation and Runtime Measurement
• 11:45 – 13:00 Lunch break
Afternoon Session
• 13:00 – 13:45 Using PAPI Performance Counters
• 13:45 – 14:00 Break
• 14:00 – 15:00 Trace Visualization
• 15:00 – 15:30 Break
• 15:30 – 18:00 Hands On (Wrubel Computing Center, Building
WCC, Room 107)
3. We do have computers in Germany too (although quiet old ones)
TU DRESDEN, ZIH, AND HPC
4. Dresden University of Technology
• Founded in 1828
• One of the oldest technical
universities in Germany
• 14 faculties and a number of
specialized institutes
• More than 35000 Students, about
4000 Employees, 438 professors
• International courses of studies,
bachelor, masters
• One of the largest faculties for
computer science in Germany
• 110 million Euro annual third party
funding
• http://tu-dresden.de
5. Center for Information Services and
HPC (ZIH)
• Central Scientific Unit at TU
Dresden
• Competence Center for
„Parallel Computing and
Software Tools“
• Strong commitment to
support real users
• Development of algorithms
and methods: Cooperation
with users from all
departments
• Providing infrastructure and
qualified service for TU
Dresden and Saxony
6. Structure of ZIH
• Management
– Director: Prof. Dr. Wolfgang E. Nagel
– Assistant directors: Dr. Peter Fischer (COO),
Dr. Matthias S. Müller (CTO)
• Administration (7 Employees)
• Departments (ca. 100 Employees; incl. Trainees)
– Department of interdisciplinary function support and
coordination (IAK)
– Department of networking and communication services (NK)
– Department of central systems and services (ZSD)
– Department of innovative methods of computing (IMC)
– Department of programming and software tool-kits (PSW)
– Department of distributed and data intensive computing (VDR)
8. Areas of Expertise
• Research topics
– Architecture and performance analysis of High
Performance Computers
– Programming methods and techniques for
HPC systems
– Grid Computing
– Software tools to support programming and
optimization
– Modeling algorithms of biological processes
– Mathematical models, algorithms, and
efficient implementations
• Role of mediator between vendors,
developers, and users
• Pick up and preparation of new concepts,
methods, and techniques
• Teaching and Education
9. Performance Analysis Tools
• The Vampir performance analysis toolkit
– Vampir: Scalable event trace visualization
– VampirTrace: Instrumentation and run-time data collection
– Open Trace Format (OTF): Event trace data format
10. Performance Analysis Tools
Vampir-Team
Ronny Brendel Matthias Jurenz Prof. Wolfgang E. Nagel
Jens Doleschal Dr. Andreas Knüpfer Michael Peter
Ronald Geisler Matthias Lieber Heide Rohling
Daniel Hackenberg Holger Mickler Matthias Weber
Robert Henschel Dr. Hartmut Mix Thomas William
Dr. Matthias Müller
http://www.tu-dresden.de/zih/ptools
http://www.vampir.eu
12. Why performance analysis?
• Moore's Law still in charge, no need to tune performance?
• Increasingly difficult to get close to peak performance
– for sequential computation
• memory wall
• optimum pipelining, ...
– for parallel interaction
• Amdahl's law
• synchronization with single late-comer, ...
• Efficiency is important because of limited resources
• Scalability is important to cope with next bigger simulation
14. Motivation
• Reasons for parallel programming:
– Higher Performance
• Solve the same problem in shorter time
• Solve larger problems in the same time
– Higher Capability
• Solve problems that cannot be solved on a single processor
• Larger memory on parallel computers
• Time constraints limit the possible problem size
( Weather forecast, turn around within working day)
• In both cases performance is one of the major
concerns:
– Also consider sequential performance within the parallel
sections
15. Parallelization Strategies
• General strategy for parallelization:
– Distribute the work to many workers
Limitations:
– Not all tasks can be split into smaller sub-tasks
– Dependencies between sub-tasks
– Coordination overhead
– (same as for human teams)
Algorithms:
– Different algorithms for the same problem differ in terms of
parallelization
– Different “best” algorithms for serial vs. parallel execution or
for different parallelization schemes
17. Speed-up
• Definition of speed-up S TS
Ts: Serial Execution S
Tp: Parallel Execution Time with P CPUs Tp
Speed-up versus number of used processors:
9
8 Id e a l S p e e d -u p
7 R e a l S p e e d -u p
6
S p e e d -u p
5
4
3
2
1
0
1 2 3 4 5 6 7 8
#CP Us
Actual speed-up often lower than optimal one due to aforementioned limitations.
18. Parallel Efficiency
• Alternative definition: parallel efficiency E
TS: Serial Execution Time S TS
E
TP: Parallel Execution Time with P CPUs P TP P
Parallel efficiency versus number of used processors:
1,2
1
P a r a l l e l E ffi c i e n c y
0,8
0,6
Id e a l Pa r a lle l
0,4
Ef f ic ie n c y
0,2 R e a l Pa r a lle l
Ef f ic ie n c y
0
1 2 3 4 5 6 7 8
#CP Us
19. Amdahl’s law
• Fundamental limit of parallelization
1 1
S S (P )
F (1 F)
(1 F)
SP
•Only a fraction F of the algorithm is parallel with speed-up Sp
•A fraction (1-F) is serial
Then the maximum resulting speed-up is:
18
16 Id e a l
14 F=9 9 %
M a x im u m S p e e d - u p
F=9 5 %
12
F=9 0 %
10
F=8 0 %
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# C P Us
20. Amdahl’s law
• If you know your desired speed up S you can calculate F:
1
F 1
S
– F gives you the percentage of your program that has to be
executed parallel in order to achieve a speed up S
(asymptotically).
– In order estimate the resulting effort you need to know in which
parts of your program (1-F) of the time is spent.
– This is even before considering the actual parallelization method
• Might add new serial sections
• Brings coordination overhead
• Will not scale arbitrarily high, i.e. the parallel section will stay > 0
21. Amdahl’s law, example
• Example program with some sub-routines calling
one another:
# c a lls T im e ( % ) A c c u m u la te d C a ll
T im e ( % )
155648 3 1 .2 2 3 1 .2 2 C a lc
603648 2 2 .2 4 5 3 .4 6 M u ltip ly
155648 1 0 .0 5 6 3 .5 1 M a tm u l
214528 9 .3 3 7 2 .8 4 Copy
603648 7 .8 7 8 0 .7 1 F in d
– For a maximum speed-up of 2 one needs to parallelize
Calc and Multiply.
– For a maximum speed-up of 5 all need to be
parallelized!
22. General Parallelization Strategy
• Therefore, successful parallelization requires:
– Finding the actual hot-spots of work
– Sufficient potential for parallelization
– Parallelization strategy that introduces minimum coordination
overhead
• There are no general rules! Things that help to achieve
high performance:
– Know your application
– Know your compiler
– Understand the performance tool
– Know the characteristics of the hardware
24. Profiling
• Profiling gives an overview about the distribution of
run time
• Usually on the level of subroutines, also at line-by-line
level
• Rather low overhead
• Usually good enough to find computation hot spots
• Little details to detect performance problems and
their causes
• More sophisticated ways of profiling:
– Based on hardware performance counters
– Phase-based profiles
– Call-path profiles
25. Profiling
• Profile Recording
– Of aggregated information (Time, Counts, …)
– About program and system entities
• Functions, loops, basic blocks
• Application, processes, threads, …
• Methods of Profile Creation
– PC sampling (statistical approach)
– Direct measurement (deterministic approach)
26. Profiling with gprof
– Compile with profiling support
• Using -pg for GNU, -p –g for Intel
• Optimization -O3 might obscure the output somewhat
%> mpicc –p -g -O2 heat-mpi-slow-big.c -o heat-mpi-slow-big
– Execute normally
• Used to be only for sequential programms
• Parallel only with the GMON_OUT_PREFIX trick
%> export GMON_OUT_PREFIX=ggg
%> mpirun -np 4 heat-mpi-slow-big
%> ls
ggg.11762 ggg.11763 ggg.11764 ggg.11765
27. Profiling with gprof
– Pre-process profiling output with gprof:
• Text output
• There are also GUI front-ends like
– pgprof (PGI)
– kprof (KDE)
– For a single rank:
%> gprof [–b] heat-mpi-slow-big ggg.11765 | less
– Combine results for all ranks:
%> gprof -s heat-mpi-slow-big ggg.*
%> gprof [–b] heat-mpi-slow-big gmon.sum | less
28. Profiling with gprof
– Flat profile for one of four ranks:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.00 2.08 2.08 1 2.08 2.08 Algorithm
0.00 2.08 0.00 1 0.00 0.00 CalcBoundaries
0.00 2.08 0.00 1 0.00 0.00 DistributeNodes
– Flat profile for all four ranks combined:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.00 8.59 8.59 4 2.15 2.15 Algorithm
0.00 8.59 0.00 4 0.00 0.00 CalcBoundaries
0.00 8.59 0.00 4 0.00 0.00 DistributeNodes
29. Profiling with gprof
– Annotated call graph for one of four ranks:
Call graph
granularity: each sample hit covers 4 byte(s) for 0.48% of 2.08 seconds
index % time self children called name
2.08 0.00 1/1 main [2]
[1] 100.0 2.08 0.00 1 Algorithm [1]
-----------------------------------------------
<spontaneous>
[2] 100.0 0.00 2.08 main [2]
2.08 0.00 1/1 Algorithm [1]
0.00 0.00 1/1 DistributeNodes [4]
0.00 0.00 1/1 CalcBoundaries [3]
-----------------------------------------------
0.00 0.00 1/1 main [2]
[3] 0.0 0.00 0.00 1 CalcBoundaries [3]
-----------------------------------------------
0.00 0.00 1/1 main [2]
[4] 0.0 0.00 0.00 1 DistributeNodes [4]
-----------------------------------------------
30. Profiling
• Simple profiling is a good starting point
• Reveals computational hot spots
• Hides away outlier values in the average
• More details needed for
• Parallel analysis and identification of performance problems
• Finding optimization opportunities
• Advanced profiling tools:
• TAU http://www.cs.uoregon.edu/research/tau/
• HPCToolkit http://hpctoolkit.org/
32. Event Tracing
• Collect more detailed information for more
insight
• Do not summarize run-time information
• Collect individual events with properties during
run-time
• Event Tracing can be used for:
– Visualization (VampirSuite)
– Automatic analysis (Scalasca)
– Debugging or for re-play (VampirSuite + Scalasca)
33. Tracing
• Recording of run-time events (points of interest)
– During program execution
– Enter leave of functions/subroutines
– Send/receive of messages, synchronization
– More …
– Saved as event records
• Timestamp, process, thread, event type
• Event specific information
• Sorted by time stamp
– Collected via instrumentation & trace library
34. Profiling vs Tracing
• Tracing Advantages
– Preserve temporal and spatial relationships (context)
– Allow reconstruction of dynamic behavior on any
required abstraction level
– Profiles can be calculated from trace
• Tracing Disadvantages
– Traces can become very large
– May cause perturbation
– Instrumentation and tracing is complicated
• Event buffering, clock synchronization, …
35. Common Event Types
• Enter/leave of function/routine/region
– Time stamp, process/thread, function ID
• Send/receive of P2P message (MPI)
– Time stamp, sender, receiver, length, tag, communicator
• Collective communication (MPI)
– Time stamp, process, root, communicator, # bytes
• Hardware performance counter values
– Time stamp, process, counter ID, value
• Etc.
36. Parallel Trace
DEF TIMERRES 1000000000
DEF PROCESS 1 `Master`
DEF PROCESS 1 `Slave`
10010 P 1 ENTER 5
DEF FUNCTION 5 `main`
10090 P 1 ENTER 6
DEF FUNCTION 6 `foo`
10110 P 1 ENTER 12
10110 P 1 SEND TO 3 LEN 1024 ... DEF FUNCTION 9 `bar`
10330 P 1 LEAVE 12 10020 P 2 ENTER 5 DEF FUNCTION 12 `MPI_Send`
10400 P 1 LEAVE 6 10095 P 2 ENTER 6 DEF FUNCTION 13 `MPI_Recv`
10520 P 1 ENTER 9
10120 P 2 ENTER 13
10550 P 1 LEAVE 9
10300 P 2 RECV FROM 3 LEN 1024 ...
...
10350 P 2 LEAVE 13
10450 P 2 LEAVE 6
10620 P 2 ENTER 9
10650 P 2 LEAVE 9
...
37. Instrumentation
• Instrumentation: Process of modifying programs
to detect and report events by calling
instrumentation functions.
– Instrumentation functions provided by trace library
– Call == notification about run-time event
– There are various ways of instrumentation
38. Source Code Instrumentation
int foo(void* arg){ int foo(void* arg){
enter(6);
if (cond){ if (cond){
leave(6);
return 1; return 1;
} }
leave(6);
return 0; return 0;
} }
Manually or Automatically
39. Source Code Instrumentation
Manually
– Large effort, error prone
– Difficult to manage
Automatically
– Via source to source translation
– Program Database Toolkit (PDT)
http://www.cs.uoregon.edu/research/pdt/
– OpenMP Pragma And Region Instrumentor (Opari)
http://www.fz-juelich.de/zam/kojak/opari/
40. Wrapper Function Instrumentation
• Provide wrapper functions
• Call instrumentation function for notification
• Call original target for functionality
• Via preprocessor directives:
#define MPI_Init WRAPPER_MPI_Init
#define MPI_Send WRAPPER_MPI_Send
– Via library preload:
• preload instrumented dynamic library
– Suitable for standard libraries (e.g. MPI, glibc)
41. The MPI Profiling Interface
– Each MPI function has two names:
• MPI_xxx and PMPI_xxx
– Selective replacement of MPI routines at link time
MPI_Send MPI_Send
user program
MPI_Send
wrapper library
PMPI_Send
MPI_Send MPI_Send
MPI library
42. Compiler Instrumentation
• gcc -finstrument-functions -c foo.c
void __cyg_profile_func_enter( <args> );
void __cyg_profile_func_exit( <args> );
• Many compilers support instrumentation:
(GCC, Intel, IBM, PGI, NEC, Hitachi, Sun Fortran, …)
• No source modification
45. Trace Libraries
• Provide instrumentation functions
• Receive events of various types
• Collect event properties
– Time stamp
– Location (thread, process, cluster node, MPI rank)
– Event specific properties
– Perhaps hardware performance counter values
• Record to memory buffer, flush eventually
• Try to be fast, minimize overhead
46. Trace Files & Formats
• TAU Trace Format (Univ. of Oregon)
• Epilog (ZAM, FZ Jülich)
• STF (Pallas, now Intel)
• Open Trace Format (OTF)
– ZIH, TU Dresden in coop. with Oregon & Jülich
– Single/multiple files per trace with
– Fast sequential and random access
– Including API for writing/reading
– Supports auxiliary information
– See http://www.tu-dresden.de/zih/otf/
48. Other Tools
• TAU profiling (University of Oregon, USA)
– Extensive profiling and tracing for parallel applications and
visualization, camparison, etc.
http://www.cs.uoregon.edu/research/tau/
• Paraver (CEPBA, Barcelona, Spain)
– Trace based parallel performance analysis and visualization
http://www.cepba.upc.edu/paraver/
• Scalasca (FZ Jülich)
– Tracing and automatic detection of performance problems
http://www.scalasca.org
• Intel Trace Collector & Analyzer
– Very similar to Vampir