SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
Performance Analysis
using the Vampir Toolchain
   Robert Henschel (HPA-IU)
   David Cronk (CS-UTK)
   Thomas William (PSW-ZIH)
Overview
Morning Session (Innovation Center, Room 105)
• 09:00 – 10:15 Overview: Event Based Program Analysis
• 10:15 – 10:45 Break
• 10:45 – 11:45 Instrumentation and Runtime Measurement
• 11:45 – 13:00 Lunch break

Afternoon Session
• 13:00 – 13:45 Using PAPI Performance Counters
• 13:45 – 14:00 Break
• 14:00 – 15:00 Trace Visualization
• 15:00 – 15:30 Break
• 15:30 – 18:00 Hands On (Wrubel Computing Center, Building
  WCC, Room 107)
We do have computers in Germany too (although quiet old ones)

TU DRESDEN, ZIH, AND HPC
Dresden University of Technology
• Founded in 1828
• One of the oldest technical
  universities in Germany
• 14 faculties and a number of
  specialized institutes
• More than 35000 Students, about
  4000 Employees, 438 professors
• International courses of studies,
  bachelor, masters
• One of the largest faculties for
  computer science in Germany
• 110 million Euro annual third party
  funding
• http://tu-dresden.de
Center for Information Services and
                  HPC (ZIH)
• Central Scientific Unit at TU
  Dresden
• Competence Center for
  „Parallel Computing and
  Software Tools“
• Strong commitment to
  support real users
• Development of algorithms
  and methods: Cooperation
  with users from all
  departments
• Providing infrastructure and
  qualified service for TU
  Dresden and Saxony
Structure of ZIH
•   Management
     – Director:                            Prof. Dr. Wolfgang E. Nagel
     – Assistant directors:                 Dr. Peter Fischer (COO),
                                            Dr. Matthias S. Müller (CTO)
•   Administration (7 Employees)

•   Departments (ca. 100 Employees; incl. Trainees)
     – Department of interdisciplinary function support and
       coordination (IAK)
     – Department of networking and communication services (NK)
     – Department of central systems and services (ZSD)
     – Department of innovative methods of computing (IMC)
     – Department of programming and software tool-kits (PSW)
     – Department of distributed and data intensive computing (VDR)
Today‘s Main HPC Infrastructure

     HPC-Component
                                                    PC-Farm
    Main Memory    6,5 TB



          8 GB/s                                4 GB/s    4 GB/s




        HPC-SAN                                           PC-SAN

       Hard-disk -                                       Hard-disk -
        capacity :                                       capacity :
          68 TB                                             68 TB


                       1,8 GB/s


                                   PetaByte-
                                  Tapestorage

                                  capacity :
                                   1 PB                                installed in 2006
Areas of Expertise
• Research topics
   – Architecture and performance analysis of High
     Performance Computers
   – Programming methods and techniques for
     HPC systems
   – Grid Computing
   – Software tools to support programming and
     optimization
   – Modeling algorithms of biological processes
   – Mathematical models, algorithms, and
     efficient implementations
• Role of mediator between vendors,
  developers, and users
• Pick up and preparation of new concepts,
  methods, and techniques
• Teaching and Education
Performance Analysis Tools
• The Vampir performance analysis toolkit
   – Vampir: Scalable event trace visualization
   – VampirTrace: Instrumentation and run-time data collection
   – Open Trace Format (OTF): Event trace data format
Performance Analysis Tools


                    Vampir-Team

Ronny Brendel            Matthias Jurenz       Prof. Wolfgang E. Nagel
Jens Doleschal           Dr. Andreas Knüpfer   Michael Peter
Ronald Geisler           Matthias Lieber       Heide Rohling
Daniel Hackenberg        Holger Mickler        Matthias Weber
Robert Henschel          Dr. Hartmut Mix       Thomas William
                         Dr. Matthias Müller




  http://www.tu-dresden.de/zih/ptools
  http://www.vampir.eu
EVENT BASED PROGRAM ANALYSIS
Why performance analysis?
• Moore's Law still in charge, no need to tune performance?
• Increasingly difficult to get close to peak performance
   – for sequential computation
      • memory wall
      • optimum pipelining, ...
   – for parallel interaction
      • Amdahl's law
      • synchronization with single late-comer, ...


• Efficiency is important because of limited resources
• Scalability is important to cope with next bigger simulation
Basics about Parallelization
Performance Analysis with Profiling
Instrumentation and Tracing

OVERVIEW
Motivation
• Reasons for parallel programming:
  – Higher Performance
     • Solve the same problem in shorter time
     • Solve larger problems in the same time
  – Higher Capability
     • Solve problems that cannot be solved on a single processor
     • Larger memory on parallel computers
     • Time constraints limit the possible problem size
       ( Weather forecast, turn around within working day)


• In both cases performance is one of the major
  concerns:
  – Also consider sequential performance within the parallel
    sections
Parallelization Strategies
• General strategy for parallelization:
   – Distribute the work to many workers

   Limitations:
   – Not all tasks can be split into smaller sub-tasks
   – Dependencies between sub-tasks
   – Coordination overhead
   – (same as for human teams)

   Algorithms:
   – Different algorithms for the same problem differ in terms of
     parallelization
   – Different “best” algorithms for serial vs. parallel execution or
     for different parallelization schemes
BASICS ABOUT PARALLELIZATION
Speed-up
• Definition of speed-up S                                                                        TS
       Ts: Serial Execution                                                               S
       Tp: Parallel Execution Time with P CPUs                                                    Tp

                       Speed-up versus number of used processors:
                                           9

                                           8       Id e a l S p e e d -u p

                                           7       R e a l S p e e d -u p
                                           6
                          S p e e d -u p




                                           5

                                           4

                                           3

                                           2

                                           1

                                           0
                                               1     2         3         4    5   6   7       8

                                                                         #CP Us




Actual speed-up often lower than optimal one due to aforementioned limitations.
Parallel Efficiency
• Alternative definition: parallel efficiency E
  TS: Serial Execution Time                                                                                    S        TS
                                                                                                       E
  TP: Parallel Execution Time with P CPUs                                                                      P       TP P

                   Parallel efficiency versus number of used processors:
                                                            1,2


                                                              1
                        P a r a l l e l E ffi c i e n c y




                                                            0,8


                                                            0,6

                                                                      Id e a l Pa r a lle l
                                                            0,4
                                                                      Ef f ic ie n c y

                                                            0,2       R e a l Pa r a lle l
                                                                      Ef f ic ie n c y

                                                              0
                                                                  1         2            3    4    5       6   7   8

                                                                                              #CP Us
Amdahl’s law
• Fundamental limit of parallelization
                                                   1                                                                      1
        S                                                                                             S                                      (P   )
                                                            F                                                   (1            F)
              (1                                  F)
                                                            SP
        •Only a fraction F of the algorithm is parallel with speed-up Sp
        •A fraction (1-F) is serial

                                                       Then the maximum resulting speed-up is:
                                                  18

                                                  16            Id e a l

                                                  14            F=9 9 %
                   M a x im u m S p e e d - u p




                                                                F=9 5 %
                                                  12
                                                                F=9 0 %
                                                  10
                                                                F=8 0 %
                                                   8

                                                   6

                                                   4

                                                   2

                                                   0
                                                        1   2        3     4   5   6   7   8     9        10   11    12       13   14   15   16

                                                                                           # C P Us
Amdahl’s law
• If you know your desired speed up S you can calculate F:
                                                 1
                                 F       1
                                                S
   – F gives you the percentage of your program that has to be
     executed parallel in order to achieve a speed up S
     (asymptotically).
   – In order estimate the resulting effort you need to know in which
     parts of your program (1-F) of the time is spent.

   – This is even before considering the actual parallelization method
      • Might add new serial sections
      • Brings coordination overhead
      • Will not scale arbitrarily high, i.e. the parallel section will stay > 0
Amdahl’s law, example
• Example program with some sub-routines calling
  one another:
       # c a lls      T im e ( % )   A c c u m u la te d   C a ll
                                     T im e ( % )
       155648         3 1 .2 2       3 1 .2 2              C a lc
       603648         2 2 .2 4       5 3 .4 6              M u ltip ly
       155648         1 0 .0 5       6 3 .5 1              M a tm u l
       214528         9 .3 3         7 2 .8 4              Copy
       603648         7 .8 7         8 0 .7 1              F in d




  – For a maximum speed-up of 2 one needs to parallelize
    Calc and Multiply.
  – For a maximum speed-up of 5 all need to be
    parallelized!
General Parallelization Strategy
• Therefore, successful parallelization requires:
   – Finding the actual hot-spots of work
   – Sufficient potential for parallelization
   – Parallelization strategy that introduces minimum coordination
     overhead



• There are no general rules! Things that help to achieve
  high performance:
   –   Know your application
   –   Know your compiler
   –   Understand the performance tool
   –   Know the characteristics of the hardware
PERFORMANCE ANALYSIS WITH
PROFILING
Profiling
• Profiling gives an overview about the distribution of
  run time
• Usually on the level of subroutines, also at line-by-line
  level
• Rather low overhead
• Usually good enough to find computation hot spots
• Little details to detect performance problems and
  their causes

• More sophisticated ways of profiling:
   – Based on hardware performance counters
   – Phase-based profiles
   – Call-path profiles
Profiling
• Profile Recording
  – Of aggregated information (Time, Counts, …)
  – About program and system entities
     • Functions, loops, basic blocks
     • Application, processes, threads, …


• Methods of Profile Creation
  – PC sampling (statistical approach)
  – Direct measurement (deterministic approach)
Profiling with gprof
 – Compile with profiling support
    • Using -pg for GNU, -p –g for Intel
    • Optimization -O3 might obscure the output somewhat
%> mpicc –p -g -O2 heat-mpi-slow-big.c -o heat-mpi-slow-big


 – Execute normally
    • Used to be only for sequential programms
    • Parallel only with the GMON_OUT_PREFIX trick

%> export GMON_OUT_PREFIX=ggg
%> mpirun -np 4 heat-mpi-slow-big
%> ls
  ggg.11762   ggg.11763   ggg.11764   ggg.11765
Profiling with gprof
 – Pre-process profiling output with gprof:
    • Text output
    • There are also GUI front-ends like
       – pgprof (PGI)
       – kprof (KDE)
 – For a single rank:
%> gprof [–b] heat-mpi-slow-big ggg.11765 | less

 – Combine results for all ranks:
%> gprof -s heat-mpi-slow-big ggg.*
%> gprof [–b] heat-mpi-slow-big gmon.sum | less
Profiling with gprof
 – Flat profile for one of four ranks:
Flat profile:


Each sample counts as 0.01 seconds.
  %     cumulative    self             self     total
 time    seconds     seconds   calls   s/call   s/call   name
100.00        2.08     2.08        1     2.08     2.08   Algorithm
  0.00        2.08     0.00        1     0.00     0.00   CalcBoundaries
  0.00        2.08     0.00        1     0.00     0.00   DistributeNodes

 – Flat profile for all four ranks combined:
Flat profile:


Each sample counts as 0.01 seconds.
  %     cumulative    self             self     total
 time    seconds     seconds   calls   s/call   s/call   name
100.00        8.59     8.59        4     2.15     2.15   Algorithm
  0.00        8.59     0.00        4     0.00     0.00   CalcBoundaries
  0.00        8.59     0.00        4     0.00     0.00   DistributeNodes
Profiling with gprof
– Annotated call graph for one of four ranks:
Call graph
granularity: each sample hit covers 4 byte(s) for 0.48% of 2.08 seconds
index % time    self   children   called     name
                2.08     0.00     1/1             main [2]
[1]    100.0    2.08     0.00      1        Algorithm [1]
-----------------------------------------------
                                            <spontaneous>
[2]    100.0    0.00     2.08                main [2]
                2.08     0.00     1/1             Algorithm [1]
                0.00     0.00     1/1             DistributeNodes [4]
                0.00     0.00     1/1             CalcBoundaries [3]
-----------------------------------------------
                0.00     0.00     1/1             main [2]
[3]      0.0    0.00     0.00      1        CalcBoundaries [3]
-----------------------------------------------
                0.00     0.00     1/1             main [2]
[4]      0.0    0.00     0.00      1        DistributeNodes [4]
-----------------------------------------------
Profiling

• Simple profiling is a good starting point
      • Reveals computational hot spots
      • Hides away outlier values in the average


• More details needed for
      • Parallel analysis and identification of performance problems
      • Finding optimization opportunities


• Advanced profiling tools:
      • TAU http://www.cs.uoregon.edu/research/tau/
      • HPCToolkit http://hpctoolkit.org/
INSTRUMENTATION AND TRACING
Event Tracing
• Collect more detailed information for more
  insight
• Do not summarize run-time information
• Collect individual events with properties during
  run-time

• Event Tracing can be used for:
  – Visualization (VampirSuite)
  – Automatic analysis (Scalasca)
  – Debugging or for re-play (VampirSuite + Scalasca)
Tracing
• Recording of run-time events (points of interest)
  – During program execution
  – Enter leave of functions/subroutines
  – Send/receive of messages, synchronization
  – More …
  – Saved as event records
     • Timestamp, process, thread, event type
     • Event specific information
     • Sorted by time stamp
  – Collected via instrumentation & trace library
Profiling vs Tracing
• Tracing Advantages
  – Preserve temporal and spatial relationships (context)
  – Allow reconstruction of dynamic behavior on any
    required abstraction level
  – Profiles can be calculated from trace
• Tracing Disadvantages
  – Traces can become very large
  – May cause perturbation
  – Instrumentation and tracing is complicated
     • Event buffering, clock synchronization, …
Common Event Types
• Enter/leave of function/routine/region
  – Time stamp, process/thread, function ID
• Send/receive of P2P message (MPI)
  – Time stamp, sender, receiver, length, tag, communicator
• Collective communication (MPI)
  – Time stamp, process, root, communicator, # bytes
• Hardware performance counter values
  – Time stamp, process, counter ID, value
• Etc.
Parallel Trace
                                                DEF TIMERRES 1000000000
                                                DEF PROCESS 1 `Master`
                                                DEF PROCESS 1 `Slave`
10010   P   1   ENTER 5
                                                DEF FUNCTION 5 `main`
10090   P   1   ENTER 6
                                                DEF FUNCTION 6 `foo`
10110   P   1   ENTER 12
10110   P   1   SEND TO 3 LEN 1024   ...        DEF FUNCTION 9 `bar`
10330   P   1   LEAVE 12 10020 P 2   ENTER 5 DEF FUNCTION 12 `MPI_Send`
10400   P   1   LEAVE 6 10095 P 2    ENTER 6 DEF FUNCTION 13 `MPI_Recv`
10520   P   1   ENTER 9
                         10120 P 2   ENTER 13
10550   P   1   LEAVE 9
                         10300 P 2   RECV FROM 3 LEN 1024 ...
...
                         10350 P 2 LEAVE 13
                         10450 P 2 LEAVE 6
                         10620 P 2 ENTER 9
                         10650 P 2 LEAVE 9
                         ...
Instrumentation
• Instrumentation: Process of modifying programs
  to detect and report events by calling
  instrumentation functions.

  – Instrumentation functions provided by trace library
  – Call == notification about run-time event

  – There are various ways of instrumentation
Source Code Instrumentation

int foo(void* arg){            int foo(void* arg){
                               enter(6);
if (cond){                     if (cond){
                                   leave(6);
    return 1;                      return 1;
}                              }
                               leave(6);
return 0;                      return 0;
}                              }

                Manually or Automatically
Source Code Instrumentation
Manually
  – Large effort, error prone
  – Difficult to manage
Automatically
  – Via source to source translation
  – Program Database Toolkit (PDT)
     http://www.cs.uoregon.edu/research/pdt/
  – OpenMP Pragma And Region Instrumentor (Opari)
     http://www.fz-juelich.de/zam/kojak/opari/
Wrapper Function Instrumentation
• Provide wrapper functions
     • Call instrumentation function for notification
     • Call original target for functionality
     • Via preprocessor directives:
        #define MPI_Init WRAPPER_MPI_Init
        #define MPI_Send WRAPPER_MPI_Send

  – Via library preload:
     • preload instrumented dynamic library
  – Suitable for standard libraries (e.g. MPI, glibc)
The MPI Profiling Interface
– Each MPI function has two names:
   • MPI_xxx and PMPI_xxx
– Selective replacement of MPI routines at link time


     MPI_Send                           MPI_Send
                      user program


     MPI_Send
                     wrapper library


                      PMPI_Send
     MPI_Send                           MPI_Send
                       MPI library
Compiler Instrumentation
• gcc -finstrument-functions -c foo.c

         void __cyg_profile_func_enter( <args> );
         void __cyg_profile_func_exit( <args> );



• Many compilers support instrumentation:
  (GCC, Intel, IBM, PGI, NEC, Hitachi, Sun Fortran, …)
• No source modification
Dynamic Instrumentation
• Modify binary executable in memory
• Insert instrumentation calls
• Very platform/machine dependent, expensive

• DynInst project (http://www.dyninst.org)
  – Common interface
  – Alpha/Tru64, MIPS/IRIX, PowerPC/AIX, Sparc/Solaris,
    x86/Linux+Windows, ia64/Linux
Instrumentation & Trace Overhead

            manual     PDT           GCC          DynInst
w/o                              15 ticks
dummy          59          60               52      568

f.addr.        117         117              115     638
f.symbol       120         121              278     637
f.id           119         120              219     633

id+timer       299         300              451     937
             overhead for empty function call
Trace Libraries
• Provide instrumentation functions
• Receive events of various types
• Collect event properties
  – Time stamp
  – Location (thread, process, cluster node, MPI rank)
  – Event specific properties
  – Perhaps hardware performance counter values
• Record to memory buffer, flush eventually
• Try to be fast, minimize overhead
Trace Files & Formats
•   TAU Trace Format (Univ. of Oregon)
•   Epilog (ZAM, FZ Jülich)
•   STF (Pallas, now Intel)
•   Open Trace Format (OTF)
    – ZIH, TU Dresden in coop. with Oregon & Jülich
    – Single/multiple files per trace with
    – Fast sequential and random access
    – Including API for writing/reading
    – Supports auxiliary information
    – See http://www.tu-dresden.de/zih/otf/
Interoperability
Other Tools
• TAU profiling (University of Oregon, USA)
   – Extensive profiling and tracing for parallel applications and
     visualization, camparison, etc.
   http://www.cs.uoregon.edu/research/tau/
• Paraver (CEPBA, Barcelona, Spain)
   – Trace based parallel performance analysis and visualization
   http://www.cepba.upc.edu/paraver/
• Scalasca (FZ Jülich)
   – Tracing and automatic detection of performance problems
   http://www.scalasca.org
• Intel Trace Collector & Analyzer
   – Very similar to Vampir

Mais conteúdo relacionado

Semelhante a Overview: Event Based Program Analysis

QuantumChemistry500
QuantumChemistry500QuantumChemistry500
QuantumChemistry500Maho Nakata
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science Domino Data Lab
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyDavid Lecomber
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
HPC Resource Accounting
HPC Resource AccountingHPC Resource Accounting
HPC Resource AccountingKen Schumacher
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly
 
PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)
PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)
PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)Databricks
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 

Semelhante a Overview: Event Based Program Analysis (20)

Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
QuantumChemistry500
QuantumChemistry500QuantumChemistry500
QuantumChemistry500
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
HPC Resource Accounting
HPC Resource AccountingHPC Resource Accounting
HPC Resource Accounting
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)
PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)
PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 

Mais de PTIHPA

Github:fi Presentation
Github:fi PresentationGithub:fi Presentation
Github:fi PresentationPTIHPA
 
2010 05 hands_on
2010 05 hands_on2010 05 hands_on
2010 05 hands_onPTIHPA
 
Trace Visualization
Trace VisualizationTrace Visualization
Trace VisualizationPTIHPA
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurementPTIHPA
 
2010 vampir workshop_iu_configuration
2010 vampir workshop_iu_configuration2010 vampir workshop_iu_configuration
2010 vampir workshop_iu_configurationPTIHPA
 
2010 03 papi_indiana
2010 03 papi_indiana2010 03 papi_indiana
2010 03 papi_indianaPTIHPA
 
Switc Hpa
Switc HpaSwitc Hpa
Switc HpaPTIHPA
 
Statewide It Robert Henschel
Statewide It Robert HenschelStatewide It Robert Henschel
Statewide It Robert HenschelPTIHPA
 
3 Vampir Trace In Detail
3 Vampir Trace In Detail3 Vampir Trace In Detail
3 Vampir Trace In DetailPTIHPA
 
5 Vampir Configuration At IU
5 Vampir Configuration At IU5 Vampir Configuration At IU
5 Vampir Configuration At IUPTIHPA
 
2 Vampir Trace Visualization
2 Vampir Trace Visualization2 Vampir Trace Visualization
2 Vampir Trace VisualizationPTIHPA
 
1 Vampir Overview
1 Vampir Overview1 Vampir Overview
1 Vampir OverviewPTIHPA
 
4 HPA Examples Of Vampir Usage
4 HPA Examples Of Vampir Usage4 HPA Examples Of Vampir Usage
4 HPA Examples Of Vampir UsagePTIHPA
 
GeneIndex: an open source parallel program for enumerating and locating words...
GeneIndex: an open source parallel program for enumerating and locating words...GeneIndex: an open source parallel program for enumerating and locating words...
GeneIndex: an open source parallel program for enumerating and locating words...PTIHPA
 
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. ProcessorImplementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. ProcessorPTIHPA
 
Big Iron and Parallel Processing, USArray Data Processing Workshop
Big Iron and Parallel Processing, USArray Data Processing WorkshopBig Iron and Parallel Processing, USArray Data Processing Workshop
Big Iron and Parallel Processing, USArray Data Processing WorkshopPTIHPA
 

Mais de PTIHPA (16)

Github:fi Presentation
Github:fi PresentationGithub:fi Presentation
Github:fi Presentation
 
2010 05 hands_on
2010 05 hands_on2010 05 hands_on
2010 05 hands_on
 
Trace Visualization
Trace VisualizationTrace Visualization
Trace Visualization
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement
 
2010 vampir workshop_iu_configuration
2010 vampir workshop_iu_configuration2010 vampir workshop_iu_configuration
2010 vampir workshop_iu_configuration
 
2010 03 papi_indiana
2010 03 papi_indiana2010 03 papi_indiana
2010 03 papi_indiana
 
Switc Hpa
Switc HpaSwitc Hpa
Switc Hpa
 
Statewide It Robert Henschel
Statewide It Robert HenschelStatewide It Robert Henschel
Statewide It Robert Henschel
 
3 Vampir Trace In Detail
3 Vampir Trace In Detail3 Vampir Trace In Detail
3 Vampir Trace In Detail
 
5 Vampir Configuration At IU
5 Vampir Configuration At IU5 Vampir Configuration At IU
5 Vampir Configuration At IU
 
2 Vampir Trace Visualization
2 Vampir Trace Visualization2 Vampir Trace Visualization
2 Vampir Trace Visualization
 
1 Vampir Overview
1 Vampir Overview1 Vampir Overview
1 Vampir Overview
 
4 HPA Examples Of Vampir Usage
4 HPA Examples Of Vampir Usage4 HPA Examples Of Vampir Usage
4 HPA Examples Of Vampir Usage
 
GeneIndex: an open source parallel program for enumerating and locating words...
GeneIndex: an open source parallel program for enumerating and locating words...GeneIndex: an open source parallel program for enumerating and locating words...
GeneIndex: an open source parallel program for enumerating and locating words...
 
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. ProcessorImplementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
 
Big Iron and Parallel Processing, USArray Data Processing Workshop
Big Iron and Parallel Processing, USArray Data Processing WorkshopBig Iron and Parallel Processing, USArray Data Processing Workshop
Big Iron and Parallel Processing, USArray Data Processing Workshop
 

Overview: Event Based Program Analysis

  • 1. Performance Analysis using the Vampir Toolchain Robert Henschel (HPA-IU) David Cronk (CS-UTK) Thomas William (PSW-ZIH)
  • 2. Overview Morning Session (Innovation Center, Room 105) • 09:00 – 10:15 Overview: Event Based Program Analysis • 10:15 – 10:45 Break • 10:45 – 11:45 Instrumentation and Runtime Measurement • 11:45 – 13:00 Lunch break Afternoon Session • 13:00 – 13:45 Using PAPI Performance Counters • 13:45 – 14:00 Break • 14:00 – 15:00 Trace Visualization • 15:00 – 15:30 Break • 15:30 – 18:00 Hands On (Wrubel Computing Center, Building WCC, Room 107)
  • 3. We do have computers in Germany too (although quiet old ones) TU DRESDEN, ZIH, AND HPC
  • 4. Dresden University of Technology • Founded in 1828 • One of the oldest technical universities in Germany • 14 faculties and a number of specialized institutes • More than 35000 Students, about 4000 Employees, 438 professors • International courses of studies, bachelor, masters • One of the largest faculties for computer science in Germany • 110 million Euro annual third party funding • http://tu-dresden.de
  • 5. Center for Information Services and HPC (ZIH) • Central Scientific Unit at TU Dresden • Competence Center for „Parallel Computing and Software Tools“ • Strong commitment to support real users • Development of algorithms and methods: Cooperation with users from all departments • Providing infrastructure and qualified service for TU Dresden and Saxony
  • 6. Structure of ZIH • Management – Director: Prof. Dr. Wolfgang E. Nagel – Assistant directors: Dr. Peter Fischer (COO), Dr. Matthias S. Müller (CTO) • Administration (7 Employees) • Departments (ca. 100 Employees; incl. Trainees) – Department of interdisciplinary function support and coordination (IAK) – Department of networking and communication services (NK) – Department of central systems and services (ZSD) – Department of innovative methods of computing (IMC) – Department of programming and software tool-kits (PSW) – Department of distributed and data intensive computing (VDR)
  • 7. Today‘s Main HPC Infrastructure HPC-Component PC-Farm Main Memory 6,5 TB 8 GB/s 4 GB/s 4 GB/s HPC-SAN PC-SAN Hard-disk - Hard-disk - capacity : capacity : 68 TB 68 TB 1,8 GB/s PetaByte- Tapestorage capacity : 1 PB installed in 2006
  • 8. Areas of Expertise • Research topics – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems – Grid Computing – Software tools to support programming and optimization – Modeling algorithms of biological processes – Mathematical models, algorithms, and efficient implementations • Role of mediator between vendors, developers, and users • Pick up and preparation of new concepts, methods, and techniques • Teaching and Education
  • 9. Performance Analysis Tools • The Vampir performance analysis toolkit – Vampir: Scalable event trace visualization – VampirTrace: Instrumentation and run-time data collection – Open Trace Format (OTF): Event trace data format
  • 10. Performance Analysis Tools Vampir-Team Ronny Brendel Matthias Jurenz Prof. Wolfgang E. Nagel Jens Doleschal Dr. Andreas Knüpfer Michael Peter Ronald Geisler Matthias Lieber Heide Rohling Daniel Hackenberg Holger Mickler Matthias Weber Robert Henschel Dr. Hartmut Mix Thomas William Dr. Matthias Müller http://www.tu-dresden.de/zih/ptools http://www.vampir.eu
  • 12. Why performance analysis? • Moore's Law still in charge, no need to tune performance? • Increasingly difficult to get close to peak performance – for sequential computation • memory wall • optimum pipelining, ... – for parallel interaction • Amdahl's law • synchronization with single late-comer, ... • Efficiency is important because of limited resources • Scalability is important to cope with next bigger simulation
  • 13. Basics about Parallelization Performance Analysis with Profiling Instrumentation and Tracing OVERVIEW
  • 14. Motivation • Reasons for parallel programming: – Higher Performance • Solve the same problem in shorter time • Solve larger problems in the same time – Higher Capability • Solve problems that cannot be solved on a single processor • Larger memory on parallel computers • Time constraints limit the possible problem size ( Weather forecast, turn around within working day) • In both cases performance is one of the major concerns: – Also consider sequential performance within the parallel sections
  • 15. Parallelization Strategies • General strategy for parallelization: – Distribute the work to many workers Limitations: – Not all tasks can be split into smaller sub-tasks – Dependencies between sub-tasks – Coordination overhead – (same as for human teams) Algorithms: – Different algorithms for the same problem differ in terms of parallelization – Different “best” algorithms for serial vs. parallel execution or for different parallelization schemes
  • 17. Speed-up • Definition of speed-up S TS Ts: Serial Execution S Tp: Parallel Execution Time with P CPUs Tp Speed-up versus number of used processors: 9 8 Id e a l S p e e d -u p 7 R e a l S p e e d -u p 6 S p e e d -u p 5 4 3 2 1 0 1 2 3 4 5 6 7 8 #CP Us Actual speed-up often lower than optimal one due to aforementioned limitations.
  • 18. Parallel Efficiency • Alternative definition: parallel efficiency E TS: Serial Execution Time S TS E TP: Parallel Execution Time with P CPUs P TP P Parallel efficiency versus number of used processors: 1,2 1 P a r a l l e l E ffi c i e n c y 0,8 0,6 Id e a l Pa r a lle l 0,4 Ef f ic ie n c y 0,2 R e a l Pa r a lle l Ef f ic ie n c y 0 1 2 3 4 5 6 7 8 #CP Us
  • 19. Amdahl’s law • Fundamental limit of parallelization 1 1 S S (P ) F (1 F) (1 F) SP •Only a fraction F of the algorithm is parallel with speed-up Sp •A fraction (1-F) is serial Then the maximum resulting speed-up is: 18 16 Id e a l 14 F=9 9 % M a x im u m S p e e d - u p F=9 5 % 12 F=9 0 % 10 F=8 0 % 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # C P Us
  • 20. Amdahl’s law • If you know your desired speed up S you can calculate F: 1 F 1 S – F gives you the percentage of your program that has to be executed parallel in order to achieve a speed up S (asymptotically). – In order estimate the resulting effort you need to know in which parts of your program (1-F) of the time is spent. – This is even before considering the actual parallelization method • Might add new serial sections • Brings coordination overhead • Will not scale arbitrarily high, i.e. the parallel section will stay > 0
  • 21. Amdahl’s law, example • Example program with some sub-routines calling one another: # c a lls T im e ( % ) A c c u m u la te d C a ll T im e ( % ) 155648 3 1 .2 2 3 1 .2 2 C a lc 603648 2 2 .2 4 5 3 .4 6 M u ltip ly 155648 1 0 .0 5 6 3 .5 1 M a tm u l 214528 9 .3 3 7 2 .8 4 Copy 603648 7 .8 7 8 0 .7 1 F in d – For a maximum speed-up of 2 one needs to parallelize Calc and Multiply. – For a maximum speed-up of 5 all need to be parallelized!
  • 22. General Parallelization Strategy • Therefore, successful parallelization requires: – Finding the actual hot-spots of work – Sufficient potential for parallelization – Parallelization strategy that introduces minimum coordination overhead • There are no general rules! Things that help to achieve high performance: – Know your application – Know your compiler – Understand the performance tool – Know the characteristics of the hardware
  • 24. Profiling • Profiling gives an overview about the distribution of run time • Usually on the level of subroutines, also at line-by-line level • Rather low overhead • Usually good enough to find computation hot spots • Little details to detect performance problems and their causes • More sophisticated ways of profiling: – Based on hardware performance counters – Phase-based profiles – Call-path profiles
  • 25. Profiling • Profile Recording – Of aggregated information (Time, Counts, …) – About program and system entities • Functions, loops, basic blocks • Application, processes, threads, … • Methods of Profile Creation – PC sampling (statistical approach) – Direct measurement (deterministic approach)
  • 26. Profiling with gprof – Compile with profiling support • Using -pg for GNU, -p –g for Intel • Optimization -O3 might obscure the output somewhat %> mpicc –p -g -O2 heat-mpi-slow-big.c -o heat-mpi-slow-big – Execute normally • Used to be only for sequential programms • Parallel only with the GMON_OUT_PREFIX trick %> export GMON_OUT_PREFIX=ggg %> mpirun -np 4 heat-mpi-slow-big %> ls ggg.11762 ggg.11763 ggg.11764 ggg.11765
  • 27. Profiling with gprof – Pre-process profiling output with gprof: • Text output • There are also GUI front-ends like – pgprof (PGI) – kprof (KDE) – For a single rank: %> gprof [–b] heat-mpi-slow-big ggg.11765 | less – Combine results for all ranks: %> gprof -s heat-mpi-slow-big ggg.* %> gprof [–b] heat-mpi-slow-big gmon.sum | less
  • 28. Profiling with gprof – Flat profile for one of four ranks: Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 100.00 2.08 2.08 1 2.08 2.08 Algorithm 0.00 2.08 0.00 1 0.00 0.00 CalcBoundaries 0.00 2.08 0.00 1 0.00 0.00 DistributeNodes – Flat profile for all four ranks combined: Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 100.00 8.59 8.59 4 2.15 2.15 Algorithm 0.00 8.59 0.00 4 0.00 0.00 CalcBoundaries 0.00 8.59 0.00 4 0.00 0.00 DistributeNodes
  • 29. Profiling with gprof – Annotated call graph for one of four ranks: Call graph granularity: each sample hit covers 4 byte(s) for 0.48% of 2.08 seconds index % time self children called name 2.08 0.00 1/1 main [2] [1] 100.0 2.08 0.00 1 Algorithm [1] ----------------------------------------------- <spontaneous> [2] 100.0 0.00 2.08 main [2] 2.08 0.00 1/1 Algorithm [1] 0.00 0.00 1/1 DistributeNodes [4] 0.00 0.00 1/1 CalcBoundaries [3] ----------------------------------------------- 0.00 0.00 1/1 main [2] [3] 0.0 0.00 0.00 1 CalcBoundaries [3] ----------------------------------------------- 0.00 0.00 1/1 main [2] [4] 0.0 0.00 0.00 1 DistributeNodes [4] -----------------------------------------------
  • 30. Profiling • Simple profiling is a good starting point • Reveals computational hot spots • Hides away outlier values in the average • More details needed for • Parallel analysis and identification of performance problems • Finding optimization opportunities • Advanced profiling tools: • TAU http://www.cs.uoregon.edu/research/tau/ • HPCToolkit http://hpctoolkit.org/
  • 32. Event Tracing • Collect more detailed information for more insight • Do not summarize run-time information • Collect individual events with properties during run-time • Event Tracing can be used for: – Visualization (VampirSuite) – Automatic analysis (Scalasca) – Debugging or for re-play (VampirSuite + Scalasca)
  • 33. Tracing • Recording of run-time events (points of interest) – During program execution – Enter leave of functions/subroutines – Send/receive of messages, synchronization – More … – Saved as event records • Timestamp, process, thread, event type • Event specific information • Sorted by time stamp – Collected via instrumentation & trace library
  • 34. Profiling vs Tracing • Tracing Advantages – Preserve temporal and spatial relationships (context) – Allow reconstruction of dynamic behavior on any required abstraction level – Profiles can be calculated from trace • Tracing Disadvantages – Traces can become very large – May cause perturbation – Instrumentation and tracing is complicated • Event buffering, clock synchronization, …
  • 35. Common Event Types • Enter/leave of function/routine/region – Time stamp, process/thread, function ID • Send/receive of P2P message (MPI) – Time stamp, sender, receiver, length, tag, communicator • Collective communication (MPI) – Time stamp, process, root, communicator, # bytes • Hardware performance counter values – Time stamp, process, counter ID, value • Etc.
  • 36. Parallel Trace DEF TIMERRES 1000000000 DEF PROCESS 1 `Master` DEF PROCESS 1 `Slave` 10010 P 1 ENTER 5 DEF FUNCTION 5 `main` 10090 P 1 ENTER 6 DEF FUNCTION 6 `foo` 10110 P 1 ENTER 12 10110 P 1 SEND TO 3 LEN 1024 ... DEF FUNCTION 9 `bar` 10330 P 1 LEAVE 12 10020 P 2 ENTER 5 DEF FUNCTION 12 `MPI_Send` 10400 P 1 LEAVE 6 10095 P 2 ENTER 6 DEF FUNCTION 13 `MPI_Recv` 10520 P 1 ENTER 9 10120 P 2 ENTER 13 10550 P 1 LEAVE 9 10300 P 2 RECV FROM 3 LEN 1024 ... ... 10350 P 2 LEAVE 13 10450 P 2 LEAVE 6 10620 P 2 ENTER 9 10650 P 2 LEAVE 9 ...
  • 37. Instrumentation • Instrumentation: Process of modifying programs to detect and report events by calling instrumentation functions. – Instrumentation functions provided by trace library – Call == notification about run-time event – There are various ways of instrumentation
  • 38. Source Code Instrumentation int foo(void* arg){ int foo(void* arg){ enter(6); if (cond){ if (cond){ leave(6); return 1; return 1; } } leave(6); return 0; return 0; } } Manually or Automatically
  • 39. Source Code Instrumentation Manually – Large effort, error prone – Difficult to manage Automatically – Via source to source translation – Program Database Toolkit (PDT) http://www.cs.uoregon.edu/research/pdt/ – OpenMP Pragma And Region Instrumentor (Opari) http://www.fz-juelich.de/zam/kojak/opari/
  • 40. Wrapper Function Instrumentation • Provide wrapper functions • Call instrumentation function for notification • Call original target for functionality • Via preprocessor directives: #define MPI_Init WRAPPER_MPI_Init #define MPI_Send WRAPPER_MPI_Send – Via library preload: • preload instrumented dynamic library – Suitable for standard libraries (e.g. MPI, glibc)
  • 41. The MPI Profiling Interface – Each MPI function has two names: • MPI_xxx and PMPI_xxx – Selective replacement of MPI routines at link time MPI_Send MPI_Send user program MPI_Send wrapper library PMPI_Send MPI_Send MPI_Send MPI library
  • 42. Compiler Instrumentation • gcc -finstrument-functions -c foo.c void __cyg_profile_func_enter( <args> ); void __cyg_profile_func_exit( <args> ); • Many compilers support instrumentation: (GCC, Intel, IBM, PGI, NEC, Hitachi, Sun Fortran, …) • No source modification
  • 43. Dynamic Instrumentation • Modify binary executable in memory • Insert instrumentation calls • Very platform/machine dependent, expensive • DynInst project (http://www.dyninst.org) – Common interface – Alpha/Tru64, MIPS/IRIX, PowerPC/AIX, Sparc/Solaris, x86/Linux+Windows, ia64/Linux
  • 44. Instrumentation & Trace Overhead manual PDT GCC DynInst w/o 15 ticks dummy 59 60 52 568 f.addr. 117 117 115 638 f.symbol 120 121 278 637 f.id 119 120 219 633 id+timer 299 300 451 937 overhead for empty function call
  • 45. Trace Libraries • Provide instrumentation functions • Receive events of various types • Collect event properties – Time stamp – Location (thread, process, cluster node, MPI rank) – Event specific properties – Perhaps hardware performance counter values • Record to memory buffer, flush eventually • Try to be fast, minimize overhead
  • 46. Trace Files & Formats • TAU Trace Format (Univ. of Oregon) • Epilog (ZAM, FZ Jülich) • STF (Pallas, now Intel) • Open Trace Format (OTF) – ZIH, TU Dresden in coop. with Oregon & Jülich – Single/multiple files per trace with – Fast sequential and random access – Including API for writing/reading – Supports auxiliary information – See http://www.tu-dresden.de/zih/otf/
  • 48. Other Tools • TAU profiling (University of Oregon, USA) – Extensive profiling and tracing for parallel applications and visualization, camparison, etc. http://www.cs.uoregon.edu/research/tau/ • Paraver (CEPBA, Barcelona, Spain) – Trace based parallel performance analysis and visualization http://www.cepba.upc.edu/paraver/ • Scalasca (FZ Jülich) – Tracing and automatic detection of performance problems http://www.scalasca.org • Intel Trace Collector & Analyzer – Very similar to Vampir