Mais conteúdo relacionado
Semelhante a HPCMPUG2011 cray tutorial
Semelhante a HPCMPUG2011 cray tutorial (20)
HPCMPUG2011 cray tutorial
- 2. Review of XT6 Architecture
AMD Opteron
Cray Networks
Lustre Basics
Programming Environment
PGI Compiler Basics
The Cray Compiler Environment
Cray Scientific Libraries
Cray Message Passing Toolkit
Cray Performance Analysis Tools
ATP
CCM
Optimizations
CPU
Communication
I/O
2011 HPCMP User Group © Cray Inc. June 20, 2011 2
- 3. AMD CPU Architecture
Cray Architecture
Lustre Filesystem Basics
2011 HPCMP User Group © Cray Inc. June 20, 2011 3
- 5. 2003 2005 2007 2008 2009 2010
AMD AMD
“Barcelona” “Shanghai” “Istanbul” “Magny-Cours”
Opteron™ Opteron™
Mfg.
130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI
Process
K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+
CPU Core
L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB
Hyper
Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s
Technology
Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333
2011 HPCMP User Group © Cray Inc. June 20, 2011 5
- 6. 12 cores
1.7-2.2Ghz
1 4 7 10 105.6Gflops
8 cores
5 11 1.8-2.4Ghz
2 8 76.8Gflops
Power (ACP)
3 6 9 12 80Watts
Stream
27.5GB/s
Cache
12x 64KB L1
12x 512KB L2
12MB L3
2011 HPCMP User Group © Cray Inc. June 20, 2011 6
- 7. L3 cache
HT Link
HT Link
HT Link
HT Link
L2 cache L2 cache L2 cache L2 cache
MEMORY CONTROLLER
Core 2 MEMORY CONTROLLER Core 5 Core 8 Core 11
HT Link
HT Link
HT Link
HT Link
L2 cache L2 cache L2 cache L2 cache
Core 1 Core 4 Core 7 Core 10
L2 cache L2 cache L2 cache L2 cache
Core 0 Core 3 Core 6 Core 9
2011 HPCMP User Group © Cray Inc. June 20, 2011 7
- 8. A cache line is 64B
Unique L1 and L2 cache attached to each core
L1 cache is 64 kbytes
L2 cache is 512 kbytes
L3 Cache is shared between 6 cores
Cache is a “victim cache”
All loads go to L1 immediately and get evicted down the caches
Hardware prefetcher detects forward and backward strides through
memory
Each core can perform a 128b add and 128b multiply per clock cycle
This requires SSE, packed instructions
“Stride-one vectorization”
6 cores share a “flat” memory
Non-uniform-memory-access (NUMA) beyond a node
2011 HPCMP User Group © Cray Inc. June 20, 2011 8
- 9. Processor Frequency Peak Bandwidth Balance
(Gflops) (GB/sec) (bytes/flop
)
Istanbul
2.6 62.4 12.8 0.21
(XT5)
2.0 64 42.6 0.67
MC-8 2.3 73.6 42.6 0.58
2.4 76.8 42.6 0.55
1.9 91.2 42.6 0.47
MC-12 2.1 100.8 42.6 0.42
2.2 105.6 42.6 0.40
2011 HPCMP User Group © Cray Inc. June 20, 2011 9
- 11. Microkernel on Compute PEs,
full featured Linux on Service
PEs.
Service PEs specialize by
function
Compute PE
Software Architecture
Login PE eliminates OS “Jitter”
Network PE Software Architecture enables
reproducible run times
System PE
Large machines boot in under
I/O PE 30 minutes, including
filesystem
Service Partition
Specialized
Linux nodes
2011 HPCMP User Group © Cray Inc. June 20, 2011 11
- 12. XE6
System
External
Login Server
Boot RAID 10 GbE
IB QDR
2011 HPCMP User Group © Cray Inc. June 20, 2011 13
- 13. 6.4 GB/sec direct connect
Characteristics HyperTransport
Number of 16 or 24 (MC)
Cores 32 (IL)
Peak 153 Gflops/sec
Performance
MC-8 (2.4)
Peak 211 Gflops/sec
Performance
MC-12 (2.2)
Memory Size 32 or 64 GB per
node
Memory 83.5 GB/sec
Bandwidth 83.5 GB/sec direct
connect memory
Cray
SeaStar2+
Interconnect
2011 HPCMP User Group © Cray Inc. June 20, 2011 14
- 14. Greyhound Greyhound
Greyhound Greyhound DDR3 Channel
DDR3 Channel 6MB L3 HT3 6MB L3
Greyhound Greyhound
Cache Greyhound Cache Greyhound
Greyhound Greyhound
DDR3 Channel Greyhound Greyhound DDR3 Channel
HT3
HT3
Greyhound H Greyhound
DDR3 Channel 6MB L3
Greyhound
Greyhound
T3 6MB L3
Greyhound
Greyhound
DDR3 Channel
Cache Greyhound Cache Greyhound
Greyhound Greyhound
Greyhound HT3 Greyhound DDR3 Channel
DDR3 Channel
To Interconnect
HT1 / HT3
2 Multi-Chip Modules, 4 Opteron Dies
8 Channels of DDR3 Bandwidth to 8 DIMMs
24 (or 16) Computational Cores, 24 MB of L3 cache
Dies are fully connected with HT3
Snoop Filter Feature Allows 4 Die SMP to scale well
2011 HPCMP User Group © Cray Inc. June 20, 2011 15
- 15. Without snoop filter, a streams test
shows 25MB/sec out of a possible
51.2 GB/sec or 48% of peak
bandwidth
2011 HPCMP User Group © Cray Inc. June 20, 2011 16
- 16. With snoop filter, a streams test
shows 42.3 MB/sec out of a
possible 51.2 GB/sec or 82% of
peak bandwidth
This feature will be key for two-
socket Magny Cours Nodes which
are the same architecture-wise
2011 HPCMP User Group © Cray Inc. June 20, 2011 17
- 17. New compute blade with 8 AMD
Magny Cours processors
Plug-compatible with XT5 cabinets
and backplanes
Upgradeable to AMD’s
“Interlagos” series
XE6 systems ship with the current
SIO blade
2011 HPCMP User Group © Cray Inc. June 20, 2011 18
- 19. Supports 2 Nodes per ASIC
168 GB/sec routing capacity
Scales to over 100,000 network
endpoints
Link Level Reliability and Adaptive Hyper Hyper
Routing Transport Transport
3 3
Advanced Resiliency Features
Provides global address NIC 0 Netlink NIC 1
SB
space Block Gemini
LO
Advanced NIC designed to Processor
efficiently support 48-Port
MPI YARC Router
One-sided MPI
Shmem
UPC, Coarray FORTRAN
2011 HPCMP User Group © Cray Inc. June 20, 2011 20
- 20. Cray Baker Node
Characteristics
Number of 16 or 24
10 12X Gemini Cores
Channels
Peak 140 or 210 Gflops/s
(Each Gemini High Radix
YARC Router
Performance
acts like two
nodes on the 3D with adaptive Memory Size 32 or 64 GB per
Torus) Routing
node
168 GB/sec
capacity Memory 85 GB/sec
Bandwidth
2011 HPCMP User Group © Cray Inc. June 20, 2011 21
- 21. Module with
SeaStar
Z
Y
X
Module with
Gemini
2011 HPCMP User Group © Cray Inc. June 20, 2011 22
- 22. net rsp
net req
LB
ht treq p
net LB Ring
ht treq np FMA req T net
ht trsp net net
A req S req req vc0
ht p req
net R S
req O
ht np req B I
BTE R net
D rsp
B vc1
Router Tiles
HT3 Cave
NL
ht irsp
NPT vc1
ht np net
ireq rsp
net req CQ NAT
ht np req
H
ht p req net rsp headers
ht p
A AMO net
ht p req
ireq R net req req net req vc0
B RMT
ht p req
RAT net rsp
LM
CLM
FMA (Fast Memory Access)
Mechanism for most MPI transfers
Supports tens of millions of MPI requests per second
BTE (Block Transfer Engine)
Supports asynchronous block transfers between local and remote memory,
in either direction
For use for large MPI transfers that happen in the background
2011 HPCMP User Group © Cray Inc. June 20, 2011 23
- 23. Two Gemini ASICs are
packaged on a pin-compatible
mezzanine card
Topology is a 3-D torus
Each lane of the torus is
composed of 4 Gemini router
“tiles”
Systems with SeaStar
interconnects can be upgraded
by swapping this card
100% of the 48 router tiles on
each Gemini chip are used
2011 HPCMP User Group © Cray Inc. June 20, 2011 24
- 25. Name Architecture Processor Network # Cores Memory/Core
Jade XT-4 AMD Seastar 2.1 8584 2GB DDR2-800
Budapest (2.1
Ghz)
Einstein XT-5 AMD Seastar 2.1 12827 2GB (some
Shanghai (2.4 nodes have
Ghz) 4GB/core)
DDR2-800
MRAP XT-5 AMD Seastar 2.1 10400 4GB DDR2-800
Barcelona (2.3
Ghz)
Garnet XE-6 Magny Cours Gemini 1.0 20160 2GB DDR3-1333
8 core 2.4 Ghz
Raptor XE-6 Magny Cours Gemini 1.0 43712 2GB DDR3-1333
8 core 2.4 Ghz
Chugach XE-6 Magny Cours Gemini 1.0 11648 2GB DDR3 -1333
8 core 2.3 Ghz
2011 HPCMP User Group © Cray Inc. June 20, 2011 29
- 29. Low Velocity Airflow
High Velocity Airflow
Low Velocity Airflow
High Velocity Airflow
2011 HPCMP User Group © Cray Inc.
June 20, 2011
33 Low Velocity Airflow
- 30. Cool air is released into the computer room
Liquid Liquid/Vapor
in Mixture out
Hot air stream passes through evaporator, rejects
heat to R134a via liquid-vapor phase change
(evaporation).
R134a absorbs energy only in the presence of heated air.
Phase change is 10x more efficient than pure water
cooling.
2011 HPCMP User Group © Cray Inc. June 20, 2011 34
- 31. R134a piping Exit Evaporators
Inlet Evaporator
2011 HPCMP User Group © Cray Inc. June 20, 2011 35
- 33. Term Meaning Purpose
MDS Metadata Server Manages all file metadata for
filesystem. 1 per FS
OST Object Storage Target The basic “chunk” of data written
to disk. Max 160 per file.
OSS Object Storage Server Communicates with disks,
manages 1 or more OSTs. 1 or
more per FS
Stripe Size Size of chunks. Controls the size of file chunks
stored to OSTs. Can’t be changed
once file is written.
Stripe Count Number of OSTs used per Controls parallelism of file. Can’t
file. be changed once file is writte.
2011 HPCMP User Group © Cray Inc. June 20, 2011 37
- 36. 32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size
Unable to take advantage of file system parallelism
Access to multiple disks adds overhead which hurts performance
Single Writer
Write Performance
120
100
80
Write (MB/s)
1 MB Stripe
60
32 MB Stripe
40
Lustre
20
0
1 2 4 16 32 64 128 160
Stripe Count
40 2011 HPCMP User Group © Cray Inc. une 20, 2011
J
- 37. Single OST, 256 MB File Size
Performance can be limited by the process (transfer size) or file system
(stripe size)
Single Writer
Transfer vs. Stripe Size
140
120
100
Write (MB/s)
80
32 MB Transfer
60
8 MB Transfer
1 MB Transfer
40 Lustre
20
0
1 2 4 8 16 32 64 128
Stripe Size (MB)
41 2011 HPCMP User Group © Cray Inc. une 20, 2011
J
- 38. Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and
possibly size
lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)
lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)
export MPICH_MPIIO_HINTS=‘*: striping_factor=160’
Files inherit striping information from the parent directory, this cannot be
changed once the file is written
Set the striping before copying in files
2011 HPCMP User Group © Cray Inc. June 20, 2011 42
- 39. Available Compilers
Cray Scientific Libraries
Cray Message Passing Toolkit
2011 HPCMP User Group © Cray Inc. June 20, 2011 43
- 40. Cray XT/XE Supercomputers come with compiler wrappers to simplify
building parallel applications (similar the mpicc/mpif90)
Fortran Compiler: ftn
C Compiler: cc
C++ Compiler: CC
Using these wrappers ensures that your code is built for the compute
nodes and linked against important libraries
Cray MPT (MPI, Shmem, etc.)
Cray LibSci (BLAS, LAPACK, etc.)
…
Choosing the underlying compiler is via the PrgEnv-* modules, do not call
the PGI, Cray, etc. compilers directly.
Always load the appropriate xtpe-<arch> module for your machine
Enables proper compiler target
Links optimized math libraries
2011 HPCMP User Group © Cray Inc. June 20, 2011 44
- 41. …from Cray’s Perspective
PGI – Very good Fortran and C, pretty good C++
Good vectorization
Good functional correctness with optimization enabled
Good manual and automatic prefetch capabilities
Very interested in the Linux HPC market, although that is not their only focus
Excellent working relationship with Cray, good bug responsiveness
Pathscale – Good Fortran, C, possibly good C++
Outstanding scalar optimization for loops that do not vectorize
Fortran front end uses an older version of the CCE Fortran front end
OpenMP uses a non-pthreads approach
Scalar benefits will not get as much mileage with longer vectors
Intel – Good Fortran, excellent C and C++ (if you ignore vectorization)
Automatic vectorization capabilities are modest, compared to PGI and CCE
Use of inline assembly is encouraged
Focus is more on best speed for scalar, non-scaling apps
Tuned for Intel architectures, but actually works well for some applications on
AMD
2011 HPCMP User Group © Cray Inc. June 20, 2011 45
- 42. …from Cray’s Perspective
GNU so-so Fortran, outstanding C and C++ (if you ignore vectorization)
Obviously, the best for gcc compatability
Scalar optimizer was recently rewritten and is very good
Vectorization capabilities focus mostly on inline assembly
Note the last three releases have been incompatible with each other (4.3, 4.4,
and 4.5) and required recompilation of Fortran modules
CCE – Outstanding Fortran, very good C, and okay C++
Very good vectorization
Very good Fortran language support; only real choice for Coarrays
C support is quite good, with UPC support
Very good scalar optimization and automatic parallelization
Clean implementation of OpenMP 3.0, with tasks
Sole delivery focus is on Linux-based Cray hardware systems
Best bug turnaround time (if it isn’t, let us know!)
Cleanest integration with other Cray tools (performance tools, debuggers,
upcoming productivity tools)
No inline assembly support
2011 HPCMP User Group © Cray Inc. June 20, 2011 46
- 43. PGI
-fast –Mipa=fast(,safe)
If you can be flexible with precision, also try -Mfprelaxed
Compiler feedback: -Minfo=all -Mneginfo
man pgf90; man pgcc; man pgCC; or pgf90 -help
Cray
<none, turned on by default>
Compiler feedback: -rm (Fortran) -hlist=m (C)
If you know you don’t want OpenMP: -xomp or -Othread0
man crayftn; man craycc ; man crayCC
Pathscale
-Ofast Note: this is a little looser with precision than other compilers
Compiler feedback: -LNO:simd_verbose=ON
man eko (“Every Known Optimization”)
GNU
-O2 / -O3
Compiler feedback: good luck
man gfortran; man gcc; man g++
Intel
-fast
Compiler feedback:
man ifort; man icc; man iCC
2011 HPCMP User Group © Cray Inc. June 20, 2011 47
- 45. Traditional (scalar) optimizations are controlled via -O# compiler flags
Default: -O2
More aggressive optimizations (including vectorization) are enabled with
the -fast or -fastsse metaflags
These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre
–Mautoinline -Mvect=sse -Mscalarsse
-Mcache_align -Mflushz –Mpre
Interprocedural analysis allows the compiler to perform whole-program
optimizations. This is enabled with –Mipa=fast
See man pgf90, man pgcc, or man pgCC for more information about
compiler options.
2011 HPCMP User Group © Cray Inc. June 20, 2011 49
- 46. Compiler feedback is enabled with -Minfo and -Mneginfo
This can provide valuable information about what optimizations were
or were not done and why.
To debug an optimized code, the -gopt flag will insert debugging
information without disabling optimizations
It’s possible to disable optimizations included with -fast if you believe one
is causing problems
For example: -fast -Mnolre enables -fast and then disables loop
redundant optimizations
To get more information about any compiler flag, add -help with the
flag in question
pgf90 -help -fast will give more information about the -fast
flag
OpenMP is enabled with the -mp flag
2011 HPCMP User Group © Cray Inc. June 20, 2011 50
- 47. Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also able to enforce accuracy.
-Kieee: All FP math strictly conforms to IEEE 754 (off by default)
-Ktrap: Turns on processor trapping of FP exceptions
-Mdaz: Treat all denormalized numbers as zero
-Mflushz: Set SSE to flush-to-zero (on with -fast)
-Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to
speed up some floating point optimizations
Some other compilers turn this on by default, PGI chooses to favor
accuracy to speed by default.
2011 HPCMP User Group © Cray Inc. June 20, 2011 51
- 49. Cray has a long tradition of high performance compilers on Cray
platforms (Traditional vector, T3E, X1, X2)
Vectorization
Parallelization
Code transformation
More…
Investigated leveraging an open source compiler called LLVM
First release December 2008
2011 HPCMP User Group © Cray Inc. June 20, 2011 53
- 50. Fortran Source C and C++ Source C and C++ Front End
supplied by Edison Design
Group, with Cray-developed
Fortran Front End C & C++ Front End code for extensions and
interface support
Interprocedural Analysis
Cray Inc. Compiler
Technology
Compiler
Optimization and
Parallelization
X86 Code Cray X2 Code
Generator Generator
X86 Code Generation from
Open Source LLVM, with
Object File
additional Cray-developed
optimizations and interface
support
2011 HPCMP User Group © Cray Inc. June 20, 2011 54
- 51. Standard conforming languages and programming models
Fortran 2003
UPC & CoArray Fortran
Fully optimized and integrated into the compiler
No preprocessor involved
Target the network appropriately:
GASNet with Portals
DMAPP with Gemini & Aries
Ability and motivation to provide high-quality support for custom
Cray network hardware
Cray technology focused on scientific applications
Takes advantage of Cray’s extensive knowledge of automatic
vectorization
Takes advantage of Cray’s extensive knowledge of automatic
shared memory parallelization
Supplements, rather than replaces, the available compiler
choices
2011 HPCMP User Group © Cray Inc. June 20, 2011 55
- 52. Make sure it is available
module avail PrgEnv-cray
To access the Cray compiler
module load PrgEnv-cray
To target the various chip
module load xtpe-[barcelona,shanghi,mc8]
Once you have loaded the module “cc” and “ftn” are the Cray
compilers
Recommend just using default options
Use –rm (fortran) and –hlist=m (C) to find out what happened
man crayftn
2011 HPCMP User Group © Cray Inc. June 20, 2011 56
- 53. Excellent Vectorization
Vectorize more loops than other compilers
OpenMP 3.0
Task and Nesting
PGAS: Functional UPC and CAF available today
C++ Support
Automatic Parallelization
Modernized version of Cray X1 streaming capability
Interacts with OMP directives
Cache optimizations
Automatic Blocking
Automatic Management of what stays in cache
Prefetching, Interchange, Fusion, and much more…
2011 HPCMP User Group © Cray Inc. June 20, 2011 57
- 54. Loop Based Optimizations
Vectorization
OpenMP
Autothreading
Interchange
Pattern Matching
Cache blocking/ non-temporal / prefetching
Fortran 2003 Standard; working on 2008
PGAS (UPC and Co-Array Fortran)
Some performance optimizations available in 7.1
Optimization Feedback: Loopmark
Focus
2011 HPCMP User Group © Cray Inc. June 20, 2011 58
- 55. Cray compiler supports a full and growing set of directives
and pragmas
!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable
man directives
man loop_info
2011 HPCMP User Group © Cray Inc. June 20, 2011 59
- 56. Compiler can generate an filename.lst file.
Contains annotated listing of your source code with letter indicating important
optimizations
%%% L o o p m a r k L e g e n d %%%
Primary Loop Type Modifiers
------- ---- ---- ---------
a - vector atomic memory operation
A - Pattern matched b - blocked
C - Collapsed f - fused
D - Deleted i - interchanged
E - Cloned m - streamed but not partitioned
I - Inlined p - conditional, partial and/or computed
M - Multithreaded r - unrolled
P - Parallel/Tasked s - shortloop
V - Vectorized t - array syntax temp used
W - Unwound w - unwound
2011 HPCMP User Group © Cray Inc. June 20, 2011 60
- 57. • ftn –rm … or cc –hlist=m …
29. b-------< do i3=2,n3-1
30. b b-----< do i2=2,n2-1
31. b b Vr--< do i1=1,n1
32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1)
34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
36. b b Vr--> enddo
37. b b Vr--< do i1=2,n1-1
38. b b Vr r(i1,i2,i3) = v(i1,i2,i3)
39. b b Vr > - a(0) * u(i1,i2,i3)
40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) )
42. b b Vr--> enddo
43. b b-----> enddo
44. b-------> enddo
2011 HPCMP User Group © Cray Inc. June 20, 2011 61
- 58. ftn-6289 ftn: VECTOR File = resid.f, Line = 29
A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines
32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 29
A loop starting at line 29 was blocked with block size 4.
ftn-6289 ftn: VECTOR File = resid.f, Line = 30
A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32
and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 30
A loop starting at line 30 was blocked with block size 4.
ftn-6005 ftn: SCALAR File = resid.f, Line = 31
A loop starting at line 31 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 31
A loop starting at line 31 was vectorized.
ftn-6005 ftn: SCALAR File = resid.f, Line = 37
A loop starting at line 37 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 37
A loop starting at line 37 was vectorized.
2011 HPCMP User Group © Cray Inc. June 20, 2011 62
- 59. -hbyteswapio
Link time option
Applies to all unformatted fortran IO
Assign command
With the PrgEnv-cray module loaded do this:
setenv FILENV assign.txt
assign -N swap_endian g:su
assign -N swap_endian g:du
Can use assign to be more precise
2011 HPCMP User Group © Cray Inc. June 20, 2011 63
- 60. OpenMP is ON by default
Optimizations controlled by –Othread#
To shut off use –Othread0 or –xomp or –hnoomp
Autothreading is NOT on by default;
-hautothread to turn on
Modernized version of Cray X1 streaming capability
Interacts with OMP directives
If you do not want to use OpenMP and have OMP directives
in the code, make sure to make a run with OpenMP shut
off at compile time
2011 HPCMP User Group © Cray Inc. June 20, 2011 64
- 62. Cray have historically played a role in scientific library
development
BLAS3 were largely designed for Crays
Standard libraries were tuned for Cray vector processors
(later COTS)
Cray have always tuned standard libraries for Cray
interconnect
In the 90s, Cray provided many non-standard libraries
Sparse direct, sparse iterative
These days the goal is to remain portable (standard APIs)
whilst providing more performance
Advanced features, tuning knobs, environment variables
2011 HPCMP User Group © Cray Inc. June 20, 2011 66
- 63. FFT Dense Sparse
BLAS
CRAFFT CASK
LAPACK
FFTW ScaLAPACK PETSc
IRT
P-CRAFFT Trilinos
CASE
IRT – Iterative Refinement Toolkit
CASK – Cray Adaptive Sparse Kernels
CRAFFT – Cray Adaptive FFT
CASE – Cray Adaptive Simple Eigensolver
2011 HPCMP User Group © Cray Inc. June 20, 2011 69
- 64. There are many libsci libraries on the systems
One for each of
Compiler (intel, cray, gnu, pathscale, pgi )
Single thread, multiple thread
Target (istanbul, mc12 )
Best way to use libsci is to ignore all of this
Load the xtpe-module (some sites set this by default)
E.g. module load xtpe-shanghai / xtpe-istanbul / xtpe-mc8
Cray’s drivers will link the library automatically
PETSc, Trilinos, fftw, acml all have their own module
Tip : make sure you have the correct library loaded e.g.
–Wl, -ydgemm_
2011 HPCMP User Group © Cray Inc. June 20, 2011 70
- 65. Perhaps you want to link another library such as ACML
This can be done. If the library is provided by Cray, then load
the module. The link will be performed with the libraries in the
correct order.
If the library is not provided by Cray and has no module, add it
to the link line.
Items you add to the explicit link will be in the correct place
Note, to get explicit BLAS from ACML but scalapack from libsci
Load acml module. Explicit calls to BLAS in code resolve
from ACML
BLAS calls from the scalapack code will be resolved from
libsci (no way around this)
2011 HPCMP User Group © Cray Inc. June 20, 2011 71
- 66. Threading capabilities in previous libsci versions were poor
Used PTHREADS (more explicit affinity etc)
Required explicit linking to a _mp version of libsci
Was a source of concern for some applications that need
hybrid performance and interoperability with openMP
LibSci 10.4.2 February 2010
OpenMP-aware LibSci
Allows calling of BLAS inside or outside parallel region
Single library supported (there is still a single thread lib)
Usage – load the xtpe module for your system (mc12)
GOTO_NUM_THREADS outmoded – use OMP_NUM_THREADS
2011 HPCMP User Group © Cray Inc. June 20, 2011 72
- 67. Allows seamless calling of the BLAS within or without a parallel
region
e.g. OMP_NUM_THREADS = 12
call dgemm(…) threaded dgemm is used with 12 threads
!$OMP PARALLEL DO
do
call dgemm(…) single thread dgemm is used
end do
Some users are requesting a further layer of parallelism here (see
later)
2011 HPCMP User Group © Cray Inc. June 20, 2011 73
- 68. 120
Libsci DGEMM efficiency
100
80
GFLOPs
1thread
60
3threads
6threads
40 9threads
12threads
20
0
Dimension (square) Inc.
2011 HPCMP User Group © Cray June 20, 2011 74
- 69. 140
Libsci-10.5.2 performance on 2 x MC12 2.0 GHz K=64
120 (Cray XE6)
K=128
100 K=200
K=228
80
GFLOPS
K=256
60 K=300
K=400
40
K=500
20
K=600
0 K=700
1 2 4 8 12 16 20 24
K=800
Number of threads
2011 HPCMP User Group © Cray Inc. June 20, 2011 75
- 70. All BLAS libraries are optimized for rank-k update
* =
However, a huge % of dgemm usage is not from solvers but explicit calls
E.g. DCA++ matrices are of this form
* =
How can we very easily provide an optimization for these types of
matrices?
2011 HPCMP User Group © Cray Inc. June 20, 2011 76
- 71. Cray BLAS existed on every Cray machine between Cray-2 and Cray
X2
Cray XT line did not include Cray BLAS
Cray’s expertise was in vector processors
GotoBLAS was the best performing x86 BLAS
LibGoto is now discontinued
In Q3 2011 LibSci will be released with Cray BLAS
2011 HPCMP User Group © Cray Inc. June 20, 2011 77
- 72. 1. Customers require more OpenMP features unobtainable
with current library
2. Customers require more adaptive performance for
unusual problems .e.g. DCA++
3. Interlagos / Bulldozer is a dramatic shift in
ISA/architecture/performance
4. Our auto-tuning framework has advanced to the point
that we can tackle this problem (good BLAS is easy,
excellent BLAS is very hard)
5. Need for Bit-reproducable BLAS at high-performance
2011 HPCMP User Group © Cray Inc. June 20, 2011 78
- 73. "anything that can be represented in C, Fortran or ASM
code can be generated automatically by one instance
of an abstract operator in high-level code“
In other words, if we can create a purely general model
of matrix-multiplication, and create every instance of
it, then at least one of the generated schemes will
perform well
2011 HPCMP User Group © Cray Inc. June 20, 2011 79
- 74. Start with a completely general formulation of the BLAS
Use a DSL that expresses every important optimization
Auto-generate every combination of orderings, buffering, and
optimization
For every combination of the above, sweep all possible sizes
For a given input set ( M, N, K, datatype, alpha, beta ) map the
best dgemm routine to the input
The current library should be a specific instance of the above
Worst-case performance can be no worse than current library
The lowest level of blocking is a hand-written assembly kernel
2011 HPCMP User Group © Cray Inc. June 20, 2011 80
- 75. 7.5
7.45
7.4
7.35
7.3
bframe GFLOPS
7.25
libsci
7.2
7.15
7.1
7.05
143
72
12
17
22
27
62
133
67
37
42
57
105
2
7
47
100
128
138
95
32
52
2011 HPCMP User Group © Cray Inc. June 20, 2011 81
- 76. New optimizations for Gemini network in the ScaLAPACK LU and Cholesky
routines
1. Change the default broadcast topology to match the Gemini network
2. Give tools to allow the topology to be changed by the user
3. Give guidance on how grid-shape can affect the performance
2011 HPCMP User Group © Cray Inc. June 20, 2011 82
- 77. Parallel Version of LAPACK GETRF
Panel Factorization
Only single column block is involved
The rest of PEs are waiting
Trailing matrix update
Major part of the computation
Column-wise broadcast (Blocking)
Row-wise broadcast (Asynchronous)
Data is packed before sending using PBLAS
Broadcast uses BLACS library
These broadcasts are the major communication
patterns
2011 HPCMP User Group © Cray Inc. June 20, 2011 83
- 78. MPI default
Binomial Tree + node-aware broadcast
All PEs makes implicit barrier to make sure the completion
Not suitable for rank-k update
Bidirectional-Ring broadcast
Root PE makes 2 MPI Send calls to both of the directions
The immediate neighbor finishes first
ScaLAPACK’s default
Better than MPI
2011 HPCMP User Group © Cray Inc. June 20, 2011 84
- 79. Increasing Ring Broadcast (our new default)
Root makes a single MPI call to the immediate neighbor
Pipelining
Better than bidirectional ring
The immediate neighbor finishes first
Multi-Ring Broadcast (2, 4, 8 etc)
The immediate neighbor finishes first
The root PE sends to multiple sub-rings
Can be done with tree algorithm
2 rings seems the best for row-wise broadcast of LU
2011 HPCMP User Group © Cray Inc. June 20, 2011 85
- 80. Hypercube
Behaves like MPI default
Too many collisions in the message traffic
Decreasing Ring
The immediate neighbor finishes last
No benefit in LU
Modified Increasing Ring
Best performance in HPL
As good as increasing ring
2011 HPCMP User Group © Cray Inc. June 20, 2011 86
- 81. XDLU performance: 3072 cores, size=65536
10000
9000
8000
7000
6000
Gflops
5000
4000
3000 SRING
IRING
2000
1000
0
32 64 32 64 32 64 32 64 32 64
48 48 24 24 12 12 32 32 16 16
64 64 128 128 256 256 96 96 192 192
NB / P / Q
2011 HPCMP User Group © Cray Inc. June 20, 2011 87
- 82. XDLU performance: 6144 cores, size=65536
14000
12000
10000
8000
Gflops
6000
SRING
4000
IRING
2000
0
32 64 32 64 32 64 32 64 32 64
48 48 24 24 12 12 64 64 32 32
128 128 256 256 512 512 96 96 192 192
NB / P / Q
2011 HPCMP User Group © Cray Inc. June 20, 2011 88
- 83. Row Major Process Grid puts adjacent PEs in the same row
Adjacent PEs are most probably located in the same node
In flat MPI, 16 or 24 PEs are in the same node
In hybrid mode, several are in the same node
Most MPI sends in I-ring happen in the same node
MPI has good shared-memory device
Good pipelining
Node 0 Node 1 Node 2
2011 HPCMP User Group © Cray Inc. June 20, 2011 89
- 84. For PxGETRF: The variables let users to choose
SCALAPACK_LU_CBCAST broadcast algorithm :
SCALAPACK_LU_RBCAST IRING increasing ring
For PxPOTRF:
(default value)
DRING decreasing ring
SCALAPACK_LLT_CBCAST
SRING split ring (old default
SCALAPCK_LLT_RBCAST
value)
SCALAPACK_UTU_CBCAST
MRING multi-ring
SCALAPACK_UTU_RBCAST
HYPR hypercube
MPI mpi_bcast
TREE tree
There is also a set function, allowing
FULL full connected
the user to change these on the fly
2011 HPCMP User Group © Cray Inc. June 20, 2011 91
- 85. Grid shape / size
Square grid is most common
Try to use Q = x * P grids, where x = 2, 4, 6, 8
Square grids not often the best
Blocksize
Unlike HPL, fine-tuning not important.
64 usually the best
Ordering
Try using column-major ordering, it can be better
BCAST
The new default will be a huge improvement if you can make your grid
the right way. If you cannot, play with the environment variables.
2011 HPCMP User Group © Cray Inc. June 20, 2011 92
- 87. Full MPI2 support (except process spawning) based on ANL MPICH2
Cray used the MPICH2 Nemesis layer for Gemini
Cray-tuned collectives
Cray-tuned ROMIO for MPI-IO
Current Release: 5.3.0 (MPICH 1.3.1)
Improved MPI_Allreduce and MPI_alltoallv
Initial support for checkpoint/restart for MPI or Cray SHMEM on XE
systems
Improved support for MPI thread safety.
module load xt-mpich2
Tuned SHMEM library
module load xt-shmem
2011 HPCMP User Group © Cray Inc. June 20, 2011 94
- 88. MPI_Alltoall with 10,000 Processes
Comparing Original vs Optimized Algorithms
on Cray XE6 Systems
25000000
20000000
Microseconds
15000000
Original Algorithm
10000000
Optimized Algorithm
5000000
0
256 512 1024 2048 4096 8192 16384 32768
MessageHPCMP User Group © Cray Inc.
2011
Size (in bytes) June 20, 2011 95
- 89. 8-Byte MPI_Allgather and MPI_Allgatherv Scaling
Comparing Original vs Optimized Algorithms
45000 on Cray XE6 Systems
40000
MPI_Allgather and
35000 MPI_Allgatherv algorithms
optimized for Cray XE6.
30000
Microseconds
Original Allgather
25000
20000 Optimized Allgather
15000 Original Allgatherv
10000 Optimized Allgatherv
5000
0
1024p 2048p 4096p 8192p 16384p 32768p
Number ofUser Group © Cray Inc. June 20, 2011
2011 HPCMP Processes 96
- 90. Default is 8192 bytes
Maximum size message that can go through the eager protocol.
May help for apps that are sending medium size messages, and do better
when loosely coupled. Does application have a large amount of time in
MPI_Waitall? Setting this environment variable higher may help.
Max value is 131072 bytes.
Remember for this path it helps to pre-post receives if possible.
Note that a 40-byte CH3 header is included when accounting for the
message size.
2011 HPCMP User Group © Cray Inc. June 20, 2011 97
- 91. Default is 64 32K buffers ( 2M total )
Controls number of 32K DMA buffers available for each rank to use in the
Eager protocol described earlier
May help to modestly increase. But other resources constrain the usability
of a large number of buffers.
2011 HPCMP User Group © Cray Inc. June 20, 2011 98
- 93. What do I mean by PGAS?
Partitioned Global Address Space
UPC
CoArray Fortran ( Fortran 2008 )
SHMEM (I will count as PGAS for convenience)
SHMEM: Library based
Not part of any language standard
Compiler independent
Compiler has no knowledge that it is compiling a PGAS code and
does nothing different. I.E. no transformations or optimizations
2011 HPCMP User Group © Cray Inc. June 20, 2011 100
- 94. UPC
Specification that extends the ISO/IEC 9899 standard for C
Integrated into the language
Heavily compiler dependent
Compiler intimately involved in detecting and executing remote
references
Flexible, but filled with challenges like pointers, a lack of true
multidimensional arrays, and many options for distributing data
Fortran 2008
Now incorporates coarrays
Compiler dependent
Philosophically different from UPC
Replication of arrays on every image with “easy and obvious” ways
to access those remote locations.
2011 HPCMP User Group © Cray Inc. June 20, 2011 101
- 96. Translate the UPC source code into hardware executable operations that
produce the proper behavior, as defined by the specification
Storing to a remote location?
Loading from a remote location?
When does the transfer need to be complete?
Are there any dependencies between this transfer and anything else?
No ordering guarantees provided by the network, compiler is
responsible for making sure everything gets to its destination in the
correct order.
2011 HPCMP User Group © Cray Inc. June 20, 2011 103
- 97. for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) {
local_data[i] += global_2d[i][target];
}
for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) {
temp = pgas_get(&global_2d[i]); // Initiate the get
pgas_fence(); // makes sure the get is complete
local_data[i] += temp; // Use the local location to complete the operation
}
The compiler must
Recognize you are referencing a shared location
Initiate the load of the remote data
Make sure the transfer has completed
Proceed with the calculation
Repeat for all iterations of the loop
2011 HPCMP User Group © Cray Inc. June 20, 2011 104
- 98. for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) {
temp = pgas_get(&global_2d[i]); // Initiate the get
pgas_fence(); // makes sure the get is complete
local_data[i] += temp; // Use the local location to complete the operation
}
Simple translation results in
Single word references
Lots of fences
Little to no latency hiding
No use of special hardware
Nothing here says “fast”
2011 HPCMP User 105
June 20, 2011 Group © Cray Inc.
- 99. Want the compiler to generate code that will run as fast as possible given
what the user has written, or allow the user to get fast performance with
simple modifications.
Increase message size
Do multi / many word transfers whenever possible, not single word.
Minimize fences
Delay fence “as much as possible”
Eliminate the fence in some circumstances
Use the appropriate hardware
Use on-node hardware for on-node transfers
Use transfer mechanism appropriate for this message size
Overlap communication and computation
Use hardware atomic functions where appropriate
2011 HPCMP User Group © Cray Inc. June 20, 2011 106
- 100. Primary Loop Type Modifiers
A - Pattern matched a - atomic memory operation
b - blocked
C - Collapsed c - conditional and/or computed
D - Deleted
E - Cloned f - fused
G - Accelerated g - partitioned
I - Inlined i - interchanged
M - Multithreaded m - partitioned
n - non-blocking remote transfer
p - partial
r - unrolled
s - shortloop
V - Vectorized w - unwound
2011 HPCMP User Group © Cray Inc. June 20, 2011 107
- 102. 15. shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS];
…
83. 1 before = upc_ticks_now();
84. 1 r8------< for ( i = 0, j = target; i < ELEMS_PER_THREAD ;
85. 1 r8 i += 1, j += THREADS ) {
86. 1 r8 n local_data[i]= global_1d[j];
87. 1 r8------> }
88. 1 after = upc_ticks_now();
1D get BW= 0.027598 Gbytes/s
2011 HPCMP User 109
June 20, 2011 Group © Cray Inc.
- 103. 15. shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS];
…
101. 1 before = upc_ticks_now();
102. 1 upc_memget(&local_data[0],&global_1d[target],8*ELEMS_PER_THREAD);
103. 1
104. 1 after = upc_ticks_now();
1D get BW= 0.027598 Gbytes/s
1D upc_memget BW= 4.972960 Gbytes/s
upc_memget is 184 times faster!!
2011 HPCMP User 110
June 20, 2011 Group © Cray Inc.
- 104. 16. shared long global_2d[MAX_ELEMS_PER_THREAD][THREADS];
…
121. 1 A-------< for ( i = 0; i < ELEMS_PER_THREAD; i+=1) {
122. 1 A local_data[i] = global_2d[i][target];
123. 1 A-------> }
1D get BW= 0.027598 Gbytes/s
1D upc_memget BW= 4.972960 Gbytes/s
2D get time BW= 4.905653 Gbytes/s
Pattern matching can give you the same
performance as if using upc_memget
2011 HPCMP User 111
June 20, 2011 Group © Cray Inc.
- 106. PGAS data references made by the single statement immediately following the pgas
defer_sync directive will not be synchronized until the next fence instruction.
Only applies to next UPC/CAF statement
Does not apply to upc “routines”
Does not apply to shmem routines
Normally the compiler synchronizes the references in a statement as late as
possible without violating program semantics. The purpose of the defer_sync
directive is to synchronize the references even later, beyond where the compiler
can determine it is safe.
Extremely powerful!
Can easily overlap communication and computation with this statement
Can apply to both “gets” and “puts”
Can be used to implement a variety of “tricks”. Use your imagination!
2011 HPCMP User Group © Cray Inc. June 20, 2011 113
- 108. Future system basic characteristics:
Many-core, hybrid multi-core computing
Increase in on-node concurrency
10s-100s of cores sharing memory
With or without a companion accelerator
Vector hardware at the low level
Impact on applications:
Restructure / evolve applications while using existing programming
models to take advantage of increased concurrency
Expand on use of mixed-mode programming models (MPI + OpenMP +
accelerated kernels, etc.)
2011 HPCMP User Group © Cray Inc. June 20, 2011 115
- 109. Focus on automation (simplify tool usage, provide feedback based on
analysis)
Enhance support for multiple programming models within a program (MPI,
PGAS, OpenMP, SHMEM)
Scaling (larger jobs, more data, better tool response)
New processors and interconnects
Extend performance tools to include pre-runtime optimization information
from the Cray compiler
2011 HPCMP User Group © Cray Inc. June 20, 2011 116
- 110. New predefined wrappers (ADIOS, ARMCI, PetSc, PGAS libraries)
More UPC and Co-array Fortran support
Support for non-record locking file systems
Support for applications built with shared libraries
Support for Chapel programs
pat_report tables available in Cray Apprentice2
2011 HPCMP User Group © Cray Inc. June 20, 2011 117
- 111. Enhanced PGAS support is available in perftools 5.1.3 and later
Profiles of a PGAS program can be created to show:
Top time consuming functions/line numbers in the code
Load imbalance information
Performance statistics attributed to user source by default
Can expose statistics by library as well
To see underlying operations, such as wait time on barriers
Data collection is based on methods used for MPI library
PGAS data is collected by default when using Automatic Profiling Analysis
(pat_build –O apa)
Predefined wrappers for runtime libraries (caf, upc, pgas) enable attribution of
samples or time to user source
UPC and SHMEM heap tracking coming in subsequent release
-g heap will track shared heap in addition to local heap
June 20, 2011 2011 HPCMP User Group © Cray Inc. 118
- 112. Table 1: Profile by Function
Samp % | Samp | Imb. | Imb. |Group
| | Samp | Samp % | Function
| | | | PE='HIDE'
100.0% | 48 | -- | -- |Total
|------------------------------------------
| 95.8% | 46 | -- | -- |USER
||-----------------------------------------
|| 83.3% | 40 | 1.00 | 3.3% |all2all
|| 6.2% | 3 | 0.50 | 22.2% |do_cksum
|| 2.1% | 1 | 1.00 | 66.7% |do_all2all
|| 2.1% | 1 | 0.50 | 66.7% |mpp_accum_long
|| 2.1% | 1 | 0.50 | 66.7% |mpp_alloc
||=========================================
| 4.2% | 2 | -- | -- |ETC
||-----------------------------------------
|| 4.2% | 2 | 0.50 | 33.3% |bzero
|==========================================
June 20, 2011 2011 HPCMP User Group © Cray Inc. 119
- 113. Table 2: Profile by Group, Function, and Line
Samp % | Samp | Imb. | Imb. |Group
| | Samp | Samp % | Function
| | | | Source
| | | | Line
| | | | PE='HIDE'
100.0% | 48 | -- | -- |Total
|--------------------------------------------
| 95.8% | 46 | -- | -- |USER
||-------------------------------------------
|| 83.3% | 40 | -- | -- |all2all
3| | | | | mpp_bench.c
4| | | | | line.298
|| 6.2% | 3 | -- | -- |do_cksum
3| | | | | mpp_bench.c
||||-----------------------------------------
4||| 2.1% | 1 | 0.25 | 33.3% |line.315
4||| 4.2% | 2 | 0.25 | 16.7% |line.316
||||=========================================
June 20, 2011 2011 HPCMP User Group © Cray Inc. 120
- 114. Table 1: Profile by Function and Callers, with Line Numbers
Samp % | Samp |Group
| | Function
| | Caller
| | PE='HIDE’
100.0% | 47 |Total
|---------------------------
| 93.6% | 44 |ETC
||--------------------------
|| 85.1% | 40 |upc_memput
3| | | all2all:mpp_bench.c:line.298
4| | | do_all2all:mpp_bench.c:line.348
5| | | main:test_all2all.c:line.70
|| 4.3% | 2 |bzero
3| | | (N/A):(N/A):line.0
|| 2.1% | 1 |upc_all_alloc
3| | | mpp_alloc:mpp_bench.c:line.143
4| | | main:test_all2all.c:line.25
|| 2.1% | 1 |upc_all_reduceUL
3| | | mpp_accum_long:mpp_bench.c:line.185
4| | | do_cksum:mpp_bench.c:line.317
5| | | do_all2all:mpp_bench.c:line.341
6| | | main:test_all2all.c:line.70
||==========================
June 20, 2011 2011 HPCMP User Group © Cray Inc. 121
- 115. Table 1: Profile by Function and Callers, with Line Numbers
Time % | Time | Calls |Group
| | | Function
| | | Caller
| | | PE='HIDE'
100.0% | 0.795844 | 73904.0 |Total
|-----------------------------------------
| 78.9% | 0.628058 | 41121.8 |PGAS
||----------------------------------------
|| 76.1% | 0.605945 | 32768.0 |__pgas_put
3| | | | all2all:mpp_bench.c:line.298
4| | | | do_all2all:mpp_bench.c:line.348
5| | | | main:test_all2all.c:line.70
|| 1.5% | 0.012113 | 10.0 |__pgas_barrier
3| | | | (N/A):(N/A):line.0
…
June 20, 2011 2011 HPCMP User Group © Cray Inc. 122
- 116. …
||========================================
| 15.7% | 0.125006 | 3.0 |USER
||----------------------------------------
|| 12.2% | 0.097125 | 1.0 |do_all2all
3| | | | main:test_all2all.c:line.70
|| 3.5% | 0.027668 | 1.0 |main
3| | | | (N/A):(N/A):line.0
||========================================
| 5.4% | 0.042777 | 32777.2 |UPC
||----------------------------------------
|| 5.3% | 0.042321 | 32768.0 |upc_memput
3| | | | all2all:mpp_bench.c:line.298
4| | | | do_all2all:mpp_bench.c:line.348
5| | | | main:test_all2all.c:line.70
|=========================================
June 20, 2011 2011 HPCMP User Group © Cray Inc. 123
- 117. New text
table icon
Right click
for table
generation
options
2011 HPCMP User Group © Cray Inc. June 20, 2011 124
- 119. Scalability
New .ap2 data format and client / server model
Reduced pat_report processing and report generation times
Reduced app2 data load times
Graphical presentation handled locally (not passed through ssh
connection)
Better tool responsiveness
Minimizes data loaded into memory at any given time
Reduced server footprint on Cray XT/XE service node
Larger jobs supported
Distributed Cray Apprentice2 (app2) client for Linux
app2 client for Mac and Windows laptops coming later this year
2011 HPCMP User Group © Cray Inc. June 20, 2011 126
- 120. CPMD
MPI, instrumented with pat_build –u, HWPC=1
960 cores
Perftools 5.1.3 Perftools 5.2.0
.xf -> .ap2 88.5 seconds 22.9 seconds
ap2 -> report 1512.27 seconds 49.6 seconds
VASP
MPI, instrumented with pat_build –gmpi –u, HWPC=3
768 cores
Perftools 5.1.3 Perftools 5.2.0
.xf -> .ap2 45.2 seconds 15.9 seconds
ap2 -> report 796.9 seconds 28.0 seconds
2011 HPCMP User Group © Cray Inc. June 20, 2011 127
- 121. ‘:’ signifies
From Linux desktop – a remote
host
instead of
% module load perftools
ap2 file
% app2
% app2 kaibab:
% app2 kaibab:/lus/scratch/heidi/swim+pat+10302-0t.ap2
File->Open Remote…
2011 HPCMP User Group © Cray Inc. June 20, 2011 128
- 122. Optional app2 client for Linux desktop available as of 5.2.0
Can still run app2 from Cray service node
Improves response times as X11 traffic is no longer passed through the ssh
connection
Replaces 32-bit Linux desktop version of Cray Apprentice2
Uses libssh to establish connection
app2 clients for Windows and Mac coming in subsequent release
2011 HPCMP User Group © Cray Inc. June 20, 2011 129
- 123. Linux desktop All data from Cray XT login Collected Compute nodes
my_program.ap2 + performance
X Window
X11 protocol data
app2
System
application my_program.ap2 my_program+apa
Log into Cray XT/XE login node
% ssh –Y seal
Launch Cray Apprentice2 on Cray XT/XE login node
% app2 /lus/scratch/mydir/my_program.ap2
User Interface displayed on desktop via ssh trusted X11 forwarding
Entire my_program.ap2 file loaded into memory on XT login node (can
be Gbytes of data)
2011 HPCMP User Group © Cray Inc. June 20, 2011 130
- 124. Linux desktop User requested data Cray XT login Collected
Compute nodes
from
X Window performance
my_program.ap2 app2 server
System data
application
my_program.ap2
my_program+apa
app2 client
Launch Cray Apprentice2 on desktop, point to data
% app2 seal:/lus/scratch/mydir/my_program.ap2
User Interface displayed on desktop via X Windows-based software
Minimal subset of data from my_program.ap2 loaded into memory on
Cray XT/XE service node at any given time
Only data requested sent from server to client
2011 HPCMP User Group © Cray Inc. June 20, 2011 131
- 126. Major change to the way HW counters are collected starting with CPMAT
5.2.1 and CLE 4.0 (In conjunction with Interlagos support)
Linux has officially incorporated support for accessing counters through a
perf_events subsystem. Until this, Linux kernels have had to be patched to
add support for perfmon2 which provided access to the counters for PAPI
and for CrayPat.
Seamless to users except –
Overhead incurred when accessing counters has increased
Creates additional application perturbation
Working to bring this back in line with perfmon2 overhead
2011 HPCMP User Group © Cray Inc. June 20, 2011 133
- 127. When possible, CrayPat will identify dominant communication grids
(communication patterns) in a program
Example: nearest neighbor exchange in 2 or 3 dimensions
Sweep3d uses a 2-D grid for communication
Determine whether or not a custom MPI rank order will produce a
significant performance benefit
Custom rank orders are helpful for programs with significant point-to-point
communication
Doesn’t interfere with MPI collective communication optimizations
2011 HPCMP User Group © Cray Inc. June 20, 2011 134
- 128. Focuses on intra-node communication (place ranks that communication
frequently on the same node, or close by)
Option to focus on other metrics such as memory bandwidth
Determine rank order used during run that produced data
Determine grid that defines the communication
Produce a custom rank order if it’s beneficial based on grid size, grid order
and cost metric
Summarize findings in report
Describe how to re-run with custom rank order
2011 HPCMP User Group © Cray Inc. June 20, 2011 135
- 129. For Sweep3d with 768 MPI ranks:
This application uses point-to-point MPI communication between nearest
neighbors in a 32 X 24 grid pattern. Time spent in this communication
accounted for over 50% of the execution time. A significant fraction (but
not more than 60%) of this time could potentially be saved by using the
rank order in the file MPICH_RANK_ORDER.g which was generated along
with this report.
To re-run with a custom rank order …
2011 HPCMP User Group © Cray Inc. June 20, 2011 136
- 130. Assist the user with application performance analysis and optimization
Help user identify important and meaningful information from
potentially massive data sets
Help user identify problem areas instead of just reporting data
Bring optimization knowledge to a wider set of users
Focus on ease of use and intuitive user interfaces
Automatic program instrumentation
Automatic analysis
Target scalability issues in all areas of tool development
Data management
Storage, movement, presentation
June 20, 2011 2011 HPCMP User Group © Cray Inc. 137
- 131. Supports traditional post-mortem performance analysis
Automatic identification of performance problems
Indication of causes of problems
Suggestions of modifications for performance improvement
CrayPat
pat_build: automatic instrumentation (no source code changes needed)
run-time library for measurements (transparent to the user)
pat_report for performance analysis reports
pat_help: online help utility
Cray Apprentice2
Graphical performance analysis and visualization tool
June 20, 2011 2011 HPCMP User Group © Cray Inc. 138
- 132. CrayPat
Instrumentation of optimized code
No source code modification required
Data collection transparent to the user
Text-based performance reports
Derived metrics
Performance analysis
Cray Apprentice2
Performance data visualization tool
Call tree view
Source code mappings
June 20, 2011 2011 HPCMP User Group © Cray Inc. 139
- 133. When performance measurement is triggered
External agent (asynchronous)
Sampling
Timer interrupt
Hardware counters overflow
Internal agent (synchronous)
Code instrumentation
Event based
Automatic or manual instrumentation
How performance data is recorded
Profile ::= Summation of events over time
run time summarization (functions, call sites, loops, …)
Trace file ::= Sequence of events over time
June 20, 2011 2011 HPCMP User Group © Cray Inc. 140
- 134. Millions of lines of code
Automatic profiling analysis
Identifies top time consuming routines
Automatically creates instrumentation template customized to your
application
Lots of processes/threads
Load imbalance analysis
Identifies computational code regions and synchronization calls that
could benefit most from load balance optimization
Estimates savings if corresponding section of code were balanced
Long running applications
Detection of outliers
June 20, 2011 2011 HPCMP User Group © Cray Inc. 141
- 135. Important performance statistics:
Top time consuming routines
Load balance across computing resources
Communication overhead
Cache utilization
FLOPS
Vectorization (SSE instructions)
Ratio of computation versus communication
June 20, 2011 2011 HPCMP User Group © Cray Inc. 142
- 136. No source code or makefile modification required
Automatic instrumentation at group (function) level
Groups: mpi, io, heap, math SW, …
Performs link-time instrumentation
Requires object files
Instruments optimized code
Generates stand-alone instrumented program
Preserves original binary
Supports sample-based and event-based instrumentation
June 20, 2011 2011 HPCMP User Group © Cray Inc. 143
- 137. Analyze the performance data and direct the user to meaningful
information
Simplifies the procedure to instrument and collect performance data for
novice users
Based on a two phase mechanism
1. Automatically detects the most time consuming functions in the
application and feeds this information back to the tool for further
(and focused) data collection
2. Provides performance information on the most significant parts of the
application
June 20, 2011 2011 HPCMP User Group © Cray Inc. 144
- 138. Performs data conversion
Combines information from binary with raw performance
data
Performs analysis on data
Generates text report of performance results
Formats data for input into Cray Apprentice2
June 20, 2011 2011 HPCMP User Group © Cray Inc. 145
- 139. Craypat / Cray Apprentice2 5.0 released September 10, 2009
New internal data format
FAQ
Grid placement support
Better caller information (ETC group in pat_report)
Support larger numbers of processors
Client/server version of Cray Apprentice2
Panel help in Cray Apprentice2
June 20, 2011 2011 HPCMP User Group © Cray Inc. 146
- 140. Access performance tools software
% module load perftools
Build application keeping .o files (CCE: -h keepfiles)
% make clean
% make
Instrument application for automatic profiling analysis
You should get an instrumented program a.out+pat
% pat_build –O apa a.out
Run application to get top time consuming routines
You should get a performance file (“<sdatafile>.xf”) or
multiple files in a directory <sdatadir>
% aprun … a.out+pat (or qsub <pat script>)
June 20, 2011 2011 HPCMP User Group © Cray Inc. 147
- 141. Generate report and .apa instrumentation file
% pat_report –o my_sampling_report [<sdatafile>.xf |
<sdatadir>]
Inspect .apa file and sampling report
Verify if additional instrumentation is needed
June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 148
- 142. # You can edit this file, if desired, and use it # 43.37% 99659 bytes
# to reinstrument the program for tracing like this: -T mlwxyz_
#
# pat_build -O mhd3d.Oapa.x+4125-401sdt.apa # 16.09% 17615 bytes
# -T half_
# These suggested trace options are based on data from:
# # 6.82% 6846 bytes
# /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, -T artv_
/home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf
# 1.29% 5352 bytes
# ----------------------------------------------------------------------
-T currenh_
# HWPC group to collect by default.
# 1.03% 25294 bytes
-T bndbo_
-Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics.
# Functions below this point account for less than 10% of samples.
# ----------------------------------------------------------------------
# Libraries to trace.
# 1.03% 31240 bytes
# -T bndto_
-g mpi
...
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# User-defined functions to trace, sorted by % of samples.
# Limited to top 200. A function is commented out if it has < 1%
-o mhd3d.x+apa # New instrumented program.
# of samples, or if a cumulative threshold of 90% has been reached,
# or if it has size < 200 bytes.
/work/crayadm/ldr/mhd3d/mhd3d.x # Original program.
# Note: -u should NOT be specified as an additional option.
June 20, 2011 149
2011 HPCMP User Group © Cray Inc.
- 143. biolib Cray Bioinformatics library routines omp OpenMP API (not supported on
blacs Basic Linear Algebra communication Catamount)
subprograms omp-rtl OpenMP runtime library (not
blas Basic Linear Algebra subprograms supported on Catamount)
caf Co-Array Fortran (Cray X2 systems only) portals Lightweight message passing API
fftw Fast Fourier Transform library (64-bit pthreads POSIX threads (not supported on
only) Catamount)
hdf5 manages extremely large and complex scalapack Scalable LAPACK
data collections shmem SHMEM
heap dynamic heap stdio all library functions that accept or return
io includes stdio and sysio groups the FILE* construct
lapack Linear Algebra Package sysio I/O system calls
lustre Lustre File System system system calls
math ANSI math upc Unified Parallel C (Cray X2 systems only)
mpi MPI
netcdf network common data form (manages
array-oriented scientific data)
2011 HPCMP User Group © Cray Inc. June 20, 2011 150
- 144. 0 Summary with instruction 11 Floating point operations
metrics mix (2)
1 Summary with TLB metrics 12 Floating point operations
mix (vectorization)
2 L1 and L2 metrics
13 Floating point operations
3 Bandwidth information mix (SP)
4 Hypertransport information 14 Floating point operations
5 Floating point mix mix (DP)
6 Cycles stalled, resources 15 L3 (socket-level)
idle 16 L3 (core-level reads)
7 Cycles stalled, resources 17 L3 (core-level misses)
full
18 L3 (core-level fills caused
8 Instructions and branches by L2 evictions)
9 Instruction cache 19 Prefetches
10 Cache hierarchy
2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 151
- 145. Regions, useful to break up long routines
int PAT_region_begin (int id, const char *label)
int PAT_region_end (int id)
Disable/Enable Profiling, useful for excluding initialization
int PAT_record (int state)
Flush buffer, useful when program isn’t exiting cleanly
int PAT_flush_buffer (void)
2011 HPCMP User Group © Cray Inc. June 20, 2011 153
- 146. Instrument application for further analysis (a.out+apa)
% pat_build –O <apafile>.apa
Run application
% aprun … a.out+apa (or qsub <apa script>)
Generate text report and visualization file (.ap2)
% pat_report –o my_text_report.txt [<datafile>.xf |
<datadir>]
View report in text and/or with Cray Apprentice2
% app2 <datafile>.ap2
June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 154
- 147. MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)
Number of files used to store raw data
1 file created for program with 1 – 256 processes
√n files created for program with 257 – n processes
Ability to customize with PAT_RT_EXPFILE_MAX
June 20, 2011 2011 HPCMP User Group © Cray Inc. 155
- 148. Full trace files show transient events but are too large
Current run-time summarization misses transient events
Plan to add ability to record:
Top N peak values (N small)
Approximate std dev over time
For time, memory traffic, etc.
During tracing and sampling
June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 156
- 149. Call graph profile Cray Apprentice2
Communication statistics is target to help identify and
Time-line view correct:
Communication Load imbalance
I/O Excessive communication
Network contention
Activity view
Excessive serialization
Pair-wise communication statistics I/O Problems
Text reports
Source code mapping
June 20, 2011 157
2011 HPCMP User Group © Cray Inc.
- 154. Min, Avg, and Max
Values
-1, +1
Std Dev
marks
June 20, 2011 2011 HPCMP User Group © Cray Inc. 162