Do Multicore ao Manycore: Práticas de Configuração, Compilação e Execução no coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
HW and SW
Architecture of the
Intel® Xeon Phi™
Coprocessor
Leo Borges (leonardo.borges@intel.com)
Intel - Software and Services Group
iStep-Brazil, August 2013
1

Click to edit Master title style
2
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance and Thread Parallelism
Conclusions & References

3
* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor
versus a standard multi-core Intel® Xeon® processor
Efficient vectorization,
threading, and parallel
execution drives higher
performance for
many applications
Fraction Parallel
% Vector
Performance
7.00
5.00
3.00
1.00
1.00
0.20
0.00
0.40
0.60
0.80
0%
100%
50%
75%
25%
Big Gains for Selected Applications
Scale to
manycore
Parallelize
Vectorize
Medical imaging and
biophysics
Computer Aided Design
& Manufacturing
Climate modeling &
weather prediction
Financial analyses,
trading
Energy &oil
exploration
Digital content
creation

4
YES
Evaluating Your Applications
for Intel® Xeon Phi™
NO
YES
YES
YES
Can your workload
benefit from more
memory bandwidth?
Can your workload
benefit from
large vectors?
NO
NO
Can your workload
scale to over
100 threads?
Use Intel® Xeon Phi™ coprocessors for applications that scale with:
• Threads • Vectors • Memory Bandwidth

5
Introduction

6
Intel Many Integrated Core (MIC, pronounced “Mike”)
Product Family/Architecture for Highly Parallel Applications
• Based on large number of smaller, low power, Intel Arch. Cores
• 512-bit wide vector engine
• Compliments Intel Xeon processor product line
• Provides breakthrough performance for highly parallel apps
– Familiar x86 programming model
– Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor
– Initially a coprocessor with PCI Express form factor
First products announced at SC12: Code named Knights Corner (KNC)
• Up to 61 cores, 4 threads per core
• Up to 16GB GDDR5 memory (up to 352 GB/s)
• 225-300W (Cooling: Both passive & active SKUs)
• x16 PCIe Form-Factor (requires IA host)
6
Intel® Xeon® Phi™ Product Family
Based on the Intel MIC Architecture

7
Each Intel® Xeon Phi™ Coprocessor
core is a fully functional multi-thread
Execution unit
• >50 in-order cores
• Ring interconnect
• 64-bit addressing
• Scalar unit based on Intel® Pentium®
processor family
• Two pipelines
- Dual issue with scalar instructions
• One-per-clock scalar pipeline throughput
- 4 clock latency from issue to resolution
• 4 hardware threads per core
• Each thread issues instructions in turn
• Round-robin execution hides scalar unit latencyRing
Scalar
Registers
Vector
Registers
512K L2 Cache
32K L1 I-cache
32K L1 D-cache
Instruction Decode
Vector
Unit
Scalar Unit

8
Each Intel® Xeon Phi™ Coprocessor
core is a fully functional multi-thread
Vector unit
Ring
Scalar
Registers
Vector
Registers
512K L2 Cache
32K L1 I-cache
32K L1 D-cache
Instruction Decode
Vector
Unit
Scalar Unit
• Optimized
• Single and Double precision
• All new vector unit
• 512-bit SIMD Instructions – not Intel®
SSE, MMX™, or Intel® AVX
• 32 512-bit wide vector registers
- Hold 16 singles or 8 doubles per register
• Fully-coherent L1 and L2 caches
Takeaway: Vectorization is important

9
Individual cores are tied together via
fully coherent caches into a
bidirectional ring
• 9
GDDR
GDDR
GDDR
GDDR
PCIexp
L1 32K I- D-cache per core
3 cycle access
Up to 8 concurrent accesses
L2 512K cache per core
11 cycle best access
Up to 32 concurrent
accesses
GDDR5 Memory
16 memory channels
- Up to 5.5 Gb/sec
16 GB 300ns access
Bidirectional ring
115 GB/sec
Distributed Tag
Directory (DTD)
reduces ring
snoop traffic
PCIe port has its
own ring stop
Takeaway: Parallelization and data placement are important

10
Each Xeon Phi can be addressed as
an Individual Node in the Cluster
• 1
0
6 to 16 GB GDDR5 memory

INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
11
3 Family
Outstanding Parallel Computing Solution
Performance/$ leadership
Intel® Xeon Phi™ Coprocessors
3120P 3120A
5 Family
Optimized for High Density Environments
Performance/watt leadership
5120D
7 Family
Highest Level of Features
Performance leadership
7120P 7120X
16GB GDDR5
352 GB/s
> 1.2 TFlops DP
Turbo
T
8GB GDDR5
>300 GB/s
>1 TFlops DP
6GB GDDR5
240 GB/s
>1 TFlops DP
5120P

12
Introduction
Performance Considerations

13
Reminder: Vectorization, What is it?
for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];
+
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
Vector
- One Instruction
- Eight Mathematical
Operations1
1. Number of operations per instruction varies based on the which SIMD instruction is used and the width of the operands
+
C
B
A
Scalar
- One Instruction
- One Mathematical
Operation
• Vectorizations is Core-Level
Parallelism

INTEL CONFIDENTIAL
‒ Second level
 Third level
o Fourth level
 Fifth level
14
14
Instruction Instruction
Width
Operand
Width
Number of
Operations
per
Instruction
Family
SSE 128-bit 32-bit (SP) 4 Westmere
SSE 128-bit 64-bit (DP) 2 Westmere
AVX 256-bit 32-bit (SP) 8 SandyBridge
AVX 256-bit 64-bit (DP) 4 SandyBridge
MIC ISA 512-bit 32-bit (SP) 16 Xeon Phi
MIC ISA 512-bit 64-bit (DP) 8 Xeon Phi
SIMD Vector Instructions per Family
2X
2X

INTEL CONFIDENTIAL
‒ Second level
 Third level
o Fourth level
 Fifth level
15
Sandy Bridge/Ivy Bridge : Two 256 bits SIMD per cycle
8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle
4 MUL (64b) and 4 ADD (64b): 8 Double Precision Flops / cycle
Theoretical peak for a 2-sockets E5-2697 v2 (12 cores @ 2.7 GHz)
16[Flops/cycle ]*2[sockets]*12[cores]*2.7[Gcycles/sec] = 1036.8 [Gflops/sec] SP
8[Flops/cycle ]* 2[sockets]*12[cores]*2.7[Gcycles/sec] = 518.4 [Gflops/sec] DP
Xeon Phi : One 512 bits SIMD FMA per cycle
16 MUL (32b) and 16 ADD (32b): 32 Single Precision Flops / cycle
8 MUL (64b) and 8 ADD (64b): 16 Double Precision Flops / cycle
Theoretical peak for a KNC 7120x (61 cores @ 1.24 GHz)
32[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 2420.5 [Gflops/sec] SP
16[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 1210.2 [Gflops/sec] DP
Theoretical Peak Flops on
Xeon and Xeon Phi

INTEL CONFIDENTIAL
‒ Second level
 Third level
o Fourth level
 Fifth level
16
Theoretical Memory Bandwidth on
Xeon and Xeon Phi
Sandy Bridge/Ivy Bridge: 4 channels , 2 sockets and 1600/1866 MHz memory
8*1.600* 4*2 = 102 GB/s peak (ST : 80 GB/s) on SNB-EP
8*1.866* 4*2 = 120 GB/s peak (ST : 90 GB/s) on IVB-EP
Xeon Phi: 16 channels , 5.5 GT/s memory
4[bytes/channel]* 5.5[GT/s]* 16[channels] =
352 GB/s peak (ST : 170 GB/s *) on KNC 7120x
*ECC Enabled
Basical rules for theoretical memory BW [Bytes / second ] :
[Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets

INTEL CONFIDENTIAL17
75
171
0
50
100
150
200
STREAM
Triad (GB/s)
330
802
0
200
400
600
800
1000
SMP Linpack
(GF/s)
347
887
0
200
400
600
800
1000
DGEMM
(GF/s)
728
1,796
0
500
1000
1500
2000
SGEMM
(GF/s)
Notes
1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x
10752, SMP Linpack Matrix 26000 x 26000
2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800,
DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672
3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster
+ Texas Advanced Computing Center (TACC) at the University of Texas at Austin.
++ Measured on the TACC+ Stampede Cluster
Coprocessor results: Benchmark run 100% on coprocessor,
no help from Intel® Xeon® processor host (aka native)
Synthetic Benchmarks
Intel® Xeon Phi™ Coprocessor and Intel® MKL
UP TO
2.4X
UP TO
2.5X
UP TO
2.2X
UP TO
2.4X
Higher is Better
• 2S Intel® Xeon®
• Intel Xeon Phi
ECC ON84% Efficient 83% Efficient 75% Efficient

18
Introduction
Native, Offload and Variations

INTEL CONFIDENTIAL
‒ Second level
 Third level
o Fourth level
 Fifth level
Wide Spectrum of Execution Models
General purpose
serial and parallel
computing
Codes with highly-
parallel phases
Highly-parallel
codes
Codes with
balanced needs
Main( )
Foo( )
MPI_*()
Foo( )
Main( )
Foo( )
MPI_*()
Main()
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Multicore
Many-core
Multicore Centric Many-core Centric
(Intel® Xeon® processors) (Intel® Many Integrated Core co-processors)
Multi-core-hosted Offload Symmetric Many-core-hosted
Range of Models to Meet Application Needs
19

The Intel Manycore Platform Software Stack
(MPSS) provides Linux on the coprocessor
20
Linux* OS
Intel® Xeon Phi™ Coprocessor
support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
communication and application-
launch support
Intel® Xeon Phi™ CoprocessorHost Processor
System-level code System-level code
User-level codeUser-level code

Runs either as an accelerator for offloaded
host computation…
21
Linux* OS
drivers
Linux* OS
PCI-E Bus PCI-E Bus
launch support
Offload libraries, user-level
driver, user-accessible APIs
and libraries
User code
Host-side offload application
User code
Offload libraries,
user-accessible
APIs and libraries
Target-side offload
applicationAdvantages
• More memory available
• Better file access
• Host better on serial code
• Better uses resources

…Or runs as a native or
MPI* compute node via IP or OFED
22
Linux* OS
drivers
Linux* OS
PCI-E Bus PCI-E Bus
launch support
Advantages
• Simpler model
• No directives
• Easier port
• Good kernel test
ssh or telnet
connection to
coprocessor IP
address
Virtual terminal session
Use if
• Not serial
• Modest memory
• Complex code
Target-side “native”
application
User code
Standard OS libraries
plus any 3rd-party or
Intel libraries
IB fabric

Becomes a Network Node
*
Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor
Virtual Network
Connection
Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor
Virtual Network
Connection
…
…Intel® Xeon Phi™ Architecture + Linux enables IP addressability
23

Flexible: Enables Multiple Programming
Models
24
CPU MIC
CPU MIC
Data
MPI
Data
Network
Homogenous network
of many-core CPUs
CPU MIC
CPU MIC
Data
MPI
Data
Network
Data
Data
Heterogeneous network
of homogeneous CPUs
CPU MIC
CPU MIC
MPI
Offload
Offload
Network
Data
Data
Homogenous network of
heterogeneous nodes
Coprocessor only Host+Offload Symmetric

Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
The Intel® Manycore Platform Software Stack
(Intel® MPSS) provides Linux* on the coprocessor
25
Authenticated users can treat it like another node
Add –mmic to compiles to create native programs
Intel MPSS supplies a virtual FS and native execution
ssh mic0 top
Mem: 298016K used, 7578640K free, 0K shrd, 0K buff, 100688K cached
CPU: 0.0% usr 0.3% sys 0.0% nic 99.6% idle 0.0% io 0.0% irq 0.0% sirq
Load average: 1.00 1.04 1.01 1/2234 7265
PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND
7265 7264 fdkew R 7060 0.0 14 0.3 top
43 2 root SW 0 0.0 13 0.0 [ksoftirqd/13]
5748 1 root S 119m 1.5 226 0.0 ./sep_mic_server3.8
5670 1 micuser S 97872 1.2 0 0.0 /bin/coi_daemon --coiuser=micuser
sudo scp /opt/intel/composerxe/lib/mic/libiomp5.so root@mic0:/lib64
scp native.exe mic0:/tmp
ssh mic0 “/tmp/native.exe <my-args>”
icc –O3 –g –mmic –o nativeMIC myNativeProgram.c
Xeon Phi can work as a Node

Compiler Assisted Offload: Examples
• Offload section of code to the coprocessor.
• Offload any function call to the coprocessor.
26
#pragma offload target(mic)
in(transa, transb, N, alpha, beta)
in(A:length(matrix_elements))
in(B:length(matrix_elements))
in(C:length(matrix_elements))
out(C:length(matrix_elements) alloc_if(0))
{ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N); }
float pi = 0.0f;
#pragma omp parallel for reduction(+:pi)
for (i=0; i<count; i++) {
float t = (float)((i+0.5f)/count);
pi += 4.0f/(1.0f+t*t);
}
pi /= count;
Xeon Phi can work as a Coprocessor

Compiler Assisted Offload: Example
• An example in Fortran:
27
!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM
!DEC$ OMP OFFLOAD TARGET( MIC ) &
!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &
!DEC$ IN( A: LENGTH( NCOLA * LDA )), &
!DEC$ IN( B: LENGTH( NCOLB * LDB )), &
!DEC$ INOUT( C: LENGTH( N * LDC ))
CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &
A, LDA, B, LDB BETA, C, LDC )

Offload directives are independent of
function boundaries
28
Host
Intel® Xeon®
processor
Target
Intel® Xeon Xeon
Phi™ coprocessor
Execution
• If at first offload the
target is available,
the target program
is loaded
• At each offload if the
target is available,
statement is run on
target, else it is run
on the host
• At program
termination the
target program is
unloaded
f() {
#pragma offload
a = b + g();
h();
}
f_part1() {
a = b + g();
}
__attribute__ ((target(mic)))
g() {
...
}
h() {
...
}
__attribute__ ((target(mic)))
g() {
...
}

Example – share work between
coprocessor and host using OpenMP*
omp_set_nested(1);
#pragma omp parallel private(ip)
{
#pragma omp sections
{
#pragma omp section
/* use pointer to copy back only part of potential array,
to avoid overwriting host */
#pragma offload target(mic) in(xp) in(yp) in(zp) out(ppot:length(np1))
#pragma omp parallel for private(ip)
for (i=0;i<np1;i++) {
ppot[i] = threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i],yp[i],zp[i]);
}
#pragma omp section
#pragma omp parallel for private(ip)
for (i=0;i<np2;i++) {
pot[i+np1] =
threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i+np1],yp[i+np1],zp[i+np1]);
}
}
}
29
Top level, runs on host
Runs on coprocessor
Runs on host

Pragmas and directives mark data and code to
be offloaded and executed
30
C/C++ Syntax
Offload pragma #pragma offload <clauses> <statement>
Allow next statement to execute on coprocessor or host CPU
Variable/function
offload properties
__attribute__((target(mic)))
Compile function for, or allocate variable on, both host CPU
and coprocessor
Entire blocks of
data/code defs
#pragma offload_attribute(push, target(mic))
#pragma offload_attribute(pop)
Mark entire files or large blocks of code to compile for both
host CPU and coprocessorFortran Syntax
Offload directive !dir$ omp offload <clauses> <statement>
Execute OpenMP* parallel block on coprocessor
!dir$ offload <clauses> <statement>
Execute next statement or function on coproc.
Variable/function
offload properties
!dir$ attributes offload:<mic> :: <ret-name> OR
<var1,var2,…>
Compile function or variable for CPU and coprocessor
Entire code blocks !dir$ offload begin <clauses>
!dir$ end offload

Options on offloads can control data copying
and manage coprocessor dynamic allocation
31
Clauses Syntax Semantics
Multiple coprocessors target(mic[:unit] ) Select specific coprocessors
Conditional offload if (condition) / manadatory Select coprocessor or host compute
Inputs in(var-list modifiersopt) Copy from host to coprocessor
Outputs out(var-list modifiersopt) Copy from coprocessor to host
Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back
when offload completes
Non-copied data nocopy(var-list modifiersopt) Data is local to target
Modifiers
Specify copy length length(N) Copy N elements of pointer’s type
Coprocessor memory
allocation
alloc_if ( bool ) Allocate coprocessor space on this
offload (default: TRUE)
Coprocessor memory
release
free_if ( bool ) Free coprocessor space at the end of
this offload (default: TRUE)
Control target data
alignment
align ( N bytes ) Specify minimum memory alignment
on coprocessor
Array partial allocation &
variable relocation
alloc ( array-slice )
into ( var-expr )
Enables partial array allocation and
data copy into other vars & ranges

Data Persistence with Compiler Offload
32
__declspec(target(mic)) static float *A, *B, *C, *C1;
// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B
in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC)
in(A:length(NCOLA * LDA) free_if(0))
in(B:length(NCOLB * LDB) free_if(0))
inout(C:length(N * LDC))
{ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);
}
// Transfer matrix C1 to coprocessor and reuse matrices A and B
in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1)
nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0))
nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0))
inout(C1:length(N * LDC1))
{ sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);
}
// Deallocate A and B on the coprocessor
nocopy(A:length(NCOLA * LDA) free_if(1))
nocopy(B:length(NCOLB * LDB) free_if(1))
{ }

Data Persistence with Compiler Offload
33
#define ALLOC alloc_if(1) free_if(0)
#define REUSE alloc_if(0) free_if(0)
#define FREE alloc_if(0) free_if(1)
__declspec(target(mic)) static float *A, *B, *C, *C1;
// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B
in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC)
in(A:length(NCOLA * LDA) ALLOC )
in(B:length(NCOLB * LDB) ALLOC )
inout(C:length(N * LDC))
{ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);
}
// Transfer matrix C1 to coprocessor and reuse matrices A and B
in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1)
nocopy(A:length(NCOLA * LDA) REUSE )
nocopy(B:length(NCOLB * LDB) REUSE )
inout(C1:length(N * LDC1))
{ sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);
}
// Deallocate A and B on the coprocessor
#pragma offload_transfer target(mic)
nocopy(A:length(NCOLA * LDA) FREE )
nocopy(B:length(NCOLB * LDB) FREE )

To handle more complex data structures on the
coprocessor, use Virtual Shared Memory
An identical range of virtual addresses is reserved on both host an
coprocessor: changes are shared at offload points, allowing:
• Seamless sharing of complex data structures, including linked lists
• Elimination of manual data marshaling and shared array management
• Freer use of new C++ features and standard classes
34
Host
VM
coproc
VM
Offload code
C/C++ executable
Host coprocessor
Same virtual
address range

Example: Virtual Shared Memory
• Shared between host and Xeon Phi
35
// Shared variable declaration
_Cilk_shared T in1[SIZE];
_Cilk_shared T in2[SIZE];
_Cilk_shared T res[SIZE];
_Cilk_shared void compute_sum()
{
int i;
for (i=0; i<SIZE; i++) {
res[i] = in1[i] + in2[i];
}
}
(...)
// Call compute sum on Target
_Cilk_offload compute_sum();

Virtual Shared Memory uses special allocation
to manage data sharing at offload boundaries
Declare virtual shared data using _Cilk_shared allocation specifier
Allocate virtual dynamic shared data using these special functions:
Shared data copying occurs automatically around offload sections
• Memory is only synchronized on entry to or exit from an offload call
• Only modified data blocks are transferred between host and coprocessor
Allows transfer of C++ objects
• Pointers are transportable when they point to “shared” data addresses
Well-known methods can be used to synchronize access to shared data
and prevent data races within offloaded code
• E.g., locks, critical sections, etc.
This model is integrated with the Intel® Cilk™ Plus parallel extensions
36
Note: Not supported on Fortran - available for C/C++ only
_Offload_shared_malloc(), _Offload_shared_aligned_malloc(),
_Offload_shared_free(), _Offload_shared_aligned_free()

Data sharing between host and coprocessor can
be enabled using this Intel® Cilk™ Plus syntax
37
What Syntax
Function int _Cilk_shared f(int x){ return x+1; }
Code emitted for host and target; may be called from either side
Global _Cilk_shared int x = 0;
Datum is visible on both sides
File/Function
static
static _Cilk_shared int x;
Datum visible on both sides, only to code within the file/function
Class class _Cilk_shared x {…};
Class methods, members and operators available on both sides
Pointer to
shared data
int _Cilk_shared *p;
p is local (not shared), can point to shared data
A shared
pointer
int *_Cilk_shared p;
p is shared; should only point at shared data
Entire blocks
of code
#pragma offload_attribute( push, _Cilk_shared)
#pragma offload_attribute(pop)
Mark entire files or blocks of code _Cilk_shared using this pragma

Intel® Cilk™ Plus syntax can also specify the
offloading of computation to the coprocessor
38
Feature Example
Offloading a
function call
x = _Cilk_offload func(y);
func executes on coprocessor if possible
x = _Cilk_offload_to (card_num) func(y);
func must execute on specified coprocessor or an error occurs
Offloading
asynchronously
x = _Cilk_spawn _Cilk_offload func(y);
func executes on coprocessor; continuation available for stealing
Offloading a
parallel for-
loop
_Cilk_offload _Cilk_for(i=0; i<N; i++){
a[i] = b[i] + c[i];
}
Loop executes in parallel on coprocessor.
The loop is implicitly “un-inlined” as a function call.

39
Introduction

Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Advisor XE
VTune Amplifier XE
Inspector XE
Trace Analyzer
Code Analysis
Comprehensive set of SW tools for Xeon
and Xeon Phi Programing
Intel Cilk Plus
Threading Building
Blocks
OpenMP
OpenCL
MPI
Offload/Native/MYO
Programming
Models
Math Kernel Library
Integrated Performance
Primitives
Intel Compilers
Libraries &
Compilers
40

First Level
• Second level
– Third level
– Fourth level
– Fifth level
INTEL CONFIDENTIAL
41
‒ Second level
 Third level
o Fourth level
 Fifth level
Options for Thread Parallelism
Intel® Math Kernel Library
OpenMP*
Intel® Threading Building Blocks
Intel® Cilk™ Plus
OpenCL*
Pthreads* and other threading libraries Programmer control
Ease of use / code
maintainability
Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture!
41

42
Introduction
Performance and Thread Parallelism: OpenMP

OpenMP* on the Coprocessor
• The basics work just like on the host CPU
• For both native and offload models
• Need to specify -openmp
• There are 4 hardware thread contexts per core
• Need at least 2 x ncore threads for good performance
– For all except the most memory-bound workloads
– Often, 3x or 4x (number of available cores) is best
– Very different from hyperthreading on the host!
– -opt-threads-per-core=n advises compiler how many
threads to optimize for
• If you don’t saturate all available threads, be sure to
set KMP_AFFINITY to control thread distribution
43

Thread Affinity Interface
Allows OpenMP threads to be bound to physical or logical cores
• export environment variable KMP_AFFINITY=
– physical use all physical cores before assigning threads to other
logical cores (other hardware thread contexts)
– compact assign threads to consecutive h/w contexts on same
physical core (eg to benefit from shared cache)
– scatter assign consecutive threads to different physical cores
(eg to maximize access to memory)
– balanced blend of compact & scatter
(currently only available for Intel® MIC Architecture)
• Helps optimize access to memory or cache
• Particularly important if all available h/w threads not used
– else some physical cores may be idle while others run multiple
threads
• See compiler documentation for (much) more detail
44

OpenMP defaults
• OMP_NUM_THREADS defaults to
• 1 x ncore for host (or 2x if hyperthreading enabled)
• 4 x ncore for native coprocessor applications
• 4 x (ncore-1) for offload applications
– one core is reserved for offload daemons and OS
• Defaults may be changed via environment variables
or via API calls on either the host or the coprocessor
45

Target OpenMP environment (offload)
Use target-specific APIs to set for coprocessor target only, e.g.
omp_set_num_threads_target() (called from host)
omp_set_nested_target() etc
• Protect with #ifdef __INTEL_OFFLOAD, undefined with –no-offload
• Fortran: USE MIC_LIB and OMP_LIB C: #include <offload.h>
Or define MIC – specific versions of env vars using
MIC_ENV_PREFIX=MIC (no underscore)
• Values on MIC no longer default to values on host
• Set values specific to MIC using
export MIC_OMP_NUM_THREADS=120 (all cards)
export MIC_2_OMP_NUM_THREADS=180 for card #2, etc
export MIC_3_ENV=“OMP_NUM_THREADS=240|KMP_AFFINITY=balanced”
46

47
Introduction
Performance and Thread Parallelism: MKL

48

MKL Usage Models on Intel® Xeon Phi™
Coprocessor
49
• Automatic Offload
– No code changes required
– Automatically uses both host and target
– Transparent data transfer and execution management
• Compiler Assisted Offload
– Explicit controls of data transfer and remote execution using compiler
offload pragmas/directives
– Can be used together with Automatic Offload
• Native Execution
– Uses the coprocessors as independent nodes
– Input data is copied to targets in advance

MKL Execution Models
50
Multicore Hosted
General purpose serial
and parallel computing
Offload
Codes with highly-
parallel phases
Many Core Hosted
Highly-parallel codes
Symmetric
Codes with balanced
needs
Multicore
(Intel® Xeon®)
Many-core
(Intel® Xeon Phi™)
Multicore Centric Many-Core Centric

Work Division Control in MKL Automatic Offload
51
Examples Notes
mkl_mic_set_Workdivision(
MKL_TARGET_MIC, 0, 0.5)
Offload 50% of computation only to the 1st
card.
Examples Notes
MKL_MIC_0_WORKDIVISION=0.5 Offload 50% of computation only to the 1st
card.

How to Use MKL with Compiler
Assisted Offload
• The same way you would offload any function call
to the coprocessor.
• An example in C:
52
in(transa, transb, N, alpha, beta)
in(A:length(matrix_elements))
in(B:length(matrix_elements))
in(C:length(matrix_elements))
out(C:length(matrix_elements) alloc_if(0))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}

53
Introduction

• Second level
– Third level
– Fourth level
– Fifth level
Conclusions
Intel® Xeon Phi™ coprocessor advantages:
• Comparable performance potential to other accelerators
• Faster time to solution due to reduced development effort
• Better investment protection with a single code base for
processors and coprocessors
Flexible and Wide range of programming models: from
pure Native to Offloaded – and all variants between
All with the familiar Intel development environment
54

• Second level
– Third level
– Fourth level
– Fifth level
One Stop Shop for:
Tools & Software Downloads
Getting Started Development Guides
Video Workshops, Tutorials, & Events
Code Samples & Case Studies
Articles, Forums, & Blogs
Associated Product Links
http://software.intel.com/mic-developer
Intel® Xeon Phi™ Coprocessor Developer
Site: http://software.intel.com/mic-developer
55

• Second level
– Third level
– Fourth level
– Fifth level
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on
Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core,
VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
57

Do Multicore ao Manycore: Práticas de Configuração, Compilação e Execução no coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Intel Software Brasil

Mais de Intel Software Brasil (20)

Último

Último (20)

Do Multicore ao Manycore: Práticas de Configuração, Compilação e Execução no coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013