SlideShare uma empresa Scribd logo
1 de 57
Baixar para ler offline
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
HW and SW
Architecture of the
Intel® Xeon Phi™
Coprocessor
Leo Borges (leonardo.borges@intel.com)
Intel - Software and Services Group
iStep-Brazil, August 2013
1
Click to edit Master title style
2
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance and Thread Parallelism
Conclusions & References
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
3
* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor
versus a standard multi-core Intel® Xeon® processor
Efficient vectorization,
threading, and parallel
execution drives higher
performance for
many applications
Fraction Parallel
% Vector
Performance
7.00
5.00
3.00
1.00
1.00
0.20
0.00
0.40
0.60
0.80
0%
100%
50%
75%
25%
Big Gains for Selected Applications
Scale to
manycore
Parallelize
Vectorize
Medical imaging and
biophysics
Computer Aided Design
& Manufacturing
Climate modeling &
weather prediction
Financial analyses,
trading
Energy &oil
exploration
Digital content
creation
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
4
YES
Evaluating Your Applications
for Intel® Xeon Phi™
NO
YES
YES
YES
Can your workload
benefit from more
memory bandwidth?
Can your workload
benefit from
large vectors?
NO
NO
Can your workload
scale to over
100 threads?
Use Intel® Xeon Phi™ coprocessors for applications that scale with:
• Threads • Vectors • Memory Bandwidth
Click to edit Master title style
5
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance and Thread Parallelism
Conclusions & References
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
6
Intel Many Integrated Core (MIC, pronounced “Mike”)
Product Family/Architecture for Highly Parallel Applications
• Based on large number of smaller, low power, Intel Arch. Cores
• 512-bit wide vector engine
• Compliments Intel Xeon processor product line
• Provides breakthrough performance for highly parallel apps
– Familiar x86 programming model
– Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor
– Initially a coprocessor with PCI Express form factor
First products announced at SC12: Code named Knights Corner (KNC)
• Up to 61 cores, 4 threads per core
• Up to 16GB GDDR5 memory (up to 352 GB/s)
• 225-300W (Cooling: Both passive & active SKUs)
• x16 PCIe Form-Factor (requires IA host)
6
Intel® Xeon® Phi™ Product Family
Based on the Intel MIC Architecture
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
7
Each Intel® Xeon Phi™ Coprocessor
core is a fully functional multi-thread
Execution unit
• >50 in-order cores
• Ring interconnect
• 64-bit addressing
• Scalar unit based on Intel® Pentium®
processor family
• Two pipelines
- Dual issue with scalar instructions
• One-per-clock scalar pipeline throughput
- 4 clock latency from issue to resolution
• 4 hardware threads per core
• Each thread issues instructions in turn
• Round-robin execution hides scalar unit latencyRing
Scalar
Registers
Vector
Registers
512K L2 Cache
32K L1 I-cache
32K L1 D-cache
Instruction Decode
Vector
Unit
Scalar Unit
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
8
Each Intel® Xeon Phi™ Coprocessor
core is a fully functional multi-thread
Vector unit
Ring
Scalar
Registers
Vector
Registers
512K L2 Cache
32K L1 I-cache
32K L1 D-cache
Instruction Decode
Vector
Unit
Scalar Unit
• Optimized
• Single and Double precision
• All new vector unit
• 512-bit SIMD Instructions – not Intel®
SSE, MMX™, or Intel® AVX
• 32 512-bit wide vector registers
- Hold 16 singles or 8 doubles per register
• Fully-coherent L1 and L2 caches
Takeaway: Vectorization is important
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
9
Individual cores are tied together via
fully coherent caches into a
bidirectional ring
• 9
GDDR
GDDR
GDDR
GDDR
PCIexp
L1 32K I- D-cache per core
3 cycle access
Up to 8 concurrent accesses
L2 512K cache per core
11 cycle best access
Up to 32 concurrent
accesses
GDDR5 Memory
16 memory channels
- Up to 5.5 Gb/sec
16 GB 300ns access
Bidirectional ring
115 GB/sec
Distributed Tag
Directory (DTD)
reduces ring
snoop traffic
PCIe port has its
own ring stop
Takeaway: Parallelization and data placement are important
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
10
Each Xeon Phi can be addressed as
an Individual Node in the Cluster
• 1
0
6 to 16 GB GDDR5 memory
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
11
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
3 Family
Outstanding Parallel Computing Solution
Performance/$ leadership
Intel® Xeon Phi™ Coprocessors
3120P 3120A
5 Family
Optimized for High Density Environments
Performance/watt leadership
5120D
7 Family
Highest Level of Features
Performance leadership
7120P 7120X
16GB GDDR5
352 GB/s
> 1.2 TFlops DP
Turbo
T
8GB GDDR5
>300 GB/s
>1 TFlops DP
6GB GDDR5
240 GB/s
>1 TFlops DP
5120P
Click to edit Master title style
12
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance Considerations
Performance and Thread Parallelism
Conclusions & References
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
13
Reminder: Vectorization, What is it?
for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];
+
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
Vector
- One Instruction
- Eight Mathematical
Operations1
1. Number of operations per instruction varies based on the which SIMD instruction is used and the width of the operands
+
C
B
A
Scalar
- One Instruction
- One Mathematical
Operation
• Vectorizations is Core-Level
Parallelism
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
14
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
14
Instruction Instruction
Width
Operand
Width
Number of
Operations
per
Instruction
Family
SSE 128-bit 32-bit (SP) 4 Westmere
SSE 128-bit 64-bit (DP) 2 Westmere
AVX 256-bit 32-bit (SP) 8 SandyBridge
AVX 256-bit 64-bit (DP) 4 SandyBridge
MIC ISA 512-bit 32-bit (SP) 16 Xeon Phi
MIC ISA 512-bit 64-bit (DP) 8 Xeon Phi
SIMD Vector Instructions per Family
2X
2X
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
15
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Sandy Bridge/Ivy Bridge : Two 256 bits SIMD per cycle
8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle
4 MUL (64b) and 4 ADD (64b): 8 Double Precision Flops / cycle
Theoretical peak for a 2-sockets E5-2697 v2 (12 cores @ 2.7 GHz)
16[Flops/cycle ]*2[sockets]*12[cores]*2.7[Gcycles/sec] = 1036.8 [Gflops/sec] SP
8[Flops/cycle ]* 2[sockets]*12[cores]*2.7[Gcycles/sec] = 518.4 [Gflops/sec] DP
Xeon Phi : One 512 bits SIMD FMA per cycle
16 MUL (32b) and 16 ADD (32b): 32 Single Precision Flops / cycle
8 MUL (64b) and 8 ADD (64b): 16 Double Precision Flops / cycle
Theoretical peak for a KNC 7120x (61 cores @ 1.24 GHz)
32[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 2420.5 [Gflops/sec] SP
16[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 1210.2 [Gflops/sec] DP
Theoretical Peak Flops on
Xeon and Xeon Phi
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
16
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Theoretical Memory Bandwidth on
Xeon and Xeon Phi
Sandy Bridge/Ivy Bridge: 4 channels , 2 sockets and 1600/1866 MHz memory
8*1.600* 4*2 = 102 GB/s peak (ST : 80 GB/s) on SNB-EP
8*1.866* 4*2 = 120 GB/s peak (ST : 90 GB/s) on IVB-EP
Xeon Phi: 16 channels , 5.5 GT/s memory
4[bytes/channel]* 5.5[GT/s]* 16[channels] =
352 GB/s peak (ST : 170 GB/s *) on KNC 7120x
*ECC Enabled
Basical rules for theoretical memory BW [Bytes / second ] :
[Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
INTEL CONFIDENTIAL17
75
171
0
50
100
150
200
STREAM
Triad (GB/s)
330
802
0
200
400
600
800
1000
SMP Linpack
(GF/s)
347
887
0
200
400
600
800
1000
DGEMM
(GF/s)
728
1,796
0
500
1000
1500
2000
SGEMM
(GF/s)
Notes
1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x
10752, SMP Linpack Matrix 26000 x 26000
2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800,
DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672
3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster
+ Texas Advanced Computing Center (TACC) at the University of Texas at Austin.
++ Measured on the TACC+ Stampede Cluster
Coprocessor results: Benchmark run 100% on coprocessor,
no help from Intel® Xeon® processor host (aka native)
Synthetic Benchmarks
Intel® Xeon Phi™ Coprocessor and Intel® MKL
UP TO
2.4X
UP TO
2.5X
UP TO
2.2X
UP TO
2.4X
Higher is Better
• 2S Intel® Xeon®
• Intel Xeon Phi
ECC ON84% Efficient 83% Efficient 75% Efficient
Click to edit Master title style
18
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Native, Offload and Variations
Performance and Thread Parallelism
Conclusions & References
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Wide Spectrum of Execution Models
General purpose
serial and parallel
computing
Codes with highly-
parallel phases
Highly-parallel
codes
Codes with
balanced needs
Main( )
Foo( )
MPI_*()
Foo( )
Main( )
Foo( )
MPI_*()
Main()
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Multicore
Many-core
Multicore Centric Many-core Centric
(Intel® Xeon® processors) (Intel® Many Integrated Core co-processors)
Multi-core-hosted Offload Symmetric Many-core-hosted
Range of Models to Meet Application Needs
19
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
The Intel Manycore Platform Software Stack
(MPSS) provides Linux on the coprocessor
20
Linux* OS
Intel® Xeon Phi™ Coprocessor
support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor
communication and application-
launch support
Intel® Xeon Phi™ CoprocessorHost Processor
System-level code System-level code
User-level codeUser-level code
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Runs either as an accelerator for offloaded
host computation…
21
Linux* OS
Intel® Xeon Phi™ Coprocessor
support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor
communication and application-
launch support
Intel® Xeon Phi™ CoprocessorHost Processor
System-level code System-level code
User-level codeUser-level code
Offload libraries, user-level
driver, user-accessible APIs
and libraries
User code
Host-side offload application
User code
Offload libraries,
user-accessible
APIs and libraries
Target-side offload
applicationAdvantages
• More memory available
• Better file access
• Host better on serial code
• Better uses resources
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
…Or runs as a native or
MPI* compute node via IP or OFED
22
Linux* OS
Intel® Xeon Phi™ Coprocessor
support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor
communication and application-
launch support
Intel® Xeon Phi™ CoprocessorHost Processor
System-level code System-level code
User-level codeUser-level code
Advantages
• Simpler model
• No directives
• Easier port
• Good kernel test
ssh or telnet
connection to
coprocessor IP
address
Virtual terminal session
Use if
• Not serial
• Modest memory
• Complex code
Target-side “native”
application
User code
Standard OS libraries
plus any 3rd-party or
Intel libraries
IB fabric
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel® Xeon Phi™ Coprocessor
Becomes a Network Node
*
Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor
Virtual Network
Connection
Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor
Virtual Network
Connection
…
…Intel® Xeon Phi™ Architecture + Linux enables IP addressability
23
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Flexible: Enables Multiple Programming
Models
24
CPU MIC
CPU MIC
Data
MPI
Data
Network
Homogenous network
of many-core CPUs
CPU MIC
CPU MIC
Data
MPI
Data
Network
Data
Data
Heterogeneous network
of homogeneous CPUs
CPU MIC
CPU MIC
MPI
Offload
Offload
Network
Data
Data
Homogenous network of
heterogeneous nodes
Coprocessor only Host+Offload Symmetric
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
The Intel® Manycore Platform Software Stack
(Intel® MPSS) provides Linux* on the coprocessor
25
Authenticated users can treat it like another node
Add –mmic to compiles to create native programs
Intel MPSS supplies a virtual FS and native execution
ssh mic0 top
Mem: 298016K used, 7578640K free, 0K shrd, 0K buff, 100688K cached
CPU: 0.0% usr 0.3% sys 0.0% nic 99.6% idle 0.0% io 0.0% irq 0.0% sirq
Load average: 1.00 1.04 1.01 1/2234 7265
PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND
7265 7264 fdkew R 7060 0.0 14 0.3 top
43 2 root SW 0 0.0 13 0.0 [ksoftirqd/13]
5748 1 root S 119m 1.5 226 0.0 ./sep_mic_server3.8
5670 1 micuser S 97872 1.2 0 0.0 /bin/coi_daemon --coiuser=micuser
sudo scp /opt/intel/composerxe/lib/mic/libiomp5.so root@mic0:/lib64
scp native.exe mic0:/tmp
ssh mic0 “/tmp/native.exe <my-args>”
icc –O3 –g –mmic –o nativeMIC myNativeProgram.c
Xeon Phi can work as a Node
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compiler Assisted Offload: Examples
• Offload section of code to the coprocessor.
• Offload any function call to the coprocessor.
26
#pragma offload target(mic) 
in(transa, transb, N, alpha, beta) 
in(A:length(matrix_elements)) 
in(B:length(matrix_elements)) 
in(C:length(matrix_elements)) 
out(C:length(matrix_elements) alloc_if(0))
{ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N); }
float pi = 0.0f;
#pragma offload target(mic)
#pragma omp parallel for reduction(+:pi)
for (i=0; i<count; i++) {
float t = (float)((i+0.5f)/count);
pi += 4.0f/(1.0f+t*t);
}
pi /= count;
Xeon Phi can work as a Coprocessor
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compiler Assisted Offload: Example
• An example in Fortran:
27
!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM
!DEC$ OMP OFFLOAD TARGET( MIC ) &
!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &
!DEC$ IN( A: LENGTH( NCOLA * LDA )), &
!DEC$ IN( B: LENGTH( NCOLB * LDB )), &
!DEC$ INOUT( C: LENGTH( N * LDC ))
CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &
A, LDA, B, LDB BETA, C, LDC )
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Offload directives are independent of
function boundaries
28
Host
Intel® Xeon®
processor
Target
Intel® Xeon Xeon
Phi™ coprocessor
Execution
• If at first offload the
target is available,
the target program
is loaded
• At each offload if the
target is available,
statement is run on
target, else it is run
on the host
• At program
termination the
target program is
unloaded
f() {
#pragma offload
a = b + g();
h();
}
f_part1() {
a = b + g();
}
__attribute__ ((target(mic)))
g() {
...
}
h() {
...
}
__attribute__ ((target(mic)))
g() {
...
}
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Example – share work between
coprocessor and host using OpenMP*
omp_set_nested(1);
#pragma omp parallel private(ip)
{
#pragma omp sections
{
#pragma omp section
/* use pointer to copy back only part of potential array,
to avoid overwriting host */
#pragma offload target(mic) in(xp) in(yp) in(zp) out(ppot:length(np1))
#pragma omp parallel for private(ip)
for (i=0;i<np1;i++) {
ppot[i] = threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i],yp[i],zp[i]);
}
#pragma omp section
#pragma omp parallel for private(ip)
for (i=0;i<np2;i++) {
pot[i+np1] =
threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i+np1],yp[i+np1],zp[i+np1]);
}
}
}
29
Top level, runs on host
Runs on coprocessor
Runs on host
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Pragmas and directives mark data and code to
be offloaded and executed
30
C/C++ Syntax
Offload pragma #pragma offload <clauses> <statement>
Allow next statement to execute on coprocessor or host CPU
Variable/function
offload properties
__attribute__((target(mic)))
Compile function for, or allocate variable on, both host CPU
and coprocessor
Entire blocks of
data/code defs
#pragma offload_attribute(push, target(mic))
#pragma offload_attribute(pop)
Mark entire files or large blocks of code to compile for both
host CPU and coprocessorFortran Syntax
Offload directive !dir$ omp offload <clauses> <statement>
Execute OpenMP* parallel block on coprocessor
!dir$ offload <clauses> <statement>
Execute next statement or function on coproc.
Variable/function
offload properties
!dir$ attributes offload:<mic> :: <ret-name> OR
<var1,var2,…>
Compile function or variable for CPU and coprocessor
Entire code blocks !dir$ offload begin <clauses>
!dir$ end offload
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Options on offloads can control data copying
and manage coprocessor dynamic allocation
31
Clauses Syntax Semantics
Multiple coprocessors target(mic[:unit] ) Select specific coprocessors
Conditional offload if (condition) / manadatory Select coprocessor or host compute
Inputs in(var-list modifiersopt) Copy from host to coprocessor
Outputs out(var-list modifiersopt) Copy from coprocessor to host
Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back
when offload completes
Non-copied data nocopy(var-list modifiersopt) Data is local to target
Modifiers
Specify copy length length(N) Copy N elements of pointer’s type
Coprocessor memory
allocation
alloc_if ( bool ) Allocate coprocessor space on this
offload (default: TRUE)
Coprocessor memory
release
free_if ( bool ) Free coprocessor space at the end of
this offload (default: TRUE)
Control target data
alignment
align ( N bytes ) Specify minimum memory alignment
on coprocessor
Array partial allocation &
variable relocation
alloc ( array-slice )
into ( var-expr )
Enables partial array allocation and
data copy into other vars & ranges
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Data Persistence with Compiler Offload
32
__declspec(target(mic)) static float *A, *B, *C, *C1;
// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B
#pragma offload target(mic) 
in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) 
in(A:length(NCOLA * LDA) free_if(0)) 
in(B:length(NCOLB * LDB) free_if(0)) 
inout(C:length(N * LDC))
{ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);
}
// Transfer matrix C1 to coprocessor and reuse matrices A and B
#pragma offload target(mic) 
in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) 
nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) 
nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) 
inout(C1:length(N * LDC1))
{ sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);
}
// Deallocate A and B on the coprocessor
#pragma offload target(mic) 
nocopy(A:length(NCOLA * LDA) free_if(1)) 
nocopy(B:length(NCOLB * LDB) free_if(1)) 
{ }
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Data Persistence with Compiler Offload
33
#define ALLOC alloc_if(1) free_if(0)
#define REUSE alloc_if(0) free_if(0)
#define FREE alloc_if(0) free_if(1)
__declspec(target(mic)) static float *A, *B, *C, *C1;
// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B
#pragma offload target(mic) 
in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) 
in(A:length(NCOLA * LDA) ALLOC ) 
in(B:length(NCOLB * LDB) ALLOC ) 
inout(C:length(N * LDC))
{ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);
}
// Transfer matrix C1 to coprocessor and reuse matrices A and B
#pragma offload target(mic) 
in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) 
nocopy(A:length(NCOLA * LDA) REUSE ) 
nocopy(B:length(NCOLB * LDB) REUSE ) 
inout(C1:length(N * LDC1))
{ sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);
}
// Deallocate A and B on the coprocessor
#pragma offload_transfer target(mic) 
nocopy(A:length(NCOLA * LDA) FREE ) 
nocopy(B:length(NCOLB * LDB) FREE )
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
To handle more complex data structures on the
coprocessor, use Virtual Shared Memory
An identical range of virtual addresses is reserved on both host an
coprocessor: changes are shared at offload points, allowing:
• Seamless sharing of complex data structures, including linked lists
• Elimination of manual data marshaling and shared array management
• Freer use of new C++ features and standard classes
34
Host
VM
coproc
VM
Offload code
C/C++ executable
Host coprocessor
Same virtual
address range
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Example: Virtual Shared Memory
• Shared between host and Xeon Phi
35
// Shared variable declaration
_Cilk_shared T in1[SIZE];
_Cilk_shared T in2[SIZE];
_Cilk_shared T res[SIZE];
_Cilk_shared void compute_sum()
{
int i;
for (i=0; i<SIZE; i++) {
res[i] = in1[i] + in2[i];
}
}
(...)
// Call compute sum on Target
_Cilk_offload compute_sum();
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Virtual Shared Memory uses special allocation
to manage data sharing at offload boundaries
Declare virtual shared data using _Cilk_shared allocation specifier
Allocate virtual dynamic shared data using these special functions:
Shared data copying occurs automatically around offload sections
• Memory is only synchronized on entry to or exit from an offload call
• Only modified data blocks are transferred between host and coprocessor
Allows transfer of C++ objects
• Pointers are transportable when they point to “shared” data addresses
Well-known methods can be used to synchronize access to shared data
and prevent data races within offloaded code
• E.g., locks, critical sections, etc.
This model is integrated with the Intel® Cilk™ Plus parallel extensions
36
Note: Not supported on Fortran - available for C/C++ only
_Offload_shared_malloc(), _Offload_shared_aligned_malloc(),
_Offload_shared_free(), _Offload_shared_aligned_free()
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Data sharing between host and coprocessor can
be enabled using this Intel® Cilk™ Plus syntax
37
What Syntax
Function int _Cilk_shared f(int x){ return x+1; }
Code emitted for host and target; may be called from either side
Global _Cilk_shared int x = 0;
Datum is visible on both sides
File/Function
static
static _Cilk_shared int x;
Datum visible on both sides, only to code within the file/function
Class class _Cilk_shared x {…};
Class methods, members and operators available on both sides
Pointer to
shared data
int _Cilk_shared *p;
p is local (not shared), can point to shared data
A shared
pointer
int *_Cilk_shared p;
p is shared; should only point at shared data
Entire blocks
of code
#pragma offload_attribute( push, _Cilk_shared)
#pragma offload_attribute(pop)
Mark entire files or blocks of code _Cilk_shared using this pragma
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus syntax can also specify the
offloading of computation to the coprocessor
38
Feature Example
Offloading a
function call
x = _Cilk_offload func(y);
func executes on coprocessor if possible
x = _Cilk_offload_to (card_num) func(y);
func must execute on specified coprocessor or an error occurs
Offloading
asynchronously
x = _Cilk_spawn _Cilk_offload func(y);
func executes on coprocessor; continuation available for stealing
Offloading a
parallel for-
loop
_Cilk_offload _Cilk_for(i=0; i<N; i++){
a[i] = b[i] + c[i];
}
Loop executes in parallel on coprocessor.
The loop is implicitly “un-inlined” as a function call.
Click to edit Master title style
39
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance and Thread Parallelism
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Advisor XE
VTune Amplifier XE
Inspector XE
Trace Analyzer
Code Analysis
Comprehensive set of SW tools for Xeon
and Xeon Phi Programing
Intel Cilk Plus
Threading Building
Blocks
OpenMP
OpenCL
MPI
Offload/Native/MYO
Programming
Models
Math Kernel Library
Integrated Performance
Primitives
Intel Compilers
Libraries &
Compilers
40
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master title style
First Level
• Second level
– Third level
– Fourth level
– Fifth level
INTEL CONFIDENTIAL
41
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Options for Thread Parallelism
Intel® Math Kernel Library
OpenMP*
Intel® Threading Building Blocks
Intel® Cilk™ Plus
OpenCL*
Pthreads* and other threading libraries Programmer control
Ease of use / code
maintainability
Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture!
41
Click to edit Master title style
42
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance and Thread Parallelism: OpenMP
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
OpenMP* on the Coprocessor
• The basics work just like on the host CPU
• For both native and offload models
• Need to specify -openmp
• There are 4 hardware thread contexts per core
• Need at least 2 x ncore threads for good performance
– For all except the most memory-bound workloads
– Often, 3x or 4x (number of available cores) is best
– Very different from hyperthreading on the host!
– -opt-threads-per-core=n advises compiler how many
threads to optimize for
• If you don’t saturate all available threads, be sure to
set KMP_AFFINITY to control thread distribution
43
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Thread Affinity Interface
Allows OpenMP threads to be bound to physical or logical cores
• export environment variable KMP_AFFINITY=
– physical use all physical cores before assigning threads to other
logical cores (other hardware thread contexts)
– compact assign threads to consecutive h/w contexts on same
physical core (eg to benefit from shared cache)
– scatter assign consecutive threads to different physical cores
(eg to maximize access to memory)
– balanced blend of compact & scatter
(currently only available for Intel® MIC Architecture)
• Helps optimize access to memory or cache
• Particularly important if all available h/w threads not used
– else some physical cores may be idle while others run multiple
threads
• See compiler documentation for (much) more detail
44
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
OpenMP defaults
• OMP_NUM_THREADS defaults to
• 1 x ncore for host (or 2x if hyperthreading enabled)
• 4 x ncore for native coprocessor applications
• 4 x (ncore-1) for offload applications
– one core is reserved for offload daemons and OS
• Defaults may be changed via environment variables
or via API calls on either the host or the coprocessor
45
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Target OpenMP environment (offload)
Use target-specific APIs to set for coprocessor target only, e.g.
omp_set_num_threads_target() (called from host)
omp_set_nested_target() etc
• Protect with #ifdef __INTEL_OFFLOAD, undefined with –no-offload
• Fortran: USE MIC_LIB and OMP_LIB C: #include <offload.h>
Or define MIC – specific versions of env vars using
MIC_ENV_PREFIX=MIC (no underscore)
• Values on MIC no longer default to values on host
• Set values specific to MIC using
export MIC_OMP_NUM_THREADS=120 (all cards)
export MIC_2_OMP_NUM_THREADS=180 for card #2, etc
export MIC_3_ENV=“OMP_NUM_THREADS=240|KMP_AFFINITY=balanced”
46
Click to edit Master title style
47
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance and Thread Parallelism: MKL
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
48
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
MKL Usage Models on Intel® Xeon Phi™
Coprocessor
49
• Automatic Offload
– No code changes required
– Automatically uses both host and target
– Transparent data transfer and execution management
• Compiler Assisted Offload
– Explicit controls of data transfer and remote execution using compiler
offload pragmas/directives
– Can be used together with Automatic Offload
• Native Execution
– Uses the coprocessors as independent nodes
– Input data is copied to targets in advance
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
MKL Execution Models
50
Multicore Hosted
General purpose serial
and parallel computing
Offload
Codes with highly-
parallel phases
Many Core Hosted
Highly-parallel codes
Symmetric
Codes with balanced
needs
Multicore
(Intel® Xeon®)
Many-core
(Intel® Xeon Phi™)
Multicore Centric Many-Core Centric
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Work Division Control in MKL Automatic Offload
51
Examples Notes
mkl_mic_set_Workdivision(
MKL_TARGET_MIC, 0, 0.5)
Offload 50% of computation only to the 1st
card.
Examples Notes
MKL_MIC_0_WORKDIVISION=0.5 Offload 50% of computation only to the 1st
card.
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
How to Use MKL with Compiler
Assisted Offload
• The same way you would offload any function call
to the coprocessor.
• An example in C:
52
#pragma offload target(mic) 
in(transa, transb, N, alpha, beta) 
in(A:length(matrix_elements)) 
in(B:length(matrix_elements)) 
in(C:length(matrix_elements)) 
out(C:length(matrix_elements) alloc_if(0))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}
Click to edit Master title style
53
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance and Thread Parallelism
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Conclusions
Intel® Xeon Phi™ coprocessor advantages:
• Comparable performance potential to other accelerators
• Faster time to solution due to reduced development effort
• Better investment protection with a single code base for
processors and coprocessors
Flexible and Wide range of programming models: from
pure Native to Offloaded – and all variants between
All with the familiar Intel development environment
54
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
One Stop Shop for:
Tools & Software Downloads
Getting Started Development Guides
Video Workshops, Tutorials, & Events
Code Samples & Case Studies
Articles, Forums, & Blogs
Associated Product Links
http://software.intel.com/mic-developer
Intel® Xeon Phi™ Coprocessor Developer
Site: http://software.intel.com/mic-developer
55
Obrigado.
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on
Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core,
VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
57

Mais conteúdo relacionado

Mais de Intel Software Brasil

Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™ Intel Software Brasil
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatIntel Software Brasil
 
Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaIntel Software Brasil
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaIntel Software Brasil
 
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEIntel Software Brasil
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Intel Software Brasil
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaIntel Software Brasil
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoIntel Software Brasil
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Software Brasil
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoIntel Software Brasil
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoIntel Software Brasil
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...Intel Software Brasil
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Intel Software Brasil
 

Mais de Intel Software Brasil (20)

Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKat
 
Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento Multiplataforma
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataforma
 
Yocto - 7 masters
Yocto - 7 mastersYocto - 7 masters
Yocto - 7 masters
 
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralela
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorização
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance Computing
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenho
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/Vivo
 
Html5 fisl15
Html5 fisl15Html5 fisl15
Html5 fisl15
 
IoT FISL15
IoT FISL15IoT FISL15
IoT FISL15
 
IoT TDC Floripa 2014
IoT TDC Floripa 2014IoT TDC Floripa 2014
IoT TDC Floripa 2014
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
 
Html5 tdc floripa_2014
Html5 tdc floripa_2014Html5 tdc floripa_2014
Html5 tdc floripa_2014
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
 

Último

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Do Multicore ao Manycore: Práticas de Configuração, Compilação e Execução no coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

  • 1. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. HW and SW Architecture of the Intel® Xeon Phi™ Coprocessor Leo Borges (leonardo.borges@intel.com) Intel - Software and Services Group iStep-Brazil, August 2013 1
  • 2. Click to edit Master title style 2 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism Conclusions & References
  • 3. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 3 * Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor versus a standard multi-core Intel® Xeon® processor Efficient vectorization, threading, and parallel execution drives higher performance for many applications Fraction Parallel % Vector Performance 7.00 5.00 3.00 1.00 1.00 0.20 0.00 0.40 0.60 0.80 0% 100% 50% 75% 25% Big Gains for Selected Applications Scale to manycore Parallelize Vectorize Medical imaging and biophysics Computer Aided Design & Manufacturing Climate modeling & weather prediction Financial analyses, trading Energy &oil exploration Digital content creation
  • 4. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 4 YES Evaluating Your Applications for Intel® Xeon Phi™ NO YES YES YES Can your workload benefit from more memory bandwidth? Can your workload benefit from large vectors? NO NO Can your workload scale to over 100 threads? Use Intel® Xeon Phi™ coprocessors for applications that scale with: • Threads • Vectors • Memory Bandwidth
  • 5. Click to edit Master title style 5 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism Conclusions & References
  • 6. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 6 Intel Many Integrated Core (MIC, pronounced “Mike”) Product Family/Architecture for Highly Parallel Applications • Based on large number of smaller, low power, Intel Arch. Cores • 512-bit wide vector engine • Compliments Intel Xeon processor product line • Provides breakthrough performance for highly parallel apps – Familiar x86 programming model – Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor – Initially a coprocessor with PCI Express form factor First products announced at SC12: Code named Knights Corner (KNC) • Up to 61 cores, 4 threads per core • Up to 16GB GDDR5 memory (up to 352 GB/s) • 225-300W (Cooling: Both passive & active SKUs) • x16 PCIe Form-Factor (requires IA host) 6 Intel® Xeon® Phi™ Product Family Based on the Intel MIC Architecture
  • 7. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 7 Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread Execution unit • >50 in-order cores • Ring interconnect • 64-bit addressing • Scalar unit based on Intel® Pentium® processor family • Two pipelines - Dual issue with scalar instructions • One-per-clock scalar pipeline throughput - 4 clock latency from issue to resolution • 4 hardware threads per core • Each thread issues instructions in turn • Round-robin execution hides scalar unit latencyRing Scalar Registers Vector Registers 512K L2 Cache 32K L1 I-cache 32K L1 D-cache Instruction Decode Vector Unit Scalar Unit
  • 8. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 8 Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread Vector unit Ring Scalar Registers Vector Registers 512K L2 Cache 32K L1 I-cache 32K L1 D-cache Instruction Decode Vector Unit Scalar Unit • Optimized • Single and Double precision • All new vector unit • 512-bit SIMD Instructions – not Intel® SSE, MMX™, or Intel® AVX • 32 512-bit wide vector registers - Hold 16 singles or 8 doubles per register • Fully-coherent L1 and L2 caches Takeaway: Vectorization is important
  • 9. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 9 Individual cores are tied together via fully coherent caches into a bidirectional ring • 9 GDDR GDDR GDDR GDDR PCIexp L1 32K I- D-cache per core 3 cycle access Up to 8 concurrent accesses L2 512K cache per core 11 cycle best access Up to 32 concurrent accesses GDDR5 Memory 16 memory channels - Up to 5.5 Gb/sec 16 GB 300ns access Bidirectional ring 115 GB/sec Distributed Tag Directory (DTD) reduces ring snoop traffic PCIe port has its own ring stop Takeaway: Parallelization and data placement are important
  • 10. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 10 Each Xeon Phi can be addressed as an Individual Node in the Cluster • 1 0 6 to 16 GB GDDR5 memory
  • 11. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 11 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 3 Family Outstanding Parallel Computing Solution Performance/$ leadership Intel® Xeon Phi™ Coprocessors 3120P 3120A 5 Family Optimized for High Density Environments Performance/watt leadership 5120D 7 Family Highest Level of Features Performance leadership 7120P 7120X 16GB GDDR5 352 GB/s > 1.2 TFlops DP Turbo T 8GB GDDR5 >300 GB/s >1 TFlops DP 6GB GDDR5 240 GB/s >1 TFlops DP 5120P
  • 12. Click to edit Master title style 12 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance Considerations Performance and Thread Parallelism Conclusions & References
  • 13. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 13 Reminder: Vectorization, What is it? for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; + c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] Vector - One Instruction - Eight Mathematical Operations1 1. Number of operations per instruction varies based on the which SIMD instruction is used and the width of the operands + C B A Scalar - One Instruction - One Mathematical Operation • Vectorizations is Core-Level Parallelism
  • 14. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 14 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 14 Instruction Instruction Width Operand Width Number of Operations per Instruction Family SSE 128-bit 32-bit (SP) 4 Westmere SSE 128-bit 64-bit (DP) 2 Westmere AVX 256-bit 32-bit (SP) 8 SandyBridge AVX 256-bit 64-bit (DP) 4 SandyBridge MIC ISA 512-bit 32-bit (SP) 16 Xeon Phi MIC ISA 512-bit 64-bit (DP) 8 Xeon Phi SIMD Vector Instructions per Family 2X 2X
  • 15. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 15 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Sandy Bridge/Ivy Bridge : Two 256 bits SIMD per cycle 8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle 4 MUL (64b) and 4 ADD (64b): 8 Double Precision Flops / cycle Theoretical peak for a 2-sockets E5-2697 v2 (12 cores @ 2.7 GHz) 16[Flops/cycle ]*2[sockets]*12[cores]*2.7[Gcycles/sec] = 1036.8 [Gflops/sec] SP 8[Flops/cycle ]* 2[sockets]*12[cores]*2.7[Gcycles/sec] = 518.4 [Gflops/sec] DP Xeon Phi : One 512 bits SIMD FMA per cycle 16 MUL (32b) and 16 ADD (32b): 32 Single Precision Flops / cycle 8 MUL (64b) and 8 ADD (64b): 16 Double Precision Flops / cycle Theoretical peak for a KNC 7120x (61 cores @ 1.24 GHz) 32[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 2420.5 [Gflops/sec] SP 16[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 1210.2 [Gflops/sec] DP Theoretical Peak Flops on Xeon and Xeon Phi
  • 16. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 16 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Theoretical Memory Bandwidth on Xeon and Xeon Phi Sandy Bridge/Ivy Bridge: 4 channels , 2 sockets and 1600/1866 MHz memory 8*1.600* 4*2 = 102 GB/s peak (ST : 80 GB/s) on SNB-EP 8*1.866* 4*2 = 120 GB/s peak (ST : 90 GB/s) on IVB-EP Xeon Phi: 16 channels , 5.5 GT/s memory 4[bytes/channel]* 5.5[GT/s]* 16[channels] = 352 GB/s peak (ST : 170 GB/s *) on KNC 7120x *ECC Enabled Basical rules for theoretical memory BW [Bytes / second ] : [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
  • 17. INTEL CONFIDENTIAL17 75 171 0 50 100 150 200 STREAM Triad (GB/s) 330 802 0 200 400 600 800 1000 SMP Linpack (GF/s) 347 887 0 200 400 600 800 1000 DGEMM (GF/s) 728 1,796 0 500 1000 1500 2000 SGEMM (GF/s) Notes 1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x 10752, SMP Linpack Matrix 26000 x 26000 2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800, DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672 3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster + Texas Advanced Computing Center (TACC) at the University of Texas at Austin. ++ Measured on the TACC+ Stampede Cluster Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native) Synthetic Benchmarks Intel® Xeon Phi™ Coprocessor and Intel® MKL UP TO 2.4X UP TO 2.5X UP TO 2.2X UP TO 2.4X Higher is Better • 2S Intel® Xeon® • Intel Xeon Phi ECC ON84% Efficient 83% Efficient 75% Efficient
  • 18. Click to edit Master title style 18 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Native, Offload and Variations Performance and Thread Parallelism Conclusions & References
  • 19. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Wide Spectrum of Execution Models General purpose serial and parallel computing Codes with highly- parallel phases Highly-parallel codes Codes with balanced needs Main( ) Foo( ) MPI_*() Foo( ) Main( ) Foo( ) MPI_*() Main() Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Multicore Many-core Multicore Centric Many-core Centric (Intel® Xeon® processors) (Intel® Many Integrated Core co-processors) Multi-core-hosted Offload Symmetric Many-core-hosted Range of Models to Meet Application Needs 19
  • 20. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. The Intel Manycore Platform Software Stack (MPSS) provides Linux on the coprocessor 20 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code
  • 21. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Runs either as an accelerator for offloaded host computation… 21 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code Offload libraries, user-level driver, user-accessible APIs and libraries User code Host-side offload application User code Offload libraries, user-accessible APIs and libraries Target-side offload applicationAdvantages • More memory available • Better file access • Host better on serial code • Better uses resources
  • 22. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. …Or runs as a native or MPI* compute node via IP or OFED 22 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code Advantages • Simpler model • No directives • Easier port • Good kernel test ssh or telnet connection to coprocessor IP address Virtual terminal session Use if • Not serial • Modest memory • Complex code Target-side “native” application User code Standard OS libraries plus any 3rd-party or Intel libraries IB fabric
  • 23. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel® Xeon Phi™ Coprocessor Becomes a Network Node * Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor Virtual Network Connection Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor Virtual Network Connection … …Intel® Xeon Phi™ Architecture + Linux enables IP addressability 23
  • 24. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Flexible: Enables Multiple Programming Models 24 CPU MIC CPU MIC Data MPI Data Network Homogenous network of many-core CPUs CPU MIC CPU MIC Data MPI Data Network Data Data Heterogeneous network of homogeneous CPUs CPU MIC CPU MIC MPI Offload Offload Network Data Data Homogenous network of heterogeneous nodes Coprocessor only Host+Offload Symmetric
  • 25. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The Intel® Manycore Platform Software Stack (Intel® MPSS) provides Linux* on the coprocessor 25 Authenticated users can treat it like another node Add –mmic to compiles to create native programs Intel MPSS supplies a virtual FS and native execution ssh mic0 top Mem: 298016K used, 7578640K free, 0K shrd, 0K buff, 100688K cached CPU: 0.0% usr 0.3% sys 0.0% nic 99.6% idle 0.0% io 0.0% irq 0.0% sirq Load average: 1.00 1.04 1.01 1/2234 7265 PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND 7265 7264 fdkew R 7060 0.0 14 0.3 top 43 2 root SW 0 0.0 13 0.0 [ksoftirqd/13] 5748 1 root S 119m 1.5 226 0.0 ./sep_mic_server3.8 5670 1 micuser S 97872 1.2 0 0.0 /bin/coi_daemon --coiuser=micuser sudo scp /opt/intel/composerxe/lib/mic/libiomp5.so root@mic0:/lib64 scp native.exe mic0:/tmp ssh mic0 “/tmp/native.exe <my-args>” icc –O3 –g –mmic –o nativeMIC myNativeProgram.c Xeon Phi can work as a Node
  • 26. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compiler Assisted Offload: Examples • Offload section of code to the coprocessor. • Offload any function call to the coprocessor. 26 #pragma offload target(mic) in(transa, transb, N, alpha, beta) in(A:length(matrix_elements)) in(B:length(matrix_elements)) in(C:length(matrix_elements)) out(C:length(matrix_elements) alloc_if(0)) { sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); } float pi = 0.0f; #pragma offload target(mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5f)/count); pi += 4.0f/(1.0f+t*t); } pi /= count; Xeon Phi can work as a Coprocessor
  • 27. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compiler Assisted Offload: Example • An example in Fortran: 27 !DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM !DEC$ OMP OFFLOAD TARGET( MIC ) & !DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), & !DEC$ IN( A: LENGTH( NCOLA * LDA )), & !DEC$ IN( B: LENGTH( NCOLB * LDB )), & !DEC$ INOUT( C: LENGTH( N * LDC )) CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, & A, LDA, B, LDB BETA, C, LDC )
  • 28. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Offload directives are independent of function boundaries 28 Host Intel® Xeon® processor Target Intel® Xeon Xeon Phi™ coprocessor Execution • If at first offload the target is available, the target program is loaded • At each offload if the target is available, statement is run on target, else it is run on the host • At program termination the target program is unloaded f() { #pragma offload a = b + g(); h(); } f_part1() { a = b + g(); } __attribute__ ((target(mic))) g() { ... } h() { ... } __attribute__ ((target(mic))) g() { ... }
  • 29. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Example – share work between coprocessor and host using OpenMP* omp_set_nested(1); #pragma omp parallel private(ip) { #pragma omp sections { #pragma omp section /* use pointer to copy back only part of potential array, to avoid overwriting host */ #pragma offload target(mic) in(xp) in(yp) in(zp) out(ppot:length(np1)) #pragma omp parallel for private(ip) for (i=0;i<np1;i++) { ppot[i] = threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i],yp[i],zp[i]); } #pragma omp section #pragma omp parallel for private(ip) for (i=0;i<np2;i++) { pot[i+np1] = threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i+np1],yp[i+np1],zp[i+np1]); } } } 29 Top level, runs on host Runs on coprocessor Runs on host
  • 30. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Pragmas and directives mark data and code to be offloaded and executed 30 C/C++ Syntax Offload pragma #pragma offload <clauses> <statement> Allow next statement to execute on coprocessor or host CPU Variable/function offload properties __attribute__((target(mic))) Compile function for, or allocate variable on, both host CPU and coprocessor Entire blocks of data/code defs #pragma offload_attribute(push, target(mic)) #pragma offload_attribute(pop) Mark entire files or large blocks of code to compile for both host CPU and coprocessorFortran Syntax Offload directive !dir$ omp offload <clauses> <statement> Execute OpenMP* parallel block on coprocessor !dir$ offload <clauses> <statement> Execute next statement or function on coproc. Variable/function offload properties !dir$ attributes offload:<mic> :: <ret-name> OR <var1,var2,…> Compile function or variable for CPU and coprocessor Entire code blocks !dir$ offload begin <clauses> !dir$ end offload
  • 31. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Options on offloads can control data copying and manage coprocessor dynamic allocation 31 Clauses Syntax Semantics Multiple coprocessors target(mic[:unit] ) Select specific coprocessors Conditional offload if (condition) / manadatory Select coprocessor or host compute Inputs in(var-list modifiersopt) Copy from host to coprocessor Outputs out(var-list modifiersopt) Copy from coprocessor to host Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes Non-copied data nocopy(var-list modifiersopt) Data is local to target Modifiers Specify copy length length(N) Copy N elements of pointer’s type Coprocessor memory allocation alloc_if ( bool ) Allocate coprocessor space on this offload (default: TRUE) Coprocessor memory release free_if ( bool ) Free coprocessor space at the end of this offload (default: TRUE) Control target data alignment align ( N bytes ) Specify minimum memory alignment on coprocessor Array partial allocation & variable relocation alloc ( array-slice ) into ( var-expr ) Enables partial array allocation and data copy into other vars & ranges
  • 32. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Data Persistence with Compiler Offload 32 __declspec(target(mic)) static float *A, *B, *C, *C1; // Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B #pragma offload target(mic) in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) in(A:length(NCOLA * LDA) free_if(0)) in(B:length(NCOLB * LDB) free_if(0)) inout(C:length(N * LDC)) { sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC); } // Transfer matrix C1 to coprocessor and reuse matrices A and B #pragma offload target(mic) in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) inout(C1:length(N * LDC1)) { sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1); } // Deallocate A and B on the coprocessor #pragma offload target(mic) nocopy(A:length(NCOLA * LDA) free_if(1)) nocopy(B:length(NCOLB * LDB) free_if(1)) { }
  • 33. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Data Persistence with Compiler Offload 33 #define ALLOC alloc_if(1) free_if(0) #define REUSE alloc_if(0) free_if(0) #define FREE alloc_if(0) free_if(1) __declspec(target(mic)) static float *A, *B, *C, *C1; // Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B #pragma offload target(mic) in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) in(A:length(NCOLA * LDA) ALLOC ) in(B:length(NCOLB * LDB) ALLOC ) inout(C:length(N * LDC)) { sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC); } // Transfer matrix C1 to coprocessor and reuse matrices A and B #pragma offload target(mic) in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) nocopy(A:length(NCOLA * LDA) REUSE ) nocopy(B:length(NCOLB * LDB) REUSE ) inout(C1:length(N * LDC1)) { sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1); } // Deallocate A and B on the coprocessor #pragma offload_transfer target(mic) nocopy(A:length(NCOLA * LDA) FREE ) nocopy(B:length(NCOLB * LDB) FREE )
  • 34. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. To handle more complex data structures on the coprocessor, use Virtual Shared Memory An identical range of virtual addresses is reserved on both host an coprocessor: changes are shared at offload points, allowing: • Seamless sharing of complex data structures, including linked lists • Elimination of manual data marshaling and shared array management • Freer use of new C++ features and standard classes 34 Host VM coproc VM Offload code C/C++ executable Host coprocessor Same virtual address range
  • 35. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Example: Virtual Shared Memory • Shared between host and Xeon Phi 35 // Shared variable declaration _Cilk_shared T in1[SIZE]; _Cilk_shared T in2[SIZE]; _Cilk_shared T res[SIZE]; _Cilk_shared void compute_sum() { int i; for (i=0; i<SIZE; i++) { res[i] = in1[i] + in2[i]; } } (...) // Call compute sum on Target _Cilk_offload compute_sum();
  • 36. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Virtual Shared Memory uses special allocation to manage data sharing at offload boundaries Declare virtual shared data using _Cilk_shared allocation specifier Allocate virtual dynamic shared data using these special functions: Shared data copying occurs automatically around offload sections • Memory is only synchronized on entry to or exit from an offload call • Only modified data blocks are transferred between host and coprocessor Allows transfer of C++ objects • Pointers are transportable when they point to “shared” data addresses Well-known methods can be used to synchronize access to shared data and prevent data races within offloaded code • E.g., locks, critical sections, etc. This model is integrated with the Intel® Cilk™ Plus parallel extensions 36 Note: Not supported on Fortran - available for C/C++ only _Offload_shared_malloc(), _Offload_shared_aligned_malloc(), _Offload_shared_free(), _Offload_shared_aligned_free()
  • 37. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Data sharing between host and coprocessor can be enabled using this Intel® Cilk™ Plus syntax 37 What Syntax Function int _Cilk_shared f(int x){ return x+1; } Code emitted for host and target; may be called from either side Global _Cilk_shared int x = 0; Datum is visible on both sides File/Function static static _Cilk_shared int x; Datum visible on both sides, only to code within the file/function Class class _Cilk_shared x {…}; Class methods, members and operators available on both sides Pointer to shared data int _Cilk_shared *p; p is local (not shared), can point to shared data A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data Entire blocks of code #pragma offload_attribute( push, _Cilk_shared) #pragma offload_attribute(pop) Mark entire files or blocks of code _Cilk_shared using this pragma
  • 38. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Cilk™ Plus syntax can also specify the offloading of computation to the coprocessor 38 Feature Example Offloading a function call x = _Cilk_offload func(y); func executes on coprocessor if possible x = _Cilk_offload_to (card_num) func(y); func must execute on specified coprocessor or an error occurs Offloading asynchronously x = _Cilk_spawn _Cilk_offload func(y); func executes on coprocessor; continuation available for stealing Offloading a parallel for- loop _Cilk_offload _Cilk_for(i=0; i<N; i++){ a[i] = b[i] + c[i]; } Loop executes in parallel on coprocessor. The loop is implicitly “un-inlined” as a function call.
  • 39. Click to edit Master title style 39 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism Conclusions & References
  • 40. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Advisor XE VTune Amplifier XE Inspector XE Trace Analyzer Code Analysis Comprehensive set of SW tools for Xeon and Xeon Phi Programing Intel Cilk Plus Threading Building Blocks OpenMP OpenCL MPI Offload/Native/MYO Programming Models Math Kernel Library Integrated Performance Primitives Intel Compilers Libraries & Compilers 40
  • 41. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master title style First Level • Second level – Third level – Fourth level – Fifth level INTEL CONFIDENTIAL 41 • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Options for Thread Parallelism Intel® Math Kernel Library OpenMP* Intel® Threading Building Blocks Intel® Cilk™ Plus OpenCL* Pthreads* and other threading libraries Programmer control Ease of use / code maintainability Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture! 41
  • 42. Click to edit Master title style 42 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism: OpenMP Conclusions & References
  • 43. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenMP* on the Coprocessor • The basics work just like on the host CPU • For both native and offload models • Need to specify -openmp • There are 4 hardware thread contexts per core • Need at least 2 x ncore threads for good performance – For all except the most memory-bound workloads – Often, 3x or 4x (number of available cores) is best – Very different from hyperthreading on the host! – -opt-threads-per-core=n advises compiler how many threads to optimize for • If you don’t saturate all available threads, be sure to set KMP_AFFINITY to control thread distribution 43
  • 44. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Thread Affinity Interface Allows OpenMP threads to be bound to physical or logical cores • export environment variable KMP_AFFINITY= – physical use all physical cores before assigning threads to other logical cores (other hardware thread contexts) – compact assign threads to consecutive h/w contexts on same physical core (eg to benefit from shared cache) – scatter assign consecutive threads to different physical cores (eg to maximize access to memory) – balanced blend of compact & scatter (currently only available for Intel® MIC Architecture) • Helps optimize access to memory or cache • Particularly important if all available h/w threads not used – else some physical cores may be idle while others run multiple threads • See compiler documentation for (much) more detail 44
  • 45. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenMP defaults • OMP_NUM_THREADS defaults to • 1 x ncore for host (or 2x if hyperthreading enabled) • 4 x ncore for native coprocessor applications • 4 x (ncore-1) for offload applications – one core is reserved for offload daemons and OS • Defaults may be changed via environment variables or via API calls on either the host or the coprocessor 45
  • 46. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Target OpenMP environment (offload) Use target-specific APIs to set for coprocessor target only, e.g. omp_set_num_threads_target() (called from host) omp_set_nested_target() etc • Protect with #ifdef __INTEL_OFFLOAD, undefined with –no-offload • Fortran: USE MIC_LIB and OMP_LIB C: #include <offload.h> Or define MIC – specific versions of env vars using MIC_ENV_PREFIX=MIC (no underscore) • Values on MIC no longer default to values on host • Set values specific to MIC using export MIC_OMP_NUM_THREADS=120 (all cards) export MIC_2_OMP_NUM_THREADS=180 for card #2, etc export MIC_3_ENV=“OMP_NUM_THREADS=240|KMP_AFFINITY=balanced” 46
  • 47. Click to edit Master title style 47 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism: MKL Conclusions & References
  • 48. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 48
  • 49. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. MKL Usage Models on Intel® Xeon Phi™ Coprocessor 49 • Automatic Offload – No code changes required – Automatically uses both host and target – Transparent data transfer and execution management • Compiler Assisted Offload – Explicit controls of data transfer and remote execution using compiler offload pragmas/directives – Can be used together with Automatic Offload • Native Execution – Uses the coprocessors as independent nodes – Input data is copied to targets in advance
  • 50. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. MKL Execution Models 50 Multicore Hosted General purpose serial and parallel computing Offload Codes with highly- parallel phases Many Core Hosted Highly-parallel codes Symmetric Codes with balanced needs Multicore (Intel® Xeon®) Many-core (Intel® Xeon Phi™) Multicore Centric Many-Core Centric
  • 51. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Work Division Control in MKL Automatic Offload 51 Examples Notes mkl_mic_set_Workdivision( MKL_TARGET_MIC, 0, 0.5) Offload 50% of computation only to the 1st card. Examples Notes MKL_MIC_0_WORKDIVISION=0.5 Offload 50% of computation only to the 1st card.
  • 52. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. How to Use MKL with Compiler Assisted Offload • The same way you would offload any function call to the coprocessor. • An example in C: 52 #pragma offload target(mic) in(transa, transb, N, alpha, beta) in(A:length(matrix_elements)) in(B:length(matrix_elements)) in(C:length(matrix_elements)) out(C:length(matrix_elements) alloc_if(0)) { sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); }
  • 53. Click to edit Master title style 53 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance and Thread Parallelism Conclusions & References
  • 54. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Conclusions Intel® Xeon Phi™ coprocessor advantages: • Comparable performance potential to other accelerators • Faster time to solution due to reduced development effort • Better investment protection with a single code base for processors and coprocessors Flexible and Wide range of programming models: from pure Native to Offloaded – and all variants between All with the familiar Intel development environment 54
  • 55. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. One Stop Shop for: Tools & Software Downloads Getting Started Development Guides Video Workshops, Tutorials, & Events Code Samples & Case Studies Articles, Forums, & Blogs Associated Product Links http://software.intel.com/mic-developer Intel® Xeon Phi™ Coprocessor Developer Site: http://software.intel.com/mic-developer 55
  • 57. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization Notice Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 57