6. ti vat i on
Mo
! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'
12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&
! P1;$.&1#+,,8'! -*Q;'3"$'O+;$"&
" P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;
! S.I'! -*Q;'3"$'I16"&
slide by Matthew Bolitho
7. ti vat i on
Mo
! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'
O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U
! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";
! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''
" D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'
O%26+/"2$+,,8'&"6";132"6
slide by Matthew Bolitho
15. Getting your feet wet
Algorithm X v1.0 Profiling Analysis on Input 10x10x10
100
100% parallelizable
75
sequential in nature
time (s)
50 50
25 29
10 11
0
load_data() foo() bar() yey()
Q: What is the maximum speed up ?
16. Getting your feet wet
Algorithm X v1.0 Profiling Analysis on Input 10x10x10
100
100% parallelizable
75
sequential in nature
time (s)
50 50
25 29
10 11
0
load_data() foo() bar() yey()
A: 2X ! :-(
17. Getting your feet wet
Algorithm X v1.0 Profiling Analysis on Input 100x100x100
9,000 9,000
100% parallelizable
6,750
sequential in nature
time (s)
4,500
2,250
0 350 250 300
load_data() foo() bar() yey()
Q: and now?
18. You need to...
• ... understand the problem (duh!)
• ... study the current (sequential?) solutions and
their constraints
• ... know the input domain
• ... profile accordingly
• ... “refactor” based on new constraints (hw/sw)
19. A better way ?
...
ale!
t sc
es n’
do
Speculation: (input) domain-aware optimization using
some sort of probabilistic modeling ?
20. Some Perspective
The “problem tree” for scientific problem solving
9 Some Perspective
Technical Problem to be Analyzed
Consultation with experts
Scientific Model "A" Model "B"
Theoretical analysis
Discretization "A" Discretization "B" Experiments
Iterative equation solver Direct elimination equation solver
Parallel implementation Sequential implementation
Figure 11: There“problem tree” for to try to achieve the same goal. are many
The are many options scientific problem solving. There
options to try to achieve the same goal.
from Scott et al. “Scientific Parallel Computing” (2005)
21. Computational Thinking
• translate/formulate domain problems into
computational models that can be solved
efficiently by available computing resources
• requires a deep understanding of their
relationships
adapted from Hwu & Kirk (PASI 2011)
22. Getting ready...
Programming Models
Architecture Algorithms Languages
Patterns il ers
C omp
Parallel Thinking
Parallel
Computing
APPLICATIONS
adapted from Scott et al. “Scientific Parallel Computing” (2005)
23. Fundamental Skills
• Computer architecture
• Programming models and compilers
• Algorithm techniques and patterns
• Domain knowledge
24. Computer Architecture
critical in understanding tradeoffs btw algorithms
• memory organization, bandwidth and latency;
caching and locality (memory hierarchy)
• floating-point precision vs. accuracy
• SISD, SIMD, MISD, MIMD vs. SIMT, SPMD
25. Programming models
for optimal data structure and code execution
• parallel execution models (threading hierarchy)
• optimal memory access patterns
• array data layout and loop transformations
26. Algorithms and patterns
• toolbox for designing good parallel algorithms
• it is critical to understand their scalability and
efficiency
• many have been exposed and documented
• sometimes hard to “extract”
• ... but keep trying!
27. Domain Knowledge
• abstract modeling
• mathematical properties
• accuracy requirements
• coming back to the drawing board to expose
more/better parallelism ?
28. You can do it!
• thinking parallel is not as hard as you may think
• many techniques have been thoroughly explained...
• ... and are now “accessible” to non-experts !
30. Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
31. Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
32. What’s in a computer?
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
33. What’s in a computer?
Processor
Intel Q6600 Core2 Quad, 2.4 GHz
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
34. What’s in a computer?
Die
Processor
(2×) 143 mm2 , 2 × 2 cores
Intel Q6600 Core2 Quad, 2.4 GHz 582,000,000 transistors
∼ 100W
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
35. What’s in a computer?
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
36. What’s in a computer?
Memory
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
37. Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
38. A Basic Processor
Memory Interface
Address ALU Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch PC
Data ALU
Control Unit
(loosely based on Intel 8086)
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
39. How all of this fits together
Everything synchronizes to the Clock.
Control Unit (“CU”): The brains of the Memory Interface
operation. Everything connects to it. Address ALU Address Bus
Data Bus
Bus entries/exits are gated and Register File
Flags
(potentially) buffered. Internal Bus
CU controls gates, tells other units Insn.
fetch PC
Control Unit
Data ALU
about ‘what’ and ‘how’:
• What operation?
• Which register?
• Which addressing mode?
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
40. What is. . . an ALU?
Arithmetic Logic Unit
One or two operands A, B
Operation selector (Op):
• (Integer) Addition, Subtraction
• (Logical) And, Or, Not
• (Bitwise) Shifts (equivalent to
multiplication by power of two)
• (Integer) Multiplication, Division
Specialized ALUs:
• Floating Point Unit (FPU)
• Address ALU
Operates on binary representations of
numbers. Negative numbers represented
by two’s complement.
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
41. What is. . . a Register File?
Registers are On-Chip Memory
%r0
• Directly usable as operands in %r1
Machine Language %r2
• Often “general-purpose” %r3
• Sometimes special-purpose: Floating %r4
point, Indexing, Accumulator %r5
• Small: x86 64: 16×64 bit GPRs %r6
• Very fast (near-zero latency) %r7
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
42. How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Processor Memory
A0..15
¯
R/W
CLK
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
43. How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Processor Memory
A0..15
¯
R/W
CLK
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
44. How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Processor Memory
A0..15
¯
R/W
CLK
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
45. How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Processor Memory
A0..15
¯
R/W
CLK
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
46. How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Processor Memory
A0..15
¯
R/W
CLK
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
47. How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Processor Memory
A0..15
¯
R/W
CLK
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
48. How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Processor Memory
A0..15
¯
R/W
CLK
Observation: Access (and addressing) happens
in bus-width-size “chunks”.
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
49. What is. . . a Memory Interface?
Memory Interface gets and stores binary
words in off-chip memory.
Smallest granularity: Bus width
Tells outside memory
• “where” through address bus
• “what” through data bus
Computer main memory is “Dynamic RAM”
(DRAM): Slow, but small and cheap.
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
50. Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
51. A Very Simple Program
4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp)
b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp)
int a = 5;
12: 8b 45 f4 mov −0xc(%rbp),%eax
int b = 17;
15: 0f af 45 f8 imul −0x8(%rbp),%eax
int z = a ∗ b; 19: 89 45 fc mov %eax,−0x4(%rbp)
1c: 8b 45 fc mov −0x4(%rbp),%eax
Things to know:
• Addressing modes (Immediate, Register, Base plus Offset)
• 0xHexadecimal
• “AT&T Form”: (we’ll use this)
<opcode><size> <source>, <dest>
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
52. A Very Simple Program: Intel Form
4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5
b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11
12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc]
15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8]
19: 89 45 fc mov DWORD PTR [rbp−0x4],eax
1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4]
• “Intel Form”: (you might see this on the net)
<opcode> <sized dest>, <sized source>
• Goal: Reading comprehension.
• Don’t understand an opcode?
Google “<opcode> intel instruction”.
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
53. Machine Language Loops
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)
{ b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)
int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e>
14: 8b 45 fc mov −0x4(%rbp),%eax
for ( i = 0;
17: 01 45 f8 add %eax,−0x8(%rbp)
y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)
y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)
return y; 22: 7e f0 jle 14 <main+0x14>
24: 8b 45 f8 mov −0x8(%rbp),%eax
} 27: c9 leaveq
28: c3 retq
Things to know:
• Condition Codes (Flags): Zero, Sign, Carry, etc.
• Call Stack: Stack frame, stack pointer, base pointer
• ABI: Calling conventions
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
54. Machine Language Loops
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)
{ b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)
int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e>
14: 8b 45 fc mov −0x4(%rbp),%eax
for ( i = 0;
17: 01 45 f8 add %eax,−0x8(%rbp)
y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)
y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)
return y; 22: 7e f0 jle 14 <main+0x14>
24: 8b 45 f8 mov −0x8(%rbp),%eax
} 27: c9 leaveq
28: c3 retq
Things to know:
Want to make those yourself?
• Condition Codes (Flags): Zero, Sign, Carry, etc.
Write myprogram.c.
• Call Stack:-c myprogram.c
$ cc Stack frame, stack pointer, base pointer
• ABI: $ objdump --disassemble myprogram.o
Calling conventions
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
55. We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer:
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
56. We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer: Make things go faster!
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
57. We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer: Make things go faster!
Goal now:
Understand sources of slowness, and how they get addressed.
Remember: High Performance Computing
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
58. The High-Performance Mindset
Writing high-performance Codes
Mindset: What is going to be the limiting
factor?
• ALU?
• Memory?
• Communication? (if multi-machine)
Benchmark the assumed limiting factor right
away.
Evaluate
• Know your peak throughputs (roughly)
• Are you getting close?
• Are you tracking the right limiting factor?
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
59. Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
60. Source of Slowness: Memory
Memory is slow.
Distinguish two different versions of “slow”:
• Bandwidth
• Latency
→ Memory has long latency, but can have large bandwidth.
Size of die vs. distance to memory: big!
Dynamic RAM: long intrinsic latency!
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
61. Source of Slowness: Memory
Memory is slow.
Distinguish two different versions of “slow”:
• Bandwidth
• Latency
→ Memory has long latency, but can have large bandwidth.
Idea:
Put a look-up table of
recently-used data onto
the chip.
Size of die vs. distance to memory: big!
→ “Cache”
Dynamic RAM: long intrinsic latency!
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
63. Performance of computer system
Performance of computer system
Entire problem fits within registers
Entire problem fits within cache
from Scott et al. “Scientific Parallel Computing” (2005)
Entire problem
fits within
main memory
Problem
requires
Size of problem being solved
Size of problem being solved
secondary
(disk)
memory
for system!
Performance
Impact on
Problem too big
64. The Memory Hierarchy
Hierarchy of increasingly bigger, slower memories:
Registers 1 kB, 1 cycle
L1 Cache 10 kB, 10 cycles
L2 Cache 1 MB, 100 cycles
DRAM 1 GB, 1000 cycles
Virtual Memory
1 TB, 1 M cycles
(hard drive) How might data locality
factor into this?
What is a working set?
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
65. Cache: Actual Implementation
Demands on cache implementation:
• Fast, small, cheap, low-power
• Fine-grained
• High “hit”-rate (few “misses”)
Problem:
Goals at odds with each other: Access matching logic expensive!
Solution 1: More data per unit of access matching logic
→ Larger “Cache Lines”
Solution 2: Simpler/less access matching logic
→ Less than full “Associativity”
Other choices: Eviction strategy, size
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
67. Cache: Associativity
Direct Mapped 2-way set associative
Memory Cache Memory Cache
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4
5 5
6 6
.
. .
.
. .
Miss rate versus cache size on the Integer por-
tion of SPEC CPU2000 [Cantin, Hill 2003]
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
68. Cache Example: Intel Q6600/Core2 Quad
--- L1 data cache ---
fully associative cache = false
threads sharing this cache = 0x0 (0)
processor cores on this die= 0x3 (3)
system coherency line size = 0x3f (63)
ways of associativity = 0x7 (7)
number of sets - 1 (s) = 63
--- L1 instruction ---
fully associative cache = false --- L2 unified cache ---
threads sharing this cache = 0x0 (0) fully associative cache false
processor cores on this die= 0x3 (3) threads sharing this cache = 0x1 (1)
system coherency line size = 0x3f (63) processor cores on this die= 0x3 (3)
ways of associativity = 0x7 (7) system coherency line size = 0x3f (63)
number of sets - 1 (s) = 63 ways of associativity = 0xf (15)
number of sets - 1 (s) = 4095
More than you care to know about your CPU:
http://www.etallen.com/cpuid.html
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
69. Measuring the Cache I
void go(unsigned count, unsigned stride)
{
const unsigned arr size = 64 ∗ 1024 ∗ 1024;
int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size );
for (unsigned it = 0; it < count; ++it)
{
for (unsigned i = 0; i < arr size ; i += stride)
ary [ i ] ∗= 17;
}
free (ary );
}
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
70. Measuring the Cache I
void go(unsigned count, unsigned stride)
{
const unsigned arr size = 64 ∗ 1024 ∗ 1024;
int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size );
for (unsigned it = 0; it < count; ++it)
{
for (unsigned i = 0; i < arr size ; i += stride)
ary [ i ] ∗= 17;
}
free (ary );
}
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
71. Measuring the Cache II
void go(unsigned array size , unsigned steps)
{
int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size );
unsigned asm1 = array size − 1;
for (unsigned i = 0; i < steps; ++i)
ary [( i ∗16) & asm1] ++;
free (ary );
}
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
72. Measuring the Cache II
void go(unsigned array size , unsigned steps)
{
int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size );
unsigned asm1 = array size − 1;
for (unsigned i = 0; i < steps; ++i)
ary [( i ∗16) & asm1] ++;
free (ary );
}
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
73. Measuring the Cache III
void go(unsigned array size , unsigned stride , unsigned steps)
{
char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size );
unsigned p = 0;
for (unsigned i = 0; i < steps; ++i)
{
ary [p] ++;
p += stride;
if (p >= array size)
p = 0;
}
free (ary );
}
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
74. Measuring the Cache III
void go(unsigned array size , unsigned stride , unsigned steps)
{
char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size );
unsigned p = 0;
for (unsigned i = 0; i < steps; ++i)
{
ary [p] ++;
p += stride;
if (p >= array size)
p = 0;
}
free (ary );
}
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
77. Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
78. Source of Slowness: Sequential Operation
IF Instruction fetch
ID Instruction Decode
EX Execution
MEM Memory Read/Write
WB Result Writeback
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
84. Programming for the Pipeline
How to upset a processor pipeline:
for (int i = 0; i < 1000; ++i)
for (int j = 0; j < 1000; ++j)
{
if ( j % 2 == 0)
do something(i , j );
}
. . . why is this bad?
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
85. A Puzzle
int steps = 256 ∗ 1024 ∗ 1024;
int [] a = new int[2];
// Loop 1
for (int i =0; i<steps; i ++) { a[0]++; a[0]++; }
// Loop 2
for (int i =0; i<steps; i ++) { a[0]++; a[1]++; }
Which is faster?
. . . and why?
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
86. Two useful Strategies
Loop unrolling:
for (int i = 0; i < 500; i+=2)
{
for (int i = 0; i < 1000; ++i)
do something(i );
→ do something(i );
do something(i+1);
}
Software pipelining:
for (int i = 0; i < 500; i+=2)
for (int i = 0; i < 1000; ++i) {
{ do a( i );
do a( i ); → do a( i +1);
do b(i ); do b(i );
} do b(i+1);
}
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
87. SIMD
Control Units are large and expensive. SIMD Instruction Pool
Functional Units are simple and cheap.
→ Increase the Function/Control ratio:
Data Pool
Control several functional units with
one control unit.
All execute same operation.
GCC vector extensions:
typedef int v4si attribute (( vector size (16)));
v4si a, b, c;
c = a + b;
// +, −, ∗, /, unary minus, ˆ, |, &, ˜, %
Will revisit for OpenCL, GPUs.
adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
88. Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
90. Intro PyOpenCL What and Why? OpenCL
“CPU-style” Cores
CPU-“style” cores
Fetch/ Out-of-order control logic
Decode
Fancy branch predictor
ALU
(Execute)
Memory pre-fetcher
Execution
Context
Data cache
(A big one)
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13
Credit: Kayvon Fatahalian (Stanford)
91. Intro PyOpenCL What and Why? OpenCL
Slimming down
Slimming down
Fetch/
Decode
Idea #1:
ALU Remove components that
(Execute)
help a single instruction
Execution stream run fast
Context
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
92. Intro PyOpenCL What and Why? OpenCL
More Space: Double the Numberparallel)
Two cores (two fragments in of Cores
fragment 1 fragment 2
Fetch/ Fetch/
Decode Decode
!"#$$%&'()*"'+,-.
!"#$$%&'()*"'+,-.
ALU ALU
&*/01'.+23.453.623.&2.
&*/01'.+23.453.623.&2.
/%1..+73.423.892:2;.
/%1..+73.423.892:2;.
/*"".+73.4<3.892:<;3.+7.
(Execute) (Execute)
/*"".+73.4<3.892:<;3.+7.
/*"".+73.4=3.892:=;3.+7.
/*"".+73.4=3.892:=;3.+7.
81/0.+73.+73.1>2?2@3.1><?2@.
81/0.+73.+73.1>2?2@3.1><?2@.
/%1..A23.+23.+7.
/%1..A23.+23.+7.
Execution Execution
/%1..A<3.+<3.+7.
/%1..A<3.+<3.+7.
/%1..A=3.+=3.+7.
/%1..A=3.+=3.+7.
Context Context
/A4..A73.1><?2@.
/A4..A73.1><?2@.
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
93. Intro PyOpenCL What and Why? OpenCL
Fouragain
. . . cores (four fragments in parallel)
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
GRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
94. Intro PyOpenCL What and Why? OpenCL
xteen cores
. . . and again (sixteen fragments in parallel)
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
16 cores = 16 simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Credit: Kayvon Fatahalian (Stanford) 17
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
95. Intro PyOpenCL What and Why? OpenCL
xteen cores
. . . and again (sixteen fragments in parallel)
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU
→ 16 independent instruction streams
ALU ALU ALU
Reality: instruction streams not actually
16 cores = 16very different/independent
simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Credit: Kayvon Fatahalian (Stanford) 17
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
96. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
Fetch/
Decode
ALU
(Execute)
Execution
Context
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
97. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
Fetch/
Decode
ALU Idea #2
(Execute)
Amortize cost/complexity of
managing an instruction stream
Execution across many ALUs
Context → SIMD
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
98. ecall: simple processing core
dd ALUs Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
Fetch/ Idea #2:
Decode
Amortize cost/complexity of
ALU 1 ALU 2 ALU 3 ALU 4
ALU managing an instruction
Idea #2
(Execute)
ALU 5 ALU 6 ALU 7 ALU 8 stream across many of
Amortize cost/complexity ALUs
managing an instruction stream
Execution across many ALUs
Ctx Ctx Ctx
Context
Ctx
SIMD processing
→ SIMD
Ctx Ctx Ctx Ctx
Shared Ctx Data
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
99. dd ALUs Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
Fetch/ Idea #2:
Decode
Amortize cost/complexity of
ALU 1 ALU 2 ALU 3 ALU 4
managing an instruction
Idea #2
ALU 5 ALU 6 ALU 7 ALU 8 stream across many of
Amortize cost/complexity ALUs
managing an instruction stream
across many ALUs
Ctx Ctx Ctx Ctx
SIMD processing
→ SIMD
Ctx Ctx Ctx Ctx
Shared Ctx Data
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
100. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL
Gratuitous Amounts of Parallelism!
ragments in parallel
16 cores = 128 ALUs
= 16 simultaneous instruction streams
Credit: Shading: http://s09.idav.ucdavis.edu/
Kayvon Fatahalian (Stanford)
Beyond Programmable 24
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
101. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL
Gratuitous Amounts of Parallelism!
ragments in parallel
Example:
128 instruction streams in parallel
16 independent groups of 8 synchronized streams
16 cores = 128 ALUs
= 16 simultaneous instruction streams
Credit: Shading: http://s09.idav.ucdavis.edu/
Kayvon Fatahalian (Stanford)
Beyond Programmable 24
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
102. Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Problem
Memory still has very high latency. . .
. . . but we’ve removed most of the
hardware that helps us deal with that.
We’ve removed
caches
branch prediction
out-of-order execution
So what now?
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
103. Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Problem
Memory still has very high latency. . .
. . . but we’ve removed most of the
hardware that helps us deal with that.
We’ve removed
caches
branch prediction Idea #3
out-of-order execution Even more parallelism
So what now? + Some extra memory
= A solution!
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
104. Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Fetch/
Decode
Problem ALU ALU ALU ALU
Memory still has very high latency. . .
ALU ALU ALU ALU
. . . but we’ve removed most of the
hardware that helps us deal with that.
Ctx Ctx Ctx Ctx
We’ve removedCtx Ctx Ctx Ctx
caches
Shared Ctx Data
branch prediction Idea #3
out-of-order execution Even more parallelism
v.ucdavis.edu/
So what now? +
33 Some extra memory
= A solution!
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
105. Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Fetch/
Decode
Problem ALU ALU ALU ALU
Memory still has very high latency. . .
ALU ALU ALU ALU
. . . but we’ve removed most of the
hardware that helps us deal with that.
1 2
We’ve removed
caches 3 4
branch prediction Idea #3
out-of-order execution Even more parallelism
v.ucdavis.edu/ now?
So what +
34 Some extra memory
= A solution!
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
106. Intro PyOpenCL What and Why? OpenCL
GPU Architecture Summary
Core Ideas:
1 Many slimmed down cores
→ lots of parallelism
2 More ALUs, Fewer Control Units
3 Avoid memory stalls by interleaving
execution of SIMD groups
(“warps”)
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
107. Is it free?
! GA,3&,('&3A'&.*-4'H2'-.'4I
! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/
! 6,3,&,..'44&.*A'('-.5
! $(*1(,+&)D*F
slide by Matthew Bolitho
108. dvariables.
variables.
uted memory private memory for each processor, only acces
uted memory private memory for each processor, only acce
Some terminology
ocessor, so no synchronization for memory accesses neede
ocessor, so no synchronization for memory accesses neede
mationexchanged by sending data from one processor to ano
ation exchanged by sending data from one processor to an
interconnection network using explicit communication opera
interconnection network using explicit communication opera
M
M M
M M
M PP PP PP
PP PP PP
Interconnection Network
Interconnection Network
Interconnection Network
Interconnection Network M
M M
M M
M
“distributed memory”
approach increasingly common “shared memory”
d approach increasingly common
now: mostly hybrid
109. Some terminology
Some More Terminology
One way to classify machines distinguishes between
shared memory global memory can be acessed by all processors or
Some More Terminologyshared variables
cores. Information exchanged between threads using
One way to classify machines distinguishes Need to coordinate access to
written by one thread and read by another. between
shared memory global memory can be acessed by all processors or
shared variables.
cores. Information exchanged between threads using shared accessible
distributed memory private memory for each processor, only variables
written by one thread synchronization for memoryto coordinate access to
this processor, so no and read by another. Need accesses needed.
shared variables.
Information exchanged by sending data from one processor to another
distributed memory private memory for each processor, only accessible
via an interconnection network using explicit communication operations.
this processor, so no synchronization for memory accesses needed.
InformationM exchanged by sending data from one processor to another
M M P P P
via an interconnection network using explicit communication operations.
P
M P
M P
M P P P
Interconnection Network
113. Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/ Fetch/ Fetch/
Decode Decode Decode
show
are s?
32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
o c ore
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
h
W ny c
16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
ma
Fetch/ Fetch/ Fetch/
Decode Decode Decode
32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
Idea: 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
Program as if there were Fetch/
Decode
Fetch/
Decode
Fetch/
Decode
“infinitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
Program as if there were 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
“infinitely” many ALUs per
core
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
114. Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/ Fetch/ Fetch/
Decode Decode Decode
show
are s?
32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
o c ore
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
h
W ny c
16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
ma
Fetch/ Fetch/ Fetch/
Decode Decode Decode
32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
Idea: 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
Consider: Which there were do automatically?
Program as if is easy to Fetch/
Decode
Fetch/
Decode
Fetch/
Decode
“infinitely” many cores
Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
or Program as if there were 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
“infinitely” many ALUs per
Sequential program → parallel hardware?
core
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA