[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #2: Architecture, Theory & Patterns | February 1st, 2011

Nicolas Pinto (MIT, Harvard)
pinto@mit.edu

Objectives

• introduce important computational thinking
skills for massively parallel computing
• understand hardware limitations
• understand algorithm constraints
• identify common patterns

During this course,
r CS264
adapted fo

we’ll try to

“ ”

and use existing material ;-)

Outline

• Thinking Parallel
• Architecture
• Programming Model
• Bits of Theory
• Patterns

ti vat i on
Mo

! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'
12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&

! P1;$.&1#+,,8'! -*Q;'3"$'O+;$"&

" P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;

! S.I'! -*Q;'3"$'I16"&

slide by Matthew Bolitho

ti vat i on
Mo

! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'
O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U

! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";
! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''

" D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'
O%26+/"2$+,,8'&"6";132"6


Getting your feet wet

• Common scenario: “I want to make the
algorithm X run faster, help me!”

• Q: How do you approach the problem?

How?
• Option 1: wait
• Option 2: gcc -O3 -msse4.2
• Option 3: xlc -O5
• Option 4: use parallel libraries (e.g. (cu)blas)
• Option 5: hand-optimize everything!
• Option 6: wait more

Algorithm X v1.0 Proﬁling Analysis on Input 10x10x10

100

100% parallelizable
75
sequential in nature
time (s)

50 50

25 29

10 11
0
load_data() foo() bar() yey()

Q: What is the maximum speed up ?


100

100% parallelizable
75
time (s)

50 50

25 29

10 11
0

A: 2X ! :-(


9,000 9,000

100% parallelizable
6,750
time (s)

4,500

2,250

0 350 250 300

Q: and now?

You need to...
• ... understand the problem (duh!)
• ... study the current (sequential?) solutions and
their constraints
• ... know the input domain
• ... proﬁle accordingly
• ... “refactor” based on new constraints (hw/sw)

A better way ?

...

ale!
t sc
es n’
do

Speculation: (input) domain-aware optimization using
some sort of probabilistic modeling ?

Some Perspective
The “problem tree” for scientific problem solving
9 Some Perspective

Technical Problem to be Analyzed

Consultation with experts

Scientific Model "A" Model "B"

Theoretical analysis
Discretization "A" Discretization "B" Experiments

Iterative equation solver Direct elimination equation solver

Parallel implementation Sequential implementation

Figure 11: There“problem tree” for to try to achieve the same goal. are many
The are many options scientific problem solving. There
options to try to achieve the same goal.
from Scott et al. “Scientific Parallel Computing” (2005)

Computational Thinking

• translate/formulate domain problems into
computational models that can be solved
efﬁciently by available computing resources

• requires a deep understanding of their
relationships

adapted from Hwu & Kirk (PASI 2011)

Getting ready...

Programming Models

Architecture Algorithms Languages
Patterns il ers
C omp

Parallel Thinking
Parallel
Computing

APPLICATIONS
adapted from Scott et al. “Scientiﬁc Parallel Computing” (2005)

Fundamental Skills

• Computer architecture
• Programming models and compilers
• Algorithm techniques and patterns
• Domain knowledge

Computer Architecture
critical in understanding tradeoffs btw algorithms

• memory organization, bandwidth and latency;
caching and locality (memory hierarchy)
• ﬂoating-point precision vs. accuracy
• SISD, SIMD, MISD, MIMD vs. SIMT, SPMD

Programming models
for optimal data structure and code execution

• parallel execution models (threading hierarchy)
• optimal memory access patterns
• array data layout and loop transformations

Algorithms and patterns
• toolbox for designing good parallel algorithms
• it is critical to understand their scalability and
efﬁciency
• many have been exposed and documented
• sometimes hard to “extract”
• ... but keep trying!

Domain Knowledge

• abstract modeling
• mathematical properties
• accuracy requirements
• coming back to the drawing board to expose
more/better parallelism ?

You can do it!

• thinking parallel is not as hard as you may think
• many techniques have been thoroughly explained...
• ... and are now “accessible” to non-experts !

Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs

What’s in a computer?

adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines


Processor

Intel Q6600 Core2 Quad, 2.4 GHz


Die

Processor

(2×) 143 mm2 , 2 × 2 cores
Intel Q6600 Core2 Quad, 2.4 GHz 582,000,000 transistors
∼ 100W

Memory


Architecture

• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines

A Basic Processor
Memory Interface
Address ALU Address Bus

Data Bus
Register File
Flags

Internal Bus

Insn.
fetch PC
Data ALU
Control Unit

(loosely based on Intel 8086)


How all of this ﬁts together

Everything synchronizes to the Clock.
Control Unit (“CU”): The brains of the Memory Interface

operation. Everything connects to it. Address ALU Address Bus

Data Bus
Bus entries/exits are gated and Register File
Flags

(potentially) buﬀered. Internal Bus

CU controls gates, tells other units Insn.
fetch PC
Control Unit
Data ALU

about ‘what’ and ‘how’:
• What operation?
• Which register?
• Which addressing mode?


What is. . . an ALU?
Arithmetic Logic Unit
One or two operands A, B
Operation selector (Op):
• (Integer) Addition, Subtraction
• (Logical) And, Or, Not
• (Bitwise) Shifts (equivalent to
multiplication by power of two)
• (Integer) Multiplication, Division
Specialized ALUs:
• Floating Point Unit (FPU)
• Address ALU
Operates on binary representations of
numbers. Negative numbers represented
by two’s complement.


What is. . . a Register File?

Registers are On-Chip Memory
%r0
• Directly usable as operands in %r1
Machine Language %r2
• Often “general-purpose” %r3
• Sometimes special-purpose: Floating %r4
point, Indexing, Accumulator %r5
• Small: x86 64: 16×64 bit GPRs %r6
• Very fast (near-zero latency) %r7


How does computer memory work?
One (reading) memory transaction (simpliﬁed):

D0..15
Processor Memory

A0..15
¯
R/W
CLK


How does computer memory work?
One (reading) memory transaction (simpliﬁed):

D0..15
Processor Memory

A0..15
¯
R/W
CLK

Observation: Access (and addressing) happens
in bus-width-size “chunks”.

What is. . . a Memory Interface?

Memory Interface gets and stores binary
words in oﬀ-chip memory.
Smallest granularity: Bus width
Tells outside memory
• “where” through address bus
• “what” through data bus
Computer main memory is “Dynamic RAM”
(DRAM): Slow, but small and cheap.


A Very Simple Program

4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp)
b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp)
int a = 5;
12: 8b 45 f4 mov −0xc(%rbp),%eax
int b = 17;
15: 0f af 45 f8 imul −0x8(%rbp),%eax
int z = a ∗ b; 19: 89 45 fc mov %eax,−0x4(%rbp)
1c: 8b 45 fc mov −0x4(%rbp),%eax

Things to know:
• Addressing modes (Immediate, Register, Base plus Oﬀset)
• 0xHexadecimal
• “AT&T Form”: (we’ll use this)
<opcode><size> <source>, <dest>


A Very Simple Program: Intel Form

4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5
b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11
12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc]
15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8]
19: 89 45 fc mov DWORD PTR [rbp−0x4],eax
1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4]

• “Intel Form”: (you might see this on the net)
<opcode> <sized dest>, <sized source>
• Goal: Reading comprehension.
• Don’t understand an opcode?
Google “<opcode> intel instruction”.


Machine Language Loops
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)
{ b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)
int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e>
14: 8b 45 fc mov −0x4(%rbp),%eax
for ( i = 0;
17: 01 45 f8 add %eax,−0x8(%rbp)
y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)
y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)
return y; 22: 7e f0 jle 14 <main+0x14>
24: 8b 45 f8 mov −0x8(%rbp),%eax
} 27: c9 leaveq
28: c3 retq

Things to know:
• Condition Codes (Flags): Zero, Sign, Carry, etc.
• Call Stack: Stack frame, stack pointer, base pointer
• ABI: Calling conventions


Machine Language Loops
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)
{ b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)
int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e>
14: 8b 45 fc mov −0x4(%rbp),%eax
for ( i = 0;
17: 01 45 f8 add %eax,−0x8(%rbp)
y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)
y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)
return y; 22: 7e f0 jle 14 <main+0x14>
24: 8b 45 f8 mov −0x8(%rbp),%eax
} 27: c9 leaveq
28: c3 retq

Things to know:
Want to make those yourself?
• Condition Codes (Flags): Zero, Sign, Carry, etc.
Write myprogram.c.
• Call Stack:-c myprogram.c
$ cc Stack frame, stack pointer, base pointer
• ABI: $ objdump --disassemble myprogram.o
Calling conventions


We know how a computer works!

All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer:



transistors?
Answer: Make things go faster!



transistors?
Answer: Make things go faster!
Goal now:
Understand sources of slowness, and how they get addressed.
Remember: High Performance Computing


The High-Performance Mindset
Writing high-performance Codes
Mindset: What is going to be the limiting
factor?
• ALU?
• Memory?
• Communication? (if multi-machine)

Benchmark the assumed limiting factor right
away.
Evaluate
• Know your peak throughputs (roughly)
• Are you getting close?
• Are you tracking the right limiting factor?


Source of Slowness: Memory
Memory is slow.
Distinguish two diﬀerent versions of “slow”:
• Bandwidth
• Latency
→ Memory has long latency, but can have large bandwidth.

Size of die vs. distance to memory: big!
Dynamic RAM: long intrinsic latency!

Source of Slowness: Memory
Memory is slow.
Distinguish two diﬀerent versions of “slow”:
• Bandwidth
• Latency
→ Memory has long latency, but can have large bandwidth.

Idea:
Put a look-up table of
recently-used data onto
the chip.
Size of die vs. distance to memory: big!
→ “Cache”
Dynamic RAM: long intrinsic latency!

The Memory Hierarchy
Hierarchy of increasingly bigger, slower memories:
faster
Registers 1 kB, 1 cycle

L1 Cache 10 kB, 10 cycles

L2 Cache 1 MB, 100 cycles

DRAM 1 GB, 1000 cycles

Virtual Memory
1 TB, 1 M cycles
(hard drive)
bigger


Performance of computer system
Performance of computer system

Entire problem fits within registers

Entire problem fits within cache

from Scott et al. “Scientiﬁc Parallel Computing” (2005)
Entire problem
fits within
main memory

Problem
requires
Size of problem being solved
Size of problem being solved
secondary
(disk)
memory
for system!
Performance
Impact on

Problem too big

The Memory Hierarchy
Hierarchy of increasingly bigger, slower memories:

Registers 1 kB, 1 cycle

L1 Cache 10 kB, 10 cycles

L2 Cache 1 MB, 100 cycles

DRAM 1 GB, 1000 cycles

Virtual Memory
1 TB, 1 M cycles
(hard drive) How might data locality
factor into this?
What is a working set?


Cache: Actual Implementation
Demands on cache implementation:
• Fast, small, cheap, low-power
• Fine-grained
• High “hit”-rate (few “misses”)

Problem:
Goals at odds with each other: Access matching logic expensive!

Solution 1: More data per unit of access matching logic
→ Larger “Cache Lines”

Solution 2: Simpler/less access matching logic
→ Less than full “Associativity”

Other choices: Eviction strategy, size


Cache: Associativity

Direct Mapped 2-way set associative
Memory Cache Memory Cache
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4
5 5
6 6
.
. .
.
. .


Cache: Associativity

Direct Mapped 2-way set associative
Memory Cache Memory Cache
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4
5 5
6 6
.
. .
.
. .

Miss rate versus cache size on the Integer por-
tion of SPEC CPU2000 [Cantin, Hill 2003]


Cache Example: Intel Q6600/Core2 Quad

--- L1 data cache ---
fully associative cache = false
threads sharing this cache = 0x0 (0)
processor cores on this die= 0x3 (3)
system coherency line size = 0x3f (63)
ways of associativity = 0x7 (7)
number of sets - 1 (s) = 63

--- L1 instruction ---
fully associative cache = false --- L2 unified cache ---
threads sharing this cache = 0x0 (0) fully associative cache false
processor cores on this die= 0x3 (3) threads sharing this cache = 0x1 (1)
system coherency line size = 0x3f (63) processor cores on this die= 0x3 (3)
ways of associativity = 0x7 (7) system coherency line size = 0x3f (63)
number of sets - 1 (s) = 63 ways of associativity = 0xf (15)
number of sets - 1 (s) = 4095

More than you care to know about your CPU:
http://www.etallen.com/cpuid.html


Measuring the Cache I

void go(unsigned count, unsigned stride)
{
const unsigned arr size = 64 ∗ 1024 ∗ 1024;
int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size );

for (unsigned it = 0; it < count; ++it)
{
for (unsigned i = 0; i < arr size ; i += stride)
ary [ i ] ∗= 17;
}

free (ary );
}


Measuring the Cache II

void go(unsigned array size , unsigned steps)
{
int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size );
unsigned asm1 = array size − 1;

for (unsigned i = 0; i < steps; ++i)
ary [( i ∗16) & asm1] ++;

free (ary );
}


Measuring the Cache III

void go(unsigned array size , unsigned stride , unsigned steps)
{
char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size );

unsigned p = 0;
for (unsigned i = 0; i < steps; ++i)
{
ary [p] ++;
p += stride;
if (p >= array size)
p = 0;
}

free (ary );
}


http://sequoia.stanford.edu/

Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)

Source of Slowness: Sequential Operation

IF Instruction fetch
ID Instruction Decode
EX Execution
MEM Memory Read/Write
WB Result Writeback


Solution: Pipelining


Pipelining

(MIPS, 110,000 transistors)


Issues with Pipelines

Pipelines generally help
performance–but not always.
Possible issues:
• Stalls
• Dependent Instructions
• Branches (+Prediction)
• Self-Modifying Code
“Solution”: Bubbling, extra
circuitry


Intel Q6600 Pipeline


Intel Q6600 Pipeline
New concept:
Instruction-level
parallelism
(“Superscalar”)


Programming for the Pipeline

How to upset a processor pipeline:
for (int i = 0; i < 1000; ++i)
for (int j = 0; j < 1000; ++j)
{
if ( j % 2 == 0)
do something(i , j );
}

. . . why is this bad?


A Puzzle

int steps = 256 ∗ 1024 ∗ 1024;
int [] a = new int[2];

// Loop 1
for (int i =0; i<steps; i ++) { a[0]++; a[0]++; }

// Loop 2
for (int i =0; i<steps; i ++) { a[0]++; a[1]++; }

Which is faster?

. . . and why?


Two useful Strategies
Loop unrolling:

for (int i = 0; i < 500; i+=2)
{
for (int i = 0; i < 1000; ++i)
do something(i );
→ do something(i );
do something(i+1);
}
Software pipelining:

for (int i = 0; i < 500; i+=2)
for (int i = 0; i < 1000; ++i) {
{ do a( i );
do a( i ); → do a( i +1);
do b(i ); do b(i );
} do b(i+1);
}


SIMD
Control Units are large and expensive. SIMD Instruction Pool

Functional Units are simple and cheap.
→ Increase the Function/Control ratio:

Data Pool
Control several functional units with
one control unit.
All execute same operation.

GCC vector extensions:
typedef int v4si attribute (( vector size (16)));

v4si a, b, c;
c = a + b;
// +, −, ∗, /, unary minus, ˆ, |, &, ˜, %

Will revisit for OpenCL, GPUs.


GPUs ?
! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&
C(*8D'+4/

! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&
.*-3(*D&,-@&@,3,&.,.A'

Intro PyOpenCL What and Why? OpenCL

“CPU-style” Cores
CPU-“style” cores

Fetch/ Out-of-order control logic
Decode
Fancy branch predictor
ALU
(Execute)
Memory pre-fetcher
Execution
Context
Data cache
(A big one)

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13

Credit: Kayvon Fatahalian (Stanford)


Slimming down
Slimming down

Fetch/
Decode
Idea #1:
ALU Remove components that
(Execute)
help a single instruction
Execution stream run fast
Context



slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA


More Space: Double the Numberparallel)
Two cores (two fragments in of Cores
fragment 1 fragment 2

Fetch/ Fetch/
Decode Decode
!"#$$%&'()*"'+,-.
!"#$$%&'()*"'+,-.

ALU ALU
&*/01'.+23.453.623.&2.
&*/01'.+23.453.623.&2.
/%1..+73.423.892:2;.
/%1..+73.423.892:2;.
/*"".+73.4<3.892:<;3.+7.
(Execute) (Execute)
/*"".+73.4<3.892:<;3.+7.
/*"".+73.4=3.892:=;3.+7.
/*"".+73.4=3.892:=;3.+7.
81/0.+73.+73.1>2?2@3.1><?2@.
81/0.+73.+73.1>2?2@3.1><?2@.
/%1..A23.+23.+7.
/%1..A23.+23.+7.

Execution Execution
/%1..A<3.+<3.+7.
/%1..A<3.+<3.+7.
/%1..A=3.+=3.+7.
/%1..A=3.+=3.+7.

Context Context
/A4..A73.1><?2@.
/A4..A73.1><?2@.





Fouragain
. . . cores (four fragments in parallel)

Fetch/ Fetch/
Decode Decode

ALU ALU
(Execute) (Execute)

Execution Execution
Context Context

Fetch/ Fetch/
Decode Decode

ALU ALU
(Execute) (Execute)

Execution Execution
Context Context

GRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16




xteen cores
. . . and again (sixteen fragments in parallel)

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

16 cores = 16 simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Credit: Kayvon Fatahalian (Stanford) 17



xteen cores
. . . and again (sixteen fragments in parallel)

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU
→ 16 independent instruction streams
ALU ALU ALU

Reality: instruction streams not actually
16 cores = 16very diﬀerent/independent
simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Credit: Kayvon Fatahalian (Stanford) 17


ecall: simple processing core Intro PyOpenCL What and Why? OpenCL

Saving Yet More Space

Fetch/
Decode

ALU
(Execute)

Execution
Context



ecall: simple processing core Intro PyOpenCL What and Why? OpenCL


Fetch/
Decode

ALU Idea #2
(Execute)
Amortize cost/complexity of
managing an instruction stream
Execution across many ALUs
Context → SIMD



ecall: simple processing core
dd ALUs Intro PyOpenCL What and Why? OpenCL


Fetch/ Idea #2:
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU managing an instruction
Idea #2
(Execute)
ALU 5 ALU 6 ALU 7 ALU 8 stream across many of
Amortize cost/complexity ALUs
Execution across many ALUs
Ctx Ctx Ctx
Context
Ctx
SIMD processing
→ SIMD
Ctx Ctx Ctx Ctx

Shared Ctx Data


dd ALUs Intro PyOpenCL What and Why? OpenCL


Fetch/ Idea #2:
Decode
ALU 1 ALU 2 ALU 3 ALU 4
managing an instruction
Idea #2
ALU 5 ALU 6 ALU 7 ALU 8 stream across many of
Amortize cost/complexity ALUs
across many ALUs
Ctx Ctx Ctx Ctx
SIMD processing
→ SIMD
Ctx Ctx Ctx Ctx

Shared Ctx Data


http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL

Gratuitous Amounts of Parallelism!
ragments in parallel

16 cores = 128 ALUs
= 16 simultaneous instruction streams
Credit: Shading: http://s09.idav.ucdavis.edu/
Kayvon Fatahalian (Stanford)
Beyond Programmable 24


http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL

Gratuitous Amounts of Parallelism!
ragments in parallel
Example:
128 instruction streams in parallel
16 independent groups of 8 synchronized streams

16 cores = 128 ALUs
= 16 simultaneous instruction streams
Credit: Shading: http://s09.idav.ucdavis.edu/
Kayvon Fatahalian (Stanford)
Beyond Programmable 24



Remaining Problem: Slow Memory

Problem
Memory still has very high latency. . .
. . . but we’ve removed most of the
hardware that helps us deal with that.

We’ve removed
caches
branch prediction
out-of-order execution
So what now?




Problem

We’ve removed
caches
branch prediction Idea #3
out-of-order execution Even more parallelism
So what now? + Some extra memory
= A solution!



Fetch/
Decode

Problem ALU ALU ALU ALU
ALU ALU ALU ALU
Ctx Ctx Ctx Ctx

We’ve removedCtx Ctx Ctx Ctx
caches
Shared Ctx Data
v.ucdavis.edu/
So what now? +
33 Some extra memory
= A solution!



Fetch/
Decode

Problem ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
We’ve removed
caches 3 4
v.ucdavis.edu/ now?
So what +
34 Some extra memory
= A solution!



GPU Architecture Summary

Core Ideas:

1 Many slimmed down cores
→ lots of parallelism

2 More ALUs, Fewer Control Units

3 Avoid memory stalls by interleaving
execution of SIMD groups
(“warps”)



Is it free?
! GA,3&,('&3A'&.*-4'H2'-.'4I
! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/
! 6,3,&,..'44&.*A'('-.5
! $(*1(,+&)D*F


dvariables.
variables.
uted memory private memory for each processor, only acces
uted memory private memory for each processor, only acce
Some terminology
ocessor, so no synchronization for memory accesses neede
ocessor, so no synchronization for memory accesses neede
mationexchanged by sending data from one processor to ano
ation exchanged by sending data from one processor to an
interconnection network using explicit communication opera
interconnection network using explicit communication opera
M
M M
M M
M PP PP PP

PP PP PP
Interconnection Network

Interconnection Network M
M M
M M
M

“distributed memory”
approach increasingly common “shared memory”
d approach increasingly common
now: mostly hybrid

Some terminology
Some More Terminology
One way to classify machines distinguishes between
shared memory global memory can be acessed by all processors or
Some More Terminologyshared variables
cores. Information exchanged between threads using
One way to classify machines distinguishes Need to coordinate access to
written by one thread and read by another. between
shared memory global memory can be acessed by all processors or
shared variables.
cores. Information exchanged between threads using shared accessible
distributed memory private memory for each processor, only variables
written by one thread synchronization for memoryto coordinate access to
this processor, so no and read by another. Need accesses needed.
shared variables.
Information exchanged by sending data from one processor to another
distributed memory private memory for each processor, only accessible
via an interconnection network using explicit communication operations.
this processor, so no synchronization for memory accesses needed.
InformationM exchanged by sending data from one processor to another
M M P P P
via an interconnection network using explicit communication operations.

P
M P
M P
M P P P

Programming Model
(Overview)

GPU Architecture

CUDA Programming Model


Connection: Hardware ↔ Programming Model
Fetch/
Decode Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)

Shared Shared Shared

Fetch/ Fetch/ Fetch/
Decode Decode Decode

32 kiB Ctx 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

Private


(“Registers”) Shared Shared Shared


16 kiB Ctx 32 kiB Ctx
Private
(“Registers”)
32 kiB Ctx
Private
(“Registers”)
32 kiB Ctx
Private
(“Registers”)

Shared 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared





show
are s?

o c ore

h
W ny c
16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

ma


Idea: 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

Program as if there were Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

“inﬁnitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

Program as if there were 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

“inﬁnitely” many ALUs per
core





show
are s?

o c ore

h
W ny c
16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

ma


Idea: 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

Consider: Which there were do automatically?
Program as if is easy to Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

“inﬁnitely” many cores
Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

or Program as if there were 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

“inﬁnitely” many ALUs per
Sequential program → parallel hardware?
core




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Axis 1






Software representation
Hardware




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

(Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

or “Block”


Grid nc-

nel: Fu
er
Axis 1

(K

nG r i d)

ti on o Fetch/
Decode
Fetch/
Decode
Fetch/
Decode


(Work) Item

or “Thread” Hardware




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Grid nc-

nel: Fu
er
Axis 1

(K

nG r i d)

ti on o Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Hardware




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

(Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

or “Block”


Grid nc-

nel: Fu
er
Axis 1

(K

nG r i d)

ti on o Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Hardware




Axis 0
? Fetch/
Decode

32 kiB Ctx
Private
(“Registers”)

16 kiB Ctx
Shared
Fetch/
Decode

32 kiB Ctx
Private
(“Registers”)

16 kiB Ctx
Shared
Fetch/
Decode

32 kiB Ctx
Private
(“Registers”)

16 kiB Ctx
Shared

Axis 1






Hardware


[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Semelhante a [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns (20)

Mais de npinto

Mais de npinto (20)

Último

Último (20)

[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns