Coa.ppt2

COMPUTER ORGANIZATION &
ARCHITECTURE
Text Book: Computer Architecture: A Quantitative
Approach by Hennessey and Patterson
Prof.Prasanta Kumar Dash
GITA,BHUBANESWAR

Pipelining: Basic and Intermediate
Concepts

RISC Instruction Set Basics
(from Hennessey and Patterson)
• Properties of RISC architectures:
– All ops on data apply to data in registers and typically
change the entire register (32-bits or 64-bits).
– The only ops that affect memory are load/store
operations. Memory to register, and register to
memory.
– Load and store ops on data less than a full size of a
register (32, 16, 8 bits) are often available.
– Usually instructions are few in number (this can be
relative) and are typically one size.

Types Of Instructions
• ALU Instructions:
• Arithmetic operations, either take two registers as
operands or take one register and a sign extended
immediate value as an operand. The result is stored in
a third register.
• Logical operations AND OR, XOR do not usually
differentiate between 32-bit and 64-bit.
• Load/Store Instructions:
• Usually take a register (base register) as an operand
and a 16-bit immediate value. The sum of the two will
create the effective address. A second register acts as a
source in the case of a load operation.

Types Of Instructions (continued)
• In the case of a store operation the second register
contains the data to be stored.
• Branches and Jumps
• Conditional branches are transfers of control. As
described before, a branch causes an immediate value
to be added to the current program counter.

RISC Instruction Set Implementation
• We first need to look at how instructions in the
MIPS64 instruction set are implemented without
pipelining. Assume that any instruction (MIPS) can be
executed in at most 5 clock cycles.
• The five clock cycles will be broken up into the
following steps:
• Instruction Fetch Cycle
• Instruction Decode/Register Fetch Cycle
• Execution Cycle
• Memory Access Cycle
• Write-Back Cycle

Instruction Fetch (IF) Cycle
• Send the program counter (PC) to memory
and fetch the current instruction from
memory.
• Update the PC to the next sequential PC by
adding 4 (since each instruction is 4 bytes) to
the PC.

Instruction Decode (ID)/Register Fetch Cycle
• Decode the instruction and at the same time read
in the values of the register involved. As the
registers are being read, do equality test incase
the instruction decodes as a branch or jump.
• The offset field of the instruction is sign-extended
incase it is needed. The possible branch effective
address is computed by adding the sign-extended
offset to the incremented PC. The branch can be
completed at this stage if the equality test is true
and the instruction decoded as a branch.

Instruction Decode (ID)/Register Fetch
Cycle (continued)
• Instruction can be decoded in parallel with
reading the registers because the register
addresses are at fixed locations.

Execution (EX)/Effective Address
Cycle
• If a branch or jump did not occur in the
previous cycle, the arithmetic logic unit (ALU)
can execute the instruction.
• At this point the instruction falls into three
different types:
• Memory Reference: ALU adds the base register and
the offset to form the effective address.
• Register-Register: ALU performs the arithmetic,
logical, etc… operation as per the opcode.
• Register-Immediate: ALU performs operation based on
the register and the immediate value (sign extended).

Memory Access (MEM) Cycle
• If a load, the effective address computed from
the previous cycle is referenced and the
memory is read. The actual data transfer to
the register does not occur until the next
cycle.
• If a store, the data from the register is written
to the effective address in memory.

Write-Back (WB) Cycle
• Occurs with Register-Register ALU instructions
or load instructions.
• Simple operation whether the operation is a
register-register operation or a memory load
operation, the resulting data is written to the
appropriate register into the register file.

What Is A Pipeline?
• Pipelining is used by virtually all modern
microprocessors to enhance performance by
overlapping the execution of instructions.
• A common analogue for a pipeline is a factory
assembly line. Assume that there are three
stages:
• Welding
• Painting
• Polishing
• For simplicity, assume that each task takes one
hour.

What Is A Pipeline?
• If a single person were to work on the product it
would take three hours to produce one product.
• If we had three people, one person could work on
each stage, upon completing their stage they
could pass their product on to the next person
(since each stage takes one hour there will be no
waiting).
• We could then produce one product per hour
assuming the assembly line has been filled.

What Is A Pipeline?
Pipelining: is an implementation technique
whereby multiple instructions are overlapped
in execution.
• It takes advantage of parallelism that exists
among the actions needed to execute an
instruction.
• Pipelining is the key implementation
technique used to make fast CPUs.

Characteristics Of Pipelining
• If the stages are perfectly balanced, then the
time per instruction on the pipelined
processor (assuming ideal conditions)—is
equal to
• Under these conditions, the speedup from
pipelining equals the number of pipe stages.

Contd…
• Usually, however, the stages will not be
perfectly balanced; furthermore, pipelining
does involve some overhead.
• The previous expression is ideal. We will see
later that there are many ways in which a
pipeline cannot function in a perfectly
balanced fashion.

Characteristics Of Pipelining
• In terms of a CPU, the implementation of
pipelining has the effect of reducing the
average instruction time, therefore reducing
the average CPI.
• EX: If each instruction in a microprocessor
takes 5 clock cycles (unpipelined) and we have
a 4 stage pipeline, the ideal average CPI with
the pipeline will be 1.25 .

IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WBProgram Flow
Time
Pipelined Execution

Precedence relation
A set of subtask { T1,T2,……,Tn } for a given task T, that
some task Tj can not start until some earlier task Ti
,where (i<j)finishes.
Pipeline consists of cascade of processing stages.
Stages are combinational circuits over data stream
flowing through pipe.
Stages are separated by high speed interface latches
(Holding intermediate results between stages.)
Control must be under a common clock.

Pipeline Cycle
Pipeline cycle:
Determined by the time required by the
slowest stage.
Pipeline designers try to balance the length
(i.e. the processing time) of each pipeline
stage.
For a perfectly balanced pipeline, the
execution time per instruction is t/n, where
t is the execution time per instruction on
nonpipelined machine and n is the number
of pipe stages.

Pipeline Cycle
However, it is very difficult to make the
different pipeline stages perfectly balanced.
Besides, pipelining itself involves some
overhead.

Synchronous Pipeline
S1 S2 Sk
LL LLL
Input Output
d m
Clock
- Transfers between stages are simultaneous.
- One task or operation enters the pipeline per
cycle.

Asynchronous Pipeline
S1 S2 Sk
Output
Ready
Ack
Ready
Ack
Ready
Ack
Ready
Ack
Input
- Transfers performed when individual stages
are ready.
- Handshaking protocol between processors.
- Different amounts of delay may be experienced at
different stages.
- Can display variable throughput rate.

A Few Pipeline Concepts
Si Si+1
 m
d
Pipeline cycle : 
Latch delay : d
 = max {m } + d
Pipeline frequency : f
f = 1 / 

Example on Clock period
Suppose the time delays of the 4 stages are 1 =
60ns,2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of ld = 10ns.
Hence the cycle time of this pipeline can be granted
to be like :-  = 90 + 10 =100ns
Clock frequency of the pipeline (f) = 1/100 =10 Mhz
If it is non-pipeline then = 60 + 50 + 90 +
80 =280ns
  = max {m } + d

Ideal Pipeline Speedup
k-stage pipeline processes n tasks in k + (n-1)
clock cycles:
k cycles for the first task and n-1
cycles for the remaining n-1 tasks.
Total time to process n tasks,
Tk = [ k + (n-1)] 
For the non-pipelined processor
T1 = n k 

Pipeline Speedup Expression
Speedup=
Maximum speedup = Sk  K ,for n >> K
Observe that the memory bandwidth
must increase by a factor of Sk:
Otherwise, the processor would stall
waiting for data to arrive from memory.
Sk =
T1
Tk
=
n k 
[ k + (n-1)] 
=
n k
k + (n-1)

Efficiency of pipeline
The percentage of busy time-space span over the
total time span.
n:- no. of task or instruction
k:- no. of pipeline stages
:- clock period of pipeline
Hence pipeline efficiency can be defined by:-
n * k *
K[ k* +(n-1)]
 =
n
k+(n-1)
=

Throughput of pipeline
Number of result task that can be completed by a
pipeline per unit time.
Idle case w = 1/ = f when  =1.
Maximum throughput = frequency of linear pipeline
W =
n
k*+(n-1)
=
n
[k+(n-1)]
=



Pipelines: A Few Basic Concepts
Historically, there are two different types of pipelines:
Instruction pipelines
Arithmetic pipelines
Arithmetic pipelines (e.g. FP multiplication) are not
popular in general purpose computers:
Need a continuous stream of arithmetic
operations.
E.g. Vector processors operating on an array.
On the other had instruction pipelines being used in
almost every modern processor.

Pipeline increases instruction throughput:
But, does not decrease the execution time of the
individual instructions.
In fact, slightly increases execution time of each
instruction due to pipeline overheads.
Pipeline overhead arises due to a combination
of:
Pipeline register delay
Clock skew

Pipeline register delay:
Caused due to set up time
Clock skew:
the maximum delay between clock arrival at
any two registers.
Once clock cycle is as small as the pipeline
overhead:
No further pipelining would be useful.
Very deep pipelines may not be useful.

Pipeline Registers
 Pipeline registers are essential part of pipelines:
There are 4 groups of pipeline registers in 5 stage pipeline.
 Each group saves output from one stage and passes it as input
to the next stage:
IF/ID
ID/EX
EX/MEM
MEM/WB
 This way, each time “something is computed”...
Effective address, Immediate value, Register content, etc.
It is saved safely in the context of the instruction that needs
it.

Looking At The Big Picture
• Overall the most time that an non-pipelined
instruction can take is 5 clock cycles. Below is
a summary:
• Branch - 2 clock cycles
• Store - 4 clock cycles
• Other - 5 clock cycles
• EX: Assuming branch instructions account for
12% of all instructions and stores account for
10%, what is the average CPI of a non-
pipelined CPU?
ANS: 0.12*2+0.10*4+0.78*5 = 4.54

Assignment
Find out total time to processes 100 tasks in a
2-stage pipeline with a cycle time 10ns.
Repeat the above problem assuming latching
in pipeline require 2ns.
A pipeline has 4-stage with time delays 1 =
60ns, 2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of ld = 10ns. What
is the cycle time of this pipeline?
What is the clock frequency of the above
pipeline?

Instruction-Level Parallelism
• What is ILP (Instruction-Level Parallelism)?
– Parallel execution of different instructions
belonging to the same thread.
• A thread usually consists of several basic
blocks:
– As well as several branches and loops.
• Basic block:
– A sequence of instructions not having a branch
instruction.

Cont…
• Instruction pipelines can effectively exploit
parallelism in a basic block:
– An n-stage pipeline can improve performance up
to n times.
– Does not require much investment in hardware
– Transparent to the programmers.
• Pipelining can be viewed to:
– Decrease average CPI, and/or
– Decrease clock cycle time for instructions.

Drags on Pipeline Performance
• Factors that can degrade pipeline performance
– Unbalanced stages
– Pipeline overheads
– Clock skew
– Hazards
• Hazards cause the worst drag on the
performance of a pipeline.

The Classical RISC: 5 Stage Pipeline
• In an ideal case to implement a pipeline we
just need to start a new instruction at each
clock cycle.
• Unfortunately there are many problems while
trying to implement this.
• We look at each stage of instruction execution
as being independent, we can see how
instructions can be “overlapped”.

Problems With The Previous Figure
• The memory is accessed twice during each clock
cycle. This problem is avoided by using separate data
and instruction caches.
• It is important to note that if the clock period is the
same for a pipelined processor and an non-pipelined
processor, the memory must work five times faster.
• Another problem that we can observe is that the
registers are accessed twice every clock cycle. To try
to avoid a resource conflict we perform the register
write in the first half of the cycle and the read in the
second half of the cycle.

Problems With The Previous Figure
(continued)
• We write in the first half therefore an write
operation can be read by another instruction
further down the pipeline.
• A third problem arises with the interaction of
the pipeline with the PC. We use an adder to
increment PC by the end of IF. Within ID we
may branch and modify PC. How does this
affect the pipeline?

Pipeline Hazards
• The performance gain from using pipelining
occurs because we can start the execution of a
new instruction each clock cycle. In a real
implementation this is not always possible.
• What is a pipeline hazard?
 A situation that prevent s an instruction from executing
during its designated clock cycles.
• Pipeline hazards prevent the execution of the
next instruction during the appropriate clock
cycle.

Types Of Hazards
Structural hazards arise from resource conflicts
when the hardware cannot support all possible
combinations of instructions simultaneously in
overlapped execution.
Data hazards arise when an instruction depends on
the results of a previous instruction in a way that
is exposed by the overlapping of instructions in
the pipeline.
Control hazards arise from the pipelining of
branches and other instructions that change the
PC.

Structural Hazard: Example
IF ID EX
E
MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB

An Example of a Structural
Hazard
ALU
RegMem DM Reg
ALU
RegMem DM Reg
ALU
RegMem DM Reg
ALU
RegMem DM Reg
Time
ALU
RegMem DM Reg
Load
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Would there be a hazard here?

Performance with Stalls
• Stalls degrade performance of a
pipeline:
–Result in deviation from 1 instruction
executing/clock cycle.
–Let’s examine by how much stalls can
impact CPI…

A Hazard Will Cause A Pipeline Stall
• Some performance expressions involve a
realistic pipeline in terms of CPI. It is assumed
that the clock period is the same for pipelined
and unpipelined implementations.
Speedup = CPI Unpipelined / CPI pipelined
= Pipeline Depth / ( 1 + Stalls per Inst)
= Avg Inst Time Unpipelined / Avg Inst Time
Pipelined

Dealing With Structural Hazards
• Arise from resource conflicts among instructions
executing concurrently:
–Same resource is required by two (or more)
concurrently executing instructions at the
same time.
• Easy way to avoid structural hazards:
–Duplicate resources (sometimes not practical)
–Memory interleaving ( lower & higher order )

Contd…
• Examples of Resolution of Structural Hazard:
–An ALU to perform an arithmetic operation
and an adder to increment PC.
–Separate data cache and instruction cache
accessed simultaneously in the same cycle.

How is it Resolved?
ALU
RegMem DM Reg
ALU
RegMem DM Reg
ALU
RegMem DM Reg
Time
ALU
RegMem DM Reg
Load
Instruction 1
Instruction 2
Stall
Instruction 3
Bubble Bubble Bubble Bubble Bubble
A Pipeline can be stalled by inserting a “bubble” or NOP

• A structural hazard is dealt with by inserting a
stall or pipeline bubble into the pipeline.
• This means that for that clock cycle, nothing
happens for that instruction.
• This effectively “slides” that instruction, and
subsequent instructions, by one clock cycle.
• This effectively increases the average CPI.

(continued)
• We can see that even though the clock speed
of the processor with the hazard is a little
faster, the speedup is still less than 1.
• Therefore the hazard has quite an effect on
the performance.
• Sometimes computer architects will opt to
design a processor that exhibits a structural
hazard. Why?
• A: The improvement to the processor data path is too costly.
• B: The hazard occurs rarely enough so that the processor will still
perform to specifications.

An Example of Performance
Impact of Structural Hazard
• Assume:
– Pipelined processor.
– Data references constitute 40% of an instruction
mix.
– Ideal CPI of the pipelined machine is 1.
– Consider two cases:
• Unified data and instruction cache vs. separate data and
instruction cache.
• What is the impact on performance?

Data Dependences and Hazards
• Determining how one instruction depends on
another is critical to determining how much
parallelism exists in a program and how that
parallelism can be exploited.

Data Dependences
There are three different types of dependences:
• Data Dependences (also called true data
dependences), Name Dependences and
Control Dependences.
• An instruction j is data dependent on
instruction i if either of the following holds:
 Instruction i produces a result that may be used by
instruction j, or
 Instruction j is data dependent on instruction k, and
instruction k is data dependent on instruction i.

Consider the MIPS code sequence
That increments a vector of values in memory
(starting at 0(R1) , and with the last element at
8(R2)
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1!=R2

Data Dependences
Contd…
• A data value may flow between instructions
either through registers or through memory
locations.
• When the data flow occurs in a register,
detecting the dependence is straight forward
since the register names are fixed in the
instructions
• Although it gets more complicated when
branches intervene

Contd…
• Dependences that flow through memory
locations are more difficult to detect
• Since two addresses may refer to the same
location but look different:
• For example, 100(R4) and 20(R6) may be
identical memory addresses.
• Effective address of a load or store may change
from one execution of the instruction to another
(so that 20(R4) and 20(R4) may be different

Detecting Data Dependences
• A data value may flow between instructions:
– (i) through registers
– (ii) through memory locations.
• When data flow is through a register:
– Detection is rather straight forward.
• When data flow is through a memory location:
– Detection is difficult.
– Two addresses may refer to the same memory
location but look different.
100(R4) and 20(R6)

Name Dependences
• A Name Dependence occurs when two
instructions use the same register or memory
location, called a name
• There are two types of name dependences
between an instruction i that preceedes
instruction j in program order:
• Antidependence,
• Output Dependence

Contd…
• An Antidependence: between instruction i and
instruction j occurs when instruction J writes a
register or memory location that instruction i
reads.
• The original ordering must be preserved to
ensure that i reads the correct value. There is
an antidependence between S.D and DADDIU
on register R1, in the MIPS code sequence next
slide.

Consider the MIPS code sequence
That increments a vector of values in memory
(starting at 0(R1) , and with the last element at
8(R2)
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1!=R2), by a scalar in
register F2.

Contd…
• An Output Dependence occurs when
instruction i and instruction j write the same
register or memory location.
• The ordering between the instructions must be
preserved to ensure that the value finally
written corresponds to instruction j.

Data Hazards
• Occur when an instruction under execution
depends on:
– Data from an instruction ahead in pipeline.
• Example:
– Dependent instruction uses old data:
• Results in wrong computations
IF ID EX MEM WB
IF ID EXE MEM WB
A=B+C;
D=A+E;
A=B+C
;D=A+E
;

Types of Data Hazards
• Data hazards are of three types:
– Read After Write (RAW)
– Write After Read (WAR)
– Write After Write (WAW)
• With an in-order execution machine:
– WAW, WAR hazards can not occur.
• Assume instruction i is issued before j.

Read after Write (RAW) Hazards
• Hazard between two instructions i & j may
occur when j attempts to read some data
object that has been modified by i.
– instruction j tries to read its operand before
instruction i writes it.
– j would incorrectly receive an old or incorrect
value.
• Example: … j i …
Instruction j is a
read instruction
issued after i
Instruction i is a
write instruction
issued before j
i: ADD R1, R2, R3
j: SUB R4, R1, R6

Read after Write (RAW) Hazards
D(I)
Instn I
Write
R(I)
D(J) R(J)Instn J
Read
RAW
R (I) ∩ D (J) ≠ Ø for RAW

RAW Dependency: More Examples
• Example program (a):
–i1: load r1, addr;
–i2: add r2, r1,r1;
• Program (b):
–i1: mul r1, r4, r5;
–i2: add r2, r1, r1;
• Both cases, i2 does not get operand until i1
has completed writing the result
–In (a) this is due to load-use dependency
–In (b) this is due to define-use dependency

Write after Read (WAR) Hazards
– Instruction j tries to write its operand at
destination before instruction i read it.
– i would incorrectly receive a new or incorrect
value.
• WAR hazards do not usually occur because of the
amount of time between the read cycle and write
cycle in a pipeline.
… j i …
Instruction j is a
write instruction
issued after i
Instruction i is a
read instruction
issued before j
i: ADD R1, R2, R3
j: SUB R2, R4, R6
WAR
hazards
occur due
to Anti
dependency
.

Write after Read (WAR) Hazards
D(J)
Instn J
Write
R(J)
D(I) R(I)
InstnI
Read
WAR
D (I) ∩ R (J) ≠ Ø for WAR

Write After Write (WAW) Hazards
• WAW hazard:
– Both i & j wants to modify a same data object.
– instruction j tries to write an operand before
instruction i writes it.
– Writes are performed in wrong order.
• Example:
… j i …
Instruction j is a
write instruction
issued after i
Instruction i is a
write instruction
issued before j
i: DIV F1, F2, F3
j: SUB F1, F4, F6
(How can
this
happen???)
WAW
hazards
occur due
to output
dependence
.

Write After Write (WAW) Hazards
D(I)
Instn I
Write
R(I)
R(J) D(J)
Instn J
Write
WAW
R (I) ∩ R (J) ≠ Ø for WAW

Inter-Instruction Dependences
 Data dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
 Anti-dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)
 Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
Control dependence
False
Dependency

Data Dependencies : Summary
Data dependencies
in straight-line code
RAW
Read After Write
dependency
Load-Use
dependency
Define-Use
dependency
WAR
Write After Read
dependency
WAW
Write After Write
dependency
( Flow dependency ) ( Anti dependency ) ( Output dependency )
True dependency
Cannot be overcome
False dependency
Can be eliminated by register
renaming

Recollect Data Hazards
What causes them?
– Pipelining changes the order of read/write
accesses to operands.
– Order differs from that of an unpipelined
machine.
• Example:
– ADD R1, R2, R3
– SUB R4, R1, R5
For MIPS, ADD writes
the register in WB but
SUB needs it in ID.
This is a data hazard

Illustration of a Data Hazard
ALU
RegMem DM Reg
ALU
RegMem DM Reg
ALU
RegMem DM
RegMem
Time
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
ALU
RegMem
ADD instruction causes a hazard in next 3 instructions
because register not written until after those 3 read it.

Solutions to Data Hazard
• Operand forwarding
• Pipeline interlock
• By S/W (NOP)
• Reordering the instruction

Forwarding
• Simplest solution to data hazard:
– forwarding
• Result of the ADD instruction not really
needed:
– until after ADD actually produces it.
• Can we move the result from EX/MEM
register to the beginning of ALU (where SUB
needs it)?
– Yes!

Forwarding
cont…
• Generally speaking:
–Forwarding occurs when a result is
passed directly to the functional unit
that requires it.
–Result goes from output of one pipeline
stage to input of another.

Forwarding Technique
EXECUTE
ALU
WRITE
RESULT
Latch Latch
Forwarding Path

When Can We Forward?
ALU
RegMem DM Reg
ALU
RegMem DM Reg
ALU
RegMem DM
RegMem
Time
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
ALU
RegMem
SUB gets info.
from EX/MEM
pipe register
AND gets info.
from MEM/WB
pipe register
OR gets info. by
forwarding from
register file
If line goes “forward” you can do forwarding.
If its drawn backward, it’s physically impossible.

General Data Forwarding
• It is easy to see how data forwarding can be
used by drawing out the pipelined execution
of each instruction.
• Now consider the following instructions:
DADD R1, R2, R3
LD R4, O(R1)
SD R4, 12(R1)

Problems
• Can data forwarding prevent all data hazards?
• NO!
• The following operations will still cause a data
hazard. This happens because the further
down the pipeline we get, the less we can use
forwarding.
LD R1, O(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9

Problems
• We can avoid the hazard by using a Pipeline
interlock.
• The pipeline interlock will detect when data
forwarding will not be able to get the data to
the next instruction in time.
• A stall is introduced until the instruction can
get the appropriate data from the previous
instruction.

Handling data hazard by S/W
• Compiler introduce NOP in between two
instructions
• NOP = a piece of code which keeps a gap
between two instruction
• Detection of the dependency is left entirely
on the S/W
• Advantage :- We find the easy technique
called as instruction reordering.

Instruction Reordering
• ADD R1 , R2 , R3
• SUB R4 , R1 , R5
• XOR R8 , R6 , R7
• AND R9 , R10 , R11
• ADD R1 , R2 , R3
• XOR R8 , R6 , R7
• AND R9 , R10 , R11
• SUB R4 , R1 , R5
Before
After

97
Instruction Execution:
MIPS Data path
• Can break down the process of “running” an
instruction into stages.
• These stages are what needs to be done to
complete the execution of each instruction.
Some instructions will not require some stages.

98
MIPS Data path
The DLX (MIPS) datapath allows every instruction to be
executed in 4 or 5 cycles

99
1. Instruction Fetch (IF) - Get the instruction to be
executed.
IR  M[PC]
NPC  PC + 4
IR – Instruction register
NPC – Next program counter
Instruction Execution
Contd…

100
2. Instruction Decode/Register Fetch (ID) –
Figure out what the instruction is supposed to
do and what it needs.
A  Register File[Rs]
B  Register File[Rt]
Imm  {(IR16)16, IR15..0}
A & B & Imm are temporary registers that hold
inputs to the ALU which is in the Execute Stage
Contd…

101
3. Execution (EX) -The instruction has been decoded, so
execution can be split according to instruction type.
Reg-Reg
ALU instr: ALUout  A op B
Reg-Imm: ALUout  A op Imm
Branch: ALUout  NPC + Imm
Cond  (A {==, !=} 0)
LD/ST: ALUout  A op Imm
to form effective Address
Contd…

102
4. Memory Access/Branch Completion (MEM) – Besides
the IF stage this is the only stage that access the
memory to load and store data.
Load: LMD = Mem[ALUout]
Store: Mem[ALUout]  B
Branch: if (cond) PC  ALUout
Jump: PC  ALUout
ELSE: PC  NPC
LMD=Load Memory Data Register
Contd…

103
5. Write-Back (WB) – Store all the results and loads
back to registers.
Reg-Reg
ALU instr: Rd  ALUoutput
Load: Rd  LMD
Reg-Imm: Rt  ALUoutput
Contd…

Control Hazards
• Result from branch and other instructions that change
the flow of a program (i.e. change PC).
• Example:
• Statement in line 2 is control dependent on statement
at line 1.
• Until condition evaluation completes:
– It is not known whether s1 or s2 will execute next.
1: If(cond){
2: s1}
3: s2

• Control hazards are caused by branches in the
code.
• During the IF stage remember that the PC is
incremented by 4 in preparation for the next IF
cycle of the next instruction.
• What happens if there is a branch performed and
we aren’t simply incrementing the PC by 4.
• The easiest way to deal with the occurrence of a
branch is to perform the IF stage again once the
branch occurs.

• These following solutions assume that we are
dealing with Static Branches (Compile time).
Meaning that the actions taken during a
branch do not change.
#1. Flush Pipeline/ Stall
#2. Predict Branch Not Taken:
#3. Predict Branch Taken
#4. Delayed branch.
Four Simple Control/Branch Hazard
Solutions

Branch Hazard Solutions
#1. Flush Pipeline/ Stall
• until branch direction is clear – flushing pipe
, once an instruction is detected to be branch
during the ID stage.
• Let us see an example, we will stall the
pipeline until the branch is resolved (in that
case we repeated the IF stage until the branch
is resolved and modifies the PC)

Performing IF Twice
• We take a big performance hit by
performing the instruction fetch whenever
a branch occurs. Note, this happens even if
the branch is taken or not.
• This guarantees that the PC will get the
correct value.
IF ID EX MEM WB
IF ID EX MEM WB
IF IF ID EX MEM WB
branch

Control Hazards solutions
#2. Predict Branch Not Taken:
• What if we treat every branch as “not taken”
remember that not only do we read the registers
during ID, but we also perform an equality test in
case we need to branch or not.
• We can improve performance by assuming that
the branch will not be taken.
–Execute successor instructions in sequence as if there
is no branch
–undo instructions in pipeline if branch actually taken

Control Hazards solutions:
Predict Branch Not Taken: Contd..
• The “branch-not taken” scheme is the same as
performing the IF stage a second time in our 5
stage pipeline if the branch is taken.
• If not there is no performance degradation.
• 47% branches not taken on average

Control Hazards solutions:
Predict Branch Not Taken: Contd..

#3 Predict Branch Taken
– The “branch taken” scheme is no benefit in our case because
we evaluate the branch target address in the ID stage.
– 53% branches taken on average.
– But branch target address not available after IF in
MIPS
• MIPS still incurs 1 cycle branch penalty even with
predict taken
• LOOP or in some other machines: branch target
known before branch outcome computed, significant
benefits can be accrued.

#4: Delayed Branch
• The fourth method for dealing with a control
hazard is to implement a “delayed” branch
scheme.
• In this scheme an instruction is inserted into the
pipeline that is useful and not dependent on
whether the branch is taken or not. It is the job of
the compiler to determine the delayed branch
instruction.
• If the branch is actually taken, we need to clear
the pipeline of any code loaded in from the “not-
taken” path.

Delayed Branch Contd…
• Likewise we can assume that the branch is
always taken. Does this work in our “5-stage”
pipeline?
 No, the branch target is computed during the ID cycle.
• Some processors will have the target address
computed in time for the IF stage of the next
instruction so there is no delay.

cont…
#4: Delayed Branch
–Insert unrelated successor in the branch delay
slot
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target (if taken)
–1 slot delay required in 5 stage pipeline
Branch delay of length n

The behavior of a delayed branch

Delayed Branch
• Simple idea: Put an instruction that would be
executed anyway right after a branch.
• Question: What instruction do we put in the delay slot?
• Answer: one that can safely be executed no matter what the
branch does.
– The compiler decides this.
IF ID EX MEM WB
IF
IF ID EX MEM WB
ID EX MEM WB
Branch
Delayed slot instruction
Branch target OR successor
delay slot

Delayed Branch
• One possibility: An instruction from before
• Example:
• The DADD instruction is executed no matter what
happens in the branch:
– Because it is executed before the branch!
– Therefore, it can be moved
DADD R1, R2, R3
if R2 == 0 then
. . .
delay slot
DADD R1, R2, R3
if R2 == 0 then
DADD R1, R2, R3

Delayed Branch
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
branch
add instruction
branch target/successor
By this time, we know whether
to take the branch or whether not
to take it
• We get to execute the “DADD” execution
“for free”

Delayed Branch
• Another possibility: An instruction much before from
target
• Example:
• The DSUB instruction can be replicated into the delay
slot, and the branch target can be changed
DSUB R4, R5, R6
...
DADD R1, R2, R3
if R1 == 0 then
delay slot

Delayed Branch
• Example:
• The DSUB instruction can be replicated into the delay
slot, and the branch target can be changed
DSUB R4, R5, R6
...
DADD R1, R2, R3
if R1 == 0 then
DSUB R4, R5, R6

Delayed Branch
• Yet another possibility: An instruction from inside the
taken path: fall through
• Example:
• The OR instruction can be moved into the delay slot
ONLY IF its execution doesn’t disrupt the program
execution (e.g., R7 is overwritten later)
DADD R1, R2, R3
if R1 == 0 then
OR R7, R8, R9
DSUB R4, R5, R6
delay slot

Delayed Branch
• Third possibility: An instruction from inside the taken
path
• Example:
• The OR instruction can be moved into the delay slot
ONLY IF its execution doesn’t disrupt the program
execution (e.g., R7 is overwritten later)
DADD R1, R2, R3
if R1 == 0 then
OR R7, R8, R9
DSUB R4, R5, R6
OR R7, R8, R9

Introduction to Parallel
Processors
Multiprocessor

Flynn’s Classification
SISD (Single Instruction Single Data):
Uniprocessors.
MISD (Multiple Instruction Single Data):
No practical examples exist
SIMD (Single Instruction Multiple Data):
Specialized processors(Vector architectures,
Multimedia extensions, Graphics processor units)
MIMD (Multiple Instruction Multiple Data):
General purpose, commercially important
(Tightly-coupled MIMD, Loosely-coupled MIMD)

SISD
Control
unit
Process
ing
Unit
Memory
Module
DSIS

SIMD
Control
unit
Processing
Unit 2
Memory
Module
DS 2IS
Processing
Unit 1
Processing
Unit n
Memory
Module
Memory
Module
DS1
DS n
IS

MIMD
Contr
ol
unit
Process
ing
Unit 2
Memory
Module
DS2IS
Process
ing
Unit 1
Process
ing
Unit n
Memory
Module
Memory
Module
DS1
DSn
Contr
ol
unit
Contr
ol
unit
IS
IS

A Broad Classification of Computers
• Shared-memory multiprocessors
–Also called UMA
• Distributed memory computers
–Also called NUMA:
• Distributed Shared-memory (DSM)
architectures
• Clusters
• Grids, etc.

UMA vs. NUMA Computers
Cache
P1
Cache
P2
Cache
Pn
Cache
P1
Cache
P2
Cache
Pn
Network
Main
Memory
Main
Memory
Main
Memory
Main
Memory
Bus
(a) UMA Model (b) NUMA Model
Latency = 100s of ns
Latency = several
milliseconds to seconds

Distributed Memory Computers
• Distributed memory computers use:
–Message Passing Model
• Explicit message send and receive instructions
have to be written by the programmer.
–Send: specifies local buffer + receiving process (id)
on remote computer (address).
–Receive: specifies sending process on remote
computer + local buffer to place data.

Advantages of Message-Passing
Communication
• Hardware for communication and
synchronization are much simpler:
–Compared to communication in a shared memory
model.
• Explicit communication:
–Programs simpler to understand, helps to reduce
maintenance and development costs.
• Synchronization is implicit:
–Naturally associated with sending/receiving
messages.
–Easier to debug.

Disadvantages of Message-Passing
Communication
• Programmer has to write explicit message
passing constructs.
–Also, precisely identify the processes (or
threads) with which communication is to
occur.
• Explicit calls to operating system:
–Higher overhead.

DSM
• Physically separate memories are accessed
as one logical address space.
• Processors running on a multi-computer
system share their memory.
–Implemented by operating system.
• DSM multiprocessors are NUMA:
–Access time depends on the exact location of
the data.

Distributed Shared-Memory
Architecture (DSM)
• Underlying mechanism is message passing:
– Shared memory convenience provided to the
programmer by the operating system.
– Basically, an operating system facility takes care
of message passing implicitly.
• Advantage of DSM:
– Ease of programming

Disadvantage of DSM
• High communication cost:
–A program not specifically optimized for
DSM by the programmer shall perform
extremely poorly.
–Data (variables) accessed by specific
program segments have to be collocated.
–Useful only for process-level (coarse-
grained) parallelism.

Symmetric Multiprocessors (SMPs)
• SMPs are a popular shared memory
multiprocessor architecture:
–Processors share Memory and I/O
–Bus based: access time for all memory locations is
equal --- “Symmetric MP”
P P P P
Cache Cache Cache Cache
Main memory I/O system
Bus

SMPs: Some Insights
• In any multiprocessor, main memory access is
a bottleneck:
–Multilevel caches reduce the memory demand of a
processor.
–Multilevel caches in fact make it possible for more
than one processor to meaningfully share the
memory bus.
–Hence multilevel caches are a must in a
multiprocessor!

Pros of SMPs
• Ease of programming:
–Especially when communication
patterns are complex or vary
dynamically during execution.

Cons of SMPs
• As the number of processors increases,
contention for the bus increases.
– Scalability of the SMP model restricted.
– One way out may be to use switches (crossbar,
multistage networks, etc.) instead of a bus.
– Switches set up parallel point-to-point
connections.
– Again switches are not without any
disadvantages: make implementation of cache
coherence difficult.

An Important Problem with
Shared-Memory: Coherence
• When shared data are cached:
–These are replicated in multiple caches.
–The data in the caches of different
processors may become inconsistent.
• How to enforce cache coherency?
– How does a processor know changes in the
caches of other processors?

The Cache Coherency
Problem
P1 P2 P3
U:5 U:5
U:51
4
U:? U:? U:7
2
3
5
What value will P1 and P2 read?
1 3
U:?

Cache Coherence Solutions
(Protocols)
• The key to maintain cache coherence:
– Track the state of sharing of every data
block.
• Based on this idea, following can be an
overall solution:
–Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.

Pros and Cons of the Solution
• Pro:
–Consistency maintenance becomes
transparent to programmers,
compilers, as well as to the operating
system.
• Con:
–Increased hardware complexity .

Two Important Cache Coherency
Protocols
• Snooping protocol:
– Each cache “snoops” the bus to find out which
data is being used by whom.
• Directory-based protocol:
– Keep track of the sharing state of each data
block using a directory.
– A directory is a centralized register for all
memory blocks.
– Allows coherency protocol to avoid broadcasts.

Snooping vs. Directory-based
Protocols
• Snooping protocol reduces memory traffic.
– More efficient.
• Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
– Even when there is a shared bus, scalability is a
problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses.

Snooping Protocol
• As soon as a request for any data block by a
processor is put out on the bus:
–Other processors “snoop” to check if they have a
copy and respond accordingly.
• Works well with bus interconnection:
–All transmissions on a bus are essentially broadcast:
• Snooping is therefore effortless.
–Dominates almost all small scale machines.

Categories of Snoopy
Protocols
• Essentially two types:
–Write Invalidate Protocol
–Write Broadcast Protocol
• Write invalidate protocol:
–When one processor writes to its cache, all other
processors having a copy of that data block
invalidate that block.
• Write broadcast:
–When one processor writes to its cache, all other
processors having a copy of that data block
update that block with the recent written value.

Write Invalidate Vs. Write Update
Protocols
P P P P
Cache Cache Cache Cache
Main memory I/O system
Bus

Write Invalidate Protocol
• Handling a write to shared data:
–An invalidate command is sent on bus --- all
caches snoop and invalidate any copies they
have.
• Handling a read Miss:
–Write-through: memory is always up-to-date.
–Write-back: snooping finds most recent copy.

Write Invalidate in Write Through
Caches
• Simple implementation.
• Writes:
– Write to shared data: broadcast on bus, processors
snoop, and update any copies.
– Read miss: memory is always up-to-date.
• Concurrent writes:
– Write serialization automatically achieved since bus
serializes requests.
– Bus provides the basic arbitration support.

Write Invalidate versus
Broadcast cont…
• Invalidate exploits spatial locality:
–Only one bus transaction for any number of
writes to the same block.
–Obviously, more efficient.
• Broadcast has lower latency for writes and reads:
–As compared to invalidate.

Coa.ppt2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Semelhante a Coa.ppt2

Semelhante a Coa.ppt2 (20)

Último

Último (20)

Coa.ppt2