SlideShare uma empresa Scribd logo
1 de 152
Text Book: Computer Architecture: A Quantitative
Approach by Hennessey and Patterson
Prof.Prasanta Kumar Dash
Pipelining: Basic and Intermediate
RISC Instruction Set Basics
(from Hennessey and Patterson)
• Properties of RISC architectures:
– All ops on data apply to data in registers and typically
change the entire register (32-bits or 64-bits).
– The only ops that affect memory are load/store
operations. Memory to register, and register to
– Load and store ops on data less than a full size of a
register (32, 16, 8 bits) are often available.
– Usually instructions are few in number (this can be
relative) and are typically one size.
RISC Instruction Set Basics
Types Of Instructions
• ALU Instructions:
• Arithmetic operations, either take two registers as
operands or take one register and a sign extended
immediate value as an operand. The result is stored in
a third register.
• Logical operations AND OR, XOR do not usually
differentiate between 32-bit and 64-bit.
• Load/Store Instructions:
• Usually take a register (base register) as an operand
and a 16-bit immediate value. The sum of the two will
create the effective address. A second register acts as a
source in the case of a load operation.
RISC Instruction Set Basics
Types Of Instructions (continued)
• In the case of a store operation the second register
contains the data to be stored.
• Branches and Jumps
• Conditional branches are transfers of control. As
described before, a branch causes an immediate value
to be added to the current program counter.
RISC Instruction Set Implementation
• We first need to look at how instructions in the
MIPS64 instruction set are implemented without
pipelining. Assume that any instruction (MIPS) can be
executed in at most 5 clock cycles.
• The five clock cycles will be broken up into the
following steps:
• Instruction Fetch Cycle
• Instruction Decode/Register Fetch Cycle
• Execution Cycle
• Memory Access Cycle
• Write-Back Cycle
Instruction cycle
Instruction Fetch (IF) Cycle
• Send the program counter (PC) to memory
and fetch the current instruction from
• Update the PC to the next sequential PC by
adding 4 (since each instruction is 4 bytes) to
the PC.
Instruction Decode (ID)/Register Fetch Cycle
• Decode the instruction and at the same time read
in the values of the register involved. As the
registers are being read, do equality test incase
the instruction decodes as a branch or jump.
• The offset field of the instruction is sign-extended
incase it is needed. The possible branch effective
address is computed by adding the sign-extended
offset to the incremented PC. The branch can be
completed at this stage if the equality test is true
and the instruction decoded as a branch.
Instruction Decode (ID)/Register Fetch
Cycle (continued)
• Instruction can be decoded in parallel with
reading the registers because the register
addresses are at fixed locations.
Execution (EX)/Effective Address
• If a branch or jump did not occur in the
previous cycle, the arithmetic logic unit (ALU)
can execute the instruction.
• At this point the instruction falls into three
different types:
• Memory Reference: ALU adds the base register and
the offset to form the effective address.
• Register-Register: ALU performs the arithmetic,
logical, etc… operation as per the opcode.
• Register-Immediate: ALU performs operation based on
the register and the immediate value (sign extended).
Memory Access (MEM) Cycle
• If a load, the effective address computed from
the previous cycle is referenced and the
memory is read. The actual data transfer to
the register does not occur until the next
• If a store, the data from the register is written
to the effective address in memory.
Write-Back (WB) Cycle
• Occurs with Register-Register ALU instructions
or load instructions.
• Simple operation whether the operation is a
register-register operation or a memory load
operation, the resulting data is written to the
appropriate register into the register file.
What Is A Pipeline?
• Pipelining is used by virtually all modern
microprocessors to enhance performance by
overlapping the execution of instructions.
• A common analogue for a pipeline is a factory
assembly line. Assume that there are three
• Welding
• Painting
• Polishing
• For simplicity, assume that each task takes one
What Is A Pipeline?
• If a single person were to work on the product it
would take three hours to produce one product.
• If we had three people, one person could work on
each stage, upon completing their stage they
could pass their product on to the next person
(since each stage takes one hour there will be no
• We could then produce one product per hour
assuming the assembly line has been filled.
What Is A Pipeline?
Pipelining: is an implementation technique
whereby multiple instructions are overlapped
in execution.
• It takes advantage of parallelism that exists
among the actions needed to execute an
• Pipelining is the key implementation
technique used to make fast CPUs.
Characteristics Of Pipelining
• If the stages are perfectly balanced, then the
time per instruction on the pipelined
processor (assuming ideal conditions)—is
equal to
• Under these conditions, the speedup from
pipelining equals the number of pipe stages.
• Usually, however, the stages will not be
perfectly balanced; furthermore, pipelining
does involve some overhead.
• The previous expression is ideal. We will see
later that there are many ways in which a
pipeline cannot function in a perfectly
balanced fashion.
Characteristics Of Pipelining
• In terms of a CPU, the implementation of
pipelining has the effect of reducing the
average instruction time, therefore reducing
the average CPI.
• EX: If each instruction in a microprocessor
takes 5 clock cycles (unpipelined) and we have
a 4 stage pipeline, the ideal average CPI with
the pipeline will be 1.25 .
Serial Vs Pipeline
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WBProgram Flow
Pipelined Execution
Precedence relation
A set of subtask { T1,T2,……,Tn } for a given task T, that
some task Tj can not start until some earlier task Ti
,where (i<j)finishes.
Pipeline consists of cascade of processing stages.
Stages are combinational circuits over data stream
flowing through pipe.
Stages are separated by high speed interface latches
(Holding intermediate results between stages.)
Control must be under a common clock.
Pipeline Cycle
Pipeline cycle:
Determined by the time required by the
slowest stage.
Pipeline designers try to balance the length
(i.e. the processing time) of each pipeline
For a perfectly balanced pipeline, the
execution time per instruction is t/n, where
t is the execution time per instruction on
nonpipelined machine and n is the number
of pipe stages.
Pipeline Cycle
However, it is very difficult to make the
different pipeline stages perfectly balanced.
Besides, pipelining itself involves some
Synchronous Pipeline
S1 S2 Sk
Input Output
d m
- Transfers between stages are simultaneous.
- One task or operation enters the pipeline per
Asynchronous Pipeline
S1 S2 Sk
- Transfers performed when individual stages
are ready.
- Handshaking protocol between processors.
- Different amounts of delay may be experienced at
different stages.
- Can display variable throughput rate.
A Few Pipeline Concepts
Si Si+1
 m
Pipeline cycle : 
Latch delay : d
 = max {m } + d
Pipeline frequency : f
f = 1 / 
Example on Clock period
Suppose the time delays of the 4 stages are 1 =
60ns,2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of ld = 10ns.
Hence the cycle time of this pipeline can be granted
to be like :-  = 90 + 10 =100ns
Clock frequency of the pipeline (f) = 1/100 =10 Mhz
If it is non-pipeline then = 60 + 50 + 90 +
80 =280ns
  = max {m } + d
Ideal Pipeline Speedup
k-stage pipeline processes n tasks in k + (n-1)
clock cycles:
k cycles for the first task and n-1
cycles for the remaining n-1 tasks.
Total time to process n tasks,
Tk = [ k + (n-1)] 
For the non-pipelined processor
T1 = n k 
Pipeline Speedup Expression
Maximum speedup = Sk  K ,for n >> K
Observe that the memory bandwidth
must increase by a factor of Sk:
Otherwise, the processor would stall
waiting for data to arrive from memory.
Sk =
n k 
[ k + (n-1)] 
n k
k + (n-1)
Efficiency of pipeline
The percentage of busy time-space span over the
total time span.
n:- no. of task or instruction
k:- no. of pipeline stages
:- clock period of pipeline
Hence pipeline efficiency can be defined by:-
n * k *
K[ k* +(n-1)]
 =
Throughput of pipeline
Number of result task that can be completed by a
pipeline per unit time.
Idle case w = 1/ = f when  =1.
Maximum throughput = frequency of linear pipeline
W =
Pipelines: A Few Basic Concepts
Historically, there are two different types of pipelines:
Instruction pipelines
Arithmetic pipelines
Arithmetic pipelines (e.g. FP multiplication) are not
popular in general purpose computers:
Need a continuous stream of arithmetic
E.g. Vector processors operating on an array.
On the other had instruction pipelines being used in
almost every modern processor.
Pipelines: A Few Basic Concepts
Pipeline increases instruction throughput:
But, does not decrease the execution time of the
individual instructions.
In fact, slightly increases execution time of each
instruction due to pipeline overheads.
Pipeline overhead arises due to a combination
Pipeline register delay
Clock skew
Pipelines: A Few Basic Concepts
Pipeline register delay:
Caused due to set up time
Clock skew:
the maximum delay between clock arrival at
any two registers.
Once clock cycle is as small as the pipeline
No further pipelining would be useful.
Very deep pipelines may not be useful.
Pipeline Registers
 Pipeline registers are essential part of pipelines:
There are 4 groups of pipeline registers in 5 stage pipeline.
 Each group saves output from one stage and passes it as input
to the next stage:
 This way, each time “something is computed”...
Effective address, Immediate value, Register content, etc.
It is saved safely in the context of the instruction that needs
Looking At The Big Picture
• Overall the most time that an non-pipelined
instruction can take is 5 clock cycles. Below is
a summary:
• Branch - 2 clock cycles
• Store - 4 clock cycles
• Other - 5 clock cycles
• EX: Assuming branch instructions account for
12% of all instructions and stores account for
10%, what is the average CPI of a non-
pipelined CPU?
ANS: 0.12*2+0.10*4+0.78*5 = 4.54
Find out total time to processes 100 tasks in a
2-stage pipeline with a cycle time 10ns.
Repeat the above problem assuming latching
in pipeline require 2ns.
A pipeline has 4-stage with time delays 1 =
60ns, 2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of ld = 10ns. What
is the cycle time of this pipeline?
What is the clock frequency of the above
Instruction-Level Parallelism
• What is ILP (Instruction-Level Parallelism)?
– Parallel execution of different instructions
belonging to the same thread.
• A thread usually consists of several basic
– As well as several branches and loops.
• Basic block:
– A sequence of instructions not having a branch
• Instruction pipelines can effectively exploit
parallelism in a basic block:
– An n-stage pipeline can improve performance up
to n times.
– Does not require much investment in hardware
– Transparent to the programmers.
• Pipelining can be viewed to:
– Decrease average CPI, and/or
– Decrease clock cycle time for instructions.
Drags on Pipeline Performance
• Factors that can degrade pipeline performance
– Unbalanced stages
– Pipeline overheads
– Clock skew
– Hazards
• Hazards cause the worst drag on the
performance of a pipeline.
The Classical RISC: 5 Stage Pipeline
• In an ideal case to implement a pipeline we
just need to start a new instruction at each
clock cycle.
• Unfortunately there are many problems while
trying to implement this.
• We look at each stage of instruction execution
as being independent, we can see how
instructions can be “overlapped”.
Problems With The Previous Figure
• The memory is accessed twice during each clock
cycle. This problem is avoided by using separate data
and instruction caches.
• It is important to note that if the clock period is the
same for a pipelined processor and an non-pipelined
processor, the memory must work five times faster.
• Another problem that we can observe is that the
registers are accessed twice every clock cycle. To try
to avoid a resource conflict we perform the register
write in the first half of the cycle and the read in the
second half of the cycle.
Problems With The Previous Figure
• We write in the first half therefore an write
operation can be read by another instruction
further down the pipeline.
• A third problem arises with the interaction of
the pipeline with the PC. We use an adder to
increment PC by the end of IF. Within ID we
may branch and modify PC. How does this
affect the pipeline?
Pipeline Hazards
• The performance gain from using pipelining
occurs because we can start the execution of a
new instruction each clock cycle. In a real
implementation this is not always possible.
• What is a pipeline hazard?
 A situation that prevent s an instruction from executing
during its designated clock cycles.
• Pipeline hazards prevent the execution of the
next instruction during the appropriate clock
Types Of Hazards
Structural hazards arise from resource conflicts
when the hardware cannot support all possible
combinations of instructions simultaneously in
overlapped execution.
Data hazards arise when an instruction depends on
the results of a previous instruction in a way that
is exposed by the overlapping of instructions in
the pipeline.
Control hazards arise from the pipelining of
branches and other instructions that change the
Structural Hazard: Example
An Example of a Structural
RegMem DM Reg
RegMem DM Reg
RegMem DM Reg
RegMem DM Reg
RegMem DM Reg
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Would there be a hazard here?
Performance with Stalls
• Stalls degrade performance of a
–Result in deviation from 1 instruction
executing/clock cycle.
–Let’s examine by how much stalls can
impact CPI…
A Hazard Will Cause A Pipeline Stall
• Some performance expressions involve a
realistic pipeline in terms of CPI. It is assumed
that the clock period is the same for pipelined
and unpipelined implementations.
Speedup = CPI Unpipelined / CPI pipelined
= Pipeline Depth / ( 1 + Stalls per Inst)
= Avg Inst Time Unpipelined / Avg Inst Time
Dealing With Structural Hazards
• Arise from resource conflicts among instructions
executing concurrently:
–Same resource is required by two (or more)
concurrently executing instructions at the
same time.
• Easy way to avoid structural hazards:
–Duplicate resources (sometimes not practical)
–Memory interleaving ( lower & higher order )
• Examples of Resolution of Structural Hazard:
–An ALU to perform an arithmetic operation
and an adder to increment PC.
–Separate data cache and instruction cache
accessed simultaneously in the same cycle.
How is it Resolved?
RegMem DM Reg
RegMem DM Reg
RegMem DM Reg
RegMem DM Reg
Instruction 1
Instruction 2
Instruction 3
Bubble Bubble Bubble Bubble Bubble
A Pipeline can be stalled by inserting a “bubble” or NOP
Dealing With Structural Hazards
• A structural hazard is dealt with by inserting a
stall or pipeline bubble into the pipeline.
• This means that for that clock cycle, nothing
happens for that instruction.
• This effectively “slides” that instruction, and
subsequent instructions, by one clock cycle.
• This effectively increases the average CPI.
Dealing With Structural Hazards
• We can see that even though the clock speed
of the processor with the hazard is a little
faster, the speedup is still less than 1.
• Therefore the hazard has quite an effect on
the performance.
• Sometimes computer architects will opt to
design a processor that exhibits a structural
hazard. Why?
• A: The improvement to the processor data path is too costly.
• B: The hazard occurs rarely enough so that the processor will still
perform to specifications.
An Example of Performance
Impact of Structural Hazard
• Assume:
– Pipelined processor.
– Data references constitute 40% of an instruction
– Ideal CPI of the pipelined machine is 1.
– Consider two cases:
• Unified data and instruction cache vs. separate data and
instruction cache.
• What is the impact on performance?
Data Dependences and Hazards
• Determining how one instruction depends on
another is critical to determining how much
parallelism exists in a program and how that
parallelism can be exploited.
Data Dependences
There are three different types of dependences:
• Data Dependences (also called true data
dependences), Name Dependences and
Control Dependences.
• An instruction j is data dependent on
instruction i if either of the following holds:
 Instruction i produces a result that may be used by
instruction j, or
 Instruction j is data dependent on instruction k, and
instruction k is data dependent on instruction i.
Consider the MIPS code sequence
That increments a vector of values in memory
(starting at 0(R1) , and with the last element at
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1!=R2
Data Dependences
• A data value may flow between instructions
either through registers or through memory
• When the data flow occurs in a register,
detecting the dependence is straight forward
since the register names are fixed in the
• Although it gets more complicated when
branches intervene
• Dependences that flow through memory
locations are more difficult to detect
• Since two addresses may refer to the same
location but look different:
• For example, 100(R4) and 20(R6) may be
identical memory addresses.
• Effective address of a load or store may change
from one execution of the instruction to another
(so that 20(R4) and 20(R4) may be different
Detecting Data Dependences
• A data value may flow between instructions:
– (i) through registers
– (ii) through memory locations.
• When data flow is through a register:
– Detection is rather straight forward.
• When data flow is through a memory location:
– Detection is difficult.
– Two addresses may refer to the same memory
location but look different.
100(R4) and 20(R6)
Name Dependences
• A Name Dependence occurs when two
instructions use the same register or memory
location, called a name
• There are two types of name dependences
between an instruction i that preceedes
instruction j in program order:
• Antidependence,
• Output Dependence
• An Antidependence: between instruction i and
instruction j occurs when instruction J writes a
register or memory location that instruction i
• The original ordering must be preserved to
ensure that i reads the correct value. There is
an antidependence between S.D and DADDIU
on register R1, in the MIPS code sequence next
Consider the MIPS code sequence
That increments a vector of values in memory
(starting at 0(R1) , and with the last element at
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1!=R2), by a scalar in
register F2.
• An Output Dependence occurs when
instruction i and instruction j write the same
register or memory location.
• The ordering between the instructions must be
preserved to ensure that the value finally
written corresponds to instruction j.
Data Hazards
• Occur when an instruction under execution
depends on:
– Data from an instruction ahead in pipeline.
• Example:
– Dependent instruction uses old data:
• Results in wrong computations
Types of Data Hazards
• Data hazards are of three types:
– Read After Write (RAW)
– Write After Read (WAR)
– Write After Write (WAW)
• With an in-order execution machine:
– WAW, WAR hazards can not occur.
• Assume instruction i is issued before j.
Read after Write (RAW) Hazards
• Hazard between two instructions i & j may
occur when j attempts to read some data
object that has been modified by i.
– instruction j tries to read its operand before
instruction i writes it.
– j would incorrectly receive an old or incorrect
• Example: … j i …
Instruction j is a
read instruction
issued after i
Instruction i is a
write instruction
issued before j
i: ADD R1, R2, R3
j: SUB R4, R1, R6
Read after Write (RAW) Hazards
Instn I
D(J) R(J)Instn J
R (I) ∩ D (J) ≠ Ø for RAW
RAW Dependency: More Examples
• Example program (a):
–i1: load r1, addr;
–i2: add r2, r1,r1;
• Program (b):
–i1: mul r1, r4, r5;
–i2: add r2, r1, r1;
• Both cases, i2 does not get operand until i1
has completed writing the result
–In (a) this is due to load-use dependency
–In (b) this is due to define-use dependency
Write after Read (WAR) Hazards
– Instruction j tries to write its operand at
destination before instruction i read it.
– i would incorrectly receive a new or incorrect
• WAR hazards do not usually occur because of the
amount of time between the read cycle and write
cycle in a pipeline.
… j i …
Instruction j is a
write instruction
issued after i
Instruction i is a
read instruction
issued before j
i: ADD R1, R2, R3
j: SUB R2, R4, R6
occur due
to Anti
Write after Read (WAR) Hazards
Instn J
D(I) R(I)
D (I) ∩ R (J) ≠ Ø for WAR
Write After Write (WAW) Hazards
• WAW hazard:
– Both i & j wants to modify a same data object.
– instruction j tries to write an operand before
instruction i writes it.
– Writes are performed in wrong order.
• Example:
… j i …
Instruction j is a
write instruction
issued after i
Instruction i is a
write instruction
issued before j
i: DIV F1, F2, F3
j: SUB F1, F4, F6
(How can
occur due
to output
Write After Write (WAW) Hazards
Instn I
R(J) D(J)
Instn J
R (I) ∩ R (J) ≠ Ø for WAW
Inter-Instruction Dependences
 Data dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
 Anti-dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)
 Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
Control dependence
Data Dependencies : Summary
Data dependencies
in straight-line code
Read After Write
Write After Read
Write After Write
( Flow dependency ) ( Anti dependency ) ( Output dependency )
True dependency
Cannot be overcome
False dependency
Can be eliminated by register
Recollect Data Hazards
What causes them?
– Pipelining changes the order of read/write
accesses to operands.
– Order differs from that of an unpipelined
• Example:
– ADD R1, R2, R3
– SUB R4, R1, R5
For MIPS, ADD writes
the register in WB but
SUB needs it in ID.
This is a data hazard
Illustration of a Data Hazard
RegMem DM Reg
RegMem DM Reg
RegMem DM
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
ADD instruction causes a hazard in next 3 instructions
because register not written until after those 3 read it.
Solutions to Data Hazard
• Operand forwarding
• Pipeline interlock
• By S/W (NOP)
• Reordering the instruction
• Simplest solution to data hazard:
– forwarding
• Result of the ADD instruction not really
– until after ADD actually produces it.
• Can we move the result from EX/MEM
register to the beginning of ALU (where SUB
needs it)?
– Yes!
• Generally speaking:
–Forwarding occurs when a result is
passed directly to the functional unit
that requires it.
–Result goes from output of one pipeline
stage to input of another.
Forwarding Technique
Latch Latch
Forwarding Path
When Can We Forward?
RegMem DM Reg
RegMem DM Reg
RegMem DM
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
SUB gets info.
from EX/MEM
pipe register
AND gets info.
from MEM/WB
pipe register
OR gets info. by
forwarding from
register file
If line goes “forward” you can do forwarding.
If its drawn backward, it’s physically impossible.
General Data Forwarding
• It is easy to see how data forwarding can be
used by drawing out the pipelined execution
of each instruction.
• Now consider the following instructions:
DADD R1, R2, R3
LD R4, O(R1)
SD R4, 12(R1)
• Can data forwarding prevent all data hazards?
• NO!
• The following operations will still cause a data
hazard. This happens because the further
down the pipeline we get, the less we can use
LD R1, O(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
• We can avoid the hazard by using a Pipeline
• The pipeline interlock will detect when data
forwarding will not be able to get the data to
the next instruction in time.
• A stall is introduced until the instruction can
get the appropriate data from the previous
Handling data hazard by S/W
• Compiler introduce NOP in between two
• NOP = a piece of code which keeps a gap
between two instruction
• Detection of the dependency is left entirely
on the S/W
• Advantage :- We find the easy technique
called as instruction reordering.
Instruction Reordering
• ADD R1 , R2 , R3
• SUB R4 , R1 , R5
• XOR R8 , R6 , R7
• AND R9 , R10 , R11
• ADD R1 , R2 , R3
• XOR R8 , R6 , R7
• AND R9 , R10 , R11
• SUB R4 , R1 , R5
Instruction Execution:
MIPS Data path
• Can break down the process of “running” an
instruction into stages.
• These stages are what needs to be done to
complete the execution of each instruction.
Some instructions will not require some stages.
MIPS Data path
The DLX (MIPS) datapath allows every instruction to be
executed in 4 or 5 cycles
1. Instruction Fetch (IF) - Get the instruction to be
IR  M[PC]
NPC  PC + 4
IR – Instruction register
NPC – Next program counter
Instruction Execution
2. Instruction Decode/Register Fetch (ID) –
Figure out what the instruction is supposed to
do and what it needs.
A  Register File[Rs]
B  Register File[Rt]
Imm  {(IR16)16, IR15..0}
A & B & Imm are temporary registers that hold
inputs to the ALU which is in the Execute Stage
Instruction Execution
3. Execution (EX) -The instruction has been decoded, so
execution can be split according to instruction type.
ALU instr: ALUout  A op B
Reg-Imm: ALUout  A op Imm
Branch: ALUout  NPC + Imm
Cond  (A {==, !=} 0)
LD/ST: ALUout  A op Imm
to form effective Address
Instruction Execution
4. Memory Access/Branch Completion (MEM) – Besides
the IF stage this is the only stage that access the
memory to load and store data.
Load: LMD = Mem[ALUout]
Store: Mem[ALUout]  B
Branch: if (cond) PC  ALUout
Jump: PC  ALUout
LMD=Load Memory Data Register
Instruction Execution
5. Write-Back (WB) – Store all the results and loads
back to registers.
ALU instr: Rd  ALUoutput
Load: Rd  LMD
Reg-Imm: Rt  ALUoutput
Instruction Execution
Control Hazards
• Result from branch and other instructions that change
the flow of a program (i.e. change PC).
• Example:
• Statement in line 2 is control dependent on statement
at line 1.
• Until condition evaluation completes:
– It is not known whether s1 or s2 will execute next.
1: If(cond){
2: s1}
3: s2
• Control hazards are caused by branches in the
• During the IF stage remember that the PC is
incremented by 4 in preparation for the next IF
cycle of the next instruction.
• What happens if there is a branch performed and
we aren’t simply incrementing the PC by 4.
• The easiest way to deal with the occurrence of a
branch is to perform the IF stage again once the
branch occurs.
• These following solutions assume that we are
dealing with Static Branches (Compile time).
Meaning that the actions taken during a
branch do not change.
#1. Flush Pipeline/ Stall
#2. Predict Branch Not Taken:
#3. Predict Branch Taken
#4. Delayed branch.
Four Simple Control/Branch Hazard
Branch Hazard Solutions
#1. Flush Pipeline/ Stall
• until branch direction is clear – flushing pipe
, once an instruction is detected to be branch
during the ID stage.
• Let us see an example, we will stall the
pipeline until the branch is resolved (in that
case we repeated the IF stage until the branch
is resolved and modifies the PC)
Performing IF Twice
• We take a big performance hit by
performing the instruction fetch whenever
a branch occurs. Note, this happens even if
the branch is taken or not.
• This guarantees that the PC will get the
correct value.
Control Hazards solutions
#2. Predict Branch Not Taken:
• What if we treat every branch as “not taken”
remember that not only do we read the registers
during ID, but we also perform an equality test in
case we need to branch or not.
• We can improve performance by assuming that
the branch will not be taken.
–Execute successor instructions in sequence as if there
is no branch
–undo instructions in pipeline if branch actually taken
Control Hazards solutions:
Predict Branch Not Taken: Contd..
• The “branch-not taken” scheme is the same as
performing the IF stage a second time in our 5
stage pipeline if the branch is taken.
• If not there is no performance degradation.
• 47% branches not taken on average
Control Hazards solutions:
Predict Branch Not Taken: Contd..
Control Hazards solutions
#3 Predict Branch Taken
– The “branch taken” scheme is no benefit in our case because
we evaluate the branch target address in the ID stage.
– 53% branches taken on average.
– But branch target address not available after IF in
• MIPS still incurs 1 cycle branch penalty even with
predict taken
• LOOP or in some other machines: branch target
known before branch outcome computed, significant
benefits can be accrued.
Control Hazards solutions
#4: Delayed Branch
• The fourth method for dealing with a control
hazard is to implement a “delayed” branch
• In this scheme an instruction is inserted into the
pipeline that is useful and not dependent on
whether the branch is taken or not. It is the job of
the compiler to determine the delayed branch
• If the branch is actually taken, we need to clear
the pipeline of any code loaded in from the “not-
taken” path.
Control Hazards solutions
Delayed Branch Contd…
• Likewise we can assume that the branch is
always taken. Does this work in our “5-stage”
 No, the branch target is computed during the ID cycle.
• Some processors will have the target address
computed in time for the IF stage of the next
instruction so there is no delay.
Control Hazards solutions
#4: Delayed Branch
–Insert unrelated successor in the branch delay
branch instruction
sequential successor1
sequential successor2
sequential successorn
branch target (if taken)
–1 slot delay required in 5 stage pipeline
Branch delay of length n
The behavior of a delayed branch
Delayed Branch
• Simple idea: Put an instruction that would be
executed anyway right after a branch.
• Question: What instruction do we put in the delay slot?
• Answer: one that can safely be executed no matter what the
branch does.
– The compiler decides this.
Delayed slot instruction
Branch target OR successor
delay slot
Delayed Branch
• One possibility: An instruction from before
• Example:
• The DADD instruction is executed no matter what
happens in the branch:
– Because it is executed before the branch!
– Therefore, it can be moved
DADD R1, R2, R3
if R2 == 0 then
. . .
delay slot
DADD R1, R2, R3
if R2 == 0 then
DADD R1, R2, R3
Delayed Branch
add instruction
branch target/successor
By this time, we know whether
to take the branch or whether not
to take it
• We get to execute the “DADD” execution
“for free”
Delayed Branch
• Another possibility: An instruction much before from
• Example:
• The DSUB instruction can be replicated into the delay
slot, and the branch target can be changed
DSUB R4, R5, R6
DADD R1, R2, R3
if R1 == 0 then
delay slot
Delayed Branch
• Example:
• The DSUB instruction can be replicated into the delay
slot, and the branch target can be changed
DSUB R4, R5, R6
DADD R1, R2, R3
if R1 == 0 then
DSUB R4, R5, R6
Delayed Branch
• Yet another possibility: An instruction from inside the
taken path: fall through
• Example:
• The OR instruction can be moved into the delay slot
ONLY IF its execution doesn’t disrupt the program
execution (e.g., R7 is overwritten later)
DADD R1, R2, R3
if R1 == 0 then
OR R7, R8, R9
DSUB R4, R5, R6
delay slot
Delayed Branch
• Third possibility: An instruction from inside the taken
• Example:
• The OR instruction can be moved into the delay slot
ONLY IF its execution doesn’t disrupt the program
execution (e.g., R7 is overwritten later)
DADD R1, R2, R3
if R1 == 0 then
OR R7, R8, R9
DSUB R4, R5, R6
OR R7, R8, R9
Introduction to Parallel
Flynn’s Classification
SISD (Single Instruction Single Data):
MISD (Multiple Instruction Single Data):
No practical examples exist
SIMD (Single Instruction Multiple Data):
Specialized processors(Vector architectures,
Multimedia extensions, Graphics processor units)
MIMD (Multiple Instruction Multiple Data):
General purpose, commercially important
(Tightly-coupled MIMD, Loosely-coupled MIMD)
Unit 2
Unit 1
Unit n
DS n
Unit 2
Unit 1
Unit n
A Broad Classification of Computers
• Shared-memory multiprocessors
–Also called UMA
• Distributed memory computers
–Also called NUMA:
• Distributed Shared-memory (DSM)
• Clusters
• Grids, etc.
UMA vs. NUMA Computers
(a) UMA Model (b) NUMA Model
Latency = 100s of ns
Latency = several
milliseconds to seconds
Distributed Memory Computers
• Distributed memory computers use:
–Message Passing Model
• Explicit message send and receive instructions
have to be written by the programmer.
–Send: specifies local buffer + receiving process (id)
on remote computer (address).
–Receive: specifies sending process on remote
computer + local buffer to place data.
Advantages of Message-Passing
• Hardware for communication and
synchronization are much simpler:
–Compared to communication in a shared memory
• Explicit communication:
–Programs simpler to understand, helps to reduce
maintenance and development costs.
• Synchronization is implicit:
–Naturally associated with sending/receiving
–Easier to debug.
Disadvantages of Message-Passing
• Programmer has to write explicit message
passing constructs.
–Also, precisely identify the processes (or
threads) with which communication is to
• Explicit calls to operating system:
–Higher overhead.
• Physically separate memories are accessed
as one logical address space.
• Processors running on a multi-computer
system share their memory.
–Implemented by operating system.
• DSM multiprocessors are NUMA:
–Access time depends on the exact location of
the data.
Distributed Shared-Memory
Architecture (DSM)
• Underlying mechanism is message passing:
– Shared memory convenience provided to the
programmer by the operating system.
– Basically, an operating system facility takes care
of message passing implicitly.
• Advantage of DSM:
– Ease of programming
Disadvantage of DSM
• High communication cost:
–A program not specifically optimized for
DSM by the programmer shall perform
extremely poorly.
–Data (variables) accessed by specific
program segments have to be collocated.
–Useful only for process-level (coarse-
grained) parallelism.
Symmetric Multiprocessors (SMPs)
• SMPs are a popular shared memory
multiprocessor architecture:
–Processors share Memory and I/O
–Bus based: access time for all memory locations is
equal --- “Symmetric MP”
Cache Cache Cache Cache
Main memory I/O system
SMPs: Some Insights
• In any multiprocessor, main memory access is
a bottleneck:
–Multilevel caches reduce the memory demand of a
–Multilevel caches in fact make it possible for more
than one processor to meaningfully share the
memory bus.
–Hence multilevel caches are a must in a
Pros of SMPs
• Ease of programming:
–Especially when communication
patterns are complex or vary
dynamically during execution.
Cons of SMPs
• As the number of processors increases,
contention for the bus increases.
– Scalability of the SMP model restricted.
– One way out may be to use switches (crossbar,
multistage networks, etc.) instead of a bus.
– Switches set up parallel point-to-point
– Again switches are not without any
disadvantages: make implementation of cache
coherence difficult.
An Important Problem with
Shared-Memory: Coherence
• When shared data are cached:
–These are replicated in multiple caches.
–The data in the caches of different
processors may become inconsistent.
• How to enforce cache coherency?
– How does a processor know changes in the
caches of other processors?
The Cache Coherency
P1 P2 P3
U:5 U:5
U:? U:? U:7
What value will P1 and P2 read?
1 3
Cache Coherence Solutions
• The key to maintain cache coherence:
– Track the state of sharing of every data
• Based on this idea, following can be an
overall solution:
–Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.
Pros and Cons of the Solution
• Pro:
–Consistency maintenance becomes
transparent to programmers,
compilers, as well as to the operating
• Con:
–Increased hardware complexity .
Two Important Cache Coherency
• Snooping protocol:
– Each cache “snoops” the bus to find out which
data is being used by whom.
• Directory-based protocol:
– Keep track of the sharing state of each data
block using a directory.
– A directory is a centralized register for all
memory blocks.
– Allows coherency protocol to avoid broadcasts.
Snooping vs. Directory-based
• Snooping protocol reduces memory traffic.
– More efficient.
• Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
– Even when there is a shared bus, scalability is a
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses.
Snooping Protocol
• As soon as a request for any data block by a
processor is put out on the bus:
–Other processors “snoop” to check if they have a
copy and respond accordingly.
• Works well with bus interconnection:
–All transmissions on a bus are essentially broadcast:
• Snooping is therefore effortless.
–Dominates almost all small scale machines.
Categories of Snoopy
• Essentially two types:
–Write Invalidate Protocol
–Write Broadcast Protocol
• Write invalidate protocol:
–When one processor writes to its cache, all other
processors having a copy of that data block
invalidate that block.
• Write broadcast:
–When one processor writes to its cache, all other
processors having a copy of that data block
update that block with the recent written value.
Write Invalidate Vs. Write Update
Cache Cache Cache Cache
Main memory I/O system
Write Invalidate Protocol
• Handling a write to shared data:
–An invalidate command is sent on bus --- all
caches snoop and invalidate any copies they
• Handling a read Miss:
–Write-through: memory is always up-to-date.
–Write-back: snooping finds most recent copy.
Write Invalidate in Write Through
• Simple implementation.
• Writes:
– Write to shared data: broadcast on bus, processors
snoop, and update any copies.
– Read miss: memory is always up-to-date.
• Concurrent writes:
– Write serialization automatically achieved since bus
serializes requests.
– Bus provides the basic arbitration support.
Write Invalidate versus
Broadcast cont…
• Invalidate exploits spatial locality:
–Only one bus transaction for any number of
writes to the same block.
–Obviously, more efficient.
• Broadcast has lower latency for writes and reads:
–As compared to invalidate.

Mais conteúdo relacionado

Mais procurados

Pipeline processing and space time diagram
Pipeline processing and space time diagramPipeline processing and space time diagram
Pipeline processing and space time diagramRahul Sharma
Arithmatic pipline
Arithmatic piplineArithmatic pipline
Arithmatic piplineA. Shamel
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInteX Research Lab
Computer Organozation
Computer OrganozationComputer Organozation
Computer OrganozationAabha Tiwari
PipeliningAmin Omi
Concept of Pipelining
Concept of PipeliningConcept of Pipelining
Concept of PipeliningSHAKOOR AB
Pipelining powerpoint presentation
Pipelining powerpoint presentationPipelining powerpoint presentation
Pipelining powerpoint presentationbhavanadonthi
Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazardAJAL A J
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
PipeliningAJAL A J
Instruction pipelining
Instruction pipeliningInstruction pipelining
Instruction pipeliningTech_MX
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processorBảo Hoang
Unit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingUnit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingvishal choudhary

Mais procurados (17)

Pipeline processing and space time diagram
Pipeline processing and space time diagramPipeline processing and space time diagram
Pipeline processing and space time diagram
Arithmatic pipline
Arithmatic piplineArithmatic pipline
Arithmatic pipline
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
Computer Organozation
Computer OrganozationComputer Organozation
Computer Organozation
Concept of Pipelining
Concept of PipeliningConcept of Pipelining
Concept of Pipelining
Pipelining powerpoint presentation
Pipelining powerpoint presentationPipelining powerpoint presentation
Pipelining powerpoint presentation
Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazard
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
Instruction pipelining
Instruction pipeliningInstruction pipelining
Instruction pipelining
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processor
Unit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingUnit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processing

Semelhante a Coa.ppt2

Pipelining of Processors
Pipelining of ProcessorsPipelining of Processors
Pipelining of ProcessorsGaditek
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System AchitectureYashiUpadhyay3
pipelining ppt.pdf
pipelining ppt.pdfpipelining ppt.pdf
pipelining ppt.pdfWilliamTom9
Design pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesDesign pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesMahmudul Hasan
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlockSyed Zaid Irshad
Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptxJoyChowdhury30
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelinesturki_09
Computer arithmetic in computer architecture
Computer arithmetic in computer architectureComputer arithmetic in computer architecture
Computer arithmetic in computer architectureishapadhy
Pipelining 16 computers Artitacher pdf
Pipelining   16 computers Artitacher  pdfPipelining   16 computers Artitacher  pdf
Pipelining 16 computers Artitacher pdfMadhuGupta99385
Clock-8086 bus cycle
Clock-8086 bus cycleClock-8086 bus cycle
Clock-8086 bus cycleRani Rahul
Cpu performance matrix
Cpu performance matrixCpu performance matrix
Cpu performance matrixRehman baig

Semelhante a Coa.ppt2 (20)

Unit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptxUnit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptx
Pipelining of Processors
Pipelining of ProcessorsPipelining of Processors
Pipelining of Processors
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System Achitecture
pipelining ppt.pdf
pipelining ppt.pdfpipelining ppt.pdf
pipelining ppt.pdf
COA Unit-5.pptx
COA Unit-5.pptxCOA Unit-5.pptx
COA Unit-5.pptx
Design pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesDesign pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelines
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlock
Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptx
Unit 4 COA.pptx
Unit 4 COA.pptxUnit 4 COA.pptx
Unit 4 COA.pptx
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelines
Computer arithmetic in computer architecture
Computer arithmetic in computer architectureComputer arithmetic in computer architecture
Computer arithmetic in computer architecture
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
Presentation on risc pipeline
Presentation on risc pipelinePresentation on risc pipeline
Presentation on risc pipeline
Pipelining 16 computers Artitacher pdf
Pipelining   16 computers Artitacher  pdfPipelining   16 computers Artitacher  pdf
Pipelining 16 computers Artitacher pdf
Clock-8086 bus cycle
Clock-8086 bus cycleClock-8086 bus cycle
Clock-8086 bus cycle
CO Module 5
CO Module 5CO Module 5
CO Module 5
Cpu performance matrix
Cpu performance matrixCpu performance matrix
Cpu performance matrix


Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...liera silvan
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239

Último (20)

Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx


  • 1. COMPUTER ORGANIZATION & ARCHITECTURE Text Book: Computer Architecture: A Quantitative Approach by Hennessey and Patterson Prof.Prasanta Kumar Dash GITA,BHUBANESWAR
  • 2. Pipelining: Basic and Intermediate Concepts
  • 3. RISC Instruction Set Basics (from Hennessey and Patterson) • Properties of RISC architectures: – All ops on data apply to data in registers and typically change the entire register (32-bits or 64-bits). – The only ops that affect memory are load/store operations. Memory to register, and register to memory. – Load and store ops on data less than a full size of a register (32, 16, 8 bits) are often available. – Usually instructions are few in number (this can be relative) and are typically one size.
  • 4. RISC Instruction Set Basics Types Of Instructions • ALU Instructions: • Arithmetic operations, either take two registers as operands or take one register and a sign extended immediate value as an operand. The result is stored in a third register. • Logical operations AND OR, XOR do not usually differentiate between 32-bit and 64-bit. • Load/Store Instructions: • Usually take a register (base register) as an operand and a 16-bit immediate value. The sum of the two will create the effective address. A second register acts as a source in the case of a load operation.
  • 5. RISC Instruction Set Basics Types Of Instructions (continued) • In the case of a store operation the second register contains the data to be stored. • Branches and Jumps • Conditional branches are transfers of control. As described before, a branch causes an immediate value to be added to the current program counter.
  • 6. RISC Instruction Set Implementation • We first need to look at how instructions in the MIPS64 instruction set are implemented without pipelining. Assume that any instruction (MIPS) can be executed in at most 5 clock cycles. • The five clock cycles will be broken up into the following steps: • Instruction Fetch Cycle • Instruction Decode/Register Fetch Cycle • Execution Cycle • Memory Access Cycle • Write-Back Cycle
  • 8. Instruction Fetch (IF) Cycle • Send the program counter (PC) to memory and fetch the current instruction from memory. • Update the PC to the next sequential PC by adding 4 (since each instruction is 4 bytes) to the PC.
  • 9. Instruction Decode (ID)/Register Fetch Cycle • Decode the instruction and at the same time read in the values of the register involved. As the registers are being read, do equality test incase the instruction decodes as a branch or jump. • The offset field of the instruction is sign-extended incase it is needed. The possible branch effective address is computed by adding the sign-extended offset to the incremented PC. The branch can be completed at this stage if the equality test is true and the instruction decoded as a branch.
  • 10. Instruction Decode (ID)/Register Fetch Cycle (continued) • Instruction can be decoded in parallel with reading the registers because the register addresses are at fixed locations.
  • 11. Execution (EX)/Effective Address Cycle • If a branch or jump did not occur in the previous cycle, the arithmetic logic unit (ALU) can execute the instruction. • At this point the instruction falls into three different types: • Memory Reference: ALU adds the base register and the offset to form the effective address. • Register-Register: ALU performs the arithmetic, logical, etc… operation as per the opcode. • Register-Immediate: ALU performs operation based on the register and the immediate value (sign extended).
  • 12. Memory Access (MEM) Cycle • If a load, the effective address computed from the previous cycle is referenced and the memory is read. The actual data transfer to the register does not occur until the next cycle. • If a store, the data from the register is written to the effective address in memory.
  • 13. Write-Back (WB) Cycle • Occurs with Register-Register ALU instructions or load instructions. • Simple operation whether the operation is a register-register operation or a memory load operation, the resulting data is written to the appropriate register into the register file.
  • 15. What Is A Pipeline? • Pipelining is used by virtually all modern microprocessors to enhance performance by overlapping the execution of instructions. • A common analogue for a pipeline is a factory assembly line. Assume that there are three stages: • Welding • Painting • Polishing • For simplicity, assume that each task takes one hour.
  • 16. What Is A Pipeline? • If a single person were to work on the product it would take three hours to produce one product. • If we had three people, one person could work on each stage, upon completing their stage they could pass their product on to the next person (since each stage takes one hour there will be no waiting). • We could then produce one product per hour assuming the assembly line has been filled.
  • 17. What Is A Pipeline? Pipelining: is an implementation technique whereby multiple instructions are overlapped in execution. • It takes advantage of parallelism that exists among the actions needed to execute an instruction. • Pipelining is the key implementation technique used to make fast CPUs.
  • 18. Characteristics Of Pipelining • If the stages are perfectly balanced, then the time per instruction on the pipelined processor (assuming ideal conditions)—is equal to • Under these conditions, the speedup from pipelining equals the number of pipe stages.
  • 19. Contd… • Usually, however, the stages will not be perfectly balanced; furthermore, pipelining does involve some overhead. • The previous expression is ideal. We will see later that there are many ways in which a pipeline cannot function in a perfectly balanced fashion.
  • 20. Characteristics Of Pipelining • In terms of a CPU, the implementation of pipelining has the effect of reducing the average instruction time, therefore reducing the average CPI. • EX: If each instruction in a microprocessor takes 5 clock cycles (unpipelined) and we have a 4 stage pipeline, the ideal average CPI with the pipeline will be 1.25 .
  • 22. IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WBProgram Flow Time Pipelined Execution
  • 23. Precedence relation A set of subtask { T1,T2,……,Tn } for a given task T, that some task Tj can not start until some earlier task Ti ,where (i<j)finishes. Pipeline consists of cascade of processing stages. Stages are combinational circuits over data stream flowing through pipe. Stages are separated by high speed interface latches (Holding intermediate results between stages.) Control must be under a common clock.
  • 24. Pipeline Cycle Pipeline cycle: Determined by the time required by the slowest stage. Pipeline designers try to balance the length (i.e. the processing time) of each pipeline stage. For a perfectly balanced pipeline, the execution time per instruction is t/n, where t is the execution time per instruction on nonpipelined machine and n is the number of pipe stages.
  • 25. Pipeline Cycle However, it is very difficult to make the different pipeline stages perfectly balanced. Besides, pipelining itself involves some overhead.
  • 26. Synchronous Pipeline S1 S2 Sk LL LLL Input Output d m Clock - Transfers between stages are simultaneous. - One task or operation enters the pipeline per cycle.
  • 27. Asynchronous Pipeline S1 S2 Sk Output Ready Ack Ready Ack Ready Ack Ready Ack Input - Transfers performed when individual stages are ready. - Handshaking protocol between processors. - Different amounts of delay may be experienced at different stages. - Can display variable throughput rate.
  • 28. A Few Pipeline Concepts Si Si+1  m d Pipeline cycle :  Latch delay : d  = max {m } + d Pipeline frequency : f f = 1 / 
  • 29. Example on Clock period Suppose the time delays of the 4 stages are 1 = 60ns,2 = 50ns, 3 = 90ns, 4 = 80ns & the interface latch has a delay of ld = 10ns. Hence the cycle time of this pipeline can be granted to be like :-  = 90 + 10 =100ns Clock frequency of the pipeline (f) = 1/100 =10 Mhz If it is non-pipeline then = 60 + 50 + 90 + 80 =280ns   = max {m } + d
  • 30. Ideal Pipeline Speedup k-stage pipeline processes n tasks in k + (n-1) clock cycles: k cycles for the first task and n-1 cycles for the remaining n-1 tasks. Total time to process n tasks, Tk = [ k + (n-1)]  For the non-pipelined processor T1 = n k 
  • 31. Pipeline Speedup Expression Speedup= Maximum speedup = Sk  K ,for n >> K Observe that the memory bandwidth must increase by a factor of Sk: Otherwise, the processor would stall waiting for data to arrive from memory. Sk = T1 Tk = n k  [ k + (n-1)]  = n k k + (n-1)
  • 32. Efficiency of pipeline The percentage of busy time-space span over the total time span. n:- no. of task or instruction k:- no. of pipeline stages :- clock period of pipeline Hence pipeline efficiency can be defined by:- n * k * K[ k* +(n-1)]  = n k+(n-1) =
  • 33. Throughput of pipeline Number of result task that can be completed by a pipeline per unit time. Idle case w = 1/ = f when  =1. Maximum throughput = frequency of linear pipeline W = n k*+(n-1) = n [k+(n-1)] =  
  • 34. Pipelines: A Few Basic Concepts Historically, there are two different types of pipelines: Instruction pipelines Arithmetic pipelines Arithmetic pipelines (e.g. FP multiplication) are not popular in general purpose computers: Need a continuous stream of arithmetic operations. E.g. Vector processors operating on an array. On the other had instruction pipelines being used in almost every modern processor.
  • 35. Pipelines: A Few Basic Concepts Pipeline increases instruction throughput: But, does not decrease the execution time of the individual instructions. In fact, slightly increases execution time of each instruction due to pipeline overheads. Pipeline overhead arises due to a combination of: Pipeline register delay Clock skew
  • 36. Pipelines: A Few Basic Concepts Pipeline register delay: Caused due to set up time Clock skew: the maximum delay between clock arrival at any two registers. Once clock cycle is as small as the pipeline overhead: No further pipelining would be useful. Very deep pipelines may not be useful.
  • 37. Pipeline Registers  Pipeline registers are essential part of pipelines: There are 4 groups of pipeline registers in 5 stage pipeline.  Each group saves output from one stage and passes it as input to the next stage: IF/ID ID/EX EX/MEM MEM/WB  This way, each time “something is computed”... Effective address, Immediate value, Register content, etc. It is saved safely in the context of the instruction that needs it.
  • 38. Looking At The Big Picture • Overall the most time that an non-pipelined instruction can take is 5 clock cycles. Below is a summary: • Branch - 2 clock cycles • Store - 4 clock cycles • Other - 5 clock cycles • EX: Assuming branch instructions account for 12% of all instructions and stores account for 10%, what is the average CPI of a non- pipelined CPU? ANS: 0.12*2+0.10*4+0.78*5 = 4.54
  • 39. Assignment Find out total time to processes 100 tasks in a 2-stage pipeline with a cycle time 10ns. Repeat the above problem assuming latching in pipeline require 2ns. A pipeline has 4-stage with time delays 1 = 60ns, 2 = 50ns, 3 = 90ns, 4 = 80ns & the interface latch has a delay of ld = 10ns. What is the cycle time of this pipeline? What is the clock frequency of the above pipeline?
  • 40. Instruction-Level Parallelism • What is ILP (Instruction-Level Parallelism)? – Parallel execution of different instructions belonging to the same thread. • A thread usually consists of several basic blocks: – As well as several branches and loops. • Basic block: – A sequence of instructions not having a branch instruction.
  • 41. Cont… • Instruction pipelines can effectively exploit parallelism in a basic block: – An n-stage pipeline can improve performance up to n times. – Does not require much investment in hardware – Transparent to the programmers. • Pipelining can be viewed to: – Decrease average CPI, and/or – Decrease clock cycle time for instructions.
  • 42. Drags on Pipeline Performance • Factors that can degrade pipeline performance – Unbalanced stages – Pipeline overheads – Clock skew – Hazards • Hazards cause the worst drag on the performance of a pipeline.
  • 43. The Classical RISC: 5 Stage Pipeline • In an ideal case to implement a pipeline we just need to start a new instruction at each clock cycle. • Unfortunately there are many problems while trying to implement this. • We look at each stage of instruction execution as being independent, we can see how instructions can be “overlapped”.
  • 44.
  • 45. Problems With The Previous Figure • The memory is accessed twice during each clock cycle. This problem is avoided by using separate data and instruction caches. • It is important to note that if the clock period is the same for a pipelined processor and an non-pipelined processor, the memory must work five times faster. • Another problem that we can observe is that the registers are accessed twice every clock cycle. To try to avoid a resource conflict we perform the register write in the first half of the cycle and the read in the second half of the cycle.
  • 46. Problems With The Previous Figure (continued) • We write in the first half therefore an write operation can be read by another instruction further down the pipeline. • A third problem arises with the interaction of the pipeline with the PC. We use an adder to increment PC by the end of IF. Within ID we may branch and modify PC. How does this affect the pipeline?
  • 47. Pipeline Hazards • The performance gain from using pipelining occurs because we can start the execution of a new instruction each clock cycle. In a real implementation this is not always possible. • What is a pipeline hazard?  A situation that prevent s an instruction from executing during its designated clock cycles. • Pipeline hazards prevent the execution of the next instruction during the appropriate clock cycle.
  • 48. Types Of Hazards Structural hazards arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution. Data hazards arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. Control hazards arise from the pipelining of branches and other instructions that change the PC.
  • 49. Structural Hazard: Example IF ID EX E MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB
  • 50. An Example of a Structural Hazard ALU RegMem DM Reg ALU RegMem DM Reg ALU RegMem DM Reg ALU RegMem DM Reg Time ALU RegMem DM Reg Load Instruction 1 Instruction 2 Instruction 3 Instruction 4 Would there be a hazard here?
  • 51. Performance with Stalls • Stalls degrade performance of a pipeline: –Result in deviation from 1 instruction executing/clock cycle. –Let’s examine by how much stalls can impact CPI…
  • 52. A Hazard Will Cause A Pipeline Stall • Some performance expressions involve a realistic pipeline in terms of CPI. It is assumed that the clock period is the same for pipelined and unpipelined implementations. Speedup = CPI Unpipelined / CPI pipelined = Pipeline Depth / ( 1 + Stalls per Inst) = Avg Inst Time Unpipelined / Avg Inst Time Pipelined
  • 53. Dealing With Structural Hazards • Arise from resource conflicts among instructions executing concurrently: –Same resource is required by two (or more) concurrently executing instructions at the same time. • Easy way to avoid structural hazards: –Duplicate resources (sometimes not practical) –Memory interleaving ( lower & higher order )
  • 54. Contd… • Examples of Resolution of Structural Hazard: –An ALU to perform an arithmetic operation and an adder to increment PC. –Separate data cache and instruction cache accessed simultaneously in the same cycle.
  • 55.
  • 56. How is it Resolved? ALU RegMem DM Reg ALU RegMem DM Reg ALU RegMem DM Reg Time ALU RegMem DM Reg Load Instruction 1 Instruction 2 Stall Instruction 3 Bubble Bubble Bubble Bubble Bubble A Pipeline can be stalled by inserting a “bubble” or NOP
  • 57. Dealing With Structural Hazards • A structural hazard is dealt with by inserting a stall or pipeline bubble into the pipeline. • This means that for that clock cycle, nothing happens for that instruction. • This effectively “slides” that instruction, and subsequent instructions, by one clock cycle. • This effectively increases the average CPI.
  • 58. Dealing With Structural Hazards (continued) • We can see that even though the clock speed of the processor with the hazard is a little faster, the speedup is still less than 1. • Therefore the hazard has quite an effect on the performance. • Sometimes computer architects will opt to design a processor that exhibits a structural hazard. Why? • A: The improvement to the processor data path is too costly. • B: The hazard occurs rarely enough so that the processor will still perform to specifications.
  • 59. An Example of Performance Impact of Structural Hazard • Assume: – Pipelined processor. – Data references constitute 40% of an instruction mix. – Ideal CPI of the pipelined machine is 1. – Consider two cases: • Unified data and instruction cache vs. separate data and instruction cache. • What is the impact on performance?
  • 60. Data Dependences and Hazards • Determining how one instruction depends on another is critical to determining how much parallelism exists in a program and how that parallelism can be exploited.
  • 61. Data Dependences There are three different types of dependences: • Data Dependences (also called true data dependences), Name Dependences and Control Dependences. • An instruction j is data dependent on instruction i if either of the following holds:  Instruction i produces a result that may be used by instruction j, or  Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.
  • 62. Consider the MIPS code sequence That increments a vector of values in memory (starting at 0(R1) , and with the last element at 8(R2) Loop: L.D F0, 0(R1) ;F0=array element ADD.D F4, F0, F2 ;add scalar in F2 S.D F4, 0(R1) ;store result DADDUI R1, R1, #-8 ;decrement pointer 8 bytes BNE R1, R2, LOOP ;branch R1!=R2
  • 63.
  • 64. Data Dependences Contd… • A data value may flow between instructions either through registers or through memory locations. • When the data flow occurs in a register, detecting the dependence is straight forward since the register names are fixed in the instructions • Although it gets more complicated when branches intervene
  • 65. Contd… • Dependences that flow through memory locations are more difficult to detect • Since two addresses may refer to the same location but look different: • For example, 100(R4) and 20(R6) may be identical memory addresses. • Effective address of a load or store may change from one execution of the instruction to another (so that 20(R4) and 20(R4) may be different
  • 66. Detecting Data Dependences • A data value may flow between instructions: – (i) through registers – (ii) through memory locations. • When data flow is through a register: – Detection is rather straight forward. • When data flow is through a memory location: – Detection is difficult. – Two addresses may refer to the same memory location but look different. 100(R4) and 20(R6)
  • 67. Name Dependences • A Name Dependence occurs when two instructions use the same register or memory location, called a name • There are two types of name dependences between an instruction i that preceedes instruction j in program order: • Antidependence, • Output Dependence
  • 68. Contd… • An Antidependence: between instruction i and instruction j occurs when instruction J writes a register or memory location that instruction i reads. • The original ordering must be preserved to ensure that i reads the correct value. There is an antidependence between S.D and DADDIU on register R1, in the MIPS code sequence next slide.
  • 69. Consider the MIPS code sequence That increments a vector of values in memory (starting at 0(R1) , and with the last element at 8(R2) Loop: L.D F0, 0(R1) ;F0=array element ADD.D F4, F0, F2 ;add scalar in F2 S.D F4, 0(R1) ;store result DADDUI R1, R1, #-8 ;decrement pointer 8 bytes BNE R1, R2, LOOP ;branch R1!=R2), by a scalar in register F2.
  • 70. Contd… • An Output Dependence occurs when instruction i and instruction j write the same register or memory location. • The ordering between the instructions must be preserved to ensure that the value finally written corresponds to instruction j.
  • 71. Data Hazards • Occur when an instruction under execution depends on: – Data from an instruction ahead in pipeline. • Example: – Dependent instruction uses old data: • Results in wrong computations IF ID EX MEM WB IF ID EXE MEM WB A=B+C; D=A+E; A=B+C ;D=A+E ;
  • 72. Types of Data Hazards • Data hazards are of three types: – Read After Write (RAW) – Write After Read (WAR) – Write After Write (WAW) • With an in-order execution machine: – WAW, WAR hazards can not occur. • Assume instruction i is issued before j.
  • 73. Read after Write (RAW) Hazards • Hazard between two instructions i & j may occur when j attempts to read some data object that has been modified by i. – instruction j tries to read its operand before instruction i writes it. – j would incorrectly receive an old or incorrect value. • Example: … j i … Instruction j is a read instruction issued after i Instruction i is a write instruction issued before j i: ADD R1, R2, R3 j: SUB R4, R1, R6
  • 74. Read after Write (RAW) Hazards D(I) Instn I Write R(I) D(J) R(J)Instn J Read RAW R (I) ∩ D (J) ≠ Ø for RAW
  • 75. RAW Dependency: More Examples • Example program (a): –i1: load r1, addr; –i2: add r2, r1,r1; • Program (b): –i1: mul r1, r4, r5; –i2: add r2, r1, r1; • Both cases, i2 does not get operand until i1 has completed writing the result –In (a) this is due to load-use dependency –In (b) this is due to define-use dependency
  • 76. Write after Read (WAR) Hazards – Instruction j tries to write its operand at destination before instruction i read it. – i would incorrectly receive a new or incorrect value. • WAR hazards do not usually occur because of the amount of time between the read cycle and write cycle in a pipeline. … j i … Instruction j is a write instruction issued after i Instruction i is a read instruction issued before j i: ADD R1, R2, R3 j: SUB R2, R4, R6 WAR hazards occur due to Anti dependency .
  • 77. Write after Read (WAR) Hazards D(J) Instn J Write R(J) D(I) R(I) InstnI Read WAR D (I) ∩ R (J) ≠ Ø for WAR
  • 78. Write After Write (WAW) Hazards • WAW hazard: – Both i & j wants to modify a same data object. – instruction j tries to write an operand before instruction i writes it. – Writes are performed in wrong order. • Example: … j i … Instruction j is a write instruction issued after i Instruction i is a write instruction issued before j i: DIV F1, F2, F3 j: SUB F1, F4, F6 (How can this happen???) WAW hazards occur due to output dependence .
  • 79. Write After Write (WAW) Hazards D(I) Instn I Write R(I) R(J) D(J) Instn J Write WAW R (I) ∩ R (J) ≠ Ø for WAW
  • 80. Inter-Instruction Dependences  Data dependence r3  r1 op r2 Read-after-Write r5  r3 op r4 (RAW)  Anti-dependence r3  r1 op r2 Write-after-Read r1  r4 op r5 (WAR)  Output dependence r3  r1 op r2 Write-after-Write r5  r3 op r4 (WAW) r3  r6 op r7 Control dependence False Dependency
  • 81. Data Dependencies : Summary Data dependencies in straight-line code RAW Read After Write dependency Load-Use dependency Define-Use dependency WAR Write After Read dependency WAW Write After Write dependency ( Flow dependency ) ( Anti dependency ) ( Output dependency ) True dependency Cannot be overcome False dependency Can be eliminated by register renaming
  • 82. Recollect Data Hazards What causes them? – Pipelining changes the order of read/write accesses to operands. – Order differs from that of an unpipelined machine. • Example: – ADD R1, R2, R3 – SUB R4, R1, R5 For MIPS, ADD writes the register in WB but SUB needs it in ID. This is a data hazard
  • 83. Illustration of a Data Hazard ALU RegMem DM Reg ALU RegMem DM Reg ALU RegMem DM RegMem Time ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 ALU RegMem ADD instruction causes a hazard in next 3 instructions because register not written until after those 3 read it.
  • 84. Solutions to Data Hazard • Operand forwarding • Pipeline interlock • By S/W (NOP) • Reordering the instruction
  • 85. Forwarding • Simplest solution to data hazard: – forwarding • Result of the ADD instruction not really needed: – until after ADD actually produces it. • Can we move the result from EX/MEM register to the beginning of ALU (where SUB needs it)? – Yes!
  • 86. Forwarding cont… • Generally speaking: –Forwarding occurs when a result is passed directly to the functional unit that requires it. –Result goes from output of one pipeline stage to input of another.
  • 88. When Can We Forward? ALU RegMem DM Reg ALU RegMem DM Reg ALU RegMem DM RegMem Time ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 ALU RegMem SUB gets info. from EX/MEM pipe register AND gets info. from MEM/WB pipe register OR gets info. by forwarding from register file If line goes “forward” you can do forwarding. If its drawn backward, it’s physically impossible.
  • 89. General Data Forwarding • It is easy to see how data forwarding can be used by drawing out the pipelined execution of each instruction. • Now consider the following instructions: DADD R1, R2, R3 LD R4, O(R1) SD R4, 12(R1)
  • 90.
  • 91. Problems • Can data forwarding prevent all data hazards? • NO! • The following operations will still cause a data hazard. This happens because the further down the pipeline we get, the less we can use forwarding. LD R1, O(R2) DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9
  • 92.
  • 93. Problems • We can avoid the hazard by using a Pipeline interlock. • The pipeline interlock will detect when data forwarding will not be able to get the data to the next instruction in time. • A stall is introduced until the instruction can get the appropriate data from the previous instruction.
  • 94.
  • 95. Handling data hazard by S/W • Compiler introduce NOP in between two instructions • NOP = a piece of code which keeps a gap between two instruction • Detection of the dependency is left entirely on the S/W • Advantage :- We find the easy technique called as instruction reordering.
  • 96. Instruction Reordering • ADD R1 , R2 , R3 • SUB R4 , R1 , R5 • XOR R8 , R6 , R7 • AND R9 , R10 , R11 • ADD R1 , R2 , R3 • XOR R8 , R6 , R7 • AND R9 , R10 , R11 • SUB R4 , R1 , R5 Before After
  • 97. 97 Instruction Execution: MIPS Data path • Can break down the process of “running” an instruction into stages. • These stages are what needs to be done to complete the execution of each instruction. Some instructions will not require some stages.
  • 98. 98 MIPS Data path The DLX (MIPS) datapath allows every instruction to be executed in 4 or 5 cycles
  • 99. 99 1. Instruction Fetch (IF) - Get the instruction to be executed. IR  M[PC] NPC  PC + 4 IR – Instruction register NPC – Next program counter Instruction Execution Contd…
  • 100. 100 2. Instruction Decode/Register Fetch (ID) – Figure out what the instruction is supposed to do and what it needs. A  Register File[Rs] B  Register File[Rt] Imm  {(IR16)16, IR15..0} A & B & Imm are temporary registers that hold inputs to the ALU which is in the Execute Stage Instruction Execution Contd…
  • 101. 101 3. Execution (EX) -The instruction has been decoded, so execution can be split according to instruction type. Reg-Reg ALU instr: ALUout  A op B Reg-Imm: ALUout  A op Imm Branch: ALUout  NPC + Imm Cond  (A {==, !=} 0) LD/ST: ALUout  A op Imm to form effective Address Instruction Execution Contd…
  • 102. 102 4. Memory Access/Branch Completion (MEM) – Besides the IF stage this is the only stage that access the memory to load and store data. Load: LMD = Mem[ALUout] Store: Mem[ALUout]  B Branch: if (cond) PC  ALUout Jump: PC  ALUout ELSE: PC  NPC LMD=Load Memory Data Register Instruction Execution Contd…
  • 103. 103 5. Write-Back (WB) – Store all the results and loads back to registers. Reg-Reg ALU instr: Rd  ALUoutput Load: Rd  LMD Reg-Imm: Rt  ALUoutput Instruction Execution Contd…
  • 104. Control Hazards • Result from branch and other instructions that change the flow of a program (i.e. change PC). • Example: • Statement in line 2 is control dependent on statement at line 1. • Until condition evaluation completes: – It is not known whether s1 or s2 will execute next. 1: If(cond){ 2: s1} 3: s2
  • 105. • Control hazards are caused by branches in the code. • During the IF stage remember that the PC is incremented by 4 in preparation for the next IF cycle of the next instruction. • What happens if there is a branch performed and we aren’t simply incrementing the PC by 4. • The easiest way to deal with the occurrence of a branch is to perform the IF stage again once the branch occurs.
  • 106. • These following solutions assume that we are dealing with Static Branches (Compile time). Meaning that the actions taken during a branch do not change. #1. Flush Pipeline/ Stall #2. Predict Branch Not Taken: #3. Predict Branch Taken #4. Delayed branch. Four Simple Control/Branch Hazard Solutions
  • 107. Branch Hazard Solutions #1. Flush Pipeline/ Stall • until branch direction is clear – flushing pipe , once an instruction is detected to be branch during the ID stage. • Let us see an example, we will stall the pipeline until the branch is resolved (in that case we repeated the IF stage until the branch is resolved and modifies the PC)
  • 108. Performing IF Twice • We take a big performance hit by performing the instruction fetch whenever a branch occurs. Note, this happens even if the branch is taken or not. • This guarantees that the PC will get the correct value. IF ID EX MEM WB IF ID EX MEM WB IF IF ID EX MEM WB branch
  • 109. Control Hazards solutions #2. Predict Branch Not Taken: • What if we treat every branch as “not taken” remember that not only do we read the registers during ID, but we also perform an equality test in case we need to branch or not. • We can improve performance by assuming that the branch will not be taken. –Execute successor instructions in sequence as if there is no branch –undo instructions in pipeline if branch actually taken
  • 110. Control Hazards solutions: Predict Branch Not Taken: Contd.. • The “branch-not taken” scheme is the same as performing the IF stage a second time in our 5 stage pipeline if the branch is taken. • If not there is no performance degradation. • 47% branches not taken on average
  • 111. Control Hazards solutions: Predict Branch Not Taken: Contd..
  • 112. Control Hazards solutions #3 Predict Branch Taken – The “branch taken” scheme is no benefit in our case because we evaluate the branch target address in the ID stage. – 53% branches taken on average. – But branch target address not available after IF in MIPS • MIPS still incurs 1 cycle branch penalty even with predict taken • LOOP or in some other machines: branch target known before branch outcome computed, significant benefits can be accrued.
  • 113. Control Hazards solutions #4: Delayed Branch • The fourth method for dealing with a control hazard is to implement a “delayed” branch scheme. • In this scheme an instruction is inserted into the pipeline that is useful and not dependent on whether the branch is taken or not. It is the job of the compiler to determine the delayed branch instruction. • If the branch is actually taken, we need to clear the pipeline of any code loaded in from the “not- taken” path.
  • 114. Control Hazards solutions Delayed Branch Contd… • Likewise we can assume that the branch is always taken. Does this work in our “5-stage” pipeline?  No, the branch target is computed during the ID cycle. • Some processors will have the target address computed in time for the IF stage of the next instruction so there is no delay.
  • 115. Control Hazards solutions cont… #4: Delayed Branch –Insert unrelated successor in the branch delay slot branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target (if taken) –1 slot delay required in 5 stage pipeline Branch delay of length n
  • 116. The behavior of a delayed branch
  • 117. Delayed Branch • Simple idea: Put an instruction that would be executed anyway right after a branch. • Question: What instruction do we put in the delay slot? • Answer: one that can safely be executed no matter what the branch does. – The compiler decides this. IF ID EX MEM WB IF IF ID EX MEM WB ID EX MEM WB Branch Delayed slot instruction Branch target OR successor delay slot
  • 118. Delayed Branch • One possibility: An instruction from before • Example: • The DADD instruction is executed no matter what happens in the branch: – Because it is executed before the branch! – Therefore, it can be moved DADD R1, R2, R3 if R2 == 0 then . . . delay slot DADD R1, R2, R3 if R2 == 0 then DADD R1, R2, R3
  • 119. Delayed Branch IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB branch add instruction branch target/successor By this time, we know whether to take the branch or whether not to take it • We get to execute the “DADD” execution “for free”
  • 120. Delayed Branch • Another possibility: An instruction much before from target • Example: • The DSUB instruction can be replicated into the delay slot, and the branch target can be changed DSUB R4, R5, R6 ... DADD R1, R2, R3 if R1 == 0 then delay slot
  • 121. Delayed Branch • Example: • The DSUB instruction can be replicated into the delay slot, and the branch target can be changed DSUB R4, R5, R6 ... DADD R1, R2, R3 if R1 == 0 then DSUB R4, R5, R6
  • 122. Delayed Branch • Yet another possibility: An instruction from inside the taken path: fall through • Example: • The OR instruction can be moved into the delay slot ONLY IF its execution doesn’t disrupt the program execution (e.g., R7 is overwritten later) DADD R1, R2, R3 if R1 == 0 then OR R7, R8, R9 DSUB R4, R5, R6 delay slot
  • 123. Delayed Branch • Third possibility: An instruction from inside the taken path • Example: • The OR instruction can be moved into the delay slot ONLY IF its execution doesn’t disrupt the program execution (e.g., R7 is overwritten later) DADD R1, R2, R3 if R1 == 0 then OR R7, R8, R9 DSUB R4, R5, R6 OR R7, R8, R9
  • 125. Flynn’s Classification SISD (Single Instruction Single Data): Uniprocessors. MISD (Multiple Instruction Single Data): No practical examples exist SIMD (Single Instruction Multiple Data): Specialized processors(Vector architectures, Multimedia extensions, Graphics processor units) MIMD (Multiple Instruction Multiple Data): General purpose, commercially important (Tightly-coupled MIMD, Loosely-coupled MIMD)
  • 127. SIMD Control unit Processing Unit 2 Memory Module DS 2IS Processing Unit 1 Processing Unit n Memory Module Memory Module DS1 DS n IS
  • 128. MIMD Contr ol unit Process ing Unit 2 Memory Module DS2IS Process ing Unit 1 Process ing Unit n Memory Module Memory Module DS1 DSn Contr ol unit Contr ol unit IS IS
  • 129. A Broad Classification of Computers • Shared-memory multiprocessors –Also called UMA • Distributed memory computers –Also called NUMA: • Distributed Shared-memory (DSM) architectures • Clusters • Grids, etc.
  • 130. UMA vs. NUMA Computers Cache P1 Cache P2 Cache Pn Cache P1 Cache P2 Cache Pn Network Main Memory Main Memory Main Memory Main Memory Bus (a) UMA Model (b) NUMA Model Latency = 100s of ns Latency = several milliseconds to seconds
  • 131. Distributed Memory Computers • Distributed memory computers use: –Message Passing Model • Explicit message send and receive instructions have to be written by the programmer. –Send: specifies local buffer + receiving process (id) on remote computer (address). –Receive: specifies sending process on remote computer + local buffer to place data.
  • 132. Advantages of Message-Passing Communication • Hardware for communication and synchronization are much simpler: –Compared to communication in a shared memory model. • Explicit communication: –Programs simpler to understand, helps to reduce maintenance and development costs. • Synchronization is implicit: –Naturally associated with sending/receiving messages. –Easier to debug.
  • 133. Disadvantages of Message-Passing Communication • Programmer has to write explicit message passing constructs. –Also, precisely identify the processes (or threads) with which communication is to occur. • Explicit calls to operating system: –Higher overhead.
  • 134. DSM • Physically separate memories are accessed as one logical address space. • Processors running on a multi-computer system share their memory. –Implemented by operating system. • DSM multiprocessors are NUMA: –Access time depends on the exact location of the data.
  • 135. Distributed Shared-Memory Architecture (DSM) • Underlying mechanism is message passing: – Shared memory convenience provided to the programmer by the operating system. – Basically, an operating system facility takes care of message passing implicitly. • Advantage of DSM: – Ease of programming
  • 136. Disadvantage of DSM • High communication cost: –A program not specifically optimized for DSM by the programmer shall perform extremely poorly. –Data (variables) accessed by specific program segments have to be collocated. –Useful only for process-level (coarse- grained) parallelism.
  • 137. Symmetric Multiprocessors (SMPs) • SMPs are a popular shared memory multiprocessor architecture: –Processors share Memory and I/O –Bus based: access time for all memory locations is equal --- “Symmetric MP” P P P P Cache Cache Cache Cache Main memory I/O system Bus
  • 138. SMPs: Some Insights • In any multiprocessor, main memory access is a bottleneck: –Multilevel caches reduce the memory demand of a processor. –Multilevel caches in fact make it possible for more than one processor to meaningfully share the memory bus. –Hence multilevel caches are a must in a multiprocessor!
  • 139. Pros of SMPs • Ease of programming: –Especially when communication patterns are complex or vary dynamically during execution.
  • 140. Cons of SMPs • As the number of processors increases, contention for the bus increases. – Scalability of the SMP model restricted. – One way out may be to use switches (crossbar, multistage networks, etc.) instead of a bus. – Switches set up parallel point-to-point connections. – Again switches are not without any disadvantages: make implementation of cache coherence difficult.
  • 141. An Important Problem with Shared-Memory: Coherence • When shared data are cached: –These are replicated in multiple caches. –The data in the caches of different processors may become inconsistent. • How to enforce cache coherency? – How does a processor know changes in the caches of other processors?
  • 142. The Cache Coherency Problem P1 P2 P3 U:5 U:5 U:51 4 U:? U:? U:7 2 3 5 What value will P1 and P2 read? 1 3 U:?
  • 143. Cache Coherence Solutions (Protocols) • The key to maintain cache coherence: – Track the state of sharing of every data block. • Based on this idea, following can be an overall solution: –Dynamically recognize any potential inconsistency at run-time and carry out preventive action.
  • 144. Pros and Cons of the Solution • Pro: –Consistency maintenance becomes transparent to programmers, compilers, as well as to the operating system. • Con: –Increased hardware complexity .
  • 145. Two Important Cache Coherency Protocols • Snooping protocol: – Each cache “snoops” the bus to find out which data is being used by whom. • Directory-based protocol: – Keep track of the sharing state of each data block using a directory. – A directory is a centralized register for all memory blocks. – Allows coherency protocol to avoid broadcasts.
  • 146. Snooping vs. Directory-based Protocols • Snooping protocol reduces memory traffic. – More efficient. • Snooping protocol requires broadcasts: – Can meaningfully be implemented only when there is a shared bus. – Even when there is a shared bus, scalability is a problem. – Some work arounds have been tried: Sun Enterprise server has up to 4 buses.
  • 147. Snooping Protocol • As soon as a request for any data block by a processor is put out on the bus: –Other processors “snoop” to check if they have a copy and respond accordingly. • Works well with bus interconnection: –All transmissions on a bus are essentially broadcast: • Snooping is therefore effortless. –Dominates almost all small scale machines.
  • 148. Categories of Snoopy Protocols • Essentially two types: –Write Invalidate Protocol –Write Broadcast Protocol • Write invalidate protocol: –When one processor writes to its cache, all other processors having a copy of that data block invalidate that block. • Write broadcast: –When one processor writes to its cache, all other processors having a copy of that data block update that block with the recent written value.
  • 149. Write Invalidate Vs. Write Update Protocols P P P P Cache Cache Cache Cache Main memory I/O system Bus
  • 150. Write Invalidate Protocol • Handling a write to shared data: –An invalidate command is sent on bus --- all caches snoop and invalidate any copies they have. • Handling a read Miss: –Write-through: memory is always up-to-date. –Write-back: snooping finds most recent copy.
  • 151. Write Invalidate in Write Through Caches • Simple implementation. • Writes: – Write to shared data: broadcast on bus, processors snoop, and update any copies. – Read miss: memory is always up-to-date. • Concurrent writes: – Write serialization automatically achieved since bus serializes requests. – Bus provides the basic arbitration support.
  • 152. Write Invalidate versus Broadcast cont… • Invalidate exploits spatial locality: –Only one bus transaction for any number of writes to the same block. –Obviously, more efficient. • Broadcast has lower latency for writes and reads: –As compared to invalidate.