Unit 3

CONTENTSCONTENTS
 What is PipeliningWhat is Pipelining
 How Pipelines WorksHow Pipelines Works
 Advantages/DisadvantagesAdvantages/Disadvantages
 CharacterizeCharacterize PipelinesPipelines
 Pipeline classificationPipeline classification

What is PipeliningWhat is Pipelining
 A technique used in advanced microprocessors whereA technique used in advanced microprocessors where
the microprocessor begins executing a secondthe microprocessor begins executing a second
instruction before the first has been completed.instruction before the first has been completed.
- A Pipeline is a series of stages, where some work isA Pipeline is a series of stages, where some work is
done at each stage. The work is not finished until it hasdone at each stage. The work is not finished until it has
passed through all stages.passed through all stages.
 With pipelining, the computer architecture allows theWith pipelining, the computer architecture allows the
next instructions to be fetched while the processor isnext instructions to be fetched while the processor is
performing arithmetic operations, holding them in aperforming arithmetic operations, holding them in a
buffer close to the processor until each instructionbuffer close to the processor until each instruction
operation can performed.operation can performed.

How Pipelines WorksHow Pipelines Works
 The pipeline is divided into segments andThe pipeline is divided into segments and
each segment can execute it operationeach segment can execute it operation
concurrently with the other segments.concurrently with the other segments.
Once a segment completes an operations,Once a segment completes an operations,
it passes the result to the next segment init passes the result to the next segment in
the pipeline and fetches the nextthe pipeline and fetches the next
operations from the preceding segment.operations from the preceding segment.

Instructions FetchInstructions Fetch
 The instruction Fetch (IF) stage is responsible forThe instruction Fetch (IF) stage is responsible for
obtaining the requested instruction from memory. Theobtaining the requested instruction from memory. The
instruction and the program counter (which isinstruction and the program counter (which is
incremented to the next instruction) are stored in theincremented to the next instruction) are stored in the
IF/ID pipeline register as temporary storage so that mayIF/ID pipeline register as temporary storage so that may
be used in the next stage at the start of the next clockbe used in the next stage at the start of the next clock
cycle.cycle.

Instruction DecodeInstruction Decode
 The Instruction Decode (ID) stage is responsible forThe Instruction Decode (ID) stage is responsible for
decoding the instruction and sending out the variousdecoding the instruction and sending out the various
control lines to the other parts of the processor. Thecontrol lines to the other parts of the processor. The
instruction is sent to the control unit where it is decodedinstruction is sent to the control unit where it is decoded
and the registers are fetched from the register file.and the registers are fetched from the register file.

ExecutionExecution
 The Execution (EX) stage is where any calculations areThe Execution (EX) stage is where any calculations are
performed. The main component in this stage is theperformed. The main component in this stage is the
ALU. The ALU is made up of arithmetic, logic andALU. The ALU is made up of arithmetic, logic and
capabilities.capabilities.

Memory and IOMemory and IO
 The Memory and IO (MEM) stage is responsible forThe Memory and IO (MEM) stage is responsible for
storing and loading values to and from memory. It alsostoring and loading values to and from memory. It also
responsible for input or output from the processor. If theresponsible for input or output from the processor. If the
current instruction is not of Memory or IO type than thecurrent instruction is not of Memory or IO type than the
result from the ALU is passed through to the write backresult from the ALU is passed through to the write back
stage.stage.

Write BackWrite Back
 The Write Back (WB) stage is responsibleThe Write Back (WB) stage is responsible
for writing the result of a calculation,for writing the result of a calculation,
memory access or input into the registermemory access or input into the register
file.file.

Operation TimingsOperation Timings
 Estimated timings for each ofEstimated timings for each of
the stages:the stages:
InstructionInstruction
FetchFetch
2ns2ns
InstructionInstruction
DecodeDecode
1ns1ns
ExecutionExecution 2ns2ns
MemoryMemory
and IOand IO
2ns2ns
Write BackWrite Back 1ns1ns

Advantages/DisadvantagesAdvantages/Disadvantages
Advantages:Advantages:
 More efficient use of processorMore efficient use of processor
 Quicker time of execution of large number ofQuicker time of execution of large number of
instructionsinstructions
Disadvantages:Disadvantages:
 Pipelining involves adding hardware to the chipPipelining involves adding hardware to the chip
 Inability to continuously run the pipelineInability to continuously run the pipeline
at full speed because of pipeline hazardsat full speed because of pipeline hazards
which disrupt the smooth execution of thewhich disrupt the smooth execution of the
pipeline.pipeline.

CharacterizeCharacterize PipelinesPipelines
1)1) Hardware or software implementationHardware or software implementation –– pipelining can bepipelining can be
implemented in either software or hardware.implemented in either software or hardware.
2)2) Large or Small ScaleLarge or Small Scale – Stations in a pipeline can range from simplistic to– Stations in a pipeline can range from simplistic to
powerful, and a pipeline can range in length from short to long.powerful, and a pipeline can range in length from short to long.
3)3) Synchronous or asynchronous flowSynchronous or asynchronous flow – A synchronous pipeline operates like– A synchronous pipeline operates like
an assembly line: at a given time, each station is processing some amountan assembly line: at a given time, each station is processing some amount
of information.of information.
4)4) asynchronous pipeline, allow a station to forward information at any time.asynchronous pipeline, allow a station to forward information at any time.

CharacterizeCharacterize PipelinesPipelines
3)3) Buffered or unbuffered flowBuffered or unbuffered flow – One stage– One stage
of pipeline sends data directly to anotherof pipeline sends data directly to another
one or a buffer is place between eachone or a buffer is place between each
pairs of stages.pairs of stages.
4)4) Finite Chunks or Continuous BitFinite Chunks or Continuous Bit
StreamsStreams – The digital information that– The digital information that
passes though a pipeline can consist ofpasses though a pipeline can consist of
a sequence or small data items or ana sequence or small data items or an
arbitrarily long bit stream.arbitrarily long bit stream.
6)6) Automatic Data Feed Or ManualAutomatic Data Feed Or Manual
Data FeedData Feed – Some implementations of– Some implementations of
pipelines use a separate mechanism topipelines use a separate mechanism to
move information, and othermove information, and other

Linear pipelinesLinear pipelines
 A linear pipeline processor is a series ofA linear pipeline processor is a series of
processing stages and memory access.processing stages and memory access.
 In pipelining, we divide a task intoIn pipelining, we divide a task into
set of subtasks.set of subtasks.

Linear pipelinesLinear pipelines
 The Precedence relation of a set ofThe Precedence relation of a set of
subtask {T1,T2….TK} for a givensubtask {T1,T2….TK} for a given
task T implies that the same task Tjtask T implies that the same task Tj
cannot start until some earlier taskcannot start until some earlier task
Ti finishes.Ti finishes.
 The interdependencies of allThe interdependencies of all
subtask form the precedence graph.subtask form the precedence graph.

Linear Pipeline processorLinear Pipeline processor
 Linear Pipeline processor is a cascade ofLinear Pipeline processor is a cascade of
processing stages which are linearlyprocessing stages which are linearly
connected.connected.
 It perform a fixed function over a stream ofIt perform a fixed function over a stream of
data flowing from one end to other.data flowing from one end to other.
 External input are fed into the pipeline atExternal input are fed into the pipeline at
the first stage and final result emerges atthe first stage and final result emerges at
the last stage of the pipeline.the last stage of the pipeline.

Non-linear ORNon-linear OR dynamicdynamic
pipelinepipeline pipelinespipelines
 A non-linear pipelining (also calledA non-linear pipelining (also called
dynamic pipeline) can be configured todynamic pipeline) can be configured to
perform various functions at differentperform various functions at different
times. In a dynamic pipeline, there is alsotimes. In a dynamic pipeline, there is also
feed-forward or feed-back connection. Afeed-forward or feed-back connection. A
non-linear pipeline also allows very longnon-linear pipeline also allows very long
instruction words.instruction words.

Non-linear pipelinesNon-linear pipelines
 Traditional linear pipeline are staticTraditional linear pipeline are static
pipeline as they are used to perform mixedpipeline as they are used to perform mixed
function.function.
 It allow feed forward and feedbackIt allow feed forward and feedback
connection in associationconnection in association

Instruction pipeline designInstruction pipeline design
 This pipeline reads consecutive instructionThis pipeline reads consecutive instruction
from memory while previous instructionsfrom memory while previous instructions
are being executed in the other segments.are being executed in the other segments.

How instruction executeHow instruction execute
 This phase consists of a sequence ofThis phase consists of a sequence of
operations. Each phase require one oroperations. Each phase require one or
more clock cycle to execute. Thesemore clock cycle to execute. These
includesincludes
 Instruction fetchInstruction fetch
 DecodeDecode
 Operand fetchOperand fetch
 ExecuteExecute
 Result storageResult storage

Basic terms used in instructionBasic terms used in instruction
pipelinepipeline
 Instruction pipeline cycle:Instruction pipeline cycle: It is clock periodIt is clock period
of the pipelineof the pipeline
 Instruction issue latencyInstruction issue latency: It is clock period: It is clock period
 Instruction issue rateInstruction issue rate: no of instruction: no of instruction
issued per cycleissued per cycle
 Simple operation latencySimple operation latency:: It includeIt include
add,load,store,branches, move etc. it alsoadd,load,store,branches, move etc. it also
includes complex operationsincludes complex operations

Mechanism of instructionMechanism of instruction
pipelinepipeline
 For smooth flow and working of theFor smooth flow and working of the
instruction pipeline following mechanisminstruction pipeline following mechanism
are usedare used
 Prefetch bufferPrefetch buffer
 Sequential bufferSequential buffer
 Target bufferTarget buffer
 Loop bufferLoop buffer

pipelinepipeline
 Internal data forwarding :Internal data forwarding :
 it improve the throughput of the pipelineit improve the throughput of the pipeline
processor.processor.
 Its core idea is to replace unnecessaryIts core idea is to replace unnecessary
memory access by register to register transfermemory access by register to register transfer
in a sequence of load arithmetic storein a sequence of load arithmetic store
operations.operations.

pipelinepipeline
 Internal data forwarding can furtherInternal data forwarding can further
divided into three directiondivided into three direction
 Store load forwardStore load forward
 This store, load and forward can be replaced by twoThis store, load and forward can be replaced by two
parallel operations store-register-transferparallel operations store-register-transfer
 Load-load forwardLoad-load forward
 Two load-load can be replaced by one load andTwo load-load can be replaced by one load and
one register transferone register transfer
 Store-store forwardStore-store forward
 Two memory updates of the same word can beTwo memory updates of the same word can be
combined into one. Because second storecombined into one. Because second store
overwritten the first.overwritten the first.

Difficulties with instructionDifficulties with instruction
pipelinepipeline
 Resource conflict:Resource conflict: it is caused whenit is caused when
accessing memory by two segments at theaccessing memory by two segments at the
same time.same time.
 Data dependenciesData dependencies : when an: when an
instruction depends on the result of ainstruction depends on the result of a
previous instruction which is not availableprevious instruction which is not available
 Branch difficultiesBranch difficulties :: it arises from:: it arises from
branch and other instruction that changesbranch and other instruction that changes
the sequence's of instructionsthe sequence's of instructions

Branch difficultiesBranch difficulties
 Main difficulties arises with the conditionalMain difficulties arises with the conditional
branch instructions.branch instructions.
 Until the instruction is actually, executed, itUntil the instruction is actually, executed, it
is impossible to determine whether theis impossible to determine whether the
branch will be taken or not.branch will be taken or not.
 Prefetch targetPrefetch target andand loop buffersloop buffers areare
used to handle branch difficultiesused to handle branch difficulties

 Branch prediction is used to predict someBranch prediction is used to predict some
additional guess the outcome of aadditional guess the outcome of a
conditional branch instruction before it isconditional branch instruction before it is
executed.executed.
 Branch can be predict in two waysBranch can be predict in two ways
 StaticallyStatically
 DynamicallyDynamically

 Statically is usually wired into theStatically is usually wired into the
processor.processor.
 Dynamic branch strategy uses recentDynamic branch strategy uses recent
branch history to predict whether or notbranch history to predict whether or not
the branch will be taken next time when itthe branch will be taken next time when it
occurs.occurs.
 For this dynamic prediction additionalFor this dynamic prediction additional
hardware are used calledhardware are used called branch targetbranch target
buffer and delayed branch arebuffer and delayed branch are

 In delayed branch the compiler detectsIn delayed branch the compiler detects
their branch instructions and rearrange thetheir branch instructions and rearrange the
machine language code sequence bymachine language code sequence by
inserting useful instruction that keep theinserting useful instruction that keep the
pipeline operations without interruptionpipeline operations without interruption

Arithmetic Pipeline DesignArithmetic Pipeline Design
 This technique is used to speedupThis technique is used to speedup
numerical arithmetic computations.numerical arithmetic computations.
 Arithmetic operations is performed withArithmetic operations is performed with
finite precision due to the use of fixed sizefinite precision due to the use of fixed size
memory word or registers.memory word or registers.

 Depending on the function to beDepending on the function to be
implemented , different pipeline stages inimplemented , different pipeline stages in
an arithmetic unit require differentan arithmetic unit require different
hardware logics.hardware logics.
 All arithmetic operations can beAll arithmetic operations can be
implemented with basic add and shiftimplemented with basic add and shift
operations,operations,

 For high speed addition we require carryFor high speed addition we require carry
propagation adder (CPA) and carry savepropagation adder (CPA) and carry save
adder(CSA)adder(CSA)

Shortcut Method of finding Latency & Collision VectorShortcut Method of finding Latency & Collision Vector
• Forbidden Latency Set,F = {5} U {2} U {2}
= { 2,5}

State DiagramState Diagram
 The initial collision vector (ICV) is a binaryThe initial collision vector (ICV) is a binary
vector formed from F such thatvector formed from F such that
C = (CC = (Cnn…. C…. C22 CC11))
where Cwhere Cii = 1 if i= 1 if i ∈∈ F and CF and Cii = 0 if otherwise= 0 if otherwise
 Thus in our exampleThus in our example
F = { 2,5 }F = { 2,5 }
C = (1 0 0 1 0)C = (1 0 0 1 0)

Multifunctional pipelineMultifunctional pipeline
 A pipeline processor which can perform pA pipeline processor which can perform p
distinct function can be described by pdistinct function can be described by p
reservation tables overlaid together.reservation tables overlaid together.
 Each task to be initiated can beEach task to be initiated can be
associated with a function tag identifyingassociated with a function tag identifying
the reservation table to be used.the reservation table to be used.
 Collision may occur between two or moreCollision may occur between two or more
tasks with the same function tag or fromtasks with the same function tag or from
distinct function tag.distinct function tag.

Multifunctional pipelineMultifunctional pipeline
 The stage usage for each function can beThe stage usage for each function can be
displayed with a different tag in thedisplayed with a different tag in the
overlaid reservation table.overlaid reservation table.

Pipeline HazardsPipeline Hazards
 Data HazardsData Hazards – an instruction uses the result of the– an instruction uses the result of the
previous instruction. A hazard occurs exactly when anprevious instruction. A hazard occurs exactly when an
instruction tries to read a register in its ID stage that aninstruction tries to read a register in its ID stage that an
earlier instruction intends to write in its WB stage.earlier instruction intends to write in its WB stage.
 When an instruction depends on the results of theWhen an instruction depends on the results of the
previous instructionprevious instruction
 Control HazardsControl Hazards – the location of an instruction– the location of an instruction
depends on previous instruction, Due to branches anddepends on previous instruction, Due to branches and
other instructions that affect the PCother instructions that affect the PC

Pipeline HazardsPipeline Hazards
 Structural HazardsStructural Hazards – two instructions– two instructions
need to access the same resource.need to access the same resource.
 Resource conflict.Resource conflict.
 Hardware cannot support all possibleHardware cannot support all possible
combinations of instructions in simultaneouscombinations of instructions in simultaneous
overlapped executionoverlapped execution

StallingStalling
 Stalling involves halting the flow of instructions until theStalling involves halting the flow of instructions until the
required result is ready to be used. However stallingrequired result is ready to be used. However stalling
wastes processor time by doing nothing while waitingwastes processor time by doing nothing while waiting
for the result.for the result.
 A stall is the delay in cycles caused due to any of theA stall is the delay in cycles caused due to any of the
hazards mentioned abovehazards mentioned above
 How to Calculate SpeedupHow to Calculate Speedup ::
1/(1+pipeline stall per instruction)* Number1/(1+pipeline stall per instruction)* Number
of stagesof stages

StallingStalling
 So what is the speed up for an idealSo what is the speed up for an ideal
pipeline with no stalls?pipeline with no stalls?
Number of cycles needed to initially fill upNumber of cycles needed to initially fill up
the pipeline could be included inthe pipeline could be included in
computation of average stall per instructioncomputation of average stall per instruction

Structural hazardsStructural hazards
 When more than one instruction in theWhen more than one instruction in the
pipeline needs to access a resource, thepipeline needs to access a resource, the
data path is said to have a structuraldata path is said to have a structural
hazardhazard
 Examples of resources: register file,Examples of resources: register file,
memory, ALU.memory, ALU.
 Solution: Stall the pipeline for one clockSolution: Stall the pipeline for one clock
cycle when the conflict is detected. Thiscycle when the conflict is detected. This
results in a pipeline bubbleresults in a pipeline bubble

Data hazard - solutionData hazard - solution
 Usually solved by data or registerUsually solved by data or register
forwarding (bypassing or short-circuiting)forwarding (bypassing or short-circuiting)
 How it is done ?How it is done ?
 The data selected is not really usedThe data selected is not really used

Data hazard classificationData hazard classification
 RAWRAW - Read After Write. Most common:- Read After Write. Most common:
solved by data forwarding.solved by data forwarding.
 WAWWAW - Write After Write : Inst i (load)- Write After Write : Inst i (load)
before inst j (add). Both write to samebefore inst j (add). Both write to same
register.register.
 WARWAR - Write after Read: inst j tries to- Write after Read: inst j tries to
write a destination before it is read by I, sowrite a destination before it is read by I, so
I incorrectly gets its valueI incorrectly gets its value

Dynamic Instruction SchedulingDynamic Instruction Scheduling
 With dynamic scheduling the hardwareWith dynamic scheduling the hardware
tries to rearrange the instructions duringtries to rearrange the instructions during
run-time to reduce pipeline stalls.run-time to reduce pipeline stalls.
 Simpler compiler handles dependenciesSimpler compiler handles dependencies
not known at compile timenot known at compile time
 Allows code compiled for a differentAllows code compiled for a different
machine to run efficiently.machine to run efficiently.

Out-Of-Order ExecutionOut-Of-Order Execution
 With out-of-order execution, the SUBD isWith out-of-order execution, the SUBD is
allowed to executed before the addallowed to executed before the add
 this can lead to out-of order completion,this can lead to out-of order completion,
which can cause WAW and WAR hazardswhich can cause WAW and WAR hazards

Score boardingScore boarding
 The scoreboard implements a centralizedThe scoreboard implements a centralized
 control scheme that Detects all resource andcontrol scheme that Detects all resource and
data hazardsdata hazards
 Allows instructions to execute out-of-order whenAllows instructions to execute out-of-order when
no resource hazards or data dependenciesno resource hazards or data dependencies
 First implemented in 1964 by the CDC 6600,First implemented in 1964 by the CDC 6600,
which had 18 separate functional units,4 FPwhich had 18 separate functional units,4 FP
units (2 multiply, 1 add, 1 divide),7 memory unitsunits (2 multiply, 1 add, 1 divide),7 memory units
(5 loads, 2 stores)(5 loads, 2 stores)
 7 integer units (add, shift, logical, compare, etc.)7 integer units (add, shift, logical, compare, etc.)

Scoreboard ImplicationsScoreboard Implications
 Our dynamic pipeline (much simpler)Our dynamic pipeline (much simpler)
 2 FP multiply (10 EX cycles)2 FP multiply (10 EX cycles)
 1 FP add (2 EX cycles)1 FP add (2 EX cycles)
 1 FP divide (40 EX cycles)1 FP divide (40 EX cycles)
 1 integer unit (1 EX cycle)1 integer unit (1 EX cycle)

 Out-of-order completion can lead to WAROut-of-order completion can lead to WAR
and WAW hazards?and WAW hazards?
 Solution for WAWSolution for WAW
 Detect WAW hazard before reading operandsDetect WAW hazard before reading operands
 Stall write until other instruction completesStall write until other instruction completes

 Solutions for WARSolutions for WAR
 Detect WAR hazards before writing back toDetect WAR hazards before writing back to
the register files and stall the write backthe register files and stall the write back
 This scoreboard does not take advantage ofThis scoreboard does not take advantage of
forwarding (i.e. bypasses), since it waits untilforwarding (i.e. bypasses), since it waits until
both results are written back to the register fileboth results are written back to the register file
 Scoreboard replaces DR, EX, WB with 4Scoreboard replaces DR, EX, WB with 4
stagesstages

Stages of Scoreboard ControlStages of Scoreboard Control
 Decode+Issue (Issue)Decode+Issue (Issue)
 Read operands (Read)Read operands (Read)
 Execution (EX)Execution (EX)
 Write result (WB)Write result (WB)

Parts of the ScoreboardParts of the Scoreboard
 Instruction statusInstruction status : which of 4 steps the: which of 4 steps the
instruction is in: Issue, Read, EX, or WBinstruction is in: Issue, Read, EX, or WB
 Functional unit statusFunctional unit status :Indicates the:Indicates the
state of the functional unit (FU). 9 fields forstate of the functional unit (FU). 9 fields for
each functional unit.each functional unit.
 Register result statusRegister result status —Indicates—Indicates
which functional unit will write eachwhich functional unit will write each
register, if one exists. Blank when noregister, if one exists. Blank when no
pending instructions will write that registerpending instructions will write that register

Parts of the ScoreboardParts of the Scoreboard
 Busy:Indicates whether the unit is busy or notBusy:Indicates whether the unit is busy or not
 Op:Operation to perform in the unit (e.g., + or –)Op:Operation to perform in the unit (e.g., + or –)
 Fi:Destination registerFi:Destination register
 Fj, Fk:Source-register numbersFj, Fk:Source-register numbers
 Qj, Qk:Functional units producing sourceQj, Qk:Functional units producing source
registers Fj, Fkregisters Fj, Fk
 Rj, Rk:Flags indicating when Fj, Fk are readyRj, Rk:Flags indicating when Fj, Fk are ready

Tomasulo Algorithm forTomasulo Algorithm for
Dynamic SchedulingDynamic Scheduling
 For IBM 360/91 in 1967 -about 3 years afterFor IBM 360/91 in 1967 -about 3 years after
CDC 6600 •Goal: High performance withoutCDC 6600 •Goal: High performance without
special compilers •special compilers •
 Differences between IBM 360 & CDC 6600 –IBMDifferences between IBM 360 & CDC 6600 –IBM
has only 2 register specifiers/instr vs. 3 in CDChas only 2 register specifiers/instr vs. 3 in CDC
6600 –IBM has register-memory instructions –6600 –IBM has register-memory instructions –
IBM has 4 FP registers vs. 8 in CDC 6600 –IBMIBM has 4 FP registers vs. 8 in CDC 6600 –IBM
has pipelined functional units (3 adds, 2has pipelined functional units (3 adds, 2
multiplies)multiplies)

Tomasulo AlgorithmTomasulo Algorithm
 Tomasulo algorithm is designed to handleTomasulo algorithm is designed to handle
name dependencies (WAW and WARname dependencies (WAW and WAR
hazards) efficientlyhazards) efficiently

Tomasulo Algorithm AdvantageTomasulo Algorithm Advantage
 Prevents register from being thePrevents register from being the
bottleneck.bottleneck.
 Eliminates WAR, WAW hazards –AllowsEliminates WAR, WAW hazards –Allows
loop unrolling in HW Common.loop unrolling in HW Common.
 Data Bus –Broadcasts results to multipleData Bus –Broadcasts results to multiple
instructions –Central bottleneckinstructions –Central bottleneck
 It provide Dynamic schedulingIt provide Dynamic scheduling
 It provide Register renamingIt provide Register renaming
 Load/store disambiguationLoad/store disambiguation

Unit 3

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Unit 3

Semelhante a Unit 3 (20)

Último

Último (20)

Unit 3