2. What is pipelining?
• The greater performance of the cpu is achieved by
instruction pipelining.
• 8086 microprocesor has two blocks
BIU(BUS INTERFACE UNIT)
EU(EXECUTION UNIT)
• The BIU performs all bus operations such as instruction
fetching,reading and writing operands for memory and
calculating the addresses of the memory operands. The
instruction bytes are transferred to the instruction queue.
• EU executes instructions from the instruction system
byte queue.
• Both units operate asynchronously to give the 8086 an
overlapping instruction fetch and execution mechanism
which is called as Pipelining.
3. INSTRUCTION PIPELINING
First stage fetches the instruction and buffers it.
When the second stage is free, the first stage
passes it the buffered instruction.
While the second stage is executing the
instruction,the first stage takes advantages of
any unused memory cycles to fetch and buffer the
next instruction.
This is called instruction prefetch or fetch
overlap.
4. Inefficiency in two stage
instruction pipelining
There are two reasons
• The execution time will generally be longer than
the fetch time.Thus the fetch stage may have to
wait for some time before it can empty the buffer.
• When conditional branch occurs,then the address
of next instruction to be fetched become
unknown.Then the execution stage have to wait
while the next instruction is fetched.
5. Two stage instruction pipelining
Simplified view
wait new address wait
Fetch Execute
Instruction Instruction
Result
discard EXPANDED VIEW
6. Decomposition of instruction
processing
To gain further speedup,the pipeline have more
stages(6 stages)
Fetch instruction(FI)
Decode instruction(DI)
Calculate operands (i.e. EAs)(CO)
Fetch operands(FO)
Execute instructions(EI)
Write operand(WO)
7. SIX STAGE OF INSTRUCTION
PIPELINING
Fetch Instruction(FI)
Read the next expected instruction into a buffer
Decode Instruction(DI)
Determine the opcode and the operand specifiers.
Calculate Operands(CO)
Calculate the effective address of each source operand.
Fetch Operands(FO)
Fetch each operand from memory. Operands in registers
need not be fetched.
Execute Instruction(EI)
Perform the indicated operation and store the result
Write Operand(WO)
Store the result in memory.
9. High efficiency of instruction
pipelining
Assume all the below in diagram
• All stages will be of equal duration.
• Each instruction goes through all the six stages of
the pipeline.
• All the stages can be performed parallel.
• No memory conflicts.
• All the accesses occur simultaneously.
In the previous diagram the instruction pipelining
works very efficiently and give high performance
10. Limits to performance
enhancement
The factors affecting the performance are
1. If six stages are not of equal duration,then there will
be some waiting time at various stages.
2. Conditional branch instruction which can invalidate
several instruction fetches.
3. Interrupt which is unpredictable event.
4. Register and memory conflicts.
5. CO stage may depend on the contents of a register
that could be altered by a previous instruction that
is still in pipeline.
12. Conditional branch instructions
Assume that the instruction 3 is a conditional
branch to instruction 15.
Until the instruction is executed there is no way of
knowing which instruction will come next
The pipeline will simply loads the next instruction
in the sequence and execute.
Branch is not determined until the end of time unit
7.
During time unit 8,instruction 15 enters into the
pipeline.
No instruction complete during time units 9
through 12.
This is the performance penalty incurred because
13. Simple pattern for high performance
• Two factors that frustrate this simple pattern for
high performance are
1. At each stage of the pipeline,there is some
overhead involved in moving data from buffer to
buffer and in performing various preparation and
delivery functions.This overhead will lengthen
the execution time of a single instruction.This is
significant when sequential instructions are
logically dependent,either through heavy use of
branching or through memory access
dependencies
2. The amount of control logic required to handle
memory and register dependencies and to
optimize the use of the pipeline increases
15. Dealing with branches
A variety of approaches have been taken for dealing
with conditional branches.
Multiple streams
Prefetch branch target.
Loop buffer
Branch prediction
Delayed branch
16. Multiple streams
In simple pipeline,it must choose one of the two
instructions to fetch next and may make wrong
choice.
In multiple streams allow the pipeline to fetch both
instructions making use of two streams.
Problems with this approach
• With multiple pipelines there are contention delays
for the access to the registers and to memory.
• Additional branch instructions may enter the
pipeline(either stream)before the original branch
decision is resolved.Each such instructions needs
an additional branch.
Examples:
• IBM 370/168 AND IBM 3033.
17. Prefetch Branch Target
When a conditional branched is recognized,the target
of the branch is prefetched,in addition to the instruction
following the branch.
This target is then saved until the branch instruction is
executed.
If the branch is taken,the target has already been
prefetched.
The IBM 360/91 uses this approach.
18. Loop buffer
A loop buffer is a small,very high-speed memory
maintained in instruction fetch stage.
It contains n most recently fetched instructions in
sequence.
If a branch is to be taken,the hardware first checks
whether the branch target is within the buffer.
If so,the next instruction is fetched from the buffer.
19. Benefits of loop buffer
Instructions fetched in sequence will be available
without the usual memory access time
If the branch occurs to the target just a few locations
ahead of the address of the branch instruction, the
target will already be in the buffer. This is useful for
the rather common occurrence of IF-THEN and IF-
THEN-ELSE sequences.
This is well suited for loops or iterations, hence
named loop buffer.If the loop buffer is large enough
to contain all the instructions in a loop,then those
instructions need to be fetched from memory only
once,for the first iteration.
For subsequent iterations,all the needed instructions
are already in the buffer.
20. Cont..,
Loop buffer is similar to cache.
Least significant 8 bits are used to index the buffer
and remaining MSB are checked to determine the
branch target.
Branch address
Loop buffer
8 (256 bytes)
Instruction to be
decoded
in case of hit
Most significant address
bits
21. Branch prediction
Various techniques used to predict whether a
branch will be taken. They are
Predict Never Taken
Predict Always Taken STATIC
Predict by Opcode
Taken/Not Taken Switch
Branch History Table DYNAMIC
22. Static branch strategies
• STATIC(1,2,3)-They do not depend on the
execution history
• Predict Never Taken
Always assume that the branch will not be
taken and continue to fetch instruction in sequence.
• Predict Always Taken
Always assume that the branch will be taken
and always fetch from target.
• Predict by Opcode
Decision based on the opcode of the
branch instruction. The processor assumes that the
branch will be taken for certain branch opcodes and
not for others.
23. Dynamic branch strategies
DYNAMIC(4,5)-They depend on the execution
history.
They attempt to improve the accuracy of prediction
by recording the history of conditional branch
instructions in a program.
For example,one or more bits can be associated
with conditional branch instruction that reflect the
recent history.
These bits are referred as taken/not taken switch.
These history bits are stored in temporary high-
speed memory.
Then associate the bits with any conditional branch
instruction and make decision.
Another possibility is to maintain a small table for
recent history with one or more bits in each entry.
24. Cont..,
With only one bit of history, an error prediction will occur
twice for each use of the loop:once on entering the loop
and once on exiting.
The decision process can be represented by a finite-
state machine with four stages.
25. Cont..,
If the last two branches of the given instruction
have taken same path,the prediction is to make
the same path again.
If the prediction is wrong it remains same for next
time also
But when again the prediction went wrong, the
opposite path will be selected.
Greater efficiency could be achieved if the
instruction fetch could be initiated as soon as the
branch decision is made.
For this purpose, information must be saved, that
is known as branch target buffer,or a branch
history table.
26. Branch history table
It is a small cache memory associated with
instruction fetch stage.
Each entry in table consist of elements:
Address of branch instruction
Some number of history bits.
Information about the target instruction.
• The third field may contain address or target
instruction itself.
28. Branching strategies
If branch is taken,some logic in the processor
detects that and instruct to fetch next instruction
from target address.
Each prefetch triggers a lookup in the branch
history table.
If no match is found,the next sequential instruction
address is used for fetch.
If match occurs, a prediction is made based on the
state of the instruction.
When the branch instruction is executed,the
execute stage signals the branch history table logic
with result.
29. Delayed branch
It is possible to improve pipeline performance by
automatically rearranging instructions within the
program.
So that branch instructions occur later than
actually desired.
30. Intel 80486 Pipelining
• Fetch
— From cache or external memory
— Put in one of two 16-byte prefetch buffers
— Fill buffer with new data as soon as old data consumed
— Average 5 instructions fetched per load
— Independent of other stages to keep buffers full
• Decode stage 1
— Opcode & address-mode info
— At most first 3 bytes of instruction
— Can direct D2 stage to get rest of instruction
• Decode stage 2
— Expand opcode into control signals
— Computation of complex address modes
• Execute
— ALU operations, cache access, register update
• Writeback
— Update registers & flags
— Results sent to cache & bus interface write buffers