2. What is Pipelining
• Pipelining is an implementation technique whereby multiple instructions are
overlapped in execution; it takes advantage of parallelism that exists among
the actions needed to execute an instruction. Today, pipelining is the key
implementation technique used to make fast CPUs.
• A technique used in advanced microprocessors where the microprocessor
begins executing a second instruction before the first has been completed.
• A Pipeline is a series of stages, where some work is done at each stage. The
work is not finished until it has passed through all stages.
• With pipelining, the computer architecture allows the next instructions to
be fetched while the processor is performing arithmetic operations, holding
them in a buffer close to the processor until each instruction operation can
performed.
2
3. Pipelining: Its Natural!
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A B C D
3
4. Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
4
5. Pipelined Laundry
Start work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
5
6. Pipelining Lessons
• Pipelining doesn’t help latency
of single task, it helps
throughput of entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously
• Potential speedup = Number
pipe stages
• Unbalanced lengths of pipe
stages reduces speedup
• Time to “fill” pipeline and time
to “drain” it reduces speedup
A
B
C
D
6 PM 7 8 9
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
6
7. How Pipelines Works
• The pipeline is divided into segments and each
segment can execute it operation concurrently with
the other segments. Once a segment completes an
operations, it passes the result to the next segment in
the pipeline and fetches the next operations from the
preceding segment.
7
8. Before there was pipelining…
• Single-cycle control: hardwired
– Low CPI (1)
– Long clock period (to accommodate slowest instruction)
• Multi-cycle control: micro-programmed
– Short clock period
– High CPI
Single-cycle
Multi-cycle
insn0.(fetch,decode,exec) insn1.(fetch,decode,exec)
insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec
time
8
12. Pipelining
Multi-cycle insn0.fetch insn1.decinsn1.fetchinsn0.dec insn0.exec insn1.exec
time
Pipelined
insn0.fetch insn0.dec
insn1.fetch
insn0.exec
insn1.dec
insn2.fetch
insn1.exec
insn2.dec insn2.exec
• Start with multi-cycle design
• When insn0 goes from stage 1 to stage 2
… insn1 starts stage 1
• Each instruction passes through all stages
… but instructions enter and leave at faster rate
Can have as many insns in flight as there are stages 12
16. Instruction Pipeline
• To implement pipelining, a designer s divides a processor's data path
into sections (stages), and places pipeline latches (also called buffers)
between each section (stage)
16
19. A Simple Implementation of a RISC Instruction Set
• Every instruction in this RISC subset can be implemented in at most
5 clock cycles. The 5 clock cycles are as follows:
Instructions Fetch(IF)
• The instruction Fetch (IF) stage is responsible for obtaining the requested
instruction from memory. The instruction and the program counter (which is
incremented to the next instruction) are stored in the IF/ID pipeline register as
temporary storage so that may be used in the next stage at the start of the next
clock cycle.
• Send the program counter (PC) to memory and fetch the current instruction
from memory. Update the PC to the next sequential PC by adding 4 (since
each instruction is 4 bytes) to the PC.
19
20. Stage 1: Fetch
• Fetch an instruction from memory every cycle
– Use PC to index memory
– Increment PC (assume no branches for now)
• Write state to the pipeline register (IF/ID)
– The next stage will read this pipeline register
20
21. Stage 1: Fetch Diagram
Instructio
n
bits
IF /ID
Pipelineregister
Instruction
Cache
PC
en
en
1
+
M
U
X
PC+1
Decode
target
21
22. Instruction Decode
• The Registers Fetch (REG)and Instruction Decode (ID) stage is responsible for
decoding the instruction and sending out the various control lines to the other
parts of the processor. The instruction is sent to the control unit where it is
decoded and the registers are fetched from the register file.
22
23. Stage 2: Decode
• Decodes opcode bits
– Set up Control signals for later stages
• Read input operands from register file
– Specified by decoded instruction bits
• Write state to the pipeline register (ID/EX)
– Opcode
– Register contents
– PC+1 (even though decode didn’t use it)
– Control signals (from insn) for opcode and destReg
23
24. Stage 2: Decode Diagram
ID /EX
Pipelineregister
regA
content
s
regB
content
s
Register File
regA
regB
en
Instructio
n
bits
IF /ID
Pipelineregister
PC+1
PC+1
Control
signals
Fetch
Execute
destReg
data
target
24
25. Execution
• The Effective Address/Execution (EX) stage is where any calculations are performed.
The main component in this stage is the ALU. The ALU is made up of arithmetic, logic
and capabilities.
• The ALU operates on the operands prepared in the prior cycle, performing one of
three functions depending on the instruction type.
■ Memory reference—The ALU adds the base register and the offset to form
the effective address.
■ Register-Register ALU instruction—The ALU performs the operation
specified by the ALU opcode on the values read from the register file.
■ Register-Immediate ALU instruction—The ALU performs the operation specified by the ALU
opcode on the first value read from the register file and the sign-extended immediate.
• In a load-store architecture the effective address and execution cycles can be
combined into a single clock cycle, since no instruction needs to simultaneously
calculate a data address and perform an operation on the data.
25
26. Stage 3: Execute
• Perform ALU operations
– Calculate result of instruction
• Control signals select operation
• Contents of regA used as one input
• Either regB or constant offset (from insn) used as second input
– Calculate PC-relative branch target
• PC+1+(constant offset)
• Write state to the pipeline register (EX/Mem)
– ALU result, contents of regB, and PC+1+offset
– Control signals (from insn) for opcode and destReg
26
27. Stage 3: Execute Diagram
ID /EX
Pipelineregister
regA
content
s
regB
content
s
ALU
resul
t
EX/Mem
Pipelineregister
PC+1
Control
signals
Control
signals
PC+1
+offse
t
+
regB
content
s
A
L
UM
U
X
Decode
Memory
destReg
data
target
27
28. Memory and IO
• The Memory Access and IO (MEM) stage is responsible for
storing and loading values to and from memory. It also
responsible for input or output from the processor. If the
current instruction is not of Memory or IO type than the result
from the ALU is passed through to the write back stage.
• If the instruction is a load, the memory does a read using the
effective address computed in the previous cycle. If it is a
store, then the memory writes the data from the second
register read from the register file using the effective address.
28
29. Stage 4: Memory
• Perform data cache access
– ALU result contains address for LD or ST
– Opcode bits control R/W and enable signals
• Write state to the pipeline register (Mem/WB)
– ALU result and Loaded data
– Control signals (from insn) for opcode and destReg
29
30. Stage 4: Memory Diagram
ALU
resul
t
Mem/WB
Pipelineregister
ALU
resul
t
EX/Mem
Pipelineregister
Control
signals
PC+1
+offse
t
regB
content
s
Loade
d
data
Control
signals
Execute
Write-
back
in_data
in_addr
Data Cache
en R/W
destReg
data
target
30
31. Write Back
• The Write Back (WB) stage is responsible for writing
the result of a calculation, memory access or input
into the register file.
• Register-Register ALU instruction or load instruction:
• Write the result into the register file, whether it
comes from the memory system (for a load) or from
the ALU (for an ALU instruction).
31
32. Stage 5: Write-back
• Writing result to register file (if required)
– Write Loaded data to destReg for LD
– Write ALU result to destReg for arithmetic insn
– Opcode bits control register write enable signal
32
33. Stage 5: Write-back Diagram
ALU
resul
t
Mem/WB
Pipelineregister
Control
signals
Loade
d
data
M
U
X
data
destReg
M
U
X
Memory
33
34. Putting It All Together
PC Inst
Cache
Register
file
M
UA
L
U
1
Data
Cache
+
+
M
U
X
IF/ID EX/Mem Mem/WB
M
U
X
dest
op
ID/EX
offset
valB
valA
PC+1PC+1
target
XALU
result
dest
op
valB
dest
op
ALU
result
mdata
eq?
instructio
n R0 0
R1
R2
R3
R4
R5
R6
R7
regA
regB
data
dest
M
U
X
34
35. Characterize Pipelines
1) Hardware or software implementation – pipelining can be implemented in
either software or hardware.
2) Large or Small Scale – Stations in a pipeline can range from simplistic to
powerful, and a pipeline can range in length from short to long.
3) Synchronous or asynchronous flow – A synchronous pipeline operates like an
assembly line: at a given time, each station is processing some amount of
information. A asynchronous pipeline, allow a station to forward information at
any time.
4) Buffered or unbuffered flow – One stage of pipeline sends data directly to
another one or a buffer is place between each pairs of stages.
5) Finite Chunks or Continuous Bit Streams – The digital information that passes
though a pipeline can consist of a sequence or small data items or an arbitrarily
long bit stream.
6) Automatic Data Feed Or Manual Data Feed – Some implementations of
pipelines use a separate mechanism to move information, and other
implementations require each stage to participate in moving information.
35