CS4109 Computer System Architecture

CS4109
Computer System Architecture
By
Prof. K.Sridhar Patnaik
Department of Computer Science and
Engineering,
BIT Mesra, Ranchi

Course Objectives
• To Learn:
1)How computers work,basic principles.
2)How to analyze their performance(or how not
to)
3)How computers are designed and built.
4)Issues affecting modern
processors(cache,pipelines etc)

Course Motivation
This knowledge will be useful if you need to
1)Design/built a new computer-rare opportunity
2)Design/built a new version of a computer.
3)Improve software performance.
4)Purchase a computer
5)Provide a solution with embedded computer.

Introduction
• I think it’s fair to say that personal computers
have become the most empowering tool
we’ve ever created. They’re tools of
communication, they’re tools of creativity, and
they can be shaped by their user.
Bill Gates, February 24, 2004

Computers?
A computer is a general purpose device that can be programmed to process
information, and yield meaningful results.
McGrawHill

Computers for..
• Processing and Communication
• For processing (ex. Numbers) requires
processor , memory and I/O(input-output)
Processor-for processing.
Memory-for storage.
I/O -can be considered as extended
memory.(Also helps us in interacting with m/c)

Computer Architecture Vs Computer
Organization
You can study Computer Systems from users(car driver)
point of view or designer’s(car mechanic) point of view
Computer Architecture :The view of a computer as
presented to software designers.
i.e. from designers(hardware) point of view.
Building Architecture: Structural Design(Civil Engg)
Computer Architecture: Circuit Design(EE)
Computer Organization: The actual implementation of a
computer in hardware.
i.e. from users(programmers/software) point of view.

• Ex.Mutiplier,from organization point of view you know there is a
multiplier, one need not bother about how it is designed .Similarly there is
an instruction set and the user know the instruction that is enough and he
is not bothered about how it is implemented.
• So how multiplier and instruction set are implemented is the job of the
designers.
• Application spectrum

Computer
Program Information
store
results
• Program – List of instructions given to the computer
• Information store – data, images, files, videos
• Computer – Process the information store according to the
instructions in the program
McGrawHill

Inside
CPU
Memory
Hard disk
 Let us take the lid off a desktop computer

Let us make it a full system ...
Computer
Memory Hard disk
Keyboard
Mouse
Monitor
Printer

MEMORY Storage
Control
Datapath

• Processor keep addressing the memory and get the
data from memory for processing.
• Processor consists(DP ckts+CP ckts)
• Processor must be able to set some kind of path to
deal or handle the data for processing ( data path
setting).
• Also able to carry out the processing through a
sequence of control signals.
• The one which interact the CPU indirectly is the
storage(magnetic tape, disc system etc)
• The CPU deals with memory in the same way as it
deals with I/O.

SOFTWARE ABSTRACTION
• int sum(int x,int y) HLL(C)
• {
• int z=x+y;
• return z;}
• 0x401040<sum>:0x55 MACHINE CODE
0x89
• 0xe5
• 0x8b
• 0x45
• 0x0C
• 0x03
• 0x45
• 0x08
• 0x89
• 0xec
• 0x5d
• 0xc3
• _sum: ASSEMBLY
• pushl%ebp
• movl%esp,%ebp
• movl 12(%ebp),%eax
• addl 8(%ebp),%eax
• movl %ebp,%esp
• popl %ebp
• ret

HARDWARE ABSTRACTION
CPU
Register file
System bus Memory bus
I/O BUS
Expansion slots
for other devices
such as network
adapters
PC
BUS INTERFACE
ALU
BRIDGE MAIN
MEMORY
GRAPHICS
Adapter
USB
Controller
DISK
Controller
DISK
MOUSE KEYBOARD
E
DISPLAY

HARDWARE/SOFTWARE INTERFACE
SOFTWARE
Our focus
HARDWARE
C++
M/C Intruction
Reg,Adder
Transistors
…………………………………………………………………………………………………

How does an Electronic Computer
Differ from our Brain ?
• Computers are ultra-fast and ultra-dumb
Feature Computer Our Brilliant Brain
Intelligence Dumb Intelligent
Speed of basic calculations Ultra-fast Slow
Can get tired Never After sometime
Can get bored Never Almost always

What Can a Computer Understand ?
 Computer can clearly NOT understand
instructions of the form
 Multiply two matrices
 Compute the determinant of a matrix
 Find the shortest path between Mumbai and Delhi
 They understand :
 Add a + b to get c
 Multiply a + b to get c

Architecture levels
• Instruction set architecture:
Lowest level visible to programmer
• Micro architecture:
Fills the gap between instructions and logic
modules
The semantics of all the instructions supported by a processor is known
as its instruction set architecture (ISA). This includes the semantics of
the instructions themselves, along with their operands, and interfaces
with peripheral devices.

Abstract Machine
Addresses
Instructions
Data
CPU
MEMORY

• Programmer -Visible State:
PC-Program Counter
Register File-Heavily used data
Condition Codes
Memory-Byte Array
-Code+Data
-Stack

Why different processors?
• What is the difference between processors
used in desk-tops, lap-tops,mobile phones,
washing machines etc.?
• Performance/speed
• Power consumption
• Cost
• General purpose/special purpose

Topics to be Covered
• Performance issues
• A specific instruction set architecture
• Arithmetic and how to build an ALU
• Constructing a processor to execute
instruction
• Pipelining to improve performance
• Memory:Caches and Virtual memory
• Input/output

Features of an ISA
 Example of instructions in an ISA
 Arithmetic instructions : add, sub, mul, div
 Logical instructions : and, or, not
 Data transfer/movement instructions
 Complete
 It should be able to implement all the
programs that users may write.

Features of an ISA – II
 Concise
 The instruction set should have a limited size.
Typically an ISA contains 32-1000 instructions.
 Generic
 Instructions should not be too specialized, e.g.
add14 (adds a number with 14) instruction is too
specialized
 Simple
 Should not be very complicated.

Designing an ISA
 Important questions that need to be answered :
 How many instructions should we have ?
 What should they do ?
 How complicated should they be ?
Two different paradigms : RISC and CISC
RISC
(Reduced Instruction Set
Computer)
CISC
(Complex Instruction
Set Computer)

RISC vs CISC
A reduced instruction set computer (RISC) implements
simple instructions that have a simple and regular
structure. The number of instructions is typically a small
number (64 to 128). Examples: ARM, IBM PowerPC,
HP PA-RISC
A complex instruction set computer (CISC) implements
complex instructions that are highly irregular, take multiple
operands, and implement complex functionalities.
Secondly, the number of instructions is large (typically
500+). Examples: Intel x86, VAX

Completeness of an ISA – II
How to ensure that we have just enough instructions such that
we can implement every possible program that we might want
to write ?
Answer:
 Let us look at results in theoretical computer science
 Is there an universal ISA ?
The universal machine has a set of basic actions, and each such
action can be interpreted as an instruction.
Universal ISA Universal Machine

The Turing Machine – Alan Turing
 Facts about Alan Turing
 Known as the father of computer science
 Discovered the Turing machine that is the most
powerful computing device known to man
 Indian connection : His father worked with the
Indian Civil Service at the time he was born. He
was posted in Chhatrapur, Odisha.

Turing Machine
Infinite Tape
State Register Tape Head
L R
Action Table• The tape
head can
only move
left or right
(old state, old symbol) -> (new state, new symbol, left/right)

Operation of a Turing Machine
 There is an inifinite tape that extends to the left and right. It
consists of an infinite number of cells.
 The tape head points to a cell, and can either move 1 cell to the left
or right
 Based on the symbol in the cell, and its current state, the Turing
machine computes the transition :
 Computes the next state
 Overwrites the symbol in the cell (or keeps it the same)
 Moves to the left or right by 1 cell
 The action table records the rules for the transitions.

Example of a Turing Machine
• Design a Turing machine to increment a number by 1.
 Start from the rightmost position. (state = 1)
 If (state = 1), replace a number x, by x+1 mod 10
 The new state is equal to the value of the carry
 Keep going left till the '$' sign
Tape Head
$3 4 6 9$ 7

More about the Turing Machine
 This machine is extremely simple, and
extremely powerful
 We can solve all kinds of problems – mathematical
problems, engineering analyses, protein folding,
computer games, …
 Try to use the Turing machine to solve many more
types of problems (TO DO)

Church-Turing Thesis
Church-Turing thesis: Any real-world computation can be translated
into an equivalent computation involving a Turing machine.
(source: Wolfram Mathworld).
Any computing system that is equivalent to a Turing machine is said to be
Turing complete.
Universal Turing Machine
For every problem in the world, we can design a Turing Machine (Church-Turing thesis)
Can we design a universal Turing machine that can simulate any Turing machine.
This will make it a universal machine (UTM)
Why not? The logic of a Turing machine is really simple.
We need to move the tape head left, or right, and update the symbol and
state based on the action table. A UTM can easily do this.
A UTM needs to have an action table, state register, and tape that can
simulate any arbitrary Turing machine.

Prog. 1 Prog. 2 Prog. 3
Turing Machine 1 Turing Machine 2 Turing Machine 3

A Universal Turing Machine
Generic State Register Tape Head
L R
Generic Action Table
Simulated Action Table
Simulated State Register
Work Area
37

A Universal Turing Machine - II
Generic State Register Tape Head
L R
Generic Action Table
Simulated Action Table
Simulated State Register
Work Area
CPU
Data Memory
Instruction
Memory
Program Counter
(PC)

Computer Inspired from the Turing
Machine
CPU
Program
Counter (PC)
Program Data
Memory
Control
Unit
Arithmetic
Unit
Instruction
39

Elements of a Computer
 Memory (array of bytes) contains
 The program, which is a sequence of instructions
 The program data → variables, and constants
 The program counter(PC) points to an instruction in a program
 After executing an instruction, it points to the next instruction
by default
 A branch instruction makes the PC point to another instruction
(not in sequence)
 CPU (Central Processing Unit) contains the
 Program counter, instruction execution units

Designing Practical Machines
Harvard Architecture
CPU
Control
ALU
Instruction
memory
Data
memory
I/O devices

Von-Neumann Architecture
CPU
Control
ALU
Memory I/O devices

Problems with Harvard/ Von-Neumann
Architectures
 The memory is assumed to be one large array of
bytes
 It is very very slow
 Solution:
 Have a small array of named locations (registers) that can
be used by instructions
 This small array is very fast
General Rule: Larger is a structure, slower it is
Insight: Accesses exhibit locality (tend to use the same
variables frequently in the same window of time)
43

Uses of Registers
 A CPU (Processor) contains set of registers (16-64)
 These are named storage locations.
 Typically values are loaded from memory to registers.
 Arithmetic/logical instructions use registers as input
operands
 Finally, data is stored back into their memory locations.
44

Example of a Program in Machine
Language with Registers
 r1, r2, and r3, are registers
 mem → array of bytes representing memory
1: r1 = mem[b] // load b
2: r2 = mem[c] // load c
3: r3 = r1 + r2 // add b and c
4: mem[a] = r3 // save the result
45

Machine with Registers
CPU
Control
ALU
Memory I/O devices
Registers

Performance Measure
• When we say one computer has better performance than
another , what do we mean?
Airplane Passenger
Capacity
Cruising
Range(miles)
Cruising
speed(mph)
Passenger
throughput
(passenger X mph)
A1 300 4000 600 180000
A2 400 3500 600 240000
A3 130 3400 1000 130000
A4 140 8000 500 70000

• Consider different measures of performance, the plane with highest
cruising speed is A3 ,the plane with the longest range is A4,and the
plane with the largest capacity is the A2 .
• Suppose we define performance in terms of speed.Then the fastest
plane as the one with the highest cruising speed,taking less
passengers from one point to another in the least time.
• If you are interested in transporting 400 passengers than A2 would
clearly be the fastest.
• Similarly , we can define computer performance in several different
ways

• If you are running a program on two desktop computers, The faster
one is the one that gets the job done first.
• If you were running a datacenter that had several servers running
jobs submitted by many users,the faster computer was the one that
completed the most jobs during a day.
• As an individual computer user we are interested in reducing
response time(execution time).
• Datacenter managers are often interested in increasing throughput
or bandwidth(total amount of work done in a given time)
• Hence in most cases we need different performance metrics as
well as different sets of applications to benchmark embedded and
desktop computers,which are more focused on response
time,versus servers,which are more focused on throughput

Throughput and Response time
• Do the following changes to a computer system increase throughput, decrease
response time:
1.Replacing the processor in a computer with a faster version.
2.Adding additional processors to a system that uses multiple processors for separate
task-ex searching the www.
In case 1 decreasing response time almost always improves throughput, hence case 1
improves both.
In case 2 throughput only increases.
Note-The demand for processing in the second case was almost as large as the
throughput,the system might force requests to queue up.
In this case the throughput could also improve response time, since it would reduce
the waiting time in the queue.
Thus, in many real computer systems, changing either execution time or throughput
often affects the other.

Performance Measure Equations
• Performance=1/Execution time
• For two computers X and Y,if the performance of X is greater than Y ,
• 𝑃𝑋 > 𝑃𝑌 =
1
𝐸𝑥𝑒.𝑡𝑖𝑚𝑒 𝑋
>
1
𝐸𝑥𝑒.𝑡𝑖𝑚𝑒 𝑌
=𝐸𝑥𝑒. 𝑡𝑖𝑚𝑒 𝑌 > 𝐸𝑥𝑒. 𝑡𝑖𝑚𝑒𝑋
• Do this-If computer A runs a program in 10 sec and computer B runs the same
program in 15 sec,how much faster is A than B?
• Response time,or elapsed time-Total time to complete a task,including disk
accesses,memory accesses,I/O activities,OS overhead
• CPU Execution time/CPU time-The actual time the CPU spends computing
specific task.(does not include time spent waiting for I/O or running other
programs.
• CPU time=user CPU time+system CPU time
User CPU time(CPU time spent in the the program)
System CPU time(CPU time spent in the OS performing tasks on behalf of the
program)

• Differentiating user and system CPU time is difficult to do accurately
,because it is hard to assign responsibility for OS activities to one user
program rather than another and because of the functionality differences
among OSs.
• CPU execution time for a program=CPU clock cycles for a program X Clock
cycle time.
=CPU clock cycles for a program /Clock rate
Note-The hardware designer can improve performance by reducing the
number of clock cycles required for a program or the length of the clock cycle.
The designer often faces a trade off between the number of clock cycles
needed for a program and the length of each cycle.
Many techniques that decrease the number of clock cycles may also increase
the clock cycle time.

• Do this-A program runs in 10secs on computer A,which has a 2GHz clock.
You as a designer build a Computer B which will run the same program in
6secs.As a designer you determined that a substantial increase in the clock
rate is possible, but this increase will affect the rest of the CPU
design,causing computer B to require 1.2 times as many clock cycles as
computer A for this program. What clock rate will the designer has to
target?

• CPU time 𝐴 =CPU clock cycles 𝐴 /Clock rate 𝐴
10 secs=CPU clock cycles 𝐴/2 × 109
cycles/secs
CPU clock cycles 𝐴= 20 × 109 cycles
CPU time 𝐵=1.2 × CPU clock cycles 𝐴/Clock rate 𝐵
6 secs=1.2 × 20 × 109 cycles /Clock rate 𝐵
Clock rate 𝐵=4× 109 cycles/secs =4GHz.
To run the program in 6 secs, B must have twice the clock rate of A.

Instruction Performance
• CPU execution time for a program=CPU clock cycles for a program X Clock
cycle time.
• CPU clock cycles= instructions for a program X Avg clock cycles per
instruction
• Avg clock cycles per instruction(CPI).
• Do this-Suppose we have two implementations of the same ISA.
Computer A has a clock cycle time of 250ps and a CPI of 2.0 for some
program, and computer B has a clock cycle time of 500ps and a CPI of 1.2
for the same program. Which computer is faster for this program and by
how much?
• The classic CPU performance equation=
CPU time=Instruction count X CPI X Clock cycle time
=(Instruction count X CPI)/Clock rate

SPEC CPU Benchmark
• To evaluate two computer systems, a user would simply compare the
execution time of the workload on the two computers.
• Workload=A set of programs run on computer that is either the actual
collection of applications run by a user or constructed from real programs
to approximate such a mix.
• Benchmark: A program selected for use in comparing computer
performance
• SPEC:(Standard Performance Evaluation Cooperation)is an effort funded
and supported by a number of computer vendors to create standard sets
of benchmarks for modern computer systems.
• In 1989, SPEC created a benchmark set focusing on processor
performance(SPEC89) which evolved through 5 generations.
• SPEC CPU2006 (consists of a set of 12 integer benchmarks(CINT2006) and
17 floating point benchmarks(CFP2006))

• The integers benchmarks – C compiler,chess program,quantum computer
simulations.
• The floating point benchmarks: Structure grid codes for finite element
modeling, particle method codes for molecular dynamics,sparse linear
algebra codes for fluid dynamics.
• Check for SPEC14.
• SPECRatio=𝐸𝑋𝐸 𝑡𝑖𝑚𝑒 𝑅𝐸𝐹/EXE Time
• Take Geometric mean of SPECRatios
• When comparing two computers using SPEC ratios,use the GM so that it
gives the same relative answer no matter what computer is used to
normalize the results.If we averaged the normalized exe time values with
an AM,the results would vary depending on the computer we choose as
the reference.

SPECINTC2006 benchmarks running on
AMD Opteron X4 model
Des Name Inst
count
X𝟏𝟎 𝟗
CPI Clock
cycle
time
(sec
X𝟏𝟎 𝟗
)
Exe time
(secs)
Ref time
(secs)
SPEC
ratio
Interpreted
String
processing
Perl 2118 0.75 0.4 637 9770 15.3
GNU C
compiler
Gcc 1050 1.72 0.4 724 8050 11.1

Q1:Show that the ratio of the geometric means is equal to the
geometric mean of the performance ratios, and that the reference
computer of SPECRatio matters not.
Assume two computers A and B and a set of SPEC ratios for each.
𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝐴
𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝐵
=
𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐴 𝑖
𝑛
𝑖=1
𝑛
𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐵 𝑖
𝑛
𝑖=1
𝑛
=
𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐴 𝑖
𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐵 𝑖
𝑛
𝑖<1
𝑛
=
𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝐵 𝑖
𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝐴 𝑖
𝑛
𝑖<1
𝑛
=
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐴 𝑖
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐵 𝑖
𝑛
𝑖<1
𝑛
The geometric mean of the ratios is the same as the ratio of the
geometric means.

Fallacies and Pitfalls
• Pitfall: Expecting the improvement of one aspect of a computer to
increase overall performance by an amount proportional to the size
of the improvement.
• This pitfall has visited designers of both h/w and s/w.
• Ex. Suppose a program runs in 100secs on a computer,with multiply
operations responsible for 80secs of this time.How much do you
have to improve the speed of multiplication if you want your
program to run five times faster.
• The exe time of the program after making the improvement is given
by the equation known as Amdahl’s Law.
• Exe time after improvement = (Exe time affected by
improvement/Amount of improvement ) +Exe time unaffected

Exe time after improvement=80/n +(100-80)
20=80/n +(100-80)
0=80/n(i.e there is no amount by which we can enhance –
multiply to achieve a fivefold increase in performance ,if multiple
accounts for only 80% of the workload.
The performance enhancement possible with a given
improvement is limited by the amount that the improved feature
is used(it’s the law of diminishing returns).
We can use Amdahl’s law to estimate performance
improvements when we know the time consumed for some
function and its potential speedup.

Amdahl’s Law
Parent thread
Time
Spawn child threads
Child
threads
Thread join operation
Sequential
section
Initialisation

 For P parallel processors, we can expect a speedup of
P (in the ideal case)
 Let us assume that a program takes Told units of time
 Let us divide into two parts – sequential and
parallel
 Sequential portion : Told * fseq
 Parallel portion : Told * (1 - fseq )
fseq =fraction of time spend in sequential part
(1 - fseq )= fraction of time spend in parallel part

 Only, the parallel portion gets sped up P times
 The sequential portion is unaffected
 Equation for the time taken with parallelisation
 The speedup is thus : :
Told
𝑇𝑛𝑒𝑤

Implications
 Consider multiple values of fseq
 Speedup vs Number of processors
45
40
35
30
25
20
15
10
5
0
0 50 100 150 200
Speedup(S)
Number of processors (P)
10%
5%
2%

Conclusions
 We observe that with an increasing number of processors the
speedup gradually saturates and tends to the limiting
value,1
𝑓𝑠𝑒𝑞
.We observe diminishing returns as we increase the
number of processors beyond certain point.
 We are limited by the size of the sequential section
 For a very large number of processors, the parallel section is
actually very small
 Ideally, a parallel workload should have as small a sequential
section as possible.

• Fallacy: Computers at low utilization use little power.
Power efficiency matters at low utilizations because server
workloads vary. CPU utilization for servers at Google,is between
10% to 50% most of the time.(SPEC Power Benchmark)
Server
Manufacturer
Microprocessor Total
cores/sockets
Clock
rate
Peak
Perfm
100%
load
power
50%
load
power
10%load
power
Idle
power
HP XenonE5440 8/2 3GHz 308022 269W 227W 174W 160W
Dell XenonE5440 8/2 2.8GHz 305413 276W 230W 173W 157W
Fujitsu Seimens XenonX3220 4/1 2.4GHz 143742 132W 110W 85W 60W

• Even servers that are only 10% utilized burn about two-third of
their peak power.
• Since servers’ workloads vary but use a large fraction of peak
power,so we should redesign h/w to achieve “energy proportional
computing”.If future servers are used ,say 10% of peak power at
10% workload,we could reduce the electricity bill of
datacentres(also a Concern of CO2 emissions)
• Pitfall: Using a subset of the performance equation as a
performance metric.
Measuring performance using clock rate or CPI is a fallacy ,using two or
three factors to compare performance may be valid in a limited
context or misused.
Alternative to time is MIPS(Million inst per secs)=
𝐼𝑛𝑠𝑡 𝑐𝑜𝑢𝑛𝑡/(𝐸𝑥𝑒𝑡𝑖𝑚𝑒 × 106)

• Problems with MIPS:
Cannot compare computers with different instruction sets using
MIPS.
MIPS varies between programs on the same
computer.(computer can not have a single MIPS rating)
If a new program executes more instructions but each
instruction is faster, MIPS can vary independently from
performance.
𝑀𝐼𝑃𝑆 =
𝐼𝐶
𝐼𝐶×𝐶𝑃𝐼×106
𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒
=
𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒
𝐶𝑃𝐼×106

ARM Instruction Set
• 32 bit instruction set.
• Arthmetic(Add,Sub….)
• Data Transfer(load register,store register,mov etc.)
• Logical(and,or,not…)
• Conditional branch(branch on EQ,NE,LT,LE…)
• Unconditional branch(branch(always),branch and link)
Name Example Comments
16 registers r0,r1,r2…..r11,….r12…..
sp…lr,pc
Fast locations for data,in
ARM,data must be in
registers to perform
arithmetic.
230 memory words Memory[0],memory[4+…
Memory[4294967292]
ARM uses byte addresses
,so sequential address
differ by 4

MIPS Instruction Set
Name Example Comments
32 registers $s0-$s7;
$t0-$t9,$zero,
$a0-$a3,
$v0-$v1,$gp,$fp,$sp,$ra,$at
Fast locations for data,in MIPS,data must be
in registers to perform arithmetic.register
$zero always equals 0,and register $at is
reserved by the assembler to handle large
constants
230 memory words Memory*0+,memory*4+…
Memory[4294967292]
MIPS uses byte addresses ,so sequential
address differ by 4,Memory holds DS,
arrays, and spilled registers

comparison
Category Instruction ARM MIPS
Arithmetic Add
Sub
Add immediate
ADD r1,r2,r3
SUB r1,r2,r3
add $s1,$s2,$s3
sub $s1,$s2,$s3
addi $s1,$s2,20
Data Transfer Load register/
Load word
Store register/
store word
……
…..
LDR r1,[r2,#20]
{r1=memory[r2+20]}
STR r1,[r2,#20]
Lw $s1,20($s2)
Sw $s1,20($s2)

ARM assembly language notation
ARM arithmetic instruction performs only one operation and
must always have exactly three variables.Ex.place the sum of
four variables b,c,d,e into variable a.
• ADD a,b,c (sum of b and c placed in a)
• ADD a,a,d (sum of b,c and d is now in a)
• ADD a,a,e(sum of b,c,d,and e is now in a)
It takes three instructions to sum the four variables.
First Design Principle: Simplicity favours regularity
Requiring every instruction to have exactly three operands,no
more and no less. Conforms to the philosophy of keeping the
h/w simple;h/w for variable number of operands is more
complicated than h/w for a fixed number.

C assignment to ARM assembly
• Second Design Principle: smaller is faster(A very large number of
registers may increase the clock cycle time because it takes
electronic signals longer when they must travel farther. smaller is
faster ;15 registers may not be faster than 16 .Energy is also the
major concern ,fewer registers is to conserve energy)
• a=b+c;d=a-e; ADD a,b,c , SUB d,a,e
• f=(g+h)-(i+j);ADD t0,g,h ,ADD t1,i,j,SUB f,t0,t1.
Memory operands: for simple variables single data elements is used
but for complex data structures –arrays, structure etc.The DSs can
contain many more elements than there are registers in computer.
How can a computer represent and access such large structures?
The processor can keep only small amount of data in registers, but
memory contains billions of data elements, Hence DSs(array and
structures)are kept in memory.

• Compile the C assignment , g=h+A[8]
LDR r1,[r2,#32]
r2=base address
r1=temp register
• ADD r3,r4,r1
• Offset=8
• In ARM word must
Start at address that are
Multiples of 4
(alignment restriction)
32 8
24 7
20 6
16 5
12 4
8 3
4 2
0 1
Byte Address data element

Compile using load and store A[12]=h+A[8]
LDR r1,[r2,#32]
ADD r4,r3,r1
STR r4,[r2,#48]
Many programs have more variables than computers have registers.
Consequently the compilers tries to keep the most frequently used variables
in registers and places the rest in memory,using loads and stores to move
variables between registers and memory.The process of putting less
commonly used variables (or those needed later) into memory is called
spilling registers.
To achieve highest performance and conserve energy,compilers must use
registers efficiently.

Constant or Immediate Operands
Many times a program will use a constant in an operation. Ex .increment an
index to point to the next element of an array. More than half of the ARM
arithmetic instructions have a constant as an operand when running the
SPEC2006 benchmarks.
Add a constant 4 to register r3.
LDR r5,[r1,#AddrConstant4]
ADD r3,r3,r5
r1+AddrConstant4 is the memory address of the constant.
ADD r3,r3,#4 {r3=r3+4},immediate operand.
Third Design Principle: Make the common case fast(Constant operands occur
frequently ,and by including constants inside arithmetic instructions,
operations are much faster and use less energy than if constants were loaded
from memory)

Signed and Unsigned Numbers
31 30 29 28 27 26 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1
LSB=0MSB=31
ARM word is 32 bits long.
So we can represent the numbers from 0 to 232-1(4294967295)
0000 0000 0000 0000 0000 0000 0000 0000=0
0000 0000 0000 0000 0000 0000 0000 0001=1
0000 0000 0000 0000 0000 0000 0000 0010=2
……… …….
1111 1111 1111 1111 1111 1111 1111 1111
=4294967295

Positive and negative numbers represented by separate sign by
single bit 0(positive) and 1(negative).
Integers representation: a)Signed Magnitude representation
b)Signed-1’s complement representation
c) Signed-2’s complement representation
Ex 14 in 8- bit register=0000 1110
Three different ways to represent -14=
i) Signed Magnitude representation=1 0001110
ii) Signed-1’s complement representation=1 1110001
iii)Signed-2’s complement representation=1 1110010

Conclusions
• Signed magnitude system is used in ordinary arithmetic but is
awkward when employed in computer arithmetic.
• Hence signed complement is used ,but 1’s complement
impose difficulties because it has two representation of 0(+0
and -0).1’s complement is used for logical operation.
• 2’s complement is used for representing negative numbers.
2’s complement addition:6 00000110
13 00001101
19 00010011
Do -6+13,6-13,-6-13.

Translating ARM assembly into m/c
instruction
• ADD r5,r1,r2
The decimal representation:
Fourth field tells that the instruction performs addition
Instruction in binary form:instruction format
ARM Fields:
Cond: conditional branch instruction
F:Instruction format,I=immediate(if 0 the second source operand is register ,if
1 than 12 bit immediate),S=set condition(related to cond.branch)
14 0 0 4 0 1 5 2
1110 00 0 0100 0 0001 0101 000000000010
4bits 2bits 1bit 4bits 1bit 4bits 4bits 12bits
Cond F I Opcode S Rn Rd Operand2

ADD r3,r3,#4 , r3=r3+4
LDR r5,[r3,#32] ,temp reg r5 gets A[8]
Load and store instruction use 6 fields
F=1 data transfer instruction,24=load word
14 0 1 4 0 3 3 4
Cond F Opcode Rn Rd Offset2
4bits 2bits 6bit 4bits 4bit 12bits
14 1 24 3 5 32
4bits 2bits 6bit 4bits 4bit 12bits

Instruction Format Cond F I Op S Rn Rd Operand2
ADD DP 14 0 0 4 0 Reg Reg Reg
SUB DP 14 0 0 2 0 Reg Reg Reg
ADD(imme
dite)
DP 14 0 1 4 0 Reg Reg Constant
LDR DT 14 1 Na 24 Na Reg Reg Address
STR DT 14 1 Na 25 Na Reg Reg Address

A[30]=h+A[30]
LDR r5,[r3,#120]
ADD r5,r2,r5
STR r5,[r3,#120]
Cond F I Op S Rn Rd Operand2
14 1 24 3 5 120
14 0 0 4 0 2 5 5
14 1 25 3 5 120
Cond F I Op S Rn Rd Operand2
1110 1 11000 0011 0101 0000 1111 0000
1110 0 0 100 0 0010 0101 0000 0000 0101
1110 1 11001 0011 0101 0000 1111 0000

Logical Operations
AND r5,r1,r2,reg5=reg1 & reg2
MVN r5,r1 ,reg5=~reg r1,[move Not]
MOV r6,r5 ,reg6=reg5
MIPS Code for f=(g+h)-(i+j)
Assign variables to registers:$s0=f,$s1=g,$s2=h,$s3=i,$s4=j
add $t0,$s1,$s2
add $t1,$s3,$s4
sub $s0,$t0,$t1
For g=h+A[8]
lw $t0,32($s3) ,tem reg t0 gets A[8]
add $s1,$s2,$t0

ADD r5,r1,r2 in ARM(decimal representation)
MIPS decimal representation
add $t0,$s1,$s2
17=$s1,18=$s2,8=$t0,
6bits 5bits 5bits 5bits 5bits 6bits
0 17 18 8 0 32

MIPS Fields
op=opcode
rs=first register source operand
rt=second register source operand
rd=register destination operand
shamt=shift amount
funct=function,this field selects the specific variant of the operation in the op
field also called as function code.
op rs rt rd shamt funct
6bits 5bits 5bits 5bits 5bits 6bits

• The problem occurs when an instruction needs longer fields than those
shown.Ex the load word instruction must specify two registers and a
constant.If the address uses one of the 5 fields of the format,the constant
within the load word would be limited to 25 = 32.This constant is used to
select elements from the arrays or data structures and is often much
larger than 32.
• We have conflict between desire to keep all the instructions the same
length and the desire to have a single instruction format.
• Design principle four: Good design demands good compromises.
• The compromise choosen is to keep all instructions the same
length,thereby by requiring different kinds of instruction formats for
different kinds of instructions.
• R-type(register format)(as shown above)
• I-type(immediate or data transfer)

I-format
16 bit address means load word instruction can load any word within a region
of ±215 or 8192 words of address in the base register rs.
Lw $t0 32($s3) temp reg t0 gets A[8].
op rs rt Const or
address
6bits 5bits 5bits 16bits
Inst Format Op Rs Rt Rd Shamt Funct address
add R 0 Reg Reg Reg O 32 Na
sub R 0 Reg Reg Reg O 34 Na
addi I 8 Reg Reg Na Na Na Const
lw I 35 Reg Reg Na Na Na Address
sw I 43 Reg Reg Na Na Na Address

Translate MIPS assembly into m/c code
• A[300]=h+A[300]
Lw $t0,1200($t1) #temp reg t0 gets A[300]
add $t0,$s1,$t0 #temp reg to gets h+A[300]
sw $t0,1200($t1) # stores h+A[300] back into A[300]
op rs rt rd shamt/
address
funct
35 9 8 12000
0 18 8 8 0 32
43 9 8 1200

m/c code
$s0-$s7 mapped to 16-23
$t0-$t7 mapped to 8-15
Op Rs Rt Rd Shamt/address funct
100011 01001 01000 0000 0100
1011 0000
0 10010 01000 01000 0 100000
101011 01001 01000 0000 0100
1011 0000

• In MIPS
• Shift left logical(sll) , ex. sll $t0,$s0,4 # reg $t0=$regs0<<4 bits
• Shift right logical(srl)
• In ARM
• Logical shift left(LSL),ex LSL r5,r1 LSL #2,# r5=r1<<2
• Logical shift right(LSR),ex MOV r6,r5 LSR #4 , #r6=r5>>4
• In MIPS
0 0 16 10 4 0

i=j 𝑖 ≠ 𝑗
Exit:
f=g+h f=g-h
i==j

• Shifting left by i bits gives same result as multiplying by 2𝑖
• Shifting right by i bits gives same result as divide by 2𝑖
Instructions for making decisions:in ARM
• If(i==j),f=g+h;else f=g-h;
Assigning variables to registers:r0=f,r1=g,r2=h,r3=i,r4=j
CMP r3,r4
BNE Else;go to else if i≠j
ADD r0,r1,r2;
B Exit;go to exit
Else :SUB r0,r1,r2;
Exit:

Instructions for making decisions
• If(i==j),f=g+h;else f=g-h; in MIPS
beq reg1,reg2,L1 #go to statement L1 if reg1 ==reg2
bne reg1,reg2,L1 #go to statement L1 if reg1 ≠reg2
f,g,h,i,j corresponds to five registers $s0-$4
bne $s3,$s4,Else #go to Else if i ≠j
add $s0,$s1,$s2 # f=g+h skipped if i ≠j
J Exit # go to exit #unconditional branch
Else:sub $s0,$s2,$s1 # f=g-h skipped if i==j
Exit:

ARM and MIPS assembly for the loop
In ARM
while (save[i]==k)
i+=1;
Assume i and k corresponds to registers r3 and r5 and the base of the array save in r6
First step:save[i] in temp reg.before that add i to the base of the array save to form the
address,multiply the index i by 4 due to the byte addressing problem.
Use LSL since shifting left 2 bits means multiply by 22
Loop:ADD r12,r6,r3,LSL #2 ;r12=address of save[i]
LDR r0,[r12,#0] ; temp reg r0=save[i]
CMP r0,r5
BNE Exit ;go to Exit if save[i]≠ 𝑘
ADD r3,r3,#1 ;i=i+1
B loop ;go to loop
Exit;

In MIPS
Loop:sll $t1,$s3,2 ;temp reg $t1=4*i
add $t1,$t1,$s6 ;$t1=address of save[i]
lw $t0,0($t1) ;temp reg $t0=save[i]
bne $t0,$s5,Exit ;go to exit if save[i]≠k
add $s3,$s3,1 ;i=i+1
J loop
Exit:
Example:
Let r0=1111 1111 and r1=0000 0001
CMP r0,r1
Which conditional branch is taken?
BLO L1 ;unsigned branch(branch on lower unsigned instruction is not taken to L1,since
decimal of 1111 1111>1
BLT L2 ;signed branch(the branch on less than instruction is taken to L2,since
in decimal -1<1
Note:an unsigned comparison of x<y also checks if x is negative as well as if x is less than y

In MIPS the instruction slt(set on less than) used in for loops
slt $t0,$s3,$s4 ;means register $t0 is set to 1 if the $s3<$s4
otherwise $t0 set to 0.
slti $t0,$s2,10 ;$t0=1 if $s2<10
The MIPS architecture does not include branch on less than because it is too
complicated;either it would stretch the clock cycle time or it would take extra
clock cycles per instruction.
MIPS compilers use slt,slti,bne,beq and the fixed value 0($zero) to create all
relative conditions,equal,not equal,less than,less than or equal,greater
than,greater than or equal.

Case/Switch statements
• The simplest way to implement switch via a sequence of conditional
tests,turning the switch statement into a chain of if-then-else statements.
• Sometimes the alternatives may be more efficiently encoded as a table of
addresses of alternative instruction sequences,called a jump address
table/jump table.And the program needs only to index into the table and then
jump to the appropriate sequence.
• The jump table is then just an array of words containing addresses that
correspond to labels in the code.The program need to jump using the address
in the appropriate entry from the jump table.
• ARM handles such situations implicitly by stored program concept .
• Need to have a register to hold the address of the current instruction being
executed (PC-program counter)
• In ARM register 15 is the PC(also instruction address register)
• Any instruction with register 15 as destination register is an unconditional
branch to the address at the value.

• Encoding branch instruction in ARM
• The cond field encodes the many versions of the conditional branch like
Cond 12 address
4bits 4bits 24bits
Value Meaning Value Meaning
0 EQ(EQual) 8 HI(unsigned higher)
1 NE(Not Equal) 9 LS(unsigned Lower or Same)
2 HS(unsigned Higher or Same) 10 GE(signed Greater than or Equal)
3 LO(unsigned LOwer) 11 LT(signed Less Than)
4 MI(MInus,<0) 12 GT(signed Greater Than)
5 PL(PLus,>=0) 13 LE(signed Less Than or Equal)
6 VS(oVerflow Set,overflow) 14 AL(ALways)
7 VC(oVerflow Clear,no overflow) 15 NV(reserved)

• The cond field encodes the many versions of the conditional branch (shown in
the table)
• The 24 bit address limit the programs to 224
or 16MB which would be fine for
many programs but constrains large ones.
• An alternative would be a register that added to branch address,so the branch
instruction would calculate PC=Reg+branch address.
• The sum would allow the programs to be as large as 232
,solving the branch
address size problem.’
• Which is that register?
• Conditional branches are found in loops and in if statements,so they tend to
branch to a nearby instruction.Ex about half of all conditional branches in SPEC
benchmarks go to locations less than 16 instructions away.
• Since the PC contains the address of the current instruction,we can branch
within ±224
words of the current instruction if we use the PC as the register
to be added to the address.All the loops and if statements are much smaller
than ±224
words,so the PC is the ideal choice.
• Also called as PC-relative addressing.

Conditional Execution
• Unusual feature of ARM is most instruction can be conditionally executed
,not just branches.This is the purpose of the 4 bit cond field found in the
most ARM instruction formats.
• The assembly language programmer simply appends the desired condition
to the instruction name to perform the operation only if the condition is
true based on the last time the condition flags were set.
• Ex
CMP r3,r4
ADDEQ r0,r1,r2
SUBNE r0,r1,r2

Procedures in Computer Hardware
• Procedure: A stored subroutine that performs a specific task based on the
parameters with which it is provided.
• Registers are the fastest place to hold data in a computer,so use them as
much as possible .ARM s/w follows the following conventions for
procedure calling in allocating 16 registers.
• r0-r3(four argument registers in which to pass parameters.
• lr:one link register containing the return address register to return to the
point of origin.
• BL:It jumps to an address and simultaneously save the address of the
following instruction in register lr.(Branch-and-link instruction).
• BL:ProcedureAddress.
• (BL instruction that jumps to an address and simultaneously and saves the
address of the following instruction in a register (lr or register 14).

• MOV pc,lr
• The calling program ,or caller,puts the parameter values in r0-r3 and uses BL X
to jump to procedure X(callee).The callee then performs the calculations,
places the results (if any) into r0 and r1,and returns control to the caller using
MOV pc,lr.
• The BL instruction actually saves PC+4 in register lr to link to the following
instruction to set up the procedure return.
• Stack:
• Suppose a compiler needs more registers for a procedure than the four
arguments and two return value registers.Since we must cover our tracks after
our mission is complete,
any registers needed by the caller must be restored to the values that they
contained before the procedure was invoked.This situation is an example in which
we need to spill registers to memory.
The ideal data structure for spilling registers is a stack –a LIFO queue

• A stack needs a pointer to the most recently allocated address in the stack
to show where the next procedure should place the registers to be spilled
or where old register values are found.
• The stack pointer is adjusted by one word for each register that is saved or
restored. ARM reserves register 13 for SP.
• Compile the C procedure that does not call another procedure
int example(int g,int h,int i, int j)
{
int f;
f=(g+h)-(i+j)
return f;
}

• Let g,h,i,j correspond to the argument registers r0,r1,r2,r3 and f
corresponds to r4.
• Label of the procedure is ex_procedure:
• Save three registers:r4,r5,r6.
• “push” the old values onto the stack by creating space for three words(12
bytes) on the stack and then store them:
SUB sp,sp,#12 :adjust stack to make room for 3 items
STR r6,[sp,#8] :save register r6 for use afterwords
ADD r5,r0,r1 :reg r5 contains g+h
ADD r6,r2,r3 :reg r6 contains i+j
SUB r4,r5,r6 :f gets r5-r6,which is (g+h)-(i+j)

Values of sp and the stack:
before,during and after the procedure call
High address
sp sp
Contents of r6
Contents of r5
sp Contents of r4
Low address

To return the value of f we can copy it into a return value register r0:
MOV r0,r4 ; return f(r0=r4)
Before returning ,we restore the three old values of the registers we saved by
“popping”them from the stack.
LDR r4,[sp,#0] ;restore register r4 for caller
ADD sp,sp,#12 ;adjust stack to delete 3 items
The procedure ends with a jump register using the return address:
MOV pc,lr ;jump back to calling routine.
We used temporary registers and assumed their old values must be saved and
restored.To avoid saving and restoring a register whose value is never used,which
might happen with a temporary register,ARM software separates 12 of the
registers into two groups:

r0-r3,r12:argument or scratch registers that are not preserved by the
callee(called procedure)on a procedure call
r4-r11:eight variable registers that must be preserved on a procedure call(if
used ,the callee saves and restores them)
Note:This simple convention reduces register spilling .In the example above,if
we could rewrite the code to use r12 and reuse one of the r0 to r3,we can
drop two stores and two loads from the code. We still must save and restore
r4,since the callee must assume that the caller needs its value.
In MIPS:
MIPS s/w follows the following convention in allocating its 32 registers for
procedure calling:
$a0-$a3 : four argument register in which to pass parameters
$v0-$v1: two value registers in which to return values.
$ra : one return address register to return to the point of origin.

jal: jump-and-link instruction(an instruction that jumps to an address and
simultaneously saves the address of the following instruction in a register($ra
in MIPS)
jal ProcedureAddress
jal instruction saves PC+4 in register $ra to link to the following instruction to
set up the procedure return.
jr $ra ;jump register instruction meaning an unconditional jump
to the address specified in a register.
The calling program ,or caller,puts the parameter values in $a0-$a3 and uses
jal X to jump to procedure X(callee).The callee then performs the
calculations, places the results (if any) into $v0 and $v1,and returns control to
the caller using jr $ra.

• Let g,h,i,j correspond to the argument registers $a0,$a1,$a2,$a3 and f
corresponds to $s0.
• Label of the procedure is ex_procedure:
• Save three registers:$s0,$t0,$t1.
• “push” the old values onto the stack by creating space for three words(12
bytes) on the stack and then store them:
addi $sp,$sp,-12 :adjust stack to make room for 3 items
sw $t1,8[$sp] :save register $t1 for use afterwords
sw $t0,4[$sp] :save register $t0 for use afterwords
sw $s0,0[$sp] :save register $s0 for use afterwords
add $t0,$a0,$a1 :reg $t0 contains g+h
add $t1,$a2,$a3 :reg $t1 contains i+j
sub $s0,$t0,$t1 :f gets $t0-$t1,which is (g+h)-(i+j)
add $v0,$s0,$zero :returns f ($v0=$s0+0)

Before returning ,we restore the three old values of the registers we saved by
“popping”them from the stack.
lw $s0,0($sp) ;restore register $s0 for caller
lw $t0,4($sp) ;restore register $t0 for caller
lw $t1,8($sp) ;restore register $t1 for caller
addi sp,sp,12 ;adjust stack to delete 3 items
The procedure ends with a jump register using the return address:
jr $ra ;jump back to calling routine.
We used temporary registers and assumed their old values must be saved and
restored.To avoid saving and restoring a register whose value is never used,which
might happen with a temporary register,MIPS software separates 18 of the
registers into two groups:
$t0-$t9 :10 temp registers that are not preserved by the callee(called
procedure)on a procedure call
$s0-$s7:eight saved registers that must be preserved on a procedure call(if used
,the callee saves and restores them)

Note:This simple convention reduces register spilling .In the example above,
since the caller (procedure doing the calling) does not expect registers $t0
and $t1 to be preserved across a procedure call ,we can drop two stores and
two loads from the code. We still must save and restore $s0,since the callee
must assume that the caller needs its value.
Nested Procedures:
Ex.Suppose that the main program calls procedure A with an argument of
3,by placing the value 3 into register r0 and then using BL A. Then suppose
that procedure A calls procedure B via BL B with an argument of 7,also placed
in r0.Since A has not finished its task yet,there is a conflict over the use of
register r0.Similarly there is a conflict over the return address in register
lr,since it now has the return address for B.
Solution:push all the other registers that must be preserved onto the stack.

• The caller pushes any argument registers(r0-r3) that are needed after the
call. The callee pushes the return address register lr and any variable
registers (r4-r11)used by the callee. The sp is adjusted to account for the
number of registers placed on the stack.Upon the return, the registers are
restored from memory and the sp is readjusted.
• Convert into into ARM assembly code?
int fact(int n)
{
if(n<1)return (1);
else return(n*fact(n-1));
}

fact:ARM fact:MIPS
SUB sp,sp,#8 ;adjust stack for 2 items addi sp,sp,-8;adjust stack for 2 items
STR lr ,[sp,#4] ;save return address sw $ra,4($sp) ;save return address
STR r0,[sp,#0] ;save the argument n sw $a0,0($sp) ;save the argument n
CMP r0,#1 ;compare n to 1 slti $t0,$a0,1;test for n<1
BGE L1 ;if n>=1,go to L1 beq $t0,$zero,L1;if n>=1,go to L1
MOV r0,#1 ;return 1 addi $v0,$zero,1 ;return 1
ADD sp,sp,#8 ;pop two items off stack addi $sp,$sp,#8;pop two items off stack
MOV pc,lr ;return to the caller jr $ra ;return to after jal
L1: SUB r0,r0,#1 ;n>=1:argument gets(n-1) L1:addi $a0,$a0,-1; n>=1:argument
gets(n-1)
BL fact ;call fact with (n-1) jal fact ; call fact with (n-1)
MOV r12,r0 ;save the return value lw $a0,0($sp); return from jal;restore
argument n
LDR r0,[sp,#0] ;return from BL;restore argument n lw $ra,4($sp); restore the return address
LDR lr [sp,#0] ;restore the return address addi $sp,$sp,8 ;adjust sp to pop 2 items
ADD sp,sp,#8 ;adjust sp to pop 2 items mul $v0,$a0,$v0; return n * fact(n-1)
MUL r0,r0,r12 ;return n * fact(n-1) jr $ra ; return to caller

Stack allocation before during and
after procedural call
High address $fp
$fp $fp
$sp sp
saved argument
registers(if any)
saved ret address
$sp Saved saved registers(if any)
Local arrays and
Structures(if any)
Low address

What is and what not preserved across
a procedure call
Preserved Not Preserved
Variable registers:r4-r11 Argument registers:r0-r3
Stack pointer register:sp Intra procedure-call scratch register:r12
Link register:lr Stack below the sp
Stack above the sp

• Storage classes in C:automatic (local to procedure)and static(variables outside
procedures).
• Global pointer($gp):To simplify access to global data MIPS s/w reserves
another register called global pointer.
• Allocating space for new data on stack: The final complexity is that stack is also
used to store variables that are local to the procedure that do not fit in the
registers such as local arrays and structures.
• The segment of the stack containing a procedure’s saved registers and local
variables is called :procedure frame(activation record).
• Some MIPS s/w uses a frame pointer($fp) to point to first word of the frame of
the procedure.
• $fp is the value denoting the location of the saved registers and local variables
for a given procedure.
• $sp points to the top of the stack.when a $fp is used,it is initialized using the
address in $sp on a call and $sp is restored using $sp.

Allocating space for new data on the
heap,MIPS convention
$sp 7fff fffchex
$gp 1000 8000hex
1000 0000hex
$pc 0040 0000hex
0
Stack
Dynamic data
Static data
Text
Reserved

• In addition to the automatic variables that are local to procedures, C
programmers need space in memory for static variables and for dynamic
data structures.
• Text segment: Segment of the unix object file that containsthe m/c code
for routines in the source file.
• Static data segment:place for constants and other static variables.
• Data structures like linked lists tend to grow and shrink during there
lifetimes.The segments for such DSs is the heap.
• Stack and heap grow towards each other allowing efficient use of
memory as two segments wax and wane.
• C allocates and free space on the heap with explicit functions,malloc()
allocates space on the heap and returns a pointers to it and free()
releases space on the stack to which the pointer points.

• Memory allocation is controlled by programs in C,and it is the source of
many common and difficult bugs.
• Forgetting to free space leads to “memory leak”,which eventually uses up
so much memory that the OS may crash.Freeing space to early leads to
“dangling pointers”,which can cause pointers to point to things that the
program never intended.
• GNU MIPS C compiler uses a frame pointer.C compiler from MIPS/Silicon
graphics does not use fp,it uses register 30 as another save register.

ARM register conventions
Name Register No Usage Preserved on call?
a1-a2 0-1 Argument/return
result/scratch
register
no
a3-a4 2-3 Argument/scratch
register
no
v1-v8 4-11 Variables for local
routine
yes
ip 12 Intra-procedure-
call scratch register
no
sp 13 Stack pointer yes
lr 14 Link register yes
pc 15 Program counter n.a

MIPS register Convention
Name Register No Usage Preserved on call?
$zero 0 The constant value 0 n.a
$v0-$v1 2-3 Values for results and
expression evaluation
no
$a0-$a3 4-7 Arguments no
$t0-$t7 8-15 Temporaries no
$s0-$s7 16-23 Saved yes
$t8-$t9 24-25 More temporaries no
$gp 28 Global pointer yes
$sp 29 Stack pointer yes
$fp 30 Frame pointer yes
$ra 31 Return address yes

32-bit immediate operand
• Although constants are frequently short and fit into 16 bit field,sometimes
they are bigger.The MIPS instruction set includes the instruction load
upper immediate(lui)
• Specifically to set the upper 16bits of a constant in a register, allowing a
subsequent instruction to specify the lower 16bits of the constant.

m/c code of lui $t0,255 ;t0 is register 8
Contents of register $t0 after executing lui $t0,255
001111 00000 01000 0000 0000 1111 1111
0000 0000 1111 1111 0000 0000 0000 0000

• Loading 32-bit constant
• What is the MIPS assembly code to load 32-bit constant into register $s0?
0000 0000 0011 1101 0000 1001 0000 0000
First load the upper 16 bits which is 61 in decimal ,using lui;
Lui $s0,61
The value of register $s0 afterwords is
0000 0000 0011 1100 0000 0000 0000 0000
Add the lower 16 bits whose decimal is 2304
ori $s0,$s0,2304
The final value in the register $s0
0000 0000 0011 1101 0000 1001 0000 0000

Compiling a String Copy Procedure
void strcpy(char x[],char y[]
{ int i;
i=0;
while((x*i+=y*i+)!=‘0’)
i+=1;
}
• In ARM:
Assume the base addresses for arrays x and y are found in r0 and r1,while i is in
r4.strcpy adjusts the stack pointer and then saves the saved register r4 on the
stack
strcpy:
SUB sp,sp,#4 ;adjust stack for 1 more item
STR r4,[sp,#0] ;save r4
To initialize i to 0,the next instruction sets r4 to 0 by adding 0 to 0 and placing that
sum in r4:
MOV r4,#0 ;i=0+0

L1:ADD r2,r4,r1 ;address of y[i] in r2 {this is the beginning of the loop,the
address of y[i] is first formed by adding i to y[](assume array of bytes).
To load the character in y[i]
LDRB r3,[r2,#0] ;r3=y[i] and set condition flags {load register byte;loads a byte
from memory
ADD r12,r4,r0 ;address of x[i] in r12
STRB r3,[r12,#0] ;x[i]=y[i]
BEQ L2 ;if y[i]==0; go to L2
Increment i and loop back
ADD r4,r4,#1 ;i=i+1
B L1 ;go to L1
If we don’t loop back it was the last character of the strings ;we restore r4 and the
stack pointer,and then return
L2:LDR r4,[sp,#0] ;y[i]==0;end of string ,restore old r4
ADD sp,sp,#4 ;pop 1 word off stack
MOV pc,lr ;return

In MIPS:
strcpy:
addi $sp,$sp,-4
sw $sp,0($sp)
add $s0,$zero,$zero,
L1:ADD $t1,$s0,$a1 ;address of y[i] in $t1
lb $t2,0($t1) ;$t2=y[i]
add $t3,$s0,$a0 ;address of x[i] in $t3
sb $t2,0($t3) ;x[i]=y[i]
beq $t2,$zero ,L2
addi $s0,$s0,1
J L1
L2: lw $s0,0($sp)
addi $sp,$sp,4
jr $ra

Addressing in Branches and Jumps
J 10000 ;go to location 10000
bne $s0,$s1,Exit ;go to Exit if $s0≠ $s1
If the address of the program is bigger than 16 bits than PC=Reg+branch
address ,sum becomes 32 bits
2 10000
6 bits 26bits
5 16 17 Exit
6bits 5bits 5bits 16bits

Branching far away: Given beq $s0,$s1,L1,replace with a pair of instruction that
offer a much greater branching distance.
Short address conditional branch:
bne $1,$s0,L2
J L1
L2:
MIPS Addressing Modes:
1)Register addressing(where the operand is a register)
2)Base or Displacement addressing(where the operand is at the memory location
whose is the sum of the register and a constant in the instruction)
3)Immediate addressing(operand is the constant within the instruction itself)
4)PC relative addressing(where the address is the sum of the PC and a constant in
the instruction)
5)Pseudodirect addrerssing(where the jump address is the 26 bits of the
instruction concatenated with the upper bits of the PC

• 1)convert into assembly for m/c instruction 00af8020(hex)
0000 0000 1010 1111 1000 1000 0010 0000
Find the op field
31-29 and 28-26 are 000 ,000 ,hence R-format instruction
000000 00101 01111 10000 00000 100000
Bit 5-3 ,100 and bits 2-0,000,hence it represent add
The decimal values are 5 for rs and 15 for rt,16 for rd(shamt is unused).these
numbers represents registers $a1,$t7 and $s0
add $s0,$a1,$t7

Translating and starting a program
C-program----COMPILER----Assembly program--ASSEMBLER--
object:m/c language module---LINKER--exe m/c language program--
LOADER--memory
object:library routine(m/c language)
Source file-x.c
Assembly file-x.s
Object file-x.o
Statically linked library routines are x.a and dynamically linked library routes
are x.so
Executable files by default are called a.out.
MS-DOS uses .C,.ASM,.OBJ,.LIB,.DLL,and .EXE

Compiling java program
Java program----COMPILER---classfiles(bytecodes)
java library routines(m/c language)
JIT JVM
compiled java methods(m/c language)
JVM-s/w interpreter,can execute bytecodes,it’s a program that simulates an
ISA.portable and found in devices –mobile phones to internet browsers.
To preserve portability and improve execution speed the next phase is JIT
JIT-compiler that operates on runtime, translating the interpreted code
segments into the native code of the computer.

Arithmetic for computers
• Two’s complement representation:The positive and negative numbers 32 bit
numbers can be represented as
• 𝑥31 × −231
+ 𝑥30 × 230
+ 𝑥29 × 229
+ ⋯ + 𝑥1 × 21
+ 𝑥0 × 20
• The sign bit is multiplied by −231
and the rest of the bits are then multiplied
by positive versions of their respective base values.
• Decimal value of two’s complement number:
1111 1111 1111 1111 1111 1111 1111 1100
Multiplication: Sequential version of the Multiplication Algorithm and Hardware
Let’s assume that the multiplier is in the 32-bit Multiplier register and that the 64-
bit Product register is initialized to 0.
We need to move the multiplicand left one digit each step, as it may be added to
the intermediate products.
Over 32 steps, a 32 bit multiplicand would move 32 bits to the left.Hence,we need
a 64bit multiplicand register, initialized with the 32 bit multiplicand in the right
half and zero in the left half.This register is then shifted left 1 bit each step to align
the multiplicand with the sum being accumulated in the 64-bit Product register.

First version of the multiplication H/w
64bits
32bits
ALU 64bit
64bits
Multiplicand
Shift left
Product
write
Multiplier
Shift right
Control
test

First multiplication algorithm
• flowchart_multiply.pdf
• Three basic steps needed for each bit. These three basic steps are
repeated 32 times to obtain the product.If each step took a clock
cycle,this algo would require almost 100 clock cycles to multiply two 32 bit
numbers.

Multiply 0010 X 0011
Iteration Step Multiplier Multiplicand Product
0 Initial value 0011 0000 0010 0000 0000
1 1a:1=>Prod=Prod
+ Multplicand
2:shift left
multiplicand
3:shift right
multiplier
0011
0011
0001
0000 0010
0000 0100
0000 0100
0000 0010
0000 0010
0000 0010
2 1a:1=>Prod=Prod
+ Multplicand
2:shift left
multiplicand
3:shift right
multiplier
0001
0001
0000
0000 0100
0000 1000
0000 1000
0000 0110
0000 0110
0000 0110
3 1a:1=>No
operation
2:shift left
multiplicand
3:shift right
multiplier
0000
0000
0000
0000 1000
0001 0000
0001 0000
0000 0110
0000 0110
0000 0110
4 1a:1=>No
operation
2:shift left
multiplicand
3:shift right
multiplier
0000
0000
0000
0001 0000
0010 0000
0010 0000
0000 0110
0000 0110
0000 0110

Refined version of Multiplication H/w
32bits
32bit ALU
64 bits
Multiplicand
Product shift right
write
Control
test

=10 =01
=00
=11
=/= =0
Multipcand in BR
Multiplier in QR
AC0
Qn+10
SC0
QnQn+1
ACAC+BR’
+1
ACAC+BR
Ashr(AC&QR)
SCSC-1
SC END

h/w for Booth Algo
Qn Qn+1
BR Register
Complementer and parallel
Adder
Sequence counter
QR registerAC register

First version of division h/w
64bits
ALU
64bit
32bits
64bits
DIVISOR
Shift right
QUOTIENT
Shift left
REMAINDER
write
Control
test

Floating Point
Reals in mathematics:
3.14159265….(𝜋)
2.71828…(e)
0.000000001 or 1.0 × 10;9
3,155,760,000 or 3.15576× 109
The last number did not represent a small fraction,but it was bigger than we
could represent with a 32 bit signed integer.
The alternative notation of the last two numbers is called scientific notation.
Which has a single digit to the left of the decimal point.
A number in scientific notation that has no leading 0s is called a normalized
number,which is the usual way to write it.1.0 × 10;9 is a normalized scientific
notation but 0.1 × 109 𝑎𝑛𝑑 10.0 × 1010 𝑎𝑟𝑒 𝑛𝑜𝑡
Note:all numbers in decimal

• We can also show binary numbers in scientific notation.1.0 × 2;1
Floating point: computer arithmetic that represent numbers in which the
binary point is not fixed
In scientific notation (1. 𝑥𝑥𝑥𝑥𝑥𝑥)2× 2 𝑦𝑦𝑦𝑦
Advantages of scientific notation in normalized form:
1)It simplifies exchange of data that includes floating point numbers.
2)It simplifies the floating point arithmetic algorithms.
3)Increases the accuracy of the numbers that can be stored in a word.
General Form:
(−1) × 𝐹 × 2 𝐸
31 30-23 22-0
S(1 bit) Exponent(8 bits) Fraction(23 bits)

• These chosen sizes of exponent and fraction give MIPS computer arithmetic an
extra ordinary range. Fractions almost as small as
2.0 × 10;38
𝑎𝑛𝑑 𝑎𝑠 𝑙𝑎𝑟𝑔𝑒 𝑎𝑠 2.0 × 1038
Overflow: positive exponent becomes too large to fit in the exponent field.
Underflow:negative exponent becomes too large to fit in the exponent field.
Double precision: Floating point value represented two 32 bit words.
Single precision: Floating point value represented in a single 32 bit word.
MIPS double precision:large number:2 × 10308
Small number:2 × 10;308
These formats go beyond MIPS:they are part of IEEE754 floating point standard
found in virtually every computer invented since 1980.
31 30-20 19-0
S(1 bit) Exponent(11 bits) Fraction(20 bits)
Fraction (continued)32 bits

To pack into more bits into significand(fraction),the IEEE754 makes the
leading 1 bit of normalized binary numbers implicit.
Hence,the number is actually 24 bits long in single precision(1 implied+23 bit
fraction).
For double precision for 53 bits long (1+52).
General form:
(−1) 𝑆× 1 + 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 × 2 𝐸
Where the bits of the fraction represents a number between 0 and 1 and E
specifies the value in the exponent field.
If we number the bits of the fraction from left to right s1,s2,s3…then the
value is
(−1) 𝑆× (1 + 𝑠1 × 2;1 + 𝑠2 × 2;2 + ⋯ . ) × 2 𝐸

• 1.0 × 2;1
• 1.0 × 2:1
31 30-23 22-0
0 11111111 0000000000……
31 30-23 22-0
0 0000001 0000000000……

The Processor:
4
Add Add
ALU
P
C
Address Instruction
Instruction Memory
Data
Register#
Register# REGISTERS
Register#
Address
Data
memory
Data

• Abstract view of the implementation of MIPS subset showing the major
functional units and the major connections between them.
a)All the instructions start by using PC to supply the instruction
address to instruction memory.
b)After the instruction is fetched, the register operands used by
an instruction are specified by the fields of that instruction.
c)Once the register operands have been fetched, they can be
operated on to compute the memory address(for load or
store),to compute an arithmetic result(for an integer arithmetic
logical instruction) or compare (for branch).

d)If the instruction is arithmetic logical instruction, the result
from ALU must be written to a register.
e)If the operation is load or store,the ALU result is used as an
address to either store a value from the registers or load a value
from memory into registers. The result from the ALU or memory
is written back into the register file.

f)Branches require the use of ALU output to determine the next
instruction address, which comes from either the ALU(where the
PC and the branch offset are summed) or from an adder that
increments the current PC by 4.
g)All the instructions(memory reference, arithmetic-logical and
branch) except jump uses ALU after reading the registers.

Logic design conventions
• The functional units in MIPS implementation consists of two
different types of logic elements: elements that operate on data
values and elements that contain state.
• Elements that operate on data values are called combinational.
Which means that their output depends only on the current inputs.
Given the same inputs a combinational element always produces
the same output.(Ex ALU).
• An element contain state if it has some internal storage :state
elements, because if we pulled the plug on the m/c,we could
restart it by loading the state elements with the values they
contained before we pulled the plug.
• The instruction and data memories ,registers are state elements.A
state element has two inputs and one output.

• Inputs are the data value to be written into the element and the
clock,which determines when the data value is written.Output from the
state element provides the value that was written in an earlier clock cycle.
Ex D-type flip-flop.
• Logic components that contain state: sequential(output depends on both
inputs and contents of internal state.Ex registers.
• Clocking Methodology: when the data is valid and stable relative to the
clock.
• Edge-triggered clocking:A clocking scheme in which all state changes occur
on a clock edge.It means any values stored in a sequential logic element
are updated only on a clock edge.
• The inputs are the values that were written in previous clock cycle,while
the outputs are values that can be used in a following clock cycle.

Combinational Logic,state element and
the clock are closely related
Clock cycle
State
element 1
State
element 2
Combinational logic

Edge triggered methodology allows a state element to be read
and written in the same clock cycle without creating a race that
could lead to indeterminate data values
State
element 1
Combinational logic

• Control signal: A signal used for multiplexor
selection or for directing the operation of a
functional unit.
• Contrasts with a data signal ,which contains
information that is operated on by a
functional unit.

Building a datapath
a)Instruction Memory b)PC
Add Sum
c)Adder
Instruction address
Instruction
Instruction Memory
PC

Portion of datapath used for fetching
instructions and incrementing the PC
Add Sum
4
Read address
Instruction
Instruction Memory
PC

Register file
5
Register Nos 5
5 Data
Data
Read Read
Register1 data1
Read
Register 2
REGISTERS
Write
Register
Write Read
data RegWrite data2
data2

ALU Operation
4
zero
ALU
ALU result

MemWrite
16 32
MemRead
Address Read
data
DATA MEMORY
Write
data
Sign
Extend

Datapath for branch uses the ALU to evaluate the branch condition and
a separate adder to compute the branch target as the some of the
incremented PC and the sign-extended,lower 16 bits of the instruction
shifted left two bits.
PC+4 from instruction datapath
ALU sum Branch
target
ALUoperation
Inst
ALU zero To branch
Control logic
RegWrite
16 32
Read reg1 Read data1
Read reg2
Write reg
Write data Read data2
Sign
extend
Shiftleft2

Creating single datapath
1)To share a datapath element between two different instruction
classes,we may need to allow multiple connections to the input
of the element using a multiplexor and control signal to select
among the multiple inputs.
2)The operations of R-type and memory instructions datapath
are quite similar but the differences are:
a)The R-types uses ALU with inputs coming from two
registers.The memory instruction also use ALU for address
calculation.Although the second input is the sign-extended 16 bit
offset field from instruction.
b)The value stored into the destination register comes from
ALU(R-type) or the memory(for load)

ALU Control
ALU Control Lines Function
0000 AND
0001 OR
0010 add
0110 subtract
0111 Set on less than
1100 NOR

• We can generate the 4 bit ALU control input using a small
control unit that has as inputs the function field of the
instruction and the two bit control field,which is ALUOp.
• ALUOp indicates whether the operation to be performed
should be add(00) for loads and stores, subtract(01) for beq
,or determined by the operation encoded in the function
field(10).
• The output of the ALU control unit is a 4-bit signal that
directly controls the ALU by generating one of the 4-bit
combinations.

How to set the ALU control inputs
based on the 2-bit ALUOp control and
the 6-bit function code
Instruction
opcode
ALUop Instruction
operation
Function
Field
Desired ALU
action
ALU control
input
Lw 00 Load word xxxxxx add 0010
Sw 00 Store word xxxxxx add 0010
Branch
equal
01 Branch
equal
xxxxxx subtract 0110
R-type 10 Add 100000 add 0010
R-type 10 Subtract 100010 subtract 0110
R-type 10 AND 100100 AND 0000
R-type 10 OR 100101 OR 0001
R-type 10 Set on less
than
101010 Set on less
than
0111

• The main control unit generates the ALUOp bits,which then are used as
input to the ALU control that generates the actual signals to control the
ALU unit.
• Using multiple levels of control can reduce the size of the main control
unit.
• Using several smaller control units may also potentially increase the speed
of the control unit.

Truth table for three ALU control bits
ALUOp1 ALUop2
Function Field
F5 F4 F3 F2 F1 F0
operation
0 0 X x x x x x 0010
X 1 X x x x x x 0110
1 X 100000 0010
1 X 100010 0110
1 X 100100 0000
1 X 100101 0001
1 X 101010 0111

Designing the Main Control Unit
• To connect the fields of an instruction to the datapath ,the
instruction formats must be reviewed.
• Op field:31-26 bits
• Two registers to be read rs and rt :25-21,20-16.true for R-
type,branch equal and for store.
• The base register for load and store instructions is always in
bit positions 25-21(rs).
• R-type:
0
31-26
Rs
25-21
Rt
20-16
Rd
15-11
Shamt
10-6
Funct
5-0
31-26 25-21 20-16 15-11 10-6

Load or store instruction
load:35,store:43
branch instruction
35 or 43
31-26
Rs
25-21
Rt
20-16
Address
15-0
4
31-26
Rs
25-21
Rt
20-16
Address
15-0

Simple datapath with control unit
• The input to the control unit is the 6-bit opcode field from the instruction.
The outputs of the control unit consists of three 1-bit signals that are used
to control multiplexors(RegDst,ALUSrc and MemtoReg),
• Three signals for controlling the reads and writes in register file and data
memory(RegWrite,MemRead,MemWrite).
• A 1-bit signal used in determining whether to possibly branch(Branch),and
a two bit control signal for ALU(ALUOp).
• An AND gate is used to combine the branch control signal and the zero
output from the ALU.
• The AND gate output controls the selection of the next PC

• The setting of the control lines is completely determined by the opcode
fields of the instruction
• The first row: R-type(add,sub and,or,slt).For all these instructions,the
source register fields are rs and rt and the destination rd.this define how
the signals ALUSrc and RegDst are set.
Instruction RegDst ALUSrc MemtoReg RegWrite Memread MemWrite Branch ALUOp1 ALUOp0
R-format 1 0 0 1 0 0 0 1 0
Lw 0 1 1 1 1 0 0 0 0
Sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1

• An R-type instruction writes a register(RegWrite=1),but neither reads nor
writes data memory.
• When the branch control signal is 0,the PC is unconditionally replaced with
PC+4.Otherwise the PC is replaced by branch target if the zero output of
ALU is also high.
• The ALUOp field of R-type instruction is set to 10 to indicate the ALU
control should be generated from the function field.
• The second and third row shows control signal settings for lw and sw.
• These ALUSrc and ALUOp fields are set to perform the address
calculation.
• The MemRead and MemWrite are set to perform memory access.
• Finally the RegDst and RegWrite are set for a load to cause the result to be
stored into the rt tregister.

• The branch instruction is similar to R-type,since it sends the rs and rt
registers to the ALU.
• The ALUOp field for branch is set for a subtract(ALU Control=01),which is
used to test for equality.
• Note:MemtoReg field is irrelevant when the RegWrite Signal is 0,Since the
register is not being written,the value of the data on the register data
write port is not used.
• Thus the entry MemtoReg in the last two rows of the table is replaced
with X for don’t care.
• Don’t care can also be added to RegDst when RegWrite is 0,this type of
don’t care must be added by the designer,since it depends on knowledge
of how the datapath works.

Finalizing the control:The control function for single
cycle implementation specified by truth table
Input or
output
Signal name R-Format Lw Sw beq
Inputs Op5 0 1 1 0
Op4 0 0 0 0
Op3 0 0 1 0
Op2 0 0 0 1
Op1 0 1 1 0
Op0 0 1 1 0
Outputs Regdst 1 0 X X
ALUSrc 0 1 1 0
MemtoReg 0 1 X X
RegWrite 1 1 0 0
MemRead 0 1 0 0
MemWrite 0 0 1 0
Branch 0 0 0 1
ALUOp1 1 0 0 0
ALUOp0 0 0 0 1

Implementing jumps
• The jump instruction look somewhat like branch instruction but computes
the target PC and is not conditional.
• Like a branch the lower order two bits of jump address are always 00,the
next lower 26 bits of this 32 bit address come from the 16 bit immediate
field in the instruction
31:26 25:0
000010 address

• The upper 4 bits of the address that should replace the PC come from the
PC of the jump instruction plus 4,we can implement the jump by storing
into the PC the concatenation of:
• -the upper 4 bits of the current PC+4(31:28)
• -The 26bit immediate field of jump instruction
• -The bit 00

Why single cycle implementation is not
used today?
• All though the single cycle design will work correctly, it would not be used
in modern designs because it is inefficient. The clock cycle must have the
same length for every instruction in this single cycle ,the CPI therefore is
1.
• The clock cycle is determined by the longest possible path in the machine.
• This path is almost the load instruction, which uses five functional units in
series, the instruction memory, register file, the ALU, the data memory
and the register file.
• Although the CPI is 1,the overall performance of single cycle
implementation is not likely to be very good, since several of the
instruction classes could fit in a shorter clock cycle.

• Unfortunately implementing the variable speed clock for each instruction
class is extremely difficult, an alternative is to use a shorter clock cycle
that does less work and then vary the number of clock cycles for the
different instruction classes.
• The single cycle implementation violates the design principle of making
the common case fast.
• In this single cycle implementation each functional unit can be used only
once per clock; therefore some functional units must be duplicated raising
the cost of implementation.
• Hence inefficient in its performance and in its hardware cost.
• These difficulties can be avoided by using shorter clock cycle-derived from
the basic functional unit delays, and that requires multiple clock cycles for
each instruction.

Multicycle implementation
• Also called as multiple clock cycle implementation. An
implementation in which an instruction is executed in
multiple clock cycles.
• The multicycle implementation allows a functional unit to be
used more than once per instruction as long as it is used on
different clock cycles.
• The sharing can help reduce the amount of hardware
required.
• The ability to allow instructions to take different numbers of
clock cycles and the ability to share functional units within the
execution of a single instruction are the major advantages of a
multi cycle design

Difference with single cycle version
• A single memory unit is used for both instructions and data.
• There is a single ALU, rather than ALU and two adders.
• One or more registers are added after every major functional unit to hold
the output of that unit until the value is used in a subsequent clock cycle.
• At the end of the clock cycle,all data that is used in subsequent clock
cycles must be stored in a state element.
• Data used by subsequent instructions in a later clock cycle is stored into
one of the programmer-visible state elements:register file,PC or the
memory.
• Data used by same instruction in a later cycle must be stored into one of
these additional registers.

• The position of additional registers is determined by two factors:
What combinational unit will fit in one clock cycle and what data are needed
in later cycles implementing the instruction.
-In the multicycle design it is assumed that the clock cycle can accommodate
at most one of the following operations: A memory access, a register file
access(two reads or one write) or an ALU operation.
-Any data produced by one of theses functional units must be saved into a
temporary registers for use on later cycle
-If we are not saved than the possibility of timing race could occur,leading to
then use of incorrect value.
-All the registers except IR hold data only between a pair of adjacent clock
cycles and will thus not need a write control signal.
-Thus IR needs to hold instruction until the end of the execution of that
instruction and thus will require a write control signal

Pipelining
Up till now: we have designed a processor that
execute all the SimpleRisc instructions.
Two styles:
Hardwired control unit
Microprogrammed control unit:
microprogrammed datapath
microassembly language
microinstruction

Designing efficient processors
 Microprogrammed processors are much
slower that hardwired processors
 Even hardwired processors
 Have a lot of waste !!!
 We have 5 stages.
 What is the IF stage doing, when the MA stage is
active ?
 ANSWER : It is idling

Resource utilization
• Single cycle design: each resource is tied up for the entire
duration of the instruction execution.
• Multi-cycle design:resource utilized in cycle t of instruction I
is available for cycle t+1 of instruction I.
• Pipelined design:resource utilized in cycle t of instruction I is
available again for cycle t of instruction I+1.

Problems with single cycle design
• Slowest instruction pulls down the clock frequency
• Resource utilization is poor
• There are some instructions which are impossible to be implemented in
this manner.
HIGH MULTI-CYCLE DESIGN
CPI
LOW PIPELINED DESIGN SINGLE CYCLE DESIGN
SHORT CYCLE TIME LONG

The Notion of Pipelining
 Let us go back to the car assembly line
 Is the engine shop idle, when the paint shop is
painting a car ?
 NO : It is building the engine of another car
 When this engine goes to the body shop, it builds
the engine of another car, and so on ….
 Insight :
 Multiple cars are built at the same time.
 A car proceeds from one stage to the next

• Pipelining is an implementation technique in which multiple
instructions are overlapped in execution. Today pipelining is
the key to making processors fast.
• Ex. Laundry analogy for pipelining: see Fig.
time 6pm 7pm…… 2am
task
A W D F S
B W D F S
C W D F S
D W D F S

time 6pm 7pm… 9.5pm
task
The washer ,dryer, folder and storer each take 30 minutes for their
task. Sequential laundry takes 8 hrs for four loads of wash while
pipelined laundry takes just 3.5 hrs
we show the pipeline stage of different loads overtime by showing
copies of the four resources on this two dimensional time line, but we
really have just one of each resource.
A W D F S
B W D F S
C W D F S
D W D F S

Observed so far..
• The pipeline paradox is that the time from placing a single
dirty sock in the washer until is dried ,folded and put away is
not shorter for pipelining.
• The reason pipelining is faster for many loads is that
everything is working in parallel, so more loads are finished
per hour,
• Pipelining improves throughput of the laundry system without
improving the time to complete a single load.
• Hence ,pipelining would not decrease the time to complete
one load of laundry, but when we have many loads of laundry
to do,the improvement in throughput decreases the total
time to complete the work.

• If all the stages take about the same amount of time and there is enough
work to do,then the speedup due to pipelining is equal to the number of
stages in the pipeline(in this case four stages).
• 20 loads would take about 5 times as long as 1 load,while 20 loads of
sequential laundry takes 20 times as long as 1 load.
• Its only 8/3.5=2.3 times faster becoz only four loads are shown.
• The beginning and end of the workload in the pipelined version,the
pipeline is not completely full.
• This start-up and wind-down affects performance when the number of
tasks is not large as compared to the number of stages in the pipeline, if
the number of loads is much larger than 4,then the stages will be full most
of the time and the increase in throughput will be very close to 4.

• The same principles apply to processors, where we pipeline
instruction execution. MIPS instructions classically takes five
steps:
1)Fetch instruction from memory
2)Read register while decoding the instruction(The format of
MIPS allows reading and decoding to occur simultaneously).
3)Execute the operation or calculate the address.
4)Access an operand in data memory.
5)Write the result into a register.

Single cycle Vs pipelined performance
Compare the avg time between instructions of a single cycle
implementation, in which all instructions take one clock cycle to
a pipelined implementation.
The operation times for major functional units as example is:
(in the single cycle model every instruction takes exactly one
clock cycle,so the clock cycle must be stretched to accommodate
the lowest instruction.)
Inst class Inst fetch Reg read ALU oper Data
access
Reg write total
Load 200ps 100ps 200ps 200ps 100ps 800ps
Store 200ps 100ps 200ps 200ps 700ps
R-format 200ps 100ps 200ps 100ps 600ps
branch 200ps 100ps 200ps 500ps

200 300 500 700 800 1000 1200 1300 1500 1700 1800
lw $1,100($0)
lw $1,200($0)
lw $3,300($0)
800ps
800ps
IF Reg ALU Data Ac Reg
IF Reg ALU Data Ac Reg
IF

200 400 600 800 1000 1200 1400
a)200ps IF Reg ALU Data Ac Reg
b) 200ps IF Reg ALU Data Ac Reg
c) 200ps IF Reg ALU Data Ac Reg
a)lw $1,100($0)
b)lw $2,200($0)
c)lw $3,300($0)

• Avg time between the instructions reduced from 800ps to 200 ps
• Computer pipeline stage times are limited by the slowest resource. Either
the ALU operation or the memory access.
• We assume that the write to register file occurs in the first half of the
clock cycle and the read from the register file occurs in the second half of
the clock cycle.
• For speedup:
time between instructions(pipelined)=time between
instruction(nonpipelined)/number of pipe stages
Improvement is 800/5=160ps clock cycle.(stages may be imperfectly balanced
or some overheads).
Thus the time per instruction in the pipelined processor will exceed the
minimum possible and speedup will be less than the number of pipeline
stages.

• Our claim of fourfold improvement is not reflected in the total execution
of three instructions(2400/1400=1.8).this is becoz the number of
instructions is not large.Let us increase the number of instructions:
• Let us add 1000,000 instructions in the pipeline,each instruction adds
200ps ,to the total execution time.
• Total execution time=1000,000*200ps+1400ps=200,001400ps
• For nonpipelined:1000,000*800+2400=800,002400ps
• Ratio=800,002400/200,001400=4=800/200
• Note:pipeline improves performance by increasing instruction
throughput,as opposed to decreasing the execution time on an individual
instruction.

Pipeline hazards
• There are situations in the pipeline when the next instruction can not execute in
the following clock cycle, these events are called hazards and there are three
types:
• Structural, data and control hazards.
• Structural Hazard: An occurrence in which planned instruction can not execute in
the proper clock cycle because the hardware cannot support the combinations of
instructions that are set to execute in the given clock cycle.
• With reference to laundry example the structural hazard would occur if we use the
combination of washer-dryer instead of separate washer and dryer, or you
roommate was busy doing something else and wouldn’t put clothes away.
• Hence, the carefully scheduled pipeline plans would than be foiled.
• If the pipeline in the previous Fig,has fourth instruction then we would see that
the first instruction is accessing data from the memory and the fourth instruction
is fetching an instruction from that same memory.
• Without two memories our pipeline could have a structural hazard.

 A structural hazard may occur when two instructions have a conflict on
the same set of resources in a cycle
 Example :
 Assume that we have an add instruction that can read one operand
from memory
 add r1, r2, 10[r3]
 This code will have a structural hazard
 [3] tries to read 10[r3] (MA unit) in cycle 4
 [1] tries to write to 20[r5] (MA unit) in cycle 4
 Does not happen in our pipeline
[1]: st r4, 20[r5]
[2]: sub r8, r9, r10
[3]: add r1, r2, 10[r3]

• Data Hazard: An occurrence in which a planned instruction can not
execute in the proper clock cycle because the data that is needed to
execute the instruction is not yet available.
• Occurs when the pipeline must be stalled because one step must wait for
another to complete.
• In the computer pipeline, data hazards arise from the dependence of one
instruction on an earlier one that is still in the pipeline(relationship that
doesn’t really exists when doing laundry).
add $s0 ,$t0,$t1
sub $t2,$s0,$t3
Data hazard could severely stall the pipeline, the add instruction does not
write its result until the fifth stage, meaning that we would have to add three
bubbles to the pipeline.

Data Hazard
[1]: add r1, r2, r3
[2]: sub r3, r1, r4
1
1
1
1
1
2
2
2
2
2
IF
OF
EX
MA
RW
1 2 3 4 5 6 7 8 9
clock cycles
 Instruction 2 will read incorrect values !!!

This situation represents a data hazard
In specific,
it is a RAW (read after write) hazard
The earliest we can dispatch instruction 2, is cycle 5

• Although we would try to rely on compilers to remove all such hazards,
the result would not be satisfactory. These dependences happen just too
often and the delay is just too long to expect the compiler to rescue us
from this dilemma.
• Forwarding or bypassing:
• For the code sequence mentioned earlier,as soon as the ALU creates the
sum for add, we can supply it as input for the subtract.Adding extra h/w to
retrieve the missing item early from the internal resources is called
forwarding or bypassing.

Graphical representation
Fig.2 add $s0,$t0,$t1
200 400 600 800 1000 1200 1400
a)200ps IF Reg ALU Data Ac Reg
b) 200ps IF Reg ALU Data Ac Reg
c) 200ps IF Reg ALU Data Ac Reg
a)lw $1,100($0)
b)lw $2,200($0)
c)lw $3,300($0)
200 400 600 800 1000
I
F
I
D
a l
u
MEM W
B

Forwarding
• add $s0,$t0,$t1
• sub $t2,$s0,$t3

• If the correct value is already there in another
stage, we can forward it.
[1]: add r1, r2, r3
[2]: sub r4, r1, r2
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW

Forwarding from MA to EX
 Fowarding in cycle 4 from instruction [1]
to [2]
[1]: add r1, r2, r3
[2]: sub r4, r1, r2
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW

Different Forwarding Paths
 We need to add a multitude of forwarding
paths
 Rules for creating forwarding paths
 Add a path from a later stage to an earlier stage
 Try to add a forwarding path as late as possible. For,
example, we avoid the EX → OF forwarding path, since
we have the MA → EX forwarding path
 The IF stage is not a part of any forwarding path.

Forwarding Path
 3 Stage Paths
 RW → OF
 2 Stage Paths
 RW → EX
 MA → OF (X Not Required)
 1 Stage Paths
 RW → MA (load to store)
 MA → EX (ALU Instructions, load, store)
 EX → OF (X Not Required)

Forwarding Paths : RW → MA
[1]: ld r1, 4[r2]
[2]: sw r1, 10[r3]
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW

Forwarding Paths : RW → EX
[1]: ld r1, 4[r2]
[2]: sw r8, 10[r3]
[3]: add r2, r1, r4
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
3
3
3
3
3

Forwarding Path : MA → EX
[1]: add r1, r2, r3
[2]: sub r4, r1, r2
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW

Forwarding Path : RW → OF
[1]: ld r1, 4[r2]
[2]: sw r4, 10[r3]
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
[3]: sw r5, 10[r6]
3
3
3
3
3
4
4
4
4
4
[4]: sub r7, r1, r2

• Forwarding works very well but cannot prevent all pipeline stalls.
• Suppose the first instruction were load of $s0,instead of an add, from the
figure it is clear that the desired data would be available only after the
fourth stage of the first instruction in the dependence,which is too late for
the input of the third stage of sub.’
• Even,with forwarding,we would have to stall one stage for a load use data
hazard(specific form of data hazard in which the data requested by a load
instruction has not yet become available when it is requested)

 Cannot forward (arrow goes backwards in time)
 Need to add a bubble (then use RW → EX
forwarding)
[1]: ld r1, 10[r2]
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
[2]: sub r4, r1, r2

• We need to stall even with forwarding when an R-format instruction
following a load tries to use data.
• A stall initiated in order to resolve a hazard(pipeline stall)

Data Hazards with Forwarding
 Forwarding has unfortunately not eliminated all data hazards
 We are left with one special case.
 Load-use hazard
 The instruction immediately after a load instruction has a RAW
dependence with it.
Consider the c segment: A=B+E and C=B+F,in MIPS
lw $t1,0($t0)
lw $t2,4($t0)
add $t3,$t1,$t2
sw $t3,12($t0)
lw $t4,8($01)
add$t5,$t1,$t4
sw $t5,16($t0)
Find the hazards in instructions and reorder to avoid any pipeline stalls.

Control Hazards
• Also called as branch hazard,An occurrence in which the proper
instruction can not execute in the proper clock cycle because the
instruction that was fetched is not the one that is needed, i.e the flow of
instruction addresses is not what pipeline expected.

Performance of “stall on branch”

• Let us assume that we put in enough extra h/w so that we can test
registers, calculate the branch address, and update the PC during the
second stage of the pipeline. Even with this extra h/w the pipeline
involving conditional branches would look like the fig in previous slide.
• Estimate the impact on the clock cycles per instruction(CPI) of stalling on
branches. Assume all other instructions have a CPI 1.
• It has been found that branches are 13% of the instructions executed in
SPECint2000,since other instructions run have a CPI of 1 and branches
took one extra clock cycle for the stall,hence the CPI 1.13,slowdown of
1.13 versus the ideal case.jumps also incur stalls.

• If we cannot resolve the branch in the second stage,as is often the
case for longer pipelines, then we would see larger slowdown if we
stall on branches.
• The cost of this option is too high for most computers to use and
motivates a second solution to the control hazard.
Predict:with reference to laundry example,if you are pretty sure of
having the right formula to wash uniforms,then just predict that it will
work and wash the second load while waiting for the first load to dry.
This option does not slowdown the pipeline when you are correct.
When you are wrong ,then you need to redo the load that was
washed while guessing the decision.

• Computers do indeed use prediction to handle branches.One
simple approach is to always predict that the branches will be
untaken.
• When you are right,the pipeline proceeds at full speed.Only
when branches are taken does the pipeline stall.
•

Predicting that branches are not taken as a solution to
control hazard

Branch prediction
• Some branches predicted as taken and some untaken.In laundry
example,the dark or home uniforms might take one formula ,while the
light or road uniforms might take another.
• As a computer example,at the bottom of loops are branches that jump
back to the top of the loop.Since they are likely to be taken and they
branch backwards,we could always predict taken for branches that jump
to an earlier address.
• Dynamic h/w predictors make their guesses depending on the behaviour
of each branch and may change predictions for a branch over the life fo a
program.
• In our analogy, a dynamic prediction, a person would look at how dirty the
uniform was and guess at the formula, adjusting the next guess
depending on the success of recent guesses

• One popular approach to dynamic prediction of branches is
keeping a history for each branch as taken or untaken, and
then using the recent past behaviour to predict the future.
• Survey says, dynamic branch predictors can correctly predict
branches with over 90% accuracy.
• When the guess is wrong, the pipeline control must ensure
that the instructions following the wrongly guessed branch
have no effect and must restart the pipeline from the proper
branch address.
• In laundry analogy, we must stop taking new loads so that we
can restart the load that we incorrectly predicted.

• Third approach to Control hazard: delayed decision
• The delayed branch always execute the next sequential
instruction with the branch taking place after that one
instruction delay.
• MIPS s/w will place an instruction immediately after the
delayed branch instruction that is not affected by branch and
a taken branch changes the address of the instruction that
follows this safe instruction

• The add instruction before the branch in Fig. does not affect
the branch and can be moved after the branch to fully hide

CS4109 Computer System Architecture

CS4109 Computer System Architecture

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a CS4109 Computer System Architecture

Semelhante a CS4109 Computer System Architecture (20)

Mais de ktosri

Mais de ktosri (6)

Último

Último (20)

CS4109 Computer System Architecture