SlideShare uma empresa Scribd logo
1 de 456
Baixar para ler offline
CS4109
Computer System Architecture
By
Prof. K.Sridhar Patnaik
Department of Computer Science and
Engineering,
BIT Mesra, Ranchi
Course Objectives
• To Learn:
1)How computers work,basic principles.
2)How to analyze their performance(or how not
to)
3)How computers are designed and built.
4)Issues affecting modern
processors(cache,pipelines etc)
Course Motivation
This knowledge will be useful if you need to
1)Design/built a new computer-rare opportunity
2)Design/built a new version of a computer.
3)Improve software performance.
4)Purchase a computer
5)Provide a solution with embedded computer.
Introduction
• I think it’s fair to say that personal computers
have become the most empowering tool
we’ve ever created. They’re tools of
communication, they’re tools of creativity, and
they can be shaped by their user.
Bill Gates, February 24, 2004
Computers?
A computer is a general purpose device that can be programmed to process
information, and yield meaningful results.
McGrawHill
Computers for..
• Processing and Communication
• For processing (ex. Numbers) requires
processor , memory and I/O(input-output)
Processor-for processing.
Memory-for storage.
I/O -can be considered as extended
memory.(Also helps us in interacting with m/c)
Computer Architecture Vs Computer
Organization
You can study Computer Systems from users(car driver)
point of view or designer’s(car mechanic) point of view
Computer Architecture :The view of a computer as
presented to software designers.
i.e. from designers(hardware) point of view.
Building Architecture: Structural Design(Civil Engg)
Computer Architecture: Circuit Design(EE)
Computer Organization: The actual implementation of a
computer in hardware.
i.e. from users(programmers/software) point of view.
• Ex.Mutiplier,from organization point of view you know there is a
multiplier, one need not bother about how it is designed .Similarly there is
an instruction set and the user know the instruction that is enough and he
is not bothered about how it is implemented.
• So how multiplier and instruction set are implemented is the job of the
designers.
• Application spectrum
Computer
Program Information
store
results
• Program – List of instructions given to the computer
• Information store – data, images, files, videos
• Computer – Process the information store according to the
instructions in the program
McGrawHill
Inside
CPU
Memory
Hard disk
 Let us take the lid off a desktop computer
Let us make it a full system ...
Computer
Memory Hard disk
Keyboard
Mouse
Monitor
Printer
MEMORY Storage
Control
Datapath
• Processor keep addressing the memory and get the
data from memory for processing.
• Processor consists(DP ckts+CP ckts)
• Processor must be able to set some kind of path to
deal or handle the data for processing ( data path
setting).
• Also able to carry out the processing through a
sequence of control signals.
• The one which interact the CPU indirectly is the
storage(magnetic tape, disc system etc)
• The CPU deals with memory in the same way as it
deals with I/O.
SOFTWARE ABSTRACTION
• int sum(int x,int y) HLL(C)
• {
• int z=x+y;
• return z;}
• 0x401040<sum>:0x55 MACHINE CODE
0x89
• 0xe5
• 0x8b
• 0x45
• 0x0C
• 0x03
• 0x45
• 0x08
• 0x89
• 0xec
• 0x5d
• 0xc3
• _sum: ASSEMBLY
• pushl%ebp
• movl%esp,%ebp
• movl 12(%ebp),%eax
• addl 8(%ebp),%eax
• movl %ebp,%esp
• popl %ebp
• ret
HARDWARE ABSTRACTION
CPU
Register file
System bus Memory bus
I/O BUS
Expansion slots
for other devices
such as network
adapters
PC
BUS INTERFACE
ALU
BRIDGE MAIN
MEMORY
GRAPHICS
Adapter
USB
Controller
DISK
Controller
DISK
MOUSE KEYBOARD
E
DISPLAY
HARDWARE/SOFTWARE INTERFACE
SOFTWARE
Our focus
HARDWARE
C++
M/C Intruction
Reg,Adder
Transistors
…………………………………………………………………………………………………
How does an Electronic Computer
Differ from our Brain ?
• Computers are ultra-fast and ultra-dumb
Feature Computer Our Brilliant Brain
Intelligence Dumb Intelligent
Speed of basic calculations Ultra-fast Slow
Can get tired Never After sometime
Can get bored Never Almost always
What Can a Computer Understand ?
 Computer can clearly NOT understand
instructions of the form
 Multiply two matrices
 Compute the determinant of a matrix
 Find the shortest path between Mumbai and Delhi
 They understand :
 Add a + b to get c
 Multiply a + b to get c
Architecture levels
• Instruction set architecture:
Lowest level visible to programmer
• Micro architecture:
Fills the gap between instructions and logic
modules
The semantics of all the instructions supported by a processor is known
as its instruction set architecture (ISA). This includes the semantics of
the instructions themselves, along with their operands, and interfaces
with peripheral devices.
Instruction Set Architecture
Abstract Machine
Addresses
Instructions
Data
CPU
MEMORY
• Programmer -Visible State:
PC-Program Counter
Register File-Heavily used data
Condition Codes
Memory-Byte Array
-Code+Data
-Stack
Why different processors?
• What is the difference between processors
used in desk-tops, lap-tops,mobile phones,
washing machines etc.?
• Performance/speed
• Power consumption
• Cost
• General purpose/special purpose
Topics to be Covered
• Performance issues
• A specific instruction set architecture
• Arithmetic and how to build an ALU
• Constructing a processor to execute
instruction
• Pipelining to improve performance
• Memory:Caches and Virtual memory
• Input/output
Features of an ISA
 Example of instructions in an ISA
 Arithmetic instructions : add, sub, mul, div
 Logical instructions : and, or, not
 Data transfer/movement instructions
 Complete
 It should be able to implement all the
programs that users may write.
Features of an ISA – II
 Concise
 The instruction set should have a limited size.
Typically an ISA contains 32-1000 instructions.
 Generic
 Instructions should not be too specialized, e.g.
add14 (adds a number with 14) instruction is too
specialized
 Simple
 Should not be very complicated.
Designing an ISA
 Important questions that need to be answered :
 How many instructions should we have ?
 What should they do ?
 How complicated should they be ?
Two different paradigms : RISC and CISC
RISC
(Reduced Instruction Set
Computer)
CISC
(Complex Instruction
Set Computer)
RISC vs CISC
A reduced instruction set computer (RISC) implements
simple instructions that have a simple and regular
structure. The number of instructions is typically a small
number (64 to 128). Examples: ARM, IBM PowerPC,
HP PA-RISC
A complex instruction set computer (CISC) implements
complex instructions that are highly irregular, take multiple
operands, and implement complex functionalities.
Secondly, the number of instructions is large (typically
500+). Examples: Intel x86, VAX
Completeness of an ISA – II
How to ensure that we have just enough instructions such that
we can implement every possible program that we might want
to write ?
Answer:
 Let us look at results in theoretical computer science
 Is there an universal ISA ?
The universal machine has a set of basic actions, and each such
action can be interpreted as an instruction.
Universal ISA Universal Machine
The Turing Machine – Alan Turing
 Facts about Alan Turing
 Known as the father of computer science
 Discovered the Turing machine that is the most
powerful computing device known to man
 Indian connection : His father worked with the
Indian Civil Service at the time he was born. He
was posted in Chhatrapur, Odisha.
Turing Machine
Infinite Tape
State Register Tape Head
L R
Action Table• The tape
head can
only move
left or right
(old state, old symbol) -> (new state, new symbol, left/right)
Operation of a Turing Machine
 There is an inifinite tape that extends to the left and right. It
consists of an infinite number of cells.
 The tape head points to a cell, and can either move 1 cell to the left
or right
 Based on the symbol in the cell, and its current state, the Turing
machine computes the transition :
 Computes the next state
 Overwrites the symbol in the cell (or keeps it the same)
 Moves to the left or right by 1 cell
 The action table records the rules for the transitions.
Example of a Turing Machine
• Design a Turing machine to increment a number by 1.
 Start from the rightmost position. (state = 1)
 If (state = 1), replace a number x, by x+1 mod 10
 The new state is equal to the value of the carry
 Keep going left till the '$' sign
Tape Head
$3 4 6 9$ 7
More about the Turing Machine
 This machine is extremely simple, and
extremely powerful
 We can solve all kinds of problems – mathematical
problems, engineering analyses, protein folding,
computer games, …
 Try to use the Turing machine to solve many more
types of problems (TO DO)
Church-Turing Thesis
Church-Turing thesis: Any real-world computation can be translated
into an equivalent computation involving a Turing machine.
(source: Wolfram Mathworld).
Any computing system that is equivalent to a Turing machine is said to be
Turing complete.
Universal Turing Machine
For every problem in the world, we can design a Turing Machine (Church-Turing thesis)
Can we design a universal Turing machine that can simulate any Turing machine.
This will make it a universal machine (UTM)
Why not? The logic of a Turing machine is really simple.
We need to move the tape head left, or right, and update the symbol and
state based on the action table. A UTM can easily do this.
A UTM needs to have an action table, state register, and tape that can
simulate any arbitrary Turing machine.
Universal Turing Machine
Prog. 1 Prog. 2 Prog. 3
Turing Machine 1 Turing Machine 2 Turing Machine 3
Universal Turing Machine
A Universal Turing Machine
Generic State Register Tape Head
L R
Generic Action Table
Simulated Action Table
Simulated State Register
Work Area
37
A Universal Turing Machine - II
Generic State Register Tape Head
L R
Generic Action Table
Simulated Action Table
Simulated State Register
Work Area
CPU
Data Memory
Instruction
Memory
Program Counter
(PC)
Computer Inspired from the Turing
Machine
CPU
Program
Counter (PC)
Program Data
Memory
Control
Unit
Arithmetic
Unit
Instruction
39
Elements of a Computer
 Memory (array of bytes) contains
 The program, which is a sequence of instructions
 The program data → variables, and constants
 The program counter(PC) points to an instruction in a program
 After executing an instruction, it points to the next instruction
by default
 A branch instruction makes the PC point to another instruction
(not in sequence)
 CPU (Central Processing Unit) contains the
 Program counter, instruction execution units
Designing Practical Machines
Harvard Architecture
CPU
Control
ALU
Instruction
memory
Data
memory
I/O devices
Von-Neumann Architecture
CPU
Control
ALU
Memory I/O devices
Problems with Harvard/ Von-Neumann
Architectures
 The memory is assumed to be one large array of
bytes
 It is very very slow
 Solution:
 Have a small array of named locations (registers) that can
be used by instructions
 This small array is very fast
General Rule: Larger is a structure, slower it is
Insight: Accesses exhibit locality (tend to use the same
variables frequently in the same window of time)
43
Uses of Registers
 A CPU (Processor) contains set of registers (16-64)
 These are named storage locations.
 Typically values are loaded from memory to registers.
 Arithmetic/logical instructions use registers as input
operands
 Finally, data is stored back into their memory locations.
44
Example of a Program in Machine
Language with Registers
 r1, r2, and r3, are registers
 mem → array of bytes representing memory
1: r1 = mem[b] // load b
2: r2 = mem[c] // load c
3: r3 = r1 + r2 // add b and c
4: mem[a] = r3 // save the result
45
Machine with Registers
CPU
Control
ALU
Memory I/O devices
Registers
Performance Measure
• When we say one computer has better performance than
another , what do we mean?
Airplane Passenger
Capacity
Cruising
Range(miles)
Cruising
speed(mph)
Passenger
throughput
(passenger X mph)
A1 300 4000 600 180000
A2 400 3500 600 240000
A3 130 3400 1000 130000
A4 140 8000 500 70000
• Consider different measures of performance, the plane with highest
cruising speed is A3 ,the plane with the longest range is A4,and the
plane with the largest capacity is the A2 .
• Suppose we define performance in terms of speed.Then the fastest
plane as the one with the highest cruising speed,taking less
passengers from one point to another in the least time.
• If you are interested in transporting 400 passengers than A2 would
clearly be the fastest.
• Similarly , we can define computer performance in several different
ways
• If you are running a program on two desktop computers, The faster
one is the one that gets the job done first.
• If you were running a datacenter that had several servers running
jobs submitted by many users,the faster computer was the one that
completed the most jobs during a day.
• As an individual computer user we are interested in reducing
response time(execution time).
• Datacenter managers are often interested in increasing throughput
or bandwidth(total amount of work done in a given time)
• Hence in most cases we need different performance metrics as
well as different sets of applications to benchmark embedded and
desktop computers,which are more focused on response
time,versus servers,which are more focused on throughput
Throughput and Response time
• Do the following changes to a computer system increase throughput, decrease
response time:
1.Replacing the processor in a computer with a faster version.
2.Adding additional processors to a system that uses multiple processors for separate
task-ex searching the www.
In case 1 decreasing response time almost always improves throughput, hence case 1
improves both.
In case 2 throughput only increases.
Note-The demand for processing in the second case was almost as large as the
throughput,the system might force requests to queue up.
In this case the throughput could also improve response time, since it would reduce
the waiting time in the queue.
Thus, in many real computer systems, changing either execution time or throughput
often affects the other.
Performance Measure Equations
• Performance=1/Execution time
• For two computers X and Y,if the performance of X is greater than Y ,
• 𝑃𝑋 > 𝑃𝑌 =
1
𝐸𝑥𝑒.𝑡𝑖𝑚𝑒 𝑋
>
1
𝐸𝑥𝑒.𝑡𝑖𝑚𝑒 𝑌
=𝐸𝑥𝑒. 𝑡𝑖𝑚𝑒 𝑌 > 𝐸𝑥𝑒. 𝑡𝑖𝑚𝑒𝑋
• Do this-If computer A runs a program in 10 sec and computer B runs the same
program in 15 sec,how much faster is A than B?
• Response time,or elapsed time-Total time to complete a task,including disk
accesses,memory accesses,I/O activities,OS overhead
• CPU Execution time/CPU time-The actual time the CPU spends computing
specific task.(does not include time spent waiting for I/O or running other
programs.
• CPU time=user CPU time+system CPU time
User CPU time(CPU time spent in the the program)
System CPU time(CPU time spent in the OS performing tasks on behalf of the
program)
• Differentiating user and system CPU time is difficult to do accurately
,because it is hard to assign responsibility for OS activities to one user
program rather than another and because of the functionality differences
among OSs.
• CPU execution time for a program=CPU clock cycles for a program X Clock
cycle time.
=CPU clock cycles for a program /Clock rate
Note-The hardware designer can improve performance by reducing the
number of clock cycles required for a program or the length of the clock cycle.
The designer often faces a trade off between the number of clock cycles
needed for a program and the length of each cycle.
Many techniques that decrease the number of clock cycles may also increase
the clock cycle time.
• Do this-A program runs in 10secs on computer A,which has a 2GHz clock.
You as a designer build a Computer B which will run the same program in
6secs.As a designer you determined that a substantial increase in the clock
rate is possible, but this increase will affect the rest of the CPU
design,causing computer B to require 1.2 times as many clock cycles as
computer A for this program. What clock rate will the designer has to
target?
• CPU time 𝐴 =CPU clock cycles 𝐴 /Clock rate 𝐴
10 secs=CPU clock cycles 𝐴/2 × 109
cycles/secs
CPU clock cycles 𝐴= 20 × 109 cycles
CPU time 𝐵=1.2 × CPU clock cycles 𝐴/Clock rate 𝐵
6 secs=1.2 × 20 × 109 cycles /Clock rate 𝐵
Clock rate 𝐵=4× 109 cycles/secs =4GHz.
To run the program in 6 secs, B must have twice the clock rate of A.
Instruction Performance
• CPU execution time for a program=CPU clock cycles for a program X Clock
cycle time.
• CPU clock cycles= instructions for a program X Avg clock cycles per
instruction
• Avg clock cycles per instruction(CPI).
• Do this-Suppose we have two implementations of the same ISA.
Computer A has a clock cycle time of 250ps and a CPI of 2.0 for some
program, and computer B has a clock cycle time of 500ps and a CPI of 1.2
for the same program. Which computer is faster for this program and by
how much?
• The classic CPU performance equation=
CPU time=Instruction count X CPI X Clock cycle time
=(Instruction count X CPI)/Clock rate
SPEC CPU Benchmark
• To evaluate two computer systems, a user would simply compare the
execution time of the workload on the two computers.
• Workload=A set of programs run on computer that is either the actual
collection of applications run by a user or constructed from real programs
to approximate such a mix.
• Benchmark: A program selected for use in comparing computer
performance
• SPEC:(Standard Performance Evaluation Cooperation)is an effort funded
and supported by a number of computer vendors to create standard sets
of benchmarks for modern computer systems.
• In 1989, SPEC created a benchmark set focusing on processor
performance(SPEC89) which evolved through 5 generations.
• SPEC CPU2006 (consists of a set of 12 integer benchmarks(CINT2006) and
17 floating point benchmarks(CFP2006))
• The integers benchmarks – C compiler,chess program,quantum computer
simulations.
• The floating point benchmarks: Structure grid codes for finite element
modeling, particle method codes for molecular dynamics,sparse linear
algebra codes for fluid dynamics.
• Check for SPEC14.
• SPECRatio=𝐸𝑋𝐸 𝑡𝑖𝑚𝑒 𝑅𝐸𝐹/EXE Time
• Take Geometric mean of SPECRatios
• When comparing two computers using SPEC ratios,use the GM so that it
gives the same relative answer no matter what computer is used to
normalize the results.If we averaged the normalized exe time values with
an AM,the results would vary depending on the computer we choose as
the reference.
SPECINTC2006 benchmarks running on
AMD Opteron X4 model
Des Name Inst
count
X𝟏𝟎 𝟗
CPI Clock
cycle
time
(sec
X𝟏𝟎 𝟗
)
Exe time
(secs)
Ref time
(secs)
SPEC
ratio
Interpreted
String
processing
Perl 2118 0.75 0.4 637 9770 15.3
GNU C
compiler
Gcc 1050 1.72 0.4 724 8050 11.1
Q1:Show that the ratio of the geometric means is equal to the
geometric mean of the performance ratios, and that the reference
computer of SPECRatio matters not.
Assume two computers A and B and a set of SPEC ratios for each.
𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝐴
𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝐵
=
𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐴 𝑖
𝑛
𝑖=1
𝑛
𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐵 𝑖
𝑛
𝑖=1
𝑛
=
𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐴 𝑖
𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐵 𝑖
𝑛
𝑖<1
𝑛
=
𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝐵 𝑖
𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝐴 𝑖
𝑛
𝑖<1
𝑛
=
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐴 𝑖
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐵 𝑖
𝑛
𝑖<1
𝑛
The geometric mean of the ratios is the same as the ratio of the
geometric means.
Fallacies and Pitfalls
• Pitfall: Expecting the improvement of one aspect of a computer to
increase overall performance by an amount proportional to the size
of the improvement.
• This pitfall has visited designers of both h/w and s/w.
• Ex. Suppose a program runs in 100secs on a computer,with multiply
operations responsible for 80secs of this time.How much do you
have to improve the speed of multiplication if you want your
program to run five times faster.
• The exe time of the program after making the improvement is given
by the equation known as Amdahl’s Law.
• Exe time after improvement = (Exe time affected by
improvement/Amount of improvement ) +Exe time unaffected
Exe time after improvement=80/n +(100-80)
20=80/n +(100-80)
0=80/n(i.e there is no amount by which we can enhance –
multiply to achieve a fivefold increase in performance ,if multiple
accounts for only 80% of the workload.
The performance enhancement possible with a given
improvement is limited by the amount that the improved feature
is used(it’s the law of diminishing returns).
We can use Amdahl’s law to estimate performance
improvements when we know the time consumed for some
function and its potential speedup.
Amdahl’s Law
Parent thread
Time
Spawn child threads
Child
threads
Thread join operation
Sequential
section
Initialisation
 For P parallel processors, we can expect a speedup of
P (in the ideal case)
 Let us assume that a program takes Told units of time
 Let us divide into two parts – sequential and
parallel
 Sequential portion : Told * fseq
 Parallel portion : Told * (1 - fseq )
fseq =fraction of time spend in sequential part
(1 - fseq )= fraction of time spend in parallel part
 Only, the parallel portion gets sped up P times
 The sequential portion is unaffected
 Equation for the time taken with parallelisation
 The speedup is thus : :
Told
𝑇𝑛𝑒𝑤
Implications
 Consider multiple values of fseq
 Speedup vs Number of processors
45
40
35
30
25
20
15
10
5
0
0 50 100 150 200
Speedup(S)
Number of processors (P)
10%
5%
2%
Conclusions
 We observe that with an increasing number of processors the
speedup gradually saturates and tends to the limiting
value,1
𝑓𝑠𝑒𝑞
.We observe diminishing returns as we increase the
number of processors beyond certain point.
 We are limited by the size of the sequential section
 For a very large number of processors, the parallel section is
actually very small
 Ideally, a parallel workload should have as small a sequential
section as possible.
• Fallacy: Computers at low utilization use little power.
Power efficiency matters at low utilizations because server
workloads vary. CPU utilization for servers at Google,is between
10% to 50% most of the time.(SPEC Power Benchmark)
Server
Manufacturer
Microprocessor Total
cores/sockets
Clock
rate
Peak
Perfm
100%
load
power
50%
load
power
10%load
power
Idle
power
HP XenonE5440 8/2 3GHz 308022 269W 227W 174W 160W
Dell XenonE5440 8/2 2.8GHz 305413 276W 230W 173W 157W
Fujitsu Seimens XenonX3220 4/1 2.4GHz 143742 132W 110W 85W 60W
• Even servers that are only 10% utilized burn about two-third of
their peak power.
• Since servers’ workloads vary but use a large fraction of peak
power,so we should redesign h/w to achieve “energy proportional
computing”.If future servers are used ,say 10% of peak power at
10% workload,we could reduce the electricity bill of
datacentres(also a Concern of CO2 emissions)
• Pitfall: Using a subset of the performance equation as a
performance metric.
Measuring performance using clock rate or CPI is a fallacy ,using two or
three factors to compare performance may be valid in a limited
context or misused.
Alternative to time is MIPS(Million inst per secs)=
𝐼𝑛𝑠𝑡 𝑐𝑜𝑢𝑛𝑡/(𝐸𝑥𝑒𝑡𝑖𝑚𝑒 × 106)
• Problems with MIPS:
Cannot compare computers with different instruction sets using
MIPS.
MIPS varies between programs on the same
computer.(computer can not have a single MIPS rating)
If a new program executes more instructions but each
instruction is faster, MIPS can vary independently from
performance.
𝑀𝐼𝑃𝑆 =
𝐼𝐶
𝐼𝐶×𝐶𝑃𝐼×106
𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒
=
𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒
𝐶𝑃𝐼×106
ARM Instruction Set
• 32 bit instruction set.
• Arthmetic(Add,Sub….)
• Data Transfer(load register,store register,mov etc.)
• Logical(and,or,not…)
• Conditional branch(branch on EQ,NE,LT,LE…)
• Unconditional branch(branch(always),branch and link)
Name Example Comments
16 registers r0,r1,r2…..r11,….r12…..
sp…lr,pc
Fast locations for data,in
ARM,data must be in
registers to perform
arithmetic.
230 memory words Memory[0],memory[4+…
Memory[4294967292]
ARM uses byte addresses
,so sequential address
differ by 4
MIPS Instruction Set
Name Example Comments
32 registers $s0-$s7;
$t0-$t9,$zero,
$a0-$a3,
$v0-$v1,$gp,$fp,$sp,$ra,$at
Fast locations for data,in MIPS,data must be
in registers to perform arithmetic.register
$zero always equals 0,and register $at is
reserved by the assembler to handle large
constants
230 memory words Memory*0+,memory*4+…
Memory[4294967292]
MIPS uses byte addresses ,so sequential
address differ by 4,Memory holds DS,
arrays, and spilled registers
comparison
Category Instruction ARM MIPS
Arithmetic Add
Sub
Add immediate
ADD r1,r2,r3
SUB r1,r2,r3
add $s1,$s2,$s3
sub $s1,$s2,$s3
addi $s1,$s2,20
Data Transfer Load register/
Load word
Store register/
store word
……
…..
LDR r1,[r2,#20]
{r1=memory[r2+20]}
STR r1,[r2,#20]
Lw $s1,20($s2)
Sw $s1,20($s2)
ARM assembly language notation
ARM arithmetic instruction performs only one operation and
must always have exactly three variables.Ex.place the sum of
four variables b,c,d,e into variable a.
• ADD a,b,c (sum of b and c placed in a)
• ADD a,a,d (sum of b,c and d is now in a)
• ADD a,a,e(sum of b,c,d,and e is now in a)
It takes three instructions to sum the four variables.
First Design Principle: Simplicity favours regularity
Requiring every instruction to have exactly three operands,no
more and no less. Conforms to the philosophy of keeping the
h/w simple;h/w for variable number of operands is more
complicated than h/w for a fixed number.
C assignment to ARM assembly
• Second Design Principle: smaller is faster(A very large number of
registers may increase the clock cycle time because it takes
electronic signals longer when they must travel farther. smaller is
faster ;15 registers may not be faster than 16 .Energy is also the
major concern ,fewer registers is to conserve energy)
• a=b+c;d=a-e; ADD a,b,c , SUB d,a,e
• f=(g+h)-(i+j);ADD t0,g,h ,ADD t1,i,j,SUB f,t0,t1.
Memory operands: for simple variables single data elements is used
but for complex data structures –arrays, structure etc.The DSs can
contain many more elements than there are registers in computer.
How can a computer represent and access such large structures?
The processor can keep only small amount of data in registers, but
memory contains billions of data elements, Hence DSs(array and
structures)are kept in memory.
• Compile the C assignment , g=h+A[8]
LDR r1,[r2,#32]
r2=base address
r1=temp register
• ADD r3,r4,r1
• Offset=8
• In ARM word must
Start at address that are
Multiples of 4
(alignment restriction)
32 8
24 7
20 6
16 5
12 4
8 3
4 2
0 1
Byte Address data element
Compile using load and store A[12]=h+A[8]
LDR r1,[r2,#32]
ADD r4,r3,r1
STR r4,[r2,#48]
Many programs have more variables than computers have registers.
Consequently the compilers tries to keep the most frequently used variables
in registers and places the rest in memory,using loads and stores to move
variables between registers and memory.The process of putting less
commonly used variables (or those needed later) into memory is called
spilling registers.
To achieve highest performance and conserve energy,compilers must use
registers efficiently.
Constant or Immediate Operands
Many times a program will use a constant in an operation. Ex .increment an
index to point to the next element of an array. More than half of the ARM
arithmetic instructions have a constant as an operand when running the
SPEC2006 benchmarks.
Add a constant 4 to register r3.
LDR r5,[r1,#AddrConstant4]
ADD r3,r3,r5
r1+AddrConstant4 is the memory address of the constant.
ADD r3,r3,#4 {r3=r3+4},immediate operand.
Third Design Principle: Make the common case fast(Constant operands occur
frequently ,and by including constants inside arithmetic instructions,
operations are much faster and use less energy than if constants were loaded
from memory)
Signed and Unsigned Numbers
31 30 29 28 27 26 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1
LSB=0MSB=31
ARM word is 32 bits long.
So we can represent the numbers from 0 to 232-1(4294967295)
0000 0000 0000 0000 0000 0000 0000 0000=0
0000 0000 0000 0000 0000 0000 0000 0001=1
0000 0000 0000 0000 0000 0000 0000 0010=2
……… …….
1111 1111 1111 1111 1111 1111 1111 1111
=4294967295
Positive and negative numbers represented by separate sign by
single bit 0(positive) and 1(negative).
Integers representation: a)Signed Magnitude representation
b)Signed-1’s complement representation
c) Signed-2’s complement representation
Ex 14 in 8- bit register=0000 1110
Three different ways to represent -14=
i) Signed Magnitude representation=1 0001110
ii) Signed-1’s complement representation=1 1110001
iii)Signed-2’s complement representation=1 1110010
Conclusions
• Signed magnitude system is used in ordinary arithmetic but is
awkward when employed in computer arithmetic.
• Hence signed complement is used ,but 1’s complement
impose difficulties because it has two representation of 0(+0
and -0).1’s complement is used for logical operation.
• 2’s complement is used for representing negative numbers.
2’s complement addition:6 00000110
13 00001101
19 00010011
Do -6+13,6-13,-6-13.
Translating ARM assembly into m/c
instruction
• ADD r5,r1,r2
The decimal representation:
Fourth field tells that the instruction performs addition
Instruction in binary form:instruction format
ARM Fields:
Cond: conditional branch instruction
F:Instruction format,I=immediate(if 0 the second source operand is register ,if
1 than 12 bit immediate),S=set condition(related to cond.branch)
14 0 0 4 0 1 5 2
1110 00 0 0100 0 0001 0101 000000000010
4bits 2bits 1bit 4bits 1bit 4bits 4bits 12bits
Cond F I Opcode S Rn Rd Operand2
4bits 2bits 1bit 4bits 1bit 4bits 4bits 12bits
ADD r3,r3,#4 , r3=r3+4
LDR r5,[r3,#32] ,temp reg r5 gets A[8]
Load and store instruction use 6 fields
F=1 data transfer instruction,24=load word
14 0 1 4 0 3 3 4
4bits 2bits 1bit 4bits 1bit 4bits 4bits 12bits
Cond F Opcode Rn Rd Offset2
4bits 2bits 6bit 4bits 4bit 12bits
14 1 24 3 5 32
4bits 2bits 6bit 4bits 4bit 12bits
Instruction Format Cond F I Op S Rn Rd Operand2
ADD DP 14 0 0 4 0 Reg Reg Reg
SUB DP 14 0 0 2 0 Reg Reg Reg
ADD(imme
dite)
DP 14 0 1 4 0 Reg Reg Constant
LDR DT 14 1 Na 24 Na Reg Reg Address
STR DT 14 1 Na 25 Na Reg Reg Address
A[30]=h+A[30]
LDR r5,[r3,#120]
ADD r5,r2,r5
STR r5,[r3,#120]
Cond F I Op S Rn Rd Operand2
14 1 24 3 5 120
14 0 0 4 0 2 5 5
14 1 25 3 5 120
Cond F I Op S Rn Rd Operand2
1110 1 11000 0011 0101 0000 1111 0000
1110 0 0 100 0 0010 0101 0000 0000 0101
1110 1 11001 0011 0101 0000 1111 0000
Logical Operations
AND r5,r1,r2,reg5=reg1 & reg2
MVN r5,r1 ,reg5=~reg r1,[move Not]
MOV r6,r5 ,reg6=reg5
MIPS Code for f=(g+h)-(i+j)
Assign variables to registers:$s0=f,$s1=g,$s2=h,$s3=i,$s4=j
add $t0,$s1,$s2
add $t1,$s3,$s4
sub $s0,$t0,$t1
For g=h+A[8]
lw $t0,32($s3) ,tem reg t0 gets A[8]
add $s1,$s2,$t0
ADD r5,r1,r2 in ARM(decimal representation)
MIPS decimal representation
add $t0,$s1,$s2
17=$s1,18=$s2,8=$t0,
6bits 5bits 5bits 5bits 5bits 6bits
0 17 18 8 0 32
MIPS Fields
op=opcode
rs=first register source operand
rt=second register source operand
rd=register destination operand
shamt=shift amount
funct=function,this field selects the specific variant of the operation in the op
field also called as function code.
op rs rt rd shamt funct
6bits 5bits 5bits 5bits 5bits 6bits
• The problem occurs when an instruction needs longer fields than those
shown.Ex the load word instruction must specify two registers and a
constant.If the address uses one of the 5 fields of the format,the constant
within the load word would be limited to 25 = 32.This constant is used to
select elements from the arrays or data structures and is often much
larger than 32.
• We have conflict between desire to keep all the instructions the same
length and the desire to have a single instruction format.
• Design principle four: Good design demands good compromises.
• The compromise choosen is to keep all instructions the same
length,thereby by requiring different kinds of instruction formats for
different kinds of instructions.
• R-type(register format)(as shown above)
• I-type(immediate or data transfer)
I-format
16 bit address means load word instruction can load any word within a region
of ±215 or 8192 words of address in the base register rs.
Lw $t0 32($s3) temp reg t0 gets A[8].
op rs rt Const or
address
6bits 5bits 5bits 16bits
Inst Format Op Rs Rt Rd Shamt Funct address
add R 0 Reg Reg Reg O 32 Na
sub R 0 Reg Reg Reg O 34 Na
addi I 8 Reg Reg Na Na Na Const
lw I 35 Reg Reg Na Na Na Address
sw I 43 Reg Reg Na Na Na Address
Translate MIPS assembly into m/c code
• A[300]=h+A[300]
Lw $t0,1200($t1) #temp reg t0 gets A[300]
add $t0,$s1,$t0 #temp reg to gets h+A[300]
sw $t0,1200($t1) # stores h+A[300] back into A[300]
op rs rt rd shamt/
address
funct
35 9 8 12000
0 18 8 8 0 32
43 9 8 1200
m/c code
$s0-$s7 mapped to 16-23
$t0-$t7 mapped to 8-15
Op Rs Rt Rd Shamt/address funct
100011 01001 01000 0000 0100
1011 0000
0 10010 01000 01000 0 100000
101011 01001 01000 0000 0100
1011 0000
• In MIPS
• Shift left logical(sll) , ex. sll $t0,$s0,4 # reg $t0=$regs0<<4 bits
• Shift right logical(srl)
• In ARM
• Logical shift left(LSL),ex LSL r5,r1 LSL #2,# r5=r1<<2
• Logical shift right(LSR),ex MOV r6,r5 LSR #4 , #r6=r5>>4
• In MIPS
op rs rt rd shamt funct
0 0 16 10 4 0
i=j 𝑖 ≠ 𝑗
Exit:
f=g+h f=g-h
i==j
• Shifting left by i bits gives same result as multiplying by 2𝑖
• Shifting right by i bits gives same result as divide by 2𝑖
Instructions for making decisions:in ARM
• If(i==j),f=g+h;else f=g-h;
Assigning variables to registers:r0=f,r1=g,r2=h,r3=i,r4=j
CMP r3,r4
BNE Else;go to else if i≠j
ADD r0,r1,r2;
B Exit;go to exit
Else :SUB r0,r1,r2;
Exit:
Instructions for making decisions
• If(i==j),f=g+h;else f=g-h; in MIPS
beq reg1,reg2,L1 #go to statement L1 if reg1 ==reg2
bne reg1,reg2,L1 #go to statement L1 if reg1 ≠reg2
f,g,h,i,j corresponds to five registers $s0-$4
bne $s3,$s4,Else #go to Else if i ≠j
add $s0,$s1,$s2 # f=g+h skipped if i ≠j
J Exit # go to exit #unconditional branch
Else:sub $s0,$s2,$s1 # f=g-h skipped if i==j
Exit:
ARM and MIPS assembly for the loop
In ARM
while (save[i]==k)
i+=1;
Assume i and k corresponds to registers r3 and r5 and the base of the array save in r6
First step:save[i] in temp reg.before that add i to the base of the array save to form the
address,multiply the index i by 4 due to the byte addressing problem.
Use LSL since shifting left 2 bits means multiply by 22
Loop:ADD r12,r6,r3,LSL #2 ;r12=address of save[i]
LDR r0,[r12,#0] ; temp reg r0=save[i]
CMP r0,r5
BNE Exit ;go to Exit if save[i]≠ 𝑘
ADD r3,r3,#1 ;i=i+1
B loop ;go to loop
Exit;
In MIPS
Loop:sll $t1,$s3,2 ;temp reg $t1=4*i
add $t1,$t1,$s6 ;$t1=address of save[i]
lw $t0,0($t1) ;temp reg $t0=save[i]
bne $t0,$s5,Exit ;go to exit if save[i]≠k
add $s3,$s3,1 ;i=i+1
J loop
Exit:
Example:
Let r0=1111 1111 and r1=0000 0001
CMP r0,r1
Which conditional branch is taken?
BLO L1 ;unsigned branch(branch on lower unsigned instruction is not taken to L1,since
decimal of 1111 1111>1
BLT L2 ;signed branch(the branch on less than instruction is taken to L2,since
in decimal -1<1
Note:an unsigned comparison of x<y also checks if x is negative as well as if x is less than y
In MIPS the instruction slt(set on less than) used in for loops
slt $t0,$s3,$s4 ;means register $t0 is set to 1 if the $s3<$s4
otherwise $t0 set to 0.
slti $t0,$s2,10 ;$t0=1 if $s2<10
The MIPS architecture does not include branch on less than because it is too
complicated;either it would stretch the clock cycle time or it would take extra
clock cycles per instruction.
MIPS compilers use slt,slti,bne,beq and the fixed value 0($zero) to create all
relative conditions,equal,not equal,less than,less than or equal,greater
than,greater than or equal.
Case/Switch statements
• The simplest way to implement switch via a sequence of conditional
tests,turning the switch statement into a chain of if-then-else statements.
• Sometimes the alternatives may be more efficiently encoded as a table of
addresses of alternative instruction sequences,called a jump address
table/jump table.And the program needs only to index into the table and then
jump to the appropriate sequence.
• The jump table is then just an array of words containing addresses that
correspond to labels in the code.The program need to jump using the address
in the appropriate entry from the jump table.
• ARM handles such situations implicitly by stored program concept .
• Need to have a register to hold the address of the current instruction being
executed (PC-program counter)
• In ARM register 15 is the PC(also instruction address register)
• Any instruction with register 15 as destination register is an unconditional
branch to the address at the value.
• Encoding branch instruction in ARM
• The cond field encodes the many versions of the conditional branch like
Cond 12 address
4bits 4bits 24bits
Value Meaning Value Meaning
0 EQ(EQual) 8 HI(unsigned higher)
1 NE(Not Equal) 9 LS(unsigned Lower or Same)
2 HS(unsigned Higher or Same) 10 GE(signed Greater than or Equal)
3 LO(unsigned LOwer) 11 LT(signed Less Than)
4 MI(MInus,<0) 12 GT(signed Greater Than)
5 PL(PLus,>=0) 13 LE(signed Less Than or Equal)
6 VS(oVerflow Set,overflow) 14 AL(ALways)
7 VC(oVerflow Clear,no overflow) 15 NV(reserved)
• The cond field encodes the many versions of the conditional branch (shown in
the table)
• The 24 bit address limit the programs to 224
or 16MB which would be fine for
many programs but constrains large ones.
• An alternative would be a register that added to branch address,so the branch
instruction would calculate PC=Reg+branch address.
• The sum would allow the programs to be as large as 232
,solving the branch
address size problem.’
• Which is that register?
• Conditional branches are found in loops and in if statements,so they tend to
branch to a nearby instruction.Ex about half of all conditional branches in SPEC
benchmarks go to locations less than 16 instructions away.
• Since the PC contains the address of the current instruction,we can branch
within ±224
words of the current instruction if we use the PC as the register
to be added to the address.All the loops and if statements are much smaller
than ±224
words,so the PC is the ideal choice.
• Also called as PC-relative addressing.
Conditional Execution
• Unusual feature of ARM is most instruction can be conditionally executed
,not just branches.This is the purpose of the 4 bit cond field found in the
most ARM instruction formats.
• The assembly language programmer simply appends the desired condition
to the instruction name to perform the operation only if the condition is
true based on the last time the condition flags were set.
• Ex
CMP r3,r4
ADDEQ r0,r1,r2
SUBNE r0,r1,r2
Procedures in Computer Hardware
• Procedure: A stored subroutine that performs a specific task based on the
parameters with which it is provided.
• Registers are the fastest place to hold data in a computer,so use them as
much as possible .ARM s/w follows the following conventions for
procedure calling in allocating 16 registers.
• r0-r3(four argument registers in which to pass parameters.
• lr:one link register containing the return address register to return to the
point of origin.
• BL:It jumps to an address and simultaneously save the address of the
following instruction in register lr.(Branch-and-link instruction).
• BL:ProcedureAddress.
• (BL instruction that jumps to an address and simultaneously and saves the
address of the following instruction in a register (lr or register 14).
• MOV pc,lr
• The calling program ,or caller,puts the parameter values in r0-r3 and uses BL X
to jump to procedure X(callee).The callee then performs the calculations,
places the results (if any) into r0 and r1,and returns control to the caller using
MOV pc,lr.
• The BL instruction actually saves PC+4 in register lr to link to the following
instruction to set up the procedure return.
• Stack:
• Suppose a compiler needs more registers for a procedure than the four
arguments and two return value registers.Since we must cover our tracks after
our mission is complete,
any registers needed by the caller must be restored to the values that they
contained before the procedure was invoked.This situation is an example in which
we need to spill registers to memory.
The ideal data structure for spilling registers is a stack –a LIFO queue
• A stack needs a pointer to the most recently allocated address in the stack
to show where the next procedure should place the registers to be spilled
or where old register values are found.
• The stack pointer is adjusted by one word for each register that is saved or
restored. ARM reserves register 13 for SP.
• Compile the C procedure that does not call another procedure
int example(int g,int h,int i, int j)
{
int f;
f=(g+h)-(i+j)
return f;
}
• Let g,h,i,j correspond to the argument registers r0,r1,r2,r3 and f
corresponds to r4.
• Label of the procedure is ex_procedure:
• Save three registers:r4,r5,r6.
• “push” the old values onto the stack by creating space for three words(12
bytes) on the stack and then store them:
SUB sp,sp,#12 :adjust stack to make room for 3 items
STR r6,[sp,#8] :save register r6 for use afterwords
STR r5,[sp,#4] :save register r5 for use afterwords
STR r4,[sp,#0] :save register r4 for use afterwords
ADD r5,r0,r1 :reg r5 contains g+h
ADD r6,r2,r3 :reg r6 contains i+j
SUB r4,r5,r6 :f gets r5-r6,which is (g+h)-(i+j)
Values of sp and the stack:
before,during and after the procedure call
High address
sp sp
Contents of r6
Contents of r5
sp Contents of r4
Low address
To return the value of f we can copy it into a return value register r0:
MOV r0,r4 ; return f(r0=r4)
Before returning ,we restore the three old values of the registers we saved by
“popping”them from the stack.
LDR r4,[sp,#0] ;restore register r4 for caller
LDR r5,[sp,#4] ;restore register r5 for caller
LDR r6,[sp,#8] ;restore register r6 for caller
ADD sp,sp,#12 ;adjust stack to delete 3 items
The procedure ends with a jump register using the return address:
MOV pc,lr ;jump back to calling routine.
We used temporary registers and assumed their old values must be saved and
restored.To avoid saving and restoring a register whose value is never used,which
might happen with a temporary register,ARM software separates 12 of the
registers into two groups:
r0-r3,r12:argument or scratch registers that are not preserved by the
callee(called procedure)on a procedure call
r4-r11:eight variable registers that must be preserved on a procedure call(if
used ,the callee saves and restores them)
Note:This simple convention reduces register spilling .In the example above,if
we could rewrite the code to use r12 and reuse one of the r0 to r3,we can
drop two stores and two loads from the code. We still must save and restore
r4,since the callee must assume that the caller needs its value.
In MIPS:
MIPS s/w follows the following convention in allocating its 32 registers for
procedure calling:
$a0-$a3 : four argument register in which to pass parameters
$v0-$v1: two value registers in which to return values.
$ra : one return address register to return to the point of origin.
jal: jump-and-link instruction(an instruction that jumps to an address and
simultaneously saves the address of the following instruction in a register($ra
in MIPS)
jal ProcedureAddress
jal instruction saves PC+4 in register $ra to link to the following instruction to
set up the procedure return.
jr $ra ;jump register instruction meaning an unconditional jump
to the address specified in a register.
The calling program ,or caller,puts the parameter values in $a0-$a3 and uses
jal X to jump to procedure X(callee).The callee then performs the
calculations, places the results (if any) into $v0 and $v1,and returns control to
the caller using jr $ra.
• Let g,h,i,j correspond to the argument registers $a0,$a1,$a2,$a3 and f
corresponds to $s0.
• Label of the procedure is ex_procedure:
• Save three registers:$s0,$t0,$t1.
• “push” the old values onto the stack by creating space for three words(12
bytes) on the stack and then store them:
addi $sp,$sp,-12 :adjust stack to make room for 3 items
sw $t1,8[$sp] :save register $t1 for use afterwords
sw $t0,4[$sp] :save register $t0 for use afterwords
sw $s0,0[$sp] :save register $s0 for use afterwords
add $t0,$a0,$a1 :reg $t0 contains g+h
add $t1,$a2,$a3 :reg $t1 contains i+j
sub $s0,$t0,$t1 :f gets $t0-$t1,which is (g+h)-(i+j)
add $v0,$s0,$zero :returns f ($v0=$s0+0)
Before returning ,we restore the three old values of the registers we saved by
“popping”them from the stack.
lw $s0,0($sp) ;restore register $s0 for caller
lw $t0,4($sp) ;restore register $t0 for caller
lw $t1,8($sp) ;restore register $t1 for caller
addi sp,sp,12 ;adjust stack to delete 3 items
The procedure ends with a jump register using the return address:
jr $ra ;jump back to calling routine.
We used temporary registers and assumed their old values must be saved and
restored.To avoid saving and restoring a register whose value is never used,which
might happen with a temporary register,MIPS software separates 18 of the
registers into two groups:
$t0-$t9 :10 temp registers that are not preserved by the callee(called
procedure)on a procedure call
$s0-$s7:eight saved registers that must be preserved on a procedure call(if used
,the callee saves and restores them)
Note:This simple convention reduces register spilling .In the example above,
since the caller (procedure doing the calling) does not expect registers $t0
and $t1 to be preserved across a procedure call ,we can drop two stores and
two loads from the code. We still must save and restore $s0,since the callee
must assume that the caller needs its value.
Nested Procedures:
Ex.Suppose that the main program calls procedure A with an argument of
3,by placing the value 3 into register r0 and then using BL A. Then suppose
that procedure A calls procedure B via BL B with an argument of 7,also placed
in r0.Since A has not finished its task yet,there is a conflict over the use of
register r0.Similarly there is a conflict over the return address in register
lr,since it now has the return address for B.
Solution:push all the other registers that must be preserved onto the stack.
• The caller pushes any argument registers(r0-r3) that are needed after the
call. The callee pushes the return address register lr and any variable
registers (r4-r11)used by the callee. The sp is adjusted to account for the
number of registers placed on the stack.Upon the return, the registers are
restored from memory and the sp is readjusted.
• Convert into into ARM assembly code?
int fact(int n)
{
if(n<1)return (1);
else return(n*fact(n-1));
}
fact:ARM fact:MIPS
SUB sp,sp,#8 ;adjust stack for 2 items addi sp,sp,-8;adjust stack for 2 items
STR lr ,[sp,#4] ;save return address sw $ra,4($sp) ;save return address
STR r0,[sp,#0] ;save the argument n sw $a0,0($sp) ;save the argument n
CMP r0,#1 ;compare n to 1 slti $t0,$a0,1;test for n<1
BGE L1 ;if n>=1,go to L1 beq $t0,$zero,L1;if n>=1,go to L1
MOV r0,#1 ;return 1 addi $v0,$zero,1 ;return 1
ADD sp,sp,#8 ;pop two items off stack addi $sp,$sp,#8;pop two items off stack
MOV pc,lr ;return to the caller jr $ra ;return to after jal
L1: SUB r0,r0,#1 ;n>=1:argument gets(n-1) L1:addi $a0,$a0,-1; n>=1:argument
gets(n-1)
BL fact ;call fact with (n-1) jal fact ; call fact with (n-1)
MOV r12,r0 ;save the return value lw $a0,0($sp); return from jal;restore
argument n
LDR r0,[sp,#0] ;return from BL;restore argument n lw $ra,4($sp); restore the return address
LDR lr [sp,#0] ;restore the return address addi $sp,$sp,8 ;adjust sp to pop 2 items
ADD sp,sp,#8 ;adjust sp to pop 2 items mul $v0,$a0,$v0; return n * fact(n-1)
MUL r0,r0,r12 ;return n * fact(n-1) jr $ra ; return to caller
Stack allocation before during and
after procedural call
High address $fp
$fp $fp
$sp sp
saved argument
registers(if any)
saved ret address
$sp Saved saved registers(if any)
Local arrays and
Structures(if any)
Low address
What is and what not preserved across
a procedure call
Preserved Not Preserved
Variable registers:r4-r11 Argument registers:r0-r3
Stack pointer register:sp Intra procedure-call scratch register:r12
Link register:lr Stack below the sp
Stack above the sp
• Storage classes in C:automatic (local to procedure)and static(variables outside
procedures).
• Global pointer($gp):To simplify access to global data MIPS s/w reserves
another register called global pointer.
• Allocating space for new data on stack: The final complexity is that stack is also
used to store variables that are local to the procedure that do not fit in the
registers such as local arrays and structures.
• The segment of the stack containing a procedure’s saved registers and local
variables is called :procedure frame(activation record).
• Some MIPS s/w uses a frame pointer($fp) to point to first word of the frame of
the procedure.
• $fp is the value denoting the location of the saved registers and local variables
for a given procedure.
• $sp points to the top of the stack.when a $fp is used,it is initialized using the
address in $sp on a call and $sp is restored using $sp.
Allocating space for new data on the
heap,MIPS convention
$sp 7fff fffchex
$gp 1000 8000hex
1000 0000hex
$pc 0040 0000hex
0
Stack
Dynamic data
Static data
Text
Reserved
• In addition to the automatic variables that are local to procedures, C
programmers need space in memory for static variables and for dynamic
data structures.
• Text segment: Segment of the unix object file that containsthe m/c code
for routines in the source file.
• Static data segment:place for constants and other static variables.
• Data structures like linked lists tend to grow and shrink during there
lifetimes.The segments for such DSs is the heap.
• Stack and heap grow towards each other allowing efficient use of
memory as two segments wax and wane.
• C allocates and free space on the heap with explicit functions,malloc()
allocates space on the heap and returns a pointers to it and free()
releases space on the stack to which the pointer points.
• Memory allocation is controlled by programs in C,and it is the source of
many common and difficult bugs.
• Forgetting to free space leads to “memory leak”,which eventually uses up
so much memory that the OS may crash.Freeing space to early leads to
“dangling pointers”,which can cause pointers to point to things that the
program never intended.
• GNU MIPS C compiler uses a frame pointer.C compiler from MIPS/Silicon
graphics does not use fp,it uses register 30 as another save register.
ARM register conventions
Name Register No Usage Preserved on call?
a1-a2 0-1 Argument/return
result/scratch
register
no
a3-a4 2-3 Argument/scratch
register
no
v1-v8 4-11 Variables for local
routine
yes
ip 12 Intra-procedure-
call scratch register
no
sp 13 Stack pointer yes
lr 14 Link register yes
pc 15 Program counter n.a
MIPS register Convention
Name Register No Usage Preserved on call?
$zero 0 The constant value 0 n.a
$v0-$v1 2-3 Values for results and
expression evaluation
no
$a0-$a3 4-7 Arguments no
$t0-$t7 8-15 Temporaries no
$s0-$s7 16-23 Saved yes
$t8-$t9 24-25 More temporaries no
$gp 28 Global pointer yes
$sp 29 Stack pointer yes
$fp 30 Frame pointer yes
$ra 31 Return address yes
32-bit immediate operand
• Although constants are frequently short and fit into 16 bit field,sometimes
they are bigger.The MIPS instruction set includes the instruction load
upper immediate(lui)
• Specifically to set the upper 16bits of a constant in a register, allowing a
subsequent instruction to specify the lower 16bits of the constant.
m/c code of lui $t0,255 ;t0 is register 8
Contents of register $t0 after executing lui $t0,255
001111 00000 01000 0000 0000 1111 1111
0000 0000 1111 1111 0000 0000 0000 0000
• Loading 32-bit constant
• What is the MIPS assembly code to load 32-bit constant into register $s0?
0000 0000 0011 1101 0000 1001 0000 0000
First load the upper 16 bits which is 61 in decimal ,using lui;
Lui $s0,61
The value of register $s0 afterwords is
0000 0000 0011 1100 0000 0000 0000 0000
Add the lower 16 bits whose decimal is 2304
ori $s0,$s0,2304
The final value in the register $s0
0000 0000 0011 1101 0000 1001 0000 0000
Compiling a String Copy Procedure
void strcpy(char x[],char y[]
{ int i;
i=0;
while((x*i+=y*i+)!=‘0’)
i+=1;
}
• In ARM:
Assume the base addresses for arrays x and y are found in r0 and r1,while i is in
r4.strcpy adjusts the stack pointer and then saves the saved register r4 on the
stack
strcpy:
SUB sp,sp,#4 ;adjust stack for 1 more item
STR r4,[sp,#0] ;save r4
To initialize i to 0,the next instruction sets r4 to 0 by adding 0 to 0 and placing that
sum in r4:
MOV r4,#0 ;i=0+0
L1:ADD r2,r4,r1 ;address of y[i] in r2 {this is the beginning of the loop,the
address of y[i] is first formed by adding i to y[](assume array of bytes).
To load the character in y[i]
LDRB r3,[r2,#0] ;r3=y[i] and set condition flags {load register byte;loads a byte
from memory
ADD r12,r4,r0 ;address of x[i] in r12
STRB r3,[r12,#0] ;x[i]=y[i]
BEQ L2 ;if y[i]==0; go to L2
Increment i and loop back
ADD r4,r4,#1 ;i=i+1
B L1 ;go to L1
If we don’t loop back it was the last character of the strings ;we restore r4 and the
stack pointer,and then return
L2:LDR r4,[sp,#0] ;y[i]==0;end of string ,restore old r4
ADD sp,sp,#4 ;pop 1 word off stack
MOV pc,lr ;return
In MIPS:
strcpy:
addi $sp,$sp,-4
sw $sp,0($sp)
add $s0,$zero,$zero,
L1:ADD $t1,$s0,$a1 ;address of y[i] in $t1
lb $t2,0($t1) ;$t2=y[i]
add $t3,$s0,$a0 ;address of x[i] in $t3
sb $t2,0($t3) ;x[i]=y[i]
beq $t2,$zero ,L2
addi $s0,$s0,1
J L1
L2: lw $s0,0($sp)
addi $sp,$sp,4
jr $ra
Addressing in Branches and Jumps
J 10000 ;go to location 10000
bne $s0,$s1,Exit ;go to Exit if $s0≠ $s1
If the address of the program is bigger than 16 bits than PC=Reg+branch
address ,sum becomes 32 bits
2 10000
6 bits 26bits
5 16 17 Exit
6bits 5bits 5bits 16bits
Branching far away: Given beq $s0,$s1,L1,replace with a pair of instruction that
offer a much greater branching distance.
Short address conditional branch:
bne $1,$s0,L2
J L1
L2:
MIPS Addressing Modes:
1)Register addressing(where the operand is a register)
2)Base or Displacement addressing(where the operand is at the memory location
whose is the sum of the register and a constant in the instruction)
3)Immediate addressing(operand is the constant within the instruction itself)
4)PC relative addressing(where the address is the sum of the PC and a constant in
the instruction)
5)Pseudodirect addrerssing(where the jump address is the 26 bits of the
instruction concatenated with the upper bits of the PC
• 1)convert into assembly for m/c instruction 00af8020(hex)
0000 0000 1010 1111 1000 1000 0010 0000
Find the op field
31-29 and 28-26 are 000 ,000 ,hence R-format instruction
000000 00101 01111 10000 00000 100000
op rs rt rd shamt funct
Bit 5-3 ,100 and bits 2-0,000,hence it represent add
The decimal values are 5 for rs and 15 for rt,16 for rd(shamt is unused).these
numbers represents registers $a1,$t7 and $s0
add $s0,$a1,$t7
Translating and starting a program
C-program----COMPILER----Assembly program--ASSEMBLER--
object:m/c language module---LINKER--exe m/c language program--
LOADER--memory
object:library routine(m/c language)
Source file-x.c
Assembly file-x.s
Object file-x.o
Statically linked library routines are x.a and dynamically linked library routes
are x.so
Executable files by default are called a.out.
MS-DOS uses .C,.ASM,.OBJ,.LIB,.DLL,and .EXE
Compiling java program
Java program----COMPILER---classfiles(bytecodes)
java library routines(m/c language)
JIT JVM
compiled java methods(m/c language)
JVM-s/w interpreter,can execute bytecodes,it’s a program that simulates an
ISA.portable and found in devices –mobile phones to internet browsers.
To preserve portability and improve execution speed the next phase is JIT
JIT-compiler that operates on runtime, translating the interpreted code
segments into the native code of the computer.
Arithmetic for computers
• Two’s complement representation:The positive and negative numbers 32 bit
numbers can be represented as
• 𝑥31 × −231
+ 𝑥30 × 230
+ 𝑥29 × 229
+ ⋯ + 𝑥1 × 21
+ 𝑥0 × 20
• The sign bit is multiplied by −231
and the rest of the bits are then multiplied
by positive versions of their respective base values.
• Decimal value of two’s complement number:
1111 1111 1111 1111 1111 1111 1111 1100
Multiplication: Sequential version of the Multiplication Algorithm and Hardware
Let’s assume that the multiplier is in the 32-bit Multiplier register and that the 64-
bit Product register is initialized to 0.
We need to move the multiplicand left one digit each step, as it may be added to
the intermediate products.
Over 32 steps, a 32 bit multiplicand would move 32 bits to the left.Hence,we need
a 64bit multiplicand register, initialized with the 32 bit multiplicand in the right
half and zero in the left half.This register is then shifted left 1 bit each step to align
the multiplicand with the sum being accumulated in the 64-bit Product register.
First version of the multiplication H/w
64bits
32bits
ALU 64bit
64bits
Multiplicand
Shift left
Product
write
Multiplier
Shift right
Control
test
First multiplication algorithm
• flowchart_multiply.pdf
• Three basic steps needed for each bit. These three basic steps are
repeated 32 times to obtain the product.If each step took a clock
cycle,this algo would require almost 100 clock cycles to multiply two 32 bit
numbers.
Multiply 0010 X 0011
Iteration Step Multiplier Multiplicand Product
0 Initial value 0011 0000 0010 0000 0000
1 1a:1=>Prod=Prod
+ Multplicand
2:shift left
multiplicand
3:shift right
multiplier
0011
0011
0001
0000 0010
0000 0100
0000 0100
0000 0010
0000 0010
0000 0010
2 1a:1=>Prod=Prod
+ Multplicand
2:shift left
multiplicand
3:shift right
multiplier
0001
0001
0000
0000 0100
0000 1000
0000 1000
0000 0110
0000 0110
0000 0110
3 1a:1=>No
operation
2:shift left
multiplicand
3:shift right
multiplier
0000
0000
0000
0000 1000
0001 0000
0001 0000
0000 0110
0000 0110
0000 0110
4 1a:1=>No
operation
2:shift left
multiplicand
3:shift right
multiplier
0000
0000
0000
0001 0000
0010 0000
0010 0000
0000 0110
0000 0110
0000 0110
Refined version of Multiplication H/w
32bits
32bit ALU
64 bits
Multiplicand
Product shift right
write
Control
test
=10 =01
=00
=11
=/= =0
Multipcand in BR
Multiplier in QR
AC0
Qn+10
SC0
QnQn+1
ACAC+BR’
+1
ACAC+BR
Ashr(AC&QR)
SCSC-1
SC END
h/w for Booth Algo
Qn Qn+1
BR Register
Complementer and parallel
Adder
Sequence counter
QR registerAC register
First version of division h/w
64bits
ALU
64bit
32bits
64bits
DIVISOR
Shift right
QUOTIENT
Shift left
REMAINDER
write
Control
test
Floating Point
Reals in mathematics:
3.14159265….(𝜋)
2.71828…(e)
0.000000001 or 1.0 × 10;9
3,155,760,000 or 3.15576× 109
The last number did not represent a small fraction,but it was bigger than we
could represent with a 32 bit signed integer.
The alternative notation of the last two numbers is called scientific notation.
Which has a single digit to the left of the decimal point.
A number in scientific notation that has no leading 0s is called a normalized
number,which is the usual way to write it.1.0 × 10;9 is a normalized scientific
notation but 0.1 × 109 𝑎𝑛𝑑 10.0 × 1010 𝑎𝑟𝑒 𝑛𝑜𝑡
Note:all numbers in decimal
• We can also show binary numbers in scientific notation.1.0 × 2;1
Floating point: computer arithmetic that represent numbers in which the
binary point is not fixed
In scientific notation (1. 𝑥𝑥𝑥𝑥𝑥𝑥)2× 2 𝑦𝑦𝑦𝑦
Advantages of scientific notation in normalized form:
1)It simplifies exchange of data that includes floating point numbers.
2)It simplifies the floating point arithmetic algorithms.
3)Increases the accuracy of the numbers that can be stored in a word.
General Form:
(−1) × 𝐹 × 2 𝐸
31 30-23 22-0
S(1 bit) Exponent(8 bits) Fraction(23 bits)
• These chosen sizes of exponent and fraction give MIPS computer arithmetic an
extra ordinary range. Fractions almost as small as
2.0 × 10;38
𝑎𝑛𝑑 𝑎𝑠 𝑙𝑎𝑟𝑔𝑒 𝑎𝑠 2.0 × 1038
Overflow: positive exponent becomes too large to fit in the exponent field.
Underflow:negative exponent becomes too large to fit in the exponent field.
Double precision: Floating point value represented two 32 bit words.
Single precision: Floating point value represented in a single 32 bit word.
MIPS double precision:large number:2 × 10308
Small number:2 × 10;308
These formats go beyond MIPS:they are part of IEEE754 floating point standard
found in virtually every computer invented since 1980.
31 30-20 19-0
S(1 bit) Exponent(11 bits) Fraction(20 bits)
Fraction (continued)32 bits
To pack into more bits into significand(fraction),the IEEE754 makes the
leading 1 bit of normalized binary numbers implicit.
Hence,the number is actually 24 bits long in single precision(1 implied+23 bit
fraction).
For double precision for 53 bits long (1+52).
General form:
(−1) 𝑆× 1 + 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 × 2 𝐸
Where the bits of the fraction represents a number between 0 and 1 and E
specifies the value in the exponent field.
If we number the bits of the fraction from left to right s1,s2,s3…then the
value is
(−1) 𝑆× (1 + 𝑠1 × 2;1 + 𝑠2 × 2;2 + ⋯ . ) × 2 𝐸
• 1.0 × 2;1
• 1.0 × 2:1
31 30-23 22-0
0 11111111 0000000000……
31 30-23 22-0
0 0000001 0000000000……
The Processor:
4
Add Add
ALU
P
C
Address Instruction
Instruction Memory
Data
Register#
Register# REGISTERS
Register#
Address
Data
memory
Data
• Abstract view of the implementation of MIPS subset showing the major
functional units and the major connections between them.
a)All the instructions start by using PC to supply the instruction
address to instruction memory.
b)After the instruction is fetched, the register operands used by
an instruction are specified by the fields of that instruction.
c)Once the register operands have been fetched, they can be
operated on to compute the memory address(for load or
store),to compute an arithmetic result(for an integer arithmetic
logical instruction) or compare (for branch).
d)If the instruction is arithmetic logical instruction, the result
from ALU must be written to a register.
e)If the operation is load or store,the ALU result is used as an
address to either store a value from the registers or load a value
from memory into registers. The result from the ALU or memory
is written back into the register file.
f)Branches require the use of ALU output to determine the next
instruction address, which comes from either the ALU(where the
PC and the branch offset are summed) or from an adder that
increments the current PC by 4.
g)All the instructions(memory reference, arithmetic-logical and
branch) except jump uses ALU after reading the registers.
Logic design conventions
• The functional units in MIPS implementation consists of two
different types of logic elements: elements that operate on data
values and elements that contain state.
• Elements that operate on data values are called combinational.
Which means that their output depends only on the current inputs.
Given the same inputs a combinational element always produces
the same output.(Ex ALU).
• An element contain state if it has some internal storage :state
elements, because if we pulled the plug on the m/c,we could
restart it by loading the state elements with the values they
contained before we pulled the plug.
• The instruction and data memories ,registers are state elements.A
state element has two inputs and one output.
• Inputs are the data value to be written into the element and the
clock,which determines when the data value is written.Output from the
state element provides the value that was written in an earlier clock cycle.
Ex D-type flip-flop.
• Logic components that contain state: sequential(output depends on both
inputs and contents of internal state.Ex registers.
• Clocking Methodology: when the data is valid and stable relative to the
clock.
• Edge-triggered clocking:A clocking scheme in which all state changes occur
on a clock edge.It means any values stored in a sequential logic element
are updated only on a clock edge.
• The inputs are the values that were written in previous clock cycle,while
the outputs are values that can be used in a following clock cycle.
Combinational Logic,state element and
the clock are closely related
Clock cycle
State
element 1
State
element 2
Combinational logic
Edge triggered methodology allows a state element to be read
and written in the same clock cycle without creating a race that
could lead to indeterminate data values
State
element 1
Combinational logic
• Control signal: A signal used for multiplexor
selection or for directing the operation of a
functional unit.
• Contrasts with a data signal ,which contains
information that is operated on by a
functional unit.
Building a datapath
a)Instruction Memory b)PC
Add Sum
c)Adder
Instruction address
Instruction
Instruction Memory
PC
Portion of datapath used for fetching
instructions and incrementing the PC
Add Sum
4
Read address
Instruction
Instruction Memory
PC
Register file
5
Register Nos 5
5 Data
Data
Read Read
Register1 data1
Read
Register 2
REGISTERS
Write
Register
Write Read
data RegWrite data2
data2
ALU Operation
4
zero
ALU
ALU result
MemWrite
16 32
MemRead
Address Read
data
DATA MEMORY
Write
data
Sign
Extend
Datapath for branch uses the ALU to evaluate the branch condition and
a separate adder to compute the branch target as the some of the
incremented PC and the sign-extended,lower 16 bits of the instruction
shifted left two bits.
PC+4 from instruction datapath
ALU sum Branch
target
ALUoperation
Inst
ALU zero To branch
Control logic
RegWrite
16 32
Read reg1 Read data1
Read reg2
Write reg
Write data Read data2
Sign
extend
Shiftleft2
Creating single datapath
1)To share a datapath element between two different instruction
classes,we may need to allow multiple connections to the input
of the element using a multiplexor and control signal to select
among the multiple inputs.
2)The operations of R-type and memory instructions datapath
are quite similar but the differences are:
a)The R-types uses ALU with inputs coming from two
registers.The memory instruction also use ALU for address
calculation.Although the second input is the sign-extended 16 bit
offset field from instruction.
b)The value stored into the destination register comes from
ALU(R-type) or the memory(for load)
ALU Control
ALU Control Lines Function
0000 AND
0001 OR
0010 add
0110 subtract
0111 Set on less than
1100 NOR
• We can generate the 4 bit ALU control input using a small
control unit that has as inputs the function field of the
instruction and the two bit control field,which is ALUOp.
• ALUOp indicates whether the operation to be performed
should be add(00) for loads and stores, subtract(01) for beq
,or determined by the operation encoded in the function
field(10).
• The output of the ALU control unit is a 4-bit signal that
directly controls the ALU by generating one of the 4-bit
combinations.
How to set the ALU control inputs
based on the 2-bit ALUOp control and
the 6-bit function code
Instruction
opcode
ALUop Instruction
operation
Function
Field
Desired ALU
action
ALU control
input
Lw 00 Load word xxxxxx add 0010
Sw 00 Store word xxxxxx add 0010
Branch
equal
01 Branch
equal
xxxxxx subtract 0110
R-type 10 Add 100000 add 0010
R-type 10 Subtract 100010 subtract 0110
R-type 10 AND 100100 AND 0000
R-type 10 OR 100101 OR 0001
R-type 10 Set on less
than
101010 Set on less
than
0111
• The main control unit generates the ALUOp bits,which then are used as
input to the ALU control that generates the actual signals to control the
ALU unit.
• Using multiple levels of control can reduce the size of the main control
unit.
• Using several smaller control units may also potentially increase the speed
of the control unit.
Truth table for three ALU control bits
ALUOp1 ALUop2
Function Field
F5 F4 F3 F2 F1 F0
operation
0 0 X x x x x x 0010
X 1 X x x x x x 0110
1 X 100000 0010
1 X 100010 0110
1 X 100100 0000
1 X 100101 0001
1 X 101010 0111
Designing the Main Control Unit
• To connect the fields of an instruction to the datapath ,the
instruction formats must be reviewed.
• Op field:31-26 bits
• Two registers to be read rs and rt :25-21,20-16.true for R-
type,branch equal and for store.
• The base register for load and store instructions is always in
bit positions 25-21(rs).
• R-type:
0
31-26
Rs
25-21
Rt
20-16
Rd
15-11
Shamt
10-6
Funct
5-0
31-26 25-21 20-16 15-11 10-6
Load or store instruction
load:35,store:43
branch instruction
35 or 43
31-26
Rs
25-21
Rt
20-16
Address
15-0
4
31-26
Rs
25-21
Rt
20-16
Address
15-0
Simple datapath with control unit
• The input to the control unit is the 6-bit opcode field from the instruction.
The outputs of the control unit consists of three 1-bit signals that are used
to control multiplexors(RegDst,ALUSrc and MemtoReg),
• Three signals for controlling the reads and writes in register file and data
memory(RegWrite,MemRead,MemWrite).
• A 1-bit signal used in determining whether to possibly branch(Branch),and
a two bit control signal for ALU(ALUOp).
• An AND gate is used to combine the branch control signal and the zero
output from the ALU.
• The AND gate output controls the selection of the next PC
• The setting of the control lines is completely determined by the opcode
fields of the instruction
• The first row: R-type(add,sub and,or,slt).For all these instructions,the
source register fields are rs and rt and the destination rd.this define how
the signals ALUSrc and RegDst are set.
Instruction RegDst ALUSrc MemtoReg RegWrite Memread MemWrite Branch ALUOp1 ALUOp0
R-format 1 0 0 1 0 0 0 1 0
Lw 0 1 1 1 1 0 0 0 0
Sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
• An R-type instruction writes a register(RegWrite=1),but neither reads nor
writes data memory.
• When the branch control signal is 0,the PC is unconditionally replaced with
PC+4.Otherwise the PC is replaced by branch target if the zero output of
ALU is also high.
• The ALUOp field of R-type instruction is set to 10 to indicate the ALU
control should be generated from the function field.
• The second and third row shows control signal settings for lw and sw.
• These ALUSrc and ALUOp fields are set to perform the address
calculation.
• The MemRead and MemWrite are set to perform memory access.
• Finally the RegDst and RegWrite are set for a load to cause the result to be
stored into the rt tregister.
• The branch instruction is similar to R-type,since it sends the rs and rt
registers to the ALU.
• The ALUOp field for branch is set for a subtract(ALU Control=01),which is
used to test for equality.
• Note:MemtoReg field is irrelevant when the RegWrite Signal is 0,Since the
register is not being written,the value of the data on the register data
write port is not used.
• Thus the entry MemtoReg in the last two rows of the table is replaced
with X for don’t care.
• Don’t care can also be added to RegDst when RegWrite is 0,this type of
don’t care must be added by the designer,since it depends on knowledge
of how the datapath works.
Finalizing the control:The control function for single
cycle implementation specified by truth table
Input or
output
Signal name R-Format Lw Sw beq
Inputs Op5 0 1 1 0
Op4 0 0 0 0
Op3 0 0 1 0
Op2 0 0 0 1
Op1 0 1 1 0
Op0 0 1 1 0
Outputs Regdst 1 0 X X
ALUSrc 0 1 1 0
MemtoReg 0 1 X X
RegWrite 1 1 0 0
MemRead 0 1 0 0
MemWrite 0 0 1 0
Branch 0 0 0 1
ALUOp1 1 0 0 0
ALUOp0 0 0 0 1
Implementing jumps
• The jump instruction look somewhat like branch instruction but computes
the target PC and is not conditional.
• Like a branch the lower order two bits of jump address are always 00,the
next lower 26 bits of this 32 bit address come from the 16 bit immediate
field in the instruction
31:26 25:0
000010 address
• The upper 4 bits of the address that should replace the PC come from the
PC of the jump instruction plus 4,we can implement the jump by storing
into the PC the concatenation of:
• -the upper 4 bits of the current PC+4(31:28)
• -The 26bit immediate field of jump instruction
• -The bit 00
Why single cycle implementation is not
used today?
• All though the single cycle design will work correctly, it would not be used
in modern designs because it is inefficient. The clock cycle must have the
same length for every instruction in this single cycle ,the CPI therefore is
1.
• The clock cycle is determined by the longest possible path in the machine.
• This path is almost the load instruction, which uses five functional units in
series, the instruction memory, register file, the ALU, the data memory
and the register file.
• Although the CPI is 1,the overall performance of single cycle
implementation is not likely to be very good, since several of the
instruction classes could fit in a shorter clock cycle.
• Unfortunately implementing the variable speed clock for each instruction
class is extremely difficult, an alternative is to use a shorter clock cycle
that does less work and then vary the number of clock cycles for the
different instruction classes.
• The single cycle implementation violates the design principle of making
the common case fast.
• In this single cycle implementation each functional unit can be used only
once per clock; therefore some functional units must be duplicated raising
the cost of implementation.
• Hence inefficient in its performance and in its hardware cost.
• These difficulties can be avoided by using shorter clock cycle-derived from
the basic functional unit delays, and that requires multiple clock cycles for
each instruction.
Multicycle implementation
• Also called as multiple clock cycle implementation. An
implementation in which an instruction is executed in
multiple clock cycles.
• The multicycle implementation allows a functional unit to be
used more than once per instruction as long as it is used on
different clock cycles.
• The sharing can help reduce the amount of hardware
required.
• The ability to allow instructions to take different numbers of
clock cycles and the ability to share functional units within the
execution of a single instruction are the major advantages of a
multi cycle design
Difference with single cycle version
• A single memory unit is used for both instructions and data.
• There is a single ALU, rather than ALU and two adders.
• One or more registers are added after every major functional unit to hold
the output of that unit until the value is used in a subsequent clock cycle.
• At the end of the clock cycle,all data that is used in subsequent clock
cycles must be stored in a state element.
• Data used by subsequent instructions in a later clock cycle is stored into
one of the programmer-visible state elements:register file,PC or the
memory.
• Data used by same instruction in a later cycle must be stored into one of
these additional registers.
• The position of additional registers is determined by two factors:
What combinational unit will fit in one clock cycle and what data are needed
in later cycles implementing the instruction.
-In the multicycle design it is assumed that the clock cycle can accommodate
at most one of the following operations: A memory access, a register file
access(two reads or one write) or an ALU operation.
-Any data produced by one of theses functional units must be saved into a
temporary registers for use on later cycle
-If we are not saved than the possibility of timing race could occur,leading to
then use of incorrect value.
-All the registers except IR hold data only between a pair of adjacent clock
cycles and will thus not need a write control signal.
-Thus IR needs to hold instruction until the end of the execution of that
instruction and thus will require a write control signal
Pipelining
Up till now: we have designed a processor that
execute all the SimpleRisc instructions.
Two styles:
Hardwired control unit
Microprogrammed control unit:
microprogrammed datapath
microassembly language
microinstruction
Designing efficient processors
 Microprogrammed processors are much
slower that hardwired processors
 Even hardwired processors
 Have a lot of waste !!!
 We have 5 stages.
 What is the IF stage doing, when the MA stage is
active ?
 ANSWER : It is idling
Resource utilization
• Single cycle design: each resource is tied up for the entire
duration of the instruction execution.
• Multi-cycle design:resource utilized in cycle t of instruction I
is available for cycle t+1 of instruction I.
• Pipelined design:resource utilized in cycle t of instruction I is
available again for cycle t of instruction I+1.
Problems with single cycle design
• Slowest instruction pulls down the clock frequency
• Resource utilization is poor
• There are some instructions which are impossible to be implemented in
this manner.
HIGH MULTI-CYCLE DESIGN
CPI
LOW PIPELINED DESIGN SINGLE CYCLE DESIGN
SHORT CYCLE TIME LONG
The Notion of Pipelining
 Let us go back to the car assembly line
 Is the engine shop idle, when the paint shop is
painting a car ?
 NO : It is building the engine of another car
 When this engine goes to the body shop, it builds
the engine of another car, and so on ….
 Insight :
 Multiple cars are built at the same time.
 A car proceeds from one stage to the next
• Pipelining is an implementation technique in which multiple
instructions are overlapped in execution. Today pipelining is
the key to making processors fast.
• Ex. Laundry analogy for pipelining: see Fig.
time 6pm 7pm…… 2am
task
A W D F S
B W D F S
C W D F S
D W D F S
time 6pm 7pm… 9.5pm
task
The washer ,dryer, folder and storer each take 30 minutes for their
task. Sequential laundry takes 8 hrs for four loads of wash while
pipelined laundry takes just 3.5 hrs
we show the pipeline stage of different loads overtime by showing
copies of the four resources on this two dimensional time line, but we
really have just one of each resource.
A W D F S
B W D F S
C W D F S
D W D F S
Observed so far..
• The pipeline paradox is that the time from placing a single
dirty sock in the washer until is dried ,folded and put away is
not shorter for pipelining.
• The reason pipelining is faster for many loads is that
everything is working in parallel, so more loads are finished
per hour,
• Pipelining improves throughput of the laundry system without
improving the time to complete a single load.
• Hence ,pipelining would not decrease the time to complete
one load of laundry, but when we have many loads of laundry
to do,the improvement in throughput decreases the total
time to complete the work.
• If all the stages take about the same amount of time and there is enough
work to do,then the speedup due to pipelining is equal to the number of
stages in the pipeline(in this case four stages).
• 20 loads would take about 5 times as long as 1 load,while 20 loads of
sequential laundry takes 20 times as long as 1 load.
• Its only 8/3.5=2.3 times faster becoz only four loads are shown.
• The beginning and end of the workload in the pipelined version,the
pipeline is not completely full.
• This start-up and wind-down affects performance when the number of
tasks is not large as compared to the number of stages in the pipeline, if
the number of loads is much larger than 4,then the stages will be full most
of the time and the increase in throughput will be very close to 4.
• The same principles apply to processors, where we pipeline
instruction execution. MIPS instructions classically takes five
steps:
1)Fetch instruction from memory
2)Read register while decoding the instruction(The format of
MIPS allows reading and decoding to occur simultaneously).
3)Execute the operation or calculate the address.
4)Access an operand in data memory.
5)Write the result into a register.
Single cycle Vs pipelined performance
Compare the avg time between instructions of a single cycle
implementation, in which all instructions take one clock cycle to
a pipelined implementation.
The operation times for major functional units as example is:
(in the single cycle model every instruction takes exactly one
clock cycle,so the clock cycle must be stretched to accommodate
the lowest instruction.)
Inst class Inst fetch Reg read ALU oper Data
access
Reg write total
Load 200ps 100ps 200ps 200ps 100ps 800ps
Store 200ps 100ps 200ps 200ps 700ps
R-format 200ps 100ps 200ps 100ps 600ps
branch 200ps 100ps 200ps 500ps
200 300 500 700 800 1000 1200 1300 1500 1700 1800
lw $1,100($0)
lw $1,200($0)
lw $3,300($0)
800ps
800ps
IF Reg ALU Data Ac Reg
IF Reg ALU Data Ac Reg
IF
200 400 600 800 1000 1200 1400
a)200ps IF Reg ALU Data Ac Reg
b) 200ps IF Reg ALU Data Ac Reg
c) 200ps IF Reg ALU Data Ac Reg
a)lw $1,100($0)
b)lw $2,200($0)
c)lw $3,300($0)
• Avg time between the instructions reduced from 800ps to 200 ps
• Computer pipeline stage times are limited by the slowest resource. Either
the ALU operation or the memory access.
• We assume that the write to register file occurs in the first half of the
clock cycle and the read from the register file occurs in the second half of
the clock cycle.
• For speedup:
time between instructions(pipelined)=time between
instruction(nonpipelined)/number of pipe stages
Improvement is 800/5=160ps clock cycle.(stages may be imperfectly balanced
or some overheads).
Thus the time per instruction in the pipelined processor will exceed the
minimum possible and speedup will be less than the number of pipeline
stages.
• Our claim of fourfold improvement is not reflected in the total execution
of three instructions(2400/1400=1.8).this is becoz the number of
instructions is not large.Let us increase the number of instructions:
• Let us add 1000,000 instructions in the pipeline,each instruction adds
200ps ,to the total execution time.
• Total execution time=1000,000*200ps+1400ps=200,001400ps
• For nonpipelined:1000,000*800+2400=800,002400ps
• Ratio=800,002400/200,001400=4=800/200
• Note:pipeline improves performance by increasing instruction
throughput,as opposed to decreasing the execution time on an individual
instruction.
Pipeline hazards
• There are situations in the pipeline when the next instruction can not execute in
the following clock cycle, these events are called hazards and there are three
types:
• Structural, data and control hazards.
• Structural Hazard: An occurrence in which planned instruction can not execute in
the proper clock cycle because the hardware cannot support the combinations of
instructions that are set to execute in the given clock cycle.
• With reference to laundry example the structural hazard would occur if we use the
combination of washer-dryer instead of separate washer and dryer, or you
roommate was busy doing something else and wouldn’t put clothes away.
• Hence, the carefully scheduled pipeline plans would than be foiled.
• If the pipeline in the previous Fig,has fourth instruction then we would see that
the first instruction is accessing data from the memory and the fourth instruction
is fetching an instruction from that same memory.
• Without two memories our pipeline could have a structural hazard.
 A structural hazard may occur when two instructions have a conflict on
the same set of resources in a cycle
 Example :
 Assume that we have an add instruction that can read one operand
from memory
 add r1, r2, 10[r3]
 This code will have a structural hazard
 [3] tries to read 10[r3] (MA unit) in cycle 4
 [1] tries to write to 20[r5] (MA unit) in cycle 4
 Does not happen in our pipeline
[1]: st r4, 20[r5]
[2]: sub r8, r9, r10
[3]: add r1, r2, 10[r3]
• Data Hazard: An occurrence in which a planned instruction can not
execute in the proper clock cycle because the data that is needed to
execute the instruction is not yet available.
• Occurs when the pipeline must be stalled because one step must wait for
another to complete.
• In the computer pipeline, data hazards arise from the dependence of one
instruction on an earlier one that is still in the pipeline(relationship that
doesn’t really exists when doing laundry).
add $s0 ,$t0,$t1
sub $t2,$s0,$t3
Data hazard could severely stall the pipeline, the add instruction does not
write its result until the fifth stage, meaning that we would have to add three
bubbles to the pipeline.
Data Hazard
[1]: add r1, r2, r3
[2]: sub r3, r1, r4
1
1
1
1
1
2
2
2
2
2
IF
OF
EX
MA
RW
1 2 3 4 5 6 7 8 9
clock cycles
 Instruction 2 will read incorrect values !!!
This situation represents a data hazard
In specific,
it is a RAW (read after write) hazard
The earliest we can dispatch instruction 2, is cycle 5
• Although we would try to rely on compilers to remove all such hazards,
the result would not be satisfactory. These dependences happen just too
often and the delay is just too long to expect the compiler to rescue us
from this dilemma.
• Forwarding or bypassing:
• For the code sequence mentioned earlier,as soon as the ALU creates the
sum for add, we can supply it as input for the subtract.Adding extra h/w to
retrieve the missing item early from the internal resources is called
forwarding or bypassing.
Graphical representation
Fig.2 add $s0,$t0,$t1
200 400 600 800 1000 1200 1400
a)200ps IF Reg ALU Data Ac Reg
b) 200ps IF Reg ALU Data Ac Reg
c) 200ps IF Reg ALU Data Ac Reg
a)lw $1,100($0)
b)lw $2,200($0)
c)lw $3,300($0)
200 400 600 800 1000
I
F
I
D
a l
u
MEM W
B
Forwarding
• add $s0,$t0,$t1
• sub $t2,$s0,$t3
• If the correct value is already there in another
stage, we can forward it.
[1]: add r1, r2, r3
[2]: sub r4, r1, r2
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
Forwarding from MA to EX
 Fowarding in cycle 4 from instruction [1]
to [2]
[1]: add r1, r2, r3
[2]: sub r4, r1, r2
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
Different Forwarding Paths
 We need to add a multitude of forwarding
paths
 Rules for creating forwarding paths
 Add a path from a later stage to an earlier stage
 Try to add a forwarding path as late as possible. For,
example, we avoid the EX → OF forwarding path, since
we have the MA → EX forwarding path
 The IF stage is not a part of any forwarding path.
Forwarding Path
 3 Stage Paths
 RW → OF
 2 Stage Paths
 RW → EX
 MA → OF (X Not Required)
 1 Stage Paths
 RW → MA (load to store)
 MA → EX (ALU Instructions, load, store)
 EX → OF (X Not Required)
Forwarding Paths : RW → MA
[1]: ld r1, 4[r2]
[2]: sw r1, 10[r3]
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
Forwarding Paths : RW → EX
[1]: ld r1, 4[r2]
[2]: sw r8, 10[r3]
[3]: add r2, r1, r4
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
3
3
3
3
3
Forwarding Path : MA → EX
[1]: add r1, r2, r3
[2]: sub r4, r1, r2
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
Forwarding Path : RW → OF
[1]: ld r1, 4[r2]
[2]: sw r4, 10[r3]
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
[3]: sw r5, 10[r6]
3
3
3
3
3
4
4
4
4
4
[4]: sub r7, r1, r2
• Forwarding works very well but cannot prevent all pipeline stalls.
• Suppose the first instruction were load of $s0,instead of an add, from the
figure it is clear that the desired data would be available only after the
fourth stage of the first instruction in the dependence,which is too late for
the input of the third stage of sub.’
• Even,with forwarding,we would have to stall one stage for a load use data
hazard(specific form of data hazard in which the data requested by a load
instruction has not yet become available when it is requested)
 Cannot forward (arrow goes backwards in time)
 Need to add a bubble (then use RW → EX
forwarding)
[1]: ld r1, 10[r2]
1
1
1
1
1
2
2
2
2
1 2 3 4 5 6 7 8 9
Clock cycles
2
IF
OF
EX
MA
RW
[2]: sub r4, r1, r2
• We need to stall even with forwarding when an R-format instruction
following a load tries to use data.
• A stall initiated in order to resolve a hazard(pipeline stall)
Data Hazards with Forwarding
 Forwarding has unfortunately not eliminated all data hazards
 We are left with one special case.
 Load-use hazard
 The instruction immediately after a load instruction has a RAW
dependence with it.
Consider the c segment: A=B+E and C=B+F,in MIPS
lw $t1,0($t0)
lw $t2,4($t0)
add $t3,$t1,$t2
sw $t3,12($t0)
lw $t4,8($01)
add$t5,$t1,$t4
sw $t5,16($t0)
Find the hazards in instructions and reorder to avoid any pipeline stalls.
Control Hazards
• Also called as branch hazard,An occurrence in which the proper
instruction can not execute in the proper clock cycle because the
instruction that was fetched is not the one that is needed, i.e the flow of
instruction addresses is not what pipeline expected.
Performance of “stall on branch”
• Let us assume that we put in enough extra h/w so that we can test
registers, calculate the branch address, and update the PC during the
second stage of the pipeline. Even with this extra h/w the pipeline
involving conditional branches would look like the fig in previous slide.
• Estimate the impact on the clock cycles per instruction(CPI) of stalling on
branches. Assume all other instructions have a CPI 1.
• It has been found that branches are 13% of the instructions executed in
SPECint2000,since other instructions run have a CPI of 1 and branches
took one extra clock cycle for the stall,hence the CPI 1.13,slowdown of
1.13 versus the ideal case.jumps also incur stalls.
• If we cannot resolve the branch in the second stage,as is often the
case for longer pipelines, then we would see larger slowdown if we
stall on branches.
• The cost of this option is too high for most computers to use and
motivates a second solution to the control hazard.
Predict:with reference to laundry example,if you are pretty sure of
having the right formula to wash uniforms,then just predict that it will
work and wash the second load while waiting for the first load to dry.
This option does not slowdown the pipeline when you are correct.
When you are wrong ,then you need to redo the load that was
washed while guessing the decision.
• Computers do indeed use prediction to handle branches.One
simple approach is to always predict that the branches will be
untaken.
• When you are right,the pipeline proceeds at full speed.Only
when branches are taken does the pipeline stall.
•
Predicting that branches are not taken as a solution to
control hazard
Branch prediction
• Some branches predicted as taken and some untaken.In laundry
example,the dark or home uniforms might take one formula ,while the
light or road uniforms might take another.
• As a computer example,at the bottom of loops are branches that jump
back to the top of the loop.Since they are likely to be taken and they
branch backwards,we could always predict taken for branches that jump
to an earlier address.
• Dynamic h/w predictors make their guesses depending on the behaviour
of each branch and may change predictions for a branch over the life fo a
program.
• In our analogy, a dynamic prediction, a person would look at how dirty the
uniform was and guess at the formula, adjusting the next guess
depending on the success of recent guesses
• One popular approach to dynamic prediction of branches is
keeping a history for each branch as taken or untaken, and
then using the recent past behaviour to predict the future.
• Survey says, dynamic branch predictors can correctly predict
branches with over 90% accuracy.
• When the guess is wrong, the pipeline control must ensure
that the instructions following the wrongly guessed branch
have no effect and must restart the pipeline from the proper
branch address.
• In laundry analogy, we must stop taking new loads so that we
can restart the load that we incorrectly predicted.
• Third approach to Control hazard: delayed decision
• The delayed branch always execute the next sequential
instruction with the branch taking place after that one
instruction delay.
• MIPS s/w will place an instruction immediately after the
delayed branch instruction that is not affected by branch and
a taken branch changes the address of the instruction that
follows this safe instruction
• The add instruction before the branch in Fig. does not affect
the branch and can be moved after the branch to fully hide
A pipeline Datapath
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture
CS4109 Computer System Architecture

Mais conteúdo relacionado

Mais procurados

INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMKamran Ashraf
 
Computer organisation -morris mano
Computer organisation  -morris manoComputer organisation  -morris mano
Computer organisation -morris manovishnu murthy
 
Introduction to Computer Architecture
Introduction to Computer ArchitectureIntroduction to Computer Architecture
Introduction to Computer ArchitectureAnkush Srivastava
 
BASIC COMPUTER ARCHITECTURE
BASIC COMPUTER ARCHITECTURE BASIC COMPUTER ARCHITECTURE
BASIC COMPUTER ARCHITECTURE Himanshu Sharma
 
Virtual memory
Virtual memoryVirtual memory
Virtual memoryAnuj Modi
 
Memory management
Memory managementMemory management
Memory managementcpjcollege
 
Introduction to Computer Architecture
Introduction to Computer ArchitectureIntroduction to Computer Architecture
Introduction to Computer ArchitectureKSundarAPIICSE
 
Multiprocessor Architecture (Advanced computer architecture)
Multiprocessor Architecture  (Advanced computer architecture)Multiprocessor Architecture  (Advanced computer architecture)
Multiprocessor Architecture (Advanced computer architecture)vani261
 
Instruction Execution Cycle
Instruction Execution CycleInstruction Execution Cycle
Instruction Execution Cycleutsav_shah
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architectureaamc1100
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)Student
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memoryDeepak John
 
Multiprocessor
MultiprocessorMultiprocessor
MultiprocessorNeel Patel
 
CPU Architecture - Basic
CPU Architecture - BasicCPU Architecture - Basic
CPU Architecture - BasicYong Heui Cho
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architectureArpan Baishya
 
Single and Multi core processor
Single and Multi core processorSingle and Multi core processor
Single and Multi core processorMunaam Munawar
 
Unit 4-input-output organization
Unit 4-input-output organizationUnit 4-input-output organization
Unit 4-input-output organizationvishal choudhary
 

Mais procurados (20)

INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISM
 
Computer organisation -morris mano
Computer organisation  -morris manoComputer organisation  -morris mano
Computer organisation -morris mano
 
Introduction to Computer Architecture
Introduction to Computer ArchitectureIntroduction to Computer Architecture
Introduction to Computer Architecture
 
BASIC COMPUTER ARCHITECTURE
BASIC COMPUTER ARCHITECTURE BASIC COMPUTER ARCHITECTURE
BASIC COMPUTER ARCHITECTURE
 
Virtual memory
Virtual memoryVirtual memory
Virtual memory
 
operating system structure
operating system structureoperating system structure
operating system structure
 
Memory management
Memory managementMemory management
Memory management
 
Introduction to Computer Architecture
Introduction to Computer ArchitectureIntroduction to Computer Architecture
Introduction to Computer Architecture
 
Multiprocessor Architecture (Advanced computer architecture)
Multiprocessor Architecture  (Advanced computer architecture)Multiprocessor Architecture  (Advanced computer architecture)
Multiprocessor Architecture (Advanced computer architecture)
 
Instruction Execution Cycle
Instruction Execution CycleInstruction Execution Cycle
Instruction Execution Cycle
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architecture
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memory
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Memory management
Memory managementMemory management
Memory management
 
CPU Architecture - Basic
CPU Architecture - BasicCPU Architecture - Basic
CPU Architecture - Basic
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
 
Single and Multi core processor
Single and Multi core processorSingle and Multi core processor
Single and Multi core processor
 
Unit 4-input-output organization
Unit 4-input-output organizationUnit 4-input-output organization
Unit 4-input-output organization
 

Destaque

Computer system architecture
Computer system architectureComputer system architecture
Computer system architectureKumar
 
Intelligent Machine On Cognitive Methodological Development
Intelligent Machine On Cognitive Methodological DevelopmentIntelligent Machine On Cognitive Methodological Development
Intelligent Machine On Cognitive Methodological DevelopmentNaga Balaji
 
M.tech cse 10july13 (1)
M.tech cse  10july13 (1)M.tech cse  10july13 (1)
M.tech cse 10july13 (1)vijay707070
 
computer system Architecture (Food detector)
computer system Architecture (Food detector)computer system Architecture (Food detector)
computer system Architecture (Food detector)Mehul Boghra
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
3 the system architecture
3 the system architecture3 the system architecture
3 the system architecturejavadch
 
Bab 1b The Structure Of A Computer Program
Bab 1b   The Structure Of A Computer ProgramBab 1b   The Structure Of A Computer Program
Bab 1b The Structure Of A Computer ProgramDimara Hakim
 
(Aug.31) introduction to macflash
(Aug.31) introduction to macflash(Aug.31) introduction to macflash
(Aug.31) introduction to macflashJordan Delacruz
 
Computer System Architecture-R.D.Sivakumar
Computer System Architecture-R.D.SivakumarComputer System Architecture-R.D.Sivakumar
Computer System Architecture-R.D.SivakumarSivakumar R D .
 
Computer organiztion1
Computer organiztion1Computer organiztion1
Computer organiztion1Umang Gupta
 
Introduction to basic programming repetition
Introduction to basic programming repetitionIntroduction to basic programming repetition
Introduction to basic programming repetitionJordan Delacruz
 
4 introduction to programming structure
4 introduction to programming structure4 introduction to programming structure
4 introduction to programming structureRheigh Henley Calderon
 
Computer hardware component Created by king parmeshwar pawar
Computer hardware component Created by king parmeshwar pawarComputer hardware component Created by king parmeshwar pawar
Computer hardware component Created by king parmeshwar pawarKing Parmeshwar Pawar :) Edu.Li
 
My presentation on 'computer hardware component' {hardware}
My presentation on 'computer hardware component' {hardware}My presentation on 'computer hardware component' {hardware}
My presentation on 'computer hardware component' {hardware}Rahul Kumar
 
Cognitive Computing
Cognitive ComputingCognitive Computing
Cognitive ComputingPietro Leo
 
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...Kaushal Mehta
 
structured programming
structured programmingstructured programming
structured programmingAhmad54321
 

Destaque (20)

Computer system architecture
Computer system architectureComputer system architecture
Computer system architecture
 
Ntroduction to computer architecture and organization
Ntroduction to computer architecture and organizationNtroduction to computer architecture and organization
Ntroduction to computer architecture and organization
 
Intelligent Machine On Cognitive Methodological Development
Intelligent Machine On Cognitive Methodological DevelopmentIntelligent Machine On Cognitive Methodological Development
Intelligent Machine On Cognitive Methodological Development
 
Computer system architecture
Computer system architectureComputer system architecture
Computer system architecture
 
M.tech cse 10july13 (1)
M.tech cse  10july13 (1)M.tech cse  10july13 (1)
M.tech cse 10july13 (1)
 
Conditional structure
Conditional structureConditional structure
Conditional structure
 
computer system Architecture (Food detector)
computer system Architecture (Food detector)computer system Architecture (Food detector)
computer system Architecture (Food detector)
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
3 the system architecture
3 the system architecture3 the system architecture
3 the system architecture
 
Bab 1b The Structure Of A Computer Program
Bab 1b   The Structure Of A Computer ProgramBab 1b   The Structure Of A Computer Program
Bab 1b The Structure Of A Computer Program
 
(Aug.31) introduction to macflash
(Aug.31) introduction to macflash(Aug.31) introduction to macflash
(Aug.31) introduction to macflash
 
Computer System Architecture-R.D.Sivakumar
Computer System Architecture-R.D.SivakumarComputer System Architecture-R.D.Sivakumar
Computer System Architecture-R.D.Sivakumar
 
Computer organiztion1
Computer organiztion1Computer organiztion1
Computer organiztion1
 
Introduction to basic programming repetition
Introduction to basic programming repetitionIntroduction to basic programming repetition
Introduction to basic programming repetition
 
4 introduction to programming structure
4 introduction to programming structure4 introduction to programming structure
4 introduction to programming structure
 
Computer hardware component Created by king parmeshwar pawar
Computer hardware component Created by king parmeshwar pawarComputer hardware component Created by king parmeshwar pawar
Computer hardware component Created by king parmeshwar pawar
 
My presentation on 'computer hardware component' {hardware}
My presentation on 'computer hardware component' {hardware}My presentation on 'computer hardware component' {hardware}
My presentation on 'computer hardware component' {hardware}
 
Cognitive Computing
Cognitive ComputingCognitive Computing
Cognitive Computing
 
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
ICT, Basic of Computer, Hardware, Various parts of computer hardware, What is...
 
structured programming
structured programmingstructured programming
structured programming
 

Semelhante a CS4109 Computer System Architecture

CST 20363 Session 4 Computer Logic Design
CST 20363 Session 4 Computer Logic DesignCST 20363 Session 4 Computer Logic Design
CST 20363 Session 4 Computer Logic Designoudesign
 
Chapter_01_Introduction_to_Computer.pptx
Chapter_01_Introduction_to_Computer.pptxChapter_01_Introduction_to_Computer.pptx
Chapter_01_Introduction_to_Computer.pptxKunalSahu180994
 
Chapter_01_Introduction_to_Computer.pptx
Chapter_01_Introduction_to_Computer.pptxChapter_01_Introduction_to_Computer.pptx
Chapter_01_Introduction_to_Computer.pptxshawshrutika
 
introdection BASIC OF COMPUTER EDUCATION
introdection BASIC OF COMPUTER EDUCATIONintrodection BASIC OF COMPUTER EDUCATION
introdection BASIC OF COMPUTER EDUCATIONreshmi30
 
An Introduction To Python - Understanding Computers
An Introduction To Python - Understanding ComputersAn Introduction To Python - Understanding Computers
An Introduction To Python - Understanding ComputersBlue Elephant Consulting
 
A computer system chapter presentation new full typed
A computer system chapter presentation new full typedA computer system chapter presentation new full typed
A computer system chapter presentation new full typedsmrutiranjan lenka
 
Unit-1 CSF101- Programming for Problem Solving (1).pptx
Unit-1 CSF101- Programming for Problem Solving (1).pptxUnit-1 CSF101- Programming for Problem Solving (1).pptx
Unit-1 CSF101- Programming for Problem Solving (1).pptxAmoghLavania1
 
INTRODUCTION TO COMPUTER .pptx
INTRODUCTION TO COMPUTER .pptxINTRODUCTION TO COMPUTER .pptx
INTRODUCTION TO COMPUTER .pptxRamjeyDavocRony
 
diploma basic of computers.ppt
diploma basic of computers.pptdiploma basic of computers.ppt
diploma basic of computers.pptLathaSrinivas5
 
Computer Fundamental
Computer FundamentalComputer Fundamental
Computer FundamentalShradha Kabra
 
fundamentalofcomputers-postaldeptt-150308230655-conversion-gate01.pdf
fundamentalofcomputers-postaldeptt-150308230655-conversion-gate01.pdffundamentalofcomputers-postaldeptt-150308230655-conversion-gate01.pdf
fundamentalofcomputers-postaldeptt-150308230655-conversion-gate01.pdfAlihaAzmat1
 
Presentation on Computer Basics and architecture.pdf
Presentation on Computer Basics and architecture.pdfPresentation on Computer Basics and architecture.pdf
Presentation on Computer Basics and architecture.pdfnavikvel
 
Computer Organisation unit 1 basics of computer Organisation
Computer Organisation unit 1 basics of computer OrganisationComputer Organisation unit 1 basics of computer Organisation
Computer Organisation unit 1 basics of computer Organisationluckyanirudhsai
 
3 MODULE 2.2 -COMPUTER MEMORY.ppt
3 MODULE 2.2  -COMPUTER MEMORY.ppt3 MODULE 2.2  -COMPUTER MEMORY.ppt
3 MODULE 2.2 -COMPUTER MEMORY.pptVivek Parashar
 

Semelhante a CS4109 Computer System Architecture (20)

CST 20363 Session 4 Computer Logic Design
CST 20363 Session 4 Computer Logic DesignCST 20363 Session 4 Computer Logic Design
CST 20363 Session 4 Computer Logic Design
 
Chapter_01_Introduction_to_Computer.pptx
Chapter_01_Introduction_to_Computer.pptxChapter_01_Introduction_to_Computer.pptx
Chapter_01_Introduction_to_Computer.pptx
 
Chapter_01_Introduction_to_Computer.pptx
Chapter_01_Introduction_to_Computer.pptxChapter_01_Introduction_to_Computer.pptx
Chapter_01_Introduction_to_Computer.pptx
 
introdection BASIC OF COMPUTER EDUCATION
introdection BASIC OF COMPUTER EDUCATIONintrodection BASIC OF COMPUTER EDUCATION
introdection BASIC OF COMPUTER EDUCATION
 
An Introduction To Python - Understanding Computers
An Introduction To Python - Understanding ComputersAn Introduction To Python - Understanding Computers
An Introduction To Python - Understanding Computers
 
A computer system chapter presentation new full typed
A computer system chapter presentation new full typedA computer system chapter presentation new full typed
A computer system chapter presentation new full typed
 
Unit-1 CSF101- Programming for Problem Solving (1).pptx
Unit-1 CSF101- Programming for Problem Solving (1).pptxUnit-1 CSF101- Programming for Problem Solving (1).pptx
Unit-1 CSF101- Programming for Problem Solving (1).pptx
 
PC for Managers
PC for ManagersPC for Managers
PC for Managers
 
Chap 1 CA.pptx
Chap 1 CA.pptxChap 1 CA.pptx
Chap 1 CA.pptx
 
INTRODUCTION TO COMPUTER .pptx
INTRODUCTION TO COMPUTER .pptxINTRODUCTION TO COMPUTER .pptx
INTRODUCTION TO COMPUTER .pptx
 
diploma basic of computers.ppt
diploma basic of computers.pptdiploma basic of computers.ppt
diploma basic of computers.ppt
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Fundamental of Computers
Fundamental of ComputersFundamental of Computers
Fundamental of Computers
 
Computer Fundamental
Computer FundamentalComputer Fundamental
Computer Fundamental
 
C with lab
C with labC with lab
C with lab
 
fundamentalofcomputers-postaldeptt-150308230655-conversion-gate01.pdf
fundamentalofcomputers-postaldeptt-150308230655-conversion-gate01.pdffundamentalofcomputers-postaldeptt-150308230655-conversion-gate01.pdf
fundamentalofcomputers-postaldeptt-150308230655-conversion-gate01.pdf
 
Computer Visible Parts
Computer Visible PartsComputer Visible Parts
Computer Visible Parts
 
Presentation on Computer Basics and architecture.pdf
Presentation on Computer Basics and architecture.pdfPresentation on Computer Basics and architecture.pdf
Presentation on Computer Basics and architecture.pdf
 
Computer Organisation unit 1 basics of computer Organisation
Computer Organisation unit 1 basics of computer OrganisationComputer Organisation unit 1 basics of computer Organisation
Computer Organisation unit 1 basics of computer Organisation
 
3 MODULE 2.2 -COMPUTER MEMORY.ppt
3 MODULE 2.2  -COMPUTER MEMORY.ppt3 MODULE 2.2  -COMPUTER MEMORY.ppt
3 MODULE 2.2 -COMPUTER MEMORY.ppt
 

Mais de ktosri

Basics of Intelligent Computing.pdf
Basics of Intelligent Computing.pdfBasics of Intelligent Computing.pdf
Basics of Intelligent Computing.pdfktosri
 
SoftComputing.pdf
SoftComputing.pdfSoftComputing.pdf
SoftComputing.pdfktosri
 
Software Engg.pdf
Software Engg.pdfSoftware Engg.pdf
Software Engg.pdfktosri
 
Oomd2015
Oomd2015Oomd2015
Oomd2015ktosri
 
OOMD2015_KSP
OOMD2015_KSPOOMD2015_KSP
OOMD2015_KSPktosri
 
Spm ksp
Spm kspSpm ksp
Spm kspktosri
 

Mais de ktosri (6)

Basics of Intelligent Computing.pdf
Basics of Intelligent Computing.pdfBasics of Intelligent Computing.pdf
Basics of Intelligent Computing.pdf
 
SoftComputing.pdf
SoftComputing.pdfSoftComputing.pdf
SoftComputing.pdf
 
Software Engg.pdf
Software Engg.pdfSoftware Engg.pdf
Software Engg.pdf
 
Oomd2015
Oomd2015Oomd2015
Oomd2015
 
OOMD2015_KSP
OOMD2015_KSPOOMD2015_KSP
OOMD2015_KSP
 
Spm ksp
Spm kspSpm ksp
Spm ksp
 

Último

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Último (20)

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 

CS4109 Computer System Architecture

  • 1. CS4109 Computer System Architecture By Prof. K.Sridhar Patnaik Department of Computer Science and Engineering, BIT Mesra, Ranchi
  • 2. Course Objectives • To Learn: 1)How computers work,basic principles. 2)How to analyze their performance(or how not to) 3)How computers are designed and built. 4)Issues affecting modern processors(cache,pipelines etc)
  • 3. Course Motivation This knowledge will be useful if you need to 1)Design/built a new computer-rare opportunity 2)Design/built a new version of a computer. 3)Improve software performance. 4)Purchase a computer 5)Provide a solution with embedded computer.
  • 4. Introduction • I think it’s fair to say that personal computers have become the most empowering tool we’ve ever created. They’re tools of communication, they’re tools of creativity, and they can be shaped by their user. Bill Gates, February 24, 2004
  • 5. Computers? A computer is a general purpose device that can be programmed to process information, and yield meaningful results. McGrawHill
  • 6. Computers for.. • Processing and Communication • For processing (ex. Numbers) requires processor , memory and I/O(input-output) Processor-for processing. Memory-for storage. I/O -can be considered as extended memory.(Also helps us in interacting with m/c)
  • 7. Computer Architecture Vs Computer Organization You can study Computer Systems from users(car driver) point of view or designer’s(car mechanic) point of view Computer Architecture :The view of a computer as presented to software designers. i.e. from designers(hardware) point of view. Building Architecture: Structural Design(Civil Engg) Computer Architecture: Circuit Design(EE) Computer Organization: The actual implementation of a computer in hardware. i.e. from users(programmers/software) point of view.
  • 8. • Ex.Mutiplier,from organization point of view you know there is a multiplier, one need not bother about how it is designed .Similarly there is an instruction set and the user know the instruction that is enough and he is not bothered about how it is implemented. • So how multiplier and instruction set are implemented is the job of the designers. • Application spectrum
  • 9. Computer Program Information store results • Program – List of instructions given to the computer • Information store – data, images, files, videos • Computer – Process the information store according to the instructions in the program McGrawHill
  • 10. Inside CPU Memory Hard disk  Let us take the lid off a desktop computer
  • 11. Let us make it a full system ... Computer Memory Hard disk Keyboard Mouse Monitor Printer
  • 13. • Processor keep addressing the memory and get the data from memory for processing. • Processor consists(DP ckts+CP ckts) • Processor must be able to set some kind of path to deal or handle the data for processing ( data path setting). • Also able to carry out the processing through a sequence of control signals. • The one which interact the CPU indirectly is the storage(magnetic tape, disc system etc) • The CPU deals with memory in the same way as it deals with I/O.
  • 14. SOFTWARE ABSTRACTION • int sum(int x,int y) HLL(C) • { • int z=x+y; • return z;} • 0x401040<sum>:0x55 MACHINE CODE 0x89 • 0xe5 • 0x8b • 0x45 • 0x0C • 0x03 • 0x45 • 0x08 • 0x89 • 0xec • 0x5d • 0xc3 • _sum: ASSEMBLY • pushl%ebp • movl%esp,%ebp • movl 12(%ebp),%eax • addl 8(%ebp),%eax • movl %ebp,%esp • popl %ebp • ret
  • 15. HARDWARE ABSTRACTION CPU Register file System bus Memory bus I/O BUS Expansion slots for other devices such as network adapters PC BUS INTERFACE ALU BRIDGE MAIN MEMORY GRAPHICS Adapter USB Controller DISK Controller DISK MOUSE KEYBOARD E DISPLAY
  • 16. HARDWARE/SOFTWARE INTERFACE SOFTWARE Our focus HARDWARE C++ M/C Intruction Reg,Adder Transistors …………………………………………………………………………………………………
  • 17. How does an Electronic Computer Differ from our Brain ? • Computers are ultra-fast and ultra-dumb Feature Computer Our Brilliant Brain Intelligence Dumb Intelligent Speed of basic calculations Ultra-fast Slow Can get tired Never After sometime Can get bored Never Almost always
  • 18. What Can a Computer Understand ?  Computer can clearly NOT understand instructions of the form  Multiply two matrices  Compute the determinant of a matrix  Find the shortest path between Mumbai and Delhi  They understand :  Add a + b to get c  Multiply a + b to get c
  • 19. Architecture levels • Instruction set architecture: Lowest level visible to programmer • Micro architecture: Fills the gap between instructions and logic modules The semantics of all the instructions supported by a processor is known as its instruction set architecture (ISA). This includes the semantics of the instructions themselves, along with their operands, and interfaces with peripheral devices.
  • 22. • Programmer -Visible State: PC-Program Counter Register File-Heavily used data Condition Codes Memory-Byte Array -Code+Data -Stack
  • 23. Why different processors? • What is the difference between processors used in desk-tops, lap-tops,mobile phones, washing machines etc.? • Performance/speed • Power consumption • Cost • General purpose/special purpose
  • 24. Topics to be Covered • Performance issues • A specific instruction set architecture • Arithmetic and how to build an ALU • Constructing a processor to execute instruction • Pipelining to improve performance • Memory:Caches and Virtual memory • Input/output
  • 25. Features of an ISA  Example of instructions in an ISA  Arithmetic instructions : add, sub, mul, div  Logical instructions : and, or, not  Data transfer/movement instructions  Complete  It should be able to implement all the programs that users may write.
  • 26. Features of an ISA – II  Concise  The instruction set should have a limited size. Typically an ISA contains 32-1000 instructions.  Generic  Instructions should not be too specialized, e.g. add14 (adds a number with 14) instruction is too specialized  Simple  Should not be very complicated.
  • 27. Designing an ISA  Important questions that need to be answered :  How many instructions should we have ?  What should they do ?  How complicated should they be ? Two different paradigms : RISC and CISC RISC (Reduced Instruction Set Computer) CISC (Complex Instruction Set Computer)
  • 28. RISC vs CISC A reduced instruction set computer (RISC) implements simple instructions that have a simple and regular structure. The number of instructions is typically a small number (64 to 128). Examples: ARM, IBM PowerPC, HP PA-RISC A complex instruction set computer (CISC) implements complex instructions that are highly irregular, take multiple operands, and implement complex functionalities. Secondly, the number of instructions is large (typically 500+). Examples: Intel x86, VAX
  • 29. Completeness of an ISA – II How to ensure that we have just enough instructions such that we can implement every possible program that we might want to write ? Answer:  Let us look at results in theoretical computer science  Is there an universal ISA ? The universal machine has a set of basic actions, and each such action can be interpreted as an instruction. Universal ISA Universal Machine
  • 30. The Turing Machine – Alan Turing  Facts about Alan Turing  Known as the father of computer science  Discovered the Turing machine that is the most powerful computing device known to man  Indian connection : His father worked with the Indian Civil Service at the time he was born. He was posted in Chhatrapur, Odisha.
  • 31. Turing Machine Infinite Tape State Register Tape Head L R Action Table• The tape head can only move left or right (old state, old symbol) -> (new state, new symbol, left/right)
  • 32. Operation of a Turing Machine  There is an inifinite tape that extends to the left and right. It consists of an infinite number of cells.  The tape head points to a cell, and can either move 1 cell to the left or right  Based on the symbol in the cell, and its current state, the Turing machine computes the transition :  Computes the next state  Overwrites the symbol in the cell (or keeps it the same)  Moves to the left or right by 1 cell  The action table records the rules for the transitions.
  • 33. Example of a Turing Machine • Design a Turing machine to increment a number by 1.  Start from the rightmost position. (state = 1)  If (state = 1), replace a number x, by x+1 mod 10  The new state is equal to the value of the carry  Keep going left till the '$' sign Tape Head $3 4 6 9$ 7
  • 34. More about the Turing Machine  This machine is extremely simple, and extremely powerful  We can solve all kinds of problems – mathematical problems, engineering analyses, protein folding, computer games, …  Try to use the Turing machine to solve many more types of problems (TO DO)
  • 35. Church-Turing Thesis Church-Turing thesis: Any real-world computation can be translated into an equivalent computation involving a Turing machine. (source: Wolfram Mathworld). Any computing system that is equivalent to a Turing machine is said to be Turing complete. Universal Turing Machine For every problem in the world, we can design a Turing Machine (Church-Turing thesis) Can we design a universal Turing machine that can simulate any Turing machine. This will make it a universal machine (UTM) Why not? The logic of a Turing machine is really simple. We need to move the tape head left, or right, and update the symbol and state based on the action table. A UTM can easily do this. A UTM needs to have an action table, state register, and tape that can simulate any arbitrary Turing machine.
  • 36. Universal Turing Machine Prog. 1 Prog. 2 Prog. 3 Turing Machine 1 Turing Machine 2 Turing Machine 3 Universal Turing Machine
  • 37. A Universal Turing Machine Generic State Register Tape Head L R Generic Action Table Simulated Action Table Simulated State Register Work Area 37
  • 38. A Universal Turing Machine - II Generic State Register Tape Head L R Generic Action Table Simulated Action Table Simulated State Register Work Area CPU Data Memory Instruction Memory Program Counter (PC)
  • 39. Computer Inspired from the Turing Machine CPU Program Counter (PC) Program Data Memory Control Unit Arithmetic Unit Instruction 39
  • 40. Elements of a Computer  Memory (array of bytes) contains  The program, which is a sequence of instructions  The program data → variables, and constants  The program counter(PC) points to an instruction in a program  After executing an instruction, it points to the next instruction by default  A branch instruction makes the PC point to another instruction (not in sequence)  CPU (Central Processing Unit) contains the  Program counter, instruction execution units
  • 41. Designing Practical Machines Harvard Architecture CPU Control ALU Instruction memory Data memory I/O devices
  • 43. Problems with Harvard/ Von-Neumann Architectures  The memory is assumed to be one large array of bytes  It is very very slow  Solution:  Have a small array of named locations (registers) that can be used by instructions  This small array is very fast General Rule: Larger is a structure, slower it is Insight: Accesses exhibit locality (tend to use the same variables frequently in the same window of time) 43
  • 44. Uses of Registers  A CPU (Processor) contains set of registers (16-64)  These are named storage locations.  Typically values are loaded from memory to registers.  Arithmetic/logical instructions use registers as input operands  Finally, data is stored back into their memory locations. 44
  • 45. Example of a Program in Machine Language with Registers  r1, r2, and r3, are registers  mem → array of bytes representing memory 1: r1 = mem[b] // load b 2: r2 = mem[c] // load c 3: r3 = r1 + r2 // add b and c 4: mem[a] = r3 // save the result 45
  • 47. Performance Measure • When we say one computer has better performance than another , what do we mean? Airplane Passenger Capacity Cruising Range(miles) Cruising speed(mph) Passenger throughput (passenger X mph) A1 300 4000 600 180000 A2 400 3500 600 240000 A3 130 3400 1000 130000 A4 140 8000 500 70000
  • 48. • Consider different measures of performance, the plane with highest cruising speed is A3 ,the plane with the longest range is A4,and the plane with the largest capacity is the A2 . • Suppose we define performance in terms of speed.Then the fastest plane as the one with the highest cruising speed,taking less passengers from one point to another in the least time. • If you are interested in transporting 400 passengers than A2 would clearly be the fastest. • Similarly , we can define computer performance in several different ways
  • 49. • If you are running a program on two desktop computers, The faster one is the one that gets the job done first. • If you were running a datacenter that had several servers running jobs submitted by many users,the faster computer was the one that completed the most jobs during a day. • As an individual computer user we are interested in reducing response time(execution time). • Datacenter managers are often interested in increasing throughput or bandwidth(total amount of work done in a given time) • Hence in most cases we need different performance metrics as well as different sets of applications to benchmark embedded and desktop computers,which are more focused on response time,versus servers,which are more focused on throughput
  • 50. Throughput and Response time • Do the following changes to a computer system increase throughput, decrease response time: 1.Replacing the processor in a computer with a faster version. 2.Adding additional processors to a system that uses multiple processors for separate task-ex searching the www. In case 1 decreasing response time almost always improves throughput, hence case 1 improves both. In case 2 throughput only increases. Note-The demand for processing in the second case was almost as large as the throughput,the system might force requests to queue up. In this case the throughput could also improve response time, since it would reduce the waiting time in the queue. Thus, in many real computer systems, changing either execution time or throughput often affects the other.
  • 51. Performance Measure Equations • Performance=1/Execution time • For two computers X and Y,if the performance of X is greater than Y , • 𝑃𝑋 > 𝑃𝑌 = 1 𝐸𝑥𝑒.𝑡𝑖𝑚𝑒 𝑋 > 1 𝐸𝑥𝑒.𝑡𝑖𝑚𝑒 𝑌 =𝐸𝑥𝑒. 𝑡𝑖𝑚𝑒 𝑌 > 𝐸𝑥𝑒. 𝑡𝑖𝑚𝑒𝑋 • Do this-If computer A runs a program in 10 sec and computer B runs the same program in 15 sec,how much faster is A than B? • Response time,or elapsed time-Total time to complete a task,including disk accesses,memory accesses,I/O activities,OS overhead • CPU Execution time/CPU time-The actual time the CPU spends computing specific task.(does not include time spent waiting for I/O or running other programs. • CPU time=user CPU time+system CPU time User CPU time(CPU time spent in the the program) System CPU time(CPU time spent in the OS performing tasks on behalf of the program)
  • 52. • Differentiating user and system CPU time is difficult to do accurately ,because it is hard to assign responsibility for OS activities to one user program rather than another and because of the functionality differences among OSs. • CPU execution time for a program=CPU clock cycles for a program X Clock cycle time. =CPU clock cycles for a program /Clock rate Note-The hardware designer can improve performance by reducing the number of clock cycles required for a program or the length of the clock cycle. The designer often faces a trade off between the number of clock cycles needed for a program and the length of each cycle. Many techniques that decrease the number of clock cycles may also increase the clock cycle time.
  • 53. • Do this-A program runs in 10secs on computer A,which has a 2GHz clock. You as a designer build a Computer B which will run the same program in 6secs.As a designer you determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design,causing computer B to require 1.2 times as many clock cycles as computer A for this program. What clock rate will the designer has to target?
  • 54. • CPU time 𝐴 =CPU clock cycles 𝐴 /Clock rate 𝐴 10 secs=CPU clock cycles 𝐴/2 × 109 cycles/secs CPU clock cycles 𝐴= 20 × 109 cycles CPU time 𝐵=1.2 × CPU clock cycles 𝐴/Clock rate 𝐵 6 secs=1.2 × 20 × 109 cycles /Clock rate 𝐵 Clock rate 𝐵=4× 109 cycles/secs =4GHz. To run the program in 6 secs, B must have twice the clock rate of A.
  • 55. Instruction Performance • CPU execution time for a program=CPU clock cycles for a program X Clock cycle time. • CPU clock cycles= instructions for a program X Avg clock cycles per instruction • Avg clock cycles per instruction(CPI). • Do this-Suppose we have two implementations of the same ISA. Computer A has a clock cycle time of 250ps and a CPI of 2.0 for some program, and computer B has a clock cycle time of 500ps and a CPI of 1.2 for the same program. Which computer is faster for this program and by how much? • The classic CPU performance equation= CPU time=Instruction count X CPI X Clock cycle time =(Instruction count X CPI)/Clock rate
  • 56. SPEC CPU Benchmark • To evaluate two computer systems, a user would simply compare the execution time of the workload on the two computers. • Workload=A set of programs run on computer that is either the actual collection of applications run by a user or constructed from real programs to approximate such a mix. • Benchmark: A program selected for use in comparing computer performance • SPEC:(Standard Performance Evaluation Cooperation)is an effort funded and supported by a number of computer vendors to create standard sets of benchmarks for modern computer systems. • In 1989, SPEC created a benchmark set focusing on processor performance(SPEC89) which evolved through 5 generations. • SPEC CPU2006 (consists of a set of 12 integer benchmarks(CINT2006) and 17 floating point benchmarks(CFP2006))
  • 57. • The integers benchmarks – C compiler,chess program,quantum computer simulations. • The floating point benchmarks: Structure grid codes for finite element modeling, particle method codes for molecular dynamics,sparse linear algebra codes for fluid dynamics. • Check for SPEC14. • SPECRatio=𝐸𝑋𝐸 𝑡𝑖𝑚𝑒 𝑅𝐸𝐹/EXE Time • Take Geometric mean of SPECRatios • When comparing two computers using SPEC ratios,use the GM so that it gives the same relative answer no matter what computer is used to normalize the results.If we averaged the normalized exe time values with an AM,the results would vary depending on the computer we choose as the reference.
  • 58. SPECINTC2006 benchmarks running on AMD Opteron X4 model Des Name Inst count X𝟏𝟎 𝟗 CPI Clock cycle time (sec X𝟏𝟎 𝟗 ) Exe time (secs) Ref time (secs) SPEC ratio Interpreted String processing Perl 2118 0.75 0.4 637 9770 15.3 GNU C compiler Gcc 1050 1.72 0.4 724 8050 11.1
  • 59. Q1:Show that the ratio of the geometric means is equal to the geometric mean of the performance ratios, and that the reference computer of SPECRatio matters not. Assume two computers A and B and a set of SPEC ratios for each. 𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝐴 𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝐵 = 𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐴 𝑖 𝑛 𝑖=1 𝑛 𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐵 𝑖 𝑛 𝑖=1 𝑛 = 𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐴 𝑖 𝑆𝑃𝐸𝐶𝑅𝑎𝑡𝑖𝑜 𝐵 𝑖 𝑛 𝑖<1 𝑛 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝐵 𝑖 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝐴 𝑖 𝑛 𝑖<1 𝑛 = 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐴 𝑖 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐵 𝑖 𝑛 𝑖<1 𝑛 The geometric mean of the ratios is the same as the ratio of the geometric means.
  • 60. Fallacies and Pitfalls • Pitfall: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement. • This pitfall has visited designers of both h/w and s/w. • Ex. Suppose a program runs in 100secs on a computer,with multiply operations responsible for 80secs of this time.How much do you have to improve the speed of multiplication if you want your program to run five times faster. • The exe time of the program after making the improvement is given by the equation known as Amdahl’s Law. • Exe time after improvement = (Exe time affected by improvement/Amount of improvement ) +Exe time unaffected
  • 61. Exe time after improvement=80/n +(100-80) 20=80/n +(100-80) 0=80/n(i.e there is no amount by which we can enhance – multiply to achieve a fivefold increase in performance ,if multiple accounts for only 80% of the workload. The performance enhancement possible with a given improvement is limited by the amount that the improved feature is used(it’s the law of diminishing returns). We can use Amdahl’s law to estimate performance improvements when we know the time consumed for some function and its potential speedup.
  • 62. Amdahl’s Law Parent thread Time Spawn child threads Child threads Thread join operation Sequential section Initialisation
  • 63.  For P parallel processors, we can expect a speedup of P (in the ideal case)  Let us assume that a program takes Told units of time  Let us divide into two parts – sequential and parallel  Sequential portion : Told * fseq  Parallel portion : Told * (1 - fseq ) fseq =fraction of time spend in sequential part (1 - fseq )= fraction of time spend in parallel part
  • 64.  Only, the parallel portion gets sped up P times  The sequential portion is unaffected  Equation for the time taken with parallelisation  The speedup is thus : : Told 𝑇𝑛𝑒𝑤
  • 65. Implications  Consider multiple values of fseq  Speedup vs Number of processors 45 40 35 30 25 20 15 10 5 0 0 50 100 150 200 Speedup(S) Number of processors (P) 10% 5% 2%
  • 66. Conclusions  We observe that with an increasing number of processors the speedup gradually saturates and tends to the limiting value,1 𝑓𝑠𝑒𝑞 .We observe diminishing returns as we increase the number of processors beyond certain point.  We are limited by the size of the sequential section  For a very large number of processors, the parallel section is actually very small  Ideally, a parallel workload should have as small a sequential section as possible.
  • 67. • Fallacy: Computers at low utilization use little power. Power efficiency matters at low utilizations because server workloads vary. CPU utilization for servers at Google,is between 10% to 50% most of the time.(SPEC Power Benchmark) Server Manufacturer Microprocessor Total cores/sockets Clock rate Peak Perfm 100% load power 50% load power 10%load power Idle power HP XenonE5440 8/2 3GHz 308022 269W 227W 174W 160W Dell XenonE5440 8/2 2.8GHz 305413 276W 230W 173W 157W Fujitsu Seimens XenonX3220 4/1 2.4GHz 143742 132W 110W 85W 60W
  • 68. • Even servers that are only 10% utilized burn about two-third of their peak power. • Since servers’ workloads vary but use a large fraction of peak power,so we should redesign h/w to achieve “energy proportional computing”.If future servers are used ,say 10% of peak power at 10% workload,we could reduce the electricity bill of datacentres(also a Concern of CO2 emissions) • Pitfall: Using a subset of the performance equation as a performance metric. Measuring performance using clock rate or CPI is a fallacy ,using two or three factors to compare performance may be valid in a limited context or misused. Alternative to time is MIPS(Million inst per secs)= 𝐼𝑛𝑠𝑡 𝑐𝑜𝑢𝑛𝑡/(𝐸𝑥𝑒𝑡𝑖𝑚𝑒 × 106)
  • 69. • Problems with MIPS: Cannot compare computers with different instruction sets using MIPS. MIPS varies between programs on the same computer.(computer can not have a single MIPS rating) If a new program executes more instructions but each instruction is faster, MIPS can vary independently from performance. 𝑀𝐼𝑃𝑆 = 𝐼𝐶 𝐼𝐶×𝐶𝑃𝐼×106 𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒 = 𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒 𝐶𝑃𝐼×106
  • 70. ARM Instruction Set • 32 bit instruction set. • Arthmetic(Add,Sub….) • Data Transfer(load register,store register,mov etc.) • Logical(and,or,not…) • Conditional branch(branch on EQ,NE,LT,LE…) • Unconditional branch(branch(always),branch and link) Name Example Comments 16 registers r0,r1,r2…..r11,….r12….. sp…lr,pc Fast locations for data,in ARM,data must be in registers to perform arithmetic. 230 memory words Memory[0],memory[4+… Memory[4294967292] ARM uses byte addresses ,so sequential address differ by 4
  • 71. MIPS Instruction Set Name Example Comments 32 registers $s0-$s7; $t0-$t9,$zero, $a0-$a3, $v0-$v1,$gp,$fp,$sp,$ra,$at Fast locations for data,in MIPS,data must be in registers to perform arithmetic.register $zero always equals 0,and register $at is reserved by the assembler to handle large constants 230 memory words Memory*0+,memory*4+… Memory[4294967292] MIPS uses byte addresses ,so sequential address differ by 4,Memory holds DS, arrays, and spilled registers
  • 72. comparison Category Instruction ARM MIPS Arithmetic Add Sub Add immediate ADD r1,r2,r3 SUB r1,r2,r3 add $s1,$s2,$s3 sub $s1,$s2,$s3 addi $s1,$s2,20 Data Transfer Load register/ Load word Store register/ store word …… ….. LDR r1,[r2,#20] {r1=memory[r2+20]} STR r1,[r2,#20] Lw $s1,20($s2) Sw $s1,20($s2)
  • 73. ARM assembly language notation ARM arithmetic instruction performs only one operation and must always have exactly three variables.Ex.place the sum of four variables b,c,d,e into variable a. • ADD a,b,c (sum of b and c placed in a) • ADD a,a,d (sum of b,c and d is now in a) • ADD a,a,e(sum of b,c,d,and e is now in a) It takes three instructions to sum the four variables. First Design Principle: Simplicity favours regularity Requiring every instruction to have exactly three operands,no more and no less. Conforms to the philosophy of keeping the h/w simple;h/w for variable number of operands is more complicated than h/w for a fixed number.
  • 74. C assignment to ARM assembly • Second Design Principle: smaller is faster(A very large number of registers may increase the clock cycle time because it takes electronic signals longer when they must travel farther. smaller is faster ;15 registers may not be faster than 16 .Energy is also the major concern ,fewer registers is to conserve energy) • a=b+c;d=a-e; ADD a,b,c , SUB d,a,e • f=(g+h)-(i+j);ADD t0,g,h ,ADD t1,i,j,SUB f,t0,t1. Memory operands: for simple variables single data elements is used but for complex data structures –arrays, structure etc.The DSs can contain many more elements than there are registers in computer. How can a computer represent and access such large structures? The processor can keep only small amount of data in registers, but memory contains billions of data elements, Hence DSs(array and structures)are kept in memory.
  • 75. • Compile the C assignment , g=h+A[8] LDR r1,[r2,#32] r2=base address r1=temp register • ADD r3,r4,r1 • Offset=8 • In ARM word must Start at address that are Multiples of 4 (alignment restriction) 32 8 24 7 20 6 16 5 12 4 8 3 4 2 0 1 Byte Address data element
  • 76. Compile using load and store A[12]=h+A[8] LDR r1,[r2,#32] ADD r4,r3,r1 STR r4,[r2,#48] Many programs have more variables than computers have registers. Consequently the compilers tries to keep the most frequently used variables in registers and places the rest in memory,using loads and stores to move variables between registers and memory.The process of putting less commonly used variables (or those needed later) into memory is called spilling registers. To achieve highest performance and conserve energy,compilers must use registers efficiently.
  • 77. Constant or Immediate Operands Many times a program will use a constant in an operation. Ex .increment an index to point to the next element of an array. More than half of the ARM arithmetic instructions have a constant as an operand when running the SPEC2006 benchmarks. Add a constant 4 to register r3. LDR r5,[r1,#AddrConstant4] ADD r3,r3,r5 r1+AddrConstant4 is the memory address of the constant. ADD r3,r3,#4 {r3=r3+4},immediate operand. Third Design Principle: Make the common case fast(Constant operands occur frequently ,and by including constants inside arithmetic instructions, operations are much faster and use less energy than if constants were loaded from memory)
  • 78. Signed and Unsigned Numbers 31 30 29 28 27 26 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 LSB=0MSB=31 ARM word is 32 bits long. So we can represent the numbers from 0 to 232-1(4294967295) 0000 0000 0000 0000 0000 0000 0000 0000=0 0000 0000 0000 0000 0000 0000 0000 0001=1 0000 0000 0000 0000 0000 0000 0000 0010=2 ……… ……. 1111 1111 1111 1111 1111 1111 1111 1111 =4294967295
  • 79. Positive and negative numbers represented by separate sign by single bit 0(positive) and 1(negative). Integers representation: a)Signed Magnitude representation b)Signed-1’s complement representation c) Signed-2’s complement representation Ex 14 in 8- bit register=0000 1110 Three different ways to represent -14= i) Signed Magnitude representation=1 0001110 ii) Signed-1’s complement representation=1 1110001 iii)Signed-2’s complement representation=1 1110010
  • 80. Conclusions • Signed magnitude system is used in ordinary arithmetic but is awkward when employed in computer arithmetic. • Hence signed complement is used ,but 1’s complement impose difficulties because it has two representation of 0(+0 and -0).1’s complement is used for logical operation. • 2’s complement is used for representing negative numbers. 2’s complement addition:6 00000110 13 00001101 19 00010011 Do -6+13,6-13,-6-13.
  • 81. Translating ARM assembly into m/c instruction • ADD r5,r1,r2 The decimal representation: Fourth field tells that the instruction performs addition Instruction in binary form:instruction format ARM Fields: Cond: conditional branch instruction F:Instruction format,I=immediate(if 0 the second source operand is register ,if 1 than 12 bit immediate),S=set condition(related to cond.branch) 14 0 0 4 0 1 5 2 1110 00 0 0100 0 0001 0101 000000000010 4bits 2bits 1bit 4bits 1bit 4bits 4bits 12bits Cond F I Opcode S Rn Rd Operand2 4bits 2bits 1bit 4bits 1bit 4bits 4bits 12bits
  • 82. ADD r3,r3,#4 , r3=r3+4 LDR r5,[r3,#32] ,temp reg r5 gets A[8] Load and store instruction use 6 fields F=1 data transfer instruction,24=load word 14 0 1 4 0 3 3 4 4bits 2bits 1bit 4bits 1bit 4bits 4bits 12bits Cond F Opcode Rn Rd Offset2 4bits 2bits 6bit 4bits 4bit 12bits 14 1 24 3 5 32 4bits 2bits 6bit 4bits 4bit 12bits
  • 83. Instruction Format Cond F I Op S Rn Rd Operand2 ADD DP 14 0 0 4 0 Reg Reg Reg SUB DP 14 0 0 2 0 Reg Reg Reg ADD(imme dite) DP 14 0 1 4 0 Reg Reg Constant LDR DT 14 1 Na 24 Na Reg Reg Address STR DT 14 1 Na 25 Na Reg Reg Address
  • 84. A[30]=h+A[30] LDR r5,[r3,#120] ADD r5,r2,r5 STR r5,[r3,#120] Cond F I Op S Rn Rd Operand2 14 1 24 3 5 120 14 0 0 4 0 2 5 5 14 1 25 3 5 120 Cond F I Op S Rn Rd Operand2 1110 1 11000 0011 0101 0000 1111 0000 1110 0 0 100 0 0010 0101 0000 0000 0101 1110 1 11001 0011 0101 0000 1111 0000
  • 85. Logical Operations AND r5,r1,r2,reg5=reg1 & reg2 MVN r5,r1 ,reg5=~reg r1,[move Not] MOV r6,r5 ,reg6=reg5 MIPS Code for f=(g+h)-(i+j) Assign variables to registers:$s0=f,$s1=g,$s2=h,$s3=i,$s4=j add $t0,$s1,$s2 add $t1,$s3,$s4 sub $s0,$t0,$t1 For g=h+A[8] lw $t0,32($s3) ,tem reg t0 gets A[8] add $s1,$s2,$t0
  • 86. ADD r5,r1,r2 in ARM(decimal representation) MIPS decimal representation add $t0,$s1,$s2 17=$s1,18=$s2,8=$t0, 6bits 5bits 5bits 5bits 5bits 6bits 0 17 18 8 0 32
  • 87. MIPS Fields op=opcode rs=first register source operand rt=second register source operand rd=register destination operand shamt=shift amount funct=function,this field selects the specific variant of the operation in the op field also called as function code. op rs rt rd shamt funct 6bits 5bits 5bits 5bits 5bits 6bits
  • 88. • The problem occurs when an instruction needs longer fields than those shown.Ex the load word instruction must specify two registers and a constant.If the address uses one of the 5 fields of the format,the constant within the load word would be limited to 25 = 32.This constant is used to select elements from the arrays or data structures and is often much larger than 32. • We have conflict between desire to keep all the instructions the same length and the desire to have a single instruction format. • Design principle four: Good design demands good compromises. • The compromise choosen is to keep all instructions the same length,thereby by requiring different kinds of instruction formats for different kinds of instructions. • R-type(register format)(as shown above) • I-type(immediate or data transfer)
  • 89. I-format 16 bit address means load word instruction can load any word within a region of ±215 or 8192 words of address in the base register rs. Lw $t0 32($s3) temp reg t0 gets A[8]. op rs rt Const or address 6bits 5bits 5bits 16bits Inst Format Op Rs Rt Rd Shamt Funct address add R 0 Reg Reg Reg O 32 Na sub R 0 Reg Reg Reg O 34 Na addi I 8 Reg Reg Na Na Na Const lw I 35 Reg Reg Na Na Na Address sw I 43 Reg Reg Na Na Na Address
  • 90. Translate MIPS assembly into m/c code • A[300]=h+A[300] Lw $t0,1200($t1) #temp reg t0 gets A[300] add $t0,$s1,$t0 #temp reg to gets h+A[300] sw $t0,1200($t1) # stores h+A[300] back into A[300] op rs rt rd shamt/ address funct 35 9 8 12000 0 18 8 8 0 32 43 9 8 1200
  • 91. m/c code $s0-$s7 mapped to 16-23 $t0-$t7 mapped to 8-15 Op Rs Rt Rd Shamt/address funct 100011 01001 01000 0000 0100 1011 0000 0 10010 01000 01000 0 100000 101011 01001 01000 0000 0100 1011 0000
  • 92. • In MIPS • Shift left logical(sll) , ex. sll $t0,$s0,4 # reg $t0=$regs0<<4 bits • Shift right logical(srl) • In ARM • Logical shift left(LSL),ex LSL r5,r1 LSL #2,# r5=r1<<2 • Logical shift right(LSR),ex MOV r6,r5 LSR #4 , #r6=r5>>4 • In MIPS op rs rt rd shamt funct 0 0 16 10 4 0
  • 93. i=j 𝑖 ≠ 𝑗 Exit: f=g+h f=g-h i==j
  • 94. • Shifting left by i bits gives same result as multiplying by 2𝑖 • Shifting right by i bits gives same result as divide by 2𝑖 Instructions for making decisions:in ARM • If(i==j),f=g+h;else f=g-h; Assigning variables to registers:r0=f,r1=g,r2=h,r3=i,r4=j CMP r3,r4 BNE Else;go to else if i≠j ADD r0,r1,r2; B Exit;go to exit Else :SUB r0,r1,r2; Exit:
  • 95. Instructions for making decisions • If(i==j),f=g+h;else f=g-h; in MIPS beq reg1,reg2,L1 #go to statement L1 if reg1 ==reg2 bne reg1,reg2,L1 #go to statement L1 if reg1 ≠reg2 f,g,h,i,j corresponds to five registers $s0-$4 bne $s3,$s4,Else #go to Else if i ≠j add $s0,$s1,$s2 # f=g+h skipped if i ≠j J Exit # go to exit #unconditional branch Else:sub $s0,$s2,$s1 # f=g-h skipped if i==j Exit:
  • 96. ARM and MIPS assembly for the loop In ARM while (save[i]==k) i+=1; Assume i and k corresponds to registers r3 and r5 and the base of the array save in r6 First step:save[i] in temp reg.before that add i to the base of the array save to form the address,multiply the index i by 4 due to the byte addressing problem. Use LSL since shifting left 2 bits means multiply by 22 Loop:ADD r12,r6,r3,LSL #2 ;r12=address of save[i] LDR r0,[r12,#0] ; temp reg r0=save[i] CMP r0,r5 BNE Exit ;go to Exit if save[i]≠ 𝑘 ADD r3,r3,#1 ;i=i+1 B loop ;go to loop Exit;
  • 97. In MIPS Loop:sll $t1,$s3,2 ;temp reg $t1=4*i add $t1,$t1,$s6 ;$t1=address of save[i] lw $t0,0($t1) ;temp reg $t0=save[i] bne $t0,$s5,Exit ;go to exit if save[i]≠k add $s3,$s3,1 ;i=i+1 J loop Exit: Example: Let r0=1111 1111 and r1=0000 0001 CMP r0,r1 Which conditional branch is taken? BLO L1 ;unsigned branch(branch on lower unsigned instruction is not taken to L1,since decimal of 1111 1111>1 BLT L2 ;signed branch(the branch on less than instruction is taken to L2,since in decimal -1<1 Note:an unsigned comparison of x<y also checks if x is negative as well as if x is less than y
  • 98. In MIPS the instruction slt(set on less than) used in for loops slt $t0,$s3,$s4 ;means register $t0 is set to 1 if the $s3<$s4 otherwise $t0 set to 0. slti $t0,$s2,10 ;$t0=1 if $s2<10 The MIPS architecture does not include branch on less than because it is too complicated;either it would stretch the clock cycle time or it would take extra clock cycles per instruction. MIPS compilers use slt,slti,bne,beq and the fixed value 0($zero) to create all relative conditions,equal,not equal,less than,less than or equal,greater than,greater than or equal.
  • 99. Case/Switch statements • The simplest way to implement switch via a sequence of conditional tests,turning the switch statement into a chain of if-then-else statements. • Sometimes the alternatives may be more efficiently encoded as a table of addresses of alternative instruction sequences,called a jump address table/jump table.And the program needs only to index into the table and then jump to the appropriate sequence. • The jump table is then just an array of words containing addresses that correspond to labels in the code.The program need to jump using the address in the appropriate entry from the jump table. • ARM handles such situations implicitly by stored program concept . • Need to have a register to hold the address of the current instruction being executed (PC-program counter) • In ARM register 15 is the PC(also instruction address register) • Any instruction with register 15 as destination register is an unconditional branch to the address at the value.
  • 100. • Encoding branch instruction in ARM • The cond field encodes the many versions of the conditional branch like Cond 12 address 4bits 4bits 24bits Value Meaning Value Meaning 0 EQ(EQual) 8 HI(unsigned higher) 1 NE(Not Equal) 9 LS(unsigned Lower or Same) 2 HS(unsigned Higher or Same) 10 GE(signed Greater than or Equal) 3 LO(unsigned LOwer) 11 LT(signed Less Than) 4 MI(MInus,<0) 12 GT(signed Greater Than) 5 PL(PLus,>=0) 13 LE(signed Less Than or Equal) 6 VS(oVerflow Set,overflow) 14 AL(ALways) 7 VC(oVerflow Clear,no overflow) 15 NV(reserved)
  • 101. • The cond field encodes the many versions of the conditional branch (shown in the table) • The 24 bit address limit the programs to 224 or 16MB which would be fine for many programs but constrains large ones. • An alternative would be a register that added to branch address,so the branch instruction would calculate PC=Reg+branch address. • The sum would allow the programs to be as large as 232 ,solving the branch address size problem.’ • Which is that register? • Conditional branches are found in loops and in if statements,so they tend to branch to a nearby instruction.Ex about half of all conditional branches in SPEC benchmarks go to locations less than 16 instructions away. • Since the PC contains the address of the current instruction,we can branch within ±224 words of the current instruction if we use the PC as the register to be added to the address.All the loops and if statements are much smaller than ±224 words,so the PC is the ideal choice. • Also called as PC-relative addressing.
  • 102. Conditional Execution • Unusual feature of ARM is most instruction can be conditionally executed ,not just branches.This is the purpose of the 4 bit cond field found in the most ARM instruction formats. • The assembly language programmer simply appends the desired condition to the instruction name to perform the operation only if the condition is true based on the last time the condition flags were set. • Ex CMP r3,r4 ADDEQ r0,r1,r2 SUBNE r0,r1,r2
  • 103. Procedures in Computer Hardware • Procedure: A stored subroutine that performs a specific task based on the parameters with which it is provided. • Registers are the fastest place to hold data in a computer,so use them as much as possible .ARM s/w follows the following conventions for procedure calling in allocating 16 registers. • r0-r3(four argument registers in which to pass parameters. • lr:one link register containing the return address register to return to the point of origin. • BL:It jumps to an address and simultaneously save the address of the following instruction in register lr.(Branch-and-link instruction). • BL:ProcedureAddress. • (BL instruction that jumps to an address and simultaneously and saves the address of the following instruction in a register (lr or register 14).
  • 104. • MOV pc,lr • The calling program ,or caller,puts the parameter values in r0-r3 and uses BL X to jump to procedure X(callee).The callee then performs the calculations, places the results (if any) into r0 and r1,and returns control to the caller using MOV pc,lr. • The BL instruction actually saves PC+4 in register lr to link to the following instruction to set up the procedure return. • Stack: • Suppose a compiler needs more registers for a procedure than the four arguments and two return value registers.Since we must cover our tracks after our mission is complete, any registers needed by the caller must be restored to the values that they contained before the procedure was invoked.This situation is an example in which we need to spill registers to memory. The ideal data structure for spilling registers is a stack –a LIFO queue
  • 105. • A stack needs a pointer to the most recently allocated address in the stack to show where the next procedure should place the registers to be spilled or where old register values are found. • The stack pointer is adjusted by one word for each register that is saved or restored. ARM reserves register 13 for SP. • Compile the C procedure that does not call another procedure int example(int g,int h,int i, int j) { int f; f=(g+h)-(i+j) return f; }
  • 106. • Let g,h,i,j correspond to the argument registers r0,r1,r2,r3 and f corresponds to r4. • Label of the procedure is ex_procedure: • Save three registers:r4,r5,r6. • “push” the old values onto the stack by creating space for three words(12 bytes) on the stack and then store them: SUB sp,sp,#12 :adjust stack to make room for 3 items STR r6,[sp,#8] :save register r6 for use afterwords STR r5,[sp,#4] :save register r5 for use afterwords STR r4,[sp,#0] :save register r4 for use afterwords ADD r5,r0,r1 :reg r5 contains g+h ADD r6,r2,r3 :reg r6 contains i+j SUB r4,r5,r6 :f gets r5-r6,which is (g+h)-(i+j)
  • 107. Values of sp and the stack: before,during and after the procedure call High address sp sp Contents of r6 Contents of r5 sp Contents of r4 Low address
  • 108. To return the value of f we can copy it into a return value register r0: MOV r0,r4 ; return f(r0=r4) Before returning ,we restore the three old values of the registers we saved by “popping”them from the stack. LDR r4,[sp,#0] ;restore register r4 for caller LDR r5,[sp,#4] ;restore register r5 for caller LDR r6,[sp,#8] ;restore register r6 for caller ADD sp,sp,#12 ;adjust stack to delete 3 items The procedure ends with a jump register using the return address: MOV pc,lr ;jump back to calling routine. We used temporary registers and assumed their old values must be saved and restored.To avoid saving and restoring a register whose value is never used,which might happen with a temporary register,ARM software separates 12 of the registers into two groups:
  • 109. r0-r3,r12:argument or scratch registers that are not preserved by the callee(called procedure)on a procedure call r4-r11:eight variable registers that must be preserved on a procedure call(if used ,the callee saves and restores them) Note:This simple convention reduces register spilling .In the example above,if we could rewrite the code to use r12 and reuse one of the r0 to r3,we can drop two stores and two loads from the code. We still must save and restore r4,since the callee must assume that the caller needs its value. In MIPS: MIPS s/w follows the following convention in allocating its 32 registers for procedure calling: $a0-$a3 : four argument register in which to pass parameters $v0-$v1: two value registers in which to return values. $ra : one return address register to return to the point of origin.
  • 110. jal: jump-and-link instruction(an instruction that jumps to an address and simultaneously saves the address of the following instruction in a register($ra in MIPS) jal ProcedureAddress jal instruction saves PC+4 in register $ra to link to the following instruction to set up the procedure return. jr $ra ;jump register instruction meaning an unconditional jump to the address specified in a register. The calling program ,or caller,puts the parameter values in $a0-$a3 and uses jal X to jump to procedure X(callee).The callee then performs the calculations, places the results (if any) into $v0 and $v1,and returns control to the caller using jr $ra.
  • 111. • Let g,h,i,j correspond to the argument registers $a0,$a1,$a2,$a3 and f corresponds to $s0. • Label of the procedure is ex_procedure: • Save three registers:$s0,$t0,$t1. • “push” the old values onto the stack by creating space for three words(12 bytes) on the stack and then store them: addi $sp,$sp,-12 :adjust stack to make room for 3 items sw $t1,8[$sp] :save register $t1 for use afterwords sw $t0,4[$sp] :save register $t0 for use afterwords sw $s0,0[$sp] :save register $s0 for use afterwords add $t0,$a0,$a1 :reg $t0 contains g+h add $t1,$a2,$a3 :reg $t1 contains i+j sub $s0,$t0,$t1 :f gets $t0-$t1,which is (g+h)-(i+j) add $v0,$s0,$zero :returns f ($v0=$s0+0)
  • 112. Before returning ,we restore the three old values of the registers we saved by “popping”them from the stack. lw $s0,0($sp) ;restore register $s0 for caller lw $t0,4($sp) ;restore register $t0 for caller lw $t1,8($sp) ;restore register $t1 for caller addi sp,sp,12 ;adjust stack to delete 3 items The procedure ends with a jump register using the return address: jr $ra ;jump back to calling routine. We used temporary registers and assumed their old values must be saved and restored.To avoid saving and restoring a register whose value is never used,which might happen with a temporary register,MIPS software separates 18 of the registers into two groups: $t0-$t9 :10 temp registers that are not preserved by the callee(called procedure)on a procedure call $s0-$s7:eight saved registers that must be preserved on a procedure call(if used ,the callee saves and restores them)
  • 113. Note:This simple convention reduces register spilling .In the example above, since the caller (procedure doing the calling) does not expect registers $t0 and $t1 to be preserved across a procedure call ,we can drop two stores and two loads from the code. We still must save and restore $s0,since the callee must assume that the caller needs its value. Nested Procedures: Ex.Suppose that the main program calls procedure A with an argument of 3,by placing the value 3 into register r0 and then using BL A. Then suppose that procedure A calls procedure B via BL B with an argument of 7,also placed in r0.Since A has not finished its task yet,there is a conflict over the use of register r0.Similarly there is a conflict over the return address in register lr,since it now has the return address for B. Solution:push all the other registers that must be preserved onto the stack.
  • 114. • The caller pushes any argument registers(r0-r3) that are needed after the call. The callee pushes the return address register lr and any variable registers (r4-r11)used by the callee. The sp is adjusted to account for the number of registers placed on the stack.Upon the return, the registers are restored from memory and the sp is readjusted. • Convert into into ARM assembly code? int fact(int n) { if(n<1)return (1); else return(n*fact(n-1)); }
  • 115. fact:ARM fact:MIPS SUB sp,sp,#8 ;adjust stack for 2 items addi sp,sp,-8;adjust stack for 2 items STR lr ,[sp,#4] ;save return address sw $ra,4($sp) ;save return address STR r0,[sp,#0] ;save the argument n sw $a0,0($sp) ;save the argument n CMP r0,#1 ;compare n to 1 slti $t0,$a0,1;test for n<1 BGE L1 ;if n>=1,go to L1 beq $t0,$zero,L1;if n>=1,go to L1 MOV r0,#1 ;return 1 addi $v0,$zero,1 ;return 1 ADD sp,sp,#8 ;pop two items off stack addi $sp,$sp,#8;pop two items off stack MOV pc,lr ;return to the caller jr $ra ;return to after jal L1: SUB r0,r0,#1 ;n>=1:argument gets(n-1) L1:addi $a0,$a0,-1; n>=1:argument gets(n-1) BL fact ;call fact with (n-1) jal fact ; call fact with (n-1) MOV r12,r0 ;save the return value lw $a0,0($sp); return from jal;restore argument n LDR r0,[sp,#0] ;return from BL;restore argument n lw $ra,4($sp); restore the return address LDR lr [sp,#0] ;restore the return address addi $sp,$sp,8 ;adjust sp to pop 2 items ADD sp,sp,#8 ;adjust sp to pop 2 items mul $v0,$a0,$v0; return n * fact(n-1) MUL r0,r0,r12 ;return n * fact(n-1) jr $ra ; return to caller
  • 116. Stack allocation before during and after procedural call High address $fp $fp $fp $sp sp saved argument registers(if any) saved ret address $sp Saved saved registers(if any) Local arrays and Structures(if any) Low address
  • 117. What is and what not preserved across a procedure call Preserved Not Preserved Variable registers:r4-r11 Argument registers:r0-r3 Stack pointer register:sp Intra procedure-call scratch register:r12 Link register:lr Stack below the sp Stack above the sp
  • 118. • Storage classes in C:automatic (local to procedure)and static(variables outside procedures). • Global pointer($gp):To simplify access to global data MIPS s/w reserves another register called global pointer. • Allocating space for new data on stack: The final complexity is that stack is also used to store variables that are local to the procedure that do not fit in the registers such as local arrays and structures. • The segment of the stack containing a procedure’s saved registers and local variables is called :procedure frame(activation record). • Some MIPS s/w uses a frame pointer($fp) to point to first word of the frame of the procedure. • $fp is the value denoting the location of the saved registers and local variables for a given procedure. • $sp points to the top of the stack.when a $fp is used,it is initialized using the address in $sp on a call and $sp is restored using $sp.
  • 119. Allocating space for new data on the heap,MIPS convention $sp 7fff fffchex $gp 1000 8000hex 1000 0000hex $pc 0040 0000hex 0 Stack Dynamic data Static data Text Reserved
  • 120. • In addition to the automatic variables that are local to procedures, C programmers need space in memory for static variables and for dynamic data structures. • Text segment: Segment of the unix object file that containsthe m/c code for routines in the source file. • Static data segment:place for constants and other static variables. • Data structures like linked lists tend to grow and shrink during there lifetimes.The segments for such DSs is the heap. • Stack and heap grow towards each other allowing efficient use of memory as two segments wax and wane. • C allocates and free space on the heap with explicit functions,malloc() allocates space on the heap and returns a pointers to it and free() releases space on the stack to which the pointer points.
  • 121. • Memory allocation is controlled by programs in C,and it is the source of many common and difficult bugs. • Forgetting to free space leads to “memory leak”,which eventually uses up so much memory that the OS may crash.Freeing space to early leads to “dangling pointers”,which can cause pointers to point to things that the program never intended. • GNU MIPS C compiler uses a frame pointer.C compiler from MIPS/Silicon graphics does not use fp,it uses register 30 as another save register.
  • 122. ARM register conventions Name Register No Usage Preserved on call? a1-a2 0-1 Argument/return result/scratch register no a3-a4 2-3 Argument/scratch register no v1-v8 4-11 Variables for local routine yes ip 12 Intra-procedure- call scratch register no sp 13 Stack pointer yes lr 14 Link register yes pc 15 Program counter n.a
  • 123. MIPS register Convention Name Register No Usage Preserved on call? $zero 0 The constant value 0 n.a $v0-$v1 2-3 Values for results and expression evaluation no $a0-$a3 4-7 Arguments no $t0-$t7 8-15 Temporaries no $s0-$s7 16-23 Saved yes $t8-$t9 24-25 More temporaries no $gp 28 Global pointer yes $sp 29 Stack pointer yes $fp 30 Frame pointer yes $ra 31 Return address yes
  • 124. 32-bit immediate operand • Although constants are frequently short and fit into 16 bit field,sometimes they are bigger.The MIPS instruction set includes the instruction load upper immediate(lui) • Specifically to set the upper 16bits of a constant in a register, allowing a subsequent instruction to specify the lower 16bits of the constant.
  • 125. m/c code of lui $t0,255 ;t0 is register 8 Contents of register $t0 after executing lui $t0,255 001111 00000 01000 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 0000 0000
  • 126. • Loading 32-bit constant • What is the MIPS assembly code to load 32-bit constant into register $s0? 0000 0000 0011 1101 0000 1001 0000 0000 First load the upper 16 bits which is 61 in decimal ,using lui; Lui $s0,61 The value of register $s0 afterwords is 0000 0000 0011 1100 0000 0000 0000 0000 Add the lower 16 bits whose decimal is 2304 ori $s0,$s0,2304 The final value in the register $s0 0000 0000 0011 1101 0000 1001 0000 0000
  • 127. Compiling a String Copy Procedure void strcpy(char x[],char y[] { int i; i=0; while((x*i+=y*i+)!=‘0’) i+=1; } • In ARM: Assume the base addresses for arrays x and y are found in r0 and r1,while i is in r4.strcpy adjusts the stack pointer and then saves the saved register r4 on the stack strcpy: SUB sp,sp,#4 ;adjust stack for 1 more item STR r4,[sp,#0] ;save r4 To initialize i to 0,the next instruction sets r4 to 0 by adding 0 to 0 and placing that sum in r4: MOV r4,#0 ;i=0+0
  • 128. L1:ADD r2,r4,r1 ;address of y[i] in r2 {this is the beginning of the loop,the address of y[i] is first formed by adding i to y[](assume array of bytes). To load the character in y[i] LDRB r3,[r2,#0] ;r3=y[i] and set condition flags {load register byte;loads a byte from memory ADD r12,r4,r0 ;address of x[i] in r12 STRB r3,[r12,#0] ;x[i]=y[i] BEQ L2 ;if y[i]==0; go to L2 Increment i and loop back ADD r4,r4,#1 ;i=i+1 B L1 ;go to L1 If we don’t loop back it was the last character of the strings ;we restore r4 and the stack pointer,and then return L2:LDR r4,[sp,#0] ;y[i]==0;end of string ,restore old r4 ADD sp,sp,#4 ;pop 1 word off stack MOV pc,lr ;return
  • 129. In MIPS: strcpy: addi $sp,$sp,-4 sw $sp,0($sp) add $s0,$zero,$zero, L1:ADD $t1,$s0,$a1 ;address of y[i] in $t1 lb $t2,0($t1) ;$t2=y[i] add $t3,$s0,$a0 ;address of x[i] in $t3 sb $t2,0($t3) ;x[i]=y[i] beq $t2,$zero ,L2 addi $s0,$s0,1 J L1 L2: lw $s0,0($sp) addi $sp,$sp,4 jr $ra
  • 130. Addressing in Branches and Jumps J 10000 ;go to location 10000 bne $s0,$s1,Exit ;go to Exit if $s0≠ $s1 If the address of the program is bigger than 16 bits than PC=Reg+branch address ,sum becomes 32 bits 2 10000 6 bits 26bits 5 16 17 Exit 6bits 5bits 5bits 16bits
  • 131. Branching far away: Given beq $s0,$s1,L1,replace with a pair of instruction that offer a much greater branching distance. Short address conditional branch: bne $1,$s0,L2 J L1 L2: MIPS Addressing Modes: 1)Register addressing(where the operand is a register) 2)Base or Displacement addressing(where the operand is at the memory location whose is the sum of the register and a constant in the instruction) 3)Immediate addressing(operand is the constant within the instruction itself) 4)PC relative addressing(where the address is the sum of the PC and a constant in the instruction) 5)Pseudodirect addrerssing(where the jump address is the 26 bits of the instruction concatenated with the upper bits of the PC
  • 132. • 1)convert into assembly for m/c instruction 00af8020(hex) 0000 0000 1010 1111 1000 1000 0010 0000 Find the op field 31-29 and 28-26 are 000 ,000 ,hence R-format instruction 000000 00101 01111 10000 00000 100000 op rs rt rd shamt funct Bit 5-3 ,100 and bits 2-0,000,hence it represent add The decimal values are 5 for rs and 15 for rt,16 for rd(shamt is unused).these numbers represents registers $a1,$t7 and $s0 add $s0,$a1,$t7
  • 133. Translating and starting a program C-program----COMPILER----Assembly program--ASSEMBLER-- object:m/c language module---LINKER--exe m/c language program-- LOADER--memory object:library routine(m/c language) Source file-x.c Assembly file-x.s Object file-x.o Statically linked library routines are x.a and dynamically linked library routes are x.so Executable files by default are called a.out. MS-DOS uses .C,.ASM,.OBJ,.LIB,.DLL,and .EXE
  • 134. Compiling java program Java program----COMPILER---classfiles(bytecodes) java library routines(m/c language) JIT JVM compiled java methods(m/c language) JVM-s/w interpreter,can execute bytecodes,it’s a program that simulates an ISA.portable and found in devices –mobile phones to internet browsers. To preserve portability and improve execution speed the next phase is JIT JIT-compiler that operates on runtime, translating the interpreted code segments into the native code of the computer.
  • 135. Arithmetic for computers • Two’s complement representation:The positive and negative numbers 32 bit numbers can be represented as • 𝑥31 × −231 + 𝑥30 × 230 + 𝑥29 × 229 + ⋯ + 𝑥1 × 21 + 𝑥0 × 20 • The sign bit is multiplied by −231 and the rest of the bits are then multiplied by positive versions of their respective base values. • Decimal value of two’s complement number: 1111 1111 1111 1111 1111 1111 1111 1100 Multiplication: Sequential version of the Multiplication Algorithm and Hardware Let’s assume that the multiplier is in the 32-bit Multiplier register and that the 64- bit Product register is initialized to 0. We need to move the multiplicand left one digit each step, as it may be added to the intermediate products. Over 32 steps, a 32 bit multiplicand would move 32 bits to the left.Hence,we need a 64bit multiplicand register, initialized with the 32 bit multiplicand in the right half and zero in the left half.This register is then shifted left 1 bit each step to align the multiplicand with the sum being accumulated in the 64-bit Product register.
  • 136. First version of the multiplication H/w 64bits 32bits ALU 64bit 64bits Multiplicand Shift left Product write Multiplier Shift right Control test
  • 137. First multiplication algorithm • flowchart_multiply.pdf • Three basic steps needed for each bit. These three basic steps are repeated 32 times to obtain the product.If each step took a clock cycle,this algo would require almost 100 clock cycles to multiply two 32 bit numbers.
  • 138. Multiply 0010 X 0011 Iteration Step Multiplier Multiplicand Product 0 Initial value 0011 0000 0010 0000 0000 1 1a:1=>Prod=Prod + Multplicand 2:shift left multiplicand 3:shift right multiplier 0011 0011 0001 0000 0010 0000 0100 0000 0100 0000 0010 0000 0010 0000 0010 2 1a:1=>Prod=Prod + Multplicand 2:shift left multiplicand 3:shift right multiplier 0001 0001 0000 0000 0100 0000 1000 0000 1000 0000 0110 0000 0110 0000 0110 3 1a:1=>No operation 2:shift left multiplicand 3:shift right multiplier 0000 0000 0000 0000 1000 0001 0000 0001 0000 0000 0110 0000 0110 0000 0110 4 1a:1=>No operation 2:shift left multiplicand 3:shift right multiplier 0000 0000 0000 0001 0000 0010 0000 0010 0000 0000 0110 0000 0110 0000 0110
  • 139. Refined version of Multiplication H/w 32bits 32bit ALU 64 bits Multiplicand Product shift right write Control test
  • 140. =10 =01 =00 =11 =/= =0 Multipcand in BR Multiplier in QR AC0 Qn+10 SC0 QnQn+1 ACAC+BR’ +1 ACAC+BR Ashr(AC&QR) SCSC-1 SC END
  • 141. h/w for Booth Algo Qn Qn+1 BR Register Complementer and parallel Adder Sequence counter QR registerAC register
  • 142. First version of division h/w 64bits ALU 64bit 32bits 64bits DIVISOR Shift right QUOTIENT Shift left REMAINDER write Control test
  • 143. Floating Point Reals in mathematics: 3.14159265….(𝜋) 2.71828…(e) 0.000000001 or 1.0 × 10;9 3,155,760,000 or 3.15576× 109 The last number did not represent a small fraction,but it was bigger than we could represent with a 32 bit signed integer. The alternative notation of the last two numbers is called scientific notation. Which has a single digit to the left of the decimal point. A number in scientific notation that has no leading 0s is called a normalized number,which is the usual way to write it.1.0 × 10;9 is a normalized scientific notation but 0.1 × 109 𝑎𝑛𝑑 10.0 × 1010 𝑎𝑟𝑒 𝑛𝑜𝑡 Note:all numbers in decimal
  • 144. • We can also show binary numbers in scientific notation.1.0 × 2;1 Floating point: computer arithmetic that represent numbers in which the binary point is not fixed In scientific notation (1. 𝑥𝑥𝑥𝑥𝑥𝑥)2× 2 𝑦𝑦𝑦𝑦 Advantages of scientific notation in normalized form: 1)It simplifies exchange of data that includes floating point numbers. 2)It simplifies the floating point arithmetic algorithms. 3)Increases the accuracy of the numbers that can be stored in a word. General Form: (−1) × 𝐹 × 2 𝐸 31 30-23 22-0 S(1 bit) Exponent(8 bits) Fraction(23 bits)
  • 145. • These chosen sizes of exponent and fraction give MIPS computer arithmetic an extra ordinary range. Fractions almost as small as 2.0 × 10;38 𝑎𝑛𝑑 𝑎𝑠 𝑙𝑎𝑟𝑔𝑒 𝑎𝑠 2.0 × 1038 Overflow: positive exponent becomes too large to fit in the exponent field. Underflow:negative exponent becomes too large to fit in the exponent field. Double precision: Floating point value represented two 32 bit words. Single precision: Floating point value represented in a single 32 bit word. MIPS double precision:large number:2 × 10308 Small number:2 × 10;308 These formats go beyond MIPS:they are part of IEEE754 floating point standard found in virtually every computer invented since 1980. 31 30-20 19-0 S(1 bit) Exponent(11 bits) Fraction(20 bits) Fraction (continued)32 bits
  • 146. To pack into more bits into significand(fraction),the IEEE754 makes the leading 1 bit of normalized binary numbers implicit. Hence,the number is actually 24 bits long in single precision(1 implied+23 bit fraction). For double precision for 53 bits long (1+52). General form: (−1) 𝑆× 1 + 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 × 2 𝐸 Where the bits of the fraction represents a number between 0 and 1 and E specifies the value in the exponent field. If we number the bits of the fraction from left to right s1,s2,s3…then the value is (−1) 𝑆× (1 + 𝑠1 × 2;1 + 𝑠2 × 2;2 + ⋯ . ) × 2 𝐸
  • 147. • 1.0 × 2;1 • 1.0 × 2:1 31 30-23 22-0 0 11111111 0000000000…… 31 30-23 22-0 0 0000001 0000000000……
  • 148. The Processor: 4 Add Add ALU P C Address Instruction Instruction Memory Data Register# Register# REGISTERS Register# Address Data memory Data
  • 149. • Abstract view of the implementation of MIPS subset showing the major functional units and the major connections between them. a)All the instructions start by using PC to supply the instruction address to instruction memory. b)After the instruction is fetched, the register operands used by an instruction are specified by the fields of that instruction. c)Once the register operands have been fetched, they can be operated on to compute the memory address(for load or store),to compute an arithmetic result(for an integer arithmetic logical instruction) or compare (for branch).
  • 150. d)If the instruction is arithmetic logical instruction, the result from ALU must be written to a register. e)If the operation is load or store,the ALU result is used as an address to either store a value from the registers or load a value from memory into registers. The result from the ALU or memory is written back into the register file.
  • 151. f)Branches require the use of ALU output to determine the next instruction address, which comes from either the ALU(where the PC and the branch offset are summed) or from an adder that increments the current PC by 4. g)All the instructions(memory reference, arithmetic-logical and branch) except jump uses ALU after reading the registers.
  • 152. Logic design conventions • The functional units in MIPS implementation consists of two different types of logic elements: elements that operate on data values and elements that contain state. • Elements that operate on data values are called combinational. Which means that their output depends only on the current inputs. Given the same inputs a combinational element always produces the same output.(Ex ALU). • An element contain state if it has some internal storage :state elements, because if we pulled the plug on the m/c,we could restart it by loading the state elements with the values they contained before we pulled the plug. • The instruction and data memories ,registers are state elements.A state element has two inputs and one output.
  • 153. • Inputs are the data value to be written into the element and the clock,which determines when the data value is written.Output from the state element provides the value that was written in an earlier clock cycle. Ex D-type flip-flop. • Logic components that contain state: sequential(output depends on both inputs and contents of internal state.Ex registers. • Clocking Methodology: when the data is valid and stable relative to the clock. • Edge-triggered clocking:A clocking scheme in which all state changes occur on a clock edge.It means any values stored in a sequential logic element are updated only on a clock edge. • The inputs are the values that were written in previous clock cycle,while the outputs are values that can be used in a following clock cycle.
  • 154. Combinational Logic,state element and the clock are closely related Clock cycle State element 1 State element 2 Combinational logic
  • 155. Edge triggered methodology allows a state element to be read and written in the same clock cycle without creating a race that could lead to indeterminate data values State element 1 Combinational logic
  • 156. • Control signal: A signal used for multiplexor selection or for directing the operation of a functional unit. • Contrasts with a data signal ,which contains information that is operated on by a functional unit.
  • 157. Building a datapath a)Instruction Memory b)PC Add Sum c)Adder Instruction address Instruction Instruction Memory PC
  • 158. Portion of datapath used for fetching instructions and incrementing the PC Add Sum 4 Read address Instruction Instruction Memory PC
  • 159. Register file 5 Register Nos 5 5 Data Data Read Read Register1 data1 Read Register 2 REGISTERS Write Register Write Read data RegWrite data2 data2
  • 161. MemWrite 16 32 MemRead Address Read data DATA MEMORY Write data Sign Extend
  • 162. Datapath for branch uses the ALU to evaluate the branch condition and a separate adder to compute the branch target as the some of the incremented PC and the sign-extended,lower 16 bits of the instruction shifted left two bits. PC+4 from instruction datapath ALU sum Branch target ALUoperation Inst ALU zero To branch Control logic RegWrite 16 32 Read reg1 Read data1 Read reg2 Write reg Write data Read data2 Sign extend Shiftleft2
  • 163. Creating single datapath 1)To share a datapath element between two different instruction classes,we may need to allow multiple connections to the input of the element using a multiplexor and control signal to select among the multiple inputs. 2)The operations of R-type and memory instructions datapath are quite similar but the differences are: a)The R-types uses ALU with inputs coming from two registers.The memory instruction also use ALU for address calculation.Although the second input is the sign-extended 16 bit offset field from instruction. b)The value stored into the destination register comes from ALU(R-type) or the memory(for load)
  • 164. ALU Control ALU Control Lines Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 Set on less than 1100 NOR
  • 165. • We can generate the 4 bit ALU control input using a small control unit that has as inputs the function field of the instruction and the two bit control field,which is ALUOp. • ALUOp indicates whether the operation to be performed should be add(00) for loads and stores, subtract(01) for beq ,or determined by the operation encoded in the function field(10). • The output of the ALU control unit is a 4-bit signal that directly controls the ALU by generating one of the 4-bit combinations.
  • 166. How to set the ALU control inputs based on the 2-bit ALUOp control and the 6-bit function code Instruction opcode ALUop Instruction operation Function Field Desired ALU action ALU control input Lw 00 Load word xxxxxx add 0010 Sw 00 Store word xxxxxx add 0010 Branch equal 01 Branch equal xxxxxx subtract 0110 R-type 10 Add 100000 add 0010 R-type 10 Subtract 100010 subtract 0110 R-type 10 AND 100100 AND 0000 R-type 10 OR 100101 OR 0001 R-type 10 Set on less than 101010 Set on less than 0111
  • 167. • The main control unit generates the ALUOp bits,which then are used as input to the ALU control that generates the actual signals to control the ALU unit. • Using multiple levels of control can reduce the size of the main control unit. • Using several smaller control units may also potentially increase the speed of the control unit.
  • 168. Truth table for three ALU control bits ALUOp1 ALUop2 Function Field F5 F4 F3 F2 F1 F0 operation 0 0 X x x x x x 0010 X 1 X x x x x x 0110 1 X 100000 0010 1 X 100010 0110 1 X 100100 0000 1 X 100101 0001 1 X 101010 0111
  • 169. Designing the Main Control Unit • To connect the fields of an instruction to the datapath ,the instruction formats must be reviewed. • Op field:31-26 bits • Two registers to be read rs and rt :25-21,20-16.true for R- type,branch equal and for store. • The base register for load and store instructions is always in bit positions 25-21(rs). • R-type: 0 31-26 Rs 25-21 Rt 20-16 Rd 15-11 Shamt 10-6 Funct 5-0 31-26 25-21 20-16 15-11 10-6
  • 170. Load or store instruction load:35,store:43 branch instruction 35 or 43 31-26 Rs 25-21 Rt 20-16 Address 15-0 4 31-26 Rs 25-21 Rt 20-16 Address 15-0
  • 171. Simple datapath with control unit • The input to the control unit is the 6-bit opcode field from the instruction. The outputs of the control unit consists of three 1-bit signals that are used to control multiplexors(RegDst,ALUSrc and MemtoReg), • Three signals for controlling the reads and writes in register file and data memory(RegWrite,MemRead,MemWrite). • A 1-bit signal used in determining whether to possibly branch(Branch),and a two bit control signal for ALU(ALUOp). • An AND gate is used to combine the branch control signal and the zero output from the ALU. • The AND gate output controls the selection of the next PC
  • 172. • The setting of the control lines is completely determined by the opcode fields of the instruction • The first row: R-type(add,sub and,or,slt).For all these instructions,the source register fields are rs and rt and the destination rd.this define how the signals ALUSrc and RegDst are set. Instruction RegDst ALUSrc MemtoReg RegWrite Memread MemWrite Branch ALUOp1 ALUOp0 R-format 1 0 0 1 0 0 0 1 0 Lw 0 1 1 1 1 0 0 0 0 Sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1
  • 173. • An R-type instruction writes a register(RegWrite=1),but neither reads nor writes data memory. • When the branch control signal is 0,the PC is unconditionally replaced with PC+4.Otherwise the PC is replaced by branch target if the zero output of ALU is also high. • The ALUOp field of R-type instruction is set to 10 to indicate the ALU control should be generated from the function field. • The second and third row shows control signal settings for lw and sw. • These ALUSrc and ALUOp fields are set to perform the address calculation. • The MemRead and MemWrite are set to perform memory access. • Finally the RegDst and RegWrite are set for a load to cause the result to be stored into the rt tregister.
  • 174. • The branch instruction is similar to R-type,since it sends the rs and rt registers to the ALU. • The ALUOp field for branch is set for a subtract(ALU Control=01),which is used to test for equality. • Note:MemtoReg field is irrelevant when the RegWrite Signal is 0,Since the register is not being written,the value of the data on the register data write port is not used. • Thus the entry MemtoReg in the last two rows of the table is replaced with X for don’t care. • Don’t care can also be added to RegDst when RegWrite is 0,this type of don’t care must be added by the designer,since it depends on knowledge of how the datapath works.
  • 175. Finalizing the control:The control function for single cycle implementation specified by truth table Input or output Signal name R-Format Lw Sw beq Inputs Op5 0 1 1 0 Op4 0 0 0 0 Op3 0 0 1 0 Op2 0 0 0 1 Op1 0 1 1 0 Op0 0 1 1 0 Outputs Regdst 1 0 X X ALUSrc 0 1 1 0 MemtoReg 0 1 X X RegWrite 1 1 0 0 MemRead 0 1 0 0 MemWrite 0 0 1 0 Branch 0 0 0 1 ALUOp1 1 0 0 0 ALUOp0 0 0 0 1
  • 176. Implementing jumps • The jump instruction look somewhat like branch instruction but computes the target PC and is not conditional. • Like a branch the lower order two bits of jump address are always 00,the next lower 26 bits of this 32 bit address come from the 16 bit immediate field in the instruction 31:26 25:0 000010 address
  • 177. • The upper 4 bits of the address that should replace the PC come from the PC of the jump instruction plus 4,we can implement the jump by storing into the PC the concatenation of: • -the upper 4 bits of the current PC+4(31:28) • -The 26bit immediate field of jump instruction • -The bit 00
  • 178. Why single cycle implementation is not used today? • All though the single cycle design will work correctly, it would not be used in modern designs because it is inefficient. The clock cycle must have the same length for every instruction in this single cycle ,the CPI therefore is 1. • The clock cycle is determined by the longest possible path in the machine. • This path is almost the load instruction, which uses five functional units in series, the instruction memory, register file, the ALU, the data memory and the register file. • Although the CPI is 1,the overall performance of single cycle implementation is not likely to be very good, since several of the instruction classes could fit in a shorter clock cycle.
  • 179. • Unfortunately implementing the variable speed clock for each instruction class is extremely difficult, an alternative is to use a shorter clock cycle that does less work and then vary the number of clock cycles for the different instruction classes. • The single cycle implementation violates the design principle of making the common case fast. • In this single cycle implementation each functional unit can be used only once per clock; therefore some functional units must be duplicated raising the cost of implementation. • Hence inefficient in its performance and in its hardware cost. • These difficulties can be avoided by using shorter clock cycle-derived from the basic functional unit delays, and that requires multiple clock cycles for each instruction.
  • 180. Multicycle implementation • Also called as multiple clock cycle implementation. An implementation in which an instruction is executed in multiple clock cycles. • The multicycle implementation allows a functional unit to be used more than once per instruction as long as it is used on different clock cycles. • The sharing can help reduce the amount of hardware required. • The ability to allow instructions to take different numbers of clock cycles and the ability to share functional units within the execution of a single instruction are the major advantages of a multi cycle design
  • 181. Difference with single cycle version • A single memory unit is used for both instructions and data. • There is a single ALU, rather than ALU and two adders. • One or more registers are added after every major functional unit to hold the output of that unit until the value is used in a subsequent clock cycle. • At the end of the clock cycle,all data that is used in subsequent clock cycles must be stored in a state element. • Data used by subsequent instructions in a later clock cycle is stored into one of the programmer-visible state elements:register file,PC or the memory. • Data used by same instruction in a later cycle must be stored into one of these additional registers.
  • 182. • The position of additional registers is determined by two factors: What combinational unit will fit in one clock cycle and what data are needed in later cycles implementing the instruction. -In the multicycle design it is assumed that the clock cycle can accommodate at most one of the following operations: A memory access, a register file access(two reads or one write) or an ALU operation. -Any data produced by one of theses functional units must be saved into a temporary registers for use on later cycle -If we are not saved than the possibility of timing race could occur,leading to then use of incorrect value. -All the registers except IR hold data only between a pair of adjacent clock cycles and will thus not need a write control signal. -Thus IR needs to hold instruction until the end of the execution of that instruction and thus will require a write control signal
  • 183. Pipelining Up till now: we have designed a processor that execute all the SimpleRisc instructions. Two styles: Hardwired control unit Microprogrammed control unit: microprogrammed datapath microassembly language microinstruction
  • 184. Designing efficient processors  Microprogrammed processors are much slower that hardwired processors  Even hardwired processors  Have a lot of waste !!!  We have 5 stages.  What is the IF stage doing, when the MA stage is active ?  ANSWER : It is idling
  • 185. Resource utilization • Single cycle design: each resource is tied up for the entire duration of the instruction execution. • Multi-cycle design:resource utilized in cycle t of instruction I is available for cycle t+1 of instruction I. • Pipelined design:resource utilized in cycle t of instruction I is available again for cycle t of instruction I+1.
  • 186. Problems with single cycle design • Slowest instruction pulls down the clock frequency • Resource utilization is poor • There are some instructions which are impossible to be implemented in this manner. HIGH MULTI-CYCLE DESIGN CPI LOW PIPELINED DESIGN SINGLE CYCLE DESIGN SHORT CYCLE TIME LONG
  • 187. The Notion of Pipelining  Let us go back to the car assembly line  Is the engine shop idle, when the paint shop is painting a car ?  NO : It is building the engine of another car  When this engine goes to the body shop, it builds the engine of another car, and so on ….  Insight :  Multiple cars are built at the same time.  A car proceeds from one stage to the next
  • 188. • Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Today pipelining is the key to making processors fast. • Ex. Laundry analogy for pipelining: see Fig. time 6pm 7pm…… 2am task A W D F S B W D F S C W D F S D W D F S
  • 189. time 6pm 7pm… 9.5pm task The washer ,dryer, folder and storer each take 30 minutes for their task. Sequential laundry takes 8 hrs for four loads of wash while pipelined laundry takes just 3.5 hrs we show the pipeline stage of different loads overtime by showing copies of the four resources on this two dimensional time line, but we really have just one of each resource. A W D F S B W D F S C W D F S D W D F S
  • 190. Observed so far.. • The pipeline paradox is that the time from placing a single dirty sock in the washer until is dried ,folded and put away is not shorter for pipelining. • The reason pipelining is faster for many loads is that everything is working in parallel, so more loads are finished per hour, • Pipelining improves throughput of the laundry system without improving the time to complete a single load. • Hence ,pipelining would not decrease the time to complete one load of laundry, but when we have many loads of laundry to do,the improvement in throughput decreases the total time to complete the work.
  • 191. • If all the stages take about the same amount of time and there is enough work to do,then the speedup due to pipelining is equal to the number of stages in the pipeline(in this case four stages). • 20 loads would take about 5 times as long as 1 load,while 20 loads of sequential laundry takes 20 times as long as 1 load. • Its only 8/3.5=2.3 times faster becoz only four loads are shown. • The beginning and end of the workload in the pipelined version,the pipeline is not completely full. • This start-up and wind-down affects performance when the number of tasks is not large as compared to the number of stages in the pipeline, if the number of loads is much larger than 4,then the stages will be full most of the time and the increase in throughput will be very close to 4.
  • 192. • The same principles apply to processors, where we pipeline instruction execution. MIPS instructions classically takes five steps: 1)Fetch instruction from memory 2)Read register while decoding the instruction(The format of MIPS allows reading and decoding to occur simultaneously). 3)Execute the operation or calculate the address. 4)Access an operand in data memory. 5)Write the result into a register.
  • 193. Single cycle Vs pipelined performance Compare the avg time between instructions of a single cycle implementation, in which all instructions take one clock cycle to a pipelined implementation. The operation times for major functional units as example is: (in the single cycle model every instruction takes exactly one clock cycle,so the clock cycle must be stretched to accommodate the lowest instruction.) Inst class Inst fetch Reg read ALU oper Data access Reg write total Load 200ps 100ps 200ps 200ps 100ps 800ps Store 200ps 100ps 200ps 200ps 700ps R-format 200ps 100ps 200ps 100ps 600ps branch 200ps 100ps 200ps 500ps
  • 194. 200 300 500 700 800 1000 1200 1300 1500 1700 1800 lw $1,100($0) lw $1,200($0) lw $3,300($0) 800ps 800ps IF Reg ALU Data Ac Reg IF Reg ALU Data Ac Reg IF
  • 195. 200 400 600 800 1000 1200 1400 a)200ps IF Reg ALU Data Ac Reg b) 200ps IF Reg ALU Data Ac Reg c) 200ps IF Reg ALU Data Ac Reg a)lw $1,100($0) b)lw $2,200($0) c)lw $3,300($0)
  • 196. • Avg time between the instructions reduced from 800ps to 200 ps • Computer pipeline stage times are limited by the slowest resource. Either the ALU operation or the memory access. • We assume that the write to register file occurs in the first half of the clock cycle and the read from the register file occurs in the second half of the clock cycle. • For speedup: time between instructions(pipelined)=time between instruction(nonpipelined)/number of pipe stages Improvement is 800/5=160ps clock cycle.(stages may be imperfectly balanced or some overheads). Thus the time per instruction in the pipelined processor will exceed the minimum possible and speedup will be less than the number of pipeline stages.
  • 197. • Our claim of fourfold improvement is not reflected in the total execution of three instructions(2400/1400=1.8).this is becoz the number of instructions is not large.Let us increase the number of instructions: • Let us add 1000,000 instructions in the pipeline,each instruction adds 200ps ,to the total execution time. • Total execution time=1000,000*200ps+1400ps=200,001400ps • For nonpipelined:1000,000*800+2400=800,002400ps • Ratio=800,002400/200,001400=4=800/200 • Note:pipeline improves performance by increasing instruction throughput,as opposed to decreasing the execution time on an individual instruction.
  • 198. Pipeline hazards • There are situations in the pipeline when the next instruction can not execute in the following clock cycle, these events are called hazards and there are three types: • Structural, data and control hazards. • Structural Hazard: An occurrence in which planned instruction can not execute in the proper clock cycle because the hardware cannot support the combinations of instructions that are set to execute in the given clock cycle. • With reference to laundry example the structural hazard would occur if we use the combination of washer-dryer instead of separate washer and dryer, or you roommate was busy doing something else and wouldn’t put clothes away. • Hence, the carefully scheduled pipeline plans would than be foiled. • If the pipeline in the previous Fig,has fourth instruction then we would see that the first instruction is accessing data from the memory and the fourth instruction is fetching an instruction from that same memory. • Without two memories our pipeline could have a structural hazard.
  • 199.  A structural hazard may occur when two instructions have a conflict on the same set of resources in a cycle  Example :  Assume that we have an add instruction that can read one operand from memory  add r1, r2, 10[r3]  This code will have a structural hazard  [3] tries to read 10[r3] (MA unit) in cycle 4  [1] tries to write to 20[r5] (MA unit) in cycle 4  Does not happen in our pipeline [1]: st r4, 20[r5] [2]: sub r8, r9, r10 [3]: add r1, r2, 10[r3]
  • 200. • Data Hazard: An occurrence in which a planned instruction can not execute in the proper clock cycle because the data that is needed to execute the instruction is not yet available. • Occurs when the pipeline must be stalled because one step must wait for another to complete. • In the computer pipeline, data hazards arise from the dependence of one instruction on an earlier one that is still in the pipeline(relationship that doesn’t really exists when doing laundry). add $s0 ,$t0,$t1 sub $t2,$s0,$t3 Data hazard could severely stall the pipeline, the add instruction does not write its result until the fifth stage, meaning that we would have to add three bubbles to the pipeline.
  • 201. Data Hazard [1]: add r1, r2, r3 [2]: sub r3, r1, r4 1 1 1 1 1 2 2 2 2 2 IF OF EX MA RW 1 2 3 4 5 6 7 8 9 clock cycles  Instruction 2 will read incorrect values !!!
  • 202. This situation represents a data hazard In specific, it is a RAW (read after write) hazard The earliest we can dispatch instruction 2, is cycle 5
  • 203. • Although we would try to rely on compilers to remove all such hazards, the result would not be satisfactory. These dependences happen just too often and the delay is just too long to expect the compiler to rescue us from this dilemma. • Forwarding or bypassing: • For the code sequence mentioned earlier,as soon as the ALU creates the sum for add, we can supply it as input for the subtract.Adding extra h/w to retrieve the missing item early from the internal resources is called forwarding or bypassing.
  • 204. Graphical representation Fig.2 add $s0,$t0,$t1 200 400 600 800 1000 1200 1400 a)200ps IF Reg ALU Data Ac Reg b) 200ps IF Reg ALU Data Ac Reg c) 200ps IF Reg ALU Data Ac Reg a)lw $1,100($0) b)lw $2,200($0) c)lw $3,300($0) 200 400 600 800 1000 I F I D a l u MEM W B
  • 206. • If the correct value is already there in another stage, we can forward it. [1]: add r1, r2, r3 [2]: sub r4, r1, r2 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 7 8 9 Clock cycles 2 IF OF EX MA RW
  • 207. Forwarding from MA to EX  Fowarding in cycle 4 from instruction [1] to [2] [1]: add r1, r2, r3 [2]: sub r4, r1, r2 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 7 8 9 Clock cycles 2 IF OF EX MA RW
  • 208. Different Forwarding Paths  We need to add a multitude of forwarding paths  Rules for creating forwarding paths  Add a path from a later stage to an earlier stage  Try to add a forwarding path as late as possible. For, example, we avoid the EX → OF forwarding path, since we have the MA → EX forwarding path  The IF stage is not a part of any forwarding path.
  • 209. Forwarding Path  3 Stage Paths  RW → OF  2 Stage Paths  RW → EX  MA → OF (X Not Required)  1 Stage Paths  RW → MA (load to store)  MA → EX (ALU Instructions, load, store)  EX → OF (X Not Required)
  • 210. Forwarding Paths : RW → MA [1]: ld r1, 4[r2] [2]: sw r1, 10[r3] 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 7 8 9 Clock cycles 2 IF OF EX MA RW
  • 211. Forwarding Paths : RW → EX [1]: ld r1, 4[r2] [2]: sw r8, 10[r3] [3]: add r2, r1, r4 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 7 8 9 Clock cycles 2 IF OF EX MA RW 3 3 3 3 3
  • 212. Forwarding Path : MA → EX [1]: add r1, r2, r3 [2]: sub r4, r1, r2 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 7 8 9 Clock cycles 2 IF OF EX MA RW
  • 213. Forwarding Path : RW → OF [1]: ld r1, 4[r2] [2]: sw r4, 10[r3] 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 7 8 9 Clock cycles 2 IF OF EX MA RW [3]: sw r5, 10[r6] 3 3 3 3 3 4 4 4 4 4 [4]: sub r7, r1, r2
  • 214. • Forwarding works very well but cannot prevent all pipeline stalls. • Suppose the first instruction were load of $s0,instead of an add, from the figure it is clear that the desired data would be available only after the fourth stage of the first instruction in the dependence,which is too late for the input of the third stage of sub.’ • Even,with forwarding,we would have to stall one stage for a load use data hazard(specific form of data hazard in which the data requested by a load instruction has not yet become available when it is requested)
  • 215.  Cannot forward (arrow goes backwards in time)  Need to add a bubble (then use RW → EX forwarding) [1]: ld r1, 10[r2] 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 7 8 9 Clock cycles 2 IF OF EX MA RW [2]: sub r4, r1, r2
  • 216. • We need to stall even with forwarding when an R-format instruction following a load tries to use data. • A stall initiated in order to resolve a hazard(pipeline stall)
  • 217. Data Hazards with Forwarding  Forwarding has unfortunately not eliminated all data hazards  We are left with one special case.  Load-use hazard  The instruction immediately after a load instruction has a RAW dependence with it. Consider the c segment: A=B+E and C=B+F,in MIPS lw $t1,0($t0) lw $t2,4($t0) add $t3,$t1,$t2 sw $t3,12($t0) lw $t4,8($01) add$t5,$t1,$t4 sw $t5,16($t0) Find the hazards in instructions and reorder to avoid any pipeline stalls.
  • 218.
  • 219. Control Hazards • Also called as branch hazard,An occurrence in which the proper instruction can not execute in the proper clock cycle because the instruction that was fetched is not the one that is needed, i.e the flow of instruction addresses is not what pipeline expected.
  • 220.
  • 221. Performance of “stall on branch”
  • 222. • Let us assume that we put in enough extra h/w so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline. Even with this extra h/w the pipeline involving conditional branches would look like the fig in previous slide. • Estimate the impact on the clock cycles per instruction(CPI) of stalling on branches. Assume all other instructions have a CPI 1. • It has been found that branches are 13% of the instructions executed in SPECint2000,since other instructions run have a CPI of 1 and branches took one extra clock cycle for the stall,hence the CPI 1.13,slowdown of 1.13 versus the ideal case.jumps also incur stalls.
  • 223.
  • 224. • If we cannot resolve the branch in the second stage,as is often the case for longer pipelines, then we would see larger slowdown if we stall on branches. • The cost of this option is too high for most computers to use and motivates a second solution to the control hazard. Predict:with reference to laundry example,if you are pretty sure of having the right formula to wash uniforms,then just predict that it will work and wash the second load while waiting for the first load to dry. This option does not slowdown the pipeline when you are correct. When you are wrong ,then you need to redo the load that was washed while guessing the decision.
  • 225. • Computers do indeed use prediction to handle branches.One simple approach is to always predict that the branches will be untaken. • When you are right,the pipeline proceeds at full speed.Only when branches are taken does the pipeline stall. •
  • 226. Predicting that branches are not taken as a solution to control hazard
  • 227. Branch prediction • Some branches predicted as taken and some untaken.In laundry example,the dark or home uniforms might take one formula ,while the light or road uniforms might take another. • As a computer example,at the bottom of loops are branches that jump back to the top of the loop.Since they are likely to be taken and they branch backwards,we could always predict taken for branches that jump to an earlier address. • Dynamic h/w predictors make their guesses depending on the behaviour of each branch and may change predictions for a branch over the life fo a program. • In our analogy, a dynamic prediction, a person would look at how dirty the uniform was and guess at the formula, adjusting the next guess depending on the success of recent guesses
  • 228. • One popular approach to dynamic prediction of branches is keeping a history for each branch as taken or untaken, and then using the recent past behaviour to predict the future. • Survey says, dynamic branch predictors can correctly predict branches with over 90% accuracy. • When the guess is wrong, the pipeline control must ensure that the instructions following the wrongly guessed branch have no effect and must restart the pipeline from the proper branch address. • In laundry analogy, we must stop taking new loads so that we can restart the load that we incorrectly predicted.
  • 229. • Third approach to Control hazard: delayed decision • The delayed branch always execute the next sequential instruction with the branch taking place after that one instruction delay. • MIPS s/w will place an instruction immediately after the delayed branch instruction that is not affected by branch and a taken branch changes the address of the instruction that follows this safe instruction
  • 230. • The add instruction before the branch in Fig. does not affect the branch and can be moved after the branch to fully hide