This document discusses computer architecture performance, including metrics like execution time, throughput, and instructions per cycle (IPC). It provides examples of calculating the cycles per instruction (CPI) for different instruction types and evaluating potential design changes based on their impact on CPI and overall performance. The principles of locality and Amdahl's Law, which states that speedups from parallelism are limited by the serial fraction of a program, are also covered.
Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
1. ECE 4100/6100
Advanced Computer Architecture
Lecture 3 Performance
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
2. Performance
• Execution/Response time (Latency)
– Elapsed time between start and completion of an
event
– How long my job takes?
• Throughput (Bandwidth)
– Total amount of work done within a given period
of time
– How many jobs done per unit time on a system?
3. CPU Performance
• Execution Time = Seconds / Program
cyclenInstructio
cyclesnsInstructio seconds
program
××
• Programmer
• Algorithms
• ISA
• Compilers
• Microarchitecture
• System architecture
• Microarchitecture,
pipeline depth
• Circuit design
• Technology
5. Architecture Comparison
• Many architecture research just make the following
assumptions
• Instructions / program is fixed
– Same binary ()
– Same compiler ()
– Same benchmark
• Seconds per cycle is constant ()
– Same frequency
– Same pipeline depth
– Typically a bad assumption today
• Focus on IPC or CPI
• It is more complicated for today’s architects !
6. Example: Calculating CPI
Typical Mix of
instruction types
in program
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Design guideline: Make the common case fast
MIPS 1% rule: only consider adding an instruction of it is shown to add 1%
performance improvement on reasonable benchmarks.
Run benchmark and collect workload characterization
(simulate, machine counters, or sampling)
7. Performance Comparison
• For some program running on machine X,
PerformanceX = 1 / Execution timeX
• "X is nn times faster than Y"
PerformanceX / PerformanceY = nn
= speedup of X over Y
• Problem:
– machine A runs a program in 20 seconds
– machine B runs the same program in 25 seconds
8. Performance Evaluation: Benchmark
• (Real) Programs
– In the form of collection of programs
– E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC
• Kernels:
– Small key pieces of real programs
– E.g., Livermore Fortran Loops Kernels (LFK), Linpack
• Modified (or scripted)
– To focus on some particular aspects (e.g. remove I/O, focus on CPU)
• (Toy) Benchmarks
– Produce expected results
• Synthetic Benchmarks:
– Representative instruction mix
– E.g., Dhrystone, Whetstone
• Important for
– Architectural and microarchitectural design trade-off
– Competitive analysis of real products
9. Performance Summary Measurement
• Average of total execution time
• This is Arithmetic Mean (Weighted ArithmeticArithmetic Mean (Weighted Arithmetic
Mean)Mean)
∑∑ ==
∗
n
i
ii
n
i
i TimeWeight
n
Time
n 11
1
or
1
10. Performance Summary Measurement
• Ratei is a function of 1/Timei
• Used to represent the average “rate” such as
instruction per cycle (IPC)
∑∑ ==
n
i i
i
n
i i Rate
Weight
n
Rate
n
11
or
1
11. Why Harmonic Mean?
• 30 mph for the first 10 miles
• 90 mph for the next 10 miles
• Average speed? (30+90)/2 = 60 mph??
• Wrong!
• Average speed = total distance / total time
• (10+10)/(10/30 + 10/90) = 45 mph
12. New Breed of Metrics
• Performance / Watt
– Performance achievable at the same cooling
capacity
• Performance / Joule (Energy)
– Achievable performance at the lifetime of the
same energy source (i.e., battery = energy)
– Equivalent to reciprocal of energy-delay product
(ED product)
13. Amdahl’s Law (Law of Diminishing Returns)
• Make the common case faster
• Speedup
= Perfnew / Perfold = Told / Tnew=
• Performance improvement from using faster mode
is limited by the fraction the faster mode can be
applied.
f(1 - f)
Told
(1 - f)
Tnew
f / P
P
f
f +− )1(
1
14. Amdahl’s Law Analogy
• Driving from Orlando to Atlanta
– 60 miles/hr from Orlando to Macon
– 120 miles/hr from Macon to Atlanta
– How much time you can save
compared against driving all the way
at 60 miles/hr from Orlando to
Atlanta?
• 6hr 45min vs. 7hr 30min = ~11%
speedup
• Key is to speed up the biggie portion, i.e.
speed up frequently executed blocks
15. Parallelism vs. Speedup
1.11x
1.97x
1.33x
1
10
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Speedup
Code portion in Faster mode (f)
Amdahl's Law speedup as a function of parallelism
P=1
P=2
P=4
P=8
P=16
P=32
P=64
16. Gustafson’s Law
• Amdahl’s Law killed massive parallel processing (MPP)
• Gustafson came to rescue
Seq
Tnew
Parallel
Told
Seq P * Parallel Time
Assume: Seq + Parallel = 1 (Tnew)
∴
Speedup = Seq + p * (1 – Seq) where p=parallel factor
If Seq diminishes with increased problem size, Speedup
p
18. The Principle of Locality
• Knuth made the original observation about program locality
in 1971.
– … less than 4 percent of a program generally accounts for
more than half of its running time.
• 90/10 rule: a program spends 90% of its execution time in
only 10% of the code
• Two types of locality
– Temporal locality (locality in time)
– Spatial locality (locality in space)
• Memory subsystem design heavily leverages the locality
concept for better performance
19. Example of Performance Evaluation (I)
Operation Frequency Clock cycle
count
ALU Ops (reg-reg) 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory.
These new reg-mem instructions spend 2 clock cycles. Also assume that the
extended instruction set increase branch’s clock by 1, but no impact to cycle time.
Would this change improve performance ?
20. Example of Performance Evaluation (I)
Operation Frequency Clock cycle
count
ALU Ops (reg-reg) 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory.
These new reg-mem instructions spend 2 clock cycles. Also assume that the
extended instruction set increase branch’s clock by 1, but no impact to cycle time.
Would this change improve performance ?
703.13*24.02*12.02*)43.0*25.021.0(1)43.025.043.0(243.025.0 =++−+∗∗−+∗∗=newCycles
57.12*24.0212.0221.0143.0 =+∗+∗+∗=oldCycles
21. Example of Performance Evaluation (II)
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2% of all instructions, CPI of FPSQRT =
20
• Design Option 1: decrease the CPI of FQSQRT to 2
• Design Option 2: decease the average CPI of all FP instructions to 2.5
22. Example of Performance Evaluation (II)
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2% of all instructions, CPI of FPSQRT =
20
• Design Option 1: decrease the CPI of FQSQRT to 2
• Design Option 2: decease the average CPI of all FP instructions to 2.5
Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0
Option 1 CPI = 2.0 – 2%*(20-2) = 1.64
Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) = 1.625
Speedup of Option 1 = 2/1.64 = 1.2195
Speedup of Option 2 = 2/1.625 = 1.2308
23. Example of Performance Evaluation (III)
Clock freq = 1.4 GHz
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5,
clock freq = 1.1 GHz
24. Example of Performance Evaluation (III)
Clock freq = 1.4 GHz
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5,
clock freq = 1.1 GHz
Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s
Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s
Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s