This document discusses processor design, including custom single-purpose processors and general-purpose processors. It covers topics such as combinational and sequential logic design, finite state machine design, optimizing custom processors by improving the original program, finite state machine with datapath, and datapath and finite state machine. General-purpose processors are also introduced, including their basic architecture consisting of a control unit and datapath.
2. Contents
â Custom Single purpose Processor
â RT level Combinational Components
â RT level Sequential Components
â Custom Single Purpose Processor Design
â Optimizing custom single processors
â Optimizing original program, FSMD, datapath,
FSM
â General Purpose Processors
â Basic Architecture
â Datapath
â Control unit
â Memory
â Pipelining
2
3. Contents (cont..)
â Superscalar and VLIW Architectures
â Application Specific Instruction Set Processors
(ASIPs)
â Microcontrollers
â DSP
â Less general ASIP environments
â Selecting a Microprocessor/General purpose
processor
3
4. Introduction
â Processor â Digital circuit to perform computation tasks
â Datapath
â Controller
â General purpose processor
â Wide variety of computation tasks
â Single purpose processor
â To carry out a particular computation task
â Common tasks
â Custom single purpose processors
â Non-standard task
4
5. Introduction (cont..)
â Why custom single purpose processor?
â Faster performance
â Fewer clock cycles from customized datapath
â Shorter clock cycles from simple functional units
â Smaller size
â Simpler datapath
â No program memory
â Less power consumption
â More efficient computation
â Drawbacks
â High NRE costs
â Time to market longer
â Flexibility reduced
5
6. Combinational Logic
â Transistor â Basic electrical component in digital systems
â Transistors ï Logic Gates ï Digital Systems
â MOS transistor on silicon
â Acts as an on/off switch
â Voltage at âgateâ controls whether current flows from source to
drain
6
source drain
oxide
gate
IC package IC
channel
Silicon substrate
gate
source
drain
Conducts
if gate=1
7. CMOS Transistor
Implementationsâ Complementary Metal
Oxide Semiconductor
â We refer to logic levels
â Typically 0 is 0V, 1 is 5V
â nMOS conducts if gate=1
â pMOS conducts if gate=0
â Basic gates
7
x F = x'
1
Inverter
0
F = (xy)'
x
1
x
y
y
NAND gate
0
1
F = (x+y)'
x y
x
y
NOR gate
0
gate
source
drain
nMOS
Conducts
if gate=1
gate
source
drain
pMOS
Conducts
if gate=0
8. Basic Logic Gates
8
F = x y
AND
F = (x
y)â
NAND
F = x ï
y
XOR
F = x
Driver
F = xâ
Inverte
r
x F
F = x +
y
OR
F =
(x+y)â
NOR
x F
x
y
F
F
x
y
x
y
F
x
y
F
x
y
F
F =x y
XNOR
Fy
x
x
0
y
0
F
0
0 1 0
1 0 0
1 1 1
x
0
y
0
F
0
0 1 1
1 0 1
1 1 1
x
0
y
0
F
0
0 1 1
1 0 1
1 1 0
x
0
y
0
F
1
0 1 0
1 0 0
1 1 1
x
0
y
0
F
1
0 1 1
1 0 1
1 1 0
x
0
y
0
F
1
0 1 0
1 0 0
1 1 0
x F
0 0
1 1
x F
0 1
1 0
9. Combinational Logic Design
â Combinational circuit
â Digital Circuit whose output is a function of
current inputs
â No memory of past inputs
â Steps in designing a Combinational Logic Circuit
1. Problem Definition
2. Truth Table
3. Output Equations
4. Minimized Expressions
5. Logic Circuit
9
10. Combinational Logic Design
1. Problem Description
y is 1 if a is equal to 1, or b and c are 1.
z is 1 if b or c is equal to 1, but not both, or if all
are 1.
10
14. Combinational Logic Design
(cont..)â Large circuits complex to design using logic gates
â Eg- 16 inputs
â 216=64K rows in truth table
â Reduce complexity by components that are abstract
than logic gates
14
16. Sequential Logic Design
â Sequential Circuit
â Output is a function of current as well as previous
input values
â Has memory
â Basic sequential circuit â FLIP FLOP
â Stores a single bit
16
26. Custom Single-purpose Processor
Basic Model
26
controller and datapath
controller datapath
âŠ
âŠ
external
control
inputs
external
control
outputs
âŠ
external
data
inputs
âŠ
external
data
outputs
datapath
control
inputs
datapath
control
outputs
⊠âŠ
a view inside the controller and datapath
controller datapath
⊠âŠ
state
register
next-state
and
control
logic
registers
functional
units
27. State Diagram Templates
27
Assignment statement
a = b
next statement
a = b
next
statement
Loop statement
while (cond) {
loop-body-
statements
}
next statement
loop-body-
statements
cond
next
statement
!cond
J:
C:
Branch statement
if (c1)
c1 stmts
else if c2
c2 stmts
else
other stmts
next statement
c1
c2 stmts
!c1*c2 !c1*!c2
next
statement
othersc1 stmts
J:
C:
28. Example: Greatest Common
Divisorâ First create algorithm
â Convert algorithm to âcomplexâ state machine
â Known as FSMD: finite-state machine with datapath
â Can use templates to perform such conversion
28
GCD
(a) Black-Box View
x_i y_i
d_o
go_i
29. b) Desired Functionality
29
0: int x, y;
1: while (1) {
2: while (!go_i);
3: x = x_i;
4: y = y_i;
5: while (x != y) {
6: if (x < y)
7: y = y - x;
else
8: x = x - y;
}
9: d_o = x;
}
30. c) State Diagram
30
y = y -x7: x = x - y8:
6-J:
x!=y
5:
!(x!=y)
x<y !(x<y)
6:
5-J:
1:
1
!1
x = x_i3:
y = y_i4:
2:
2-J:
!go_i
!(!go_i)
d_o = x
1-J:
9:
32. Creating the Datapath
â Create a register for any
declared variable
â Create a functional unit
for each arithmetic
operation
â Connect the ports,
registers and functional
units
â Based on reads and
writes
â Use multiplexors for
multiple sources
â Create unique identifier
â for each control input
and output of datapath
components
32
33. Creating the Controller
â Stage 3 ï x_sel=0; x_ld=1;
â for loading âxâ
â Stage 4 ï y_sel=0; y_ld=1;
â For loading âyâ
â Stage 7 ï y_sel=1; y_ld=1;
â For loading the subtracted result y-x
â Stage 8 ï x_sel=1; x_ld=1;
â For loading the subtracted result x-y
â Stage 9 ï d_ld=1
â Load the output register
33
36. Completing the GCD Custom Single-
Purpose Processor Design
36
⊠âŠ
a view inside the controller and datapath
controller datapath
⊠âŠ
state
register
next-state
and
control
logic
registers
functional
units
â We finished the datapath
â We have a state table for the
next state and control logic
â Truth table for the
combinational logic
â This is not an optimized
design.
37. Optimizing Single-Purpose
Processorsâ Optimization is the task of making design metric values the
best possible
â GCD eg- If numbers are large, it will take more steps
â Speed decreases
â Optimization opportunities
â Original Program
â FSMD
â Datapath
â FSM
37
38. Optimizing the Original Program
â Analyze program attributes and look for areas of possible
improvement
â Number of computations
â Size of variable
â Time and space complexity
â Operations used
â Multiplication and division very expensive
38
39. Optimizing the Original Program
(Cont..)
39
0: int x, y;
1: while (1) {
2: while (!go_i);
3: x = x_i;
4: y = y_i;
5: while (x != y) {
6: if (x < y)
7: y = y - x;
else
8: x = x - y;
}
9: d_o = x;
}
0: int x, y, r;
1: while (1) {
2: while (!go_i);
// x must be the larger number
3: if (x_i >= y_i) {
4: x=x_i;
5: y=y_i;
}
6: else {
7: x=y_i;
8: y=x_i;
}
9: while (y != 0) {
10: r = x % y;
11: x = y;
12: y = r;
}
13: d_o = x;
}
Original Program
Optimized Program
replace the subtraction
operation(s) with
modulo operation in
order to speed up
program
GCD(42, 8)
âą 9 iterations to complete the loop
âą x and y values evaluated as follows : (42, 8),
(34, 8), (26,8), (18,8), (10, 8), (2,8), (2,6),
(2,4), (2,2).
GCD(42,8)
âą 3 iterations to complete the loop
âą x and y values evaluated as follows: (42, 8),
(8,2), (2,0)
40. Optimizing the FSMD
â Areas of possible improvements
â Merge states
â States with constants on transitions can be eliminated, transition
taken is already known
â States with independent operations can be merged
â Separate states
â States which require complex operations (a*b*c*d) can be broken
into smaller states to reduce hardware size
â Scheduling
â Task of assigning operations from the original program to
states in an FSMD
40
41. Optimizing the FSMD
41
Original FSMD Optimized FSMD
âą Eliminate state 1 â transitions have constant
values
âą Merge state 2 and state 2J â no loop operation in
between them
âą Merge state 3 and state 4 â assignment
operations are independent of one another
âą Merge state 5 and state 6 â transitions from state
6 can be done in state 5
âą Eliminate state 5J and 6J â transitions from each
state can be done from state 7 and state 8,
respectively
âą Eliminate state 1-J â transition from state 1-J can
be done directly from state 9
42. Optimizing the FSMD (cont..)
â Consider a = b * c * d * e
â Generating a single state for the operation requires 3 multipliers
in the datapath.
â Multipliers are expensive
â Break down the operation down into smaller operations
â t1 = b * c
â t2 = d * e
â a = t1 * t2
â Each smaller operation has its own state
â Only 1 multiplier is required in the datapath
42
43. Optimizing the FSMD (cont..)
â Timing of output operations could be changed while the FSMD
is optimized
â Reduced FSMD will generate GCD output in fewer clock cycles
â Changing the timing would not be acceptable in all cases.
Eg- Clock divider
â Thus, when optimizing FSMD, a designer must be aware of
whether output timing may or may not be modified.
43
44. Optimizing the Datapath
â Sharing of functional units
â One-to-one mapping, as done previously, is not necessary
â If same operation occurs in different states, they can share
a single functional unit
â Multi-functional units
â ALUs support a variety of operations, it can be shared
among operations occurring in different states
44
45. Optimizing the FSM
â State Encoding
â Task of assigning a unique bit pattern to each state in an
FSM
â Size of state register and combinational logic vary
â Eg- FSM with n states â n! possible encoding ways
â Can be treated as an ordering problem
â More encodings are possible â Can use more than log2n
bits to encode ânâ states
â CAD tools â great aid in searching for the best encoding
â State Minimization
â Task of merging equivalent states into a single state
â State equivalent if for all possible input combinations the two
states generate the same outputs and transitions to the next same
state
45
46. â Converting a sequential program into custom single purpose
processor
â Convert the program into FSMD
â Splitting FSMD into a simple FSM controlling datapath
â Performing sequential logic design on the FSM
â In many cases, we prefer not to start with a program â but
instead directly with a FSMD
â Cycle by cycle timing of a system is central to the design
â Programming language donât typically support cycle by
cycle description
46
RT-level Custom
Single-Purpose Processor Design
47. RT-level Custom
Single-Purpose Processor Design
â Example
â Device to send an 8-bit number to another device (the
receiver)
â Receiver can receive all 8 bits at once
â Sender sends 4 bits at a time â First lower order 4 bits and
then the higher order 4 bits
â Bridge should be designed that will enable the 2 devices to
communicate
47
51. Introduction
â General-Purpose Processor
â Processor designed for a variety of computation tasks
â Low unit cost, in part because manufacturer spreads NRE
over large numbers of units
â Motorola sold half a billion 68HC05 microcontrollers in 1996 alone
â Carefully designed since higher NRE is acceptable
â Can yield good performance, size and power
â Low NRE cost, short time-to-market/prototype, high
flexibility
â User just writes software; no processor design
â Also known as âmicroprocessorâ â âmicroâ used when they
were implemented on one or a few chips rather than entire
rooms
51
52. Basic Architecture
52
â Control unit and
datapath
â Note similarity to
single-purpose
processor
â Key differences
â Datapath is general
â Control unit doesnât
store the algorithm
â the algorithm is
âprogrammedâ into
the memory
53. Datapath Operations
53
âą Load
âą Read memory
location into
register
âą ALU operation
â Input certain
registers through
ALU, store back in
register
âą Store
â Write register to
memory location
54. Control Unit
â Control unit: configures the
datapath operations
â Sequence of desired
operations
(âinstructionsâ) stored in
memory â âprogramâ
â Instruction cycle â broken
into several sub-operations,
each one clock cycle, e.g.:
â Fetch: Get next
instruction into IR
â Decode: Determine what
the instruction means
â Fetch operands: Move
data from memory to
datapath register
â Execute: Move data
through the ALU
â Store results: Write data
from register to memory 54
55. Control Unit Sub-Operations
â Fetch
â Get next
instruction
into IR
â PC: program
counter,
always points
to next
instruction
â IR: holds the
fetched
instruction
55
58. Control Unit Sub-Operations
â Execute
â Move data
through the
ALU
â This particular
instruction
does nothing
during this
sub-operation
58
59. Control Unit Sub-Operations
â Store results
â Write data
from register
to memory
â This particular
instruction
does nothing
during this
sub-operation
59
63. Architectural Considerations
â N-bit processor
â N-bit ALU,
registers, buses,
memory data
interface
â Embedded: 8-bit,
16-bit, 32-bit
common
â Desktop/servers:
32-bit, even 64
â PC size determines
address space
63
64. Architectural Considerations
â Clock frequency
â Inverse of clock
period
â Must be longer
than longest
register to
register delay in
entire processor
â Memory access is
often the longest
64
66. 66
Two Memory Architectures
â Princeton
â Fewer memory
wires
â Harvard
â Simultaneous
program and
data memory
access
Processor
Program
memory
Data
memory
Processor
Memory
(program and data)
Harvard Princeton
67. Cache Memory
â Memory access may be slow
â Cache is small but fast memory
close to processor
â Holds copy of part of
memory
â Hits and misses
67
Processor
Memory
Cache
Fast/expensive technology, usually
on the same chip
Slower/cheaper technology, usually
on a different chip
68. Superscalar and VLIW
Architecturesâ Performance can be improved by:
â Faster clock (but thereâs a limit)
â Pipelining: slice up instruction into stages, overlap stages
â Multiple ALUs to support more than one instruction
stream
â Superscalar
â Scalar: non-vector operations
â Fetches instructions in batches, executes as many as possible
â May require extensive hardware to detect independent
instructions
â VLIW: each word in memory has multiple independent
instructions
â Relies on the compiler to detect and schedule instructions
â Currently growing in popularity
68
69. Programmerâs View
â Programmer doesnât need detailed understanding of
architecture
â Instead, needs to know what instructions can be executed
â Two levels of instructions:
â Assembly level
â Structured languages (C, C++, Java, etc.)
â Most development today done using structured languages
â But, some assembly level programming may still be necessary
â Drivers: portion of program that communicates with and/or controls
(drives) another device
â Often have detailed timing considerations, extensive bit manipulation
â Assembly level may be best for these
69
70. Assembly-Level Instructions
â Instruction Set
â Defines the legal set of instructions for that processor
â Data transfer: memory/register, register/register, I/O, etc.
â Arithmetic/logical: move register through ALU and back
â Branches: determine next PC value when not just PC+1
70
opcode operand1 operand2
opcode operand1 operand2
opcode operand1 operand2
opcode operand1 operand2
...
Instruction 1
Instruction 2
Instruction 3
Instruction 4
73. Sample Programs
73
int total = 0;
for (int i=10; i!=0; i--)
total += i;
// next instructions...
C program
MOV R0, #0; // total = 0
MOV R1, #10; // i = 10
JZ R1, Next; // Done if i=0
ADD R0, R1; // total += i
MOV R2, #1; // constant 1
JZ R3, Loop; // Jump always
Loop:
Next: // next instructions...
SUB R1, R2; // i--
Equivalent assembly program
MOV R3, #0; // constant 0
0
1
2
3
5
6
7
74. Programmer Considerations
74
â Program and data memory space
â Embedded processors often very limited
â e.g., 64 Kbytes program, 256 bytes of RAM (expandable)
â Registers: How many are there?
â Only a direct concern for assembly-level programmers
â I/O
â How communicate with external signals?
â Interrupts
75. Operating System
75
â Optional software layer providing low-level services to a
program (application).
â File management, disk access
â Keyboard/display interfacing
â Scheduling multiple programs for execution
â Or even just multiple threads from one program
â Program makes system calls to the OS
76. Development Environment
76
â Development processor
â The processor on which we write and debug our programs
â Usually a PC
â Target processor
â The processor that the program will run on in our
embedded system
â Often different from the development processor
77. Software Development Process
77
â Compilers
â Cross compiler
â Runs on one
processor, but
generates code
for another
â Assemblers
â Linkers
â Debuggers
â Profilers
78. Running a Program
â If development processor is different than target,
how can we run our compiled code? Two options:
â Download to target processor
â Simulate
â Simulation
â One method: Hardware description language
â But slow, not always available
â Another method: Instruction set simulator (ISS)
â Runs on development processor, but executes
instructions of target processor
78
79. Testing and Debugging
79
â ISS
â Gives us control over
time â set breakpoints,
look at register values, set
values, step-by-step
execution, ...
â But, doesnât interact with
real environment
â Download to board
â Use device programmer
â Runs in real environment,
but not controllable
â Compromise: Emulator
â Runs in real environment
â Supports some
controllability from the
PC
80. Application-Specific
Instruction-Set Processors (ASIPs)
80
â General-Purpose Processors
â Sometimes too general to be effective in demanding
application
â e.g., video processing â requires huge video buffers
and operations on large arrays of data, inefficient on a
GPP
â But single-purpose processor has high NRE, not
programmable
â ASIPâs â targeted to a particular domain
â Contain architectural features specific to that domain
â e.g., embedded control, digital signal processing, video
processing, network processing, telecommunications,
etc.
â Still programmable
81. A Common ASIP: Microcontroller
81
â For embedded control applications
â Reading sensors, setting actuators
â Mostly dealing with events (bits): data is present, but not in huge
amounts
â e.g., VCR, disk drive, digital camera (assuming SPP for image
compression), washing machine, microwave oven
â Microcontroller features
â On-chip peripherals
â Timers, analog-digital converters, serial communication, etc.
â Tightly integrated for programmer, typically part of register space
â On-chip program and data memory
â Direct programmer access to many of the chipâs pins
â Specialized instructions for bit-manipulation and other low-level
operations
â Incorporating peripherals and memory onto the same IC â reduces the no.
of required ICâs ï Compact and low power implementations
82. Another Common ASIP: Digital Signal Processors
(DSP)â For signal processing applications
â Large amounts of digitized data, often streaming
â Source â photo captured by a digital camera, a voice packet through
a network router
â Data transformations must be applied fast
â e.g., cell-phone voice filter, digital TV, music synthesizer
â DSP features
â Several instruction execution units â Filtering, Transforming
vectors or metrics of data
â Multiple-accumulate single-cycle instruction, other instructions.
â Efficient vector operations â e.g., add two arrays
â Vector ALUs, loop buffers, etc.
â Contains number of ADC, DAC, PWM, timers, counters etc.
â Commonly used DSPâs are well supported in terms of
compiler and other development tools ï Easy and cheap to
integrate into most embedded systems. 82
83. Less General ASIP Environments
â ASIPâs that are less general in nature
â Designed to perform very domain specific processing while
allowing some degree of programmability.
â ASIPâs designed for networking hardwareï May be designed to
be programmable with different network routing algorithms,
checksum, and packet processing protocols
83
84. Trend: Even More Customized
ASIPs
84
â In the past, microprocessors were acquired as chips
â Today, we increasingly acquire a processor as Intellectual
Property (IP)
â e.g., synthesizable VHDL model
â Opportunity to add a custom datapath hardware and a few
custom instructions, or delete a few instructions
â Can have significant performance, power and size impacts
â Problem: need compiler/debugger for customized ASIP
â Remember, most development uses structured languages
â One solution: Automatic compiler/debugger generation
â e.g., www.tensillica.com
â Another solution: Re-targetable compilers
â e.g., www.improvsys.com (customized VLIW architectures)
85. Selecting a Microprocessor
85
â Issues
â Technical: speed, power, size, cost
â Other: development environment, prior expertise, licensing,
etc.
â Speed: how evaluate a processorâs speed?
â Clock speed â but instructions per cycle may differ
â Instructions per second â but work per instruction may differ
â Dhrystone: Synthetic benchmark, developed in 1984.
Dhrystones/sec.
â MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digitalâs VAX 11/780).
A.k.a. Dhrystone MIPS. Commonly used today.
â So, 750 MIPS = 750*1757 = 1,317,750 Dhrystones per second
â SPEC: set of more realistic benchmarks, but oriented to desktops
â EEMBC â EDN Embedded Benchmark Consortium, www.eembc.org
â Suites of benchmarks: automotive, consumer electronics,
networking, office automation, telecommunications
86. General Purpose Processors
86
Processor Clock speed Periph. Bus Width MIPS Power Trans. Price
General Purpose Processors
Intel PIII 1GHz 2x16 K
L1, 256K
L2, MMX
32 ~900 97W ~7M $900
IBM
PowerPC
750X
550 MHz 2x32 K
L1, 256K
L2
32/64 ~1300 5W ~7M $900
MIPS
R5000
250 MHz 2x32 K
2 way set assoc.
32/64 NA NA 3.6M NA
StrongARM
SA-110
233 MHz None 32 268 1W 2.1M NA
Microcontroller
Intel
8051
12 MHz 4K ROM, 128 RAM,
32 I/O, Timer, UART
8 ~1 ~0.2W ~10K $7
Motorola
68HC811
3 MHz 4K ROM, 192 RAM,
32 I/O, Timer, WDT,
SPI
8 ~.5 ~0.1W ~10K $5
Digital Signal Processors
TI C5416 160 MHz 128K, SRAM, 3 T1
Ports, DMA, 13
ADC, 9 DAC
16/32 ~600 NA NA $34
Lucent
DSP32C
80 MHz 16K Inst., 2K Data,
Serial Ports, DMA
32 40 NA NA $75
87. Chapter Summary
87
â General-purpose processors
â Good performance, low NRE, flexible
â Controller, datapath, and memory
â Structured languages prevail
â But some assembly level programming still necessary
â Many tools available
â Including instruction-set simulators, and in-circuit emulators
â ASIPs
â Microcontrollers, DSPs, network processors, more customized ASIPs
â Choosing among processors is an important step
â Designing a general-purpose processor is conceptually the same
as designing a single-purpose processor
89. 1. An algorithm for matrix multiplication, assuming that we have one adder and
one multiplier, follows:
a. Convert the matrix multiplication algorithm into a state diagram.
b. Rewrite the matrix multiplication algorithm given the assumption that we have
3 adders and 6 multipliers.
c. If each multiplication takes 2 cycles to compute and each addition takes one
cycle to compute, how many cycles does it take to complete the matrix
multiplication given one adder and one multiplier? Three adders and six
multipliers?
d. If each adder requires 10 transistors to implement and each multiplier requires
100 transistors to implement, what is the total number of transistors to
implement the matrix multiplication circuit using 1 adder and 1 multiplier? Three
adders and six multipliers?
89
90. main()
{
int A[3][2]={ {1, 2}, {3,4}, {5,6}};
int B[2][3]= {{7, 8, 9}, (10, 11, 12}};
int C[3][3], i, j, k;
for(i=0; i<3; i++) {
for(j=0; j<3; j++) {
c[i][j]=0;
for(k=0;k<2;k++){
c[i][j]+=A[i][k]*B[k][j];
}
}
}
}
90
93. 2. Design a single-purpose processor that outputs Fibonacci
numbers up to n places. Start with a function computing the
desired result, translate it into a state diagram, and sketch a
probable datapath.
93