3. INSTRUCTION SET
• ARM instruction set
– All instructions are 32-bit
– Most instructions can be executed conditionally
• Thumb instruction set
– 16-bit instruction set
No condition execution (except for branches)
AAETC3v00
Instruction Sets 3
– 16-bit instruction set
– No condition execution (except for branches)
– Optimized for code density from C code (~65% of ARM code size)
• Thumb-2 technology
– Extension to Thumb instruction set
– Mix of 16-bit and 32-bit instructions
– Condition execution via IT instruction
– Higher performance than Thumb and smaller than ARM
8. BYTE REVERSAL
• Byte Reversal Instructions
REV{cond} Rd, Rm Reverses the bytes in a word
REV16{cond} Rd, Rm Reverses the bytes in each halfword
3 2 01 0 1 32
REV r0, r0
AAETC3v00
Instruction Sets 8
REV16{cond} Rd, Rm Reverses the bytes in each halfword
REVSH{cond} Rd, Rm Reverses the bottom two bytes,
and sign extends to 32 bits
V6 and later
REV r0, r0
Pre-V6
EOR r1, r0, r0, ROR #16
BIC r1, r1, #0xFF0000
MOV r0, r0, ROR #8
EOR r0, r0, r1, LSR #8
9. SIMD
• ARMv6 added a number of instructions which perform SIMD (Single Instruction
Multiple Data) operations using ARM registers
– Includes instructions for addition, subtraction, multiplication and sum of absolute
differences
– Instructions can work on four 8-bit quantities, or two 16-bit quantities
– Signed/unsigned and saturating versions available of many instructions
– CPSR GE bits used instead of normal ALU flags
UADD16 Rd, Rm, Rs
AAETC3v00
Instruction Sets 9
• There are instructions for packing (PKHBT/PKHTB) and unpacking
(UXTH/UXTB) registers
+
Rs
+
Rm
UADD16 Rd, Rm, Rs
Rd
GE[3:2] GE[1:0]
10. SATURATED MATH AND CLZ
• Support for Saturated Arithmetic
– Targeted at DSP & control applications
– Overflow sets Q flag (sticky) not V, and sets result to +/- max value
QSUB{cond} Rd, Rm, Rn ; Rd = saturate(Rm - Rn)
QADD{cond} Rd, Rm, Rn ; Rd = saturate(Rm + Rn)
0x0
0x7FFFFFFF
0x80000000
-ve
+ve
AAETC3v00
Instruction Sets 10
QDSUB{cond} Rd, Rm, Rn ; Rd = saturate(Rm
- saturate(Rn * 2))
QDADD{cond} Rd, Rm, Rn ; Rd = saturate(Rm
+ saturate(Rn * 2))
• Count Leading Zeros
CLZ{cond} Rd, Rm
– Returns number of unset bits before the most significant set bit
031
0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 0
CLZ returns 10 in this case
11. SATURATION
• Saturate a value to a specified bit position (effectively saturating to any
power of 2)
– USAT - Unsigned saturate 32-bit
• Syntax: USAT Rd, #sat, Rm {shift}
• Operation: Rd = Saturate(Shift(Rm), #sat)
0 0 1 1 1
saturation position
max
(unsigned saturation)
max min
AAETC3v00
Instruction Sets 11
– Variants
SSAT - signed saturation
USAT16 - saturates two 16-bit unsigned halfwords (no rotation allowed)
SSAT16 - signed saturation of two 16-bit halfwords (no rotation
allowed)
– #sat is specified as an immediate value in the range 0 to 31
– {shift} is optional and is limited to LSL or ASR
– Q flag is set if saturation occurs
0 0 0 1 1
max
1 1 1 0 0
min
(signed saturation)
12. SINGLE / DOUBLE REGISTER DATA
TRANSFER
• Use to move data between one or two registers and memory
LDRD STRD Doubleword
LDR STR Word
LDRB STRB Byte
LDRH STRH Halfword
LDRSB Signed byte load
LDRSH Signed halfword load
Memory
31 0
AAETC3v00
Instruction Sets 12
• Syntax:
– LDR{<size>}{<cond>} Rd, <address>
– STR{<size>}{<cond>} Rd, <address>
• Example:
– LDRB r0, [r1] ; load bottom byte of r0 from the
; byte of memory at address in r1
Any remaining space
zero filled or sign extended
Rd
13. ADDRESSING MEMORY
• The address accessed by LDR/STR is specified by a base register with
an optional offset
– Base register only (no offset)
LDR r0, [r1]
– Base register plus constant
LDR r0, [r1, #8] r2, LSL #2
AAETC3v00
Instruction Sets 13
LDR r0, [r1, #8]
– Base register, plus register (optionally shifted by an immediate value)
LDR r0, [r1, r2]
LDR r0, [r1, r2, LSL #2]
– The offset can be either added or
subtracted from the base register
LDR r0, [r1, #-8]
LDR r0, [r1, -r2]
LDR r0, [r1, -r2, LSL #2]
+/-
r1 #8
r0
memory
address
r2, LSL #2
or
15. • These instructions move data between multiple registers and memory
• Syntax
<LDM|STM>{<addressing_mode>}{<cond>} Rb{!}, <register list>
• 4 addressing modes
• Increment after/before
• Decrement after/before
MULTIPLE REGISTER DATA TRANSFER
(IA)
r1 Increasing
r4 r1
r4
r0
IB DA DB
AAETC3v00
Instruction Sets 15
• Also
PUSH/POP, equivalent to STMDB/LDMIA with SP! as base register
• Example
LDM r10, {r0,r1,r4} ; load registers, using r10 base
PUSH {r4-r6,pc} ; store registers, using SP base
Increasing
Addressr0
r1
r4
r0 r1
r4
r0
r10Base Register (Rb)
16. INSTRUCTIONS FOR LOADING
CONSTANTS
• The assembler provides some instructions for loading
values into registers
– These are the recommended mechanisms for loading
constants into registers
• PC- or register-relative constants
ADR Rn, label
• Add or subtract an immediate value
to or from the PC to generate the
• Absolute constants
LDR Rn, =<constant>
LDR Rn, =label
AAETC3v00
Instruction Sets 16
to or from the PC to generate the
address of the label into the
specified register, using one
instruction
• ADRL pseudo instruction uses two
instructions, giving a better range
• Can be used to generate addresses
for position independent code (but
only if in same code section)
• Constant determined at run time
• Pseudo instruction
• Assembler will use optimal sequence to
generate constant into specified register
(one of MOV, MVN or an LDR from a
literal pool)
• Can load to the PC, causing a branch
• Use for absolute addressing and
references outside the current section
(resulting in position dependent code)
• Constant determined at assembly or
link time
17. LDR= EXAMPLES
• The following examples show how the LDR= pseudo instruction
makes code more readable, portable and flexible
LDR r0, =0x2543 MOV r0, #0x2543
DisassemblyCode
AAETC3v00
Instruction Sets 17
LDR r0, =0xFFFF43FF
LDR r0, =0xFFFFF5
MVN r0, #0xBC00
LDR r0, [pc, #xx]
...
DCD 0xFFFFF5
18. BRANCH INSTRUCTIONS
• Branch instructions have the following format
B{<cond>} label
– Might not cause a pipeline flush (branch prediction)
– Branch range depends on instruction set and width
• A BL instruction additionally generates a return address in r14 (lr)
– Returning is performed by restoring the program counter (pc) from lr
AAETC3v00
Instruction Sets 18
– Returning is performed by restoring the program counter (pc) from lr
:
BL func2
:
:
BX lr
func1 func2
void func1 (void)
{
:
func2();
:
}
19. BRANCH RANGES
• The range of a branch instruction depends on which instruction set
is being used
• It also varies between different types of branch
ARM Thumb
B ±32MB ±16MB
CBZ/CBNZ 126 bytes
AAETC3v00
Instruction Sets 19
CBZ/CBNZ 126 bytes
BL/BLX (imm) ±32MB ±16MB
BLX (reg) Any Any
BX Any Any
TBB 510 bytes
TBH 131070 bytes
“Any” indicates an instruction which can branch to any address in the 4GB address space
20. READING AND WRITING PC
• In general, writing PC causes a branch to the value written
– Bit zero controls the execution state (ARM or Thumb) at the destination
– The bottom bit of the destination address is always forced to zero
– Writing a value with ‘10’ in the bottom two bits results in unpredictable behavior
– Note that architectures prior to ARMv7 do not change state when the PC is written
directly
AAETC3v00
Instruction Sets 20
• Loading PC from memory behaves similarly
– Architectures prior to ARMv5T do not change state when the PC is loaded from memory
• The PC reads as the address of the current instruction plus an offset
– In ARM state, the offset is 8
– In Thumb state, the offset is 4
– This reflects the 3-stage structure of the ARM7TDMI pipeline
– In Thumb state, the bottom bit always reads as zero
– In ARM state, the bottom two bits will always read as zero
21. CHANGING STATE
• Changing between ARM and Thumb states (or “interworking”) can be carried out
using the Branch Exchange instruction
BX Rn
BLX RN
– Bit 0 of Rn determines the exchange behavior
• Unset (0) - change to (or remain in) ARM state
• Set (1) - change to (or remain in) Thumb state
AAETC3v00
Instruction Sets 21
• Branch and Link with Exchange
– Used to branch to a subroutine which is known to be in the opposite instruction set
– When branching to imported labels use BL, the linker will substitute BLX if necessary
BLX offset ; ARM/Thumb instruction which always
; changes state (and sets LR)
• All instructions which modify the PC can cause a state change
– Depending on bit 0 of the result
– For data processing instructions, state changes only if S variant not used
22. IF-THEN
• Thumb only, makes the next 1-4 instructions
conditional
• Syntax
IT{T|E}{T|E}{T|E} <cond>
– Any condition code may be used
– Doesn’t affect condition flags
– 16-bit instructions in the IT block do not affect condition
; if (r0 == 0)
; r0 = *r1 + 2;
; else
; r0 = *r2 + 4;
; if
CMP r0, #0
ITTEE EQ
AAETC3v00
Instruction Sets 22
– 16-bit instructions in the IT block do not affect condition
flags (except CMP, CMN & TST)
– 32-bit instructions do affect condition flags (normal rules
apply)
– No need to write this instruction: the assembler will insert
it for you where necessary
• Current “if-then status” stored in CPSR
– Conditional block may be safely interrupted and returned
to
– Not recommended to branch into or out of
‘if-then’ block
ITTEE EQ
; then
LDREQ r0, [r1]
ADDEQ r0, #2
; else
LDRNE r0, [r2]
ADDNE r0, #4
23. STATUS REGISTER ACCESS
• MRS and MSR allow contents of CPSR/SPSR to be transferred
to/from a general purpose register or be set to an immediate value
– MSR allows the whole status register, or just parts of it, to be updated
MRS r0,CPSR ; read CPSR into r0
BIC r0,r0,#0x80 ; clear bit 7 to enable IRQ
MSR CPSR_c,r0 ; write modified value to ‘c’ byte only
AAETC3v00
Instruction Sets 23
• CPS can be used to directly modify some bits in the CPSR
– These are related to interrupt enable/disable and operating mode
• SETEND instruction selects the endianness of data accesses
– For use in systems with mixed endian data (e.g. peripherals)
SETEND BE
LDR r0, [r7], #4 ; big-endian
SETEND LE
LDR r1, [r7], #4 ; little-endian
User mode programs may
read all bits of CPSR but
may only change the flag
bits
24. SYSTEM CONTROL INSTRUCTIONS
• ARM uses coprocessors for “internal functions” so as not to enforce
a particular memory map
– System Control Coprocessor: cp15
• Used for processor configuration: System ID, caches, MMU, TCMs, etc.
– Debug Coprocessor: cp14
• Can be used to access debug control registers
AAETC3v00
Instruction Sets 24
• Can be used to access debug control registers
– VFP and NEON: cp10 and cp11
• In earlier versions of the architecture, designers were permitted to
add external coprocessors
– This is not permitted in ARMv7 architecture profiles
26. VFP ARCHITECTURE
• VFP (Vector Floating Point) is ARM’s floating point architecture
– There have been 4 versions of the architecture to date (VFPv1 is no longer
AAETC3v00
Instruction Sets 26
– There have been 4 versions of the architecture to date (VFPv1 is no longer
supported)
– VFPv2 is supported by ARM9 and ARM11 processor families
– VFPv3 and VFPv4 are optional extensions to the ARMv7-AR architecture profiles
• VFPv3 (Cortex-A8, Cortex-A9, Cortex-R4, Cortex-R5)
– Can be implemented with either 16 (VFPv3-D16) or 32 (VFPv3-D32) registers
– Can be extended with half-precision conversion functions
• VFPv4 (Cortex-A5, Cortex-A7 and Cortex-A15)
– Includes half-precision conversion functions
– Supports fused multiply-add operations
27. THE NEON ARCHITECTURE EXTENSION
• NEON refers to the Advanced SIMD instruction set extension
– Optional extension to ARMv7-AR architecture profiles
– The NEON register set is separate from the core register bank
– NEON instruction support parallel operations on vectors of elements held in registers
– Advanced SIMDv1 is the base NEON architecture
• Can be extended with half-precision conversion functions
– Advanced SIMDv2 adds fused multiply-add operations
AAETC3v00
Instruction Sets 27
34. ARM9TDMI PIPELINE (LDR INTERLOCK)
Cycle
Operation
ADD R1, R1, R2
SUB R3, R4, R1
ORR R8, R3, R4
AND R6, R3, R1
1 2 3 4 5 6 7 8
LDR R4, [R7]
9
F D E
F D E W
F D E W
F D E W
F D WE
W
I
M
S
AAETC3v00
Instruction Sets 34
• In this example it takes 7 clock cycles to execute 6 instructions, CPI of 1.2
• The LDR instruction immediately followed by a data operation using the same
register causes an interlock
EOR R3, R1, R2 F D E W
F - Fetch D - Decode E - Execute I - Interlock M - Memory
W - Writeback
35. ARM9TDMI PIPELINE (LDR)
Cycle
Operation
ADD R1, R1, R2
SUB R3, R4, R1
ORR R8, R3, R4
AND R6, R3, R1
LDR R4, [R7]
1 2 3 4 5 6 7 8 9
F D E W
F D E W
F E W
F D E W
F D WE M
D
AAETC3v00
Instruction Sets 35
• In this example it takes 6 cycles to execute 6 instructions, CPI of 1
• Cycle 4 has simultaneous I & D memory accesses
• Cycle 5 R4 data available to ORR before written to register
– Internal forwarding paths are used
EOR R3, R1, R2 F D E W
F - Fetch D - Decode E - Execute I - Interlock M - Memory
W - Writeback
36. CORTEX-R4 PIPELINE
Decode Issue
Pre-
Decode
Fetch2
Shift ALU Sat
MAC
1
MAC
2
Data
Cache
Data
Cache Format
Fetch1
A
G
Common decode pipeline
4 parallel back end pipelines
MAC
3
Wr
Instruction
AAETC3v00
Instruction Sets 36
FPU2
Cache
1
Cache
2
Format
FPU0 FPU1
Branch3
Wr
G
UPrefetch Unit
• Dual issue can occur for certain instruction
sequences
• Enabled at reset, can be disabled in CP15
• AGU = Address Generation Unit
• Separate divide pipeline for hardware DIV
instruction
Branch1Branch2
FPU (Optional)
Instruction
queue
38. CORTEX-A15 AND CORTEX-A7
Fetch
Decode, Rename &
Dispatch
Loop Cache
Queue Issue
Integer
Integer
Multiply
Floating-Point / NEON
Branch
Load
Store
Writeback
AAETC3v00
Instruction Sets 38
Fetch Decode
Queue
Issue
Integer
Multiply
Floating-Point / NEON
Dual Issue
Load/Store
Writeback
Cortex-A15 and Cortex-A7 form an
architecturally-identical pair
Cortex-A15 is optimized for performance
Cortex-A7 is optimized for power
consumption
Together they can be built into a big.LITTLE
configuration
40. CYCLE COUNTING
• Early pipelines (e.g. ARM7TDMI) were entirely deterministic and
predictable
• Later pipelines introduce interlocks and inter-instruction
dependencies
– Address, resource and data dependencies are all possible
AAETC3v00
Instruction Sets 40
– Address, resource and data dependencies are all possible
– Interactions between instructions become very complicated
• On ARMv7 cores, manual cycle counting is not really possible, so
need to use…
– Cycle-accurate trace
– Simulation models
– Performance Monitoring Unit (see later)
41. PERFORMANCE MONITORING
HARDWARE
• ARMv7-A cores include a performance monitoring unit (PMU)
• A PMU provides a non-intrusive method of collecting execution information
from the core
– Enabling the PMU does not change the timing of the core
• The PMU provides:
– Cycle counter – counts execution cycles (optional 1/64 divider)
AAETC3v00
Instruction Sets 41
– Cycle counter – counts execution cycles (optional 1/64 divider)
– Programmable event counters
• The number of counters and available events vary between cores
– The PMU can be configured to generate interrupts if a counter overflows
• Some examples common to most cores:
– Cache Hits or Misses, TLB Misses (on MMU cores), Branch prediction,
correct/incorrect predictions, Number of instructions executed, etc…
• Some events are architecturally defined while others are core-dependent
– Check the ARM ARM and your core’s TRM for a full list