AAME ARM Techcon2013 004v02 Debug and Optimization

Software Debug &
Optimization for ARM®
Cortex®-M MicrocontrollersCortex -M Microcontrollers

Agenda
Tools & Debug Configuration
Debug Components
Physical Debug Interfaces
Compiler Configuration
2
AAME TechCon 2013
TC004v02
2
Introduction to Optimization
Mixing C and Assembler

Keil® MDK
Low cost tools for ARM7, ARM9, ARM Cortex-M and ARM Cortex-R4 MCUs
– Extensive device support for many devices
– Core and peripheral simulation
– Flash support
Microcontroller Development Kit (MDK)
– µVision IDE
– ARM Compiler, optimized run-time library, KEIL RTX RTOS
3
AAME TechCon 2013
TC004v02
3
– Real-time trace (for Cortex-M3 and Cortex-M4 based devices)
Real-Time Library
– Keil RTX RTOS + Source Code
– TCP networking suit, Flash File System, CAN Driver Library, USB Device Interface
Debug Hardware
Evaluation boards

ARM Cortex-M3/M4 Debug Features
Compliant with ARMv7-M Debug Architecture (CoreSight™ based)
Traditional ARM Debug Features
– Two debug modes (Halt mode and Monitor mode)
– Two stepping modes (with and without interrupts taken)
– BKPT instruction
– Vector Catch
–
4
AAME TechCon 2013
TC004v02
4
– Optional Embedded Trace Macrocell (ETM)
CoreSight Debug Features
– Flash Patch and Breakpoint (FPB)
– Instruction Breakpoints and Code Patching
– Data Watchpoint and Trace (DWT)
– Hardware Breakpoints, Event Counters and PC Sampling
– Instrumentation Trace Macrocell (ITM)
– Low bandwidth trace driven by application software or DWT
– Serial Wire Viewer

ARM Cortex-M3/M4 CoreSight
Overview
Utilizes a Debug Access Port (DAP)
– Consists of a Debug Port (DP) + AHB Access Port (AP)
Debug Port (DP) has 2 implementation options
– SWJ-DP – supports Serial Wire (2-pin) and conventional JTAG interface
– SW-DP – supports Serial Wire (2-pin) only
Serial Wire
5
AAME TechCon 2013
TC004v02
5
AHB Access Port (AP)
– Provides AHB-Lite access to core, memory and debug components
– All Debug registers are memory mapped
– Traditional internal scan chains no longer utilized
– No coprocessors on Cortex-M3 (CP14 was traditionally the debug coprocessor)
SWJ-DP AHB-
AP
DAP
Serial Wire
or
JTAG
SW-DP AHB-
AP
DAP
Serial Wire

Cortex-M3/M4 Debug Access Paths
Cortex-M3
Core
Data
Watchpoin
t & Trace
(DWT)
Flash
Patch &
Breakpoint
(FPB)
Instrument
. Trace
Macrocell
(ITM)
Bus
Matrix
AHB - Internal Private Peripheral Bus
APB - External Private Peripheral Bus
6
AAME TechCon 2013
TC004v02
6
Embedded
Trace
Macrocell
(ETM)
Trace Port
Interface
Unit
(TPIU)
SW/J-DP AHB
-AP
DAP
APB - External Private Peripheral Bus
ROM
Table

Halted Debug Mode
Traditional start/stop debug
– Core executes and then halts in Debug mode
Debug Fault Status Register (DFSR) identifies the type of Debug
event
– EXTERNAL EDBGRQ input asserted from other SoC component
– VCATCH Vector Catch triggered
7
AAME TechCon 2013
TC004v02
7
– VCATCH Vector Catch triggered
– DWTTRAP Data access to address matching a Watchpoint
– BKPT BKPT instruction executed
– HALTED Halt request from debugger (or stepping in debug)
BKPT instruction
– Debugger replaces original instruction with BKPT for software breakpoint
– Synthesized by FPB unit for a hardware breakpoint

Vector Catch
Mechanism traps selected exceptions
– Core halts when exception is asserted
– No DWT / Breakpoint resources utilized
– Suitable for early software development
– Selection made through debugger
Following exceptions may be trapped
– Reset
Address Vector
0x00 Initial Main SP
0x04 Reset
0x08 NMI
0x0C Hard Fault
0x10 Memory Manage
0x14 Bus Fault
8
AAME TechCon 2013
TC004v02
8
– Reset
– Hard Fault
– Memory Management Fault
– Bus Fault
– Usage Fault
– Exception Service Error
Note - cannot catch interrupts this way
– Unlike other ARM and Cortex-A/R cores
– Use breakpoint in interrupt handler
0x14 Bus Fault
0x18 Usage Fault
0x1C -
0x28
Reserved
0x2C SVCall
0x30 Debug Monitor
0x34 Reserved
0x38 PendSV
0x3C SysTick
0x40 IRQ0
…. More IRQs

Reset
Core has 3 different reset inputs
– PORESETn - Power-on reset for Cortex-M3 system
– SYSRESETn - System reset for processor (debug components not reset)
– DAPRESETn - AHB-AP reset
Software Generated Resets
9
AAME TechCon 2013
TC004v02
9
Software Generated Resets
– VECTRESET bit in Application Interrupt and Reset Control Register
– Equivalent to asserting SYSRESETn
– Software reset option is available in Keil/MDK and DS-5 Development Studio
– Use CTRL_REG for the “Reset Type”
– Core is safely reset without asserting nSRST JTAG signal
– SYSRESETREQ bit in Application Interrupt and Reset Control Register
– Sends a request for a reset to the system
– Reset is generated by customer-defined reset controller (not the M3)
– Other components in the system other than the Cortex-M3 may be affected

Agenda
Debug Components
10
AAME TechCon 2013
TC004v02
10

Flash Patch and Breakpoint Unit
(FPB)
Flash Patching
– Allows runtime patching of firmware
– Remaps reads from the Code space to System space using a Patch Table
– Total of 8 addresses may be patched
– 6 instruction comparators (for instruction fetches from Code space)
– 2 literal comparators (for literal data loads from Code space)
– Only reads are patched
–
11
AAME TechCon 2013
TC004v02
11
– Writes will be performed as normal
– Intended Usage
– ROM-based designs (costly fix)
– Firmware field upgrades
Hardware Breakpoints – maximum of 6
– The 6 instruction comparators can return a BKPT to halt the core
– Instruction comparators are shared with Flash Patch functionality
– If 3 instructions are flash patched, only 3 hardware breakpoints are available

Data Watchpoint and Trace (DWT)
DWT component useful for Debug, Trace and Profiling
– Enabled by setting TRCENA bit in Debug Exception and Monitor Control Reg
Debug Support
– Traditional data watchpoint for halt mode debug
– Can break on [data value && data address] match (x1)
Trace Support
– Generate trace trigger for Embedded Trace Macrocell (ETM)
12
AAME TechCon 2013
TC004v02
12
Generate trace trigger for Embedded Trace Macrocell (ETM)
Profiling / Event Support
– Provides non-invasive view of application execution
– Packets output through Instrumentation Trace Macrocell (ITM)
– Must have debug tools connected to view the output
– Packets generated for selected events of interest
– Data Address matching
– Periodic PC Sampling
– Exception Entry, Exit and Return
– Hardware performance counting

DWT Block Diagram
DWT interfaces to the Core, ETM and ITM
ITM communication is packet based
– Packets defined in the ARMv7-M Architectural Reference Manual
ETM*
13
AAME TechCon 2013
TC004v02
13
Comparator
Bank
Cortex-M3
Core
DWT
break
trigger
ETM*
ITM
packet
Event
Counters
packet
packet
Cycle
Counter
*ETM Optional

Instrumentation Trace Macrocell
(ITM)
Generates and outputs Trace Packets
Packet types (in priority order):
– Software trace
– Software can write directly to ITM stimulus registers, causing packets to be emitted
– Similar to using printf() to debug a C program
– Hardware trace
– Packets are generated by the DWT and emitted by the ITM
14
AAME TechCon 2013
TC004v02
14
– Packets are generated by the DWT and emitted by the ITM
– Timestamps
Timestamp Packets
– Must be enabled in ITM Trace Control Register
– Local Timestamp (differential) value generated from 21-bit counter
– Counter clocked from either core clock or TPIU clock
– Global Timestamp (absolute) value generated from 48-bit counter
– Packet generation
– When any other trace packet is generated (which resets timestamp counter)
– When timestamp counter overflows

ARM Cortex-M3/M4 and ITM
Cortex-M3
Core
ETM
Trigger
ATB
Cortex-M3 Macrocell Global
Timestamp
ClockGlobal Timestamp
Global
15
AAME TechCon 2013
TC004v02
15
SW/
SWJ-DP
DWT ITM
TPIU
APB
ATB
SWO & TraceData[3:0]
Local
Timestamp
Global
Timestamp

Embedded Trace Macrocell (ETM)
Optional non-invasive debug component
ETM Hardware monitors activity of processor
Trace allows:
– Historical debug of sequences leading up to events of interest
– e.g. System crash on peripheral access during overnight testing
16
AAME TechCon 2013
TC004v02
16
– e.g. System crash on peripheral access during overnight testing
– Debug of events in real-time systems where the target cannot be halted
– Hard Disk drives, Engine Management
– Visibility of accesses inside a SoC
– To internal memories/peripherals
– Software profiling and code coverage

ARM Cortex-M3/M4 and ETM
Cortex-M3
Core
ETM
Trigger
ATB
Cortex-M3 Macrocell Global
Timestamp
ClockGlobal Timestamp
Global
17
AAME TechCon 2013
TC004v02
17
SW/
SWJ-DP
DWT ITM
TPIU
APB
ATB
SWO & TraceData[3:0]
Local
Timestamp
Global
Timestamp

TPIU Interface / Serial Wire Output
Formats and serializes data from ETM and ITM
Trace data clocked out asynchronous to core clock
– TRACECLK derived from TRACECLKIN
ETM packets output over Trace Port (TRACECLK and TRACEDATA [3:0])
– Data decompressed with a conventional Trace Port Analyzer
18
AAME TechCon 2013
TC004v02
18
ITM/DWT packets optionally output over Serial Wire Output (SWO)
– SWO also referred to as “Serial Wire Viewer”
– Data decompressed with an Event Viewer
FIFO
Formatter Serializer
TPIU
FIFO
ETM
ITM
TRACECLKIN
SWO
TRACEDATA [3:0]
TRACEC
LK

Agenda
Debug Components
19
AAME TechCon 2013
TC004v02
19

Physical Interfaces
New 10-pin and 20-pin interfaces available
– Higher pin density (0.05”) over standard JTAG IDC interface (0.10”)
– Reduces connector footprint
Serial Wire Debug is the preferred solution
– SWO likely not available when using JTAG interface
20
AAME TechCon 2013
TC004v02
20
– SWO likely not available when using JTAG interface
Trace can use a 20-pin connector
– Legacy 38-pin Mictor connector not recommended
Consult CoreSight Component TRM or Device
Data Sheet

Samtec FTSH-110 Connector
2 SWDIO / TMS
4 SWCLK / TCK
6 SWO / TDO
8 NC/EXTb / TDI
VTref 1
GND 3
GND 5
KEY 7
21
AAME TechCon 2013
TC004v02
21
10 nRESETGNDDetect 9

Samtec FTSH-120 Connector
2 SWDIO / TMS
4 SWCLK / TCK
6 SWO/EXTa/TRACECTL / TDO
8 NC/EXTb / TDI
10 nRESET
VTref 1
GND 3
GND 5
KEY 7
GNDDetect 9
22
AAME TechCon 2013
TC004v02
22
10 nRESET
12 TRACECLK
14 TRACEDATA[0]
16 TRACEDATA[1]
18 TRACEDATA[2]
20 TRACEDATA[3]
GNDDetect 9
GND/TgtPwr+Cap 11
GND/TgtPwr+Cap 13
GND 15
GND 17
GND 19

Agenda
Debug Components
23
AAME TechCon 2013
TC004v02
23

Language Support
Single compiler armcc can compile standard ISO C/C++
Source language modes
– ISO C90
– 1990 C standard, compile option --c90 (default)
– ISO C99
– 1999 C standard, compile option --c99
24
AAME TechCon 2013
TC004v02
24
– 1999 C standard, compile option --c99
– ISO C++
– 2003 C++ standard, compile option –cpp
Language compliance
– Default mode supports several common extensions
– Strict mode enforces compliance with language standard: --strict
– GNU mode offers partial support for GCC extensions: --gnu

Variable types supported
The compiler supports these basic types
int / long 32 bit (word) integer
short 16-bit (half-word) integer
char 8-bit byte, unsigned by default
long long 64-bit integer
25
AAME TechCon 2013
TC004v02
25
long long 64-bit integer
float 32-bit single-precision IEEE floating point
double 64-bit double-precision IEEE floating point
bool 8-bit Boolean (C++ only)
wchar_t 16-bit “wide character” type (C++ only)
Pointers 32-bit integer addresses

Optimization Levels
Level of optimizations carried out by the compiler is selectable
-O0
– Minimum optimization
– The least optimized code, but with the best debug view
-O1
– Restricted optimization
– Optimized code and a good debug view
(default)
26
AAME TechCon 2013
TC004v02
26
-O2 (default)
– High optimization
– Well optimized code but with limited debug view
-O3
– More aggressive optimization, weighted toward -Ospace / -Otime choice
– Enables multifile compilation by default (more later)
Select optimization for code size or execution speed with -Ospace (default) or
-Otime
Use -g or --debug to generate source level debug information

Selecting an Architecture or Core
Each new version of the ARM Architecture typically supports extra instructions and
models of operation
Implementation of an architecture version may vary between cores
– Use the most specific setting you can when compiling
Inform the compiler of the architecture or processor
– The default CPU setting is ARM7TDMI (Architecture 4T)
– Either specify an architecture version, or a specific core
27
AAME TechCon 2013
TC004v02
27
--cpu 7-M (Do not prefix with a ‘v’)
--cpu Cortex-M3
Some examples of features the compiler and libraries can take advantage of:
– UDIV and SDIV (7-M and 7-R)
– REV (v6) can be used to reverse byte endianness
– Unaligned memory access (v6)
When using the Cortex-M3 it is essential to specify 7-M or Cortex-M3 to ensure
the correct (Thumb only) libraries are used

Agenda
Debug Components
28
AAME TechCon 2013
TC004v02
28

Using “volatile”
f
int f(int *p)
{
return (*p == *p);
}
f
MOVS r0, #1
BX lr
armcc
29
AAME TechCon 2013
TC004v02
29
int f(volatile int *p)
{
return (*p == *p);
}
armcc
f
LDR r1,[r0,#0]
LDR r0,[r0,#0]
CMP r1,r0
ITE NE
MOVNE r0,#0
MOVEQ r0,#1
BX lr
This code is compiled with “-O2 –Otime --cpu=Cortex-M3”

Instruction Scheduling
Instruction scheduling is enabled at -O1 and higher
– Instructions are re-ordered to suit the core on which the code will run
– Improves throughput by minimizing interlocks
– Select processor (--cpu) to determine algorithm used
For example:
int f(int *p, int x) { return *p + x * 3; }
30
AAME TechCon 2013
TC004v02
30
Without scheduling (-O0) With scheduling (-O1, -O2, -O3)
Compiler never re-orders instructions if this would change the behavior
MOV r2,r0
ADD r3,r1,r1,LSL #1
LDR r0,[r2,#0]
ADD r0,r0,r3
BX lr
LDR r0,[r0,#0]
ADD r1,r1,r1,LSL #1
ADD r0,r0,r1
BX lr

Inlining of functions
Inlining can improve performance, at the expense of a larger image
– Body of inlined function inserted directly into the calling code wherever it is called
– Only possible if caller and callee are in same compilation unit (except --multifile)
The compiler can inline functions automatically
– Normally no need to annotate your source code, or use any special switches
Factors that influence auto-inlining include
– Whether the function is marked with __inline ‘hint’
–
31
AAME TechCon 2013
TC004v02
31
– Optimization level and -Otime / -Ospace
– How many places the function is called
– Size of the function
– Whether the function has external or static linkage
To force a function to be inlined, either use --forceinline with __inline or use
__forceinline
Any non-static function that gets auto-inlined has an out-of-line version generated
too (another reason to use static)
– Increases code size
– More complex debug view
Example...

Loop Transformation
The compiler can transform and restructure loops automatically
– Enabled with -O3 –Otime
Loop unrolling reduces loop overhead at the cost of increase in code size
for (i = 0; i < 100; i++)
for (i = 0; i < 100; i += 4)
{
32
AAME TechCon 2013
TC004v02
32
Loop re-rolling
– Recognize manually unrolled loops, re-roll and unroll optimally
Loops with constant, low iteration count may be unrolled completely
for (i = 0; i < 100; i++)
{
c[i] = b[i] + 1;
}
c[i + 0] = b[i + 0] + 1;
c[i + 1] = b[i + 1] + 1;
c[i + 2] = b[i + 2] + 1;
c[i + 3] = b[i + 3] + 1;
}

Branch Target Optimization (1)
The ARM Compiler implements a performance optimization to
ensure that a loop branch target is not an unaligned 32-bit
instruction
– Available when compiling at –O3
– Improves EEMBC performance by 1.5%
The compiler will try first to widen the instruction before the
33
AAME TechCon 2013
TC004v02
33
The compiler will try first to widen the instruction before the
branch target
– If widening is not possible then it will insert a 16-bit NOP
Two examples on the following slides
– The first shows an instruction being widened to align a loop
– The second shows a NOP being inserted to align a loop

int foo1(int a[16]) {
int i;
int total = 0;
for (i=0; i<8; i++) {
total += a[i];
}
return total;
Instruction being widened to align a loop target
34
AAME TechCon 2013
TC004v02
34
}
foo1
0x00000000: 4602 .F MOV r2,r0
0x00000002: 2000 . MOVS r0,#0
0x00000004: ea4f0100 O... MOV.W r1,r0 <<<< widened MOV
loop 0x00000008: f8523021 R.!0 LDR r3,[r2,r1,LSL #2]
0x0000000c: 1c49 I. ADDS r1,r1,#1
0x0000000e: 4418 .D ADD r0,r0,r3
0x00000010: 2908 .) CMP r1,#8
0x00000012: dbf9 .. BLT {pc}-0xa ; 0x8 loop
0x00000014: 4770 pG BX lr

int foo2(int a[16], int j) {
int total = 0;
int i = 0;
if (a[0]!=0) {
for (i=0; i<8; i++) {
total += a[i] + a[i+1]; } }
return total;
}
NOP being inserted to align a loop target
foo2
0x00000000: b510 .. PUSH {r4,lr}
35
AAME TechCon 2013
TC004v02
35
0x00000000: b510 .. PUSH {r4,lr}
0x00000002: 4602 .F MOV r2,r0
0x00000004: 2000 . MOVS r0,#0
0x00000006: 6813 .h LDR r3,[r2,#0]
0x00000008: 4601 .F MOV r1,r0
0x0000000a: 2b00 .+ CMP r3,#0
0x0000000c: d00a .. BEQ {pc}+0x18 ; 0x24
0x0000000e: bf00 .. NOP <<<< added NOP
loop 0x00000010: eb020481 .... ADD r4,r2,r1,LSL #2 <<<< 32 bit aligned
0x00000014: f8523021 R.!0 LDR r3,[r2,r1,LSL #2]
0x00000018: 6864 dh LDR r4,[r4,#4]
0x0000001a: 1c49 I. ADDS r1,r1,#1
0x0000001c: 4423 #D ADD r3,r3,r4
0x0000001e: 2908 .) CMP r1,#8
0x00000020: 4418 .D ADD r0,r0,r3
0x00000022: dbf5 .. BLT {pc}-0x12 ; 0x10 loop
0x00000024: bd10 .. POP {r4,pc}

Register Usage
r0
r1
r2
r3
r4
r5
r6
Arguments into function
Result(s) from function
otherwise corruptible
(Additional parameters
passed on stack)
The compiler has a set of rules known as a
Procedure Call Standard that determine how to
pass parameters to a function (see AAPCS)
CPSR flags may be corrupted by function call
Assembler code which links with compiled code
must follow the AAPCS at external interfaces
Register
36
AAME TechCon 2013
TC004v02
36
r8
r9
r10
r11
r12
r13/sp
r14/lr
r15/pc
r6
r7Register variables
Must be preserved
Scratch register
(corruptible)
Stack Pointer
Link Register
Program Counter
The AAPCS is part of the ABI for the ARM
Architecture
Registers marked with a star are automatically
pushed on to the stack when an exception occurs
The xPSR (processor state) is also pushed to the
stack
- r14 can be used as a temporary once value stacked
- AAPCS requires that sp be 8-byte (2 word)
aligned at externally visible boundaries

Register Usage (2)
foo
PUSH {r4-r11, lr}
...
...
Callee
...
...
BL foo
Caller
May need to
save r0-r3, r12
Do not need to
save r4-r11
Must preserve
r4-r11, lr
if used by callee
May corrupt
Parameters passed in r0-r3
37
AAME TechCon 2013
TC004v02
37
...
POP {r4-r11, pc}
...
save r4-r11 May corrupt
r0-r3, r12
Value returned in r0 for int/short/char;
in r0 and r1 for long long/double
AAPCS – Procedure Call Standard for ARM Architecture

Parameter Passing (1)
The first four word sized parameters passed to a function will be transferred in
registers r0-r3 (fast & efficient)
– Sub-word sized arguments will still use a whole register
– Arguments larger than a word will be passed in multiple registers (more about 64 bit types later)
– See AAPCS for more details
If more arguments are needed, then the 5th, 6th and subsequent words will be
passed on the stack
38
AAME TechCon 2013
TC004v02
38
passed on the stack
– Involves extra instructions and memory accesses
Therefore always try to limit arguments to 4 words or fewer
– If unavoidable, place most commonly used parameters in first 4 positions
– Or if arguments are in a structure then pass a pointer to the structure instead
C++ uses the first argument to pass the this pointer to member functions, so only 3
arguments can be passed in registers
Example...

Parameter Passing (4 parameters)
int func1(int a, int b, int c, int d)
{
return a+b+c+d;
}
int caller1(void)
{
return func1(1,2,3,4);
Parameter Passing (6 parameters)
func2
PUSH {r4,r5,lr}
ADD r0,r0,r1
LDRD r4,r5,[sp,#0xc]
ADD r0,r0,r2
ADD r0,r0,r3
ADD r0,r0,r4
ADD r0,r0,r5
39
AAME TechCon 2013
TC004v02
39
return func1(1,2,3,4);
}
func1
ADDS r0, r0, r1
ADDS r0, r0, r2
ADDS r0, r0, r3
BX lr
:
caller1
MOVS r3, #4
MOVS r2, #3
MOVS r1, #2
MOVS r0, #1
B func1
ADD r0,r0,r5
POP {r4,r5,pc}
caller2
PUSH {r2,r3,lr}
MOVS r3,#6
MOVS r2,#5
STRD r2,r3,[sp,#0]
MOVS r3,#4
MOVS r2,#3
MOVS r1,#2
MOVS r0,#1
BL func2
POP {r2,r3,pc}

r0 r1 r2 r3 stack stack stack stack
The AAPCS has rules about 64-bit types
– 64-bit types must be 8-byte aligned in memory
– 64-bit arguments to functions must be passed in an even + consecutive odd register
(i.e. r0+r1 or r2+r3) or on the stack at an 8-byte aligned location
Registers or stack will be ‘wasted’ if arguments are listed in a sub-optimal order
40
AAME TechCon 2013
TC004v02
40
fy(int a, int c, double b)
a c b b
fx(int a, double b, int c)
a unused b b c
fz(double a, double b, int c, double d)
b ba a c unused d d
Remember the hidden this argument in r0 for non-static C++ member
functions
unused

Loop Termination (1)
In for(), while() and do…while() loops always use an integer
counter
Preferably decrement down to zero, rather than up towards a final value
– Subtract and compare to zero can be done in one instruction (SUBS)
– But must either use an unsigned int counter…
…or test not equal to zero (rather than greater than or equal to zero)
(otherwise the potential wraparound from –ve to +ve prohibits this optimization)
41
AAME TechCon 2013
TC004v02
41
For example, replace:
for (loop = 1; loop <= total; loop++)
with:
for (loop = total; loop != 0; loop--)
Loop limit (total) then only used once at the beginning
– Compiler can reuse this register once the loop counter has been loaded
Resulting code is smaller and faster Example...

Loop Termination (2)
Count up
int fact1(unsigned int limit)
{
unsigned int i;
int fact = 1;
for (i = 1; i <= limit; i++)
{
fact = fact * i;
}
return fact;
Count down
int fact2(unsigned int limit)
{
unsigned int i;
int fact = 1;
for (i = limit; i != 0; i--)
{
fact = fact * i;
}
return fact;
42
AAME TechCon 2013
TC004v02
42
return fact;
}
return fact;
}
fact2
MOVS r1,r0
MOV r0,#1
IT EQ
BXEQ lr
|L1.52|
MUL r0,r1,r0
SUBS r1,r1,#1
BNE |L1.52|
BX lr
fact1
MOV r2,r0
MOVS r0,#1
MOV r1,r0
CMP r2,#1
IT CC
BXCC lr
|L1.20|
MUL r0,r1,r0
ADDS r1,r1,#1
CMP r1,r2
BLS |L1.20|
BX lr Both examples compiled with -O2 -Otime

Division Operations
Prior to ARMv7, ARM cores contain no division hardware
– Division typically implemented by a run-time library function
– This can take many cycles to execute
int div(int a, int b)
{
return (a / b);
}
div
PUSH {r4,lr}
BL __aeabi_idivmod
POP {r4,pc}
43
AAME TechCon 2013
TC004v02
43
v7-M cores include division hardware
Signed and unsigned divide instructions included in Thumb-2
instruction set
int div(int a, int b)
{
return (a / b);
}
div
SDIV r0,r0,r1
BX lr

Division by Compile-time Constants
unsigned div2(unsigned n)
{
return (n / 2);
}
div2
LSRS r0, r0, #1
BX lr
Division by compile-time constants is treated as a special case
Division by powers of two will use shift operations
44
AAME TechCon 2013
TC004v02
44
– With -O1 and higher (with -Otime), other constants will use a
standard long multiply sequence on v7-M cores
div10
LDR r1, =0xCCCCCCCD
UMULL r1, r0, r1, r0
LSRS r0, r0, #3
BX lr
unsigned div10(unsigned n)
{
return (n / 10);
}

Modulo Arithmetic
The remainder operator ‘%’ is commonly used in modulo arithmetic
However, this will be expensive if the modulo value is not a power of two
– Will use hardware divide, if present, or will use division library code
Can be avoided by rewriting C code to use if() statement check
For example, if count has the range 0 to 59, replace
count = (count+1) % 60;
45
AAME TechCon 2013
TC004v02
45
count = (count+1) % 60;
with
if (++count >= 60) count = 0;
modulo
MOVS r1, #0x3c
ADDS r0, r0, #1
BL __aeabi_uidivmod
MOV r0, r1
test_and_reset
ADDS r0, r0, #1
CMP r0, #0x3c
BLT |L1.4|
MOVS r0, #0
|L1.4|
This code is compiled with “-O2”

Floating Point
ARM Cortex-M3 and ARM Cortex-M4 have no hardware floating-point
operations
– Compiler generates calls to software floating-point library routines whenever a floating
point operation is required (default option is --fpu=softvfp)
Cortex-M4F supports hardware floating-point operations
46
AAME TechCon 2013
TC004v02
46
float fplib(float num1, float num2)
{
float temp, temp2;
temp = num1 + num2;
temp2 = num2 * num2;
return temp2-temp;
}
fplib
PUSH {r4-r6,lr}
MOV r4,r1
BL __aeabi_fadd
MOV r5,r0
MOV r1,r4
MOV r0,r4
BL __aeabi_fmul
MOV r1,r5
POP {r4-r6,lr}
B.W __aeabi_fsub
fplib
VADD.F32 s0,s0,s1
VMUL.F32 s1,s1,s1
VSUB.F32 s0,s1,s0
BX lr
--cpu=Cortex-M3 --cpu=Cortex-M4F

Variable Types
Global & static variables are held in RAM
– Which requires loads/stores to memory – more later
– External globals also require an extra level of indirection because the compiler needs to
load a pointer to the variable first
Local variables are normally held in registers, for fast & efficient processing
– If the compiler’s register allocator runs out of registers, then locals will be 'spilled' onto the
47
AAME TechCon 2013
TC004v02
47
– If the compiler’s register allocator runs out of registers, then locals will be 'spilled' onto the
stack
– Taking the address of a variable also forces it to be placed in memory
For local variables, use word-sized (int) variables rather than halfword and
byte
– Avoids additional shifts/masks to ensure that variables only occupy correct space within
32-bit register
Example...

Size of Local Variables
int wordsize(int a)
{
a = a + 1;
return a;
}
int halfsize(short b)
{
wordsize
ADDS r0, r0, #1
BX lr
halfsize
48
AAME TechCon 2013
TC004v02
48
{
b = b + 1;
return b;
}
int bytesize(char c)
{
c = c + 1;
return c;
}
halfsize
ADDS r0, r0, #1
SXTH r0, r0
BX lr
bytesize
ADDS r0, r0, #1
UXTB r0, r0
BX lr
These examples compiled with --cpu=Cortex-M3

Global Data Layout
Global (and static) data is stored in memory, not registers
– Require load / store instruction to access
– So for performance reasons will be aligned on natural size boundaries
ARM compilers will optimize the layout of globals in a module
char one; char short
e.g. declared data in this order
49
AAME TechCon 2013
TC004v02
49
char one;
short two;
char three;
int four;
char short
char
Declared layout
12 bytes
(4 bytes of padding)
Optimal layout
8 bytes
(No padding)
Compiler will re-order
the data like this
shortchar char
int
int

Unaligned Accesses
ARM processors access data in memory most efficiently when on natural
size boundary
– (Multi-)Word access on word boundaries (LDR, STR, LDM, STM)
– Halfword access on halfword boundaries (LDRH, STRH)
– Byte access on byte (any) boundary (LDRB, STRB)
Use the __packed type qualifier to warn the compiler of potential unaligned
accesses
50
AAME TechCon 2013
TC004v02
50
accesses
– e.g. for byte-oriented network protocols or when porting legacy code
ARMv6 and later processors support unaligned accesses when
appropriately configured
– Must still use __packed to tell compiler the data may be unaligned
Unaligned accesses might cost additional bus cycles
– Trade-off between memory usage and performance
Outcome of an “accidental” unaligned data access is configurable
– Set UNALIGN_TRP bit of Configuration Control Register to detect unaligned accesses and trigger
an unaligned usage fault

Packing of structures
struct sta
{
char one;
short two;
char three;
int four;
char short
int
char
What about structures?
The C standard does not permit the compiler to re-order structure members
Members are still naturally aligned for good performance and code size
LDRB r1,[r0,#0]
LDRSH r2,[r0,#2]
LDRB r3,[r0,#4]
LDR r4,[r0,#8]
51
AAME TechCon 2013
TC004v02
51
Marking a structure as __packed will remove any padding
– Useful for accessing structures specified externally or for porting legacy code
– Efficient code generated using unaligned accesses
int four;
}a; int
char short
int
char
__packed struct stb
{
char one;
short two;
char three;
int four;
}b;
LDRB r1,[r0,#0]
LDRSH r2,[r0,#1]
LDRB r3,[r0,#3]
LDR r4,[r0,#4]
LDR r4,[r0,#8]

Alignment of structures
What does __packed do?
– It sets the alignment of a variable, pointer or all the members of a structure to 1
Structures have the same alignment as their ‘most’ aligned member
– Therefore a packed structure (all members byte aligned) has an alignment of 1
– But marking the whole structure (i.e. all members) __packed may be unnecessary
– Instead define packed members within structures to minimize penalties
__packed struct c struct d
52
AAME TechCon 2013
TC004v02
52
char short
int
__packed struct c
{
int one;
char two;
short three;
};
This structure has 1-byte alignment
char short
int
struct d
{
int one;
char two;
__packed short three;
};
This version has 4-byte alignment so a byte
of padding is added (gray square)
LDR r0,[r4,#0]
LDRB r1,[r4,#4]
LDRSH r2,[r4,#5]
LDR r0,[r4,#0]
LDRB r1,[r4,#4]
LDRSH r2,[r4,#5]

Agenda
Debug Components
53
AAME TechCon 2013
TC004v02
53

Mixing C and Assembly
C/C++ and assembly can easily be mixed to
– Access processor features which are not available from C
– Generate highly optimized code
Easy to make function calls between C, C++ and
54
AAME TechCon 2013
TC004v02
54
Easy to make function calls between C, C++ and
Assembly
– Just be sure to conform to the procedure calling standard…
…and import and export the relevant symbols

Calling Assembly from C/C++ (1)
Define the routine in assembly and export its name
Call directly from C just like any other function
– Provide a function prototype in C
– Disable C++ name mangling with extern “C” if using the C++ compiler
Link as normal
extern void mystrcopy(char *d, const char *s);
55
AAME TechCon 2013
TC004v02
55
extern void mystrcopy(char *d, const char *s);
int main(void)
{
const char *src = “Source”;
char dest[10];
...
mystrcopy(dest, src);
...
}
AREA StringCopy,CODE,READONLY
EXPORT mystrcopy
mystrcopy PROC
LDRB r2, [r1], #1
STRB r2, [r0], #1
CMP r2, #0
BNE mystrcopy
BX lr
ENDP
END

Calling Assembly from C/C++ (2)
Where possible use CMSIS functions or compiler
intrinsics
e.g. __nop(), __disable_irq()
Compiler also contains an Embedded assembler...
56
AAME TechCon 2013
TC004v02
56
Compiler also contains an Embedded assembler...
– Write complete functions in assembly language
– No optimization

CMSIS
ARM Cortex Microcontroller Software Interface Standard (CMSIS)
– Vendor-independent hardware abstraction layer for the Cortex-M series of cores
Provides C language access to core features
– Access to internal registers
– Helper functions for common core tasks
– Internal address definitions for core memory map
– Intrinsics for certain common assembly tasks
57
AAME TechCon 2013
TC004v02
57
– Intrinsics for certain common assembly tasks
Example: function to set interrupt priority mask
__ASM void __set_PRIMASK(uint32_t priMask)
{
msr primask, r0
bx lr
}
Available for download from http://www.onarm.com/

Intrinsics
C/C++ standards do not define core-specific functionality
– The ARM Compiler intrinsics provide extra features to realize these
operations.
The ARM Compiler supports various families of intrinsics for
operations that cannot be generated directly from C/C++
58
AAME TechCon 2013
TC004v02
58
operations that cannot be generated directly from C/C++
code
– Generic intrinsics: __current_pc, __current_sp,
__return_address, ...
– IRQ/FIQ intrinsics: __disable_irq, __enable_irq, ...
– Optimization barriers: __schedule_barrier, __force_stores, ...
– Native instructions: __isb, __dsb,...

AAME ARM Techcon2013 004v02 Debug and Optimization

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a AAME ARM Techcon2013 004v02 Debug and Optimization

Semelhante a AAME ARM Techcon2013 004v02 Debug and Optimization (20)

Último

Último (20)

AAME ARM Techcon2013 004v02 Debug and Optimization