SlideShare uma empresa Scribd logo
1 de 59
Software Debug &
Optimization for ARM®
Cortex®-M MicrocontrollersCortex -M Microcontrollers
Agenda
Tools & Debug Configuration
Debug Components
Physical Debug Interfaces
Compiler Configuration
2
AAME TechCon 2013
TC004v02
2
Compiler Configuration
Introduction to Optimization
Mixing C and Assembler
Keil® MDK
Low cost tools for ARM7, ARM9, ARM Cortex-M and ARM Cortex-R4 MCUs
– Extensive device support for many devices
– Core and peripheral simulation
– Flash support
Microcontroller Development Kit (MDK)
– µVision IDE
– ARM Compiler, optimized run-time library, KEIL RTX RTOS
3
AAME TechCon 2013
TC004v02
3
– Real-time trace (for Cortex-M3 and Cortex-M4 based devices)
Real-Time Library
– Keil RTX RTOS + Source Code
– TCP networking suit, Flash File System, CAN Driver Library, USB Device Interface
Debug Hardware
Evaluation boards
ARM Cortex-M3/M4 Debug Features
Compliant with ARMv7-M Debug Architecture (CoreSight™ based)
Traditional ARM Debug Features
– Two debug modes (Halt mode and Monitor mode)
– Two stepping modes (with and without interrupts taken)
– BKPT instruction
– Vector Catch
–
4
AAME TechCon 2013
TC004v02
4
– Optional Embedded Trace Macrocell (ETM)
CoreSight Debug Features
– Flash Patch and Breakpoint (FPB)
– Instruction Breakpoints and Code Patching
– Data Watchpoint and Trace (DWT)
– Hardware Breakpoints, Event Counters and PC Sampling
– Instrumentation Trace Macrocell (ITM)
– Low bandwidth trace driven by application software or DWT
– Serial Wire Viewer
ARM Cortex-M3/M4 CoreSight
Overview
Utilizes a Debug Access Port (DAP)
– Consists of a Debug Port (DP) + AHB Access Port (AP)
Debug Port (DP) has 2 implementation options
– SWJ-DP – supports Serial Wire (2-pin) and conventional JTAG interface
– SW-DP – supports Serial Wire (2-pin) only
Serial Wire
5
AAME TechCon 2013
TC004v02
5
AHB Access Port (AP)
– Provides AHB-Lite access to core, memory and debug components
– All Debug registers are memory mapped
– Traditional internal scan chains no longer utilized
– No coprocessors on Cortex-M3 (CP14 was traditionally the debug coprocessor)
SWJ-DP AHB-
AP
DAP
Serial Wire
or
JTAG
SW-DP AHB-
AP
DAP
Serial Wire
Cortex-M3/M4 Debug Access Paths
Cortex-M3
Core
Data
Watchpoin
t & Trace
(DWT)
Flash
Patch &
Breakpoint
(FPB)
Instrument
. Trace
Macrocell
(ITM)
Bus
Matrix
AHB - Internal Private Peripheral Bus
APB - External Private Peripheral Bus
6
AAME TechCon 2013
TC004v02
6
Embedded
Trace
Macrocell
(ETM)
Trace Port
Interface
Unit
(TPIU)
SW/J-DP AHB
-AP
DAP
APB - External Private Peripheral Bus
ROM
Table
Halted Debug Mode
Traditional start/stop debug
– Core executes and then halts in Debug mode
Debug Fault Status Register (DFSR) identifies the type of Debug
event
– EXTERNAL EDBGRQ input asserted from other SoC component
– VCATCH Vector Catch triggered
7
AAME TechCon 2013
TC004v02
7
– VCATCH Vector Catch triggered
– DWTTRAP Data access to address matching a Watchpoint
– BKPT BKPT instruction executed
– HALTED Halt request from debugger (or stepping in debug)
BKPT instruction
– Debugger replaces original instruction with BKPT for software breakpoint
– Synthesized by FPB unit for a hardware breakpoint
Vector Catch
Mechanism traps selected exceptions
– Core halts when exception is asserted
– No DWT / Breakpoint resources utilized
– Suitable for early software development
– Selection made through debugger
Following exceptions may be trapped
– Reset
Address Vector
0x00 Initial Main SP
0x04 Reset
0x08 NMI
0x0C Hard Fault
0x10 Memory Manage
0x14 Bus Fault
8
AAME TechCon 2013
TC004v02
8
– Reset
– Hard Fault
– Memory Management Fault
– Bus Fault
– Usage Fault
– Exception Service Error
Note - cannot catch interrupts this way
– Unlike other ARM and Cortex-A/R cores
– Use breakpoint in interrupt handler
0x14 Bus Fault
0x18 Usage Fault
0x1C -
0x28
Reserved
0x2C SVCall
0x30 Debug Monitor
0x34 Reserved
0x38 PendSV
0x3C SysTick
0x40 IRQ0
…. More IRQs
Reset
Core has 3 different reset inputs
– PORESETn - Power-on reset for Cortex-M3 system
– SYSRESETn - System reset for processor (debug components not reset)
– DAPRESETn - AHB-AP reset
Software Generated Resets
9
AAME TechCon 2013
TC004v02
9
Software Generated Resets
– VECTRESET bit in Application Interrupt and Reset Control Register
– Equivalent to asserting SYSRESETn
– Software reset option is available in Keil/MDK and DS-5 Development Studio
– Use CTRL_REG for the “Reset Type”
– Core is safely reset without asserting nSRST JTAG signal
– SYSRESETREQ bit in Application Interrupt and Reset Control Register
– Sends a request for a reset to the system
– Reset is generated by customer-defined reset controller (not the M3)
– Other components in the system other than the Cortex-M3 may be affected
Agenda
Tools & Debug Configuration
Debug Components
Physical Debug Interfaces
10
AAME TechCon 2013
TC004v02
10
Compiler Configuration
Introduction to Optimization
Mixing C and Assembler
Flash Patch and Breakpoint Unit
(FPB)
Flash Patching
– Allows runtime patching of firmware
– Remaps reads from the Code space to System space using a Patch Table
– Total of 8 addresses may be patched
– 6 instruction comparators (for instruction fetches from Code space)
– 2 literal comparators (for literal data loads from Code space)
– Only reads are patched
–
11
AAME TechCon 2013
TC004v02
11
– Writes will be performed as normal
– Intended Usage
– ROM-based designs (costly fix)
– Firmware field upgrades
Hardware Breakpoints – maximum of 6
– The 6 instruction comparators can return a BKPT to halt the core
– Instruction comparators are shared with Flash Patch functionality
– If 3 instructions are flash patched, only 3 hardware breakpoints are available
Data Watchpoint and Trace (DWT)
DWT component useful for Debug, Trace and Profiling
– Enabled by setting TRCENA bit in Debug Exception and Monitor Control Reg
Debug Support
– Traditional data watchpoint for halt mode debug
– Can break on [data value && data address] match (x1)
Trace Support
– Generate trace trigger for Embedded Trace Macrocell (ETM)
12
AAME TechCon 2013
TC004v02
12
Generate trace trigger for Embedded Trace Macrocell (ETM)
Profiling / Event Support
– Provides non-invasive view of application execution
– Packets output through Instrumentation Trace Macrocell (ITM)
– Must have debug tools connected to view the output
– Packets generated for selected events of interest
– Data Address matching
– Periodic PC Sampling
– Exception Entry, Exit and Return
– Hardware performance counting
DWT Block Diagram
DWT interfaces to the Core, ETM and ITM
ITM communication is packet based
– Packets defined in the ARMv7-M Architectural Reference Manual
ETM*
13
AAME TechCon 2013
TC004v02
13
Comparator
Bank
Cortex-M3
Core
DWT
break
trigger
ETM*
ITM
packet
Event
Counters
packet
packet
Cycle
Counter
*ETM Optional
Instrumentation Trace Macrocell
(ITM)
Generates and outputs Trace Packets
Packet types (in priority order):
– Software trace
– Software can write directly to ITM stimulus registers, causing packets to be emitted
– Similar to using printf() to debug a C program
– Hardware trace
– Packets are generated by the DWT and emitted by the ITM
14
AAME TechCon 2013
TC004v02
14
– Packets are generated by the DWT and emitted by the ITM
– Timestamps
Timestamp Packets
– Must be enabled in ITM Trace Control Register
– Local Timestamp (differential) value generated from 21-bit counter
– Counter clocked from either core clock or TPIU clock
– Global Timestamp (absolute) value generated from 48-bit counter
– Packet generation
– When any other trace packet is generated (which resets timestamp counter)
– When timestamp counter overflows
ARM Cortex-M3/M4 and ITM
Cortex-M3
Core
ETM
Trigger
ATB
Cortex-M3 Macrocell Global
Timestamp
ClockGlobal Timestamp
Global
15
AAME TechCon 2013
TC004v02
15
SW/
SWJ-DP
DWT ITM
TPIU
APB
ATB
SWO & TraceData[3:0]
Local
Timestamp
Global
Timestamp
Embedded Trace Macrocell (ETM)
Optional non-invasive debug component
ETM Hardware monitors activity of processor
Trace allows:
– Historical debug of sequences leading up to events of interest
– e.g. System crash on peripheral access during overnight testing
16
AAME TechCon 2013
TC004v02
16
– e.g. System crash on peripheral access during overnight testing
– Debug of events in real-time systems where the target cannot be halted
– Hard Disk drives, Engine Management
– Visibility of accesses inside a SoC
– To internal memories/peripherals
– Software profiling and code coverage
ARM Cortex-M3/M4 and ETM
Cortex-M3
Core
ETM
Trigger
ATB
Cortex-M3 Macrocell Global
Timestamp
ClockGlobal Timestamp
Global
17
AAME TechCon 2013
TC004v02
17
SW/
SWJ-DP
DWT ITM
TPIU
APB
ATB
SWO & TraceData[3:0]
Local
Timestamp
Global
Timestamp
TPIU Interface / Serial Wire Output
Formats and serializes data from ETM and ITM
Trace data clocked out asynchronous to core clock
– TRACECLK derived from TRACECLKIN
ETM packets output over Trace Port (TRACECLK and TRACEDATA [3:0])
– Data decompressed with a conventional Trace Port Analyzer
18
AAME TechCon 2013
TC004v02
18
ITM/DWT packets optionally output over Serial Wire Output (SWO)
– SWO also referred to as “Serial Wire Viewer”
– Data decompressed with an Event Viewer
FIFO
Formatter Serializer
TPIU
FIFO
ETM
ITM
TRACECLKIN
SWO
TRACEDATA [3:0]
TRACEC
LK
Agenda
Tools & Debug Configuration
Debug Components
Physical Debug Interfaces
19
AAME TechCon 2013
TC004v02
19
Compiler Configuration
Introduction to Optimization
Mixing C and Assembler
Physical Interfaces
New 10-pin and 20-pin interfaces available
– Higher pin density (0.05”) over standard JTAG IDC interface (0.10”)
– Reduces connector footprint
Serial Wire Debug is the preferred solution
– SWO likely not available when using JTAG interface
20
AAME TechCon 2013
TC004v02
20
– SWO likely not available when using JTAG interface
Trace can use a 20-pin connector
– Legacy 38-pin Mictor connector not recommended
Consult CoreSight Component TRM or Device
Data Sheet
Samtec FTSH-110 Connector
2 SWDIO / TMS
4 SWCLK / TCK
6 SWO / TDO
8 NC/EXTb / TDI
VTref 1
GND 3
GND 5
KEY 7
21
AAME TechCon 2013
TC004v02
21
10 nRESETGNDDetect 9
Samtec FTSH-120 Connector
2 SWDIO / TMS
4 SWCLK / TCK
6 SWO/EXTa/TRACECTL / TDO
8 NC/EXTb / TDI
10 nRESET
VTref 1
GND 3
GND 5
KEY 7
GNDDetect 9
22
AAME TechCon 2013
TC004v02
22
10 nRESET
12 TRACECLK
14 TRACEDATA[0]
16 TRACEDATA[1]
18 TRACEDATA[2]
20 TRACEDATA[3]
GNDDetect 9
GND/TgtPwr+Cap 11
GND/TgtPwr+Cap 13
GND 15
GND 17
GND 19
Agenda
Tools & Debug Configuration
Debug Components
Physical Debug Interfaces
23
AAME TechCon 2013
TC004v02
23
Compiler Configuration
Introduction to Optimization
Mixing C and Assembler
Language Support
Single compiler armcc can compile standard ISO C/C++
Source language modes
– ISO C90
– 1990 C standard, compile option --c90 (default)
– ISO C99
– 1999 C standard, compile option --c99
24
AAME TechCon 2013
TC004v02
24
– 1999 C standard, compile option --c99
– ISO C++
– 2003 C++ standard, compile option –cpp
Language compliance
– Default mode supports several common extensions
– Strict mode enforces compliance with language standard: --strict
– GNU mode offers partial support for GCC extensions: --gnu
Variable types supported
The compiler supports these basic types
int / long 32 bit (word) integer
short 16-bit (half-word) integer
char 8-bit byte, unsigned by default
long long 64-bit integer
25
AAME TechCon 2013
TC004v02
25
long long 64-bit integer
float 32-bit single-precision IEEE floating point
double 64-bit double-precision IEEE floating point
bool 8-bit Boolean (C++ only)
wchar_t 16-bit “wide character” type (C++ only)
Pointers 32-bit integer addresses
Optimization Levels
Level of optimizations carried out by the compiler is selectable
-O0
– Minimum optimization
– The least optimized code, but with the best debug view
-O1
– Restricted optimization
– Optimized code and a good debug view
(default)
26
AAME TechCon 2013
TC004v02
26
-O2 (default)
– High optimization
– Well optimized code but with limited debug view
-O3
– More aggressive optimization, weighted toward -Ospace / -Otime choice
– Enables multifile compilation by default (more later)
Select optimization for code size or execution speed with -Ospace (default) or
-Otime
Use -g or --debug to generate source level debug information
Selecting an Architecture or Core
Each new version of the ARM Architecture typically supports extra instructions and
models of operation
Implementation of an architecture version may vary between cores
– Use the most specific setting you can when compiling
Inform the compiler of the architecture or processor
– The default CPU setting is ARM7TDMI (Architecture 4T)
– Either specify an architecture version, or a specific core
27
AAME TechCon 2013
TC004v02
27
--cpu 7-M (Do not prefix with a ‘v’)
--cpu Cortex-M3
Some examples of features the compiler and libraries can take advantage of:
– UDIV and SDIV (7-M and 7-R)
– REV (v6) can be used to reverse byte endianness
– Unaligned memory access (v6)
When using the Cortex-M3 it is essential to specify 7-M or Cortex-M3 to ensure
the correct (Thumb only) libraries are used
Agenda
Tools & Debug Configuration
Debug Components
Physical Debug Interfaces
28
AAME TechCon 2013
TC004v02
28
Compiler Configuration
Introduction to Optimization
Mixing C and Assembler
Using “volatile”
f
int f(int *p)
{
return (*p == *p);
}
f
MOVS r0, #1
BX lr
armcc
29
AAME TechCon 2013
TC004v02
29
int f(volatile int *p)
{
return (*p == *p);
}
armcc
f
LDR r1,[r0,#0]
LDR r0,[r0,#0]
CMP r1,r0
ITE NE
MOVNE r0,#0
MOVEQ r0,#1
BX lr
This code is compiled with “-O2 –Otime --cpu=Cortex-M3”
Instruction Scheduling
Instruction scheduling is enabled at -O1 and higher
– Instructions are re-ordered to suit the core on which the code will run
– Improves throughput by minimizing interlocks
– Select processor (--cpu) to determine algorithm used
For example:
int f(int *p, int x) { return *p + x * 3; }
30
AAME TechCon 2013
TC004v02
30
Without scheduling (-O0) With scheduling (-O1, -O2, -O3)
Compiler never re-orders instructions if this would change the behavior
MOV r2,r0
ADD r3,r1,r1,LSL #1
LDR r0,[r2,#0]
ADD r0,r0,r3
BX lr
LDR r0,[r0,#0]
ADD r1,r1,r1,LSL #1
ADD r0,r0,r1
BX lr
Inlining of functions
Inlining can improve performance, at the expense of a larger image
– Body of inlined function inserted directly into the calling code wherever it is called
– Only possible if caller and callee are in same compilation unit (except --multifile)
The compiler can inline functions automatically
– Normally no need to annotate your source code, or use any special switches
Factors that influence auto-inlining include
– Whether the function is marked with __inline ‘hint’
–
31
AAME TechCon 2013
TC004v02
31
– Optimization level and -Otime / -Ospace
– How many places the function is called
– Size of the function
– Whether the function has external or static linkage
To force a function to be inlined, either use --forceinline with __inline or use
__forceinline
Any non-static function that gets auto-inlined has an out-of-line version generated
too (another reason to use static)
– Increases code size
– More complex debug view
Example...
Loop Transformation
The compiler can transform and restructure loops automatically
– Enabled with -O3 –Otime
Loop unrolling reduces loop overhead at the cost of increase in code size
for (i = 0; i < 100; i++)
for (i = 0; i < 100; i += 4)
{
32
AAME TechCon 2013
TC004v02
32
Loop re-rolling
– Recognize manually unrolled loops, re-roll and unroll optimally
Loops with constant, low iteration count may be unrolled completely
for (i = 0; i < 100; i++)
{
c[i] = b[i] + 1;
}
c[i + 0] = b[i + 0] + 1;
c[i + 1] = b[i + 1] + 1;
c[i + 2] = b[i + 2] + 1;
c[i + 3] = b[i + 3] + 1;
}
Branch Target Optimization (1)
The ARM Compiler implements a performance optimization to
ensure that a loop branch target is not an unaligned 32-bit
instruction
– Available when compiling at –O3
– Improves EEMBC performance by 1.5%
The compiler will try first to widen the instruction before the
33
AAME TechCon 2013
TC004v02
33
The compiler will try first to widen the instruction before the
branch target
– If widening is not possible then it will insert a 16-bit NOP
Two examples on the following slides
– The first shows an instruction being widened to align a loop
– The second shows a NOP being inserted to align a loop
Branch Target Optimization (2)
int foo1(int a[16]) {
int i;
int total = 0;
for (i=0; i<8; i++) {
total += a[i];
}
return total;
Instruction being widened to align a loop target
34
AAME TechCon 2013
TC004v02
34
}
foo1
0x00000000: 4602 .F MOV r2,r0
0x00000002: 2000 . MOVS r0,#0
0x00000004: ea4f0100 O... MOV.W r1,r0 <<<< widened MOV
loop 0x00000008: f8523021 R.!0 LDR r3,[r2,r1,LSL #2]
0x0000000c: 1c49 I. ADDS r1,r1,#1
0x0000000e: 4418 .D ADD r0,r0,r3
0x00000010: 2908 .) CMP r1,#8
0x00000012: dbf9 .. BLT {pc}-0xa ; 0x8 loop
0x00000014: 4770 pG BX lr
Branch Target Optimization (3)
int foo2(int a[16], int j) {
int total = 0;
int i = 0;
if (a[0]!=0) {
for (i=0; i<8; i++) {
total += a[i] + a[i+1]; } }
return total;
}
NOP being inserted to align a loop target
foo2
0x00000000: b510 .. PUSH {r4,lr}
35
AAME TechCon 2013
TC004v02
35
0x00000000: b510 .. PUSH {r4,lr}
0x00000002: 4602 .F MOV r2,r0
0x00000004: 2000 . MOVS r0,#0
0x00000006: 6813 .h LDR r3,[r2,#0]
0x00000008: 4601 .F MOV r1,r0
0x0000000a: 2b00 .+ CMP r3,#0
0x0000000c: d00a .. BEQ {pc}+0x18 ; 0x24
0x0000000e: bf00 .. NOP <<<< added NOP
loop 0x00000010: eb020481 .... ADD r4,r2,r1,LSL #2 <<<< 32 bit aligned
0x00000014: f8523021 R.!0 LDR r3,[r2,r1,LSL #2]
0x00000018: 6864 dh LDR r4,[r4,#4]
0x0000001a: 1c49 I. ADDS r1,r1,#1
0x0000001c: 4423 #D ADD r3,r3,r4
0x0000001e: 2908 .) CMP r1,#8
0x00000020: 4418 .D ADD r0,r0,r3
0x00000022: dbf5 .. BLT {pc}-0x12 ; 0x10 loop
0x00000024: bd10 .. POP {r4,pc}
Register Usage
r0
r1
r2
r3
r4
r5
r6
Arguments into function
Result(s) from function
otherwise corruptible
(Additional parameters
passed on stack)
The compiler has a set of rules known as a
Procedure Call Standard that determine how to
pass parameters to a function (see AAPCS)
CPSR flags may be corrupted by function call
Assembler code which links with compiled code
must follow the AAPCS at external interfaces
Register
36
AAME TechCon 2013
TC004v02
36
r8
r9
r10
r11
r12
r13/sp
r14/lr
r15/pc
r6
r7Register variables
Must be preserved
Scratch register
(corruptible)
Stack Pointer
Link Register
Program Counter
The AAPCS is part of the ABI for the ARM
Architecture
Registers marked with a star are automatically
pushed on to the stack when an exception occurs
The xPSR (processor state) is also pushed to the
stack
- r14 can be used as a temporary once value stacked
- AAPCS requires that sp be 8-byte (2 word)
aligned at externally visible boundaries
Register Usage (2)
foo
PUSH {r4-r11, lr}
...
...
Callee
...
...
BL foo
Caller
May need to
save r0-r3, r12
Do not need to
save r4-r11
Must preserve
r4-r11, lr
if used by callee
May corrupt
Parameters passed in r0-r3
37
AAME TechCon 2013
TC004v02
37
...
POP {r4-r11, pc}
...
save r4-r11 May corrupt
r0-r3, r12
Value returned in r0 for int/short/char;
in r0 and r1 for long long/double
AAPCS – Procedure Call Standard for ARM Architecture
Parameter Passing (1)
The first four word sized parameters passed to a function will be transferred in
registers r0-r3 (fast & efficient)
– Sub-word sized arguments will still use a whole register
– Arguments larger than a word will be passed in multiple registers (more about 64 bit types later)
– See AAPCS for more details
If more arguments are needed, then the 5th, 6th and subsequent words will be
passed on the stack
38
AAME TechCon 2013
TC004v02
38
passed on the stack
– Involves extra instructions and memory accesses
Therefore always try to limit arguments to 4 words or fewer
– If unavoidable, place most commonly used parameters in first 4 positions
– Or if arguments are in a structure then pass a pointer to the structure instead
C++ uses the first argument to pass the this pointer to member functions, so only 3
arguments can be passed in registers
Example...
Parameter Passing (2)
Parameter Passing (4 parameters)
int func1(int a, int b, int c, int d)
{
return a+b+c+d;
}
int caller1(void)
{
return func1(1,2,3,4);
Parameter Passing (6 parameters)
func2
PUSH {r4,r5,lr}
ADD r0,r0,r1
LDRD r4,r5,[sp,#0xc]
ADD r0,r0,r2
ADD r0,r0,r3
ADD r0,r0,r4
ADD r0,r0,r5
39
AAME TechCon 2013
TC004v02
39
return func1(1,2,3,4);
}
func1
ADDS r0, r0, r1
ADDS r0, r0, r2
ADDS r0, r0, r3
BX lr
:
caller1
MOVS r3, #4
MOVS r2, #3
MOVS r1, #2
MOVS r0, #1
B func1
ADD r0,r0,r5
POP {r4,r5,pc}
caller2
PUSH {r2,r3,lr}
MOVS r3,#6
MOVS r2,#5
STRD r2,r3,[sp,#0]
MOVS r3,#4
MOVS r2,#3
MOVS r1,#2
MOVS r0,#1
BL func2
POP {r2,r3,pc}
r0 r1 r2 r3 stack stack stack stack
Parameter Passing (3)
The AAPCS has rules about 64-bit types
– 64-bit types must be 8-byte aligned in memory
– 64-bit arguments to functions must be passed in an even + consecutive odd register
(i.e. r0+r1 or r2+r3) or on the stack at an 8-byte aligned location
Registers or stack will be ‘wasted’ if arguments are listed in a sub-optimal order
40
AAME TechCon 2013
TC004v02
40
fy(int a, int c, double b)
a c b b
fx(int a, double b, int c)
a unused b b c
fz(double a, double b, int c, double d)
b ba a c unused d d
Remember the hidden this argument in r0 for non-static C++ member
functions
unused
Loop Termination (1)
In for(), while() and do…while() loops always use an integer
counter
Preferably decrement down to zero, rather than up towards a final value
– Subtract and compare to zero can be done in one instruction (SUBS)
– But must either use an unsigned int counter…
…or test not equal to zero (rather than greater than or equal to zero)
(otherwise the potential wraparound from –ve to +ve prohibits this optimization)
41
AAME TechCon 2013
TC004v02
41
For example, replace:
for (loop = 1; loop <= total; loop++)
with:
for (loop = total; loop != 0; loop--)
Loop limit (total) then only used once at the beginning
– Compiler can reuse this register once the loop counter has been loaded
Resulting code is smaller and faster Example...
Loop Termination (2)
Count up
int fact1(unsigned int limit)
{
unsigned int i;
int fact = 1;
for (i = 1; i <= limit; i++)
{
fact = fact * i;
}
return fact;
Count down
int fact2(unsigned int limit)
{
unsigned int i;
int fact = 1;
for (i = limit; i != 0; i--)
{
fact = fact * i;
}
return fact;
42
AAME TechCon 2013
TC004v02
42
return fact;
}
return fact;
}
fact2
MOVS r1,r0
MOV r0,#1
IT EQ
BXEQ lr
|L1.52|
MUL r0,r1,r0
SUBS r1,r1,#1
BNE |L1.52|
BX lr
fact1
MOV r2,r0
MOVS r0,#1
MOV r1,r0
CMP r2,#1
IT CC
BXCC lr
|L1.20|
MUL r0,r1,r0
ADDS r1,r1,#1
CMP r1,r2
BLS |L1.20|
BX lr Both examples compiled with -O2 -Otime
Division Operations
Prior to ARMv7, ARM cores contain no division hardware
– Division typically implemented by a run-time library function
– This can take many cycles to execute
int div(int a, int b)
{
return (a / b);
}
div
PUSH {r4,lr}
BL __aeabi_idivmod
POP {r4,pc}
43
AAME TechCon 2013
TC004v02
43
v7-M cores include division hardware
Signed and unsigned divide instructions included in Thumb-2
instruction set
int div(int a, int b)
{
return (a / b);
}
div
SDIV r0,r0,r1
BX lr
Division by Compile-time Constants
unsigned div2(unsigned n)
{
return (n / 2);
}
div2
LSRS r0, r0, #1
BX lr
Division by compile-time constants is treated as a special case
Division by powers of two will use shift operations
44
AAME TechCon 2013
TC004v02
44
– With -O1 and higher (with -Otime), other constants will use a
standard long multiply sequence on v7-M cores
div10
LDR r1, =0xCCCCCCCD
UMULL r1, r0, r1, r0
LSRS r0, r0, #3
BX lr
unsigned div10(unsigned n)
{
return (n / 10);
}
Modulo Arithmetic
The remainder operator ‘%’ is commonly used in modulo arithmetic
However, this will be expensive if the modulo value is not a power of two
– Will use hardware divide, if present, or will use division library code
Can be avoided by rewriting C code to use if() statement check
For example, if count has the range 0 to 59, replace
count = (count+1) % 60;
45
AAME TechCon 2013
TC004v02
45
count = (count+1) % 60;
with
if (++count >= 60) count = 0;
modulo
MOVS r1, #0x3c
ADDS r0, r0, #1
BL __aeabi_uidivmod
MOV r0, r1
test_and_reset
ADDS r0, r0, #1
CMP r0, #0x3c
BLT |L1.4|
MOVS r0, #0
|L1.4|
This code is compiled with “-O2”
Floating Point
ARM Cortex-M3 and ARM Cortex-M4 have no hardware floating-point
operations
– Compiler generates calls to software floating-point library routines whenever a floating
point operation is required (default option is --fpu=softvfp)
Cortex-M4F supports hardware floating-point operations
46
AAME TechCon 2013
TC004v02
46
float fplib(float num1, float num2)
{
float temp, temp2;
temp = num1 + num2;
temp2 = num2 * num2;
return temp2-temp;
}
fplib
PUSH {r4-r6,lr}
MOV r4,r1
BL __aeabi_fadd
MOV r5,r0
MOV r1,r4
MOV r0,r4
BL __aeabi_fmul
MOV r1,r5
POP {r4-r6,lr}
B.W __aeabi_fsub
fplib
VADD.F32 s0,s0,s1
VMUL.F32 s1,s1,s1
VSUB.F32 s0,s1,s0
BX lr
--cpu=Cortex-M3 --cpu=Cortex-M4F
Variable Types
Global & static variables are held in RAM
– Which requires loads/stores to memory – more later
– External globals also require an extra level of indirection because the compiler needs to
load a pointer to the variable first
Local variables are normally held in registers, for fast & efficient processing
– If the compiler’s register allocator runs out of registers, then locals will be 'spilled' onto the
47
AAME TechCon 2013
TC004v02
47
– If the compiler’s register allocator runs out of registers, then locals will be 'spilled' onto the
stack
– Taking the address of a variable also forces it to be placed in memory
For local variables, use word-sized (int) variables rather than halfword and
byte
– Avoids additional shifts/masks to ensure that variables only occupy correct space within
32-bit register
Example...
Size of Local Variables
int wordsize(int a)
{
a = a + 1;
return a;
}
int halfsize(short b)
{
wordsize
ADDS r0, r0, #1
BX lr
halfsize
48
AAME TechCon 2013
TC004v02
48
{
b = b + 1;
return b;
}
int bytesize(char c)
{
c = c + 1;
return c;
}
halfsize
ADDS r0, r0, #1
SXTH r0, r0
BX lr
bytesize
ADDS r0, r0, #1
UXTB r0, r0
BX lr
These examples compiled with --cpu=Cortex-M3
Global Data Layout
Global (and static) data is stored in memory, not registers
– Require load / store instruction to access
– So for performance reasons will be aligned on natural size boundaries
ARM compilers will optimize the layout of globals in a module
char one; char short
e.g. declared data in this order
49
AAME TechCon 2013
TC004v02
49
char one;
short two;
char three;
int four;
char short
char
Declared layout
12 bytes
(4 bytes of padding)
Optimal layout
8 bytes
(No padding)
Compiler will re-order
the data like this
shortchar char
int
int
Unaligned Accesses
ARM processors access data in memory most efficiently when on natural
size boundary
– (Multi-)Word access on word boundaries (LDR, STR, LDM, STM)
– Halfword access on halfword boundaries (LDRH, STRH)
– Byte access on byte (any) boundary (LDRB, STRB)
Use the __packed type qualifier to warn the compiler of potential unaligned
accesses
50
AAME TechCon 2013
TC004v02
50
accesses
– e.g. for byte-oriented network protocols or when porting legacy code
ARMv6 and later processors support unaligned accesses when
appropriately configured
– Must still use __packed to tell compiler the data may be unaligned
Unaligned accesses might cost additional bus cycles
– Trade-off between memory usage and performance
Outcome of an “accidental” unaligned data access is configurable
– Set UNALIGN_TRP bit of Configuration Control Register to detect unaligned accesses and trigger
an unaligned usage fault
Packing of structures
struct sta
{
char one;
short two;
char three;
int four;
char short
int
char
What about structures?
The C standard does not permit the compiler to re-order structure members
Members are still naturally aligned for good performance and code size
LDRB r1,[r0,#0]
LDRSH r2,[r0,#2]
LDRB r3,[r0,#4]
LDR r4,[r0,#8]
51
AAME TechCon 2013
TC004v02
51
Marking a structure as __packed will remove any padding
– Useful for accessing structures specified externally or for porting legacy code
– Efficient code generated using unaligned accesses
int four;
}a; int
char short
int
char
__packed struct stb
{
char one;
short two;
char three;
int four;
}b;
LDRB r1,[r0,#0]
LDRSH r2,[r0,#1]
LDRB r3,[r0,#3]
LDR r4,[r0,#4]
LDR r4,[r0,#8]
Alignment of structures
What does __packed do?
– It sets the alignment of a variable, pointer or all the members of a structure to 1
Structures have the same alignment as their ‘most’ aligned member
– Therefore a packed structure (all members byte aligned) has an alignment of 1
– But marking the whole structure (i.e. all members) __packed may be unnecessary
– Instead define packed members within structures to minimize penalties
__packed struct c struct d
52
AAME TechCon 2013
TC004v02
52
char short
int
__packed struct c
{
int one;
char two;
short three;
};
This structure has 1-byte alignment
char short
int
struct d
{
int one;
char two;
__packed short three;
};
This version has 4-byte alignment so a byte
of padding is added (gray square)
LDR r0,[r4,#0]
LDRB r1,[r4,#4]
LDRSH r2,[r4,#5]
LDR r0,[r4,#0]
LDRB r1,[r4,#4]
LDRSH r2,[r4,#5]
Agenda
Tools & Debug Configuration
Debug Components
Physical Debug Interfaces
53
AAME TechCon 2013
TC004v02
53
Compiler Configuration
Introduction to Optimization
Mixing C and Assembler
Mixing C and Assembly
C/C++ and assembly can easily be mixed to
– Access processor features which are not available from C
– Generate highly optimized code
Easy to make function calls between C, C++ and
54
AAME TechCon 2013
TC004v02
54
Easy to make function calls between C, C++ and
Assembly
– Just be sure to conform to the procedure calling standard…
…and import and export the relevant symbols
Calling Assembly from C/C++ (1)
Define the routine in assembly and export its name
Call directly from C just like any other function
– Provide a function prototype in C
– Disable C++ name mangling with extern “C” if using the C++ compiler
Link as normal
extern void mystrcopy(char *d, const char *s);
55
AAME TechCon 2013
TC004v02
55
extern void mystrcopy(char *d, const char *s);
int main(void)
{
const char *src = “Source”;
char dest[10];
...
mystrcopy(dest, src);
...
}
AREA StringCopy,CODE,READONLY
EXPORT mystrcopy
mystrcopy PROC
LDRB r2, [r1], #1
STRB r2, [r0], #1
CMP r2, #0
BNE mystrcopy
BX lr
ENDP
END
Calling Assembly from C/C++ (2)
Where possible use CMSIS functions or compiler
intrinsics
e.g. __nop(), __disable_irq()
Compiler also contains an Embedded assembler...
56
AAME TechCon 2013
TC004v02
56
Compiler also contains an Embedded assembler...
– Write complete functions in assembly language
– No optimization
CMSIS
ARM Cortex Microcontroller Software Interface Standard (CMSIS)
– Vendor-independent hardware abstraction layer for the Cortex-M series of cores
Provides C language access to core features
– Access to internal registers
– Helper functions for common core tasks
– Internal address definitions for core memory map
– Intrinsics for certain common assembly tasks
57
AAME TechCon 2013
TC004v02
57
– Intrinsics for certain common assembly tasks
Example: function to set interrupt priority mask
__ASM void __set_PRIMASK(uint32_t priMask)
{
msr primask, r0
bx lr
}
Available for download from http://www.onarm.com/
Intrinsics
C/C++ standards do not define core-specific functionality
– The ARM Compiler intrinsics provide extra features to realize these
operations.
The ARM Compiler supports various families of intrinsics for
operations that cannot be generated directly from C/C++
58
AAME TechCon 2013
TC004v02
58
operations that cannot be generated directly from C/C++
code
– Generic intrinsics: __current_pc, __current_sp,
__return_address, ...
– IRQ/FIQ intrinsics: __disable_irq, __enable_irq, ...
– Optimization barriers: __schedule_barrier, __force_stores, ...
– Native instructions: __isb, __dsb,...
Software Debug &
Optimization for ARM®
Cortex®-M MicrocontrollersCortex -M Microcontrollers

Mais conteúdo relacionado

Mais procurados

AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAnh Dung NGUYEN
 
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC 02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC IEEE SSCS AlexSC
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM ArchitectureLinaro
 
ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsAnh Dung NGUYEN
 
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureRaahul Raghavan
 
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3Raahul Raghavan
 
ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3Raahul Raghavan
 
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheRaahul Raghavan
 
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 Linaro
 
Introduction to Processor Design and ARM Processor
Introduction to Processor Design and ARM ProcessorIntroduction to Processor Design and ARM Processor
Introduction to Processor Design and ARM ProcessorDarling Jemima
 
1: Interfacing using ARM Cortex M4 || IEEE SSCS AlexSC
1: Interfacing using ARM Cortex M4 || IEEE SSCS AlexSC 1: Interfacing using ARM Cortex M4 || IEEE SSCS AlexSC
1: Interfacing using ARM Cortex M4 || IEEE SSCS AlexSC IEEE SSCS AlexSC
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Linaro
 
Introduction to arm architecture
Introduction to arm architectureIntroduction to arm architecture
Introduction to arm architectureZakaria Gomaa
 
ARM Processor architecture
ARM Processor  architectureARM Processor  architecture
ARM Processor architecturerajkciitr
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureRyo Jin
 
Introduction to i.MX27 Multimedia Applications Processors
Introduction to i.MX27 Multimedia Applications ProcessorsIntroduction to i.MX27 Multimedia Applications Processors
Introduction to i.MX27 Multimedia Applications ProcessorsPremier Farnell
 
ARM 7 Detailed instruction set
ARM 7 Detailed instruction setARM 7 Detailed instruction set
ARM 7 Detailed instruction setP.r. Dinesh
 
ARM Cortex-M3 Training
ARM Cortex-M3 TrainingARM Cortex-M3 Training
ARM Cortex-M3 TrainingRaghav Nayak
 
Arm cortex-m4 programmer model
Arm cortex-m4 programmer modelArm cortex-m4 programmer model
Arm cortex-m4 programmer modelMohammed Gomaa
 

Mais procurados (20)

AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
 
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC 02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM Architecture
 
ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
 
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
 
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
 
ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3
 
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
 
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
 
Introduction to Processor Design and ARM Processor
Introduction to Processor Design and ARM ProcessorIntroduction to Processor Design and ARM Processor
Introduction to Processor Design and ARM Processor
 
1: Interfacing using ARM Cortex M4 || IEEE SSCS AlexSC
1: Interfacing using ARM Cortex M4 || IEEE SSCS AlexSC 1: Interfacing using ARM Cortex M4 || IEEE SSCS AlexSC
1: Interfacing using ARM Cortex M4 || IEEE SSCS AlexSC
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
 
Introduction to arm architecture
Introduction to arm architectureIntroduction to arm architecture
Introduction to arm architecture
 
ARM Processor architecture
ARM Processor  architectureARM Processor  architecture
ARM Processor architecture
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit Architecture
 
Introduction to i.MX27 Multimedia Applications Processors
Introduction to i.MX27 Multimedia Applications ProcessorsIntroduction to i.MX27 Multimedia Applications Processors
Introduction to i.MX27 Multimedia Applications Processors
 
Arm arc-2016
Arm arc-2016Arm arc-2016
Arm arc-2016
 
ARM 7 Detailed instruction set
ARM 7 Detailed instruction setARM 7 Detailed instruction set
ARM 7 Detailed instruction set
 
ARM Cortex-M3 Training
ARM Cortex-M3 TrainingARM Cortex-M3 Training
ARM Cortex-M3 Training
 
Arm cortex-m4 programmer model
Arm cortex-m4 programmer modelArm cortex-m4 programmer model
Arm cortex-m4 programmer model
 

Semelhante a AAME ARM Techcon2013 004v02 Debug and Optimization

Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.pptKundanSingh887495
 
Lcu14 101- coresight overview
Lcu14 101- coresight overviewLcu14 101- coresight overview
Lcu14 101- coresight overviewLinaro
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armPrashant Ahire
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxKandavelEee
 
Bridging the gap between hardware and software tracing
Bridging the gap between hardware and software tracingBridging the gap between hardware and software tracing
Bridging the gap between hardware and software tracingChristian Babeux
 
System_on_Chip_SOC.ppt
System_on_Chip_SOC.pptSystem_on_Chip_SOC.ppt
System_on_Chip_SOC.pptzahixdd
 
Virtual platform
Virtual platformVirtual platform
Virtual platformsean chen
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
Familiarization with instrumentation used for reactor core temperature
Familiarization with instrumentation used for reactor core temperatureFamiliarization with instrumentation used for reactor core temperature
Familiarization with instrumentation used for reactor core temperatureCMS90
 
An Overview of LPC2101/02/03
An Overview of LPC2101/02/03An Overview of LPC2101/02/03
An Overview of LPC2101/02/03Premier Farnell
 
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Atollic
 
Low cost embedded system
Low cost embedded systemLow cost embedded system
Low cost embedded systemece svit
 
computer architecture
computer architecture computer architecture
computer architecture Dr.Umadevi V
 
Tank water level & monitoring solution based on the STM32L476 MCU
Tank water level & monitoring solution based on the STM32L476 MCUTank water level & monitoring solution based on the STM32L476 MCU
Tank water level & monitoring solution based on the STM32L476 MCUJulio César Carrasquel
 
soc ip core based for spacecraft application
soc ip core based for spacecraft applicationsoc ip core based for spacecraft application
soc ip core based for spacecraft applicationnavyashree pari
 
PIC Introduction and explained in detailed
PIC Introduction and explained in detailedPIC Introduction and explained in detailed
PIC Introduction and explained in detailedAnkita Tiwari
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationTalal Khaliq
 
How to Select Hardware for Internet of Things Systems?
How to Select Hardware for Internet of Things Systems?How to Select Hardware for Internet of Things Systems?
How to Select Hardware for Internet of Things Systems?Hannes Tschofenig
 

Semelhante a AAME ARM Techcon2013 004v02 Debug and Optimization (20)

Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.ppt
 
Lcu14 101- coresight overview
Lcu14 101- coresight overviewLcu14 101- coresight overview
Lcu14 101- coresight overview
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
Bridging the gap between hardware and software tracing
Bridging the gap between hardware and software tracingBridging the gap between hardware and software tracing
Bridging the gap between hardware and software tracing
 
System_on_Chip_SOC.ppt
System_on_Chip_SOC.pptSystem_on_Chip_SOC.ppt
System_on_Chip_SOC.ppt
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
Familiarization with instrumentation used for reactor core temperature
Familiarization with instrumentation used for reactor core temperatureFamiliarization with instrumentation used for reactor core temperature
Familiarization with instrumentation used for reactor core temperature
 
An Overview of LPC2101/02/03
An Overview of LPC2101/02/03An Overview of LPC2101/02/03
An Overview of LPC2101/02/03
 
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
 
Low cost embedded system
Low cost embedded systemLow cost embedded system
Low cost embedded system
 
computer architecture
computer architecture computer architecture
computer architecture
 
Tank water level & monitoring solution based on the STM32L476 MCU
Tank water level & monitoring solution based on the STM32L476 MCUTank water level & monitoring solution based on the STM32L476 MCU
Tank water level & monitoring solution based on the STM32L476 MCU
 
soc ip core based for spacecraft application
soc ip core based for spacecraft applicationsoc ip core based for spacecraft application
soc ip core based for spacecraft application
 
Ec8791 arm 9 processor
Ec8791 arm 9 processorEc8791 arm 9 processor
Ec8791 arm 9 processor
 
PIC Introduction and explained in detailed
PIC Introduction and explained in detailedPIC Introduction and explained in detailed
PIC Introduction and explained in detailed
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
 
How to Select Hardware for Internet of Things Systems?
How to Select Hardware for Internet of Things Systems?How to Select Hardware for Internet of Things Systems?
How to Select Hardware for Internet of Things Systems?
 
Control Memory
Control MemoryControl Memory
Control Memory
 

Último

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 

Último (20)

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 

AAME ARM Techcon2013 004v02 Debug and Optimization

  • 1. Software Debug & Optimization for ARM® Cortex®-M MicrocontrollersCortex -M Microcontrollers
  • 2. Agenda Tools & Debug Configuration Debug Components Physical Debug Interfaces Compiler Configuration 2 AAME TechCon 2013 TC004v02 2 Compiler Configuration Introduction to Optimization Mixing C and Assembler
  • 3. Keil® MDK Low cost tools for ARM7, ARM9, ARM Cortex-M and ARM Cortex-R4 MCUs – Extensive device support for many devices – Core and peripheral simulation – Flash support Microcontroller Development Kit (MDK) – µVision IDE – ARM Compiler, optimized run-time library, KEIL RTX RTOS 3 AAME TechCon 2013 TC004v02 3 – Real-time trace (for Cortex-M3 and Cortex-M4 based devices) Real-Time Library – Keil RTX RTOS + Source Code – TCP networking suit, Flash File System, CAN Driver Library, USB Device Interface Debug Hardware Evaluation boards
  • 4. ARM Cortex-M3/M4 Debug Features Compliant with ARMv7-M Debug Architecture (CoreSight™ based) Traditional ARM Debug Features – Two debug modes (Halt mode and Monitor mode) – Two stepping modes (with and without interrupts taken) – BKPT instruction – Vector Catch – 4 AAME TechCon 2013 TC004v02 4 – Optional Embedded Trace Macrocell (ETM) CoreSight Debug Features – Flash Patch and Breakpoint (FPB) – Instruction Breakpoints and Code Patching – Data Watchpoint and Trace (DWT) – Hardware Breakpoints, Event Counters and PC Sampling – Instrumentation Trace Macrocell (ITM) – Low bandwidth trace driven by application software or DWT – Serial Wire Viewer
  • 5. ARM Cortex-M3/M4 CoreSight Overview Utilizes a Debug Access Port (DAP) – Consists of a Debug Port (DP) + AHB Access Port (AP) Debug Port (DP) has 2 implementation options – SWJ-DP – supports Serial Wire (2-pin) and conventional JTAG interface – SW-DP – supports Serial Wire (2-pin) only Serial Wire 5 AAME TechCon 2013 TC004v02 5 AHB Access Port (AP) – Provides AHB-Lite access to core, memory and debug components – All Debug registers are memory mapped – Traditional internal scan chains no longer utilized – No coprocessors on Cortex-M3 (CP14 was traditionally the debug coprocessor) SWJ-DP AHB- AP DAP Serial Wire or JTAG SW-DP AHB- AP DAP Serial Wire
  • 6. Cortex-M3/M4 Debug Access Paths Cortex-M3 Core Data Watchpoin t & Trace (DWT) Flash Patch & Breakpoint (FPB) Instrument . Trace Macrocell (ITM) Bus Matrix AHB - Internal Private Peripheral Bus APB - External Private Peripheral Bus 6 AAME TechCon 2013 TC004v02 6 Embedded Trace Macrocell (ETM) Trace Port Interface Unit (TPIU) SW/J-DP AHB -AP DAP APB - External Private Peripheral Bus ROM Table
  • 7. Halted Debug Mode Traditional start/stop debug – Core executes and then halts in Debug mode Debug Fault Status Register (DFSR) identifies the type of Debug event – EXTERNAL EDBGRQ input asserted from other SoC component – VCATCH Vector Catch triggered 7 AAME TechCon 2013 TC004v02 7 – VCATCH Vector Catch triggered – DWTTRAP Data access to address matching a Watchpoint – BKPT BKPT instruction executed – HALTED Halt request from debugger (or stepping in debug) BKPT instruction – Debugger replaces original instruction with BKPT for software breakpoint – Synthesized by FPB unit for a hardware breakpoint
  • 8. Vector Catch Mechanism traps selected exceptions – Core halts when exception is asserted – No DWT / Breakpoint resources utilized – Suitable for early software development – Selection made through debugger Following exceptions may be trapped – Reset Address Vector 0x00 Initial Main SP 0x04 Reset 0x08 NMI 0x0C Hard Fault 0x10 Memory Manage 0x14 Bus Fault 8 AAME TechCon 2013 TC004v02 8 – Reset – Hard Fault – Memory Management Fault – Bus Fault – Usage Fault – Exception Service Error Note - cannot catch interrupts this way – Unlike other ARM and Cortex-A/R cores – Use breakpoint in interrupt handler 0x14 Bus Fault 0x18 Usage Fault 0x1C - 0x28 Reserved 0x2C SVCall 0x30 Debug Monitor 0x34 Reserved 0x38 PendSV 0x3C SysTick 0x40 IRQ0 …. More IRQs
  • 9. Reset Core has 3 different reset inputs – PORESETn - Power-on reset for Cortex-M3 system – SYSRESETn - System reset for processor (debug components not reset) – DAPRESETn - AHB-AP reset Software Generated Resets 9 AAME TechCon 2013 TC004v02 9 Software Generated Resets – VECTRESET bit in Application Interrupt and Reset Control Register – Equivalent to asserting SYSRESETn – Software reset option is available in Keil/MDK and DS-5 Development Studio – Use CTRL_REG for the “Reset Type” – Core is safely reset without asserting nSRST JTAG signal – SYSRESETREQ bit in Application Interrupt and Reset Control Register – Sends a request for a reset to the system – Reset is generated by customer-defined reset controller (not the M3) – Other components in the system other than the Cortex-M3 may be affected
  • 10. Agenda Tools & Debug Configuration Debug Components Physical Debug Interfaces 10 AAME TechCon 2013 TC004v02 10 Compiler Configuration Introduction to Optimization Mixing C and Assembler
  • 11. Flash Patch and Breakpoint Unit (FPB) Flash Patching – Allows runtime patching of firmware – Remaps reads from the Code space to System space using a Patch Table – Total of 8 addresses may be patched – 6 instruction comparators (for instruction fetches from Code space) – 2 literal comparators (for literal data loads from Code space) – Only reads are patched – 11 AAME TechCon 2013 TC004v02 11 – Writes will be performed as normal – Intended Usage – ROM-based designs (costly fix) – Firmware field upgrades Hardware Breakpoints – maximum of 6 – The 6 instruction comparators can return a BKPT to halt the core – Instruction comparators are shared with Flash Patch functionality – If 3 instructions are flash patched, only 3 hardware breakpoints are available
  • 12. Data Watchpoint and Trace (DWT) DWT component useful for Debug, Trace and Profiling – Enabled by setting TRCENA bit in Debug Exception and Monitor Control Reg Debug Support – Traditional data watchpoint for halt mode debug – Can break on [data value && data address] match (x1) Trace Support – Generate trace trigger for Embedded Trace Macrocell (ETM) 12 AAME TechCon 2013 TC004v02 12 Generate trace trigger for Embedded Trace Macrocell (ETM) Profiling / Event Support – Provides non-invasive view of application execution – Packets output through Instrumentation Trace Macrocell (ITM) – Must have debug tools connected to view the output – Packets generated for selected events of interest – Data Address matching – Periodic PC Sampling – Exception Entry, Exit and Return – Hardware performance counting
  • 13. DWT Block Diagram DWT interfaces to the Core, ETM and ITM ITM communication is packet based – Packets defined in the ARMv7-M Architectural Reference Manual ETM* 13 AAME TechCon 2013 TC004v02 13 Comparator Bank Cortex-M3 Core DWT break trigger ETM* ITM packet Event Counters packet packet Cycle Counter *ETM Optional
  • 14. Instrumentation Trace Macrocell (ITM) Generates and outputs Trace Packets Packet types (in priority order): – Software trace – Software can write directly to ITM stimulus registers, causing packets to be emitted – Similar to using printf() to debug a C program – Hardware trace – Packets are generated by the DWT and emitted by the ITM 14 AAME TechCon 2013 TC004v02 14 – Packets are generated by the DWT and emitted by the ITM – Timestamps Timestamp Packets – Must be enabled in ITM Trace Control Register – Local Timestamp (differential) value generated from 21-bit counter – Counter clocked from either core clock or TPIU clock – Global Timestamp (absolute) value generated from 48-bit counter – Packet generation – When any other trace packet is generated (which resets timestamp counter) – When timestamp counter overflows
  • 15. ARM Cortex-M3/M4 and ITM Cortex-M3 Core ETM Trigger ATB Cortex-M3 Macrocell Global Timestamp ClockGlobal Timestamp Global 15 AAME TechCon 2013 TC004v02 15 SW/ SWJ-DP DWT ITM TPIU APB ATB SWO & TraceData[3:0] Local Timestamp Global Timestamp
  • 16. Embedded Trace Macrocell (ETM) Optional non-invasive debug component ETM Hardware monitors activity of processor Trace allows: – Historical debug of sequences leading up to events of interest – e.g. System crash on peripheral access during overnight testing 16 AAME TechCon 2013 TC004v02 16 – e.g. System crash on peripheral access during overnight testing – Debug of events in real-time systems where the target cannot be halted – Hard Disk drives, Engine Management – Visibility of accesses inside a SoC – To internal memories/peripherals – Software profiling and code coverage
  • 17. ARM Cortex-M3/M4 and ETM Cortex-M3 Core ETM Trigger ATB Cortex-M3 Macrocell Global Timestamp ClockGlobal Timestamp Global 17 AAME TechCon 2013 TC004v02 17 SW/ SWJ-DP DWT ITM TPIU APB ATB SWO & TraceData[3:0] Local Timestamp Global Timestamp
  • 18. TPIU Interface / Serial Wire Output Formats and serializes data from ETM and ITM Trace data clocked out asynchronous to core clock – TRACECLK derived from TRACECLKIN ETM packets output over Trace Port (TRACECLK and TRACEDATA [3:0]) – Data decompressed with a conventional Trace Port Analyzer 18 AAME TechCon 2013 TC004v02 18 ITM/DWT packets optionally output over Serial Wire Output (SWO) – SWO also referred to as “Serial Wire Viewer” – Data decompressed with an Event Viewer FIFO Formatter Serializer TPIU FIFO ETM ITM TRACECLKIN SWO TRACEDATA [3:0] TRACEC LK
  • 19. Agenda Tools & Debug Configuration Debug Components Physical Debug Interfaces 19 AAME TechCon 2013 TC004v02 19 Compiler Configuration Introduction to Optimization Mixing C and Assembler
  • 20. Physical Interfaces New 10-pin and 20-pin interfaces available – Higher pin density (0.05”) over standard JTAG IDC interface (0.10”) – Reduces connector footprint Serial Wire Debug is the preferred solution – SWO likely not available when using JTAG interface 20 AAME TechCon 2013 TC004v02 20 – SWO likely not available when using JTAG interface Trace can use a 20-pin connector – Legacy 38-pin Mictor connector not recommended Consult CoreSight Component TRM or Device Data Sheet
  • 21. Samtec FTSH-110 Connector 2 SWDIO / TMS 4 SWCLK / TCK 6 SWO / TDO 8 NC/EXTb / TDI VTref 1 GND 3 GND 5 KEY 7 21 AAME TechCon 2013 TC004v02 21 10 nRESETGNDDetect 9
  • 22. Samtec FTSH-120 Connector 2 SWDIO / TMS 4 SWCLK / TCK 6 SWO/EXTa/TRACECTL / TDO 8 NC/EXTb / TDI 10 nRESET VTref 1 GND 3 GND 5 KEY 7 GNDDetect 9 22 AAME TechCon 2013 TC004v02 22 10 nRESET 12 TRACECLK 14 TRACEDATA[0] 16 TRACEDATA[1] 18 TRACEDATA[2] 20 TRACEDATA[3] GNDDetect 9 GND/TgtPwr+Cap 11 GND/TgtPwr+Cap 13 GND 15 GND 17 GND 19
  • 23. Agenda Tools & Debug Configuration Debug Components Physical Debug Interfaces 23 AAME TechCon 2013 TC004v02 23 Compiler Configuration Introduction to Optimization Mixing C and Assembler
  • 24. Language Support Single compiler armcc can compile standard ISO C/C++ Source language modes – ISO C90 – 1990 C standard, compile option --c90 (default) – ISO C99 – 1999 C standard, compile option --c99 24 AAME TechCon 2013 TC004v02 24 – 1999 C standard, compile option --c99 – ISO C++ – 2003 C++ standard, compile option –cpp Language compliance – Default mode supports several common extensions – Strict mode enforces compliance with language standard: --strict – GNU mode offers partial support for GCC extensions: --gnu
  • 25. Variable types supported The compiler supports these basic types int / long 32 bit (word) integer short 16-bit (half-word) integer char 8-bit byte, unsigned by default long long 64-bit integer 25 AAME TechCon 2013 TC004v02 25 long long 64-bit integer float 32-bit single-precision IEEE floating point double 64-bit double-precision IEEE floating point bool 8-bit Boolean (C++ only) wchar_t 16-bit “wide character” type (C++ only) Pointers 32-bit integer addresses
  • 26. Optimization Levels Level of optimizations carried out by the compiler is selectable -O0 – Minimum optimization – The least optimized code, but with the best debug view -O1 – Restricted optimization – Optimized code and a good debug view (default) 26 AAME TechCon 2013 TC004v02 26 -O2 (default) – High optimization – Well optimized code but with limited debug view -O3 – More aggressive optimization, weighted toward -Ospace / -Otime choice – Enables multifile compilation by default (more later) Select optimization for code size or execution speed with -Ospace (default) or -Otime Use -g or --debug to generate source level debug information
  • 27. Selecting an Architecture or Core Each new version of the ARM Architecture typically supports extra instructions and models of operation Implementation of an architecture version may vary between cores – Use the most specific setting you can when compiling Inform the compiler of the architecture or processor – The default CPU setting is ARM7TDMI (Architecture 4T) – Either specify an architecture version, or a specific core 27 AAME TechCon 2013 TC004v02 27 --cpu 7-M (Do not prefix with a ‘v’) --cpu Cortex-M3 Some examples of features the compiler and libraries can take advantage of: – UDIV and SDIV (7-M and 7-R) – REV (v6) can be used to reverse byte endianness – Unaligned memory access (v6) When using the Cortex-M3 it is essential to specify 7-M or Cortex-M3 to ensure the correct (Thumb only) libraries are used
  • 28. Agenda Tools & Debug Configuration Debug Components Physical Debug Interfaces 28 AAME TechCon 2013 TC004v02 28 Compiler Configuration Introduction to Optimization Mixing C and Assembler
  • 29. Using “volatile” f int f(int *p) { return (*p == *p); } f MOVS r0, #1 BX lr armcc 29 AAME TechCon 2013 TC004v02 29 int f(volatile int *p) { return (*p == *p); } armcc f LDR r1,[r0,#0] LDR r0,[r0,#0] CMP r1,r0 ITE NE MOVNE r0,#0 MOVEQ r0,#1 BX lr This code is compiled with “-O2 –Otime --cpu=Cortex-M3”
  • 30. Instruction Scheduling Instruction scheduling is enabled at -O1 and higher – Instructions are re-ordered to suit the core on which the code will run – Improves throughput by minimizing interlocks – Select processor (--cpu) to determine algorithm used For example: int f(int *p, int x) { return *p + x * 3; } 30 AAME TechCon 2013 TC004v02 30 Without scheduling (-O0) With scheduling (-O1, -O2, -O3) Compiler never re-orders instructions if this would change the behavior MOV r2,r0 ADD r3,r1,r1,LSL #1 LDR r0,[r2,#0] ADD r0,r0,r3 BX lr LDR r0,[r0,#0] ADD r1,r1,r1,LSL #1 ADD r0,r0,r1 BX lr
  • 31. Inlining of functions Inlining can improve performance, at the expense of a larger image – Body of inlined function inserted directly into the calling code wherever it is called – Only possible if caller and callee are in same compilation unit (except --multifile) The compiler can inline functions automatically – Normally no need to annotate your source code, or use any special switches Factors that influence auto-inlining include – Whether the function is marked with __inline ‘hint’ – 31 AAME TechCon 2013 TC004v02 31 – Optimization level and -Otime / -Ospace – How many places the function is called – Size of the function – Whether the function has external or static linkage To force a function to be inlined, either use --forceinline with __inline or use __forceinline Any non-static function that gets auto-inlined has an out-of-line version generated too (another reason to use static) – Increases code size – More complex debug view Example...
  • 32. Loop Transformation The compiler can transform and restructure loops automatically – Enabled with -O3 –Otime Loop unrolling reduces loop overhead at the cost of increase in code size for (i = 0; i < 100; i++) for (i = 0; i < 100; i += 4) { 32 AAME TechCon 2013 TC004v02 32 Loop re-rolling – Recognize manually unrolled loops, re-roll and unroll optimally Loops with constant, low iteration count may be unrolled completely for (i = 0; i < 100; i++) { c[i] = b[i] + 1; } c[i + 0] = b[i + 0] + 1; c[i + 1] = b[i + 1] + 1; c[i + 2] = b[i + 2] + 1; c[i + 3] = b[i + 3] + 1; }
  • 33. Branch Target Optimization (1) The ARM Compiler implements a performance optimization to ensure that a loop branch target is not an unaligned 32-bit instruction – Available when compiling at –O3 – Improves EEMBC performance by 1.5% The compiler will try first to widen the instruction before the 33 AAME TechCon 2013 TC004v02 33 The compiler will try first to widen the instruction before the branch target – If widening is not possible then it will insert a 16-bit NOP Two examples on the following slides – The first shows an instruction being widened to align a loop – The second shows a NOP being inserted to align a loop
  • 34. Branch Target Optimization (2) int foo1(int a[16]) { int i; int total = 0; for (i=0; i<8; i++) { total += a[i]; } return total; Instruction being widened to align a loop target 34 AAME TechCon 2013 TC004v02 34 } foo1 0x00000000: 4602 .F MOV r2,r0 0x00000002: 2000 . MOVS r0,#0 0x00000004: ea4f0100 O... MOV.W r1,r0 <<<< widened MOV loop 0x00000008: f8523021 R.!0 LDR r3,[r2,r1,LSL #2] 0x0000000c: 1c49 I. ADDS r1,r1,#1 0x0000000e: 4418 .D ADD r0,r0,r3 0x00000010: 2908 .) CMP r1,#8 0x00000012: dbf9 .. BLT {pc}-0xa ; 0x8 loop 0x00000014: 4770 pG BX lr
  • 35. Branch Target Optimization (3) int foo2(int a[16], int j) { int total = 0; int i = 0; if (a[0]!=0) { for (i=0; i<8; i++) { total += a[i] + a[i+1]; } } return total; } NOP being inserted to align a loop target foo2 0x00000000: b510 .. PUSH {r4,lr} 35 AAME TechCon 2013 TC004v02 35 0x00000000: b510 .. PUSH {r4,lr} 0x00000002: 4602 .F MOV r2,r0 0x00000004: 2000 . MOVS r0,#0 0x00000006: 6813 .h LDR r3,[r2,#0] 0x00000008: 4601 .F MOV r1,r0 0x0000000a: 2b00 .+ CMP r3,#0 0x0000000c: d00a .. BEQ {pc}+0x18 ; 0x24 0x0000000e: bf00 .. NOP <<<< added NOP loop 0x00000010: eb020481 .... ADD r4,r2,r1,LSL #2 <<<< 32 bit aligned 0x00000014: f8523021 R.!0 LDR r3,[r2,r1,LSL #2] 0x00000018: 6864 dh LDR r4,[r4,#4] 0x0000001a: 1c49 I. ADDS r1,r1,#1 0x0000001c: 4423 #D ADD r3,r3,r4 0x0000001e: 2908 .) CMP r1,#8 0x00000020: 4418 .D ADD r0,r0,r3 0x00000022: dbf5 .. BLT {pc}-0x12 ; 0x10 loop 0x00000024: bd10 .. POP {r4,pc}
  • 36. Register Usage r0 r1 r2 r3 r4 r5 r6 Arguments into function Result(s) from function otherwise corruptible (Additional parameters passed on stack) The compiler has a set of rules known as a Procedure Call Standard that determine how to pass parameters to a function (see AAPCS) CPSR flags may be corrupted by function call Assembler code which links with compiled code must follow the AAPCS at external interfaces Register 36 AAME TechCon 2013 TC004v02 36 r8 r9 r10 r11 r12 r13/sp r14/lr r15/pc r6 r7Register variables Must be preserved Scratch register (corruptible) Stack Pointer Link Register Program Counter The AAPCS is part of the ABI for the ARM Architecture Registers marked with a star are automatically pushed on to the stack when an exception occurs The xPSR (processor state) is also pushed to the stack - r14 can be used as a temporary once value stacked - AAPCS requires that sp be 8-byte (2 word) aligned at externally visible boundaries
  • 37. Register Usage (2) foo PUSH {r4-r11, lr} ... ... Callee ... ... BL foo Caller May need to save r0-r3, r12 Do not need to save r4-r11 Must preserve r4-r11, lr if used by callee May corrupt Parameters passed in r0-r3 37 AAME TechCon 2013 TC004v02 37 ... POP {r4-r11, pc} ... save r4-r11 May corrupt r0-r3, r12 Value returned in r0 for int/short/char; in r0 and r1 for long long/double AAPCS – Procedure Call Standard for ARM Architecture
  • 38. Parameter Passing (1) The first four word sized parameters passed to a function will be transferred in registers r0-r3 (fast & efficient) – Sub-word sized arguments will still use a whole register – Arguments larger than a word will be passed in multiple registers (more about 64 bit types later) – See AAPCS for more details If more arguments are needed, then the 5th, 6th and subsequent words will be passed on the stack 38 AAME TechCon 2013 TC004v02 38 passed on the stack – Involves extra instructions and memory accesses Therefore always try to limit arguments to 4 words or fewer – If unavoidable, place most commonly used parameters in first 4 positions – Or if arguments are in a structure then pass a pointer to the structure instead C++ uses the first argument to pass the this pointer to member functions, so only 3 arguments can be passed in registers Example...
  • 39. Parameter Passing (2) Parameter Passing (4 parameters) int func1(int a, int b, int c, int d) { return a+b+c+d; } int caller1(void) { return func1(1,2,3,4); Parameter Passing (6 parameters) func2 PUSH {r4,r5,lr} ADD r0,r0,r1 LDRD r4,r5,[sp,#0xc] ADD r0,r0,r2 ADD r0,r0,r3 ADD r0,r0,r4 ADD r0,r0,r5 39 AAME TechCon 2013 TC004v02 39 return func1(1,2,3,4); } func1 ADDS r0, r0, r1 ADDS r0, r0, r2 ADDS r0, r0, r3 BX lr : caller1 MOVS r3, #4 MOVS r2, #3 MOVS r1, #2 MOVS r0, #1 B func1 ADD r0,r0,r5 POP {r4,r5,pc} caller2 PUSH {r2,r3,lr} MOVS r3,#6 MOVS r2,#5 STRD r2,r3,[sp,#0] MOVS r3,#4 MOVS r2,#3 MOVS r1,#2 MOVS r0,#1 BL func2 POP {r2,r3,pc}
  • 40. r0 r1 r2 r3 stack stack stack stack Parameter Passing (3) The AAPCS has rules about 64-bit types – 64-bit types must be 8-byte aligned in memory – 64-bit arguments to functions must be passed in an even + consecutive odd register (i.e. r0+r1 or r2+r3) or on the stack at an 8-byte aligned location Registers or stack will be ‘wasted’ if arguments are listed in a sub-optimal order 40 AAME TechCon 2013 TC004v02 40 fy(int a, int c, double b) a c b b fx(int a, double b, int c) a unused b b c fz(double a, double b, int c, double d) b ba a c unused d d Remember the hidden this argument in r0 for non-static C++ member functions unused
  • 41. Loop Termination (1) In for(), while() and do…while() loops always use an integer counter Preferably decrement down to zero, rather than up towards a final value – Subtract and compare to zero can be done in one instruction (SUBS) – But must either use an unsigned int counter… …or test not equal to zero (rather than greater than or equal to zero) (otherwise the potential wraparound from –ve to +ve prohibits this optimization) 41 AAME TechCon 2013 TC004v02 41 For example, replace: for (loop = 1; loop <= total; loop++) with: for (loop = total; loop != 0; loop--) Loop limit (total) then only used once at the beginning – Compiler can reuse this register once the loop counter has been loaded Resulting code is smaller and faster Example...
  • 42. Loop Termination (2) Count up int fact1(unsigned int limit) { unsigned int i; int fact = 1; for (i = 1; i <= limit; i++) { fact = fact * i; } return fact; Count down int fact2(unsigned int limit) { unsigned int i; int fact = 1; for (i = limit; i != 0; i--) { fact = fact * i; } return fact; 42 AAME TechCon 2013 TC004v02 42 return fact; } return fact; } fact2 MOVS r1,r0 MOV r0,#1 IT EQ BXEQ lr |L1.52| MUL r0,r1,r0 SUBS r1,r1,#1 BNE |L1.52| BX lr fact1 MOV r2,r0 MOVS r0,#1 MOV r1,r0 CMP r2,#1 IT CC BXCC lr |L1.20| MUL r0,r1,r0 ADDS r1,r1,#1 CMP r1,r2 BLS |L1.20| BX lr Both examples compiled with -O2 -Otime
  • 43. Division Operations Prior to ARMv7, ARM cores contain no division hardware – Division typically implemented by a run-time library function – This can take many cycles to execute int div(int a, int b) { return (a / b); } div PUSH {r4,lr} BL __aeabi_idivmod POP {r4,pc} 43 AAME TechCon 2013 TC004v02 43 v7-M cores include division hardware Signed and unsigned divide instructions included in Thumb-2 instruction set int div(int a, int b) { return (a / b); } div SDIV r0,r0,r1 BX lr
  • 44. Division by Compile-time Constants unsigned div2(unsigned n) { return (n / 2); } div2 LSRS r0, r0, #1 BX lr Division by compile-time constants is treated as a special case Division by powers of two will use shift operations 44 AAME TechCon 2013 TC004v02 44 – With -O1 and higher (with -Otime), other constants will use a standard long multiply sequence on v7-M cores div10 LDR r1, =0xCCCCCCCD UMULL r1, r0, r1, r0 LSRS r0, r0, #3 BX lr unsigned div10(unsigned n) { return (n / 10); }
  • 45. Modulo Arithmetic The remainder operator ‘%’ is commonly used in modulo arithmetic However, this will be expensive if the modulo value is not a power of two – Will use hardware divide, if present, or will use division library code Can be avoided by rewriting C code to use if() statement check For example, if count has the range 0 to 59, replace count = (count+1) % 60; 45 AAME TechCon 2013 TC004v02 45 count = (count+1) % 60; with if (++count >= 60) count = 0; modulo MOVS r1, #0x3c ADDS r0, r0, #1 BL __aeabi_uidivmod MOV r0, r1 test_and_reset ADDS r0, r0, #1 CMP r0, #0x3c BLT |L1.4| MOVS r0, #0 |L1.4| This code is compiled with “-O2”
  • 46. Floating Point ARM Cortex-M3 and ARM Cortex-M4 have no hardware floating-point operations – Compiler generates calls to software floating-point library routines whenever a floating point operation is required (default option is --fpu=softvfp) Cortex-M4F supports hardware floating-point operations 46 AAME TechCon 2013 TC004v02 46 float fplib(float num1, float num2) { float temp, temp2; temp = num1 + num2; temp2 = num2 * num2; return temp2-temp; } fplib PUSH {r4-r6,lr} MOV r4,r1 BL __aeabi_fadd MOV r5,r0 MOV r1,r4 MOV r0,r4 BL __aeabi_fmul MOV r1,r5 POP {r4-r6,lr} B.W __aeabi_fsub fplib VADD.F32 s0,s0,s1 VMUL.F32 s1,s1,s1 VSUB.F32 s0,s1,s0 BX lr --cpu=Cortex-M3 --cpu=Cortex-M4F
  • 47. Variable Types Global & static variables are held in RAM – Which requires loads/stores to memory – more later – External globals also require an extra level of indirection because the compiler needs to load a pointer to the variable first Local variables are normally held in registers, for fast & efficient processing – If the compiler’s register allocator runs out of registers, then locals will be 'spilled' onto the 47 AAME TechCon 2013 TC004v02 47 – If the compiler’s register allocator runs out of registers, then locals will be 'spilled' onto the stack – Taking the address of a variable also forces it to be placed in memory For local variables, use word-sized (int) variables rather than halfword and byte – Avoids additional shifts/masks to ensure that variables only occupy correct space within 32-bit register Example...
  • 48. Size of Local Variables int wordsize(int a) { a = a + 1; return a; } int halfsize(short b) { wordsize ADDS r0, r0, #1 BX lr halfsize 48 AAME TechCon 2013 TC004v02 48 { b = b + 1; return b; } int bytesize(char c) { c = c + 1; return c; } halfsize ADDS r0, r0, #1 SXTH r0, r0 BX lr bytesize ADDS r0, r0, #1 UXTB r0, r0 BX lr These examples compiled with --cpu=Cortex-M3
  • 49. Global Data Layout Global (and static) data is stored in memory, not registers – Require load / store instruction to access – So for performance reasons will be aligned on natural size boundaries ARM compilers will optimize the layout of globals in a module char one; char short e.g. declared data in this order 49 AAME TechCon 2013 TC004v02 49 char one; short two; char three; int four; char short char Declared layout 12 bytes (4 bytes of padding) Optimal layout 8 bytes (No padding) Compiler will re-order the data like this shortchar char int int
  • 50. Unaligned Accesses ARM processors access data in memory most efficiently when on natural size boundary – (Multi-)Word access on word boundaries (LDR, STR, LDM, STM) – Halfword access on halfword boundaries (LDRH, STRH) – Byte access on byte (any) boundary (LDRB, STRB) Use the __packed type qualifier to warn the compiler of potential unaligned accesses 50 AAME TechCon 2013 TC004v02 50 accesses – e.g. for byte-oriented network protocols or when porting legacy code ARMv6 and later processors support unaligned accesses when appropriately configured – Must still use __packed to tell compiler the data may be unaligned Unaligned accesses might cost additional bus cycles – Trade-off between memory usage and performance Outcome of an “accidental” unaligned data access is configurable – Set UNALIGN_TRP bit of Configuration Control Register to detect unaligned accesses and trigger an unaligned usage fault
  • 51. Packing of structures struct sta { char one; short two; char three; int four; char short int char What about structures? The C standard does not permit the compiler to re-order structure members Members are still naturally aligned for good performance and code size LDRB r1,[r0,#0] LDRSH r2,[r0,#2] LDRB r3,[r0,#4] LDR r4,[r0,#8] 51 AAME TechCon 2013 TC004v02 51 Marking a structure as __packed will remove any padding – Useful for accessing structures specified externally or for porting legacy code – Efficient code generated using unaligned accesses int four; }a; int char short int char __packed struct stb { char one; short two; char three; int four; }b; LDRB r1,[r0,#0] LDRSH r2,[r0,#1] LDRB r3,[r0,#3] LDR r4,[r0,#4] LDR r4,[r0,#8]
  • 52. Alignment of structures What does __packed do? – It sets the alignment of a variable, pointer or all the members of a structure to 1 Structures have the same alignment as their ‘most’ aligned member – Therefore a packed structure (all members byte aligned) has an alignment of 1 – But marking the whole structure (i.e. all members) __packed may be unnecessary – Instead define packed members within structures to minimize penalties __packed struct c struct d 52 AAME TechCon 2013 TC004v02 52 char short int __packed struct c { int one; char two; short three; }; This structure has 1-byte alignment char short int struct d { int one; char two; __packed short three; }; This version has 4-byte alignment so a byte of padding is added (gray square) LDR r0,[r4,#0] LDRB r1,[r4,#4] LDRSH r2,[r4,#5] LDR r0,[r4,#0] LDRB r1,[r4,#4] LDRSH r2,[r4,#5]
  • 53. Agenda Tools & Debug Configuration Debug Components Physical Debug Interfaces 53 AAME TechCon 2013 TC004v02 53 Compiler Configuration Introduction to Optimization Mixing C and Assembler
  • 54. Mixing C and Assembly C/C++ and assembly can easily be mixed to – Access processor features which are not available from C – Generate highly optimized code Easy to make function calls between C, C++ and 54 AAME TechCon 2013 TC004v02 54 Easy to make function calls between C, C++ and Assembly – Just be sure to conform to the procedure calling standard… …and import and export the relevant symbols
  • 55. Calling Assembly from C/C++ (1) Define the routine in assembly and export its name Call directly from C just like any other function – Provide a function prototype in C – Disable C++ name mangling with extern “C” if using the C++ compiler Link as normal extern void mystrcopy(char *d, const char *s); 55 AAME TechCon 2013 TC004v02 55 extern void mystrcopy(char *d, const char *s); int main(void) { const char *src = “Source”; char dest[10]; ... mystrcopy(dest, src); ... } AREA StringCopy,CODE,READONLY EXPORT mystrcopy mystrcopy PROC LDRB r2, [r1], #1 STRB r2, [r0], #1 CMP r2, #0 BNE mystrcopy BX lr ENDP END
  • 56. Calling Assembly from C/C++ (2) Where possible use CMSIS functions or compiler intrinsics e.g. __nop(), __disable_irq() Compiler also contains an Embedded assembler... 56 AAME TechCon 2013 TC004v02 56 Compiler also contains an Embedded assembler... – Write complete functions in assembly language – No optimization
  • 57. CMSIS ARM Cortex Microcontroller Software Interface Standard (CMSIS) – Vendor-independent hardware abstraction layer for the Cortex-M series of cores Provides C language access to core features – Access to internal registers – Helper functions for common core tasks – Internal address definitions for core memory map – Intrinsics for certain common assembly tasks 57 AAME TechCon 2013 TC004v02 57 – Intrinsics for certain common assembly tasks Example: function to set interrupt priority mask __ASM void __set_PRIMASK(uint32_t priMask) { msr primask, r0 bx lr } Available for download from http://www.onarm.com/
  • 58. Intrinsics C/C++ standards do not define core-specific functionality – The ARM Compiler intrinsics provide extra features to realize these operations. The ARM Compiler supports various families of intrinsics for operations that cannot be generated directly from C/C++ 58 AAME TechCon 2013 TC004v02 58 operations that cannot be generated directly from C/C++ code – Generic intrinsics: __current_pc, __current_sp, __return_address, ... – IRQ/FIQ intrinsics: __disable_irq, __enable_irq, ... – Optimization barriers: __schedule_barrier, __force_stores, ... – Native instructions: __isb, __dsb,...
  • 59. Software Debug & Optimization for ARM® Cortex®-M MicrocontrollersCortex -M Microcontrollers