Ti DSP optimization on Jacinto

Ti DSP Optimization over Jacinto
Hank
2015/06/09

Generation
● 2002 OMAP
● 2006 Jacinto 1
● 2008 Jacinto 3
● 2010 Jacinto 4/5
● 2016 Jacinto 6

OMAP
– Application:
● CD-DA and CD-
ROM/DVD-
ROM/USB/SD with
MP3, WMA, and
AAC audio decoder
support
– Software platform:
● Cooperate with
QNX Software
Systems

Jacinto 1-DDR
– Application:
● Compressed video
playback, Bluetooth
A2DP audio
streaming
– Improvement:
● C64x+ fixed-point
for graphics
acceleration,
compressed audio
decoding, voice
recognition

Jacinto 3
– Application:
● Compressed video
playback, Bluetooth
A2DP audio
streaming
– Hardware
improvement:
● ARM Cortex A8
● GPU PowerVR
SGX

Jacinto 4/5
– Application:
● Full-HD 1080p
video
decode/endcode
● QNX CAR 2
platfomr
– Hardware
improvement:
● Dual ARM Cortex-
M3-used for
decoding video
stream
● C674x DSP

Jacinto 6
– Application:
● Advanced Driver
Assistance System
(ADAS)
– Hardware
improvement:
● ARM Cortex A15
● DSP C66x
● GPU SGX544

C6000 DSP Optimization
● Code generation tool support languages:
– ANSI C (C89)
– ISO C++ (C++98)
– C6000 DSP assembly
– C6000 linear assembly

Optimization-Five key concepts
● Core(architecture)
– parallel processing
● Pipeline
– High throughput
● Software pipelining
– Instruction scheduling
● Compiler optimization
● Optimizied software library
– Intrinsic opertions in C6000, inlined functions

C6000 Core
● 8 paralleral function unit
– D: data load/store
(.D1, .D2)
– S: shift, branch(.S1, .S2)
– M: mulitply(.M1, .M2)
– L: logic, arithmetic
operations(.L1, .L2)

C6000 Core (conti.)
● 32 32-bit registers for each
side of function units
– A0-A31(.D1, .S1, .M1, .L1)
– B0-B31(.D2, .S2, .M2, .L2)
● Separate program and
data memory (L1P, L1D)
● 256-bit internal program
bus- fetch 8 32-bit
instructions from L1P
every cycle
● 2 64-bit internal data buses
that allows both .D1 and
.D2 to fetch data from L1D
every cycle

Core optimizationC++C++
Compiled parallel
Assembly
Pseudo assembly

Pipeline
F: fetch D: decode E: execute

C6000 pipeline
● Divide fetch, decode, execute into more
substages: 4-stage fetch, 2-stage decode, 10-
stage execute

Delay slots
● Pipeline will not optimize
– Current instruction depends on results of previous
instruction and it takes more than 1 cycle
– A branch is performed
● Solution
– Software scheduling (software pipelining)
– Hardware enhancement (SPLOOP buffer)

Software pipelining
● Enable
● Codes in C, just
add compiler
option -o3 to
enable software
pipelining
● Drawback
● Assembly code
size increases
● Solution
● Software
pipeline loop
buffer

SPLOOP buffer
● Support platform
– C64x+, C674x, and C66x
● SPLOOP buffer sotres a single scheduled
iteration of the loop in a specialized buffer
● C compiler automatically utilize SPLOOP
● Cannot handle loops that exceed 14 execute
packets(most 8 instructions/execute packet)
– Nested loops, conditional branches inside loops,
function calls inside loops

Compiler Optimization
● Using C compiler to generate assembly codes
that utilize C6000 functional units and pipeline
as fully as possible
– Add additional information and instructions help
compiler maximally optimize your codes
● Compiler options, e.g. -o3
● Keywords(C or C6000), e.g, restrict
● Pragma directives, e.g. MUST_ITERATE
– Understand compiler feedback

Loop qualification (option -k -mw)
Compiler feedback (option -k -mw)

Dependency & resource information
● Minimize iteration interval
– The loop carried dependency bound
● Distance of the largest loop carry path
– Partitioned resource bound
● Maximum number of cycles any functional unit is used in
a single iteration

Explicit code optimization
● Previous solution is suitable but
– Function calls in a loop
– Complex, hard-to-implement operations
● Solutions – explicit code optimization
– Intrinsic operations
– Optimized C6000 DSP libraries
– C inline functions

Intrinsic operations
● Sample
– Shuffle operation seperates even and odd bits of a 32-
bit value into two variables
● Intrinsic operations
– Function-like statements
– Leading underscore, e.g. _shfl
– Not a function call, no branch needed
● Lists in “TMS320C6000 Optimizing Compiler v7.6
User's Guide“
● Devices depend
● _abs could be used directly

Optimized DSP software libraries
● Fundational Math & signal processing
– MathLIB
– IQMath
– FastRTS
– DSPLIB
● Adaptive filtering, matrix computations
● Image & video processing
– IMGLIB
– Video Analytics & Vision Library (VLIB)
– VICP Signal Processing Library

Inline functions
● Pros
– To reduce overhead of a function call
– Make optimizer perform loop optimization
● Cons
– Size of codes increases
● To use
– Use -O2 or -O3 to automatically make functions
inline
– Use explicit inline keyword

Optimization practice
● Use –o3 and consider –mt for optimization; use –k and consider –
mw for compiler feedback (mt : assume all pointers in loop are
independent)
● Apply the restrict keyword to minimize loop carried dependency
bound (alternative to mt)
● Use the MUST_ITERATE and UNROLL pragmas to optimize
pipeline usage
● Choose the smallest applicable data type and ensure proper data
alignment to help compiler invoke
● Single Instruction Multiple Data (SIMD) operations
● Use intrinsic operations and TI libraries in case major code
modification is needed (avoid standard I/O functions)

Using pragma
●
● Without minimum iterate count, compiler needs
to assume it will iterate once
– Providing factor gives compiler freedom to loop
unrolling

Reference
●
Texas Instruments, 『 Introduction to TMS320C6000 DSP
Optimization 』
– Recommended to read first
●
Texas Instruments, 『 In-Vehicle Connectivity is So
Retro 』
●
Texas Instruments, 『 TMS320C6000 Programmer's
Guide 』
●
『 TMS320C6000 Optimizing Compiler v7.6 User's

Ti DSP optimization on Jacinto

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ti DSP optimization on Jacinto

Similar to Ti DSP optimization on Jacinto (20)

Recently uploaded

Recently uploaded (20)

Ti DSP optimization on Jacinto

Editor's Notes