3. OMAP
– Application:
● CD-DA and CD-
ROM/DVD-
ROM/USB/SD with
MP3, WMA, and
AAC audio decoder
support
– Software platform:
● Cooperate with
QNX Software
Systems
4. Jacinto 1-DDR
– Application:
● Compressed video
playback, Bluetooth
A2DP audio
streaming
– Improvement:
● C64x+ fixed-point
for graphics
acceleration,
compressed audio
decoding, voice
recognition
5. Jacinto 3
– Application:
● Compressed video
playback, Bluetooth
A2DP audio
streaming
– Hardware
improvement:
● ARM Cortex A8
● GPU PowerVR
SGX
6. Jacinto 4/5
– Application:
● Full-HD 1080p
video
decode/endcode
● QNX CAR 2
platfomr
– Hardware
improvement:
● Dual ARM Cortex-
M3-used for
decoding video
stream
● C674x DSP
11. C6000 Core
● 8 paralleral function unit
– D: data load/store
(.D1, .D2)
– S: shift, branch(.S1, .S2)
– M: mulitply(.M1, .M2)
– L: logic, arithmetic
operations(.L1, .L2)
12. C6000 Core (conti.)
● 32 32-bit registers for each
side of function units
– A0-A31(.D1, .S1, .M1, .L1)
– B0-B31(.D2, .S2, .M2, .L2)
● Separate program and
data memory (L1P, L1D)
● 256-bit internal program
bus- fetch 8 32-bit
instructions from L1P
every cycle
● 2 64-bit internal data buses
that allows both .D1 and
.D2 to fetch data from L1D
every cycle
15. C6000 pipeline
● Divide fetch, decode, execute into more
substages: 4-stage fetch, 2-stage decode, 10-
stage execute
16. Delay slots
● Pipeline will not optimize
– Current instruction depends on results of previous
instruction and it takes more than 1 cycle
– A branch is performed
● Solution
– Software scheduling (software pipelining)
– Hardware enhancement (SPLOOP buffer)
18. SPLOOP buffer
● Support platform
– C64x+, C674x, and C66x
● SPLOOP buffer sotres a single scheduled
iteration of the loop in a specialized buffer
● C compiler automatically utilize SPLOOP
● Cannot handle loops that exceed 14 execute
packets(most 8 instructions/execute packet)
– Nested loops, conditional branches inside loops,
function calls inside loops
19. Compiler Optimization
● Using C compiler to generate assembly codes
that utilize C6000 functional units and pipeline
as fully as possible
– Add additional information and instructions help
compiler maximally optimize your codes
● Compiler options, e.g. -o3
● Keywords(C or C6000), e.g, restrict
● Pragma directives, e.g. MUST_ITERATE
– Understand compiler feedback
21. Dependency & resource information
● Minimize iteration interval
– The loop carried dependency bound
● Distance of the largest loop carry path
– Partitioned resource bound
● Maximum number of cycles any functional unit is used in
a single iteration
23. Explicit code optimization
● Previous solution is suitable but
– Function calls in a loop
– Complex, hard-to-implement operations
● Solutions – explicit code optimization
– Intrinsic operations
– Optimized C6000 DSP libraries
– C inline functions
24. Intrinsic operations
● Sample
– Shuffle operation seperates even and odd bits of a 32-
bit value into two variables
● Intrinsic operations
– Function-like statements
– Leading underscore, e.g. _shfl
– Not a function call, no branch needed
● Lists in “TMS320C6000 Optimizing Compiler v7.6
User's Guide“
● Devices depend
● _abs could be used directly
25.
26. Optimized DSP software libraries
● Fundational Math & signal processing
– MathLIB
– IQMath
– FastRTS
– DSPLIB
● Adaptive filtering, matrix computations
● Image & video processing
– IMGLIB
– Video Analytics & Vision Library (VLIB)
– VICP Signal Processing Library
27. Inline functions
● Pros
– To reduce overhead of a function call
– Make optimizer perform loop optimization
● Cons
– Size of codes increases
● To use
– Use -O2 or -O3 to automatically make functions
inline
– Use explicit inline keyword
30. Optimization practice
● Use –o3 and consider –mt for optimization; use –k and consider –
mw for compiler feedback (mt : assume all pointers in loop are
independent)
● Apply the restrict keyword to minimize loop carried dependency
bound (alternative to mt)
● Use the MUST_ITERATE and UNROLL pragmas to optimize
pipeline usage
● Choose the smallest applicable data type and ensure proper data
alignment to help compiler invoke
● Single Instruction Multiple Data (SIMD) operations
● Use intrinsic operations and TI libraries in case major code
modification is needed (avoid standard I/O functions)
31. Using pragma
●
● Without minimum iterate count, compiler needs
to assume it will iterate once
– Providing factor gives compiler freedom to loop
unrolling