Powerful Google developer tools for immediate impact! (2023-24 C)
TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon Yaakov
1. May 1, 2013
An Innovative Multicore System
Architecture for Wireless SoCs
Alon Yaakov
DSP Architecture Manager, CEVA
2. May 1, 2013
Multicore in Embedded System
Defining the Problem
• Control-plane
– Synchronization between cores
– Semaphores
– Message passing using mailbox mechanism
– Snooping mechanism
– Interrupt handling
• Data-plane
EqualizationAntenna Processing Error Correction
This will be the
focus of today’s
presentation
4. May 1, 2013
Multicore Challenges
1. Partitioning
> Task partitioning onto different chip resources
> Data partitioning onto different chip resources
2. Resource sharing
> Memories, buses, system I/Fs, peripherals, etc.
3. Scheduling
> Allocating tasks/data
4. Data sharing
> Transferring data between engines
DSP A
DSP C
DSP B
CTCMLDFFT
Application ?
5. May 1, 2013
• Tasks
– Parts of an algorithm running in a sequential order
– A task must have a defined input and output data
structure (packets)
Challenge 1: Task Partitioning
Error Correction
EqualizationAntenna Processing
MLD
FFT
Ch
estimation
FFT
Ch
estimation
Reordering
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
Task
Data
6. May 1, 2013
Challenge 1: Task Partitioning
HW Offloading
• Parts of the algorithm are more suited for HW
acceleration
– Well known algorithms that require little programmability
– Heavy computational effort
MLD
FFT
Ch
estimation
FFT
Ch
estimation
Reordering
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
7. May 1, 2013
Challenge 1: Data Partitioning
• Several cores are used to process different
input data packets
• Suitable for homogeneous systems
• Shared memory is used for storing history data
– Core must wait for data to update before using it
Latency
• The entire program code is used by all cores
– Core suffers stall cycles if L1 memory is small
8. May 1, 2013
Challenge 1: Partitioning
OK, Now What?
• Efficient partitioning is dependent on the
hardware platform
• Building the optimal system depends on the
partitioning
• There is no single optimal solution
– Each approach has its merits
• Partitioning can be eased by starting with a
reference that can be used as a basis
10. May 1, 2013
Challenge 2: Resource Sharing
Avoiding Contentions
• If possible avoid contentions by duplicating HW
– Multiple DMAs
– Duplicated HW accelerators
– Multilayer BUS
– Partition memory into blocks enabling concurrent access
• Throughput and latency govern the minimum amount of
hardware resources
Memory Memory Memory Memory
11. May 1, 2013
Challenge 2: Resource Sharing
Arbitration
• When a simple set of known rules can be defined a
resource can be shared using a HW arbiter
• QoS
– Priority
– Bandwidth allocation (weight)
– Well known algorithms (round robin)
• Arbitration is based on time
sharing of resources
Scheduling
Memory
Arbiter
12. May 1, 2013
Challenge 3: Scheduling
• How do we assign and schedule tasks to
cores?
Application ?
Concatenati
on & CRC
checker
MLDFFT
Ch
estimation
FFT
Ch
estimation
Reordering
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker DSP A
DSP C
DSP B
CTCMLDFFT
13. May 1, 2013
Challenge 3: Scheduling
Static Scheduling
• Tasks are statically assigned to DSP cores
• Design phase includes task scheduling
– Data flow is fixed
– Suitable when the load on each task is fixed
CTC HW Core
DSP C
DSP BMLD HW CoreDSP AFFT HW Core
MLD
FFT
Ch
estimation
FFT
Ch
estimation
Reorder
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenat
& CRC
checker
Concatenat
& CRC
checker
Concatenat
& CRC
checker
14. May 1, 2013
Challenge 3: Scheduling
Dynamic Scheduling
Concatenati
on & CRC
checker
DSP CDSP BDSP A CTCMLDFFT
MLD
FFT
Ch
estimation
FFTCh
estimation
Reorder
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
MASTER
(Scheduler)
> A scheduler dynamically assigns tasks
to cores
> Scheduler algorithm selects the best
core to execute the task
> Processing capabilities
> Locality of data
> Load balance
> Suitable for complex designs with
variable processing load
and QoS
15. May 1, 2013
Challenge 4: Data Sharing
Memory Hierarchy
• Internal L1 memory
– Fast memory with no access penalty
– Small / medium size
– Dedicated per core
• External memory
– Can be on-chip (L2) or off-chip (i.e. DDR)
– Slow memory with access penalty
– Large size
– Shared among several cores
– Contentions
16. May 1, 2013
Challenge 4: Data Sharing
Using Cache
• When shared data is used, a cache system can be
used to reduce the stall count
– Statistically reduces memory stalls, but not
deterministic
• Used only for accessing narrow data width
– Cache should be used for control data
– Not recommended for vector DSP data flow
• Large caches
• Many stall cycles
How to share vector data?
17. May 1, 2013
Challenge 4: Data Sharing
Pre-Fetching Data
• A task cannot start until its preceding task completes
• If we can schedule the next task to be executed we can
pre-fetch its input data
– Using static scheduling the data flow is known
– Using dynamic scheduling the scheduler must handle data
move prior to activating a task
MLD Reordering
FFT
Ch
estimation
FFT
Ch
estimation
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenat
& CRC
checker
Concatenat
& CRC
checker
Concatenat
& CRC
checker
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
18. May 1, 2013
Challenge 4: Data Sharing
Pre-Fetching using DMA
• DMA transfer must wait for the following conditions:
– Source data is available
– Destination data can be written (i.e. allocated memory is free)
• DMA activation schemes
– Real-Time SW Programmable, large MIPS overhead
– HW system events Not programmable
– Queue manager Programmable, no MIPS overhead
19. May 1, 2013
Challenge 4: Data Sharing
Pre-Fetching using DMA with Queue Manager
• A Queue is a list of tasks handled in a FIFO manner
• Each DMA queue contains all DMA tasks related with data flow
channel
• DMA tasks are pushed to the queue
– DSP software (i.e. static scheduling)
– System scheduler (i.e. dynamic scheduling)
• Tasks are automatically activated using HW or SW events
– Source data is available & destination memory is free
FFT
Ch
estimation
DMA
22. May 1, 2013
MUST™ Multi-core System Technology
Overview
• Fully featured data cache
– Non-blocking, software operations, Write-Back &
Write-Through
• Advanced support for cache coherency
– Based on ARM’s leading AMBA-4 ACE™ technology
• Advanced system interconnect
– AXI-4 - easy system integration and high Quality of
Service (QoS)
– Multi-layer FIC (Fast Inter-Connect) - low latency, high
throughput master and slave ports
– Multi-level memory architecture using local TCMs and
hierarchy of caches
23. May 1, 2013
MUST™ Multicore System Technology
– Cont.
• Data Traffic Manager
– Automated data traffic management without DSP intervention
• Comprehensive software development support
– Advanced multicore debug and profiling
– Complete system emulation with real hardware
– Hardware abstraction layer (HAL) including drivers and system APIs
• Support for homogeneous and heterogeneous clusters of
multiple DSPs and CPUs
– Support for advanced resource management and sharing
– Flexible task scheduling for different system architectures: dynamic,
event based, data driven, etc.
24. May 1, 2013
> Allows multiple cores to use shared memory without any software
intervention
> Superior performance to SW coherency
> Simplifying software development
> Easy SW partitioning and scaling
from single core to multi-core
> External memory can be
dynamically partitioned into
shared and unique areas
> Minimizing system memories size
> Flexible memory allocation speed up
the SW development
> Snooping is only applied to shared areas
Cache Coherency Support
25. May 1, 2013
Data Traffic Management
– Data Traffic Manager
– Based on Queue Manager and Buffer Manager Structures
• Queue Manager - Maintains multiple independent queues of “tasks”
• Buffer Manager- Autonomously tracks data status of source and
destination buffers
– Data transfers are automatically managed based on tasks
status, input and output data buffers load
– Automatic data traffic management and DSP offloading
– Prioritized scheduling for guaranteed QoS
– Low latency packet transfers without software intervention
– Results in lower memory consumption and improved system
performance
26. May 1, 2013
Data Traffic Manager
– Allows sharing a resource among multiple cores via a shared queue
• Tasks are executed based on priority and buffer status
• Prevents starvation and deadlocks
– Allows a single core to work with multiple queues
• The core read / writes
from / to its buffers
(can be local or external)
• All data transfers
between cores and
accelerators are
performed automatically
via the data traffic
manager
27. May 1, 2013
Dynamic Scheduling
– Dynamic scheduling in symmetric systems
– A clustered system based on homogenous DSP cores
– Dynamic task allocation to DSP cores in runtime
– Flexible choice of algorithms
based on system load
– Hardware abstraction using
task oriented APIs
– Shared external memories
– FIC interface for low-latency
high-bandwidth data accesses
• Commonly used in wireless
infrastructure applications
28. May 1, 2013
MUST™ Hardware Abstraction Layer
(HAL)
• MUST™ is assisted by user-friendly software support
– Abstracts the queues, buffers, DMA and caches
• The software package includes:
– Drivers and APIs
– Full system profiling
– Graphical interface via CEVA ToolBox
29. May 1, 2013
Multicore Modeling and Simulation
• Simulating any number of cores
– Support for symmetric and asymmetric configurations
• Support for ARM CADI (Cycle Accurate Debug Interface)
– Including connectivity to ARM’s Real-View debugger
• Comprehensive multi-core simulation support
– Synchronization, system browsing, shared memories , inter connect, accelerator simulation,
cross triggering, AXI / FIC I/F
– Support for user-defined
components
• ESL tools integration with
full debug capabilities
– Compliant with TLM 2.0
– Full support for Carbon
and Synopsys
30. May 1, 2013
Co-processor Portfolio for
Wireless Modems
– Wide range of co-processors offering power-efficient
DSP offloading at extreme processing rates
• A complete wireless platform addressing all major modem PHY
processing requirements
• Offering flexible hardware-software partitioning
• Customers can focus on differentiation via DSP software
– Unique automated data traffic management
between DSP memory and hardware accelerators
• Allows fully parallel processing support
• Based on data traffic manager
31. May 1, 2013
Co-processor Portfolio for Wireless
Modems – Cont.
• Optimized tightly coupled extensions (TCE)
– MLD – Maximum Likelihood MIMO detectors
• Supports up to four MIMO layers
• Achieves near ML performance
– De-spreader – 3G De-spreader units
• Supports all WCDMA and HSDPA channels
• Scalable up to 3GPP HSPA+ Rel-11
– DFT / FFT
• Supports multi radix DFTs
• Includes NCO correction
– Viterbi
• Programmable K and r values
• Supports tail biting
– LLR processing and HARQ combining
• Supports LTE de-rate matching
• Significantly reduces HARQ memory buffer sizes
Dramatically reduces
time-to-market
32. May 1, 2013
Putting It All Together
A Cluster of Four CEVA-XC DSPs
> Processor Level
> Fixed-point and Floating-point
Vector DSPs
> Running over 1GHz
> Data Cache
> Platform Level
> Complete set of tightly coupled co-
processor units
> Automated DSP offloading using data traffic
management
> System Level
> Full cache coherency support
> AMBA-4 and FIC system interfaces
> Software Development Support
> HAL using drivers and APIs
> Comprehensive system debug & profiling
Task scheduling must account for worst case execution time per task
Cache coherencyWhen a shared data resides in the cache of one core it is unaware of changes to data made by other coresThis is best solved using MESI / ACE protocols