TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon Yaakov

May 1, 2013
An Innovative Multicore System
Architecture for Wireless SoCs
Alon Yaakov
DSP Architecture Manager, CEVA

May 1, 2013
Multicore in Embedded System
Defining the Problem
• Control-plane
– Synchronization between cores
– Semaphores
– Message passing using mailbox mechanism
– Snooping mechanism
– Interrupt handling
• Data-plane
EqualizationAntenna Processing Error Correction
This will be the
focus of today’s
presentation

May 1, 2013
Outline
• Multicore Challenges
• The CEVA-XC Solution

May 1, 2013
Multicore Challenges
1. Partitioning
> Task partitioning onto different chip resources
> Data partitioning onto different chip resources
2. Resource sharing
> Memories, buses, system I/Fs, peripherals, etc.
3. Scheduling
> Allocating tasks/data
4. Data sharing
> Transferring data between engines
DSP A
DSP C
DSP B
CTCMLDFFT
Application ?

May 1, 2013
• Tasks
– Parts of an algorithm running in a sequential order
– A task must have a defined input and output data
structure (packets)
Challenge 1: Task Partitioning
Error Correction
EqualizationAntenna Processing
MLD
FFT
Ch
estimation
FFT
Ch
estimation
Reordering
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
Task
Data

May 1, 2013
Challenge 1: Task Partitioning
HW Offloading
• Parts of the algorithm are more suited for HW
acceleration
– Well known algorithms that require little programmability
– Heavy computational effort
MLD
FFT
Ch
estimation
FFT
Ch
estimation
Reordering
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker

May 1, 2013
Challenge 1: Data Partitioning
• Several cores are used to process different
input data packets
• Suitable for homogeneous systems
• Shared memory is used for storing history data
– Core must wait for data to update before using it
 Latency
• The entire program code is used by all cores
– Core suffers stall cycles if L1 memory is small

May 1, 2013
Challenge 1: Partitioning
OK, Now What?
• Efficient partitioning is dependent on the
hardware platform
• Building the optimal system depends on the
partitioning
• There is no single optimal solution
– Each approach has its merits
• Partitioning can be eased by starting with a
reference that can be used as a basis

May 1, 2013
Challenge 2: Resource Sharing
• Resource types
– DSP cores
– HW accelerators
– Memories
– Buses
– DMA
• Resource sharing creates contention
Memory

May 1, 2013
Avoiding Contentions
• If possible avoid contentions by duplicating HW
– Multiple DMAs
– Duplicated HW accelerators
– Multilayer BUS
– Partition memory into blocks enabling concurrent access
• Throughput and latency govern the minimum amount of
hardware resources
Memory Memory Memory Memory

May 1, 2013
Arbitration
• When a simple set of known rules can be defined a
resource can be shared using a HW arbiter
• QoS
– Priority
– Bandwidth allocation (weight)
– Well known algorithms (round robin)
• Arbitration is based on time
sharing of resources
 Scheduling
Memory
Arbiter

May 1, 2013
Challenge 3: Scheduling
• How do we assign and schedule tasks to
cores?
Application ?
Concatenati
on & CRC
checker
MLDFFT
Ch
estimation
FFT
Ch
estimation
Reordering
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker DSP A
DSP C
DSP B
CTCMLDFFT

May 1, 2013
Static Scheduling
• Tasks are statically assigned to DSP cores
• Design phase includes task scheduling
– Data flow is fixed
– Suitable when the load on each task is fixed
CTC HW Core
DSP C
DSP BMLD HW CoreDSP AFFT HW Core
MLD
FFT
Ch
estimation
FFT
Ch
estimation
Reorder
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenat
& CRC
checker
Concatenat
& CRC
checker
Concatenat
& CRC
checker

May 1, 2013
Dynamic Scheduling
Concatenati
on & CRC
checker
DSP CDSP BDSP A CTCMLDFFT
MLD
FFT
Ch
estimation
FFTCh
estimation
Reorder
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenati
on & CRC
checker
Concatenati
on & CRC
checker
MASTER
(Scheduler)
> A scheduler dynamically assigns tasks
to cores
> Scheduler algorithm selects the best
core to execute the task
> Processing capabilities
> Locality of data
> Load balance
> Suitable for complex designs with
variable processing load
and QoS

May 1, 2013
Challenge 4: Data Sharing
Memory Hierarchy
• Internal L1 memory
– Fast memory with no access penalty
– Small / medium size
– Dedicated per core
• External memory
– Can be on-chip (L2) or off-chip (i.e. DDR)
– Slow memory with access penalty
– Large size
– Shared among several cores
– Contentions

May 1, 2013
Using Cache
• When shared data is used, a cache system can be
used to reduce the stall count
– Statistically reduces memory stalls, but not
deterministic
• Used only for accessing narrow data width
– Cache should be used for control data
– Not recommended for vector DSP data flow
• Large caches
• Many stall cycles
 How to share vector data?

May 1, 2013
Pre-Fetching Data
• A task cannot start until its preceding task completes
• If we can schedule the next task to be executed we can
pre-fetch its input data
– Using static scheduling the data flow is known
– Using dynamic scheduling the scheduler must handle data
move prior to activating a task
MLD Reordering
FFT
Ch
estimation
FFT
Ch
estimation
Interleaver
Interleaver
Interleaver
CTC
CTC
CTC
Concatenat
& CRC
checker
Concatenat
& CRC
checker
Concatenat
& CRC
checker
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA

May 1, 2013
Pre-Fetching using DMA
• DMA transfer must wait for the following conditions:
– Source data is available
– Destination data can be written (i.e. allocated memory is free)
• DMA activation schemes
– Real-Time SW  Programmable, large MIPS overhead
– HW system events  Not programmable
– Queue manager  Programmable, no MIPS overhead

May 1, 2013
Pre-Fetching using DMA with Queue Manager
• A Queue is a list of tasks handled in a FIFO manner
• Each DMA queue contains all DMA tasks related with data flow
channel
• DMA tasks are pushed to the queue
– DSP software (i.e. static scheduling)
– System scheduler (i.e. dynamic scheduling)
• Tasks are automatically activated using HW or SW events
– Source data is available & destination memory is free
FFT
Ch
estimation
DMA

May 1, 2013
CEVA-XC4000 Multicore Solution
Optional
Cache ctrl
ACE

May 1, 2013
MUST™ Multi-core System Technology
Overview
• Fully featured data cache
– Non-blocking, software operations, Write-Back &
Write-Through
• Advanced support for cache coherency
– Based on ARM’s leading AMBA-4 ACE™ technology
• Advanced system interconnect
– AXI-4 - easy system integration and high Quality of
Service (QoS)
– Multi-layer FIC (Fast Inter-Connect) - low latency, high
throughput master and slave ports
– Multi-level memory architecture using local TCMs and
hierarchy of caches

May 1, 2013
MUST™ Multicore System Technology
– Cont.
• Data Traffic Manager
– Automated data traffic management without DSP intervention
• Comprehensive software development support
– Advanced multicore debug and profiling
– Complete system emulation with real hardware
– Hardware abstraction layer (HAL) including drivers and system APIs
• Support for homogeneous and heterogeneous clusters of
multiple DSPs and CPUs
– Support for advanced resource management and sharing
– Flexible task scheduling for different system architectures: dynamic,
event based, data driven, etc.

May 1, 2013
> Allows multiple cores to use shared memory without any software
intervention
> Superior performance to SW coherency
> Simplifying software development
> Easy SW partitioning and scaling
from single core to multi-core
> External memory can be
dynamically partitioned into
shared and unique areas
> Minimizing system memories size
> Flexible memory allocation speed up
the SW development
> Snooping is only applied to shared areas
Cache Coherency Support

May 1, 2013
Data Traffic Management
– Data Traffic Manager
– Based on Queue Manager and Buffer Manager Structures
• Queue Manager - Maintains multiple independent queues of “tasks”
• Buffer Manager- Autonomously tracks data status of source and
destination buffers
– Data transfers are automatically managed based on tasks
status, input and output data buffers load
– Automatic data traffic management and DSP offloading
– Prioritized scheduling for guaranteed QoS
– Low latency packet transfers without software intervention
– Results in lower memory consumption and improved system
performance

May 1, 2013
Data Traffic Manager
– Allows sharing a resource among multiple cores via a shared queue
• Tasks are executed based on priority and buffer status
• Prevents starvation and deadlocks
– Allows a single core to work with multiple queues
• The core read / writes
from / to its buffers
(can be local or external)
• All data transfers
between cores and
accelerators are
performed automatically
via the data traffic
manager

May 1, 2013
Dynamic Scheduling
– Dynamic scheduling in symmetric systems
– A clustered system based on homogenous DSP cores
– Dynamic task allocation to DSP cores in runtime
– Flexible choice of algorithms
based on system load
– Hardware abstraction using
task oriented APIs
– Shared external memories
– FIC interface for low-latency
high-bandwidth data accesses
• Commonly used in wireless
infrastructure applications

May 1, 2013
MUST™ Hardware Abstraction Layer
(HAL)
• MUST™ is assisted by user-friendly software support
– Abstracts the queues, buffers, DMA and caches
• The software package includes:
– Drivers and APIs
– Full system profiling
– Graphical interface via CEVA ToolBox

May 1, 2013
Multicore Modeling and Simulation
• Simulating any number of cores
– Support for symmetric and asymmetric configurations
• Support for ARM CADI (Cycle Accurate Debug Interface)
– Including connectivity to ARM’s Real-View debugger
• Comprehensive multi-core simulation support
– Synchronization, system browsing, shared memories , inter connect, accelerator simulation,
cross triggering, AXI / FIC I/F
– Support for user-defined
components
• ESL tools integration with
full debug capabilities
– Compliant with TLM 2.0
– Full support for Carbon
and Synopsys

May 1, 2013
Co-processor Portfolio for
Wireless Modems
– Wide range of co-processors offering power-efficient
DSP offloading at extreme processing rates
• A complete wireless platform addressing all major modem PHY
processing requirements
• Offering flexible hardware-software partitioning
• Customers can focus on differentiation via DSP software
– Unique automated data traffic management
between DSP memory and hardware accelerators
• Allows fully parallel processing support
• Based on data traffic manager

May 1, 2013
Co-processor Portfolio for Wireless
Modems – Cont.
• Optimized tightly coupled extensions (TCE)
– MLD – Maximum Likelihood MIMO detectors
• Supports up to four MIMO layers
• Achieves near ML performance
– De-spreader – 3G De-spreader units
• Supports all WCDMA and HSDPA channels
• Scalable up to 3GPP HSPA+ Rel-11
– DFT / FFT
• Supports multi radix DFTs
• Includes NCO correction
– Viterbi
• Programmable K and r values
• Supports tail biting
– LLR processing and HARQ combining
• Supports LTE de-rate matching
• Significantly reduces HARQ memory buffer sizes
Dramatically reduces
time-to-market

May 1, 2013
Putting It All Together
A Cluster of Four CEVA-XC DSPs
> Processor Level
> Fixed-point and Floating-point
Vector DSPs
> Running over 1GHz
> Data Cache
> Platform Level
> Complete set of tightly coupled co-
processor units
> Automated DSP offloading using data traffic
management
> System Level
> Full cache coherency support
> AMBA-4 and FIC system interfaces
> Software Development Support
> HAL using drivers and APIs
> Comprehensive system debug & profiling

TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon Yaakov

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon Yaakov

Semelhante a TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon Yaakov (20)

Mais de chiportal

Mais de chiportal (20)

Último

Último (20)

TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon Yaakov

Notas do Editor