Gen AI in Business - Global Trends Report 2024.pdf
Power Optimization Through Manycore Multiprocessing
1. Power Optimization Through
Many-Core Multiprocessing
Delivering High Performance in a Low Power World
ChipEx2012
Haydn Povey
Marketing Director – Implementation & Security
ARM Processor Division
May 2, 2012
1
2. Billions of Connected Devices
TAM(m)
Form Factor 2015
Mobile Phones 1,750
Performance expectations continue to Media players 300
Mobile Computers 750
increase exponentially but power
Desktop PCs 150
efficiency and scalability are Digital TV/STB 500
becoming formidable challenges Automotive Infotainment 100
Other* 450
Total 4 billion
*Includes PND, photo-frames, etc
ABI Research, IDC, Gartner and ARM forecasts
May 2, 2012
2
3. Historic Technology Drivers
Functionality Functionality Functionality
Functionality
$ Power × $ Energy×$
2010s
Up to 1980s 1990s 2000s
Mobile
Mainframes/mini The PC Notebooks Computing
May 2, 2012
3
4. Low Power Positioned for the Future
Going forward low power is necessary
for everything from microcontroller to servers
Low power is a design philosophy
Mindset, style, culture and working practice
Not something you change or acquire easily
Low power is a design reality
ARM is an efficient architecture Functionality
None of the legacy or CISC complexity Energy×$
Low cost is a design & manufacturing partnership
Time to volume not time to niche markets 2010s
Mobile
Speed-binning not good enough for mass-market Computing
May 2, 2012
4
5. Limitations with Multiprocessing
Cost of offering the
peak single thread
performance on each
CPU quickly exceeds
chassis thermal limits
System and software
bottlenecks limit overall
scalability
Single die integration
offered some roadmap
May 2, 2012
5
6. Evolution to Many-Core
Base theorem
Simpler and smaller processor designs require exponentially less
energy to accomplish same amount of compute as a more complex
and larger processor design.
“Approximate rule of thumb”
To increase performance 50% you double the power and area cost of
the processor design
Quickly reaches point of diminishing returns
May 2, 2012
6
7. Challenge of Many-Core
Many-core definition
Use ‘lots’ of smaller, more efficient processors to achieve a higher
aggregate performance than can be reached through multiprocessing
Smaller processors are not capable of executing the same
single thread as a higher performance processor in the same
time – so can’t execute existing applications effectively
Many threads can not easily be decomposed into simpler
smaller tasks so as to benefit from multiprocessing on the
smaller processor
Software development challenge
May 2, 2012
7
8. Software Data Decomposition
Each data item is independent
TASK CPU
CPU
CPU
CPU
TASK CPU
Split large quantity of DATA
TASK CPU
into smaller chunks that can
TASK CPU be operated in parallel
TASK CPU
May 2, 2012
8
9. Software Task Decomposition
Each task item is functionally independent
TASK TASK TASK TASK TASK TASK TASK TASK TASK CPU
CPU
CPU
CPU
TASK TASK TASK CPU
TASK TASK TASK CPU Functionally independent tasks
can be executed concurrently
TASK TASK TASK CPU
TASK TASK TASK CPU
May 2, 2012
9
10. Functional Block Partitioning
Functional blocks are serially dependent
But temporary independent
Distribute different functional blocks across
available processors
Split into defined functional threads
Uses passing of data blocks between threads
to allocate work
Requires code changes and fine tuning Example:
Real Time Video Encoding
CPU2
Motion
Compensation
CPU0 CPU1 CPU3
Analogue Remove Remove Quantise Run-Length Buffer
Video Inter-Frame Intra-Frame Samples Compress Store
Sampling Redundancy Redundancy
(Simplified MPEG encoding functional block diagram)
TIME
May 2, 2012
10
11. Strategy Focus: The Thermal Wall
SOC sustained power is limited in mobile devices by thermals;
1.5W to 2W with low-cost POP and stacked memories
3W without stacked memories
Responsiveness is a must
Power
Burst for responsiveness
(e.g. Browsing) Complex active management is
T >= Tjmax, Tskin needed
“Opportunistic Residency”
Managed Sustained Power
Tj >= T max Tj < Tmax
Un-managed Max Power (@Tjmax )
Sustained performance
(e.g. HD Video Record , Gaming)
Power Optimised Low End
(e.g. e-Mail, Voice, MP3)
May 2, 2012 Time
11
12. Applying Nominal Use Case
Typical Day for Smartphone User
90 min voice calling
60 min email / social networking
30 min reading web
50 min angry birds / other gaming
90 min jogging while listening to music and
logging GPS co-ordinates
10 min video recording
7 hrs sleep with music alarm clock
OS typically executing ~28 active processes
Apps synching in background
May 2, 2012
12
14. Use Case Conclusion
Profiled CPU Minutes % of CPU
States Active
Deep Sleep 1186 n/a
200MHz 154 60%
500 MHz 69 27%
800 MHz 18 7%
1000 MHz 4 2%
1200 MHz 10 4%
If the phone was ARM big.LITTLE™ enabled...
Active CPU time
12% big
88% LITTLE
May 2, 2012
14
16. “big” Processor – Cortex-A15
ARM Cortex™-A15 Processor
3.5+ DMIPS/MHz
1-4 core MPCore™ configurable
Advanced Capabilities
Full ARMv7A architecture
Thumb®-2, TrustZone®, VFP, NEON™
Virtualization, large address extensions
AMBA® 4 ACE™ coherency
High Performance
Targeting 1.5GHz mobile implementation on 28nm
Hard Macro Quad-core Implementation @ 2GHz on 28HPM process
May 2, 2012
16
17. “LITTLE” Processor – Cortex-A7
ARM Cortex-A7 Processor
“LITTLE” to Cortex-A15 “big”
1-4 core MPCore configurable
Same Architectural Capabilities
Full ARMv7A architecture
Thumb-2, TrustZone, VFP, NEON
Virtualization, large address extensions
AMBA 4 ACE Coherency
ISA identical to Cortex-A15 processor
High Performance
Up to 1.2GHz for mobile implementation on 28nm
May 2, 2012
17
21. Software Use Models
Big.LITTLE Task Migration – One CPU active
Migrate between Cortex-A15 and Cortex-A7 depending on
performance requirements
Big.LITTLE MP – Both CPUs can be active
Allocate threads that need high-performance to cortex-A15
Allocate threads that don’t require high performance to Cortex-A7 for
best energy efficiency
AMBA 4 hardware coherency between Cortex-A-15 and Cortex-A7
May 2, 2012
21
23. CCI-400 Cache Coherent Interconnect
AMBA 4 compliant, 128-bit single layer at up to ½ Cortex-A15 frequency
GIC-400 Coherent
Mali-T604
I/O CCI-400 2+3 (x3)
Graphics DMA LCD
Quad ACE-Lite
device
2 full AMBA 4 ACE slave
Quad
Cortex-
Cortex-A7
Configurable AXI 4/AXI 3/AHB
:
NIC-400 interfaces
A15 ADB-400 ADB-400
ACE ACE AXI 4
+3 ACE-Lite I/O coherent
ADB-400 ADB-400 MMU-400 MMU-400 MMU-400 slave interfaces
128b 128b 128b 128b 128 b
x3 master interfaces
ACE ACE ACE-Lite + DVM ACE-Lite + DVM ACE-Lite + DVM
CoreLink™ CCI-400 Cache Coherent Interconnect
128 bit @ up to 0.5 Cortex-A15 frequency CCI interfaces:
ACE-Lite ACE-Lite ACE-Lite
AMBA 4 ACE and ACE-
128b 128b 128b
Lite manage all
ACE-Lite ACE-Lite AXI 4
NIC-400
coherency, sharability
DMC-400
PHY PHY
Configurable AXI 4/AXI 3/AHB/APB
: and barriers
DDR3/2 DDR3/2 Other Other
LPDDR2/3 LPDDR2/3 Slaves Slaves
May 2, 2012
23
24. Summary
Multiprocessing enables the scaling of today’s application to
grow while maintaining single thread performance
Addresses nicely the multi-tasking of stacked usage scenarios
Many-core brings the energy advantages of simpler and
smaller processor but with the challenge of software
complexity and lack of backwards compatibility with respect
to single thread performance
The big.LITTLE processing as delivered by the ARM Cortex-
A15 and Cortex-A7 offers both the performance and
compatibility advantages of Multiprocessing along with the
power efficiency and scalability advantages of many-core
processing
May 2, 2012
24
Editor's Notes
The performance requirements of handsets and other mobile devices continues to grow exponentially with new applications, advanced gaming, and traditional PC-type functionality migrating rapidly to these platforms. While this capability enables the next wave of digital revolution it comes at the price of increased power usage and potential thermal challenges. This presentation will investigate the issues and compromises traditionally required to push performance to the next level, and the challenges we face as an industry if we do not architecturally innovate on the implementation of advance systems. We will demonstrate key advances in future processor designs and highlight the advantages and challenges faced as we look to deliver high performance in the low power world.
EXAMPLE: Digital camera sport mode (burst mode). Take a lot of pictures and filter and JPEG on the go. Each picture is an independent work item, and can be processed in parallel. Instead of processing the pictures one at the time, one after the other, you can processes them in parallel. Quicker execution. Then switch-off cores and go to sleep. Low leakage and no dynamic power consumption. ANOTHER EXAMPLE: Complex post-processing on large RAW digital image. You can have more than one thread concurrently acting on the input data, and writing to the output image (reads can overlap).
EXAMPLE: You have more than one application running at the same time. On a single core your multitasking OS will time-slice. On a multi-core things will happen in parallel. They will execute in less time, and be more responsive (ie the UI).
EXAMPLE: VIDEO CODEC: This works because a video codec processes a stream. Within a single frame, and within a group of frames there are all sorts of dependencies BUT this is a stream, so while you are storing the result of a encoded frame, you can already be calculating the maths of the following frames, and sampling the next one and so on... Each core can have a task allocated to it, and the code needs to be modified so that these task synchronise and communicate between each other. Distribute different functional blocks of the decoder across available processors Multi-task pipeline: Eg taskA -> taskB -> (multiple)TaskC -> taskD Split into defined functional threads Uses passing of data blocks between threads to allocate work
Start with cheap package (high thermal resistance :15C/W Thetajb, 30C/W Thetaja) and 60C Tjb (so we use Thetajb) 1.5 to 2W with stacked memory limit (including the memory Tj max 85C). 3W w/o mems (20C advantage to play with assuming 105C max Tj SOC) NB: This is an issue we need to understand a lot better.
What is DVM? Why does the slide say 3 masters and 2 slaves (looks like the other way around)