The early 21c has brought the power of the computer into the hands of the general population, and though these computers consume small amounts of energy they are so numerous that their Energy Efficiency will soon become a major issue. This presentation looks at modern Computing, the ways that Energy Efficiency is currently being enhanced, and the principles behind this.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Energy Efficiant Computing in the 21c
1. Energy Efficient Computing ... In the early 21C
Abstract:
Opinions expressed are those of the author alone
With the assistance of its global partners, ARM shipped 8.7 billion CPUs in 2012; a number which continues
to grow at around ~20%pa. The 40B we have shipped to date outnumber the total of PC's more than 50
times; and today more than 75% of the things connected to the Internet are ARM based. The dominant
nature of Computing in the 21c is very different to that of the Mainframe era. It is sobering to think that if
each of those 8.7B CPUs was to dissipate just 100mw, then it would require the output of two modern power
stations to drive them; with 2.4 next year, and 3 the year after that! So Electronic Systems are also defining
where the real Energy Efficient Computing issue is! But with such a small footprint it must be easy to
measure and manage power optimisation? An increasing percentage of these are immensely complex
systems, running significant multi-tasking and multi-threaded operating systems on platforms which include
multi-processor CPU/GPU configurations, and GB of memory. Whilst their minimum dissipations are a few
uW, their peak power exceed the silicon's ability to dissipate it; so the penalty for power un-aware software
design is huge. What has been done to manage this in Electronic Systems design, and can any lessons can be
transferred to the Classic Computing domains?
Context
1hr talk at The Centre for Robotics and Neural Systems (CNRS) at University of Plymouth, Devon, UK.
The CRNS has a regular seminar series inviting national and international speakers.
http://www.tech.plym.ac.uk/SOCCE/CRNS/
SlideCast and pdf available via http://ianp24.blogspot.co.uk/
1
2. Opinions expressed are those of the author alone
Prof. Ian Phillips
Principal Staff Eng’r,
ARM Ltd
ian.phillips@arm.com
Visiting Prof. at ...
Contribution to Industry
Award 2008
Centre for Robotics and Neural Systems
Uo.Plymouth
1nov13
SlideCast and pdf available via http://ianp24.blogspot.co.uk/
2
1v0
7. The Invisible Face of Computing Today
100’s of Billions of computers each consuming mW!
Bringing Embedded Intelligence to the Consumer
Market, has changed the Face of Computing! (Again)
7
9. Markets provide the Growth Drivers
3rd Era
Millions of Units
Computing as part
of our lives
2nd Era
Broad-based computing
for specific tasks
1st Era
Select work
tasks
1960
1970
1980
1990
2000
2010
2020
Today: ~2% of our Energy Use goes on Computing and Electronics!
... Tomorrow: It could easily be 20%!
9
10. ARM in the Digital World
150+
billion
CPUs cumulative
by 2020
8.7B CPUs shipped in 2012 (Growing 20%pa.pa)
75% of the things connected to the
Internet today are ARM Powered! Gartner
40+
billion
CPUs to date
1998
10
http://www.arm.com/
2012
2020
11. Moore’s Law ...
X
100nm
10um
Transistor/PM (K)
1um
Transistors/Chip (M)
Approximate Process Geometry
10nm
Gordon Moore. Founder of Intel. (1965)
100um
ITRS’99
...
11
http://en.wikipedia.org/wiki/Moore’s_law
x More Functionality on a Si Chip in 20 yrs!
12. A Machine for Computing ...
Computing: A general term for algebraic manipulation of data ...
Numerated
Phenomena
IN (x)
y=F(x,t,s)
Processed Data/
Information
OUT (y)
... State and Time are always factors (variable weight).
It can include phenomena ranging from human thinking to calculations
with a narrower meaning.
Usually used it to exercise analogies (models) of real-world situations;
Frequently in real-time (Fast enough to be a stabilising factor in a loop).
Wikipedia
... So what part does Hardware and Software play?
... And what about Energy?
12
13. Antikythera c87BC ... Planet Motion Computer
Mechanical
Technology
• Inventor: Hipparchos (c.190 BC – c.120 BC).
•
Ancient Greek Astronomer, Philosopher and Mathematician.
Single-Task, Continuous Time, Analogue Mechanical Computing (With backlash!)
See: http://www.youtube.com/watch?v=L1CuR29OajI
13
14. Orrery c1700 ... Planet Motion Computer
Mechanical
Technology
• Inventor: George Graham (1674-1751). English Clock-Maker.
• Single-Task, Continuous Time, Analogue Mechanical Computing (With backlash!)
14
15. Babbage's Difference Engine 1837
Mechanical
Technology
(Re)construction
c2000
The difference engine consists of a number of columns, numbered from 1 to N. Each column is able to store one decimal number. The only operation the engine
can do is add the value of a column n + 1 to column n to produce the new value of n. Column N can only store a constant, column 1 displays (and possibly prints)
the value of the calculation on the current iteration.
Computer for Calculating Tables: A Basic ALU Engine
15
19. Signal Processing
Tele-Verta Radio
4 Valves
1 Rectifier Valve
BTH
Crystal Set
c1945
1 Diode
Evoke DAB Radio
c1925
100 M Transistors
2-3 Embedded Processors
Bush Radio
7 Transistors
1 Diode
c1960
19
c2005
20. Radio as Computation ...
Vi
Vrf=Vi*100
Vro='Bandpass'(Vif*1000)
Vrf
Vif
Vro
Vif=Vrf*Vlo
Vlo
Vlo=Cos(t*1^6)
Single-Task (Embedded), Real-Time, Analogue (Close-Enough) Computing
20
21. Radio as Computation ...
Valve
Technology
Vi
Vrf=Vi*100
Vro='Bandpass'(Vif*1000)
Vrf
Vif
Vro
Vif=Vrf*Vlo
Vlo
Vlo=Cos(t*1^6)
Single-Task (Embedded), Real-Time, Analogue (Close-Enough) Computing
21
22. Radio as Computation ...
‘Integrated Circuit’
Transistor
Valve
Technology
Vi
Vrf=Vi*100
Vro='Bandpass'(Vif*1000)
Vrf
Vif
Vro
Vif=Vrf*Vlo
Vlo
Vlo=Cos(t*1^6)
Single-Task (Embedded), Real-Time, Analogue (Close-Enough) Computing
22
23. Computing is Era and Application Related ...
Computing: Creating Useful Output from Input ...
Architecture: The way this is done on the day.
It is the Most Important Product Decision!
(HW, SW, Digital, Analogue, Optics, Graphene, Mechanics, Steam, etc)
23
24. Moore's Real Law: x2 Functionality Every 18mth!
Cascade of Technologies supporting Functional growth ...
Functional Density (units)
1012
1010
106
102
Electronic era:
System era:
1975-2005
2003-2030
100
1960
1980
2000
2020
... The ‘Law’ started with Wood ⇒ Stone ⇒ Bronze ⇒ Iron
24
29. Inside The Control Board
(b-side)
Level-2: Sub-Assemblies
More Visible Computing Contributors ...
A4 Processor. Spec:Apple, Design & Mfr: Samsung
Digital-CMOS (nm) ...
Provides the iPhone 4 with its GP computing power.
(Said to contain ARM A8 600 MHz CPU and other ARM IP)
ST-Micro: 3 axis Gyroscope - MEM-CMOS (ARM Partner)
Broadcom: Wi-Fi, Bluetooth, and GPS - Analogue-CMOS (ARM Ptr)
Skyworks: GSM
Analogue-Bipolar
Triquint: GSM PA Analogue-GaAs
Infineon: GSM Transceiver - Anal/Digi-CMOS (ARM Partner)
GPS
Bluetooth,
EDR &FM
29
http://www.ifixit.com
30. Level-3: Processor
NB: The Tegra 3 is similar to the
A4/5, but not used in the iPhone
30
(Nvidea Tegra 3, Around 1B transistors)
31. Packing Technology into an iCon
Analogue and Digital Design
Embedded Software
Mechanics, Plastics and Glass
Micro-Machines (MEMs)
Displays and Transducers
Robotics and Test
Knowledge and Know-How
Research, Education and Training
Components, Sub-Systems and Systems;
Design, Assembly and Manufacture
Metrology, Methodology and Tools
... Involving Many Specialist Businesses
... Round and Round the World
...Not-Least from Europe
31
32. Architecting your Product
: Is the cumulative non-functional choices made to
support the functional need
A Good Architecture is the one that ‘survives’
History is written by winners (2nd is for losers)
: Component Performance may be ‘poor’ as long
as System Performance is ‘better’ for its use.
Architectural Options ...
: Business Model (Cost-of Ownership, ROI), TTM (Productivity, History, IPAvailability, Know-How), Aesthetics (Power, Quality, Behaviour, Appearance)
: Analogue, Digital, Mechanical, Optical, RF, Software, Plastics,
Metal-forming, Manufacturing, Glass, ...
: More than 99% of a Product is Reused from its Predecessor
...
32
is assumed (working is expected!)
... It used to be the only consideration!
33. Power Philosophy
Hardware Dissipates Power ...
Chose Underlying Technology for best power efficiency.
One size does not fit all (Products, Applications or Instances)
... Software Doesn’t (But it Tells Hardware To!)
Chips can literaly melt-down under software ‘instruction’
Make computing hardware power as ‘Activity’ dependent as possible
Zero Activity => Zero Power
Make OS/Apps aware of the power/performance situation,
and their options for controlling it (Need Indicators and Levers)
... Think System: It’s how the ‘box’ performs, not the components
33
34. Core Power Management
For Processor and Peripheral Circuitry...
Variable/Gated - Clock Domains
Variable/Switched - Power Domains
Indicators and Levers
Allow the software to see and influence what is going on
Principles of Core Power Efficiency...
Minimise voltage/frequency (P=CV2f) so that processor has just
enough performance for the current application need
Maximises ‘Activity Power’ dependence (Zero Activity => Zero Power)
Management by the OS and the Application SW
Apply to all on & off-chip zones (not just the CPU) ...
34
Methodology
Retention Flops/Latches, Level Shifters, Power-Switch Cells, PLLs
35. Architectural Energy Efficiency - Parallelism
Processor
Input
Output
Output
Processor
f
Input
f/2
Processor
f
f/2
Capacitance = C
Voltage = V
Frequency = f
Capacitance = 2.2C
Voltage = 0.6V
Frequency = 0.5f
Power = CV2f
Power = (2.2*0.6*0.6*0.5)CV2f = 0.4CV2f
To a limit determined by Amdahl’s or Gustafson’s Law ...
Amdahl: Extracted parallelism from existing code (Reuse)
Gustafson: Some needs only benefit from parallelism (Custom)
... Actual improvement is application specific.
35
36. Architectural Energy Efficiency - Data
Moving Data takes significant Energy
Becoming the dominant energy consumption in a system
Data Location
Avoid moving or copying Data
Energy ∝ DataVolume x Speed x Distance>2(3)
Bring the processing to the data
Bring the Processing to the Data
Caching is good (depends on implementation)
Write back is better than write-through
Local working memory is good
Aka Software Caching
... The Arrangement of your Data matters!
36
38. Chose The Horses for The Course
About 50MTr
About 50KTr
... Delivering ~5x speed (Architecture + Process + Clock)
38
39. Multicore ARM On-Chip ...
Heterogeneous Multicore Systems
have been in ARM for a long time:
Application
UI & 3D Graphics
Power Manager
Cortex™-A8
Mali™-400
MP
Cortex-M3
Interconnect
Memory
39
40. Coherent Multicore Cluster ...
Homogenous Multicore
cluster, as part of a heterogeneous system:
Cortex-A9
Power Manager
Mali-400 MP
…
User Interface
and 3D graphics
Cortex-M3
Cortex-A9
Coherency Logic
Interconnect
40
42. Computer On a Chip c2010 ...
Today’s Consumer require a pocket ‘Super-Computer’ ...
Silicon Technology Provides a Billion transistors ...
It will be supported with a few GB of memory ...
• Typically 10 Processors ...
•
•
•
•
•
•
42
http://www.arm.com/
4 x A9 Processors (2x2):
4 x MALI 400 Frag. Proc
1 x MALI 400 Vertex Proc
1 x MALI Video CoDec
Software Stacks, OS’s and Design
Tools/
ARM Technology gives
chip/system designers ...
• Improved Productivity
• Improved TTM
• Improved Quality/Certainty
43. CoreLink™ CCN-504 and DMC-520
Heterogeneous processors – CPU, GPU,
DSP and accelerators
Virtualized Interrupts
Up to 4 cores
per cluster
Up to 4
coherent
clusters
Quad
CortexA15
Quad
CortexA15
Quad
CortexA15
L2 cache
L2 cache
L2 cache
Quad
ACE
CortexA15
L2 cache
DSP
DSP
DSP
PCIe
DPI
Crypto
USB
AHB
ACE
SATA
NIC-400
IO Virtualisation with System MMU
CoreLink™ CCN-504 Cache Coherent Network
Integrated
L3 cache
Snoop
Filter
8-16MB L3 cache
CoreLink™
DMC-520
Dual channel
DDR3/4 x72
10-40
GbE
Interrupt Control
Uniform
System
memory
CoreLink™
DMC-520
NIC-400 Network Interconnect
PHY
x72
DDR4-3200
x72
DDR4-3200
Flash
GPIO
Peripheral address space
43
Up to 18
AMBA
interfaces for
I/O coherent
accelerators
and IO
44. Methodology As Well As Hardware
C/C++
Debug & Trace
Development
Energy Trace
Modules
Middleware
44
45. big.LITTLE Processing
For High-Performance systems...
Tightly coupled combination of two ARM CPU clusters:
Cortex-A15 and Cortex-A7 - functionally identical
Same programmers view, looks the same to OS and applications
big.LITTLE combines high-performance and low power
Automatically selects the right processor for the right job
Redefines the efficiency/performance trade-off
“Demanding tasks”
>2x Performance
Current big.LITTLE
smartphone
45
big
“Always on, always
connected tasks”
LITTLE
30% of the Power
(select use cases)
Current big.LITTLE
smartphone
46. LITTLE
Fine-Tuned to Different Performance Points
Most energy-efficient applications processor from ARM
Simple, in-order, 8 stage pipelines
Performance better than mainstream, high-volume
smartphones (Cortex-A8 and Cortex-A9)
big
Highest performance in mobile power envelope
46
Complex, out-of-order, multi-issue pipelines
Up to 2x the performance of today’s high-end
smartphones
Cortex-A7
Cortex-A53
Q
u
e
u
e
I
s
s
u
e
I
n
t
e
g
e
r
Cortex-A15
Cortex-A57
47. big.LITTLE Software
CPU Migration
Migrate a single processor workload to the appropriate CPU
Migration = save context then resume on another core
Also known as Linaro “In Kernel Switcher”
DVFS driver modifications and kernel modifications
Based on standard power management routines
Small modification to OS and DVFS, ~600 lines of code
big.LITTLE MP
OS scheduler moves threads/tasks to appropriate CPU
Based on CPU workload
Based on dynamic thread performance requirements
Enables highest peak performance by using all cores at once
47
48. Bringing the Processing to the Data …
Press Claims:
Dell + Marvell, Copper
BaiDu + Marvell, Baserock
288 server nodes in a 4U rack space
Public Source: http://www.engadget.com/2011/11/02/hp-and-calxedas-moonshot-arm-servers-will-bring-all-the-boys-to/
48
50. Transferrable Lessons to GP Software
Moving data is Power Expensive ...
Don’t move data; use it locally (Cache it)
Refine it once, use it often (Pre-Process it)
Your CPU Power is work-load independent ...
So, get in; get the work done; and get out.
Maximise the workload of your code; terminate when complete.
Make your Processing work-load dependent
Use a Hypervisor and turn off (at least free) processors not in use.
50
51. Societies Challenges in the 21c
Urbanisation (Smart Cities)
Health (eHealth)
Transport
Energy (Smart Grid)
Security
Environment
Food/Water
Ageing Society
Sustainability
Digital Inclusion
Economics
And whilst our technologies will be an
essential part of all solutions, they
cannot not fix them without Society’s
help and cooperation!
... Energy Efficient Computing will minimise
the impact not avert the challenges!
51
Having a great time!
52. Conclusions
Putting the power of Computation into the hands of the masses,
has changed the face of Computing (again)
Electronic Systems will become Essential to our Lives and the Economy
Power Efficient ES are a major issue to Society
Which faces a future with them as a significant energy consumer in themselves
Power Efficiency must be architected into the System Hardware
and Software from the beginning
52
To realise the maximum potential out of your Silicon (Avoiding Dark Si)
Architect & Design HW as efficiently as possible (reflecting the task)
Strive for: No Work => No Power
Equip HW with Indicators and Levers so the System/App can manage it
Bring Processing to the Data ...
Don’t move Data; move Information
Process data Locally
Energy ∝ DataVolume x Speed x Distance>2(3)
53. Computing at the heart of the 21c
ARM:
Enabling the Creation of
High-Performance Electronic Systems
--• Productively, Economically and Reliably
• Through Hw/Sw Reuse Methodologies
• Based on a family of CPU/GPU cores
53