Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Dark Silicon, Mobile
Devices, and Possible Open
Source Solutions
Koan-Sin Tan
freedom@computer.org
COSCUP 2013, Aug. 3rd,TICC,Taipei
Friday, August 23, 13

• Software engineer, veteran open-source user
• Learned something about light-weight
process (LWP) on Sun OS 4.x in early 1990s
• Did a user-level thread library on 386BSD
with a classmate in 1992
• Was involved in big.LITTLE scheduling work
recently

Samsung “optimization” for senchmarks
http://www.anandtech.com/show/7187/looking-at-
cpugpu-benchmark-optimizations-galaxy-s-4

• “Dark Silicon refers to the exponentially
increasing number of a chip's transistors
that must remain passive, or "dark", in
order to stay within a chip's power budget”

Figure from the textbook. We know we are in CMP era.
“Since 2003, the limits of power and available instruction-
level parallelism have slowed uniprocessor performance.”

Dennard scaling hits the wall
• Dennard Scaling
• When voltages are scaled along with all dimensions, a device’s electric
ﬁelds remain constant, and most device characteristics are preserved
• scaling maintains constant power density
• logic area and power is scaled down by alpha^2
• energy per transition is scaled down by alpha^3, but frequency is
scaled up by 1/alpha, resulting in an alpha^2 decrease in power per
gate
• ........
• google Dennard Scaling you can ﬁnd more information, such as, http://
www1.cs.columbia.edu/~cs4824/lectures/csee4824_f12_lec22.pdf

Mobile Devices
• Both power and thermal constrains are
more severe than desktop devices
• The progress of battery is relatively slow
• You don’t want put a fan into you
smartphone
• conduction, convection, radiation

Yes, modern high-end mobile processors have serious
thermal problems.Tegra 4 game console ﬁgure from
iFixit

Nexus 10 Thermal
Throttling
• Antutu 3.0.2
• Unit for X axis is 200 ms
• It reaches 80 ˚C in 20
second
• Throttling starts at 80 ˚C;
stops at 78 ˚C
• Throttling is to decrement
themaximum freq value of
cpufreq

Running&Antutu&on&Octa
0&
200&
400&
600&
800&
1000&
1200&
0&
200000&
400000&
600000&
800000&
1000000&
1200000&
1400000&
1600000&
1&
10&
19&
28&
37&
46&
55&
64&
73&
82&
91&
100&
109&
118&
127&
136&
145&
154&
163&
172&
181&
190&
199&
208&
217&
226&
235&
244&
253&
262&
271&
280&
289&
298&
307&
316&
325&
334&
343&
352&
freq&0&
freq&1&
freq&2&
freq&3&
temp&0&&
temp&1&
temp&2&
temp&3&
Antutu 3.0.2 on S4 Octa

Running&Antutu&on&New&One
0&
10&
20&
30&
40&
50&
60&
70&
80&
90&
100&
1&
9&
17&
25&
33&
41&
49&
57&
65&
73&
81&
89&
97&
105&
113&
121&
129&
137&
145&
153&
161&
169&
177&
185&
193&
201&
209&
217&
225&
233&
241&
249&
257&
265&
273&
281&
289&
297&
305&
313&
321&
329&
337&
tz0&
tz1&
tz2&
tz3&
tz4&
tz5&
tz6&
tz7&
tz8&
tz9&
tz10&
tz11&
Antutu 3.0.2 on new One

Introducingbig.LITTLE
Figure 28-3 Processor DVFS curves
In a big.LITTLE system these operating points are applied both to the Cortex-A15 and
Cortex-A7 processors. When the Cortex-A7 processor is executing the OS can tune the
operating points as it would for an existing platform with a single applications processor. When
the Cortex-A7 processor is at its highest operating point (Figure 28-3), if more performance is
required a switch is invoked that transfers the OS and applications to the Cortex-A15 processor.
Further DVFS tuning takes place on the Cortex-A15 processor if required, as the operating load
increases.
Migration requires rapid context switching capability. Coherency is clearly a critical enabler in
achieving a fast task migration time as it allows the state that has been saved on the outbound
(migrated from) processor to be snooped and restored on the inbound (migrated to) processor
rather than going via main memory. Additionally, for Cluster migration, (or for CPU migration
when all processors have been switched) because the L2 cache of the outbound processor is
coherent it can remain powered up after a task migration to improve the cache warming time of
ARM big.LITTLE

Thread-Level Parallelism
• Thread-level Parallelism (TLP) is
an index you can treat it as
number of threads running
concurrently
• a table from an ISCA ‘10 paper
named “Evolution of thread-level
parallelism in desktop
applications”
• 2000, 2010
• mobile devices are worse
• http://dl.acm.org/citation.cfm?
id=1816000

Parallel Programming
Could Help a Bit
• Parallel computing/programming has been there for a long time
• You know pthread and OpenMP are available and C++11 came with currency
support
• Java use thread and its synchronization model
• “Why Threads Are A Bad Idea”, by John Ousterhout, http://www.cc.gatech.edu/
classes/AY2009/cs4210_fall/papers/ousterhout-threads.pdf
• Thread is “easy: to describe; to use; to get wrong” to quote Andrew Birrell,
http://www.cs.princeton.edu/courses/archive/spr07/cos598A/lectures/
Birrell.pdf
• For more theoretical explanation, see “The Problems with Threads” by Edward
Lee, http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
• And you know that except shared memory model, there is message passing
computing model. And more, e.g., actors, data-ﬂow, systolic array, etc.

Threads are Bad Ideas?
• “Why Threads Are Bad Ideas”, John
Ousterhout, 1995, http://
www.cc.gatech.edu/classes/AY2009/
cs4210_fall/papers/ousterhout-
threads.pdf
• Yes, It’s a bit dated. Some of those
points are no longer valid; many of
them stand the test of time
• Threads:
• Too hard for most
programmers to use
• Even for experts, development
is painful

Some of Ousterhout’s
arguments remain valid
• Synchronization
• manually set of mutex/lock
• deadlock: yes deadlock
• hard to debug
• threads breaks modularization
• callbacks don’t work with locks

thread is easy to get
wrong
• Manual selection of mutual exclusion:
• Default is too little (and hence races)
• Easy ﬁx is too much (deadlocks or
blank stares)
• Projects don’t create hierarchical
abstractions
• Can’t decide and/or maintain acyclic
locking order
• “Composition” requires entire new
abstractions
• “Clever” optimizations aren’t maintainable
• .....

User-level libraries,
frameworks
• Android AsyncTask
• a class to help perform background operations and publish results on the UI
thread without having to manipulate threads and/or handlers
• http://developer.android.com/reference/android/os/AsyncTask.html
• Intel Threading Building Blocks (TBB)
• http://threadingbuildingblocks.org/, http://en.wikipedia.org/wiki/
Intel_Threading_Building_Blocks
• works on Android x86 and ARM
• Apple Grand Central Dispatch (GCD)
• http://developer.apple.com/library/ios/#documentation/Performance/
Reference/GCD_libdispatch_Ref/
• Software Transactional Memory
• http://gcc.gnu.org/wiki/TransactionalMemory

Language extension
• Intel Cilk Plus
• http://cilkplus.org/, http://en.wikipedia.org/
wiki/Intel_Cilk_Plus
• open sourced, trying to get into gcc and llvm
• Apple blocks
• http://developer.apple.com/library/ios/
#documentation/cocoa/Conceptual/Blocks/

OpenCL Related
• OpenCL
• pocl, http://pocl.sourceforge.net/
• OpenCL and Java
• Aparapi, https://code.google.com/p/aparapi/
• Smuatra, http://openjdk.java.net/projects/sumatra/
• RenderScript
• in AOSP
• ThorScript
• will be open-sourced

Cilk Plus: simple language extensions
originated from Charles Leiserson

Simple Cilk Plus Example
int fib(int n) {
if (n < 2) return n;
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
int fib(int n) {
int x = clik_spawn fib(n-1);
int y = fib(n-2);
cilk_sync;
return x + y;
}

simple GCD+blocks
dispatch_group_t group = dispatch_group_create();
fib = ^() {
if (n < 2) {
result = n;
return;
}
__block int x, y;
int m = n;
n = m - 1;
dispatch_group_async(group, a_queue, ^{fib(); x = result;});
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
n = m - 2;
dispatch_sync(a_queue, ^{fib(); y = result;});
n = m;
result = x + y;
return;
};

data parallel ﬁb() looks
more reasonable
int fib(int n) {
int p = 0, q = 1, result =0;
cilk_for (int i=2; i <= n; i++) {
result = p + q;
p = q; q = result;
}
return result;
}
TextText
Text
n.b.: in case you didn’t
notice, this may produce
wrong results because of
loop-carried dependency

parallel ﬁb() with GCD
and blocks
int(^fib)(int);
fib = ^(int n){
__block int p = 0, q = 1, result = 0;
dispatch_apply(n-1, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t i) {
result = p + q;
p = q; q = result;
});
return result;
};

GCD is can be used with
OpenCL And GCD
• That’s what is available on Mac OS X and
iOS
• Nope, iOS didn’t open OpenCL yet. But
you can ﬁnd how to use OpenCL for
ARM on iOS easily

What are available
• Task-parallel and data-parallel constructs,
libraries or languguages
• Lambda, closure, continuation, etc.
• Queue, queue management: load balance,
work stealing, etc
• Data structures, e.g.,TBB
• Lock-less synchronization

Lockfree synchronization
• In case you didn’t know it, NO, it’s not new
at all
• Linux has been used RCU (Read-Copy-
Update) for several years
• In fact, it’s there since 1970s, see Kung’s
1980 paper proposed RCU-like mechanism.

Kernel
• big.LITTLE
• IKS: in-kernel-switcher
• related code being upstreaming after 3.10
• Global Task Scheduling (GTS), Heterogenous Multi-Processor (HMP)
• Current CFS maintainer Ingo didn’t like GTS’s power-saving part
• Power Management
• So many mechanisms: cpufreq, cpuidle, runtime PM, CCF, etc.
• Linaro has a wiki page on how to/what to enable/implement for a new SoC
• Thermal Management
• Throttling, e.g., ask related components to slow down so that less heat will
be generated

Linaro In-kernel Switcher

Global Task-Scheduling (GTS)

Many are remained to be done
• No widely used open-source power or thermal
management framework available?
• Some problems are fundamental hard to
parallelized, e.g.,
• parsing in browser: nowadays, webkit and
ﬁrefox use LALR(1) or similar parsing algorithm
• No full-featured open-source OpenCL
implementation for GPGPU

Wrap-up
• “dark silicon” is reality on mobile devices,
• power wall and thermal wall
• parallel/concurrent code isn’t popular on
mobile devices (yet)
• discussed some possible free and open
source solutions
• many remained to be done

Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Dark Silicon, Mobile Devices, and Possible Open-Source Solutions

Semelhante a Dark Silicon, Mobile Devices, and Possible Open-Source Solutions (20)

Mais de Koan-Sin Tan

Mais de Koan-Sin Tan (8)

Último

Último (20)

Dark Silicon, Mobile Devices, and Possible Open-Source Solutions