SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
A Taste of Open Source
Neural Network
Frameworks on Cell Phones
Koan-Sin Tan

freedom@computer.org

COSCUP, Taipei

Aug 11th, 2018
• interrupt me any when you have any questions

• the talk is Taiwanese

• the slide deck is in English

• you ask me questions in English, Mandarin, or
Taiwanese
!2
NN-based ML is already in
cell phones
• Google I/O: Mobile First —> AI First

• TensorFlow Lite, Android Neural Network API

• Lots of stuff from Google blogs and papers, e.g., Google Lens, federated learning in
Gboard

• Pixel Visual Core in Pixel 2 and Pixel 2/XL

• Apple announced CoreML, a machine framework, at WWDC 2017 (June 2017)

• Apple’s machine learning journal (https://machinelearning.apple.com/): how Apple
uses CNN and other machine techniques in iPhone

• Neural Engine

• Computer Architecture: A Quantitative Approach, 6th Ed. (Nov, 2017) has a whole new
chapter on Domain Specific Architecture, actually NN accelerators.
!3
https://www.amazon.com/
Computational-Aspects-Principles-
Computer-Science/dp/0914894951
• Michael Jordan published an
article on Medium named
“Artificial Intelligence — The
Revolution Hasn’t
Happened Yet”

• Yes, but current deep learning
driven stuff should be enough
for next few years

[1] https://medium.com/
@mijordan3/artificial-intelligence-
the-revolution-hasnt-happened-
yet-5e1d5812e1e7
open source nn frameworks on cellphones
Your phone personalizes the model locally, based on your usage (A).
Many users' updates are aggregated (B) to form a consensus change
(C) to the shared model, after which the procedure is repeated.
https://research.googleblog.com/2017/04/federated-learning-collaborative.html
• Why talking about open-source frameworks on edge
devices

• I like open source

• I work for a company which is designing chips for edge
devices mostly

• Some arguments for NN and general machine learning
on edge devices are: privacy, latency, bandwidth,
connection, local sensors, cost, and convenience
!8
Some progresses make NN
on edge devices really viable
• “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” [1]. A keynote at
ESWEEK 2017, “Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep-
Neural-Network Architectures” [2]

• MobileNet V1 [3] and V2 [4]: Depthwise separable convolution [5] and inverted residuals and linear
bottlenecks

• AutoML, e.g., NASNet Mobile [6] and Mnasnet [7]

• Quantization [8][9]

[1] https://arxiv.org/abs/1602.07360

[2] https://arxiv.org/abs/1710.02759

[3] https://arxiv.org/abs/1704.04861

[4] https://arxiv.org/abs/1801.04381

[5] https://www.di.ens.fr/data/publications/papers/phd_sifre.pdf

[6] https://arxiv.org/abs/1707.07012

[7] https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html, https://arxiv.org/abs/
1807.11626

[8] https://arxiv.org/abs/1712.05877

[9] https://arxiv.org/abs/1806.08342
!9
• Mainly on TensorFlow Lite and Caffe2 for edge devices

• Why?

• open source

• designed to use NN accelerators

• Running NN stuff on CPUs is generally not as
energy-efficient as on accelerators
!10
!11
• We heard Android NN and TensorFlow Lite back in Google I/
O 2017

• My COSCUP 2017 slide deck “TensorFlow on Android”

• https://www.slideshare.net/kstan2/tensorflow-on-
android

• People knew a bit about Android NN API before it was
announced and released

• No information about TensorFlow Lite, at least to me,
before it was released in last Nov
!12
tf-lite and android NN in
Google I/O
• New TensorFlow runtime
• Optimized for mobile and
embedded apps

• Runs TensorFlow models on
device

• Leverage Android NN API

• Soon to be open sourced
from Google I/O 2017 video
13
Actual Android NN API
• Announced/published with Android 8.1
Preview 1

• Available to developer in NDK

• yes, NDK

• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices

• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caffe2, or others) that build and train
neural networks

• The API is available on all devices running
Android 8.1 (API level 27) or higher.
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png
14
Android NN on Pixel 2
• Only the CPU fallback was available on Oreo MR1

• Actually, you can see Android NN API related in AOSP after Oreo MR1 (8.1) release already

• user level code, see https://android.googlesource.com/platform/frameworks/ml/+/oreo-
mr1-release

• HAL, see https://android.googlesource.com/platform/hardware/interfaces/+/oreo-mr1-
release/neuralnetworks/

• There is NN API 1.1 on Android Pie

• https://developer.android.com/about/versions/pie/android-9.0#nnapi

• adding support for nine new ops — Pad, BatchToSpaceND, SpaceToBatchND,
Transpose, Strided Slice, Mean, Div, Sub, and Squeeze

• In the Android P DP1/2 (https://developer.android.com/preview/download.html), there
was a HVX NN API 1.0 (yes, 1.0) driver. Gone after DP2. Not in recent Pie release.
!15
TensorFlow Lite
• TensorFlow Lite is TensorFlow’s lightweight solution for
mobile and embedded devices

• It enables on-device machine learning inference with low
latency and a small binary size

• Low latency techniques: optimizing the kernels for mobile
apps, pre-fused activations, and quantized kernels that
allow smaller and faster (fixed-point math) models

• TensorFlow Lite also supports hardware acceleration with
the Android Neural Networks API
!16
https://www.tensorflow.org/mobile/tflite/
What does TensorFlow Lite
contain?
• a set of core operators, both quantized and float, which have been tuned for mobile platforms

• pre-fused activations and biases to further enhance performance and quantized accuracy

• using custom operations in models also supported

• a new model file format, based on FlatBuffers

• the primary difference is that FlatBuffers does not need a parsing/unpacking step to a secondary
representation before you can access data

• the code footprint of FlatBuffers is an order of magnitude smaller than protocol buffers

• a new mobile-optimized interpreter, 

• key goals: keeping apps lean and fast. 

• a static graph ordering and a custom (less-dynamic) memory allocator to ensure minimal load,
initialization, and execution latency

• an interface to Android NN API if available
!17
https://www.tensorflow.org/mobile/tflite/
why a new mobile-specific
library?
• Innovation at the silicon layer is enabling new possibilities for hardware
acceleration, and frameworks such as the Android Neural Networks API
make it easy to leverage these

• Recent advances in real-time computer-vision and spoken language
understanding have led to mobile-optimized benchmark models being open
sourced (e.g. MobileNets, SqueezeNet)

• Widely-available smart appliances create new possibilities for on-device
intelligence

• Interest in stronger user data privacy paradigms where user data does not
need to leave the mobile device

• Ability to serve ‘offline’ use cases, where the device does not need to be
connected to a network
!18
https://www.tensorflow.org/mobile/tflite/
• A set of core operators, both quantized and float, many of which have been tuned for mobile platforms. These can be
used to create and run custom models. Developers can also write their own custom operators and use them in models

• A new FlatBuffers-based model file format

• On-device interpreter with kernels optimized for faster execution on mobile

• TensorFlow converter to convert TensorFlow-trained models to the TensorFlow Lite format.

• Smaller in size: TensorFlow Lite is smaller than 300KB when all supported operators are linked and less than 200KB
when using only the operators needed for supporting InceptionV3 and Mobilenet

• FACT CHECK: armeabi-v7a: 497,192 bytes, arm64-v8a: 675,572 bytes

• Pre-tested models

• Inception V3, MobileNet, On Device Smart Reply

• Quantized versions of the MobileNet model, which runs faster than the non-quantized (float) version on CPU.

• New Android demo app to illustrate the use of TensorFlow Lite with a quantized MobileNet model for object
classification

• Java and C++ API support
!19
https://www.tensorflow.org/mobile/tflite/
• Java API: A convenience
wrapper around the C++ API
on Android

• C++ API: Loads the
TensorFlow Lite Model File
and invokes the Interpreter.
The same library is available
on both Android and iOS
https://www.tensorflow.org/mobile/tflite/
20
Other bindings
• Python and C APIs

• Python: introduced in TF 1.8.0, built into pip package in 1.9.0

• my label_image.py for tflite merged on Aug 9, 2018

• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
contrib/lite/examples/python/label_image.py

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
contrib/lite/examples/python/label_image.md

• C API: introduced for Unity

• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
contrib/lite/experimental/c
!21
In Dec, 2017
• Let $TF_ROOT be root of tensorflow

• source of tf-lite: ${TF_ROOT}/tensorflow/contrib/lite/

• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/README.md

• examples

• two for Android, two for iOS

• APIs: ${TF_ROOT}/tensorflow/contrib/lite/g3doc/apis.md, https://github.com/tensorflow/
tensorflow/blob/master/tensorflow/contrib/lite/g3doc/apis.md

• no benchmark_model: well there is one, https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/contrib/lite/tools/benchmark_model.cc

• it’s incomplete

• no command line label_image (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
examples/label_image)
!22
Aug, 2018
• Let $TF_ROOT be root of tensorflow

• source of tf-lite: ${TF_ROOT}/tensorflow/contrib/lite/

• https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/README.md

• https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/docs_src/mobile/tflite

• examples

• (two + at least 3) for Android, two for iOS

• APIs: ${TF_ROOT}/tensorflow/contrib/lite/g3doc/apis.md, https://github.com/tensorflow/
tensorflow/blob/master/tensorflow/contrib/lite/g3doc/apis.md

• benchmark_model: https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/
tools/benchmark/

• no command line label_image my label_image for TF Lite merged (https://github.com/tensorflow/
tensorflow/pull/15095)
!23
Basic Usage
• model: .tflite model

• resolver: if no custom ops, builtin
op resolver is enough

• interpreter: we need it to compute
the graph

• interpreter->AllocateTensor():
allocate stuff for you, e.g., input
tensor(s)

• fill the input

• interpreter->Invoke(): run the graph

• process the output
tflite::FlatBufferModel model(path_to_model);
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;
tflite::InterpreterBuilder(*model, resolver)(&interpreter);
// Resize input tensors, if desired.
interpreter->AllocateTensors();
float* input = interpreter->typed_input_tensor<float>(0);
// Fill `input`.
interpreter->Invoke();
float* output = interpreter->type_output_tensor<float>(0);
some abstractions
//	TF_LITE_ENSURE	-	Self-sufficient	error	checking	
//	TfLiteStatus	-	Status	reporting	
//	TfLiteIntArray	-	stores	tensor	shapes	(dims),	
//	TfLiteContext	-	allows	an	op	to	access	the	tensors	
//	TfLiteTensor	-	tensor	(a	multidimensional	array)	
//	TfLiteNode	-	a	single	node	or	operation	
//	TfLiteRegistration	-	the	implementation	of	a	conceptual	operation.
!25
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L19-L26
TFLiteIntArray
typedef	struct	{	
		int	size;	
//	gcc	6.1+	have	a	bug	where	flexible	members	aren't	properly	
handled	
//	https://github.com/google/re2/commit/
b94b7cd42e9f02673cd748c1ac1d16db4052514c	
#if	!defined(__clang__)	&&	defined(__GNUC__)	&&	__GNUC__	==	6	
&&		
				__GNUC_MINOR__	>=	1	
		int	data[0];	
#else	
		int	data[];	
#endif	
}	TfLiteIntArray;
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L68-L80
26
TfLiteTensor
typedef	struct	{	
		TfLiteType	type;	
		TfLitePtrUnion	data;	
		TfLiteIntArray*	dims;	
		TfLiteQuantizationParams	params;	
			
		TfLiteAllocationType	allocation_type;	
		size_t	bytes;	
		const	void*	allocation;	
		const	char*	name;	
		TfLiteDelegate*	delegate;	
		TfLiteBufferHandle	buffer_handle;	
		bool	data_is_stale;	
		bool	is_variable;	
}	TfLiteTensor;
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L202-L253
TfLiteQuantizationParams
typedef	struct	{	
		float	scale;	
		int32_t	zero_point;	
}	TfLiteQuantizationParams;
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L165-L171
r = S(q − Z)
28
TfLiteNode
typedef	struct	{	
		TfLiteIntArray*	inputs;	
		TfLiteIntArray*	outputs;	
		TfLiteIntArray*	temporaries;	
		void*	user_data;	
		void*	builtin_data;	
		const	void*	custom_initial_data;	
		int	custom_initial_data_size;	
		TfLiteDelegate*	delegate;	
}	TfLiteNode;
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L272-L303
29
TfLiteContext
typedef	struct	TfLiteContext	{	
		size_t	tensors_size;	
		TfLiteStatus	(*GetExecutionPlan)(struct	TfLiteContext*	context,	TfLiteIntArray**	execution_plan);	
		TfLiteTensor*	tensors;	
		void*	impl_;	
		TfLiteStatus	(*ResizeTensor)(struct	TfLiteContext*,	TfLiteTensor*	tensor,	TfLiteIntArray*	new_size);	
		void	(*ReportError)(struct	TfLiteContext*,	const	char*	msg,	...);		TfLiteStatus	(*AddTensors)(struct	
TfLiteContext*,	int	tensors_to_add,	int*	first_new_tensor_index);	
		TfLiteStatus	(*GetNodeAndRegistration)(struct	TfLiteContext*,	int	node_index,	TfLiteNode**	node,	
TfLiteRegistration**	registration);	
		
		TfLiteStatus	(*ReplaceSubgraphsWithDelegateKernels)(struct	TfLiteContext*,	TfLiteRegistration	registration,	
const	TfLiteIntArray*	nodes_to_replace,	TfLiteDelegate*	delegate);	
		int	recommended_num_threads;	
		TfLiteExternalContext*	(*GetExternalContext)(struct	TfLiteContext*,	TfLiteExternalContextType);	
		void	(*SetExternalContext)(struct	TfLiteContext*,	TfLiteExternalContextType	eExternalContext*);	
}	TfLiteContext;
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L305-L371
30
TfLiteRegistration
typedef	struct	_TfLiteRegistration	{	
		void*	(*init)(TfLiteContext*	context,	const	char*	buffer,	size_t	length);	
		void	(*free)(TfLiteContext*	context,	void*	buffer);	
		TfLiteStatus	(*prepare)(TfLiteContext*	context,	TfLiteNode*	node);	
		TfLiteStatus	(*invoke)(TfLiteContext*	context,	TfLiteNode*	node);	
		const	char*	(*profiling_string)(const	TfLiteContext*	context,	const	TfLiteNode*	node);	
		int32_t	builtin_code;	
		const	char*	custom_name;	
		int	version;	
}	TfLiteRegistration;
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L272-L303
31
beyond basic stuff
• More information in https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/interpreter.h

• const char* GetInputName(int index): https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/
contrib/lite/interpreter.h#L198-L200

• const char* GetOutputName(int index): https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/
contrib/lite/interpreter.h#L210-L212

• size_t tensors_size() const: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/
interpreter.h#L215

• TfLiteTensor* tensor(int tensor_index): https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/
contrib/lite/interpreter.h#L230-L234

• size_t nodes_size() const: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/
interpreter.h#L218

• const std::pair<TfLiteNode, TfLiteRegistration>* node_and_registration(int node_index): https://github.com/
tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/interpreter.h#L244-L249

• Yes, we can enumerate/traverse tensors and nodes
!32
beyond basic stuff
• void UseNNAPI(bool enable)

• void SetNumThreads(int num_threads)

• my label_image for tflite

• merged since mid-Jan, 2018

• benchmark_model for tflite
!33
Conv Op
• Convolution: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/kernels/conv.cc#L445-L492

• Quantized uint8

template <KernelType kernel_type>
void EvalQuantized(TfLiteContext* context, TfLiteNode* node,
TfLiteConvParams* params, OpData* data, TfLiteTensor* input,
TfLiteTensor* filter, TfLiteTensor* bias,
TfLiteTensor* im2col, TfLiteTensor* hwcn_weights,
TfLiteTensor* output)
• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/kernels/conv.cc#L326-L367

• optimized_ops::Conv(…), https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/
kernels/internal/optimized/optimized_ops.h#L2049-L2126

• float32

template <KernelType kernel_type>
void EvalFloat(TfLiteContext* context, TfLiteNode* node,
TfLiteConvParams* params, OpData* data, TfLiteTensor* input,
TfLiteTensor* filter, TfLiteTensor* bias, TfLiteTensor* im2col,
TfLiteTensor* hwcn_weights, TfLiteTensor* output)
• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/kernels/conv.cc#L369-L443

• multithreaded_ops::Conv(…) for most cases, https://github.com/tensorflow/tensorflow/blob/r1.10/
tensorflow/contrib/lite/kernels/internal/optimized/multithreaded_conv.h#L135-L162
!34
misc
• builtin state dump function

• void PrintInterpreterState(Interpreter* interpreter): https://github.com/
tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/
optional_debug_tools.h#L25

• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/
lite/examples/label_image/label_image.cc#L159

• TF operations --> TF Lite operations is not trivial

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
contrib/lite/g3doc/tf_ops_compatibility.md

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
contrib/lite/nnapi_delegate.cc
!35
Interpreter
!36
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/interpreter.cc#L611-L697
label_image for tf lite
• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/

• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/label_image.md

• Run a TF Lite single input, single output classifier model, e.g., MobileNet V1, so that we can verify the classifier
works or not

• What does it do

• read an image: unlike TF, there is no image decoder in TF Lite, so I wrote a simple .bmp decoder

• resize the input image to specific size, e.g., 224x244 or 299x299

• convert the image tensor to floating point if necessary

• load the classifier

• prepare tensors

• run the model

• process the input

• top-k labels
!37
speed of quantized one
• It seems it's much better than naive quantization as we saw before

• On Nexus 9 (MobileNet 1.0/224)

• Quantized

• ./label_image -t 2: ~ 160 ms

• ./label_image -t 2 -c 100: ~ 60 ms

• Floating point

• ./label_image -t 2 -m ./mobilenet_v1_1.0_224.tflite: ~ 300 ms

• ./label_image -t 2 -c 100 -m ./mobilenet_v1_1.0_224.tflite: ~ 82 ms

• TFLiteCameraDemo: 130 - 180 ms

• Pixel 2

• TFLiteCameraDemo:

• CPU 

• single thread: as is: ~ 90 ms, controlled env: ~ 70 ms

• 4 threads: ~ 30 ms

• HVX: ~ 12 ms

!38
Custom Operators
• https://github.com/tensorflow/
tensorflow/blob/master/
tensorflow/contrib/lite/g3doc/
custom_operators.md

• OpInit(), OpFree(),
OpPrepare(), and OpInvoke() in
interpreter.cc
typedef struct {
void* (*init)(TfLiteContext* context, const char* buffer, size_t length);
void (*free)(TfLiteContext* context, void* buffer);
TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);
TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);
} TfLiteRegistration;
39
Fake Quantiztion in early
Dec, 2017
• How hard can it be? How much time is needed?

• Several pre-tested models are available

• https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/contrib/lite/g3doc/models.md

• but only one of them (https://storage.googleapis.com/
download.tensorflow.org/models/tflite/
mobilenet_v1_224_android_quant_2017_11_08.zip) is quantized
one

• as we can guess from related docs, retrain is kinda required to
get accuracy back
!40
Note that the biases are not quantized because they are
represented as 32-bit integers in the inference process, with
a much higher range and precision compared to the 8 bit
weights and activations. Furthermore, quantization param-
eters used for biases are inferred from the quantization pa-
rameters of the weights and activations. See section 2.4.
Typical TensorFlow code illustrating use of [19] follows:
from tf.contrib.quantize 
import quantize_graph as qg
g = tf.Graph()
with g.as_default():
output = ...
total_loss = ...
optimizer = ...
train_tensor = ...
if is_training:
quantized_graph = 
qg.create_training_graph(g)
else:
quantized_graph = 
qg.create_eval_graph(g)
# Train or evaluate quantized_graph.
3.2. Batch normalization folding
For models that use batch normalization (see [17]), there
is additional complexity: the training graph contains batch
normalization as a separate block of operations, whereas
the inference graph has batch normalization parameters
“folded” into the convolutional or fully connected layer’s
Float
Integer
Table 4.1
tized net
Sche
Weigh
Activati
Accu
Table 4.
ious qua
works (B
[21, 22])
fine-grai
4. Expe
We c
ing the e
and the o
tradeoff
tion. 4.2
ence wo
is matrix
floating-
library [1
conv
weights
uint8
input
+
biases
uint32
ReLU6 output
uint8
uint32
uint8
uint8
(a) Integer-arithmetic-only inference
conv
wt quant weightsinput
+
biases
ReLU6 act quant output
(b) Training with simulated quantization
10 20 40 80 160 320
40
50
60
70
Latency (ms)
Top1Accuracy
Float
8-bit
(c) ImageNet latency-vs-accuracy tradeoff
Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output
are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator.
The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training
with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic.
Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the
effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel
a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from
the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares
integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores.
tions [14, 27, 34]. With these approaches, both multiplica-
tions and additions can be implemented by efficient bit-shift
and bit-count operations, which are showcased in custom
GPU kernels (BNN [14]). However, 1 bit quantization of-
Our work draws inspiration from [7], which leverages
low-precision fixed-point arithmetic to accelerate the train-
ing speed of CNNs, and from [31], which uses 8-bit fixed-
point arithmetic to speed up inference on x86 CPUs. Our
[1] https://www.tensorflow.org/performance/quantization
[2] https://arxiv.org/abs/1712.05877
[3] https://arxiv.org/abs/1806.08342
41
42
Real computation
• BLAS part: Eigen (http://eigen.tuxfamily.org/) and
gemmlowp (https://github.com/google/gemmlowp)

• Some Caveats

• convolutions are multithreaded

• uint8/gemm: number of cores

• float32/Eigen: 4

• problems: big.LITTLE, number of cores, scheduling
!43
Things we didn’t touch
• Memory management: to get reasonable good performance when running highly
parallel workloads on mobile devices, you need good enough mechanism

• Profiling: there is a simple profiling mechanism in TF Lite since Apr, 2018

• time profiling only now. how about memory stuff?

• static buffer size: https://github.com/tensorflow/tensorflow/blob/r1.10/
tensorflow/contrib/lite/profiling/profiler.h#L80

• https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/
profiling

• Computation of quantized uint8

• when you want to do some operations on tensors, scale and zero point could
be changed. How to do it efficiently
!44
Quick Intro to Caffe 2
• Caffe 2

• 2nd generation of Caffe, which was the most popular deep learning framework
(before TensorFlow) from Berkeley

• merged to PyTorch

• What's the difference? Caffe2 improves Caffe 1.0 in a series of directions:

• first-class support for large-scale distributed training

• mobile deployment
• new hardware support (in addition to CPU and CUDA)

• flexibility for future directions such as quantized computation

• stress tested by the vast scale of Facebook applications
!45
https://caffe2.ai/docs/caffe-migration.html
Caffe2 on Android
• Official Android demo

• https://caffe2.ai/docs/AI-Camera-demo-android.html, https://github.com/caffe2/
AICamera

• SqueezeNet 1.1:

• 5.8/5.7 fps on Samsung S7 and Google Pixel

• not very impressive

• OpenGL backend

• https://www.facebook.com/Caffe2AI/videos/126340488008269/

• up to 6X speedup (24 FPS) compared to CPU on high-end Android devices (e.g.
Galaxy S8) for style transfer models
!46
https://trends.google.com/trends/explore?q=tensorflow,caffe2
• Tensorflow Lite is also looking for the possibility of
OpenGL ES backend

• https://github.com/tensorflow/tensorflow/issues/16189
!48
What can we use on
Android now
!49
https://github.com/caffe2/caffe2/tree/master/caffe2/mobile/contrib
Caffe2 backends for
Android I know
• ARM CPU:

• NNPACK, Eigen: quite mature

• OpenGL ES:

• OpenGL: not actively maintained (?)

• ARM Compute Library (GL ES part): newly added, still growing

• NEON, and OpenCL

• NNAPI: not fully integrated yet.
!50
How to build
• > scripts/build_android.sh
• With that, no test command line binary test

• Caffe 2 has some tests and a simple command line benchmark tool
called speed_benchmark
> scripts/build_android.sh -DBUILD_TEST -DBUILD_BINARY

• then we can get build_android/bin/speed_benchmark and
other test binaries

• Pytorch has a good tutorial on using it, http://pytorch.org/tutorials/
advanced/super_resolution_with_caffe2.html
!51
Some results
• > ./speed_benchmark --input_file input.blobproto --input
data --init_net init_net.pb --net predict_net.pb --
caffe2_log_level=0

01-06 23:15:42.073 32623 32623 I native : [I net_simple.cc:101] Starting benchmark.
01-06 23:15:42.074 32623 32623 I native : [I net_simple.cc:102] Running warmup runs.
01-06 23:15:42.074 32623 32623 I native : [I net_simple.cc:112] Main runs.
01-06 23:15:43.805 32623 32623 I native : [I net_simple.cc:123] Main run finished. Milliseconds per iter:
173.15. Iters per second: 5.77535
!52
Some results
• ARM Compute Library backend: Caffe2 addend a Compute Libarry backend on in the end of Februrary 2018. With some tweaks, it's
possible to run SqueezeNet 1.1 faster than CPU (NNPAC) with OpenGL

01-04 03:41:38.297 25523 25523 I native : [I gl_model_test.h:52] [C2DEBUG] Benchmarking OpenGL Net

01-04 03:41:38.297 25523 25523 I native : [I net_gl.cc:104] Starting benchmark.

01-04 03:41:38.297 25523 25523 I native : [I net_gl.cc:105] Running warmup runs.

01-04 03:41:38.796 25523 25523 I native : [I net_gl.cc:121] Main runs.

01-04 03:41:43.107 25523 25523 I native : [I net_gl.cc:134] [C2DEBUG] Main run finished. Milliseconds per iter: 43.1077. Iters per
second: 23.1977

01-04 03:41:43.110 25523 25523 I native : [I gl_model_test.h:66] [C2DEBUG] Benchmarking CPU Net

01-04 03:41:43.110 25523 25523 I native : [I net_simple.cc:101] Starting benchmark.

01-04 03:41:43.110 25523 25523 I native : [I net_simple.cc:102] Running warmup runs.

01-04 03:41:43.768 25523 25523 I native : [I net_simple.cc:112] Main runs.

01-04 03:41:50.229 25523 25523 I native : [I net_simple.cc:123] Main run finished. Milliseconds per iter: 64.6136. Iters per
second: 15.4766
!53
Comparing with TF Lite
• cmake is easier than bazel :-)

• Relatively large, or say comprehensive. If you want to enable something like on-device learning. It's
easier to start with TFLite.

• binary could be large

• Code looks cleaner

• Review process, or say, software engineering not as rigid as TensorFlow

• TF has a larger team (?)

• See, https://www.oreilly.com/ideas/how-the-tensorflow-team-handles-open-source-support

• Some interesting code,

• The Observer design pattern could be used to measure performance, https://en.wikipedia.org/wiki/
Observer_pattern

• https://github.com/caffe2/caffe2/tree/master/caffe2/observers
!54
Beyond Open Source
• Apple CoreML

• https://developer.apple.com/
documentation/coreml

• Google ML Kit

• https://developers.google.com/ml-kit/

• image labeling, OCR, face detection, bar
code scanning, landmark detection, etc.

• Custom models in TF Lite

• Qualcomm Snapdragon Neural Processing
Engine (SNPE)

• https://developer.qualcomm.com/software/
snapdragon-neural-processing-engine-ai

• Huawei HiAi DDK
Concluding Remarks
• Deep learning on devices are here to stay. You can see some applications nowadays. More
to come.

• Pick an open-source framework to learn how system software for ML/DL works.

• Parallelization, parallelization, and parallelization

• Memory, memory, and memory

• If you are a hardware guy, accelerators on edge devices should be an interesting topic

• I didn’t expect to see systolic array and many-core stuff on edge devices for general
apps

• If you are a more research-oriented guy, think about something like NN models for edge
devices

• Even if you are none of the above, learn the history and status quo of AI and machine
learning to satisfy your intellectual curiosity should be fun
!56
The End
Depthwise Separable Convolution
• CNNs with depthwise separable convolution such as Mobilenet [1]
changed almost everything

• Depthwise separable convolution “factorize” a standard convolution
into a depthwise convolution and a 1 × 1 convolution called a
pointwise convolution. Thus it greatly reduces computation
complexity.

• Depthwise separable convolution is not that that new [2], but pure
depthwise separable convolution-based networks such as Xception
and MobileNet demonstrated its power

[1] https://arxiv.org/abs/1704.04861

[2] L. Sifre. “Rigid-motion scattering for image classification”, PhD thesis, 2014
!58
...M
N
1
1
...
MDK
DK
1
...
M
DK
DK N
depthwise convolution filters
standard convolution filters
1×1 Convolutional Filters (Pointwise Convolution)https://arxiv.org/abs/1704.04861
Depthwise Separable Convolution

Mais conteúdo relacionado

Mais procurados

Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsKoan-Sin Tan
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolKoan-Sin Tan
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsKoan-Sin Tan
 
Introduction to Python GUI development with Delphi for Python - Part 1: Del...
Introduction to Python GUI development with Delphi for Python - Part 1:   Del...Introduction to Python GUI development with Delphi for Python - Part 1:   Del...
Introduction to Python GUI development with Delphi for Python - Part 1: Del...Embarcadero Technologies
 
Nerves Project Intro to ErlangDC
Nerves Project Intro to ErlangDCNerves Project Intro to ErlangDC
Nerves Project Intro to ErlangDCFrank Hunleth
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Linaro
 
Using Erlang on the RaspberryPi to interact with the physical world
Using Erlang on the RaspberryPi to interact with the physical worldUsing Erlang on the RaspberryPi to interact with the physical world
Using Erlang on the RaspberryPi to interact with the physical worldBrian Chamberlain
 
Dist::Zilla - A very brief introduction
Dist::Zilla - A very brief introductionDist::Zilla - A very brief introduction
Dist::Zilla - A very brief introductionDean Hamstead
 
Embedded Erlang, Nerves, and SumoBots
Embedded Erlang, Nerves, and SumoBotsEmbedded Erlang, Nerves, and SumoBots
Embedded Erlang, Nerves, and SumoBotsFrank Hunleth
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Building a Network IP Camera using Erlang
Building a Network IP Camera using ErlangBuilding a Network IP Camera using Erlang
Building a Network IP Camera using ErlangFrank Hunleth
 
都立大「ユビキタスロボティクス特論」5月12日
都立大「ユビキタスロボティクス特論」5月12日都立大「ユビキタスロボティクス特論」5月12日
都立大「ユビキタスロボティクス特論」5月12日NoriakiAndo
 
BKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFIBKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFILinaro
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)Linaro
 
Using Erlang in an Embedded and Cross-Compiled World
Using Erlang in an Embedded and Cross-Compiled WorldUsing Erlang in an Embedded and Cross-Compiled World
Using Erlang in an Embedded and Cross-Compiled WorldFrank Hunleth
 
200519 TMU Ubiquitous Robot
200519 TMU Ubiquitous Robot200519 TMU Ubiquitous Robot
200519 TMU Ubiquitous RobotNoriakiAndo
 

Mais procurados (20)

Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source Tools
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Introduction to Python GUI development with Delphi for Python - Part 1: Del...
Introduction to Python GUI development with Delphi for Python - Part 1:   Del...Introduction to Python GUI development with Delphi for Python - Part 1:   Del...
Introduction to Python GUI development with Delphi for Python - Part 1: Del...
 
Nerves Project Intro to ErlangDC
Nerves Project Intro to ErlangDCNerves Project Intro to ErlangDC
Nerves Project Intro to ErlangDC
 
Lua vs python
Lua vs pythonLua vs python
Lua vs python
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
Using Erlang on the RaspberryPi to interact with the physical world
Using Erlang on the RaspberryPi to interact with the physical worldUsing Erlang on the RaspberryPi to interact with the physical world
Using Erlang on the RaspberryPi to interact with the physical world
 
Dist::Zilla - A very brief introduction
Dist::Zilla - A very brief introductionDist::Zilla - A very brief introduction
Dist::Zilla - A very brief introduction
 
Embedded Erlang, Nerves, and SumoBots
Embedded Erlang, Nerves, and SumoBotsEmbedded Erlang, Nerves, and SumoBots
Embedded Erlang, Nerves, and SumoBots
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Building a Network IP Camera using Erlang
Building a Network IP Camera using ErlangBuilding a Network IP Camera using Erlang
Building a Network IP Camera using Erlang
 
defense-linkedin
defense-linkedindefense-linkedin
defense-linkedin
 
都立大「ユビキタスロボティクス特論」5月12日
都立大「ユビキタスロボティクス特論」5月12日都立大「ユビキタスロボティクス特論」5月12日
都立大「ユビキタスロボティクス特論」5月12日
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
BKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFIBKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFI
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)
 
Using Erlang in an Embedded and Cross-Compiled World
Using Erlang in an Embedded and Cross-Compiled WorldUsing Erlang in an Embedded and Cross-Compiled World
Using Erlang in an Embedded and Cross-Compiled World
 
200519 TMU Ubiquitous Robot
200519 TMU Ubiquitous Robot200519 TMU Ubiquitous Robot
200519 TMU Ubiquitous Robot
 

Semelhante a open source nn frameworks on cellphones

ASP.NET MVC 4 Overview
ASP.NET MVC 4 OverviewASP.NET MVC 4 Overview
ASP.NET MVC 4 OverviewGunnar Peipman
 
Open Source as Reference Implementation for Next Gen Network Services
Open Source as Reference Implementation for Next Gen Network ServicesOpen Source as Reference Implementation for Next Gen Network Services
Open Source as Reference Implementation for Next Gen Network ServicesCharles Eckel
 
Current & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylightCurrent & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylightabhijit2511
 
The Future of Networks is Open...Source
The Future of Networks is Open...SourceThe Future of Networks is Open...Source
The Future of Networks is Open...SourceFrancois Duthilleul
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Machine learning from software developers point of view
Machine learning from software developers point of viewMachine learning from software developers point of view
Machine learning from software developers point of viewPierre Paci
 
All the amazing features of asp.net core
All the amazing features of asp.net coreAll the amazing features of asp.net core
All the amazing features of asp.net coreGrayCell Technologies
 
DockerDay2015: Keynote
DockerDay2015: KeynoteDockerDay2015: Keynote
DockerDay2015: KeynoteDocker-Hanoi
 
Difference between .net core and .net framework
Difference between .net core and .net frameworkDifference between .net core and .net framework
Difference between .net core and .net frameworkAnsi Bytecode
 
CNCF Introduction - Feb 2018
CNCF Introduction - Feb 2018CNCF Introduction - Feb 2018
CNCF Introduction - Feb 2018Krishna-Kumar
 
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation EcosystemHow APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation EcosystemCisco DevNet
 
Contiki IoT simulation
Contiki IoT simulationContiki IoT simulation
Contiki IoT simulationnabati
 
Application of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibApplication of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibDavid Nzoputa Ofili
 
(WPF + WinForms) * .NET Core = Modern Desktop
(WPF + WinForms) * .NET Core = Modern Desktop(WPF + WinForms) * .NET Core = Modern Desktop
(WPF + WinForms) * .NET Core = Modern DesktopOren Novotny
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Mark Goldstein
 

Semelhante a open source nn frameworks on cellphones (20)

ASP.NET MVC 4 Overview
ASP.NET MVC 4 OverviewASP.NET MVC 4 Overview
ASP.NET MVC 4 Overview
 
Open Source as Reference Implementation for Next Gen Network Services
Open Source as Reference Implementation for Next Gen Network ServicesOpen Source as Reference Implementation for Next Gen Network Services
Open Source as Reference Implementation for Next Gen Network Services
 
Current & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylightCurrent & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylight
 
The Future of Networks is Open...Source
The Future of Networks is Open...SourceThe Future of Networks is Open...Source
The Future of Networks is Open...Source
 
Documentation
DocumentationDocumentation
Documentation
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Machine learning from software developers point of view
Machine learning from software developers point of viewMachine learning from software developers point of view
Machine learning from software developers point of view
 
Feec telecom-nw-softwarization-aug-2015
Feec telecom-nw-softwarization-aug-2015Feec telecom-nw-softwarization-aug-2015
Feec telecom-nw-softwarization-aug-2015
 
All the amazing features of asp.net core
All the amazing features of asp.net coreAll the amazing features of asp.net core
All the amazing features of asp.net core
 
DockerDay2015: Keynote
DockerDay2015: KeynoteDockerDay2015: Keynote
DockerDay2015: Keynote
 
Difference between .net core and .net framework
Difference between .net core and .net frameworkDifference between .net core and .net framework
Difference between .net core and .net framework
 
CNCF Introduction - Feb 2018
CNCF Introduction - Feb 2018CNCF Introduction - Feb 2018
CNCF Introduction - Feb 2018
 
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation EcosystemHow APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem
 
Contiki IoT simulation
Contiki IoT simulationContiki IoT simulation
Contiki IoT simulation
 
Application of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibApplication of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLib
 
Lecture 10
Lecture 10Lecture 10
Lecture 10
 
(WPF + WinForms) * .NET Core = Modern Desktop
(WPF + WinForms) * .NET Core = Modern Desktop(WPF + WinForms) * .NET Core = Modern Desktop
(WPF + WinForms) * .NET Core = Modern Desktop
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
 

Mais de Koan-Sin Tan

running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on androidKoan-Sin Tan
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPUKoan-Sin Tan
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Koan-Sin Tan
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016Koan-Sin Tan
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchKoan-Sin Tan
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android BenchmarksKoan-Sin Tan
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08Koan-Sin Tan
 

Mais de Koan-Sin Tan (8)

running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on android
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of Smartwatch
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08
 

Último

Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxKISHAN KUMAR
 
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfsdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfJulia Kaye
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchrohitcse52
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationMohsinKhanA
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS Bahzad5
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxjasonsedano2
 
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Apollo Techno Industries Pvt Ltd
 
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide LaboratoryBahzad5
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfodunowoeminence2019
 
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....santhyamuthu1
 
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecGuardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecTrupti Shiralkar, CISSP
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderjuancarlos286641
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Amil baba
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...amrabdallah9
 
SUMMER TRAINING REPORT ON BUILDING CONSTRUCTION.docx
SUMMER TRAINING REPORT ON BUILDING CONSTRUCTION.docxSUMMER TRAINING REPORT ON BUILDING CONSTRUCTION.docx
SUMMER TRAINING REPORT ON BUILDING CONSTRUCTION.docxNaveenVerma126
 
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxUNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxrealme6igamerr
 
cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabusViolet Violet
 

Último (20)

Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptx
 
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfsdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software Simulation
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptx
 
計劃趕得上變化
計劃趕得上變化計劃趕得上變化
計劃趕得上變化
 
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
 
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
 
Lecture 4 .pdf
Lecture 4                              .pdfLecture 4                              .pdf
Lecture 4 .pdf
 
Lecture 2 .pptx
Lecture 2                            .pptxLecture 2                            .pptx
Lecture 2 .pptx
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
 
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
 
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecGuardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entender
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
 
SUMMER TRAINING REPORT ON BUILDING CONSTRUCTION.docx
SUMMER TRAINING REPORT ON BUILDING CONSTRUCTION.docxSUMMER TRAINING REPORT ON BUILDING CONSTRUCTION.docx
SUMMER TRAINING REPORT ON BUILDING CONSTRUCTION.docx
 
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxUNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
 
cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabus
 

open source nn frameworks on cellphones

  • 1. A Taste of Open Source Neural Network Frameworks on Cell Phones Koan-Sin Tan freedom@computer.org COSCUP, Taipei Aug 11th, 2018
  • 2. • interrupt me any when you have any questions • the talk is Taiwanese • the slide deck is in English • you ask me questions in English, Mandarin, or Taiwanese !2
  • 3. NN-based ML is already in cell phones • Google I/O: Mobile First —> AI First • TensorFlow Lite, Android Neural Network API • Lots of stuff from Google blogs and papers, e.g., Google Lens, federated learning in Gboard • Pixel Visual Core in Pixel 2 and Pixel 2/XL • Apple announced CoreML, a machine framework, at WWDC 2017 (June 2017) • Apple’s machine learning journal (https://machinelearning.apple.com/): how Apple uses CNN and other machine techniques in iPhone • Neural Engine • Computer Architecture: A Quantitative Approach, 6th Ed. (Nov, 2017) has a whole new chapter on Domain Specific Architecture, actually NN accelerators. !3
  • 5. • Michael Jordan published an article on Medium named “Artificial Intelligence — The Revolution Hasn’t Happened Yet” • Yes, but current deep learning driven stuff should be enough for next few years [1] https://medium.com/ @mijordan3/artificial-intelligence- the-revolution-hasnt-happened- yet-5e1d5812e1e7
  • 7. Your phone personalizes the model locally, based on your usage (A). Many users' updates are aggregated (B) to form a consensus change (C) to the shared model, after which the procedure is repeated. https://research.googleblog.com/2017/04/federated-learning-collaborative.html
  • 8. • Why talking about open-source frameworks on edge devices • I like open source • I work for a company which is designing chips for edge devices mostly • Some arguments for NN and general machine learning on edge devices are: privacy, latency, bandwidth, connection, local sensors, cost, and convenience !8
  • 9. Some progresses make NN on edge devices really viable • “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” [1]. A keynote at ESWEEK 2017, “Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep- Neural-Network Architectures” [2] • MobileNet V1 [3] and V2 [4]: Depthwise separable convolution [5] and inverted residuals and linear bottlenecks • AutoML, e.g., NASNet Mobile [6] and Mnasnet [7] • Quantization [8][9] [1] https://arxiv.org/abs/1602.07360 [2] https://arxiv.org/abs/1710.02759 [3] https://arxiv.org/abs/1704.04861 [4] https://arxiv.org/abs/1801.04381 [5] https://www.di.ens.fr/data/publications/papers/phd_sifre.pdf [6] https://arxiv.org/abs/1707.07012 [7] https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html, https://arxiv.org/abs/ 1807.11626 [8] https://arxiv.org/abs/1712.05877 [9] https://arxiv.org/abs/1806.08342 !9
  • 10. • Mainly on TensorFlow Lite and Caffe2 for edge devices • Why? • open source • designed to use NN accelerators • Running NN stuff on CPUs is generally not as energy-efficient as on accelerators !10
  • 11. !11
  • 12. • We heard Android NN and TensorFlow Lite back in Google I/ O 2017 • My COSCUP 2017 slide deck “TensorFlow on Android” • https://www.slideshare.net/kstan2/tensorflow-on- android • People knew a bit about Android NN API before it was announced and released • No information about TensorFlow Lite, at least to me, before it was released in last Nov !12
  • 13. tf-lite and android NN in Google I/O • New TensorFlow runtime • Optimized for mobile and embedded apps • Runs TensorFlow models on device • Leverage Android NN API • Soon to be open sourced from Google I/O 2017 video 13
  • 14. Actual Android NN API • Announced/published with Android 8.1 Preview 1 • Available to developer in NDK • yes, NDK • The Android Neural Networks API (NNAPI) is an Android C API designed for running computationally intensive operations for machine learning on mobile devices • NNAPI is designed to provide a base layer of functionality for higher-level machine learning frameworks (such as TensorFlow Lite, Caffe2, or others) that build and train neural networks • The API is available on all devices running Android 8.1 (API level 27) or higher. https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png 14
  • 15. Android NN on Pixel 2 • Only the CPU fallback was available on Oreo MR1 • Actually, you can see Android NN API related in AOSP after Oreo MR1 (8.1) release already • user level code, see https://android.googlesource.com/platform/frameworks/ml/+/oreo- mr1-release • HAL, see https://android.googlesource.com/platform/hardware/interfaces/+/oreo-mr1- release/neuralnetworks/ • There is NN API 1.1 on Android Pie • https://developer.android.com/about/versions/pie/android-9.0#nnapi • adding support for nine new ops — Pad, BatchToSpaceND, SpaceToBatchND, Transpose, Strided Slice, Mean, Div, Sub, and Squeeze • In the Android P DP1/2 (https://developer.android.com/preview/download.html), there was a HVX NN API 1.0 (yes, 1.0) driver. Gone after DP2. Not in recent Pie release. !15
  • 16. TensorFlow Lite • TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices • It enables on-device machine learning inference with low latency and a small binary size • Low latency techniques: optimizing the kernels for mobile apps, pre-fused activations, and quantized kernels that allow smaller and faster (fixed-point math) models • TensorFlow Lite also supports hardware acceleration with the Android Neural Networks API !16 https://www.tensorflow.org/mobile/tflite/
  • 17. What does TensorFlow Lite contain? • a set of core operators, both quantized and float, which have been tuned for mobile platforms • pre-fused activations and biases to further enhance performance and quantized accuracy • using custom operations in models also supported • a new model file format, based on FlatBuffers • the primary difference is that FlatBuffers does not need a parsing/unpacking step to a secondary representation before you can access data • the code footprint of FlatBuffers is an order of magnitude smaller than protocol buffers • a new mobile-optimized interpreter, • key goals: keeping apps lean and fast. • a static graph ordering and a custom (less-dynamic) memory allocator to ensure minimal load, initialization, and execution latency • an interface to Android NN API if available !17 https://www.tensorflow.org/mobile/tflite/
  • 18. why a new mobile-specific library? • Innovation at the silicon layer is enabling new possibilities for hardware acceleration, and frameworks such as the Android Neural Networks API make it easy to leverage these • Recent advances in real-time computer-vision and spoken language understanding have led to mobile-optimized benchmark models being open sourced (e.g. MobileNets, SqueezeNet) • Widely-available smart appliances create new possibilities for on-device intelligence • Interest in stronger user data privacy paradigms where user data does not need to leave the mobile device • Ability to serve ‘offline’ use cases, where the device does not need to be connected to a network !18 https://www.tensorflow.org/mobile/tflite/
  • 19. • A set of core operators, both quantized and float, many of which have been tuned for mobile platforms. These can be used to create and run custom models. Developers can also write their own custom operators and use them in models • A new FlatBuffers-based model file format • On-device interpreter with kernels optimized for faster execution on mobile • TensorFlow converter to convert TensorFlow-trained models to the TensorFlow Lite format. • Smaller in size: TensorFlow Lite is smaller than 300KB when all supported operators are linked and less than 200KB when using only the operators needed for supporting InceptionV3 and Mobilenet • FACT CHECK: armeabi-v7a: 497,192 bytes, arm64-v8a: 675,572 bytes • Pre-tested models • Inception V3, MobileNet, On Device Smart Reply • Quantized versions of the MobileNet model, which runs faster than the non-quantized (float) version on CPU. • New Android demo app to illustrate the use of TensorFlow Lite with a quantized MobileNet model for object classification • Java and C++ API support !19 https://www.tensorflow.org/mobile/tflite/
  • 20. • Java API: A convenience wrapper around the C++ API on Android • C++ API: Loads the TensorFlow Lite Model File and invokes the Interpreter. The same library is available on both Android and iOS https://www.tensorflow.org/mobile/tflite/ 20
  • 21. Other bindings • Python and C APIs • Python: introduced in TF 1.8.0, built into pip package in 1.9.0 • my label_image.py for tflite merged on Aug 9, 2018 • https://github.com/tensorflow/tensorflow/tree/master/tensorflow/ contrib/lite/examples/python/label_image.py • https://github.com/tensorflow/tensorflow/blob/master/tensorflow/ contrib/lite/examples/python/label_image.md • C API: introduced for Unity • https://github.com/tensorflow/tensorflow/tree/master/tensorflow/ contrib/lite/experimental/c !21
  • 22. In Dec, 2017 • Let $TF_ROOT be root of tensorflow • source of tf-lite: ${TF_ROOT}/tensorflow/contrib/lite/ • https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/README.md • examples • two for Android, two for iOS • APIs: ${TF_ROOT}/tensorflow/contrib/lite/g3doc/apis.md, https://github.com/tensorflow/ tensorflow/blob/master/tensorflow/contrib/lite/g3doc/apis.md • no benchmark_model: well there is one, https://github.com/tensorflow/tensorflow/blob/master/ tensorflow/contrib/lite/tools/benchmark_model.cc • it’s incomplete • no command line label_image (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/ examples/label_image) !22
  • 23. Aug, 2018 • Let $TF_ROOT be root of tensorflow • source of tf-lite: ${TF_ROOT}/tensorflow/contrib/lite/ • https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/README.md • https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/docs_src/mobile/tflite • examples • (two + at least 3) for Android, two for iOS • APIs: ${TF_ROOT}/tensorflow/contrib/lite/g3doc/apis.md, https://github.com/tensorflow/ tensorflow/blob/master/tensorflow/contrib/lite/g3doc/apis.md • benchmark_model: https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/ tools/benchmark/ • no command line label_image my label_image for TF Lite merged (https://github.com/tensorflow/ tensorflow/pull/15095) !23
  • 24. Basic Usage • model: .tflite model • resolver: if no custom ops, builtin op resolver is enough • interpreter: we need it to compute the graph • interpreter->AllocateTensor(): allocate stuff for you, e.g., input tensor(s) • fill the input • interpreter->Invoke(): run the graph • process the output tflite::FlatBufferModel model(path_to_model); tflite::ops::builtin::BuiltinOpResolver resolver; std::unique_ptr<tflite::Interpreter> interpreter; tflite::InterpreterBuilder(*model, resolver)(&interpreter); // Resize input tensors, if desired. interpreter->AllocateTensors(); float* input = interpreter->typed_input_tensor<float>(0); // Fill `input`. interpreter->Invoke(); float* output = interpreter->type_output_tensor<float>(0);
  • 30. TfLiteContext typedef struct TfLiteContext { size_t tensors_size; TfLiteStatus (*GetExecutionPlan)(struct TfLiteContext* context, TfLiteIntArray** execution_plan); TfLiteTensor* tensors; void* impl_; TfLiteStatus (*ResizeTensor)(struct TfLiteContext*, TfLiteTensor* tensor, TfLiteIntArray* new_size); void (*ReportError)(struct TfLiteContext*, const char* msg, ...); TfLiteStatus (*AddTensors)(struct TfLiteContext*, int tensors_to_add, int* first_new_tensor_index); TfLiteStatus (*GetNodeAndRegistration)(struct TfLiteContext*, int node_index, TfLiteNode** node, TfLiteRegistration** registration); TfLiteStatus (*ReplaceSubgraphsWithDelegateKernels)(struct TfLiteContext*, TfLiteRegistration registration, const TfLiteIntArray* nodes_to_replace, TfLiteDelegate* delegate); int recommended_num_threads; TfLiteExternalContext* (*GetExternalContext)(struct TfLiteContext*, TfLiteExternalContextType); void (*SetExternalContext)(struct TfLiteContext*, TfLiteExternalContextType eExternalContext*); } TfLiteContext; https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L305-L371 30
  • 32. beyond basic stuff • More information in https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/interpreter.h • const char* GetInputName(int index): https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/ contrib/lite/interpreter.h#L198-L200 • const char* GetOutputName(int index): https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/ contrib/lite/interpreter.h#L210-L212 • size_t tensors_size() const: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/ interpreter.h#L215 • TfLiteTensor* tensor(int tensor_index): https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/ contrib/lite/interpreter.h#L230-L234 • size_t nodes_size() const: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/ interpreter.h#L218 • const std::pair<TfLiteNode, TfLiteRegistration>* node_and_registration(int node_index): https://github.com/ tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/interpreter.h#L244-L249 • Yes, we can enumerate/traverse tensors and nodes !32
  • 33. beyond basic stuff • void UseNNAPI(bool enable) • void SetNumThreads(int num_threads) • my label_image for tflite • merged since mid-Jan, 2018 • benchmark_model for tflite !33
  • 34. Conv Op • Convolution: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/kernels/conv.cc#L445-L492 • Quantized uint8 template <KernelType kernel_type> void EvalQuantized(TfLiteContext* context, TfLiteNode* node, TfLiteConvParams* params, OpData* data, TfLiteTensor* input, TfLiteTensor* filter, TfLiteTensor* bias, TfLiteTensor* im2col, TfLiteTensor* hwcn_weights, TfLiteTensor* output) • https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/kernels/conv.cc#L326-L367 • optimized_ops::Conv(…), https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/ kernels/internal/optimized/optimized_ops.h#L2049-L2126 • float32 template <KernelType kernel_type> void EvalFloat(TfLiteContext* context, TfLiteNode* node, TfLiteConvParams* params, OpData* data, TfLiteTensor* input, TfLiteTensor* filter, TfLiteTensor* bias, TfLiteTensor* im2col, TfLiteTensor* hwcn_weights, TfLiteTensor* output) • https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/kernels/conv.cc#L369-L443 • multithreaded_ops::Conv(…) for most cases, https://github.com/tensorflow/tensorflow/blob/r1.10/ tensorflow/contrib/lite/kernels/internal/optimized/multithreaded_conv.h#L135-L162 !34
  • 35. misc • builtin state dump function • void PrintInterpreterState(Interpreter* interpreter): https://github.com/ tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/ optional_debug_tools.h#L25 • https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/ lite/examples/label_image/label_image.cc#L159 • TF operations --> TF Lite operations is not trivial • https://github.com/tensorflow/tensorflow/blob/master/tensorflow/ contrib/lite/g3doc/tf_ops_compatibility.md • https://github.com/tensorflow/tensorflow/blob/master/tensorflow/ contrib/lite/nnapi_delegate.cc !35
  • 37. label_image for tf lite • https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/ • https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/label_image.md • Run a TF Lite single input, single output classifier model, e.g., MobileNet V1, so that we can verify the classifier works or not • What does it do • read an image: unlike TF, there is no image decoder in TF Lite, so I wrote a simple .bmp decoder • resize the input image to specific size, e.g., 224x244 or 299x299 • convert the image tensor to floating point if necessary • load the classifier • prepare tensors • run the model • process the input • top-k labels !37
  • 38. speed of quantized one • It seems it's much better than naive quantization as we saw before • On Nexus 9 (MobileNet 1.0/224) • Quantized • ./label_image -t 2: ~ 160 ms • ./label_image -t 2 -c 100: ~ 60 ms • Floating point • ./label_image -t 2 -m ./mobilenet_v1_1.0_224.tflite: ~ 300 ms • ./label_image -t 2 -c 100 -m ./mobilenet_v1_1.0_224.tflite: ~ 82 ms • TFLiteCameraDemo: 130 - 180 ms • Pixel 2 • TFLiteCameraDemo: • CPU • single thread: as is: ~ 90 ms, controlled env: ~ 70 ms • 4 threads: ~ 30 ms • HVX: ~ 12 ms !38
  • 39. Custom Operators • https://github.com/tensorflow/ tensorflow/blob/master/ tensorflow/contrib/lite/g3doc/ custom_operators.md • OpInit(), OpFree(), OpPrepare(), and OpInvoke() in interpreter.cc typedef struct { void* (*init)(TfLiteContext* context, const char* buffer, size_t length); void (*free)(TfLiteContext* context, void* buffer); TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node); TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node); } TfLiteRegistration; 39
  • 40. Fake Quantiztion in early Dec, 2017 • How hard can it be? How much time is needed? • Several pre-tested models are available • https://github.com/tensorflow/tensorflow/blob/master/ tensorflow/contrib/lite/g3doc/models.md • but only one of them (https://storage.googleapis.com/ download.tensorflow.org/models/tflite/ mobilenet_v1_224_android_quant_2017_11_08.zip) is quantized one • as we can guess from related docs, retrain is kinda required to get accuracy back !40
  • 41. Note that the biases are not quantized because they are represented as 32-bit integers in the inference process, with a much higher range and precision compared to the 8 bit weights and activations. Furthermore, quantization param- eters used for biases are inferred from the quantization pa- rameters of the weights and activations. See section 2.4. Typical TensorFlow code illustrating use of [19] follows: from tf.contrib.quantize import quantize_graph as qg g = tf.Graph() with g.as_default(): output = ... total_loss = ... optimizer = ... train_tensor = ... if is_training: quantized_graph = qg.create_training_graph(g) else: quantized_graph = qg.create_eval_graph(g) # Train or evaluate quantized_graph. 3.2. Batch normalization folding For models that use batch normalization (see [17]), there is additional complexity: the training graph contains batch normalization as a separate block of operations, whereas the inference graph has batch normalization parameters “folded” into the convolutional or fully connected layer’s Float Integer Table 4.1 tized net Sche Weigh Activati Accu Table 4. ious qua works (B [21, 22]) fine-grai 4. Expe We c ing the e and the o tradeoff tion. 4.2 ence wo is matrix floating- library [1 conv weights uint8 input + biases uint32 ReLU6 output uint8 uint32 uint8 uint8 (a) Integer-arithmetic-only inference conv wt quant weightsinput + biases ReLU6 act quant output (b) Training with simulated quantization 10 20 40 80 160 320 40 50 60 70 Latency (ms) Top1Accuracy Float 8-bit (c) ImageNet latency-vs-accuracy tradeoff Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator. The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic. Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores. tions [14, 27, 34]. With these approaches, both multiplica- tions and additions can be implemented by efficient bit-shift and bit-count operations, which are showcased in custom GPU kernels (BNN [14]). However, 1 bit quantization of- Our work draws inspiration from [7], which leverages low-precision fixed-point arithmetic to accelerate the train- ing speed of CNNs, and from [31], which uses 8-bit fixed- point arithmetic to speed up inference on x86 CPUs. Our [1] https://www.tensorflow.org/performance/quantization [2] https://arxiv.org/abs/1712.05877 [3] https://arxiv.org/abs/1806.08342 41
  • 42. 42
  • 43. Real computation • BLAS part: Eigen (http://eigen.tuxfamily.org/) and gemmlowp (https://github.com/google/gemmlowp) • Some Caveats • convolutions are multithreaded • uint8/gemm: number of cores • float32/Eigen: 4 • problems: big.LITTLE, number of cores, scheduling !43
  • 44. Things we didn’t touch • Memory management: to get reasonable good performance when running highly parallel workloads on mobile devices, you need good enough mechanism • Profiling: there is a simple profiling mechanism in TF Lite since Apr, 2018 • time profiling only now. how about memory stuff? • static buffer size: https://github.com/tensorflow/tensorflow/blob/r1.10/ tensorflow/contrib/lite/profiling/profiler.h#L80 • https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/ profiling • Computation of quantized uint8 • when you want to do some operations on tensors, scale and zero point could be changed. How to do it efficiently !44
  • 45. Quick Intro to Caffe 2 • Caffe 2 • 2nd generation of Caffe, which was the most popular deep learning framework (before TensorFlow) from Berkeley • merged to PyTorch • What's the difference? Caffe2 improves Caffe 1.0 in a series of directions: • first-class support for large-scale distributed training • mobile deployment • new hardware support (in addition to CPU and CUDA) • flexibility for future directions such as quantized computation • stress tested by the vast scale of Facebook applications !45 https://caffe2.ai/docs/caffe-migration.html
  • 46. Caffe2 on Android • Official Android demo • https://caffe2.ai/docs/AI-Camera-demo-android.html, https://github.com/caffe2/ AICamera • SqueezeNet 1.1: • 5.8/5.7 fps on Samsung S7 and Google Pixel • not very impressive • OpenGL backend • https://www.facebook.com/Caffe2AI/videos/126340488008269/ • up to 6X speedup (24 FPS) compared to CPU on high-end Android devices (e.g. Galaxy S8) for style transfer models !46
  • 48. • Tensorflow Lite is also looking for the possibility of OpenGL ES backend • https://github.com/tensorflow/tensorflow/issues/16189 !48
  • 49. What can we use on Android now !49 https://github.com/caffe2/caffe2/tree/master/caffe2/mobile/contrib
  • 50. Caffe2 backends for Android I know • ARM CPU: • NNPACK, Eigen: quite mature • OpenGL ES: • OpenGL: not actively maintained (?) • ARM Compute Library (GL ES part): newly added, still growing • NEON, and OpenCL • NNAPI: not fully integrated yet. !50
  • 51. How to build • > scripts/build_android.sh • With that, no test command line binary test • Caffe 2 has some tests and a simple command line benchmark tool called speed_benchmark > scripts/build_android.sh -DBUILD_TEST -DBUILD_BINARY • then we can get build_android/bin/speed_benchmark and other test binaries • Pytorch has a good tutorial on using it, http://pytorch.org/tutorials/ advanced/super_resolution_with_caffe2.html !51
  • 52. Some results • > ./speed_benchmark --input_file input.blobproto --input data --init_net init_net.pb --net predict_net.pb -- caffe2_log_level=0 01-06 23:15:42.073 32623 32623 I native : [I net_simple.cc:101] Starting benchmark. 01-06 23:15:42.074 32623 32623 I native : [I net_simple.cc:102] Running warmup runs. 01-06 23:15:42.074 32623 32623 I native : [I net_simple.cc:112] Main runs. 01-06 23:15:43.805 32623 32623 I native : [I net_simple.cc:123] Main run finished. Milliseconds per iter: 173.15. Iters per second: 5.77535 !52
  • 53. Some results • ARM Compute Library backend: Caffe2 addend a Compute Libarry backend on in the end of Februrary 2018. With some tweaks, it's possible to run SqueezeNet 1.1 faster than CPU (NNPAC) with OpenGL 01-04 03:41:38.297 25523 25523 I native : [I gl_model_test.h:52] [C2DEBUG] Benchmarking OpenGL Net 01-04 03:41:38.297 25523 25523 I native : [I net_gl.cc:104] Starting benchmark. 01-04 03:41:38.297 25523 25523 I native : [I net_gl.cc:105] Running warmup runs. 01-04 03:41:38.796 25523 25523 I native : [I net_gl.cc:121] Main runs. 01-04 03:41:43.107 25523 25523 I native : [I net_gl.cc:134] [C2DEBUG] Main run finished. Milliseconds per iter: 43.1077. Iters per second: 23.1977 01-04 03:41:43.110 25523 25523 I native : [I gl_model_test.h:66] [C2DEBUG] Benchmarking CPU Net 01-04 03:41:43.110 25523 25523 I native : [I net_simple.cc:101] Starting benchmark. 01-04 03:41:43.110 25523 25523 I native : [I net_simple.cc:102] Running warmup runs. 01-04 03:41:43.768 25523 25523 I native : [I net_simple.cc:112] Main runs. 01-04 03:41:50.229 25523 25523 I native : [I net_simple.cc:123] Main run finished. Milliseconds per iter: 64.6136. Iters per second: 15.4766 !53
  • 54. Comparing with TF Lite • cmake is easier than bazel :-) • Relatively large, or say comprehensive. If you want to enable something like on-device learning. It's easier to start with TFLite. • binary could be large • Code looks cleaner • Review process, or say, software engineering not as rigid as TensorFlow • TF has a larger team (?) • See, https://www.oreilly.com/ideas/how-the-tensorflow-team-handles-open-source-support • Some interesting code, • The Observer design pattern could be used to measure performance, https://en.wikipedia.org/wiki/ Observer_pattern • https://github.com/caffe2/caffe2/tree/master/caffe2/observers !54
  • 55. Beyond Open Source • Apple CoreML • https://developer.apple.com/ documentation/coreml • Google ML Kit • https://developers.google.com/ml-kit/ • image labeling, OCR, face detection, bar code scanning, landmark detection, etc. • Custom models in TF Lite • Qualcomm Snapdragon Neural Processing Engine (SNPE) • https://developer.qualcomm.com/software/ snapdragon-neural-processing-engine-ai • Huawei HiAi DDK
  • 56. Concluding Remarks • Deep learning on devices are here to stay. You can see some applications nowadays. More to come. • Pick an open-source framework to learn how system software for ML/DL works. • Parallelization, parallelization, and parallelization • Memory, memory, and memory • If you are a hardware guy, accelerators on edge devices should be an interesting topic • I didn’t expect to see systolic array and many-core stuff on edge devices for general apps • If you are a more research-oriented guy, think about something like NN models for edge devices • Even if you are none of the above, learn the history and status quo of AI and machine learning to satisfy your intellectual curiosity should be fun !56
  • 58. Depthwise Separable Convolution • CNNs with depthwise separable convolution such as Mobilenet [1] changed almost everything • Depthwise separable convolution “factorize” a standard convolution into a depthwise convolution and a 1 × 1 convolution called a pointwise convolution. Thus it greatly reduces computation complexity. • Depthwise separable convolution is not that that new [2], but pure depthwise separable convolution-based networks such as Xception and MobileNet demonstrated its power [1] https://arxiv.org/abs/1704.04861 [2] L. Sifre. “Rigid-motion scattering for image classification”, PhD thesis, 2014 !58
  • 59. ...M N 1 1 ... MDK DK 1 ... M DK DK N depthwise convolution filters standard convolution filters 1×1 Convolutional Filters (Pointwise Convolution)https://arxiv.org/abs/1704.04861 Depthwise Separable Convolution