Explaining basic mechanism of the Convolutional Neural Network with sample TesnsorFlow codes.
Sample codes: https://github.com/enakai00/cnn_introduction
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
Introducton to Convolutional Nerural Network with TensorFlow
1. Google confidential | Do not distribute
Introduction to Convolutional Neural Network
with TensorFlow
Etsuji Nakai
Cloud Solutions Architect at Google
2017/03/24 ver1.0
1
3. ● What's happening here?!
Image Classification Transfer Learning with Inception v3
https://codelabs.developers.google.com/codelabs/cpb102-txf-learning 3
4. ● Let's study the underlying mechanism with this (relatively) simple CNN.
Convolutional Neural Network with Two Convolution Layers
Raw
Image
Softmax Function
Pooling
Layer
Convolution
Filter
・・・
Convolution
Filter
・・・
・・・
Dropout Layer
Fully-connected Layer
Pooling
Layer
Convolution
Filter
・・・
Convolution
Filter
Pooling
Layer
・・・
Pooling
Layer
4
5. ● Launch Cloud Datalab.
○ https://cloud.google.com/datalab/docs/quickstarts
● Open a new notebook and execute the following command.
○ !git clone https://github.com/enakai00/cnn_introduction.git
● Find notebook files in "cnn_introduction" folder.
Jupyter Notebooks
5
7. ● Training Set:
○ N data points on (x, y) plane.
○ Data points belong to two categories
which are labeled as t = 1, 0.
● Problem to solve:
○ Find a straight line to classify the given
data.
○ If there's no perfect answer (which
doesn't have any misclassification),
find an optimal one in some sense.
Sample Problem
●
✕
7
8. ● Define the straight line as below.
● We apply the maximum likelihood method
to determine the parameter w.
● In other words, we will define a "probability
to obtain the training set", and maximize it.
Logistic Regression: Theoretical Ground
x
y
●
✕
8
9. ● The probability of t = 1 for a new data
point at (x, y) should have the following
properties.
○ t = 0.5 on the separation line.
○ for leaving away from the
separation line.
● This can be satisfied by translating f (x, y)
into the probability through logistic
sigmoid function σ(a).
Logistic Sigmoid Function
x
y
●
✕
9
P(x, y) increases in
this direction
10. ● Using the probability defined in the previous page, calculate the
probability of reproducing the training set .
○ If , the probability of observing it at is
○ If , the probability of observing it at is
○ These results can be expressed by a single equation as below.
(Remember that for any x.)
● Hence, the total probability of reproducing all data (likelihood function) is
expressed as:
Likelihood Function of Logistic Regression
10
11. ● Instead of maximizing the likelihood function, we generally minimize the
following loss function to avoid the underflow issue of numerical
calculation.
Loss Function
11
12. Gradient Descent Optimization
● By modifying parameters in the opposite direction of the gradient vector
incrementally, it may eventually achieve the minimum.
12
13. Learning Rate and Convergence Issue
● Learning rate ε decides the "step size"
of each modification.
● The convergence of the optimization
depends on the learning rate value.
Converge
Diverge
http://sebastianruder.com/optimizing-gradient-descent/ 13
15. Programming Style of TensorFlow
● All data is represented by "multidimensional list".
○ In many cases, you can use a two-dimension list which is equivalent to
the matrix. So by expressing models (functions) in the matrix form, you
can translate them into TensorFlow codes.
● As a concrete example, we will write the following model (functions) in
TensorFlow codes.
○ Pay attention to distinguish the following three objects.
■ Placeholder : a variable to store training data.
■ Variable: parameters to be adjusted by the training algorithm.
■ Functions constructed with Placeholders and Variables.
15
16. Programming Style of TensorFlow
● The linear function representing the straight line can be expressed using
matrix as below.
● should be treated as a Placeholder which holds multiple data
simultaneously in general. So let represent n-th data and using the
matrix holding all data for , you can write down the following
matrix equation.
○ Where corresponds to the value of f for n-th data, and
the "broadcast rule" is applied to the last part . This means adding
to all matrix elements.
16
17. Programming Style of TensorFlow
● Finally, by applying the sigmoid function σ to each element of f , the
probability for each data is calculated.
○ The "broadcast rule" is applied to , meaning applying σ to each
element of f .
● These relationships are expressed by TensorFlow codes as below.
x = tf.placeholder(tf.float32, [None, 2])
w = tf.Variable(tf.zeros([2, 1]))
w0 = tf.Variable(tf.zeros([1]))
f = tf.matmul(x, w) + w0
p = tf.sigmoid(f)
17
18. Programming Style of TensorFlow
● This explains the relationship between matrix calculations and TensorFlow
codes.
x = tf.placeholder(tf.float32, [None, 2])
w = tf.Variable(tf.zeros([2, 1]))
w0 = tf.Variable(tf.zeros([1]))
f = tf.matmul(x, w) + w0
p = tf.sigmoid(f)
Placeholder stores
training data
Matrix size (The size of row
should be None to hold
arbitrary numbers of data.)
Variables represent
parameters to be trained.
(Initializing to 0, here)
The "broadcast rule" (similar to NumPy
array) is applied to calculations. 18
19. Error Function and Training Algorithm
● To train the model (i.e. to adjust the parameters), we need to define the error
function and the training algorithm.
t = tf.placeholder(tf.float32, [None, 1])
loss = -tf.reduce_sum(t*tf.log(p) + (1-t)*tf.log(1-p))
train_step = tf.train.AdamOptimizer().minimize(loss)
tf.reduce_sum adds up
all matrix elements.
Using Adam Optimizer
to minimize "loss"
19
20. Calculations inside Session
● The TensorFlow codes we prepared so far just define functions and various
relations without doing any calculation. We prepare a "Session" and actual
calculations are executed in the session.
Placeholder
Variable
Calculations
Placeholder
Session 20
21. Using Session to Train the Model
● Create a new session and initialize Variables inside the session.
● By evaluating the training algorithm inside the session, Variables are
adjusted with the gradient descent method.
○ "feed_dict" specifies the data which are stored in Placeholder.
○ When functions are evaluated in the session, the corresponding values
are calculated using the current values of Variables.
i = 0
for _ in range(20000):
i += 1
sess.run(train_step, feed_dict={x:train_x, t:train_t})
if i % 2000 == 0:
loss_val, acc_val = sess.run(
[loss, accuracy], feed_dict={x:train_x, t:train_t})
print ('Step: %d, Loss: %f, Accuracy: %f'
% (i, loss_val, acc_val))
sess = tf.Session()
sess.run(tf.initialize_all_variables())
The gradient descent method is
applied using the training data
specified by feed_dict. Calculating "loss" and
"accuracy" using the current
values of Variables.
21
24. ● Logistic regression gives the "probability of
being classified as t = 1" for each data in the
training set.
● Parameters are adjusted to minimize the
following error function.
Recap: Logistic Regression
●
✕
P(x, y) increases in
this direction
24
25. ● Drawing 3-dimensional graph of ,
we can see that the “tilted plate” divides the
plane into two classes.
● Logistic function σ translates the height on the
plate into the probability of t = 1.
Graphical Interpretation of Logistic Regression
Logistic function σ
z
25
26. ● How can we divide the plane into three classes (instead of two)?
● We can define three linear functions and classify the point based on “which
of them has the maximum value at that point.”
○ This is equivalent to dividing with three tilted plates.
Building Multicategory Linear Classifier
26
27. ● We can define the probability that belongs to the i-th class with the
following softmax function.
● This translates the magnitude of into the probability satisfying the
following (reasonable) conditions.
Translation to Probability with Softmax function
One dimensional example of "Softmax translation."27
29. ● A grayscale image with 28x28 pixels can be represented as a 784
dimensional vector which is a collection of 784 float numbers.
○ In other words, it corresponds to a single point in a 784 dimensional
space!
Images as Points in High Dimensional Space
● When we spread a bunch of images into
this 784 dimensional space, similar
images may come together to form
clusters of images.
○ If this is a correct assumption, we
can classify the images by dividing
the 784 dimensional space with the
softmax function.
29
30. Matrix Representation
● To divide M dimensional space into K classes, we prepare the K linear
functions.
● Defining n-th image data as , the values of linear
functions for all data can be represented as below. (The broadcast rule is
applied to "+ w" operation.)
30
32. Matrix Representation
● Finally, we can translate the result into a probability by applying softmax
function. The probability of classified as k-th category for n-th data is:
● TensorFlow has "tf.nn.softmax" function which calculates them directly from
the matrix F.
32
33. TensorFlow Codes of the Model
● The matrix representations we built so far can be written in TensorFlow
codes as below.
○ Pay attention to the difference between Placeholder and Variables.
x = tf.placeholder(tf.float32, [None, 784])
w = tf.Variable(tf.zeros([784, 10]))
w0 = tf.Variable(tf.zeros([10]))
f = tf.matmul(x, w) + w0
p = tf.nn.softmax(f)
33
34. Loss Function
● The class label of n-th data is given by a vector with the one-of-K
representation. It has 1 only for the k-th element meaning it's class is k.
● Since the probability of having the correct answer for this data is ,
the probability of having correct answers for all data is calculated as below.
● We define the loss function as below. Then, minimizing the loss function is
equivalent to maximizing the probability P.
only for k' = k (the class of n-th data)
34
35. TensorFlow Codes for Loss Function
● The loss function and the optimization algorithm can be written in
TensorFlow codes as below.
● The following code calculates the accuracy of the model.
○ "correct_prediction" is a list of bool values of "correct or incorrect."
○ "accuracy" is calculated by taking the mean of bool values (1 for correct,
0 for incorrect.)
t = tf.placeholder(tf.float32, [None, 10])
loss = -tf.reduce_sum(t * tf.log(p))
train_step = tf.train.AdamOptimizer().minimize(loss)
correct_prediction = tf.equal(tf.argmax(p, 1), tf.argmax(t, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
35
36. Comparing Predictions and Class Labels
● The following shows how we calculate the correctness of predictions.
Predict the class of according to
the maximum probability.
Comparing these to check
the answer.
Indicates the correct class of
36
37. Mini-Batch Optimization of Parameters
● We repeat the optimization operations using 100 samples at a time.
i = 0
for _ in range(2000):
i += 1
batch_xs, batch_ts = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, t: batch_ts})
if i % 100 == 0:
loss_val, acc_val = sess.run([loss, accuracy],
feed_dict={x:mnist.test.images, t: mnist.test.labels})
print ('Step: %d, Loss: %f, Accuracy: %f'
% (i, loss_val, acc_val))
……Image data
Label data
batch_xs
batch_ts
100 samples
・・・
・・・
Optimization
……
……
……
Optimization
batch_xs
batch_ts 37
38. Mini-Batch Optimization of Parameters
● Mini-batch optimization has the following advantages.
○ Reduce the memory usage.
○ Avoid being trapped in the local minima with the random movement.
Minimum Minimum
Stochastic gradient descent
with mini-batch method.
Simple gradient descent method
using all training data at once.
True
minimum
Local
minima
38
41. ● The linear categorizer assumes that samples can
be classified with flat planes.
● This cannot be a perfect assumption and fails to
capture the global (topological) features of
handwritten digits.
The limitation of Linear Categorizer
Correct Incorrect
Examples form the result of linear classifier.
41
42. ● The convolutional neural network (CNN) uses image filters to extract features
from images and apply hidden layers to classify them.
The Overview of Convolutional Neural Network
Raw
Image
Softmax Function
Pooling
Layer
Convolution
Filter
・・・
Convolution
Filter
・・・
・・・
Dropout Layer
Fully-connected Layer
Pooling
Layer
Convolution
Filter
・・・
Convolution
Filter
Pooling
Layer
・・・
Pooling
Layer
42
43. ● Convolutional filters are ... just an image filter you sometimes apply in
Photoshop!
Examples of Convolutional Filters
Filter to blur images Filter to extract vertical edges
43
44. ● To classify the following training set, what would be the best filters?
Question
44
45. ● Applying image filters to capture various features of the image.
○ For example, if we want to classify three characters "+", "-", "|", we can
apply filters to extract vertical and horizontal edges as below.
● Applying the pooling layer to (deliberately) reduce the image resolution.
○ The necessary information for classification is just a density of the
filtered image.
How Convolutional Neural Network Works
45
47. ● In this model, we use pre-defined (fixed) filters to capture vertical and
horizontal edges.
● Question: How can we choose appropriate filters for more general images?
Simple Model to Classify "+", "-", "|".
Input image
Convolution filter
Pooling layer
Softmax
47
48. Exercise
48
● Run through the Notebook:
○ No.3 Convolutional Filter Example
○ No.4 Toy model with static filters
50. ● In the convolutional neural network, we define filters as Variable. The
optimization algorithm tries to adjust the filter values to achieve better
predictions.
○ The following code applies 16 filters to images with 28x28 pixels(=784
dimensional vectors).
Dynamic Optimization of Filters
num_filters = 16
x = tf.placeholder(tf.float32, [None, 784])
x_image = tf.reshape(x, [-1,28,28,1])
W_conv = tf.Variable(tf.truncated_normal([5,5,1,num_filters], stddev=0.1))
h_conv = tf.nn.conv2d(x_image, W_conv,
strides=[1,1,1,1], padding='SAME')
h_pool =tf.nn.max_pool(h_conv, ksize=[1,2,2,1],
strides=[1,2,2,1], padding='SAME')
Placeholder to store images
Define filters as Variables
Apply filters and pooling layer
50
51. Exercise
51
● Run through the Notebook:
○ No.5 Single layer CNN for MNIST
○ No.6 Single layer CNN for MNIST result
(Since filtered images contains negative pixel values,
the background of images are not necessarily white.)
52. ● By adding more filter (and pooling) layers, we can build multi-layer CNN.
○ Filters in different layers are believed to recognize different kinds of
features, but details are still under the study.
○ Dropout layer is used to avoid overfitting by randomly cutting the part of
connections during the training.
Multi-layer Convolutional Neural Network
Raw
Image
Softmax Function
Pooling
Layer
Convolution
Filter
・・・
Convolution
Filter
・・・
・・・
Dropout Layer
Fully-connected Layer
Pooling
Layer
Convolution
Filter
・・・
Convolution
Filter
Pooling
Layer
・・・
Pooling
Layer
52
53. ● Run through the Notebook:
○ No.7 CNN Handwriting Recognizer
Exercise
The images which passed through the second filters.
Predicting the handwritten number.
53
55. Single Layer Neural Network
● This is an example of a single layer neural network.
○ Two nodes in the hidden layer transform the
value of a linear function with the activation
function.
○ There are some choices for the activation
function. We will use the hyperbolic tangent in
the following examples.
Logistic sigmoid
Hyperbolic
tangent
ReLU
Hidden layer
Output layer
55
56. Single Layer Neural Network
● Since the output from the hyperbolic tangent quickly changes from -1 to 1, the outputs
from the hidden layer effectively split the input space into discrete regions with
straight lines.
○ In this example, plane is split into 4 regions.
①
②
③ ④
56
57. Single Layer Neural Network
● The logistic sigmoid in the output node can classify the plane with a straight
line, this single layer network can classify the 4 regions into two classes as below.
◯
◯
✕
◯
①②
④③
①
②
③
④
57
58. Limitation of Single Layer Network
● On the other hand, this neural network cannot classify data in the following pattern.
○ How can you extend the network to cope with this data?
◯
◯
✕
✕
①②
④③
Unable to classify
with a straight line.
58
59. Neural Network as Logical Units
■ A single node (consisted of a linear function and the activation function) works as a
logical Unit for AND or OR as below.
●
●
●
●
●
●
●
●
59
60. Neural Network as Logical Units
● Since the previous pattern is equivalent to XOR, we can combine the AND and OR
units to make a XOR unit. As a result, the following "Enhanced output node" can
classify the previous pattern.
◯
◯
✕
✕
①②
④③
AND Ops
OR Ops
XOR Ops
60
61. Neural Network as Logical Units
● Combining the hidden layer and the "enhanced output unit", it results in the following
2-layer neural network.
○ The first hidden layer extract features as a combination of binary variables
, and the second hidden layer plus output node classify them as
a XOR logical unit.
Classifying with
XOR Logical Unit
Extracting
Features
61
62. Exercise
● You can see the actual result on Neural Network Playground.
○ http://goo.gl/VIvOaQ
62