SlideShare a Scribd company logo
1 of 49
Download to read offline
DEEP LEARNING
基礎と研究動向
Taichi Kiwaki
Aihara Lab.
Thursday, June 12, 14
機械学習
紹介
Thursday, June 12, 14
電線の良品/不良品判別
• パラメータ
• 電線の抵抗
• 電線の最小太さ
• キーワード
• 教師あり学習
• 判別/判別問題
太さ
抵抗
線形分離!
Thursday, June 12, 14
画像識別問題
• パラメータ
• 画素値
• 次元数
• 千∼数百万
(ベンチマーク)
• 特徴
• 低次元データ多様体
• 線形分離不能
RN
Thursday, June 12, 14
系列予測問題
•        の学習
• 代表的手法
• N-gram
• Back-off model
• Hidden Markov Model
(HMMs)
• Conditional Random
Fields (CRFs)
A quick brown fox
Given
predict
jumps over ...
f : RN⇤T
! RN
Thursday, June 12, 14
Outline
• 機械学習の紹介
• Neural NetとDeep Learning (DL)
• 実例紹介
• Deep Convolutional Networks
• Recurrent Neural Networks
Thursday, June 12, 14
Neural Nets (NNs)
ERROR
SIGNALS
(Rumelhart et al., 1986)
(McCulloch and Pitts., 1943)
(LeCun et al., 1989)
548 LeCun, Boser, Denker, Henderson,Howard, Hubbard,and Jackel
10 output units
layer H3
30 hidden units
layer H2
hidden units
12 x 16=192 ,,*
layer H1
hidden units H1.l
12 x 64 = 768
256 input units
e0 0 0 0 0
_---------
fully connected
- 300 links
fully connected
- 6000 links
- 40,000
from 12
5 x 5 ~ 8
-20,OO 0
from 12
5 x 5
links
kernels
links
kernels
Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom)
versus number of training passes.
training set, 8.1%misclassifications on the test set, and 19.4%rejections
for 1%error rate on the remaining test patterns. A full comparative study
will be described in another paper.
5.1 Comparison with Other Work. The first several stages of pro-
cessing in our previous system (described in Denker et al. 1989) in-
volved convolutions in which the coefficientshad been laboriously hand
designed. In the present system, the first two layers of the network are
constrained to be convolutional, but the system automatically learns theThursday, June 12, 14
History of Neural Nets (NNs)
Perceptron
(Rosenblatt,1957)
ERROR
Backpropagation
(Rumelhart et al., 1986)
Boomofneural
networksresearch
1960 1990 2010Deep
Learning
(2006~)
Thursday, June 12, 14
Perceptron
• 線形分離可能な問題
についての分類器
(Rosenblatt, 1957)
• 線形識別モデル
• 線形分離不可能な問
題では使えない
• The XOR affair
(Minsky and Papert,
1969)
XOR
y = Sign(Wx + b)
Thursday, June 12, 14
Perceptron
• 線形分離可能な問題
についての分類器
(Rosenblatt, 1957)
• 線形識別モデル
• 線形分離不可能な問
題では使えない
• The XOR affair
(Minsky and Papert,
1969)
XOR
y = Sign(Wx + b)
Thursday, June 12, 14
Perceptron
• 線形分離可能な問題
についての分類器
(Rosenblatt, 1957)
• 線形識別モデル
• 線形分離不可能な問
題では使えない
• The XOR affair
(Minsky and Papert,
1969)
XOR
y = Sign(Wx + b)
Thursday, June 12, 14
Multilayer Perceptron (MLP)
P1
P2
H1 H2
O
X1 X2
Thursday, June 12, 14
Back Propagation
(BP, Back-prop)
(Rumelhart et al., 1986)
• MLP用のGradient Descent
高速計算法
• 予測誤差をパラメータで微分
• Chain ruleによって微分値を
上層から下層へ伝搬
• Code at
http://deeplearning.net/tutorial/
mlp.html ERROR
SIGNALSThursday, June 12, 14
Conv Nets
• LeCun et al. (1989)
• 画像認識用MLP
• 重み共有
• 畳み込み
• Pooling
548 LeCun, Boser, Denker, Henderson,Howard, Hubbard,and Jackel
10 output units
layer H3
30 hidden units
layer H2
hidden units
12 x 16=192 ,,*
layer H1
hidden units H1.l
12 x 64 = 768
256 input units
e0 0 0 0 0
_---------
fully connected
- 300 links
fully connected
- 6000 links
- 40,000
from 12
5 x 5 ~ 8
-20,OO 0
from 12
5 x 5
links
kernels
links
kernels
Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom)Thursday, June 12, 14
Elman Nets
• Elman, 1990
• 系列データ(e.g., テキスト
ストリーム)を学習
• 1時間ステップ前の
Context層の状態を
フィードバック
• BPThroughTime
(BPTT)
• https://github.com/pascanur/
trainingRNNs
Page 4
This approach can be modified in
the following way. Suppose a
network (shown in Figure 2) is
augmented at the input level by
additional units; call these Context
Units. These units are also “hidden”
in the sense that they interact
exclusively with other nodes
internal to the network, and not the
outside world.
Imagine that there is a
sequential input to be processed,
and some clock which regulates
presentation of the input to the
network. Processing would then
consist of the following sequence of
events. At time t, the input units
receive the first input in the sequence. Each input might be a single scalar value or a vector,
depending on the nature of the problem. The context units are initially set to 0.5. 2
Both the input
units and context units activate the hidden units; and then the hidden units feed forward to
2. The activation function used here bounds values between 0.0 and 1.0.
one, with a fixed weight of 1.0. Not all connections
are shown.
Figure 2. A simple recurrent network in which activations are
copied from hidden layer to context layer on a one-for-one
basis, with fixed weight of 1.0. Dotted lines represent trainable
connections.
OUTPUT UNITS
HIDDEN UNITS
INPUT UNITS CONTEXT UNITS
(Elman, 1990)
quick
brown
fox
Prediction
Thursday, June 12, 14
NNs の問題点 (in 80-90’s)
• 層の深化を行ってもBack Propによる性能が伸びにくい
• NNの中で何が起こっているのか分からない
• 計算が重過ぎる
Thursday, June 12, 14
NNs の問題点 (in 80-90’s)
• 層の深化を行ってもBack Propによる性能が伸びにくい
• NNの中で何が起こっているのか分からない
• 計算が重過ぎる
Thursday, June 12, 14
Vanishing Gradient
• Bengio, 1994; Hochreiter
et al., 2001
• Deep MLPにおいて
誤差信号が減衰する
@
@✓1
(fN · · · f1) =
@fN
@fN 1
· · ·
@f1
@✓1
✓1
Thursday, June 12, 14
Vanishing/Exploding Grads of
RNNs
• Bengio 1994
• 入力を切ったRNNが
安定な力学系となれ
ばVanishing Gradient
• 不安定(カオス)と
なればExploding
Gradientが起こる
@
@✓1
(f · · · f) =
@f
@xT
· · ·
@f
@x2
@f
@✓1
x1
x2 = f(x1)
xT = f(xT 1)
Thursday, June 12, 14
NNs の問題点 (in 80-90’s)
• 層の深化を行ってもBack Propによる性能が伸びにくい
• NNの中で何が起こっているのか分からない
• 計算が重過ぎる
Thursday, June 12, 14
NNs の問題点 (in 80-90’s)
• 層の深化を行ってもBack Propによる性能が伸びにくい
• NNの中で何が起こっているのか分からない
• 計算が重過ぎる
Pretraining/ReL/Initialization
Thursday, June 12, 14
NNs の問題点 (in 80-90’s)
• 層の深化を行ってもBack Propによる性能が伸びにくい
• NNの中で何が起こっているのか分からない
• 計算が重過ぎる
Pretraining/ReL/Initialization
Visualization Techniques
Thursday, June 12, 14
NNs の問題点 (in 80-90’s)
• 層の深化を行ってもBack Propによる性能が伸びにくい
• NNの中で何が起こっているのか分からない
• 計算が重過ぎる
Pretraining/ReL/Initialization
Visualization Techniques
Thursday, June 12, 14
Outline
• 機械学習の紹介
• Neural NetとDeep Learning (DL)
• 実例紹介
• Deep Convolutional Networks
• Recurrent Neural Networks
Thursday, June 12, 14
Key Persons and Research Institutes
Montréal
Toronto
Bengio
Hinton
Le Cun
ajor Breakthrough in 2006
Ability!to!train!deep!architectures!by!
using!layerJwise!unsupervised!
learning,!whereas!previous!purely!
supervised!abempts!had!failed!
Unsupervised!feature!learners:!
•  RBMs!
•  AutoJencoder!variants!
•  Sparse!coding!variants!
New York
(from ICML ’12Tutorial byY. Bengio)
Ng Manning
Thursday, June 12, 14
Deep NNs, Deep Belief Nets,
& Deep Auto Encoders
• Hinton et al., 2006;
Hinton and
Salakhutdinov, 2006;
Bengio et al., 2007
• Recipe
• pretrain a network in a
layer-wise manner
• Stack networks
• Finetune (e.g. by BP)
Thursday, June 12, 14
DBNs/DAEs1544 G. Hinton, S. Osindero, and Y.-W. Teh
Table 1: Error rates of Various Learning Algorithms on the MNIST Digit Recog-
nition Task.
Version of MNIST Task Learning Algorithm Test Error %
Permutation invariant Our generative model:
784 → 500 → 500 ↔ 2000 ↔ 10
1.25
Permutation invariant Support vector machine: degree 9
polynomial kernel
1.4
Permutation invariant Backprop: 784 → 500 → 300 → 10
cross-entropy and weight-decay
1.51
Permutation invariant Backprop: 784 → 800 → 10
cross-entropy and early stopping
1.53
Permutation invariant Backprop: 784 → 500 → 150 → 10
squared error and on-line updates
2.95
Permutation invariant Nearest neighbor: all 60,000 examples
and L3 norm
2.8
Permutation invariant Nearest neighbor: all 60,000 examples
and L2 norm
3.1
Permutation invariant Nearest neighbor: 20,000 examples and
L3 norm
4.0
Permutation invariant Nearest neighbor: 20,000 examples and
L2 norm
4.4
Unpermuted images; extra Backprop: cross-entropy and 0.4
data from elastic early-stopping convolutional neural net
deformations
Unpermuted de-skewed Virtual SVM: degree 9 polynomial 0.56
images; extra data from 2 kernel
pixel translations
Unpermuted images Shape-context features: hand-coded
matching
0.63
Unpermuted images; extra Backprop in LeNet5: convolutional 0.8
data from affine neural net
transformations
Unpermuted images Backprop in LeNet5: convolutional
neural net
0.95
adjusting the weights and biases to lower the
energy of that image and to raise the energy of
similar, Bconfabulated[ images that the network
would prefer to the real data. Given a training
image, the binary state hj of each feature de-
tector j is set to 1 with probability s(bj þP
iviwij), where s(x) is the logistic function
1/E1 þ exp (–x)^, bj is the bias of j, vi is the
state of pixel i, and wij is the weight between i
and j. Once binary states have been chosen for
the hidden units, a Bconfabulation[ is produced
by setting each vi to 1 with probability s(bi þP
jhjwij), where bi is the bias of i. The states of
the hidden units are then updated once more so
that they represent features of the confabula-
tion. The change in a weight is given by
Dwij 0 e
À
bvihjÀdata j bvihjÀrecon
Á
ð2Þ
where e is a learning rate, bvi hjÀdata is the
fraction of times that the pixel i and feature
detector j are on together when the feature
detectors are being driven by data, and
bvi hjÀrecon is the corresponding fraction for
confabulations. A simplified version of the
same learning rule is used for the biases. The
learning works well even though it is not
exactly following the gradient of the log
probability of the training data (6).
A single layer of binary features is not the
best way to model the structure in a set of im-
ages. After learning one layer of feature de-
tectors, we can treat their activities—when they
are being driven by the data—as data for
learning a second layer of features. The first
layer of feature detectors then become the
visible units for learning the next RBM. This
layer-by-layer learning can be repeated as many
Fig. 3. (A) The two-
dimensional codes for 500
digits of each class produced
by taking the first two prin-
cipal components of all
60,000 training images.
(B) The two-dimensional
codes found by a 784-
1000-500-250-2 autoen-
coder. For an alternative
visualization, see (8).
Fig. 4. (A) The fraction of
retrieved documents in the
same class as the query when
REPORTS
onJune7,2011www.sciencemag.orgloadedfrom
(Hinton et al., 2006)
(Hinton and Salakhutdinov., 2006)
Thursday, June 12, 14
Effect of pretraining
Effective deep learning became possible
through unsupervised pre-training
[Erhan!et!al.,!JMLR!2010]!
Purely!supervised!neural!net! With!unsupervised!preJtraining!
(with!RBMs!and!Denoising!AutoJEncoders)!
47!
(Erhan et al., 2010)
Thursday, June 12, 14
How does pre-training help
learning of deep nets?
• Analysis on deep linear
networks performed by
Saxe et al., 2014
• Pre-training initializes the
weight matrices to be
orthogonal matrices
• The strength of both
error/feedforward signals
are preserved
0 500 1000
0
20
40
60
80
t (Epochs)
modestrength
0
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
(t
half
−t
analy
)/t
analy
Figure 3: Left: Dynamics of learning in a three lay
network’s representation of seven modes of the input-o
Red traces show analytical curves from Eqn. 12. Blue
network (Eqn. (2)) from small random initial condit
three layer network with tanh activation functions. To
we computed the nonlinear network’s evolving input
elements of U33T
⌃31
tanhV 11
over time. The trainin
associated with a 1000-dimensional feature vector gen
in [16] with a five level binary tree and flip probability
with the rest excluded for clarity. Network training p
Right: Delay in learning due to competitive dynamics
difference between simulated time of half learning and
the analytical time of half learning. Error bars show st
initializations.
(Saxe et al., 2014)
Thursday, June 12, 14
Deep Conv Net
• Krizhevsky and Hinton (2012)
• Points
• Rectified Linear Units (ReLU)
• Dropout
• GPGPU
• https://code.google.com/p/cuda-
convnet/
Our model
● Max-pooling layers follow first, second, and
fifth convolutional layers
● The number of neurons in each layer is given
by 253440, 186624, 64896, 64896, 43264,
4096, 4096, 1000
(Krizhevsky and Hinton., 2012)
Thursday, June 12, 14
ILSVRC12
ILSVRC13
Thursday, June 12, 14
ReLU (Rectified Linear Units)
✓1
• ReL(x) = max(0, x)
ReLSigmoid
Thursday, June 12, 14
Dropout/DropConnect
• Dropout (Krizhevsky and
HInton, 2012)
• Randomly select units
and temporally turn off
these units
• DropConnect (Wan et
al., 2013)
• Generalization of
dropout on
connections
Thursday, June 12, 14
Dropout/DropConnect
• Dropout (Krizhevsky and
HInton, 2012)
• Randomly select units
and temporally turn off
these units
• DropConnect (Wan et
al., 2013)
• Generalization of
dropout on
connections
:units turned off
Thursday, June 12, 14
Dropout/DropConnect
• Dropout (Krizhevsky and
HInton, 2012)
• Randomly select units
and temporally turn off
these units
• DropConnect (Wan et
al., 2013)
• Generalization of
dropout on
connections
:units turned off
Thursday, June 12, 14
Dropout/DropConnect
• Dropout (Krizhevsky and
HInton, 2012)
• Randomly select units
and temporally turn off
these units
• DropConnect (Wan et
al., 2013)
• Generalization of
dropout on
connections
:units turned off
Thursday, June 12, 14
How does dropout work so well?
• Wager et al, 2013; Baldi and Sadowski 2013
• Dropout is L2-regularization over parameters normalized by
the Fisher information
Figure A.2: Comparison of two L2 regularizers. In both cases, the black solid ellipses are level sur-
faces of the likelihood and the blue dashed curves are level surfaces of the regularizer; the optimum
of the regularized objective is denoted by OPT. The left panel shows a classic spherical L2 regulizer
k k2
2, whereas the right panel has an L2 regularizer >
diag(I) that has been adapted to the shape
of the likelihood (I is the Fisher information matrix). The second regularizer is still aligned with
the axes, but the relative importance of each axis is now scaled using the curvature of the likelihood
function. As argued in (11), dropout training is comparable to the setup depicted in the right panel.
(Wager et al., 2013)
Classical L2 Dropout
Thursday, June 12, 14
Speech/Audio Processing with
Deep CNNs
• Zeiler et al. (ICASSP 2013) showed that deep CNNs with
ReLU can be also applied to speech data for utterance
recognition
• Oord and Dieleman (2013) also used deep CNNs for
classification of music category from audio data
Thursday, June 12, 14
Visualization of Features (1)
he cortex. They also demonstrate that convolutional
BNs (Lee et al., 2009), trained on aligned images of
aces, can learn a face detector. This result is inter-
sting, but unfortunately requires a certain degree of
upervision during dataset construction: their training
mages (i.e., Caltech 101 images) are aligned, homoge-
eous and belong to one selected category.
igure 1. The architecture and parameters in one layer of
ur network. The overall network replicates this structure
hree times. For simplicity, the images are in 1D.
.2. Architecture
logical and computational models (Pinto et al., 200
Lyu & Simoncelli, 2008; Jarrett et al., 2009).2
As mentioned above, central to our approach is the u
of local connectivity between neurons. In our exper
ments, the first sublayer has receptive fields of 18x1
pixels and the second sub-layer pools over 5x5 ove
lapping neighborhoods of features (i.e., pooling size
The neurons in the first sublayer connect to pixels in a
input channels (or maps) whereas the neurons in th
second sublayer connect to pixels of only one chann
(or map).3
While the first sublayer outputs linear filt
responses, the pooling layer outputs the square root
the sum of the squares of its inputs, and therefore,
is known as L2 pooling.
Our style of stacking a series of uniform mo
ules, switching between selectivity and tole
ance layers, is reminiscent of Neocognition an
HMAX (Fukushima & Miyake, 1982; LeCun et a
1998; Riesenhuber & Poggio, 1999). It has al
been argued to be an architecture employed by th
brain (DiCarlo et al., 2012).
Although we use local receptive fields, they a
not convolutional: the parameters are not share
across different locations in the image. This
a stark difference between our approach and pr
vious work (LeCun et al., 1998; Jarrett et al., 200
Lee et al., 2009). In addition to being more biolo
ically plausible, unshared weights allow the learnin
of more invariances other than translational invar
Building high-level features using large-scale unsupervised learning
gure 4. Scale (left) and out-of-plane (3D) rotation (right)
variance properties of the best feature.
Figure 6. Visualization of the cat face neuron (left) and
human body neuron (right).
scribed in (Zhang et al., 2008). In this dataset, there
are 10,000 positive images and 18,409 negative images
(so that the positive-to-negative ratio is similar to the
case of faces). The negative images are chosen ran-
domly from the ImageNet dataset.
and minimum activation values, then picked 20 equally
spaced thresholds in between. The reported accuracy
is the best classification accuracy among 20 thresholds.
4.3. Recognition
Surprisingly, the best neuron in the network performs
very well in recognizing faces, despite the fact that no
supervisory signals were given during training. The
best neuron in the network achieves 81.7% accuracy in
detecting faces. There are 13,026 faces in the test set,
so guessing all negative only achieves 64.8%. The best
neuron in a one-layered network only achieves 71% ac-
curacy while best linear filter, selected among 100,000
filters sampled randomly from the training set, only
achieves 74%.
To understand their contribution, we removed the lo-
cal contrast normalization sublayers and trained the
network again. Results show that the accuracy of
best neuron drops to 78.5%. This agrees with pre-
vious study showing the importance of local contrast
normalization (Jarrett et al., 2009).
We visualize histograms of activation values for face
images and random images in Figure 2. It can be seen,
even with exclusively unlabeled data, the neuron learns
to differentiate between faces and random distractors.
Specifically, when we give a face as an input image, the
neuron tends to output value larger than the threshold,
0. In contrast, if we give a random image as an input
image, the neuron tends to output value less than 0.
Figure 2. Histograms of faces (red) vs. no faces (blue).
The test set is subsampled such that the ratio between
faces and no faces is one.
4.4. Visualization
In this section, we will present two visualization tech-
niques to verify if the optimal stimulus of the neuron is
indeed a face. The first method is visualizing the most
responsive stimuli in the test set. Since the test set
is large, this method can reliably detect near optimal
stimuli of the tested neuron. The second approach
is to perform numerical optimization to find the op-
timal stimulus (Berkes & Wiskott, 2005; Erhan et al.,
tested neuron, by solving:
x∗
= arg min
x
f(x; W, H), subject to ||x||2 = 1.
Here, f(x; W, H) is the output of the tested neuron
given learned parameters W, H and input x. In our
experiments, this constraint optimization problem is
solved by projected gradient descent with line search.
These visualization methods have complementary
strengths and weaknesses. For instance, visualizing
the most responsive stimuli may suffer from fitting to
noise. On the other hand, the numerical optimization
approach can be susceptible to local minima. Results,
shown in Figure 3, confirm that the tested neuron in-
deed learns the concept of faces.
Figure 3. Top: Top 48 stimuli of the best neuron from the
test set. Bottom: The optimal stimulus according to nu-
merical constraint optimization.
4.5. Invariance properties
We would like to assess the robustness of the face de-
tector against common object transformations, e.g.,
translation, scaling and out-of-plane rotation. First,
we chose a set of 10 face images and perform distor-
tions to them, e.g., scaling and translating. For out-
of-plane rotation, we used 10 images of faces rotating
in 3D (“out-of-plane”) as the test set. To check the ro-
bustness of the neuron, we plot its averaged response
over the small test set with respect to changes in scale,
3D rotation (Figure 4), and translation (Figure 5).6
neuron in a one-layered network only achieves 71% ac-
curacy while best linear filter, selected among 100,000
filters sampled randomly from the training set, only
achieves 74%.
To understand their contribution, we removed the lo-
cal contrast normalization sublayers and trained the
network again. Results show that the accuracy of
best neuron drops to 78.5%. This agrees with pre-
vious study showing the importance of local contrast
normalization (Jarrett et al., 2009).
We visualize histograms of activation values for face
images and random images in Figure 2. It can be seen,
even with exclusively unlabeled data, the neuron learns
to differentiate between faces and random distractors.
Specifically, when we give a face as an input image, the
neuron tends to output value larger than the threshold,
0. In contrast, if we give a random image as an input
image, the neuron tends to output value less than 0.
Figure 2. Histograms of faces (red) vs. no faces (blue).
The test set is subsampled such that the ratio between
faces and no faces is one.
4.4. Visualization
In this section, we will present two visualization tech-
niques to verify if the optimal stimulus of the neuron is
indeed a face. The first method is visualizing the most
responsive stimuli in the test set. Since the test set
is large, this method can reliably detect near optimal
stimuli of the tested neuron. The second approach
is to perform numerical optimization to find the op-
timal stimulus (Berkes & Wiskott, 2005; Erhan et al.,
2009; Le et al., 2010). In particular, we find the norm-
bounded input x which maximizes the output f of the
noise. On the other hand, the numerical optimization
approach can be susceptible to local minima. Results,
shown in Figure 3, confirm that the tested neuron in-
deed learns the concept of faces.
Figure 3. Top: Top 48 stimuli of the best neuron from the
test set. Bottom: The optimal stimulus according to nu-
merical constraint optimization.
4.5. Invariance properties
We would like to assess the robustness of the face de-
tector against common object transformations, e.g.,
translation, scaling and out-of-plane rotation. First,
we chose a set of 10 face images and perform distor-
tions to them, e.g., scaling and translating. For out-
of-plane rotation, we used 10 images of faces rotating
in 3D (“out-of-plane”) as the test set. To check the ro-
bustness of the neuron, we plot its averaged response
over the small test set with respect to changes in scale,
3D rotation (Figure 4), and translation (Figure 5).6
6
Scaled, translated faces are generated by standard
cubic interpolation. For 3D rotated faces, we used 10 se-
(Quoc et al., 2012)
Thursday, June 12, 14
Visualization of Features (2)
to a given input image, the reconstruction obtained
from a single activation thus resembles a small piece
of the original input image, with structures weighted
according to their contribution toward to the feature
activation. Since the model is trained discriminatively,
they implicitly show which parts of the input image
are discriminative. Note that these projections are not
samples from the model, since there is no generative
process involved.
Layer Below Pooled Maps
Feature Maps
Rectified Feature Maps
Convolu'onal)
Filtering){F})
Rec'fied)Linear)
Func'on)
Pooled Maps
Max)Pooling)
Reconstruction
Rectified Unpooled Maps
Unpooled Maps
Convolu'onal)
Filtering){FT})
Rec'fied)Linear)
Func'on)
Layer Above
Reconstruction
Max)Unpooling)
Switches)
Unpooling
Max Locations
“Switches”
Pooling
Pooled Maps
Feature Map
Layer Above
Reconstruction
Unpooled
Maps
Rectified
Feature Maps
Figure 1. Top: A deconvnet layer (left) attached to a con-
vnet layer (right). The deconvnet will reconstruct an ap-
proximate version of the convnet features from the layer
beneath. Bottom: An illustration of the unpooling oper-
ation in the deconvnet, using switches which record the
location of the local max in each pooling region (colored
zones) during pooling in the convnet.
3. Training Details
256x256 region, subtracting the per-pixel mean (across
all images) and then using 10 di↵erent sub-crops of size
224x224 (corners + center with(out) horizontal flips).
Stochastic gradient descent with a mini-batch size of
128 was used to update the parameters, starting with a
learning rate of 10 2
, in conjunction with a momentum
term of 0.9. We anneal the learning rate throughout
training manually when the validation error plateaus.
Dropout (Hinton et al., 2012) is used in the fully con-
nected layers (6 and 7) with a rate of 0.5. All weights
are initialized to 10 2
and biases are set to 0.
Visualization of the first layer filters during training
reveals that a few of them dominate, as shown in
Fig. 6(a). To combat this, we renormalize each filter
in the convolutional layers whose RMS value exceeds
a fixed radius of 10 1
to this fixed radius. This is cru-
cial, especially in the first layer of the model, where the
input images are roughly in the [-128,128] range. As in
(Krizhevsky et al., 2012), we produce multiple di↵er-
ent crops and flips of each training example to boost
training set size. We stopped training after 70 epochs,
which took around 12 days on a single GTX580 GPU,
using an implementation based on (Krizhevsky et al.,
2012).
4. Convnet Visualization
Using the model described in Section 3, we now use
the deconvnet to visualize the feature activations on
the ImageNet validation set.
Feature Visualization: Fig. 2 shows feature visu-
alizations from our model once training is complete.
However, instead of showing the single strongest ac-
tivation for a given feature map, we show the top 9
activations. Projecting each separately down to pixel
space reveals the di↵erent structures that excite a
given feature map, hence showing its invariance to in-
put deformations. Alongside these visualizations we
Visualizing and Understanding Convolutional Networks
Layer 2
Layer 1
Visualizing and Understanding Convolutional Networks
Layer 3
(Zeiler and Forgus, 2013)
Thursday, June 12, 14
Layer 2
Layer 1
Layer 3
Thursday, June 12, 14
Layer 4 Layer 5
e 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random sub
ature maps across the validation data, projected down to pixel space using our deconvolutional network approa
reconstructions are not samples from the model: they are reconstructed patterns from the validation set that caThursday, June 12, 14
Outline
• 機械学習の紹介
• Neural NetとDeep Learning (DL)
• 実例紹介
• Deep Convolutional Networks
• Recurrent Neural Networks
Thursday, June 12, 14
RNNLM
• Mikolov et al., 2010
• They stabilized learning
by truncation of
“explosive” gradient
vectors
Table 1: Performance of models on WSJ DEV set when increas-
ing size of training data.
Model # words PPL WER
KN5 LM 200K 336 16.4
KN5 LM + RNN 90/2 200K 271 15.4
KN5 LM 1M 287 15.1
KN5 LM + RNN 90/2 1M 225 14.0
KN5 LM 6.4M 221 13.5
KN5 LM + RNN 250/5 6.4M 156 11.7
where Crare is number of words in the vocabulary that occur
less often than the threshold. All rare words are thus treated
equally, ie. probability is distributed uniformly between them.
Schwenk [4] describes several possible approaches that can
be used for further performance improvements. Additional pos-
sibilities are also discussed in [10][11][12] and most of them
can be applied also to RNNs. For comparison, it takes around 6
hours for our basic implementation to train RNN model based
on Brown corpus (800K words, 100 hidden units and vocab-
ulary threshold 5), while Bengio reports 113 days for basic
implementation and 26 hours with importance sampling [10],
when using similar data and size of neural network. We use
only BLAS library to speed up computation.
3. WSJ experiments
To evaluate performance of simple recurrent neural network
based language model, we have selected several standard
speech recognition tasks. First we report results after rescor-
ing 100-best lists from DARPA WSJ’92 and WSJ’93 data sets
- the same data sets were used by Xu [8] and Filimonov [9].
Oracle WER is 6.1% for dev set and 9.5% for eval set. Training
Table 2: Comparison of various configurations of RNN LMs
and combinations with backoff models while using 6.4M words
in training data (WSJ DEV).
PPL WER
Model RNN RNN+KN RNN RNN+KN
KN5 - baseline - 221 - 13.5
RNN 60/20 229 186 13.2 12.6
RNN 90/10 202 173 12.8 12.2
RNN 250/5 173 155 12.3 11.7
RNN 250/2 176 156 12.0 11.9
RNN 400/10 171 152 12.5 12.1
3xRNN static 151 143 11.6 11.3
3xRNN dynamic 128 121 11.3 11.1
Table 3: Comparison of WSJ results obtained with various mod-
els. Note that RNN models are trained just on 6.4M words.
Model DEV WER EVAL WER
Lattice 1 best 12.9 18.4
Baseline - KN5 (37M) 12.2 17.2
Discriminative LM [8] (37M) 11.5 16.9
Joint LM [9] (70M) - 16.7
Static 3xRNN + KN5 (37M) 11.0 15.5
Dynamic 3xRNN + KN5 (37M) 10.7 16.34
namic RNN LMs - actually, by mixing static and dynamic RNN
LMs with larger learning rate used when processing testing data
(↵ = 0.3), the best perplexity result was 112.
All LMs in the preceding experiments were trained on only
6.4M words, which is much less than the amount of data used
by others for this task. To provide a comparison with Xu [8] and
(on WSJ ’92/WSJ’93 data sets)Table 4: Comparison of very large back-off LMs and RNN LMs
trained only on limited in-domain data (5.4M words).
Model WER static WER dynamic
RT05 LM 24.5 -
RT09 LM - baseline 24.1 -
KN5 in-domain 25.7 -
RNN 500/10 in-domain 24.2 24.1
RNN 500/10 + RT09 LM 23.3 23.2
RNN 800/10 in-domain 24.3 23.8
RNN 800/10 + RT09 LM 23.4 23.1
RNN 1000/5 in-domain 24.2 23.7
RNN 1000/5 + RT09 LM 23.4 22.9
3xRNN + RT09 LM 23.3 22.8
traction use 13 Mel-PLP’s features with deltas, double and triple
wi
toy
rec
ma
tio
dis
tio
fo
mo
vo
lar
sh
po
it
ing
tha
(on NIST RT05)
Thursday, June 12, 14
Initialization of RNNs
• Sutskever et al. (2013)
empirically showed that
initialization and
momentum critically
improve RNN
performance
• Echo state network
based initialization
s use
tion)
effi-
earn-
ems,
is an
NN).
(“re-
ction
going
t and
gical
rent.
rtifi-
arget
accu-
s are
t the
er to
These
oyed
slow
tly irregular time series (Fig. 2A). The
prediction task has two steps: (i) using an
initial teacher sequence generated by the
original MGS to learn a black-box model M
of the generating system, and (ii) using M
to predict the value of the sequence some
steps ahead.
First, we created a random RNN with
1000 neurons (called the “reservoir”) and one
output neuron. The output neuron was
equipped with random connections that
project back into the reservoir (Fig. 2B). A
3000-step teacher sequence d(1), . . .,
d(3000) was generated from the MGS equa-
tion and fed into the output neuron. This
excited the internal neurons through the out-
put feedback connections. After an initial
transient period, they started to exhibit sys-
tematic individual variations of the teacher
sequence (Fig. 2B).
The fact that the internal neurons display
systematic variants of the exciting external
signal is constitutional for ESNs: The internal
neurons must work as “echo functions” for
the driving signal. Not every randomly gen-
erated RNN has this property, but it can
effectively be built into a reservoir (support-
ing online text).
square error
NRMSE ϭ ͩ͸jϭ1
100
(dj(3084) Ϫ yj͑3084))2
/100␴2
͒ͪ
1/2
Ϸ10Ϫ4.2
was obtained (dj and yj teacher and network
8759,
ed. E-
Fig. 1. (A) Schema of previous approaches to
RNN learning. (B) Schema of ESN approach.
Solid bold arrows, fixed synaptic connections;
dotted arrows, adjustable connections. Both
approaches aim at minimizing the error d(n) –
y(n), where y(n) is the network output and d(n)
is the teacher time series observed from the
target system.
2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org
(Jaeger and Haas, 2004)
Thursday, June 12, 14
Regularization of RNNs
Pascanu et al., 2013
• Mikolov et al. 2010と同
じ方法によりExploding
gradientへ対処
• Vanishing gradientには
を導入することで対処
On the di culty of training Recurrent Neural Networks
te of success for solving the temporal order
us log of sequence length. See text.
t become an issue, addressing the explod-
s problem ensures a better success rate.
ining clipping as well as the regularization
sed in section 3.3, we call this algorithm
GD-CR solved the task with a success rate
sequences up to 200 steps (the maximal
in Martens and Sutskever (2011)). Fur-
Table 1. Results on polyphonic music prediction in nega-
tive log likelihood per time step. Lower is better.
Data set
Data
fold
SGD SGD+C SGD+CR
Piano- train 6.87 6.81 7.01
midi.de test 7.56 7.53 7.46
Nottingham train 3.67 3.21 3.24
test 3.80 3.48 3.46
MuseData train 8.25 6.54 6.51
test 7.11 7.00 6.99
Table 2. Results on the next character prediction task in
entropy (bits/character)
Data set
Data
fold
SGD SGD+C SGD+CR
1 step train 1.46 1.34 1.36
test 1.50 1.42 1.41
5 steps train N/A 3.76 3.70
test N/A 3.89 3.74
4.2. Natural problems
We address the task of polyphonic music prediction,
using the datasets Piano-midi.de, Nottingham and
MuseData described in Boulanger-Lewandowski et al.
(2012) and language modelling at the character level
on the Penn Treebank dataset (Mikolov et al., 2012).
On the di culty of training Recurrent Neural Networks
of success for solving the temporal order
s log of sequence length. See text.
become an issue, addressing the explod-
problem ensures a better success rate.
ing clipping as well as the regularization
d in section 3.3, we call this algorithm
D-CR solved the task with a success rate
equences up to 200 steps (the maximal
n Martens and Sutskever (2011)). Fur-
can train a single model to deal with
Table 1. Results on polyphonic music prediction in nega-
tive log likelihood per time step. Lower is better.
Data set
Data
fold
SGD SGD+C SGD+CR
Piano- train 6.87 6.81 7.01
midi.de test 7.56 7.53 7.46
Nottingham train 3.67 3.21 3.24
test 3.80 3.48 3.46
MuseData train 8.25 6.54 6.51
test 7.11 7.00 6.99
Table 2. Results on the next character prediction task in
entropy (bits/character)
Data set
Data
fold
SGD SGD+C SGD+CR
1 step train 1.46 1.34 1.36
test 1.50 1.42 1.41
5 steps train N/A 3.76 3.70
test N/A 3.89 3.74
4.2. Natural problems
We address the task of polyphonic music prediction,
using the datasets Piano-midi.de, Nottingham and
MuseData described in Boulanger-Lewandowski et al.
(2012) and language modelling at the character level
on the Penn Treebank dataset (Mikolov et al., 2012).
We also explore a modified version of the task, where
nitude. Our intuition is that increasing the norm of
@xt
@xk
means the error at time t is more sensitive to all
inputs ut, .., uk ( @xt
@xk
is a factor in @Et
@uk
). In practice
some of these inputs will be irrelevant for the predic-
tion at time t and will behave like noise that the net-
work needs to learn to ignore. The network can not
learn to ignore these irrelevant inputs unless there is
an error signal. These two issues can not be solved in
parallel, and it seems natural to expect that we need
to force the network to increase the norm of @xt
@xk
at the
expense of larger errors (caused by the irrelevant input
entries) and then wait for it to learn to ignore these
irrelevant input entries. This suggest that moving to-
wards increasing the norm of @xt
@xk
can not be always
done while following a descent direction of the error E
(which is, for e.g., what a second order method would
try to do), and therefore we need to enforce it via a
regularization term.
The regularizer we propose below prefers solutions for
which the error signal preserves norm as it travels back
in time:
⌦ =
X
k
⌦k =
X
k
0
@
@E
@xk+1
@xk+1
@xk
@E
@xk+1
1
1
A
2
(9)
In order to be computationally e cient, we only use
the “immediate” partial derivative of ⌦ with respect to
Wrec (we consider that xk and @E
@xk+1
as being constant
with respect to Wrec when computing the derivative
of ⌦k), as depicted in equation (10). Note we use the
parametrization of equation (11). This can be done ef-
ficiently because we get the values of @E
@xk
from BPTT.
We use Theano to compute these gradients (Bergstra
model such that it is further away from the attrac-
tor (such that it does not converge to it, case in which
the gradients vanish) and closer to boundaries between
basins of attractions, making it more probable for the
gradients to explode.
4. Experiments and Results
4.1. Pathological synthetic problems
As done in Martens and Sutskever (2011), we address
the pathological problems proposed by Hochreiter and
Schmidhuber (1997) that require learning long term
correlations. We refer the reader to this original pa-
per for a detailed description of the tasks and to the
supplementary materials for the complete description
of the experimental setup.
4.1.1. The Temporal Order problem
We consider the temporal order problem as the pro-
totypical pathological problem, extending our results
to the other proposed tasks afterwards. The input is
a long stream of discrete symbols. At two points in
time (in the beginning and middle of the sequence) a
symbol within {A, B} is emitted. The task consists in
classifying the order (either AA, AB, BA, BB) at the
end of the sequence.
Fig. 7 shows the success rate of standard SGD, SGD-C
(SGD enhanced with out clipping strategy) and SGD-
CR (SGD with the clipping strategy and the regular-
ization term). Note that for sequences longer than 20,
the vanishing gradients problem ensures that neither
SGD nor SGD-C algorithms can solve the task. The
x-axis is on log scale.
This task provides empirical evidence that explodingThursday, June 12, 14
Conclusion
• NNsの歴史とDLへのつながりをまとめた
• 最新のDLの研究動向をまとめた
• 画像認識ではConv NNsが最高性能を保持
• RNNsは文章予測に関して最高性能を保持
• なぜDLが可能となったのか研究が進みつつある
Thursday, June 12, 14
情報源
• Conferences
• NIPS
• ICML
• AISTATs
• ICLR
• ICASSP
• Tutorial pages
• http://deeplearning.net/
• http://deeplearning.net/
tutorial/contents.html
• Google+
• Yann LeCun, Yoshua
Bengioを筆頭にDL研究
者が多数
Thursday, June 12, 14
Thursday, June 12, 14

More Related Content

What's hot

Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationLeonid Zhukov
 
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...Universitat Politècnica de Catalunya
 
Continuum Modeling and Control of Large Nonuniform Networks
Continuum Modeling and Control of Large Nonuniform NetworksContinuum Modeling and Control of Large Nonuniform Networks
Continuum Modeling and Control of Large Nonuniform NetworksYang Zhang
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Universitat Politècnica de Catalunya
 
Bayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelsBayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelskhbrodersen
 
Bath_IMI_Summer_Project
Bath_IMI_Summer_ProjectBath_IMI_Summer_Project
Bath_IMI_Summer_ProjectJosh Young
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingUSC
 
Deep learning nlp
Deep learning nlpDeep learning nlp
Deep learning nlpHeng-Xiu Xu
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu
 
Performance of Evolutionary Algorithms on NK Landscapes with Nearest Neighbor...
Performance of Evolutionary Algorithms on NK Landscapes with Nearest Neighbor...Performance of Evolutionary Algorithms on NK Landscapes with Nearest Neighbor...
Performance of Evolutionary Algorithms on NK Landscapes with Nearest Neighbor...Martin Pelikan
 
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 

What's hot (19)

Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI Visualization
 
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
 
Continuum Modeling and Control of Large Nonuniform Networks
Continuum Modeling and Control of Large Nonuniform NetworksContinuum Modeling and Control of Large Nonuniform Networks
Continuum Modeling and Control of Large Nonuniform Networks
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
 
Lec10new
Lec10newLec10new
Lec10new
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Bayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelsBayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal models
 
6
66
6
 
Dft
DftDft
Dft
 
Atomic Science
Atomic ScienceAtomic Science
Atomic Science
 
Bath_IMI_Summer_Project
Bath_IMI_Summer_ProjectBath_IMI_Summer_Project
Bath_IMI_Summer_Project
 
Image transforms
Image transformsImage transforms
Image transforms
 
8 lti psd
8 lti psd8 lti psd
8 lti psd
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
 
Deep learning nlp
Deep learning nlpDeep learning nlp
Deep learning nlp
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Performance of Evolutionary Algorithms on NK Landscapes with Nearest Neighbor...
Performance of Evolutionary Algorithms on NK Landscapes with Nearest Neighbor...Performance of Evolutionary Algorithms on NK Landscapes with Nearest Neighbor...
Performance of Evolutionary Algorithms on NK Landscapes with Nearest Neighbor...
 
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
 

Viewers also liked

人工知能はどんな夢を見るか?
人工知能はどんな夢を見るか?人工知能はどんな夢を見るか?
人工知能はどんな夢を見るか?Taichi Kiwaki
 
AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и реку...
AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и реку...AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и реку...
AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и реку...GeeksLab Odessa
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classificationsmatsus
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2Park Chunduck
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
変分ベイズ法の説明
変分ベイズ法の説明変分ベイズ法の説明
変分ベイズ法の説明Haruka Ozaki
 

Viewers also liked (7)

人工知能はどんな夢を見るか?
人工知能はどんな夢を見るか?人工知能はどんな夢を見るか?
人工知能はどんな夢を見るか?
 
AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и реку...
AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и реку...AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и реку...
AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и реку...
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classification
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2
 
Extreme Learning Machine
Extreme Learning MachineExtreme Learning Machine
Extreme Learning Machine
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
変分ベイズ法の説明
変分ベイズ法の説明変分ベイズ法の説明
変分ベイズ法の説明
 

Similar to Talk

Artificial Neural Network (draft)
Artificial Neural Network (draft)Artificial Neural Network (draft)
Artificial Neural Network (draft)James Boulie
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learningStanley Wang
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief netsbutest
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitBAINIDA
 
Murpy's Machine Learning:14. Kernel
Murpy's Machine Learning:14. KernelMurpy's Machine Learning:14. Kernel
Murpy's Machine Learning:14. KernelJungkyu Lee
 
Tensor Eigenvectors and Stochastic Processes
Tensor Eigenvectors and Stochastic ProcessesTensor Eigenvectors and Stochastic Processes
Tensor Eigenvectors and Stochastic ProcessesAustin Benson
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...changedaeoh
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief netszukun
 
Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9Randa Elanwar
 
Bruce Bassett - Constraining Exotic Cosmologies
Bruce Bassett - Constraining Exotic CosmologiesBruce Bassett - Constraining Exotic Cosmologies
Bruce Bassett - Constraining Exotic CosmologiesCosmoAIMS Bassett
 
My invited talk at the 23rd International Symposium of Mathematical Programmi...
My invited talk at the 23rd International Symposium of Mathematical Programmi...My invited talk at the 23rd International Symposium of Mathematical Programmi...
My invited talk at the 23rd International Symposium of Mathematical Programmi...Anirbit Mukherjee
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Naoki Hayashi
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
 

Similar to Talk (20)

Artificial Neural Network (draft)
Artificial Neural Network (draft)Artificial Neural Network (draft)
Artificial Neural Network (draft)
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learning
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
Murpy's Machine Learning:14. Kernel
Murpy's Machine Learning:14. KernelMurpy's Machine Learning:14. Kernel
Murpy's Machine Learning:14. Kernel
 
Tensor Eigenvectors and Stochastic Processes
Tensor Eigenvectors and Stochastic ProcessesTensor Eigenvectors and Stochastic Processes
Tensor Eigenvectors and Stochastic Processes
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Final Poster
Final PosterFinal Poster
Final Poster
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief nets
 
Perceptron
PerceptronPerceptron
Perceptron
 
Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Neural
NeuralNeural
Neural
 
Bruce Bassett - Constraining Exotic Cosmologies
Bruce Bassett - Constraining Exotic CosmologiesBruce Bassett - Constraining Exotic Cosmologies
Bruce Bassett - Constraining Exotic Cosmologies
 
My invited talk at the 23rd International Symposium of Mathematical Programmi...
My invited talk at the 23rd International Symposium of Mathematical Programmi...My invited talk at the 23rd International Symposium of Mathematical Programmi...
My invited talk at the 23rd International Symposium of Mathematical Programmi...
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Talk

  • 3. 電線の良品/不良品判別 • パラメータ • 電線の抵抗 • 電線の最小太さ • キーワード • 教師あり学習 • 判別/判別問題 太さ 抵抗 線形分離! Thursday, June 12, 14
  • 4. 画像識別問題 • パラメータ • 画素値 • 次元数 • 千∼数百万 (ベンチマーク) • 特徴 • 低次元データ多様体 • 線形分離不能 RN Thursday, June 12, 14
  • 5. 系列予測問題 •        の学習 • 代表的手法 • N-gram • Back-off model • Hidden Markov Model (HMMs) • Conditional Random Fields (CRFs) A quick brown fox Given predict jumps over ... f : RN⇤T ! RN Thursday, June 12, 14
  • 6. Outline • 機械学習の紹介 • Neural NetとDeep Learning (DL) • 実例紹介 • Deep Convolutional Networks • Recurrent Neural Networks Thursday, June 12, 14
  • 7. Neural Nets (NNs) ERROR SIGNALS (Rumelhart et al., 1986) (McCulloch and Pitts., 1943) (LeCun et al., 1989) 548 LeCun, Boser, Denker, Henderson,Howard, Hubbard,and Jackel 10 output units layer H3 30 hidden units layer H2 hidden units 12 x 16=192 ,,* layer H1 hidden units H1.l 12 x 64 = 768 256 input units e0 0 0 0 0 _--------- fully connected - 300 links fully connected - 6000 links - 40,000 from 12 5 x 5 ~ 8 -20,OO 0 from 12 5 x 5 links kernels links kernels Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom) versus number of training passes. training set, 8.1%misclassifications on the test set, and 19.4%rejections for 1%error rate on the remaining test patterns. A full comparative study will be described in another paper. 5.1 Comparison with Other Work. The first several stages of pro- cessing in our previous system (described in Denker et al. 1989) in- volved convolutions in which the coefficientshad been laboriously hand designed. In the present system, the first two layers of the network are constrained to be convolutional, but the system automatically learns theThursday, June 12, 14
  • 8. History of Neural Nets (NNs) Perceptron (Rosenblatt,1957) ERROR Backpropagation (Rumelhart et al., 1986) Boomofneural networksresearch 1960 1990 2010Deep Learning (2006~) Thursday, June 12, 14
  • 9. Perceptron • 線形分離可能な問題 についての分類器 (Rosenblatt, 1957) • 線形識別モデル • 線形分離不可能な問 題では使えない • The XOR affair (Minsky and Papert, 1969) XOR y = Sign(Wx + b) Thursday, June 12, 14
  • 10. Perceptron • 線形分離可能な問題 についての分類器 (Rosenblatt, 1957) • 線形識別モデル • 線形分離不可能な問 題では使えない • The XOR affair (Minsky and Papert, 1969) XOR y = Sign(Wx + b) Thursday, June 12, 14
  • 11. Perceptron • 線形分離可能な問題 についての分類器 (Rosenblatt, 1957) • 線形識別モデル • 線形分離不可能な問 題では使えない • The XOR affair (Minsky and Papert, 1969) XOR y = Sign(Wx + b) Thursday, June 12, 14
  • 12. Multilayer Perceptron (MLP) P1 P2 H1 H2 O X1 X2 Thursday, June 12, 14
  • 13. Back Propagation (BP, Back-prop) (Rumelhart et al., 1986) • MLP用のGradient Descent 高速計算法 • 予測誤差をパラメータで微分 • Chain ruleによって微分値を 上層から下層へ伝搬 • Code at http://deeplearning.net/tutorial/ mlp.html ERROR SIGNALSThursday, June 12, 14
  • 14. Conv Nets • LeCun et al. (1989) • 画像認識用MLP • 重み共有 • 畳み込み • Pooling 548 LeCun, Boser, Denker, Henderson,Howard, Hubbard,and Jackel 10 output units layer H3 30 hidden units layer H2 hidden units 12 x 16=192 ,,* layer H1 hidden units H1.l 12 x 64 = 768 256 input units e0 0 0 0 0 _--------- fully connected - 300 links fully connected - 6000 links - 40,000 from 12 5 x 5 ~ 8 -20,OO 0 from 12 5 x 5 links kernels links kernels Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom)Thursday, June 12, 14
  • 15. Elman Nets • Elman, 1990 • 系列データ(e.g., テキスト ストリーム)を学習 • 1時間ステップ前の Context層の状態を フィードバック • BPThroughTime (BPTT) • https://github.com/pascanur/ trainingRNNs Page 4 This approach can be modified in the following way. Suppose a network (shown in Figure 2) is augmented at the input level by additional units; call these Context Units. These units are also “hidden” in the sense that they interact exclusively with other nodes internal to the network, and not the outside world. Imagine that there is a sequential input to be processed, and some clock which regulates presentation of the input to the network. Processing would then consist of the following sequence of events. At time t, the input units receive the first input in the sequence. Each input might be a single scalar value or a vector, depending on the nature of the problem. The context units are initially set to 0.5. 2 Both the input units and context units activate the hidden units; and then the hidden units feed forward to 2. The activation function used here bounds values between 0.0 and 1.0. one, with a fixed weight of 1.0. Not all connections are shown. Figure 2. A simple recurrent network in which activations are copied from hidden layer to context layer on a one-for-one basis, with fixed weight of 1.0. Dotted lines represent trainable connections. OUTPUT UNITS HIDDEN UNITS INPUT UNITS CONTEXT UNITS (Elman, 1990) quick brown fox Prediction Thursday, June 12, 14
  • 16. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Thursday, June 12, 14
  • 17. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Thursday, June 12, 14
  • 18. Vanishing Gradient • Bengio, 1994; Hochreiter et al., 2001 • Deep MLPにおいて 誤差信号が減衰する @ @✓1 (fN · · · f1) = @fN @fN 1 · · · @f1 @✓1 ✓1 Thursday, June 12, 14
  • 19. Vanishing/Exploding Grads of RNNs • Bengio 1994 • 入力を切ったRNNが 安定な力学系となれ ばVanishing Gradient • 不安定(カオス)と なればExploding Gradientが起こる @ @✓1 (f · · · f) = @f @xT · · · @f @x2 @f @✓1 x1 x2 = f(x1) xT = f(xT 1) Thursday, June 12, 14
  • 20. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Thursday, June 12, 14
  • 21. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Pretraining/ReL/Initialization Thursday, June 12, 14
  • 22. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Pretraining/ReL/Initialization Visualization Techniques Thursday, June 12, 14
  • 23. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Pretraining/ReL/Initialization Visualization Techniques Thursday, June 12, 14
  • 24. Outline • 機械学習の紹介 • Neural NetとDeep Learning (DL) • 実例紹介 • Deep Convolutional Networks • Recurrent Neural Networks Thursday, June 12, 14
  • 25. Key Persons and Research Institutes Montréal Toronto Bengio Hinton Le Cun ajor Breakthrough in 2006 Ability!to!train!deep!architectures!by! using!layerJwise!unsupervised! learning,!whereas!previous!purely! supervised!abempts!had!failed! Unsupervised!feature!learners:! •  RBMs! •  AutoJencoder!variants! •  Sparse!coding!variants! New York (from ICML ’12Tutorial byY. Bengio) Ng Manning Thursday, June 12, 14
  • 26. Deep NNs, Deep Belief Nets, & Deep Auto Encoders • Hinton et al., 2006; Hinton and Salakhutdinov, 2006; Bengio et al., 2007 • Recipe • pretrain a network in a layer-wise manner • Stack networks • Finetune (e.g. by BP) Thursday, June 12, 14
  • 27. DBNs/DAEs1544 G. Hinton, S. Osindero, and Y.-W. Teh Table 1: Error rates of Various Learning Algorithms on the MNIST Digit Recog- nition Task. Version of MNIST Task Learning Algorithm Test Error % Permutation invariant Our generative model: 784 → 500 → 500 ↔ 2000 ↔ 10 1.25 Permutation invariant Support vector machine: degree 9 polynomial kernel 1.4 Permutation invariant Backprop: 784 → 500 → 300 → 10 cross-entropy and weight-decay 1.51 Permutation invariant Backprop: 784 → 800 → 10 cross-entropy and early stopping 1.53 Permutation invariant Backprop: 784 → 500 → 150 → 10 squared error and on-line updates 2.95 Permutation invariant Nearest neighbor: all 60,000 examples and L3 norm 2.8 Permutation invariant Nearest neighbor: all 60,000 examples and L2 norm 3.1 Permutation invariant Nearest neighbor: 20,000 examples and L3 norm 4.0 Permutation invariant Nearest neighbor: 20,000 examples and L2 norm 4.4 Unpermuted images; extra Backprop: cross-entropy and 0.4 data from elastic early-stopping convolutional neural net deformations Unpermuted de-skewed Virtual SVM: degree 9 polynomial 0.56 images; extra data from 2 kernel pixel translations Unpermuted images Shape-context features: hand-coded matching 0.63 Unpermuted images; extra Backprop in LeNet5: convolutional 0.8 data from affine neural net transformations Unpermuted images Backprop in LeNet5: convolutional neural net 0.95 adjusting the weights and biases to lower the energy of that image and to raise the energy of similar, Bconfabulated[ images that the network would prefer to the real data. Given a training image, the binary state hj of each feature de- tector j is set to 1 with probability s(bj þP iviwij), where s(x) is the logistic function 1/E1 þ exp (–x)^, bj is the bias of j, vi is the state of pixel i, and wij is the weight between i and j. Once binary states have been chosen for the hidden units, a Bconfabulation[ is produced by setting each vi to 1 with probability s(bi þP jhjwij), where bi is the bias of i. The states of the hidden units are then updated once more so that they represent features of the confabula- tion. The change in a weight is given by Dwij 0 e À bvihjÀdata j bvihjÀrecon Á ð2Þ where e is a learning rate, bvi hjÀdata is the fraction of times that the pixel i and feature detector j are on together when the feature detectors are being driven by data, and bvi hjÀrecon is the corresponding fraction for confabulations. A simplified version of the same learning rule is used for the biases. The learning works well even though it is not exactly following the gradient of the log probability of the training data (6). A single layer of binary features is not the best way to model the structure in a set of im- ages. After learning one layer of feature de- tectors, we can treat their activities—when they are being driven by the data—as data for learning a second layer of features. The first layer of feature detectors then become the visible units for learning the next RBM. This layer-by-layer learning can be repeated as many Fig. 3. (A) The two- dimensional codes for 500 digits of each class produced by taking the first two prin- cipal components of all 60,000 training images. (B) The two-dimensional codes found by a 784- 1000-500-250-2 autoen- coder. For an alternative visualization, see (8). Fig. 4. (A) The fraction of retrieved documents in the same class as the query when REPORTS onJune7,2011www.sciencemag.orgloadedfrom (Hinton et al., 2006) (Hinton and Salakhutdinov., 2006) Thursday, June 12, 14
  • 28. Effect of pretraining Effective deep learning became possible through unsupervised pre-training [Erhan!et!al.,!JMLR!2010]! Purely!supervised!neural!net! With!unsupervised!preJtraining! (with!RBMs!and!Denoising!AutoJEncoders)! 47! (Erhan et al., 2010) Thursday, June 12, 14
  • 29. How does pre-training help learning of deep nets? • Analysis on deep linear networks performed by Saxe et al., 2014 • Pre-training initializes the weight matrices to be orthogonal matrices • The strength of both error/feedforward signals are preserved 0 500 1000 0 20 40 60 80 t (Epochs) modestrength 0 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 (t half −t analy )/t analy Figure 3: Left: Dynamics of learning in a three lay network’s representation of seven modes of the input-o Red traces show analytical curves from Eqn. 12. Blue network (Eqn. (2)) from small random initial condit three layer network with tanh activation functions. To we computed the nonlinear network’s evolving input elements of U33T ⌃31 tanhV 11 over time. The trainin associated with a 1000-dimensional feature vector gen in [16] with a five level binary tree and flip probability with the rest excluded for clarity. Network training p Right: Delay in learning due to competitive dynamics difference between simulated time of half learning and the analytical time of half learning. Error bars show st initializations. (Saxe et al., 2014) Thursday, June 12, 14
  • 30. Deep Conv Net • Krizhevsky and Hinton (2012) • Points • Rectified Linear Units (ReLU) • Dropout • GPGPU • https://code.google.com/p/cuda- convnet/ Our model ● Max-pooling layers follow first, second, and fifth convolutional layers ● The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264, 4096, 4096, 1000 (Krizhevsky and Hinton., 2012) Thursday, June 12, 14
  • 32. ReLU (Rectified Linear Units) ✓1 • ReL(x) = max(0, x) ReLSigmoid Thursday, June 12, 14
  • 33. Dropout/DropConnect • Dropout (Krizhevsky and HInton, 2012) • Randomly select units and temporally turn off these units • DropConnect (Wan et al., 2013) • Generalization of dropout on connections Thursday, June 12, 14
  • 34. Dropout/DropConnect • Dropout (Krizhevsky and HInton, 2012) • Randomly select units and temporally turn off these units • DropConnect (Wan et al., 2013) • Generalization of dropout on connections :units turned off Thursday, June 12, 14
  • 35. Dropout/DropConnect • Dropout (Krizhevsky and HInton, 2012) • Randomly select units and temporally turn off these units • DropConnect (Wan et al., 2013) • Generalization of dropout on connections :units turned off Thursday, June 12, 14
  • 36. Dropout/DropConnect • Dropout (Krizhevsky and HInton, 2012) • Randomly select units and temporally turn off these units • DropConnect (Wan et al., 2013) • Generalization of dropout on connections :units turned off Thursday, June 12, 14
  • 37. How does dropout work so well? • Wager et al, 2013; Baldi and Sadowski 2013 • Dropout is L2-regularization over parameters normalized by the Fisher information Figure A.2: Comparison of two L2 regularizers. In both cases, the black solid ellipses are level sur- faces of the likelihood and the blue dashed curves are level surfaces of the regularizer; the optimum of the regularized objective is denoted by OPT. The left panel shows a classic spherical L2 regulizer k k2 2, whereas the right panel has an L2 regularizer > diag(I) that has been adapted to the shape of the likelihood (I is the Fisher information matrix). The second regularizer is still aligned with the axes, but the relative importance of each axis is now scaled using the curvature of the likelihood function. As argued in (11), dropout training is comparable to the setup depicted in the right panel. (Wager et al., 2013) Classical L2 Dropout Thursday, June 12, 14
  • 38. Speech/Audio Processing with Deep CNNs • Zeiler et al. (ICASSP 2013) showed that deep CNNs with ReLU can be also applied to speech data for utterance recognition • Oord and Dieleman (2013) also used deep CNNs for classification of music category from audio data Thursday, June 12, 14
  • 39. Visualization of Features (1) he cortex. They also demonstrate that convolutional BNs (Lee et al., 2009), trained on aligned images of aces, can learn a face detector. This result is inter- sting, but unfortunately requires a certain degree of upervision during dataset construction: their training mages (i.e., Caltech 101 images) are aligned, homoge- eous and belong to one selected category. igure 1. The architecture and parameters in one layer of ur network. The overall network replicates this structure hree times. For simplicity, the images are in 1D. .2. Architecture logical and computational models (Pinto et al., 200 Lyu & Simoncelli, 2008; Jarrett et al., 2009).2 As mentioned above, central to our approach is the u of local connectivity between neurons. In our exper ments, the first sublayer has receptive fields of 18x1 pixels and the second sub-layer pools over 5x5 ove lapping neighborhoods of features (i.e., pooling size The neurons in the first sublayer connect to pixels in a input channels (or maps) whereas the neurons in th second sublayer connect to pixels of only one chann (or map).3 While the first sublayer outputs linear filt responses, the pooling layer outputs the square root the sum of the squares of its inputs, and therefore, is known as L2 pooling. Our style of stacking a series of uniform mo ules, switching between selectivity and tole ance layers, is reminiscent of Neocognition an HMAX (Fukushima & Miyake, 1982; LeCun et a 1998; Riesenhuber & Poggio, 1999). It has al been argued to be an architecture employed by th brain (DiCarlo et al., 2012). Although we use local receptive fields, they a not convolutional: the parameters are not share across different locations in the image. This a stark difference between our approach and pr vious work (LeCun et al., 1998; Jarrett et al., 200 Lee et al., 2009). In addition to being more biolo ically plausible, unshared weights allow the learnin of more invariances other than translational invar Building high-level features using large-scale unsupervised learning gure 4. Scale (left) and out-of-plane (3D) rotation (right) variance properties of the best feature. Figure 6. Visualization of the cat face neuron (left) and human body neuron (right). scribed in (Zhang et al., 2008). In this dataset, there are 10,000 positive images and 18,409 negative images (so that the positive-to-negative ratio is similar to the case of faces). The negative images are chosen ran- domly from the ImageNet dataset. and minimum activation values, then picked 20 equally spaced thresholds in between. The reported accuracy is the best classification accuracy among 20 thresholds. 4.3. Recognition Surprisingly, the best neuron in the network performs very well in recognizing faces, despite the fact that no supervisory signals were given during training. The best neuron in the network achieves 81.7% accuracy in detecting faces. There are 13,026 faces in the test set, so guessing all negative only achieves 64.8%. The best neuron in a one-layered network only achieves 71% ac- curacy while best linear filter, selected among 100,000 filters sampled randomly from the training set, only achieves 74%. To understand their contribution, we removed the lo- cal contrast normalization sublayers and trained the network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with pre- vious study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. 4.4. Visualization In this section, we will present two visualization tech- niques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the op- timal stimulus (Berkes & Wiskott, 2005; Erhan et al., tested neuron, by solving: x∗ = arg min x f(x; W, H), subject to ||x||2 = 1. Here, f(x; W, H) is the output of the tested neuron given learned parameters W, H and input x. In our experiments, this constraint optimization problem is solved by projected gradient descent with line search. These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown in Figure 3, confirm that the tested neuron in- deed learns the concept of faces. Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to nu- merical constraint optimization. 4.5. Invariance properties We would like to assess the robustness of the face de- tector against common object transformations, e.g., translation, scaling and out-of-plane rotation. First, we chose a set of 10 face images and perform distor- tions to them, e.g., scaling and translating. For out- of-plane rotation, we used 10 images of faces rotating in 3D (“out-of-plane”) as the test set. To check the ro- bustness of the neuron, we plot its averaged response over the small test set with respect to changes in scale, 3D rotation (Figure 4), and translation (Figure 5).6 neuron in a one-layered network only achieves 71% ac- curacy while best linear filter, selected among 100,000 filters sampled randomly from the training set, only achieves 74%. To understand their contribution, we removed the lo- cal contrast normalization sublayers and trained the network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with pre- vious study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. 4.4. Visualization In this section, we will present two visualization tech- niques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the op- timal stimulus (Berkes & Wiskott, 2005; Erhan et al., 2009; Le et al., 2010). In particular, we find the norm- bounded input x which maximizes the output f of the noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown in Figure 3, confirm that the tested neuron in- deed learns the concept of faces. Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to nu- merical constraint optimization. 4.5. Invariance properties We would like to assess the robustness of the face de- tector against common object transformations, e.g., translation, scaling and out-of-plane rotation. First, we chose a set of 10 face images and perform distor- tions to them, e.g., scaling and translating. For out- of-plane rotation, we used 10 images of faces rotating in 3D (“out-of-plane”) as the test set. To check the ro- bustness of the neuron, we plot its averaged response over the small test set with respect to changes in scale, 3D rotation (Figure 4), and translation (Figure 5).6 6 Scaled, translated faces are generated by standard cubic interpolation. For 3D rotated faces, we used 10 se- (Quoc et al., 2012) Thursday, June 12, 14
  • 40. Visualization of Features (2) to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved. Layer Below Pooled Maps Feature Maps Rectified Feature Maps Convolu'onal) Filtering){F}) Rec'fied)Linear) Func'on) Pooled Maps Max)Pooling) Reconstruction Rectified Unpooled Maps Unpooled Maps Convolu'onal) Filtering){FT}) Rec'fied)Linear) Func'on) Layer Above Reconstruction Max)Unpooling) Switches) Unpooling Max Locations “Switches” Pooling Pooled Maps Feature Map Layer Above Reconstruction Unpooled Maps Rectified Feature Maps Figure 1. Top: A deconvnet layer (left) attached to a con- vnet layer (right). The deconvnet will reconstruct an ap- proximate version of the convnet features from the layer beneath. Bottom: An illustration of the unpooling oper- ation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet. 3. Training Details 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 di↵erent sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10 2 , in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully con- nected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10 2 and biases are set to 0. Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10 1 to this fixed radius. This is cru- cial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple di↵er- ent crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012). 4. Convnet Visualization Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set. Feature Visualization: Fig. 2 shows feature visu- alizations from our model once training is complete. However, instead of showing the single strongest ac- tivation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the di↵erent structures that excite a given feature map, hence showing its invariance to in- put deformations. Alongside these visualizations we Visualizing and Understanding Convolutional Networks Layer 2 Layer 1 Visualizing and Understanding Convolutional Networks Layer 3 (Zeiler and Forgus, 2013) Thursday, June 12, 14
  • 41. Layer 2 Layer 1 Layer 3 Thursday, June 12, 14
  • 42. Layer 4 Layer 5 e 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random sub ature maps across the validation data, projected down to pixel space using our deconvolutional network approa reconstructions are not samples from the model: they are reconstructed patterns from the validation set that caThursday, June 12, 14
  • 43. Outline • 機械学習の紹介 • Neural NetとDeep Learning (DL) • 実例紹介 • Deep Convolutional Networks • Recurrent Neural Networks Thursday, June 12, 14
  • 44. RNNLM • Mikolov et al., 2010 • They stabilized learning by truncation of “explosive” gradient vectors Table 1: Performance of models on WSJ DEV set when increas- ing size of training data. Model # words PPL WER KN5 LM 200K 336 16.4 KN5 LM + RNN 90/2 200K 271 15.4 KN5 LM 1M 287 15.1 KN5 LM + RNN 90/2 1M 225 14.0 KN5 LM 6.4M 221 13.5 KN5 LM + RNN 250/5 6.4M 156 11.7 where Crare is number of words in the vocabulary that occur less often than the threshold. All rare words are thus treated equally, ie. probability is distributed uniformly between them. Schwenk [4] describes several possible approaches that can be used for further performance improvements. Additional pos- sibilities are also discussed in [10][11][12] and most of them can be applied also to RNNs. For comparison, it takes around 6 hours for our basic implementation to train RNN model based on Brown corpus (800K words, 100 hidden units and vocab- ulary threshold 5), while Bengio reports 113 days for basic implementation and 26 hours with importance sampling [10], when using similar data and size of neural network. We use only BLAS library to speed up computation. 3. WSJ experiments To evaluate performance of simple recurrent neural network based language model, we have selected several standard speech recognition tasks. First we report results after rescor- ing 100-best lists from DARPA WSJ’92 and WSJ’93 data sets - the same data sets were used by Xu [8] and Filimonov [9]. Oracle WER is 6.1% for dev set and 9.5% for eval set. Training Table 2: Comparison of various configurations of RNN LMs and combinations with backoff models while using 6.4M words in training data (WSJ DEV). PPL WER Model RNN RNN+KN RNN RNN+KN KN5 - baseline - 221 - 13.5 RNN 60/20 229 186 13.2 12.6 RNN 90/10 202 173 12.8 12.2 RNN 250/5 173 155 12.3 11.7 RNN 250/2 176 156 12.0 11.9 RNN 400/10 171 152 12.5 12.1 3xRNN static 151 143 11.6 11.3 3xRNN dynamic 128 121 11.3 11.1 Table 3: Comparison of WSJ results obtained with various mod- els. Note that RNN models are trained just on 6.4M words. Model DEV WER EVAL WER Lattice 1 best 12.9 18.4 Baseline - KN5 (37M) 12.2 17.2 Discriminative LM [8] (37M) 11.5 16.9 Joint LM [9] (70M) - 16.7 Static 3xRNN + KN5 (37M) 11.0 15.5 Dynamic 3xRNN + KN5 (37M) 10.7 16.34 namic RNN LMs - actually, by mixing static and dynamic RNN LMs with larger learning rate used when processing testing data (↵ = 0.3), the best perplexity result was 112. All LMs in the preceding experiments were trained on only 6.4M words, which is much less than the amount of data used by others for this task. To provide a comparison with Xu [8] and (on WSJ ’92/WSJ’93 data sets)Table 4: Comparison of very large back-off LMs and RNN LMs trained only on limited in-domain data (5.4M words). Model WER static WER dynamic RT05 LM 24.5 - RT09 LM - baseline 24.1 - KN5 in-domain 25.7 - RNN 500/10 in-domain 24.2 24.1 RNN 500/10 + RT09 LM 23.3 23.2 RNN 800/10 in-domain 24.3 23.8 RNN 800/10 + RT09 LM 23.4 23.1 RNN 1000/5 in-domain 24.2 23.7 RNN 1000/5 + RT09 LM 23.4 22.9 3xRNN + RT09 LM 23.3 22.8 traction use 13 Mel-PLP’s features with deltas, double and triple wi toy rec ma tio dis tio fo mo vo lar sh po it ing tha (on NIST RT05) Thursday, June 12, 14
  • 45. Initialization of RNNs • Sutskever et al. (2013) empirically showed that initialization and momentum critically improve RNN performance • Echo state network based initialization s use tion) effi- earn- ems, is an NN). (“re- ction going t and gical rent. rtifi- arget accu- s are t the er to These oyed slow tly irregular time series (Fig. 2A). The prediction task has two steps: (i) using an initial teacher sequence generated by the original MGS to learn a black-box model M of the generating system, and (ii) using M to predict the value of the sequence some steps ahead. First, we created a random RNN with 1000 neurons (called the “reservoir”) and one output neuron. The output neuron was equipped with random connections that project back into the reservoir (Fig. 2B). A 3000-step teacher sequence d(1), . . ., d(3000) was generated from the MGS equa- tion and fed into the output neuron. This excited the internal neurons through the out- put feedback connections. After an initial transient period, they started to exhibit sys- tematic individual variations of the teacher sequence (Fig. 2B). The fact that the internal neurons display systematic variants of the exciting external signal is constitutional for ESNs: The internal neurons must work as “echo functions” for the driving signal. Not every randomly gen- erated RNN has this property, but it can effectively be built into a reservoir (support- ing online text). square error NRMSE ϭ ͩ͸jϭ1 100 (dj(3084) Ϫ yj͑3084))2 /100␴2 ͒ͪ 1/2 Ϸ10Ϫ4.2 was obtained (dj and yj teacher and network 8759, ed. E- Fig. 1. (A) Schema of previous approaches to RNN learning. (B) Schema of ESN approach. Solid bold arrows, fixed synaptic connections; dotted arrows, adjustable connections. Both approaches aim at minimizing the error d(n) – y(n), where y(n) is the network output and d(n) is the teacher time series observed from the target system. 2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org (Jaeger and Haas, 2004) Thursday, June 12, 14
  • 46. Regularization of RNNs Pascanu et al., 2013 • Mikolov et al. 2010と同 じ方法によりExploding gradientへ対処 • Vanishing gradientには を導入することで対処 On the di culty of training Recurrent Neural Networks te of success for solving the temporal order us log of sequence length. See text. t become an issue, addressing the explod- s problem ensures a better success rate. ining clipping as well as the regularization sed in section 3.3, we call this algorithm GD-CR solved the task with a success rate sequences up to 200 steps (the maximal in Martens and Sutskever (2011)). Fur- Table 1. Results on polyphonic music prediction in nega- tive log likelihood per time step. Lower is better. Data set Data fold SGD SGD+C SGD+CR Piano- train 6.87 6.81 7.01 midi.de test 7.56 7.53 7.46 Nottingham train 3.67 3.21 3.24 test 3.80 3.48 3.46 MuseData train 8.25 6.54 6.51 test 7.11 7.00 6.99 Table 2. Results on the next character prediction task in entropy (bits/character) Data set Data fold SGD SGD+C SGD+CR 1 step train 1.46 1.34 1.36 test 1.50 1.42 1.41 5 steps train N/A 3.76 3.70 test N/A 3.89 3.74 4.2. Natural problems We address the task of polyphonic music prediction, using the datasets Piano-midi.de, Nottingham and MuseData described in Boulanger-Lewandowski et al. (2012) and language modelling at the character level on the Penn Treebank dataset (Mikolov et al., 2012). On the di culty of training Recurrent Neural Networks of success for solving the temporal order s log of sequence length. See text. become an issue, addressing the explod- problem ensures a better success rate. ing clipping as well as the regularization d in section 3.3, we call this algorithm D-CR solved the task with a success rate equences up to 200 steps (the maximal n Martens and Sutskever (2011)). Fur- can train a single model to deal with Table 1. Results on polyphonic music prediction in nega- tive log likelihood per time step. Lower is better. Data set Data fold SGD SGD+C SGD+CR Piano- train 6.87 6.81 7.01 midi.de test 7.56 7.53 7.46 Nottingham train 3.67 3.21 3.24 test 3.80 3.48 3.46 MuseData train 8.25 6.54 6.51 test 7.11 7.00 6.99 Table 2. Results on the next character prediction task in entropy (bits/character) Data set Data fold SGD SGD+C SGD+CR 1 step train 1.46 1.34 1.36 test 1.50 1.42 1.41 5 steps train N/A 3.76 3.70 test N/A 3.89 3.74 4.2. Natural problems We address the task of polyphonic music prediction, using the datasets Piano-midi.de, Nottingham and MuseData described in Boulanger-Lewandowski et al. (2012) and language modelling at the character level on the Penn Treebank dataset (Mikolov et al., 2012). We also explore a modified version of the task, where nitude. Our intuition is that increasing the norm of @xt @xk means the error at time t is more sensitive to all inputs ut, .., uk ( @xt @xk is a factor in @Et @uk ). In practice some of these inputs will be irrelevant for the predic- tion at time t and will behave like noise that the net- work needs to learn to ignore. The network can not learn to ignore these irrelevant inputs unless there is an error signal. These two issues can not be solved in parallel, and it seems natural to expect that we need to force the network to increase the norm of @xt @xk at the expense of larger errors (caused by the irrelevant input entries) and then wait for it to learn to ignore these irrelevant input entries. This suggest that moving to- wards increasing the norm of @xt @xk can not be always done while following a descent direction of the error E (which is, for e.g., what a second order method would try to do), and therefore we need to enforce it via a regularization term. The regularizer we propose below prefers solutions for which the error signal preserves norm as it travels back in time: ⌦ = X k ⌦k = X k 0 @ @E @xk+1 @xk+1 @xk @E @xk+1 1 1 A 2 (9) In order to be computationally e cient, we only use the “immediate” partial derivative of ⌦ with respect to Wrec (we consider that xk and @E @xk+1 as being constant with respect to Wrec when computing the derivative of ⌦k), as depicted in equation (10). Note we use the parametrization of equation (11). This can be done ef- ficiently because we get the values of @E @xk from BPTT. We use Theano to compute these gradients (Bergstra model such that it is further away from the attrac- tor (such that it does not converge to it, case in which the gradients vanish) and closer to boundaries between basins of attractions, making it more probable for the gradients to explode. 4. Experiments and Results 4.1. Pathological synthetic problems As done in Martens and Sutskever (2011), we address the pathological problems proposed by Hochreiter and Schmidhuber (1997) that require learning long term correlations. We refer the reader to this original pa- per for a detailed description of the tasks and to the supplementary materials for the complete description of the experimental setup. 4.1.1. The Temporal Order problem We consider the temporal order problem as the pro- totypical pathological problem, extending our results to the other proposed tasks afterwards. The input is a long stream of discrete symbols. At two points in time (in the beginning and middle of the sequence) a symbol within {A, B} is emitted. The task consists in classifying the order (either AA, AB, BA, BB) at the end of the sequence. Fig. 7 shows the success rate of standard SGD, SGD-C (SGD enhanced with out clipping strategy) and SGD- CR (SGD with the clipping strategy and the regular- ization term). Note that for sequences longer than 20, the vanishing gradients problem ensures that neither SGD nor SGD-C algorithms can solve the task. The x-axis is on log scale. This task provides empirical evidence that explodingThursday, June 12, 14
  • 47. Conclusion • NNsの歴史とDLへのつながりをまとめた • 最新のDLの研究動向をまとめた • 画像認識ではConv NNsが最高性能を保持 • RNNsは文章予測に関して最高性能を保持 • なぜDLが可能となったのか研究が進みつつある Thursday, June 12, 14
  • 48. 情報源 • Conferences • NIPS • ICML • AISTATs • ICLR • ICASSP • Tutorial pages • http://deeplearning.net/ • http://deeplearning.net/ tutorial/contents.html • Google+ • Yann LeCun, Yoshua Bengioを筆頭にDL研究 者が多数 Thursday, June 12, 14