5. 系列予測問題
• の学習
• 代表的手法
• N-gram
• Back-off model
• Hidden Markov Model
(HMMs)
• Conditional Random
Fields (CRFs)
A quick brown fox
Given
predict
jumps over ...
f : RN⇤T
! RN
Thursday, June 12, 14
7. Neural Nets (NNs)
ERROR
SIGNALS
(Rumelhart et al., 1986)
(McCulloch and Pitts., 1943)
(LeCun et al., 1989)
548 LeCun, Boser, Denker, Henderson,Howard, Hubbard,and Jackel
10 output units
layer H3
30 hidden units
layer H2
hidden units
12 x 16=192 ,,*
layer H1
hidden units H1.l
12 x 64 = 768
256 input units
e0 0 0 0 0
_---------
fully connected
- 300 links
fully connected
- 6000 links
- 40,000
from 12
5 x 5 ~ 8
-20,OO 0
from 12
5 x 5
links
kernels
links
kernels
Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom)
versus number of training passes.
training set, 8.1%misclassifications on the test set, and 19.4%rejections
for 1%error rate on the remaining test patterns. A full comparative study
will be described in another paper.
5.1 Comparison with Other Work. The first several stages of pro-
cessing in our previous system (described in Denker et al. 1989) in-
volved convolutions in which the coefficientshad been laboriously hand
designed. In the present system, the first two layers of the network are
constrained to be convolutional, but the system automatically learns theThursday, June 12, 14
8. History of Neural Nets (NNs)
Perceptron
(Rosenblatt,1957)
ERROR
Backpropagation
(Rumelhart et al., 1986)
Boomofneural
networksresearch
1960 1990 2010Deep
Learning
(2006~)
Thursday, June 12, 14
13. Back Propagation
(BP, Back-prop)
(Rumelhart et al., 1986)
• MLP用のGradient Descent
高速計算法
• 予測誤差をパラメータで微分
• Chain ruleによって微分値を
上層から下層へ伝搬
• Code at
http://deeplearning.net/tutorial/
mlp.html ERROR
SIGNALSThursday, June 12, 14
14. Conv Nets
• LeCun et al. (1989)
• 画像認識用MLP
• 重み共有
• 畳み込み
• Pooling
548 LeCun, Boser, Denker, Henderson,Howard, Hubbard,and Jackel
10 output units
layer H3
30 hidden units
layer H2
hidden units
12 x 16=192 ,,*
layer H1
hidden units H1.l
12 x 64 = 768
256 input units
e0 0 0 0 0
_---------
fully connected
- 300 links
fully connected
- 6000 links
- 40,000
from 12
5 x 5 ~ 8
-20,OO 0
from 12
5 x 5
links
kernels
links
kernels
Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom)Thursday, June 12, 14
15. Elman Nets
• Elman, 1990
• 系列データ(e.g., テキスト
ストリーム)を学習
• 1時間ステップ前の
Context層の状態を
フィードバック
• BPThroughTime
(BPTT)
• https://github.com/pascanur/
trainingRNNs
Page 4
This approach can be modified in
the following way. Suppose a
network (shown in Figure 2) is
augmented at the input level by
additional units; call these Context
Units. These units are also “hidden”
in the sense that they interact
exclusively with other nodes
internal to the network, and not the
outside world.
Imagine that there is a
sequential input to be processed,
and some clock which regulates
presentation of the input to the
network. Processing would then
consist of the following sequence of
events. At time t, the input units
receive the first input in the sequence. Each input might be a single scalar value or a vector,
depending on the nature of the problem. The context units are initially set to 0.5. 2
Both the input
units and context units activate the hidden units; and then the hidden units feed forward to
2. The activation function used here bounds values between 0.0 and 1.0.
one, with a fixed weight of 1.0. Not all connections
are shown.
Figure 2. A simple recurrent network in which activations are
copied from hidden layer to context layer on a one-for-one
basis, with fixed weight of 1.0. Dotted lines represent trainable
connections.
OUTPUT UNITS
HIDDEN UNITS
INPUT UNITS CONTEXT UNITS
(Elman, 1990)
quick
brown
fox
Prediction
Thursday, June 12, 14
16. NNs の問題点 (in 80-90’s)
• 層の深化を行ってもBack Propによる性能が伸びにくい
• NNの中で何が起こっているのか分からない
• 計算が重過ぎる
Thursday, June 12, 14
17. NNs の問題点 (in 80-90’s)
• 層の深化を行ってもBack Propによる性能が伸びにくい
• NNの中で何が起こっているのか分からない
• 計算が重過ぎる
Thursday, June 12, 14
25. Key Persons and Research Institutes
Montréal
Toronto
Bengio
Hinton
Le Cun
ajor Breakthrough in 2006
Ability!to!train!deep!architectures!by!
using!layerJwise!unsupervised!
learning,!whereas!previous!purely!
supervised!abempts!had!failed!
Unsupervised!feature!learners:!
• RBMs!
• AutoJencoder!variants!
• Sparse!coding!variants!
New York
(from ICML ’12Tutorial byY. Bengio)
Ng Manning
Thursday, June 12, 14
26. Deep NNs, Deep Belief Nets,
& Deep Auto Encoders
• Hinton et al., 2006;
Hinton and
Salakhutdinov, 2006;
Bengio et al., 2007
• Recipe
• pretrain a network in a
layer-wise manner
• Stack networks
• Finetune (e.g. by BP)
Thursday, June 12, 14
27. DBNs/DAEs1544 G. Hinton, S. Osindero, and Y.-W. Teh
Table 1: Error rates of Various Learning Algorithms on the MNIST Digit Recog-
nition Task.
Version of MNIST Task Learning Algorithm Test Error %
Permutation invariant Our generative model:
784 → 500 → 500 ↔ 2000 ↔ 10
1.25
Permutation invariant Support vector machine: degree 9
polynomial kernel
1.4
Permutation invariant Backprop: 784 → 500 → 300 → 10
cross-entropy and weight-decay
1.51
Permutation invariant Backprop: 784 → 800 → 10
cross-entropy and early stopping
1.53
Permutation invariant Backprop: 784 → 500 → 150 → 10
squared error and on-line updates
2.95
Permutation invariant Nearest neighbor: all 60,000 examples
and L3 norm
2.8
Permutation invariant Nearest neighbor: all 60,000 examples
and L2 norm
3.1
Permutation invariant Nearest neighbor: 20,000 examples and
L3 norm
4.0
Permutation invariant Nearest neighbor: 20,000 examples and
L2 norm
4.4
Unpermuted images; extra Backprop: cross-entropy and 0.4
data from elastic early-stopping convolutional neural net
deformations
Unpermuted de-skewed Virtual SVM: degree 9 polynomial 0.56
images; extra data from 2 kernel
pixel translations
Unpermuted images Shape-context features: hand-coded
matching
0.63
Unpermuted images; extra Backprop in LeNet5: convolutional 0.8
data from affine neural net
transformations
Unpermuted images Backprop in LeNet5: convolutional
neural net
0.95
adjusting the weights and biases to lower the
energy of that image and to raise the energy of
similar, Bconfabulated[ images that the network
would prefer to the real data. Given a training
image, the binary state hj of each feature de-
tector j is set to 1 with probability s(bj þP
iviwij), where s(x) is the logistic function
1/E1 þ exp (–x)^, bj is the bias of j, vi is the
state of pixel i, and wij is the weight between i
and j. Once binary states have been chosen for
the hidden units, a Bconfabulation[ is produced
by setting each vi to 1 with probability s(bi þP
jhjwij), where bi is the bias of i. The states of
the hidden units are then updated once more so
that they represent features of the confabula-
tion. The change in a weight is given by
Dwij 0 e
À
bvihjÀdata j bvihjÀrecon
Á
ð2Þ
where e is a learning rate, bvi hjÀdata is the
fraction of times that the pixel i and feature
detector j are on together when the feature
detectors are being driven by data, and
bvi hjÀrecon is the corresponding fraction for
confabulations. A simplified version of the
same learning rule is used for the biases. The
learning works well even though it is not
exactly following the gradient of the log
probability of the training data (6).
A single layer of binary features is not the
best way to model the structure in a set of im-
ages. After learning one layer of feature de-
tectors, we can treat their activities—when they
are being driven by the data—as data for
learning a second layer of features. The first
layer of feature detectors then become the
visible units for learning the next RBM. This
layer-by-layer learning can be repeated as many
Fig. 3. (A) The two-
dimensional codes for 500
digits of each class produced
by taking the first two prin-
cipal components of all
60,000 training images.
(B) The two-dimensional
codes found by a 784-
1000-500-250-2 autoen-
coder. For an alternative
visualization, see (8).
Fig. 4. (A) The fraction of
retrieved documents in the
same class as the query when
REPORTS
onJune7,2011www.sciencemag.orgloadedfrom
(Hinton et al., 2006)
(Hinton and Salakhutdinov., 2006)
Thursday, June 12, 14
28. Effect of pretraining
Effective deep learning became possible
through unsupervised pre-training
[Erhan!et!al.,!JMLR!2010]!
Purely!supervised!neural!net! With!unsupervised!preJtraining!
(with!RBMs!and!Denoising!AutoJEncoders)!
47!
(Erhan et al., 2010)
Thursday, June 12, 14
29. How does pre-training help
learning of deep nets?
• Analysis on deep linear
networks performed by
Saxe et al., 2014
• Pre-training initializes the
weight matrices to be
orthogonal matrices
• The strength of both
error/feedforward signals
are preserved
0 500 1000
0
20
40
60
80
t (Epochs)
modestrength
0
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
(t
half
−t
analy
)/t
analy
Figure 3: Left: Dynamics of learning in a three lay
network’s representation of seven modes of the input-o
Red traces show analytical curves from Eqn. 12. Blue
network (Eqn. (2)) from small random initial condit
three layer network with tanh activation functions. To
we computed the nonlinear network’s evolving input
elements of U33T
⌃31
tanhV 11
over time. The trainin
associated with a 1000-dimensional feature vector gen
in [16] with a five level binary tree and flip probability
with the rest excluded for clarity. Network training p
Right: Delay in learning due to competitive dynamics
difference between simulated time of half learning and
the analytical time of half learning. Error bars show st
initializations.
(Saxe et al., 2014)
Thursday, June 12, 14
30. Deep Conv Net
• Krizhevsky and Hinton (2012)
• Points
• Rectified Linear Units (ReLU)
• Dropout
• GPGPU
• https://code.google.com/p/cuda-
convnet/
Our model
● Max-pooling layers follow first, second, and
fifth convolutional layers
● The number of neurons in each layer is given
by 253440, 186624, 64896, 64896, 43264,
4096, 4096, 1000
(Krizhevsky and Hinton., 2012)
Thursday, June 12, 14
32. ReLU (Rectified Linear Units)
✓1
• ReL(x) = max(0, x)
ReLSigmoid
Thursday, June 12, 14
33. Dropout/DropConnect
• Dropout (Krizhevsky and
HInton, 2012)
• Randomly select units
and temporally turn off
these units
• DropConnect (Wan et
al., 2013)
• Generalization of
dropout on
connections
Thursday, June 12, 14
34. Dropout/DropConnect
• Dropout (Krizhevsky and
HInton, 2012)
• Randomly select units
and temporally turn off
these units
• DropConnect (Wan et
al., 2013)
• Generalization of
dropout on
connections
:units turned off
Thursday, June 12, 14
35. Dropout/DropConnect
• Dropout (Krizhevsky and
HInton, 2012)
• Randomly select units
and temporally turn off
these units
• DropConnect (Wan et
al., 2013)
• Generalization of
dropout on
connections
:units turned off
Thursday, June 12, 14
36. Dropout/DropConnect
• Dropout (Krizhevsky and
HInton, 2012)
• Randomly select units
and temporally turn off
these units
• DropConnect (Wan et
al., 2013)
• Generalization of
dropout on
connections
:units turned off
Thursday, June 12, 14
37. How does dropout work so well?
• Wager et al, 2013; Baldi and Sadowski 2013
• Dropout is L2-regularization over parameters normalized by
the Fisher information
Figure A.2: Comparison of two L2 regularizers. In both cases, the black solid ellipses are level sur-
faces of the likelihood and the blue dashed curves are level surfaces of the regularizer; the optimum
of the regularized objective is denoted by OPT. The left panel shows a classic spherical L2 regulizer
k k2
2, whereas the right panel has an L2 regularizer >
diag(I) that has been adapted to the shape
of the likelihood (I is the Fisher information matrix). The second regularizer is still aligned with
the axes, but the relative importance of each axis is now scaled using the curvature of the likelihood
function. As argued in (11), dropout training is comparable to the setup depicted in the right panel.
(Wager et al., 2013)
Classical L2 Dropout
Thursday, June 12, 14
38. Speech/Audio Processing with
Deep CNNs
• Zeiler et al. (ICASSP 2013) showed that deep CNNs with
ReLU can be also applied to speech data for utterance
recognition
• Oord and Dieleman (2013) also used deep CNNs for
classification of music category from audio data
Thursday, June 12, 14
39. Visualization of Features (1)
he cortex. They also demonstrate that convolutional
BNs (Lee et al., 2009), trained on aligned images of
aces, can learn a face detector. This result is inter-
sting, but unfortunately requires a certain degree of
upervision during dataset construction: their training
mages (i.e., Caltech 101 images) are aligned, homoge-
eous and belong to one selected category.
igure 1. The architecture and parameters in one layer of
ur network. The overall network replicates this structure
hree times. For simplicity, the images are in 1D.
.2. Architecture
logical and computational models (Pinto et al., 200
Lyu & Simoncelli, 2008; Jarrett et al., 2009).2
As mentioned above, central to our approach is the u
of local connectivity between neurons. In our exper
ments, the first sublayer has receptive fields of 18x1
pixels and the second sub-layer pools over 5x5 ove
lapping neighborhoods of features (i.e., pooling size
The neurons in the first sublayer connect to pixels in a
input channels (or maps) whereas the neurons in th
second sublayer connect to pixels of only one chann
(or map).3
While the first sublayer outputs linear filt
responses, the pooling layer outputs the square root
the sum of the squares of its inputs, and therefore,
is known as L2 pooling.
Our style of stacking a series of uniform mo
ules, switching between selectivity and tole
ance layers, is reminiscent of Neocognition an
HMAX (Fukushima & Miyake, 1982; LeCun et a
1998; Riesenhuber & Poggio, 1999). It has al
been argued to be an architecture employed by th
brain (DiCarlo et al., 2012).
Although we use local receptive fields, they a
not convolutional: the parameters are not share
across different locations in the image. This
a stark difference between our approach and pr
vious work (LeCun et al., 1998; Jarrett et al., 200
Lee et al., 2009). In addition to being more biolo
ically plausible, unshared weights allow the learnin
of more invariances other than translational invar
Building high-level features using large-scale unsupervised learning
gure 4. Scale (left) and out-of-plane (3D) rotation (right)
variance properties of the best feature.
Figure 6. Visualization of the cat face neuron (left) and
human body neuron (right).
scribed in (Zhang et al., 2008). In this dataset, there
are 10,000 positive images and 18,409 negative images
(so that the positive-to-negative ratio is similar to the
case of faces). The negative images are chosen ran-
domly from the ImageNet dataset.
and minimum activation values, then picked 20 equally
spaced thresholds in between. The reported accuracy
is the best classification accuracy among 20 thresholds.
4.3. Recognition
Surprisingly, the best neuron in the network performs
very well in recognizing faces, despite the fact that no
supervisory signals were given during training. The
best neuron in the network achieves 81.7% accuracy in
detecting faces. There are 13,026 faces in the test set,
so guessing all negative only achieves 64.8%. The best
neuron in a one-layered network only achieves 71% ac-
curacy while best linear filter, selected among 100,000
filters sampled randomly from the training set, only
achieves 74%.
To understand their contribution, we removed the lo-
cal contrast normalization sublayers and trained the
network again. Results show that the accuracy of
best neuron drops to 78.5%. This agrees with pre-
vious study showing the importance of local contrast
normalization (Jarrett et al., 2009).
We visualize histograms of activation values for face
images and random images in Figure 2. It can be seen,
even with exclusively unlabeled data, the neuron learns
to differentiate between faces and random distractors.
Specifically, when we give a face as an input image, the
neuron tends to output value larger than the threshold,
0. In contrast, if we give a random image as an input
image, the neuron tends to output value less than 0.
Figure 2. Histograms of faces (red) vs. no faces (blue).
The test set is subsampled such that the ratio between
faces and no faces is one.
4.4. Visualization
In this section, we will present two visualization tech-
niques to verify if the optimal stimulus of the neuron is
indeed a face. The first method is visualizing the most
responsive stimuli in the test set. Since the test set
is large, this method can reliably detect near optimal
stimuli of the tested neuron. The second approach
is to perform numerical optimization to find the op-
timal stimulus (Berkes & Wiskott, 2005; Erhan et al.,
tested neuron, by solving:
x∗
= arg min
x
f(x; W, H), subject to ||x||2 = 1.
Here, f(x; W, H) is the output of the tested neuron
given learned parameters W, H and input x. In our
experiments, this constraint optimization problem is
solved by projected gradient descent with line search.
These visualization methods have complementary
strengths and weaknesses. For instance, visualizing
the most responsive stimuli may suffer from fitting to
noise. On the other hand, the numerical optimization
approach can be susceptible to local minima. Results,
shown in Figure 3, confirm that the tested neuron in-
deed learns the concept of faces.
Figure 3. Top: Top 48 stimuli of the best neuron from the
test set. Bottom: The optimal stimulus according to nu-
merical constraint optimization.
4.5. Invariance properties
We would like to assess the robustness of the face de-
tector against common object transformations, e.g.,
translation, scaling and out-of-plane rotation. First,
we chose a set of 10 face images and perform distor-
tions to them, e.g., scaling and translating. For out-
of-plane rotation, we used 10 images of faces rotating
in 3D (“out-of-plane”) as the test set. To check the ro-
bustness of the neuron, we plot its averaged response
over the small test set with respect to changes in scale,
3D rotation (Figure 4), and translation (Figure 5).6
neuron in a one-layered network only achieves 71% ac-
curacy while best linear filter, selected among 100,000
filters sampled randomly from the training set, only
achieves 74%.
To understand their contribution, we removed the lo-
cal contrast normalization sublayers and trained the
network again. Results show that the accuracy of
best neuron drops to 78.5%. This agrees with pre-
vious study showing the importance of local contrast
normalization (Jarrett et al., 2009).
We visualize histograms of activation values for face
images and random images in Figure 2. It can be seen,
even with exclusively unlabeled data, the neuron learns
to differentiate between faces and random distractors.
Specifically, when we give a face as an input image, the
neuron tends to output value larger than the threshold,
0. In contrast, if we give a random image as an input
image, the neuron tends to output value less than 0.
Figure 2. Histograms of faces (red) vs. no faces (blue).
The test set is subsampled such that the ratio between
faces and no faces is one.
4.4. Visualization
In this section, we will present two visualization tech-
niques to verify if the optimal stimulus of the neuron is
indeed a face. The first method is visualizing the most
responsive stimuli in the test set. Since the test set
is large, this method can reliably detect near optimal
stimuli of the tested neuron. The second approach
is to perform numerical optimization to find the op-
timal stimulus (Berkes & Wiskott, 2005; Erhan et al.,
2009; Le et al., 2010). In particular, we find the norm-
bounded input x which maximizes the output f of the
noise. On the other hand, the numerical optimization
approach can be susceptible to local minima. Results,
shown in Figure 3, confirm that the tested neuron in-
deed learns the concept of faces.
Figure 3. Top: Top 48 stimuli of the best neuron from the
test set. Bottom: The optimal stimulus according to nu-
merical constraint optimization.
4.5. Invariance properties
We would like to assess the robustness of the face de-
tector against common object transformations, e.g.,
translation, scaling and out-of-plane rotation. First,
we chose a set of 10 face images and perform distor-
tions to them, e.g., scaling and translating. For out-
of-plane rotation, we used 10 images of faces rotating
in 3D (“out-of-plane”) as the test set. To check the ro-
bustness of the neuron, we plot its averaged response
over the small test set with respect to changes in scale,
3D rotation (Figure 4), and translation (Figure 5).6
6
Scaled, translated faces are generated by standard
cubic interpolation. For 3D rotated faces, we used 10 se-
(Quoc et al., 2012)
Thursday, June 12, 14
40. Visualization of Features (2)
to a given input image, the reconstruction obtained
from a single activation thus resembles a small piece
of the original input image, with structures weighted
according to their contribution toward to the feature
activation. Since the model is trained discriminatively,
they implicitly show which parts of the input image
are discriminative. Note that these projections are not
samples from the model, since there is no generative
process involved.
Layer Below Pooled Maps
Feature Maps
Rectified Feature Maps
Convolu'onal)
Filtering){F})
Rec'fied)Linear)
Func'on)
Pooled Maps
Max)Pooling)
Reconstruction
Rectified Unpooled Maps
Unpooled Maps
Convolu'onal)
Filtering){FT})
Rec'fied)Linear)
Func'on)
Layer Above
Reconstruction
Max)Unpooling)
Switches)
Unpooling
Max Locations
“Switches”
Pooling
Pooled Maps
Feature Map
Layer Above
Reconstruction
Unpooled
Maps
Rectified
Feature Maps
Figure 1. Top: A deconvnet layer (left) attached to a con-
vnet layer (right). The deconvnet will reconstruct an ap-
proximate version of the convnet features from the layer
beneath. Bottom: An illustration of the unpooling oper-
ation in the deconvnet, using switches which record the
location of the local max in each pooling region (colored
zones) during pooling in the convnet.
3. Training Details
256x256 region, subtracting the per-pixel mean (across
all images) and then using 10 di↵erent sub-crops of size
224x224 (corners + center with(out) horizontal flips).
Stochastic gradient descent with a mini-batch size of
128 was used to update the parameters, starting with a
learning rate of 10 2
, in conjunction with a momentum
term of 0.9. We anneal the learning rate throughout
training manually when the validation error plateaus.
Dropout (Hinton et al., 2012) is used in the fully con-
nected layers (6 and 7) with a rate of 0.5. All weights
are initialized to 10 2
and biases are set to 0.
Visualization of the first layer filters during training
reveals that a few of them dominate, as shown in
Fig. 6(a). To combat this, we renormalize each filter
in the convolutional layers whose RMS value exceeds
a fixed radius of 10 1
to this fixed radius. This is cru-
cial, especially in the first layer of the model, where the
input images are roughly in the [-128,128] range. As in
(Krizhevsky et al., 2012), we produce multiple di↵er-
ent crops and flips of each training example to boost
training set size. We stopped training after 70 epochs,
which took around 12 days on a single GTX580 GPU,
using an implementation based on (Krizhevsky et al.,
2012).
4. Convnet Visualization
Using the model described in Section 3, we now use
the deconvnet to visualize the feature activations on
the ImageNet validation set.
Feature Visualization: Fig. 2 shows feature visu-
alizations from our model once training is complete.
However, instead of showing the single strongest ac-
tivation for a given feature map, we show the top 9
activations. Projecting each separately down to pixel
space reveals the di↵erent structures that excite a
given feature map, hence showing its invariance to in-
put deformations. Alongside these visualizations we
Visualizing and Understanding Convolutional Networks
Layer 2
Layer 1
Visualizing and Understanding Convolutional Networks
Layer 3
(Zeiler and Forgus, 2013)
Thursday, June 12, 14
42. Layer 4 Layer 5
e 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random sub
ature maps across the validation data, projected down to pixel space using our deconvolutional network approa
reconstructions are not samples from the model: they are reconstructed patterns from the validation set that caThursday, June 12, 14
44. RNNLM
• Mikolov et al., 2010
• They stabilized learning
by truncation of
“explosive” gradient
vectors
Table 1: Performance of models on WSJ DEV set when increas-
ing size of training data.
Model # words PPL WER
KN5 LM 200K 336 16.4
KN5 LM + RNN 90/2 200K 271 15.4
KN5 LM 1M 287 15.1
KN5 LM + RNN 90/2 1M 225 14.0
KN5 LM 6.4M 221 13.5
KN5 LM + RNN 250/5 6.4M 156 11.7
where Crare is number of words in the vocabulary that occur
less often than the threshold. All rare words are thus treated
equally, ie. probability is distributed uniformly between them.
Schwenk [4] describes several possible approaches that can
be used for further performance improvements. Additional pos-
sibilities are also discussed in [10][11][12] and most of them
can be applied also to RNNs. For comparison, it takes around 6
hours for our basic implementation to train RNN model based
on Brown corpus (800K words, 100 hidden units and vocab-
ulary threshold 5), while Bengio reports 113 days for basic
implementation and 26 hours with importance sampling [10],
when using similar data and size of neural network. We use
only BLAS library to speed up computation.
3. WSJ experiments
To evaluate performance of simple recurrent neural network
based language model, we have selected several standard
speech recognition tasks. First we report results after rescor-
ing 100-best lists from DARPA WSJ’92 and WSJ’93 data sets
- the same data sets were used by Xu [8] and Filimonov [9].
Oracle WER is 6.1% for dev set and 9.5% for eval set. Training
Table 2: Comparison of various configurations of RNN LMs
and combinations with backoff models while using 6.4M words
in training data (WSJ DEV).
PPL WER
Model RNN RNN+KN RNN RNN+KN
KN5 - baseline - 221 - 13.5
RNN 60/20 229 186 13.2 12.6
RNN 90/10 202 173 12.8 12.2
RNN 250/5 173 155 12.3 11.7
RNN 250/2 176 156 12.0 11.9
RNN 400/10 171 152 12.5 12.1
3xRNN static 151 143 11.6 11.3
3xRNN dynamic 128 121 11.3 11.1
Table 3: Comparison of WSJ results obtained with various mod-
els. Note that RNN models are trained just on 6.4M words.
Model DEV WER EVAL WER
Lattice 1 best 12.9 18.4
Baseline - KN5 (37M) 12.2 17.2
Discriminative LM [8] (37M) 11.5 16.9
Joint LM [9] (70M) - 16.7
Static 3xRNN + KN5 (37M) 11.0 15.5
Dynamic 3xRNN + KN5 (37M) 10.7 16.34
namic RNN LMs - actually, by mixing static and dynamic RNN
LMs with larger learning rate used when processing testing data
(↵ = 0.3), the best perplexity result was 112.
All LMs in the preceding experiments were trained on only
6.4M words, which is much less than the amount of data used
by others for this task. To provide a comparison with Xu [8] and
(on WSJ ’92/WSJ’93 data sets)Table 4: Comparison of very large back-off LMs and RNN LMs
trained only on limited in-domain data (5.4M words).
Model WER static WER dynamic
RT05 LM 24.5 -
RT09 LM - baseline 24.1 -
KN5 in-domain 25.7 -
RNN 500/10 in-domain 24.2 24.1
RNN 500/10 + RT09 LM 23.3 23.2
RNN 800/10 in-domain 24.3 23.8
RNN 800/10 + RT09 LM 23.4 23.1
RNN 1000/5 in-domain 24.2 23.7
RNN 1000/5 + RT09 LM 23.4 22.9
3xRNN + RT09 LM 23.3 22.8
traction use 13 Mel-PLP’s features with deltas, double and triple
wi
toy
rec
ma
tio
dis
tio
fo
mo
vo
lar
sh
po
it
ing
tha
(on NIST RT05)
Thursday, June 12, 14
45. Initialization of RNNs
• Sutskever et al. (2013)
empirically showed that
initialization and
momentum critically
improve RNN
performance
• Echo state network
based initialization
s use
tion)
effi-
earn-
ems,
is an
NN).
(“re-
ction
going
t and
gical
rent.
rtifi-
arget
accu-
s are
t the
er to
These
oyed
slow
tly irregular time series (Fig. 2A). The
prediction task has two steps: (i) using an
initial teacher sequence generated by the
original MGS to learn a black-box model M
of the generating system, and (ii) using M
to predict the value of the sequence some
steps ahead.
First, we created a random RNN with
1000 neurons (called the “reservoir”) and one
output neuron. The output neuron was
equipped with random connections that
project back into the reservoir (Fig. 2B). A
3000-step teacher sequence d(1), . . .,
d(3000) was generated from the MGS equa-
tion and fed into the output neuron. This
excited the internal neurons through the out-
put feedback connections. After an initial
transient period, they started to exhibit sys-
tematic individual variations of the teacher
sequence (Fig. 2B).
The fact that the internal neurons display
systematic variants of the exciting external
signal is constitutional for ESNs: The internal
neurons must work as “echo functions” for
the driving signal. Not every randomly gen-
erated RNN has this property, but it can
effectively be built into a reservoir (support-
ing online text).
square error
NRMSE ϭ ͩjϭ1
100
(dj(3084) Ϫ yj͑3084))2
/1002
͒ͪ
1/2
Ϸ10Ϫ4.2
was obtained (dj and yj teacher and network
8759,
ed. E-
Fig. 1. (A) Schema of previous approaches to
RNN learning. (B) Schema of ESN approach.
Solid bold arrows, fixed synaptic connections;
dotted arrows, adjustable connections. Both
approaches aim at minimizing the error d(n) –
y(n), where y(n) is the network output and d(n)
is the teacher time series observed from the
target system.
2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org
(Jaeger and Haas, 2004)
Thursday, June 12, 14
46. Regularization of RNNs
Pascanu et al., 2013
• Mikolov et al. 2010と同
じ方法によりExploding
gradientへ対処
• Vanishing gradientには
を導入することで対処
On the di culty of training Recurrent Neural Networks
te of success for solving the temporal order
us log of sequence length. See text.
t become an issue, addressing the explod-
s problem ensures a better success rate.
ining clipping as well as the regularization
sed in section 3.3, we call this algorithm
GD-CR solved the task with a success rate
sequences up to 200 steps (the maximal
in Martens and Sutskever (2011)). Fur-
Table 1. Results on polyphonic music prediction in nega-
tive log likelihood per time step. Lower is better.
Data set
Data
fold
SGD SGD+C SGD+CR
Piano- train 6.87 6.81 7.01
midi.de test 7.56 7.53 7.46
Nottingham train 3.67 3.21 3.24
test 3.80 3.48 3.46
MuseData train 8.25 6.54 6.51
test 7.11 7.00 6.99
Table 2. Results on the next character prediction task in
entropy (bits/character)
Data set
Data
fold
SGD SGD+C SGD+CR
1 step train 1.46 1.34 1.36
test 1.50 1.42 1.41
5 steps train N/A 3.76 3.70
test N/A 3.89 3.74
4.2. Natural problems
We address the task of polyphonic music prediction,
using the datasets Piano-midi.de, Nottingham and
MuseData described in Boulanger-Lewandowski et al.
(2012) and language modelling at the character level
on the Penn Treebank dataset (Mikolov et al., 2012).
On the di culty of training Recurrent Neural Networks
of success for solving the temporal order
s log of sequence length. See text.
become an issue, addressing the explod-
problem ensures a better success rate.
ing clipping as well as the regularization
d in section 3.3, we call this algorithm
D-CR solved the task with a success rate
equences up to 200 steps (the maximal
n Martens and Sutskever (2011)). Fur-
can train a single model to deal with
Table 1. Results on polyphonic music prediction in nega-
tive log likelihood per time step. Lower is better.
Data set
Data
fold
SGD SGD+C SGD+CR
Piano- train 6.87 6.81 7.01
midi.de test 7.56 7.53 7.46
Nottingham train 3.67 3.21 3.24
test 3.80 3.48 3.46
MuseData train 8.25 6.54 6.51
test 7.11 7.00 6.99
Table 2. Results on the next character prediction task in
entropy (bits/character)
Data set
Data
fold
SGD SGD+C SGD+CR
1 step train 1.46 1.34 1.36
test 1.50 1.42 1.41
5 steps train N/A 3.76 3.70
test N/A 3.89 3.74
4.2. Natural problems
We address the task of polyphonic music prediction,
using the datasets Piano-midi.de, Nottingham and
MuseData described in Boulanger-Lewandowski et al.
(2012) and language modelling at the character level
on the Penn Treebank dataset (Mikolov et al., 2012).
We also explore a modified version of the task, where
nitude. Our intuition is that increasing the norm of
@xt
@xk
means the error at time t is more sensitive to all
inputs ut, .., uk ( @xt
@xk
is a factor in @Et
@uk
). In practice
some of these inputs will be irrelevant for the predic-
tion at time t and will behave like noise that the net-
work needs to learn to ignore. The network can not
learn to ignore these irrelevant inputs unless there is
an error signal. These two issues can not be solved in
parallel, and it seems natural to expect that we need
to force the network to increase the norm of @xt
@xk
at the
expense of larger errors (caused by the irrelevant input
entries) and then wait for it to learn to ignore these
irrelevant input entries. This suggest that moving to-
wards increasing the norm of @xt
@xk
can not be always
done while following a descent direction of the error E
(which is, for e.g., what a second order method would
try to do), and therefore we need to enforce it via a
regularization term.
The regularizer we propose below prefers solutions for
which the error signal preserves norm as it travels back
in time:
⌦ =
X
k
⌦k =
X
k
0
@
@E
@xk+1
@xk+1
@xk
@E
@xk+1
1
1
A
2
(9)
In order to be computationally e cient, we only use
the “immediate” partial derivative of ⌦ with respect to
Wrec (we consider that xk and @E
@xk+1
as being constant
with respect to Wrec when computing the derivative
of ⌦k), as depicted in equation (10). Note we use the
parametrization of equation (11). This can be done ef-
ficiently because we get the values of @E
@xk
from BPTT.
We use Theano to compute these gradients (Bergstra
model such that it is further away from the attrac-
tor (such that it does not converge to it, case in which
the gradients vanish) and closer to boundaries between
basins of attractions, making it more probable for the
gradients to explode.
4. Experiments and Results
4.1. Pathological synthetic problems
As done in Martens and Sutskever (2011), we address
the pathological problems proposed by Hochreiter and
Schmidhuber (1997) that require learning long term
correlations. We refer the reader to this original pa-
per for a detailed description of the tasks and to the
supplementary materials for the complete description
of the experimental setup.
4.1.1. The Temporal Order problem
We consider the temporal order problem as the pro-
totypical pathological problem, extending our results
to the other proposed tasks afterwards. The input is
a long stream of discrete symbols. At two points in
time (in the beginning and middle of the sequence) a
symbol within {A, B} is emitted. The task consists in
classifying the order (either AA, AB, BA, BB) at the
end of the sequence.
Fig. 7 shows the success rate of standard SGD, SGD-C
(SGD enhanced with out clipping strategy) and SGD-
CR (SGD with the clipping strategy and the regular-
ization term). Note that for sequences longer than 20,
the vanishing gradients problem ensures that neither
SGD nor SGD-C algorithms can solve the task. The
x-axis is on log scale.
This task provides empirical evidence that explodingThursday, June 12, 14