2. Bidirectional RNNs
Motivation:
Sequences where context matters; ideally
have knowledge about the future as well as
past, e.g. speech, hand-writing.
h(t)
: state of sub-RNN moving forward in time
g(t)
: state of sub-RNN moving backward in
time
Extend architecture to go forward/backward
in n dimensions (involving 2n sub-RNNs), e.g.
4 sub-RNNs with input 2d images can
capture long-range lateral interactions
between features, but more expensive to train
than convolutional neural nets.
3. Encoder-Decoder Sequence-to-Sequence
Architectures
Allow for input and output sequences of different
lengths. Applications include speech recognition,
machine translation or question/answering.
C: vector, or sequence of vectors, summarizing the
input sequence X = (x(1)
, …, x(n)
).
Encoder: input RNN
Decoder: output RNN
Both RNNs trained jointly to maximize average of
log P (y(1)
, …, yn_y
| x(1)
, …, x(n_x)
)
4. Deep Recurrent Networks
Typically, RNNs can be decomposed into 3
blocks:
- Input-to-hidden
- Hidden-to-hidden
- Hidden-to-ouput
Basic idea here: introduce depth in each of the
above blocks.
Fig (a) : Lower levels transform raw input to
more appropriate transformation
Fig (b) : Add extra layers in the recurrence
relationship
Fig (c) : Mitigate longer distance from t to t+1 by
adding skip connections
5. Recursive Neural Networks
Generalize computational graph from chain to a
tree.
For sequence of length T, the depth (number of
compositions of non-linear operations) can be
reduced from O(T) to O(log T) (simplest way to
see this is to solve for 2depth
~ T, assuming
branching factor of 2).
Open question: How to best structure the tree. In
practice, depends on the problem at hand.
Ideally, the learner itself infers and implements
the appropriate structure given the input.
6. Challenge of Long-Term Dependencies
Basic problem:
- Gradients propagated over several time steps tend to either vanish or explode
We can think of recurrence relation
as a simple RNN lacking inputs and non-linear activation function. This can be simplified to
so that if W admits an eigen-decomposition of the form
with Q an orthogonal matrix, further simplifying the recurrence relation to
Thus eigenvalues ei
with |ei
| < 1 will tend to decay to zero, while those with |ei
| > 1 will tend to
explode, eventually causing any component of h(0)
that is not aligned with the largest eigenvector to
be discarded.
7. Challenge of Long-Term Dependencies
Problem inherent to RNNs. For non-recurrent networks, we can always choose different weights
at different time-steps.
Imagine a scalar weight w getting multiplied by itself several times at each time step.
● The product wt
will either vanish or explode given the magnitude of w.
● On the other hand, if every w(t)
is independent but identically distributed with mean 0 and
variance v, then the state at time is product of all w(t)
’s and the variance of the product is
O(vn
).
For non-recurrent deep feedforward networks, we may achieve some desired variance v*
by
sampling individual weights with variance (v*
)1/n
, and thus avoid the vanishing and exploding
gradient problem.
Open problem: Allow an RNN to learn long-term dependencies without vanishing/exploding
parameters.
8. Echo State Networks
Hidden-to-hidden and input-to-hidden weights are usually most difficult parameters to learn in an
RNN.
Echo State Networks (ESNs): Set recurrent weights such that hidden units capture history of
past inputs, and learn only the output weights.
Liquid State Machines: Same as above, except uses binary output neurons instead of
continuous-valued hidden units used for ESNs.
This approach is collectively referred to as reservoir computing (hidden units form a reservoir of
temporal features, capturing different aspects of input history).
9. Echo State Networks
Spectral radius: Largest eigenvalue of the Jacobian at time t,
Suppose J has eigenvector v with eigenvalue lambda. Further suppose we want to
back-propagate a gradient vector g back in time, and compare this to back-propagating the
perturbed vector (g + delta v). The two different executions, after n propagation steps, diverge by
delta |lambda|^n, which if |lambda| > 1, can grow exponentially large, and if |lambda| < 1, can
vanish. (We can similarly replace back-propagation with forward propagation and removing
non-linearity).
Strategy in ESNs is to fix weights to have some bounded spectral radius, such that information is
carried through time but does not explode/vanish.
10. Strategies for Multiple Time Scales
Design models that operate at multiple time scales, e.g. some parts operating at fine-grained time
scales, others at more coarse-grained scales.
Adding Skip Connections Through Time:
Add direct connections from variables in distant past to variables in present, instead of just from time
t to time t+1.
Leaky Units and a Spectrum of Different Time Scales:
Design units with linear self-connections and weights near 1. As an analogy, consider accumulating
the running average mu(t) of some variable v(t) via
When alpha is close to 1, the running average remembers the past for a long time. Hidden units with
such linear self-connections and weights close to 1 can behave similarly.
Removing connections:
Remove length-one connections and replace them with length-(larger number) connections.
11. LSTM and Other Gated RNNs
As of now, the most effective sequence models used in practical applications are gated RNNs:
including long short-term memory (LSTM) and networks based on the gated recurrent unit.
Basic idea: Create paths through time with derivatives that don’t explode/vanish, by allowing
connection weights to become functions of time. Leaky units can allow network to accumulate
information over a long period of time, but there should also be a forgetting mechanism when that
info becomes irrelevant. Ideally, we want the network itself to decide when to forget.
12. LSTM
The state unit si
(t)
has a linear self-loop similar to
the leaky units described in the previous section.
The self-loop weight is controlled by a forget gate
unit
x(t)
: current input vector
h(t)
: current hidden layer vector
bf
: biases
Uf
: input weights
Wf
: recurrent weights
LSTM “cell”
13. LSTM
The LSTM cell internal state is then updated as
follows
where
b: biases
U: input weights
W: recurrent weights
gi
(t)
: external input gate, similar to forget gate but
with its own parameters
LSTM “cell”
14. The output of the LSTM cell can also be shut off via
the output gate qi
(t)
One can choose to use the cell state si
(t)
as an extra
input (with its own weight) into the three gates of
the i-th unit, as shown in the figure.
LSTMs have been shown to learn long-term
dependencies more easily than simple recurrent
architectures.
LSTM
LSTM “cell”
15. Other Gated RNNs
Main difference from LSTM: Single gating unit simultaneously controls the forgetting factor and decision
to update unit.
u: update gate
r: reset gate
Both gates can individually ignore parts of the state vector.
16. Optimization for Long-Term Dependencies
Basic problem: Vanishing and exploding gradients when optimizing RNNs over many time steps.
Clipping Gradients: Cost function can have sharp cliffs as a function of the weights/biases. Gradient
direction can change dramatically within a short distance. Solution: reduce step-size in direction of
gradient if it gets too large:
where v is the norm threshold, and g is gradient.
17. Optimization for Long-Term Dependencies
Regularizing to Encourage Information Flow: Previous technique helps with exploding gradients,
but not vanishing gradients. Ideally, we’d like to be as large as
so that it maintains its magnitude as it gets back-propagated. We could therefore use the following
term as a regularizer to achieve this effect: