Sequence learning and modern RNNs

Sequence Learning and
modern RNNs
Grigory Sapunov
source{d} tech talks Second Series| 2017
Moscow, June 03, 2017
gs@inten.to

Recap and a tiny intro into RNNs

Artificial neuron
Artificial neuron is a mathematical function inspired by a biological
neuron (rather far from the real biology).

Artificial Neural Network (ANN)
Artificial neuron is an elementary unit in an artificial neural network.

Feedforward NN vs. Recurrent NN
Recurrent neural networks (RNNs) allow cyclical connections.

Unfolding the RNN and training using BPTT
Can do backprop on the unfolded network: Backpropagation through time (BPTT)
http://ir.hit.edu.cn/~jguo/docs/notes/bptt.pdf

Neural Network properties
Feedforward NN (FFNN):
● FFNN is a universal approximator: feed-forward network with a single hidden layer,
which contains finite number of hidden neurons, can approximate continuous functions
on compact subsets of Rn
, under mild assumptions on the activation function.
● Typical FFNNs have no inherent notion of order in time. They remember only training.
Recurrent NN (RNN):
● RNNs are Turing-complete: they can compute anything that can be computed and
have the capacity to simulate arbitrary procedures.
● RNNs possess a certain type of memory. They are much better suited to dealing with
sequences, context modeling and time dependencies.

RNN problem: Vanishing gradients
Solution: Long short-term memory (LSTM, Hochreiter, Schmidhuber, 1997)

LSTM: Fixing vanishing gradient problem

Comparing LSTM and Simple RNN
More on LSTMs: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Another solution: Gated Recurrent Unit (GRU)
GRU (Cho et al., 2014) is a bit simpler than LSTM (less weights)

Bidirectional RNN/LSTM
There are many situations when you
see the whole sequence at once
(OCR, speech recognition,
translation, caption generation, …).
So you can scan the [1D] sequence
in both directions, forward and
backward.
Here comes BRNN/BLSTM (Graves,
Schmidhuber, 2005).

Typical result: BRNN>RNN, LSTM>RNN, BLSTM>BRNN

Example: BLSTM classifying the utterance “one oh five”

Multidirectional Multidimensional RNN/LSTM
Standard RNNs are inherently one dimensional, and therefore poorly suited to
multidimensional data (e.g. images).
The basic idea of MDRNNs (Graves, Fernandez, Schmidhuber, 2007) is to replace
the single recurrent connection found in standard RNNs with as many recurrent
connections as there are dimensions in the data.
It assumes some ordering on the multidimensional data. BRNNs can be extended to
n-dimensional data by using 2^n separate hidden layers.

Multi-dimensionality (MDRNN)
The basic idea of MDRNNs is to replace the single recurrent connection found
in standard RNNs with as many recurrent connections as there are dimensions
in the data.

Multi-directionality (MDMDRNN?)

Multidirectional multidimensional RNN (MDMDRNN?)
The previously mentioned ordering is not the only possible one. It might be OK for
some tasks, but it is usually preferable for the network to have access to the
surrounding context in all directions. This is particularly true for tasks where precise
localisation is required, such as image segmentation.
For one dimensional RNNs, the problem of multidirectional context was solved by
the introduction of bidirectional recurrent neural networks (BRNNs). BRNNs contain
two separate hidden layers that process the input sequence in the forward and
reverse directions.
BRNNs can be extended to n-dimensional data by using 2n
separate hidden layers,
each of which processes the sequence using the ordering defined above, but with a
different choice of axes.

ReNet (2015)
PyraMiD-LSTM (2015)
http://arxiv.org/abs/1505.00393
http://arxiv.org/abs/1506.07452

Tree-LSTM (2015)
Interesting LSTM generalisation: Tree-LSTM
“However, natural language exhibits syntactic
properties that would naturally combine
words to phrases. We introduce the
Tree-LSTM, a generalization of LSTMs to
tree-structured network topologies.
Tree-LSTMs outperform all existing systems
and strong LSTM baselines on two tasks:
predicting the semantic relatedness of two
sentences and sentiment classification.”
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,
https://arxiv.org/abs/1507.01526

Grid LSTM (2016)
Another interesting LSTM generalisation: Grid LSTM
“This paper introduces Grid Long Short-Term Memory, a network of LSTM cells
arranged in a multidimensional grid that can be applied to vectors, sequences
or higher dimensional data such as images. The network differs from existing
deep LSTM architectures in that the cells are connected between network layers
as well as along the spatiotemporal dimensions of the data. The network provides
a unified way of using LSTM for both deep and sequential computation.”

Grid LSTM (2016)

Grid LSTM (2016)
One-dimensional Grid LSTM corresponds to a feed-forward network that uses
LSTM cells in place of transfer functions such as tanh and ReLU. These networks
are related to Highway Networks (Srivastava et al., 2015) where a gated transfer
function is used to successfully train feed-forward networks with up to 900 layers
of depth.
Grid LSTM with two dimensions is analogous to the Stacked LSTM, but it adds
cells along the depth dimension too.
Grid LSTM with three or more dimensions is analogous to Multidimensional
LSTM, but differs from it not just by having the cells along the depth dimension,
but also by using the proposed mechanism for modulating the N-way interaction
that is not prone to the instability present in Multidimesional LSTM.

End of Intro
Further we will not make a distinction between RNN/GRU/LSTM, and will usually be
using the word RNN for any kind of internal block. Typically most RNNs now are
actually LSTMs.

Encoding semantics
Using word2vec instead of word indexes allows you to better deal with the word
meanings (e.g. no need to enumerate all synonyms because their vectors are
already close to each other).
But the naive way to work with word2vec vectors still gives you a “bag of words”
model, where phrases “The man killed the tiger” and “The tiger killed the man” are
equal.
Need models which pay attention to the word ordering: paragraph2vec, sentence
embeddings (using RNN/LSTM), even World2Vec (LeCunn @CVPR2015).

https://code.google.com/archive/p/word2vec/
Example: Semantic Spaces (word2vec, GloVe)
vector('king') - vector('man') + vector('woman') = vector('queen')

http://nlp.stanford.edu/projects/glove/
Example: Semantic Spaces (word2vec, GloVe)

Case: Sentiment analysis
https://blog.openai.com/unsupervised-sentiment-neuron/
“Our research implies that simply training large unsupervised next-step-prediction
models on large amounts of data may be a good approach to use when creating
systems with good representation learning capabilities.”

Multi-modal Learning
Deep Learning models become multi-modal: they use 2+ modalities
simultaneously, i.e.:
● Image caption generation: images + text
● Search Web by an image: images + text
● Video describing: the same but added time dimension
● Visual question answering: images + text
● Speech recognition: audio + video (lips motion)
● Image classification and navigation: RGB-D (color + depth)
Where does it aim to?
● Common metric space for each concept, “thought vector”. Will be possible to
match different modalities easily.

Multi-modal Learning
http://arxiv.org/abs/1411.2539 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Example: More multi-modal learning

Example: Text generation by image
http://arxiv.org/abs/1411.4555 “Show and Tell: A Neural Image Caption Generator”

Example: Image generation by text
StackGAN: Text to Photo-realistic Image Synthesis with
Stacked Generative Adversarial Networks, https://arxiv.org/abs/1612.03242

Example: Code generation by image
pix2code: Generating Code from a Graphical User Interface Screenshot,

Sequence to Sequence Learning (seq2seq)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Another useful thing: CTC Output Layer
CTC (Connectionist Temporal Classification; Graves, Fernández, Gomez,
Schmidhuber, 2006) was specifically designed for temporal classification tasks; that
is, for sequence labelling problems where the alignment between the inputs and the
target labels is unknown.
CTC models all aspects of the sequence with a single neural network, and does not
require the network to be combined with a hidden Markov model. It also does not
require presegmented training data, or external post-processing to extract the
label sequence from the network outputs.
The CTC network predicts only the sequence of phonemes (typically as a series
of spikes, separated by ‘blanks’, or null predictions), while the framewise network
attempts to align them with the manual segmentation.

Example: CTC vs. Framewise classification

CTC (Connectionist Temporal Classification)
https://github.com/baidu-research/warp-ctc

Encoder-Decoder architecture
https://github.com/farizrahman4u/seq2seq

Encoder-Decoder with Attention
https://research.googleblog.com/2016/09/a-neural-network-for-machine.html

CNN+RNN with Attention
http://kelvinxu.github.io/projects/capgen.html

More augmented RNNs
● Attentional Interfaces (Hard attention, Soft
attention)
● Differentiable Memory (Neural Turing
Machines, Differentiable neural computer,
Hierarchical Attentive Memory, Memory
Networks, ...)
● Adaptive Computation Time
● Differentiable Data Structures (structured
memory: stack, list, queue, …)
● Differential Programming (Neural
Programmer, Differentiable Functional
Program Interpreters, …)
● ...

https://deepmind.com/blog/differentiable-neural-computers/

Encoder-Decoder: original architecture
Sequence to Sequence Learning with Neural Networks, https://arxiv.org/abs/1409.3215
Recurrent Encoder / Recurrent Decoder

Case: Machine Translation
Sequence to Sequence Learning with Neural Networks, http://arxiv.org/abs/1409.3215

Encoder-Decoder: modern architecture
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Encoder-Decoder: character-level models
Fully Character-Level Neural Machine Translation without Explicit Segmentation,

The solution #1: CNN encoder
A Convolutional Encoder Model for Neural Machine Translation, https://arxiv.org/abs/1611.02344
Convolutional Encoder / Recurrent Decoder

The solution #1.5: CNN encoder + decoder
Convolutional Sequence to Sequence Learning, https://arxiv.org/abs/1705.03122
Actually no RNN here (Facebook AI Research loves CNNs).

The solution #2: Optimizing RNNs
Exploring Sparsity in Recurrent Neural Networks, https://arxiv.org/abs/1704.05119
“Pruning RNNs reduces the size of the model and can also help achieve significant
inference time speed-up using sparse matrix multiply. Benchmarks show that using
our technique model size can be reduced by 90% and speed-up is around 2× to
7×.”

The solution #3: Better hardware
● Google TPU gen.2
○ 180 TFLOPS?
● NVIDIA DGX-1 (8*P100) ($129,000)
○ 170 TFLOPS (FP16)
● NVIDIA Tesla V100
○ 120 TFLOPS (Tensor Core)
● NVIDIA Tesla P100
○ 10.6 TFLOPS (FP32)
● NVIDIA GTX Titan X ($1000)
● NVIDIA GTX 1080/1080 Ti ($700)
○ 8/11.3 TFLOPS (FP32)

The solution #3: Better hardware
Why this solution could be among the most interesting ones?
Current success of NNs (especially CNNs) is backed by a large amounts of data
available _AND_ more powerful hardware (using the decades-old algorithms). We
potentially could achieve the same performance in the past, but the learning process
was just too slow (and we were too impatient).
The processor performance grows exponentially and in 5-10 years the available
computing power can increase 1000x. There may appear computing units more
suitable for RNN computations as well.
The situation could repeat. When the hardware will allow fast training of RNNs, we
could achieve a new kind of results. Remember, RNNs are Turing complete. They
are (potentially) much more powerful than feed-forward NNs.

https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

Sequence learning and modern RNNs

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Sequence learning and modern RNNs

Semelhante a Sequence learning and modern RNNs (20)

Mais de Grigory Sapunov

Mais de Grigory Sapunov (20)

Último

Último (20)

Sequence learning and modern RNNs