4. Artificial Neural Network (ANN)
Artificial neuron is an elementary unit in an artificial neural network.
5. Feedforward NN vs. Recurrent NN
Recurrent neural networks (RNNs) allow cyclical connections.
6. Unfolding the RNN and training using BPTT
Can do backprop on the unfolded network: Backpropagation through time (BPTT)
http://ir.hit.edu.cn/~jguo/docs/notes/bptt.pdf
7. Neural Network properties
Feedforward NN (FFNN):
● FFNN is a universal approximator: feed-forward network with a single hidden layer,
which contains finite number of hidden neurons, can approximate continuous functions
on compact subsets of Rn
, under mild assumptions on the activation function.
● Typical FFNNs have no inherent notion of order in time. They remember only training.
Recurrent NN (RNN):
● RNNs are Turing-complete: they can compute anything that can be computed and
have the capacity to simulate arbitrary procedures.
● RNNs possess a certain type of memory. They are much better suited to dealing with
sequences, context modeling and time dependencies.
12. Comparing LSTM and Simple RNN
More on LSTMs: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
13. Another solution: Gated Recurrent Unit (GRU)
GRU (Cho et al., 2014) is a bit simpler than LSTM (less weights)
14. Bidirectional RNN/LSTM
There are many situations when you
see the whole sequence at once
(OCR, speech recognition,
translation, caption generation, …).
So you can scan the [1D] sequence
in both directions, forward and
backward.
Here comes BRNN/BLSTM (Graves,
Schmidhuber, 2005).
17. Multidirectional Multidimensional RNN/LSTM
Standard RNNs are inherently one dimensional, and therefore poorly suited to
multidimensional data (e.g. images).
The basic idea of MDRNNs (Graves, Fernandez, Schmidhuber, 2007) is to replace
the single recurrent connection found in standard RNNs with as many recurrent
connections as there are dimensions in the data.
It assumes some ordering on the multidimensional data. BRNNs can be extended to
n-dimensional data by using 2^n separate hidden layers.
18. Multi-dimensionality (MDRNN)
The basic idea of MDRNNs is to replace the single recurrent connection found
in standard RNNs with as many recurrent connections as there are dimensions
in the data.
20. Multidirectional multidimensional RNN (MDMDRNN?)
The previously mentioned ordering is not the only possible one. It might be OK for
some tasks, but it is usually preferable for the network to have access to the
surrounding context in all directions. This is particularly true for tasks where precise
localisation is required, such as image segmentation.
For one dimensional RNNs, the problem of multidirectional context was solved by
the introduction of bidirectional recurrent neural networks (BRNNs). BRNNs contain
two separate hidden layers that process the input sequence in the forward and
reverse directions.
BRNNs can be extended to n-dimensional data by using 2n
separate hidden layers,
each of which processes the sequence using the ordering defined above, but with a
different choice of axes.
22. Tree-LSTM (2015)
Interesting LSTM generalisation: Tree-LSTM
“However, natural language exhibits syntactic
properties that would naturally combine
words to phrases. We introduce the
Tree-LSTM, a generalization of LSTMs to
tree-structured network topologies.
Tree-LSTMs outperform all existing systems
and strong LSTM baselines on two tasks:
predicting the semantic relatedness of two
sentences and sentiment classification.”
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,
https://arxiv.org/abs/1507.01526
23. Grid LSTM (2016)
Another interesting LSTM generalisation: Grid LSTM
“This paper introduces Grid Long Short-Term Memory, a network of LSTM cells
arranged in a multidimensional grid that can be applied to vectors, sequences
or higher dimensional data such as images. The network differs from existing
deep LSTM architectures in that the cells are connected between network layers
as well as along the spatiotemporal dimensions of the data. The network provides
a unified way of using LSTM for both deep and sequential computation.”
25. Grid LSTM (2016)
One-dimensional Grid LSTM corresponds to a feed-forward network that uses
LSTM cells in place of transfer functions such as tanh and ReLU. These networks
are related to Highway Networks (Srivastava et al., 2015) where a gated transfer
function is used to successfully train feed-forward networks with up to 900 layers
of depth.
Grid LSTM with two dimensions is analogous to the Stacked LSTM, but it adds
cells along the depth dimension too.
Grid LSTM with three or more dimensions is analogous to Multidimensional
LSTM, but differs from it not just by having the cells along the depth dimension,
but also by using the proposed mechanism for modulating the N-way interaction
that is not prone to the instability present in Multidimesional LSTM.
27. End of Intro
Further we will not make a distinction between RNN/GRU/LSTM, and will usually be
using the word RNN for any kind of internal block. Typically most RNNs now are
actually LSTMs.
29. Encoding semantics
Using word2vec instead of word indexes allows you to better deal with the word
meanings (e.g. no need to enumerate all synonyms because their vectors are
already close to each other).
But the naive way to work with word2vec vectors still gives you a “bag of words”
model, where phrases “The man killed the tiger” and “The tiger killed the man” are
equal.
Need models which pay attention to the word ordering: paragraph2vec, sentence
embeddings (using RNN/LSTM), even World2Vec (LeCunn @CVPR2015).
34. Multi-modal Learning
Deep Learning models become multi-modal: they use 2+ modalities
simultaneously, i.e.:
● Image caption generation: images + text
● Search Web by an image: images + text
● Video describing: the same but added time dimension
● Visual question answering: images + text
● Speech recognition: audio + video (lips motion)
● Image classification and navigation: RGB-D (color + depth)
Where does it aim to?
● Common metric space for each concept, “thought vector”. Will be possible to
match different modalities easily.
38. Example: Text generation by image
http://arxiv.org/abs/1411.4555 “Show and Tell: A Neural Image Caption Generator”
39. Example: Image generation by text
StackGAN: Text to Photo-realistic Image Synthesis with
Stacked Generative Adversarial Networks, https://arxiv.org/abs/1612.03242
40. Example: Code generation by image
pix2code: Generating Code from a Graphical User Interface Screenshot,
https://arxiv.org/abs/1705.07962
42. Sequence to Sequence Learning (seq2seq)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
43. Another useful thing: CTC Output Layer
CTC (Connectionist Temporal Classification; Graves, Fernández, Gomez,
Schmidhuber, 2006) was specifically designed for temporal classification tasks; that
is, for sequence labelling problems where the alignment between the inputs and the
target labels is unknown.
CTC models all aspects of the sequence with a single neural network, and does not
require the network to be combined with a hidden Markov model. It also does not
require presegmented training data, or external post-processing to extract the
label sequence from the network outputs.
The CTC network predicts only the sequence of phonemes (typically as a series
of spikes, separated by ‘blanks’, or null predictions), while the framewise network
attempts to align them with the manual segmentation.
55. Encoder-Decoder: modern architecture
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,
https://arxiv.org/abs/1609.08144
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
https://arxiv.org/abs/1611.04558
58. The solution #1: CNN encoder
A Convolutional Encoder Model for Neural Machine Translation, https://arxiv.org/abs/1611.02344
Convolutional Encoder / Recurrent Decoder
59. The solution #1.5: CNN encoder + decoder
Convolutional Sequence to Sequence Learning, https://arxiv.org/abs/1705.03122
Actually no RNN here (Facebook AI Research loves CNNs).
60. The solution #2: Optimizing RNNs
Exploring Sparsity in Recurrent Neural Networks, https://arxiv.org/abs/1704.05119
“Pruning RNNs reduces the size of the model and can also help achieve significant
inference time speed-up using sparse matrix multiply. Benchmarks show that using
our technique model size can be reduced by 90% and speed-up is around 2× to
7×.”
62. The solution #3: Better hardware
Why this solution could be among the most interesting ones?
Current success of NNs (especially CNNs) is backed by a large amounts of data
available _AND_ more powerful hardware (using the decades-old algorithms). We
potentially could achieve the same performance in the past, but the learning process
was just too slow (and we were too impatient).
The processor performance grows exponentially and in 5-10 years the available
computing power can increase 1000x. There may appear computing units more
suitable for RNN computations as well.
The situation could repeat. When the hardware will allow fast training of RNNs, we
could achieve a new kind of results. Remember, RNNs are Turing complete. They
are (potentially) much more powerful than feed-forward NNs.