SlideShare uma empresa Scribd logo
1 de 86
Baixar para ler offline
Sequence Modelling
with Deep Learning
ODSC London 2019 Tutorial
Natasha Latysheva
I. Introduction to sequence modelling
II. Quick neural network review
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention
Speaker Intro
• Welocalize
• We provide language services
• Fairly large, by revenue 8th largest globally,
4th largest US. 1500+ employees.
• Lots of localisation (translation)
• International marketing, site optimisation
• NLP engineering team
• 14 people remote across US, Ireland, UK,
Germany, China
• Various NLP things: machine translation,
text-to-speech, NER, sentiment, topics,
classification, etc.
I. Introduction to Sequence Modelling
Other sequence problems
Less conventional sequence data
• Activity on a website:
• [click_button, move_cursor, wait,
wait, click_subscribe, close_tab]
• Customer history:
• [inactive -> mildly_active ->
payment_made -> complaint_filed
-> inactive -> account_closed]
• Code (constrained language) is
sequential data – can learn the
II. Quick Neural Network Review
Feed-forward networks
Simplifying the notation
• Single neurons
• Weight matrices, bias vectors
• Fully-connected layer
III. Recurrent Neural Networks
Why do we need fancy methods to
model sequences?
• Say we are training a translation
model, English->French
• ”The cat is black” to “Le chat is
• Could in theory use a feed-
forward network to translate
Why do we need fancy methods?
• A feed-forward network treats
time steps as completely
• Even in this simple 1-to-1
correspondence example, things
are broken
• How you translate “black” depends
on noun gender (“noir” vs. “noire”)
• How you translate “The” also
depends on gender (“Le” vs. “La”)
• More generally, getting the
translation right requires context
Why do we need fancy methods?
• We need a way for the network
to remember information from
previous time steps
Recurrent neural networks
• Extremely popular way of modelling
sequential data
• Process data one time step at a
time, while updating a running
internal hidden state
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
• At each time step, RNN
passes on its activations
from previous time step
• In theory all the way back
to the first time step
Standard FF network to RNN
*Activation function probably tanh or ReLU
Standard FF network to RNN
• So you can say this is a
form of memory
• Cell hidden state
• Basis for RNNs
remembering context
Memory problems
• Basic RNNs not great at
long-term dependencies
but plenty of ways to
improve this
• Information gating
• Condensing input using
Gating mechanisms
• Gates regulate the flow of
• Very helpful - basic RNN cells not really
used anymore. Responsible for recent
RNN popularity.
• Add explicit mechanisms to remember
information and forget information
• Why use gates?
• Helps you learn long-term
• Not all time points are equally relevant
– not everything has to be remembered
• Speeds up training/convergence
Gated recurrent
units (GRUs)
• GRUs were developed later
than LSTMs but are simpler
• Motivation is to get the main
benefits of LSTMs but with less
• Reset gate: Mechanism to
decide when to remember vs.
forget/reset previous
information (hidden state)
• Update gate: Mechanism to
decide when to update
hidden state
GRU mechanics
• Reset gate controls how
much past info we use
• Rt = 0 means we are resetting
our RNN, not using any
previous information
• Rt = 1 means we use all of
previous information (back to
our normal vanilla RNN)
GRU mechanics
• Update gate controls whether
we bother updating our
hidden state using new
• Zt = 1 means you’re not
updating, you’re just using
previous hidden state
• Zt = 0 means you’re updating as
much as possible
LSTM mechanics
• LSTMs add a memory unit to
further control the flow of
information through the cell
• Also whereas GRUs have 2
gates, an LSTM cell has 3
• An input gate – should I ignore
or consider the input?
• A forget gate – should I keep
or throw away the information
in memory?
• An output gate – how should I
use input, hidden state and
memory to output my next
hidden state?
GRUs vs. LSTMs
• GRUs are simpler + train
• LSTMs more popular – can
give slightly better
performance, but GRU
performance often on par
• LSTMs would in theory
outperform GRUs in tasks
requiring very long-range
IV. Game of Thrones Language Model
• ~30 mins
• Jupyter
notebook on
building an RNN-
based language
• Python 3 + Keras
for neural
IV. Components of SOTA RNN models
Encoder-Decoder architectures
• Being forced to
output a French
word for every
English word
Encoder-Decoder architectures
Encoder-Decoder architectures
• Tends to work a lot
better than using a
single sequence-to-
sequence RNNs to
produce an output
for each input step
• You often need to
see the whole
sequence before
knowing what to
Bidirectionality in RNN encoder-decoders
• For the encoder,
bidirectional RNNs
(BRNNs) often used
• BRNNs read the
input sequences
forwards and
• Process input
sequences in both
The problem with RNN encoder-decoders
• Serious information
• Condense input
sequence down to a
small vector?!
• Memorise long
sequence + regurgitate
• Not how humans work
• Long computation
Attention concept
• Has been very influential in
deep learning
• Originally developed for
MT (Bahdanau, 2014)
• As you’re producing your
output sequence, maybe
not every part of your input
is as equally relevant
• Image captioning example
Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
Attention intuition
• Attention allows the
network to refer back
to the input
sequence, instead of
forcing it to encode
all information into
one fixed-length
• Encoder: Used BRNN
to compute rich set of
features about source
words and their
surrounding words
• Decoder is asked to
choose which hidden
states to use and
• Weighted sum of
hidden states used to
predict the next word
Attention intuition
• Decoder RNN uses
attention parameters
to decide how much
to pay attention to
different parts of the
• Allows the model to
amplify the signal
from relevant parts of
the input sequence
• This improves
Attention intuition
Main benefits
• Encoder passes a lot
more data to
the decoder
• Not just last hidden
• Passes all hidden states
at every time step
• Computation path
problem: relevant
information is now
closer by
Summary so far
• Sequence modelling
• Recurrent neural
• Some key components
of SOTA RNN-based
• Gating mechanisms
(GRUs and LSTMs)
• Encoder-decoders
• Bidirectional encoding
• Attention
V. Transformers and self-attention
Transformers are taking over NLP
• Translation, language
models, question
answering, summarisation,
• Some of the best word
embeddings are based on
• BERT, ELmO, OpenAI GPT-2
A single Transformer encoder block
• No recurrence, no convolutions
• “Attention is all you need” paper
• The core concept is the self-
attention mechanism
• Much more parallelisable than
RNN-based models, which
means faster training
Self-attention is a
• At the highest level – self-
attention takes t input
vectors and outputs t
output vectors
• Take input embedding for
“the” and update it by
incorporating in
information from its
How is the vector for “the” updated?
• Each output vector
is a weighted sum
of the input vectors
• But all of these
weights are
These are not learned weights in the
traditional neural network sense
• The weights are
calculated by taking
dot products
• Can use different
functions over input
Example calculation of a single weight
Example calculation of a single weight
Calculating a weight matrix row
Attention weight matrix
• The dot product can be
anything (negative infinity to
positive infinity)
• We normalise by length
• We softmax this so that the
weights are positive values
summing to 1
• Attention weight matrix
summarises relationship
between words
• Because dot products capture
similarity between vectors
Multi-headed attention
• Attention weight matrix
captures relationship
between words
• But there’s many
different ways words can
be related
• And which ones you want
to capture depends on
your task
• Different attention heads
learn different relations
between word pairs
Img source
Difference to RNNs
• Whereas RNNs updates context
token-by-token by updating
internal hidden state, self-
attention captures context by
updating all word representations
• Lower computational complexity,
scales better with more data
• More parallelisable = faster
Connecting all
these concepts
• “Useful” input representations are
• “Useful” weights for transforming
input vectors are learned
• These quantities should produce
“useful” dot products
• That lead to “useful” updated input
• That lead to “useful” input to the
feed-forward network layer
• … etc. … that eventually lead to
lower overall loss on the training set
I. Introduction to sequence modelling
II. Quick neural network review
• How a single neuron functions
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention
Further Reading
• More accessible: Andrew Ng
Sequence Course on Coursera
• More technical: Deep Learning book
by Goodfellow et al.
• Also: Alex Smola Berkeley Lectures
Just for fun
• Talk to transformer
• Using OpenAI’s “too
dangerous to release” GPT-
2 language model
Thanks, questions?
Extra slides
Sequences in natural language
• Sequence modelling very popular in
NLP because language is sequential by
• Text
• Sequences of words
• Sequences of characters
• We process text sequentially, though in
principle could see all words at once
• Speech
• Sequence of amplitudes over time
• Frequency spectrogram over time
• Extracted frequency features over time
Img source
Sequences in biology
• Genomics, DNA and
RNA sequences
• Proteomics, protein
structural biology
• Trying to represent
sequences in some
way, or predict some
function or
association of the
Img source
Sequences in finance
• Lots of time series data
• Numerical sequences (stocks,
• Lots of forecasting work –
predicting the future (trading
• Deep learning for these
sequences perhaps not as
popular as you might think
• Quite well-developed methods
based on classical statistics,
interpretability important
Img source
Img source
Single neuron computation
• What computation is
happening inside 1
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input
Single neuron computation
• What computation is
happening inside 1
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2
• Here weights are
cherry-picked, but
perceptrons learn these
weights automatically
from training data by
shifting parameters to
minimise error
• Formalising the perceptron
• Instead of a threshold, more
common to see a bias term
• Instead of writing out the
sums using sigma notation,
more common to see dot
• Vectorisation for efficiency
• Here, I manually chose these
values – but given a dataset of
past inputs/outputs, you could
learn the optimal parameter
• Formalising the
perceptron calculation
• Instead of a threshold,
more common to see a
bias term
• Instead of writing out
the sums using sigma
notation, more common
to see dot products.
• Vectorisation for
Sigmoid neurons
• Want to handle continuous
• Where input can be
something other than just 0 or
• Where output can be
something other than just 0 or
• We put the weighted sum of
inputs through an activation
• Sigmoid or logistic function
Sigmoid neurons
• The sigmoid function is
basically a smoothed out
• Output no longer a
sudden jump
• It’s the smoothness of the
function that we care
Img source
Activation functions
• Which activation function
to use?
• Heuristics based on
experiments, not proof-
Img source
More layers!
• Increase
number of
layers to
capacity for
processing of
Training on big window sizes
• How much of window size? On very long sequence, unrolled
RNN becomes a very deep network
• Same problems with vanishing/exploding gradients as normal
• And takes a longer time to train
• The normal tricks can help – good initialization of parameters, non-
saturating activation functions, gradient clipping, batch norm
• Training over a limited number of steps – truncated
backpropagation through time
LSTM mechanics
• Input, forget, output gates are
little neural networks within the
• Memory being updated via
forget gate and candidate
• Hidden state being updated by
output gate, which weighs up all
Query, Key, and Value transformations
• Notice that we are using
each input vector on 3
separate occasions
• E.g. vector x2
1. To take dot products
with each other input
vector when calculating
2. In dot products with
other output vectors (y1,
y3, y4) are calculated
3. And in the weighted
sum to produce output
vector y2
Query, Key, and Value transformations
• To model these 3
different functions for
each input vector, and
give the model extra
expressivity and
flexibility, we are going
to modify the input
• Apply simple linear
Input transformation
• These weight matrices
are learnable
• Gives something else
to learn by gradient

Mais conteúdo relacionado

Mais procurados

Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNNPradnya Saval
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term MemoryYan Xu
Precise LSTM Algorithm
Precise LSTM AlgorithmPrecise LSTM Algorithm
Precise LSTM AlgorithmYasutoTamura1
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu

Mais procurados (20)

Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Precise LSTM Algorithm
Precise LSTM AlgorithmPrecise LSTM Algorithm
Precise LSTM Algorithm
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization

Semelhante a Sequence Modelling with Deep Learning

Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxArunKumar674066
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyRimzim Thube
240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxddddddddddddddddssuser2624f71
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Vishal Mishra
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learningPoo Kuan Hoong
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Impetus Technologies
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerPoo Kuan Hoong
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningMadhu Sanjeevi (Mady)
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptxthanhdowork
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep LearningPoo Kuan Hoong
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnnkartikaursang53
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Vijay Srinivas Agneeswaran, Ph.D
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learningViet-Trung TRAN
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksRimzim Thube
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz

Semelhante a Sequence Modelling with Deep Learning (20)

Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptx
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
Deep learning
Deep learningDeep learning
Deep learning
240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learning
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep Learning
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnn
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)


Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1

Último (20)

Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization

Sequence Modelling with Deep Learning

  • 1. Sequence Modelling with Deep Learning ODSC London 2019 Tutorial Natasha Latysheva
  • 2. Overview I. Introduction to sequence modelling II. Quick neural network review • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  • 3. Speaker Intro • Welocalize • We provide language services • Fairly large, by revenue 8th largest globally, 4th largest US. 1500+ employees. • Lots of localisation (translation) • International marketing, site optimisation • NLP engineering team • 14 people remote across US, Ireland, UK, Germany, China • Various NLP things: machine translation, text-to-speech, NER, sentiment, topics, classification, etc.
  • 4. I. Introduction to Sequence Modelling
  • 5.
  • 7. Less conventional sequence data • Activity on a website: • [click_button, move_cursor, wait, wait, click_subscribe, close_tab] • Customer history: • [inactive -> mildly_active -> payment_made -> complaint_filed -> inactive -> account_closed] • Code (constrained language) is sequential data – can learn the structure
  • 8. II. Quick Neural Network Review
  • 10. Simplifying the notation • Single neurons • Weight matrices, bias vectors • Fully-connected layer
  • 12. Why do we need fancy methods to model sequences? • Say we are training a translation model, English->French • ”The cat is black” to “Le chat is noir” • Could in theory use a feed- forward network to translate word-by-word
  • 13. Why do we need fancy methods? • A feed-forward network treats time steps as completely independent • Even in this simple 1-to-1 correspondence example, things are broken • How you translate “black” depends on noun gender (“noir” vs. “noire”) • How you translate “The” also depends on gender (“Le” vs. “La”) • More generally, getting the translation right requires context
  • 14. Why do we need fancy methods? • We need a way for the network to remember information from previous time steps
  • 15. Recurrent neural networks • Extremely popular way of modelling sequential data • Process data one time step at a time, while updating a running internal hidden state
  • 20. Standard FF network to RNN • At each time step, RNN passes on its activations from previous time step • In theory all the way back to the first time step
  • 21. Standard FF network to RNN *Activation function probably tanh or ReLU
  • 22. Standard FF network to RNN • So you can say this is a form of memory • Cell hidden state transferred • Basis for RNNs remembering context
  • 23. Memory problems • Basic RNNs not great at long-term dependencies but plenty of ways to improve this • Information gating mechanisms • Condensing input using encoders
  • 24. Gating mechanisms • Gates regulate the flow of information • Very helpful - basic RNN cells not really used anymore. Responsible for recent RNN popularity. • Add explicit mechanisms to remember information and forget information • Why use gates? • Helps you learn long-term dependencies • Not all time points are equally relevant – not everything has to be remembered • Speeds up training/convergence
  • 25. Gated recurrent units (GRUs) • GRUs were developed later than LSTMs but are simpler • Motivation is to get the main benefits of LSTMs but with less computation • Reset gate: Mechanism to decide when to remember vs. forget/reset previous information (hidden state) • Update gate: Mechanism to decide when to update hidden state
  • 26. GRU mechanics • Reset gate controls how much past info we use • Rt = 0 means we are resetting our RNN, not using any previous information • Rt = 1 means we use all of previous information (back to our normal vanilla RNN)
  • 27. GRU mechanics • Update gate controls whether we bother updating our hidden state using new information • Zt = 1 means you’re not updating, you’re just using previous hidden state • Zt = 0 means you’re updating as much as possible
  • 28. LSTM mechanics • LSTMs add a memory unit to further control the flow of information through the cell • Also whereas GRUs have 2 gates, an LSTM cell has 3 gates: • An input gate – should I ignore or consider the input? • A forget gate – should I keep or throw away the information in memory? • An output gate – how should I use input, hidden state and memory to output my next hidden state?
  • 29. GRUs vs. LSTMs • GRUs are simpler + train faster • LSTMs more popular – can give slightly better performance, but GRU performance often on par • LSTMs would in theory outperform GRUs in tasks requiring very long-range modelling
  • 30. IV. Game of Thrones Language Model
  • 31. Notebook • ~30 mins • Jupyter notebook on building an RNN- based language model • Python 3 + Keras for neural networks
  • 32. IV. Components of SOTA RNN models
  • 33. Encoder-Decoder architectures • Being forced to immediately output a French word for every English word
  • 35.
  • 36. Encoder-Decoder architectures • Tends to work a lot better than using a single sequence-to- sequence RNNs to produce an output for each input step • You often need to see the whole sequence before knowing what to output
  • 37.
  • 38. Bidirectionality in RNN encoder-decoders • For the encoder, bidirectional RNNs (BRNNs) often used • BRNNs read the input sequences forwards and backwards
  • 39.
  • 41. The problem with RNN encoder-decoders • Serious information bottleneck • Condense input sequence down to a small vector?! • Memorise long sequence + regurgitate • Not how humans work • Long computation paths
  • 42. Attention concept • Has been very influential in deep learning • Originally developed for MT (Bahdanau, 2014) • As you’re producing your output sequence, maybe not every part of your input is as equally relevant • Image captioning example Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  • 43. Attention intuition • Attention allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector
  • 44. • Encoder: Used BRNN to compute rich set of features about source words and their surrounding words • Decoder is asked to choose which hidden states to use and ignore • Weighted sum of hidden states used to predict the next word Attention intuition
  • 45. • Decoder RNN uses attention parameters to decide how much to pay attention to different parts of the input • Allows the model to amplify the signal from relevant parts of the input sequence • This improves modelling Attention intuition
  • 46. Main benefits • Encoder passes a lot more data to the decoder • Not just last hidden state • Passes all hidden states at every time step • Computation path problem: relevant information is now closer by
  • 47. Summary so far • Sequence modelling • Recurrent neural networks • Some key components of SOTA RNN-based models: • Gating mechanisms (GRUs and LSTMs) • Encoder-decoders • Bidirectional encoding • Attention
  • 48. V. Transformers and self-attention
  • 49. Transformers are taking over NLP • Translation, language models, question answering, summarisation, etc. • Some of the best word embeddings are based on Transformers • BERT, ELmO, OpenAI GPT-2 models
  • 50. A single Transformer encoder block • No recurrence, no convolutions • “Attention is all you need” paper • The core concept is the self- attention mechanism • Much more parallelisable than RNN-based models, which means faster training
  • 51. Self-attention is a sequence-to-sequence operation • At the highest level – self- attention takes t input vectors and outputs t output vectors • Take input embedding for “the” and update it by incorporating in information from its context
  • 52. How is the vector for “the” updated?
  • 53. • Each output vector is a weighted sum of the input vectors • But all of these weights are different
  • 54. These are not learned weights in the traditional neural network sense • The weights are calculated by taking dot products • Can use different functions over input
  • 55. Example calculation of a single weight
  • 56. Example calculation of a single weight
  • 57. Calculating a weight matrix row
  • 58. Attention weight matrix • The dot product can be anything (negative infinity to positive infinity) • We normalise by length • We softmax this so that the weights are positive values summing to 1 • Attention weight matrix summarises relationship between words • Because dot products capture similarity between vectors
  • 59. Multi-headed attention • Attention weight matrix captures relationship between words • But there’s many different ways words can be related • And which ones you want to capture depends on your task • Different attention heads learn different relations between word pairs Img source
  • 60. Difference to RNNs • Whereas RNNs updates context token-by-token by updating internal hidden state, self- attention captures context by updating all word representations simultaneously • Lower computational complexity, scales better with more data • More parallelisable = faster training
  • 61. Connecting all these concepts • “Useful” input representations are learned • “Useful” weights for transforming input vectors are learned • These quantities should produce “useful” dot products • That lead to “useful” updated input vectors • That lead to “useful” input to the feed-forward network layer • … etc. … that eventually lead to lower overall loss on the training set
  • 62. Summary I. Introduction to sequence modelling II. Quick neural network review • How a single neuron functions • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  • 63. Further Reading • More accessible: Andrew Ng Sequence Course on Coursera • sequence-models • More technical: Deep Learning book by Goodfellow et al. • ents/rnn.html • Also: Alex Smola Berkeley Lectures • deos
  • 64. Just for fun • Talk to transformer • • Using OpenAI’s “too dangerous to release” GPT- 2 language model
  • 67. Sequences in natural language • Sequence modelling very popular in NLP because language is sequential by nature • Text • Sequences of words • Sequences of characters • We process text sequentially, though in principle could see all words at once • Speech • Sequence of amplitudes over time • Frequency spectrogram over time • Extracted frequency features over time Img source
  • 68. Sequences in biology • Genomics, DNA and RNA sequences • Proteomics, protein sequences, structural biology • Trying to represent sequences in some way, or predict some function or association of the sequence Img source
  • 69. Sequences in finance • Lots of time series data • Numerical sequences (stocks, indices) • Lots of forecasting work – predicting the future (trading strategies) • Deep learning for these sequences perhaps not as popular as you might think • Quite well-developed methods based on classical statistics, interpretability important Img source Img source
  • 70. Single neuron computation • What computation is happening inside 1 neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  • 71. Single neuron computation • What computation is happening inside 1 neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  • 72. Perceptrons • Modelling a binary outcome using binary input features • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, 𝑤" = 3 • Have I just had a cup of tea? • already_had, 𝑤# =-1 • Can I get it to go? • to_go, 𝑤$ =2
  • 73. Perceptrons • Modelling a binary outcome using binary input features • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, 𝑤" = 3 • Have I just had a cup of tea? • already_had, 𝑤# =-1 • Can I get it to go? • to_go, 𝑤$ =2
  • 74. Perceptrons • Here weights are cherry-picked, but perceptrons learn these weights automatically from training data by shifting parameters to minimise error
  • 75. Perceptrons • Formalising the perceptron calculation • Instead of a threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency • Here, I manually chose these values – but given a dataset of past inputs/outputs, you could learn the optimal parameter values
  • 76. Perceptrons • Formalising the perceptron calculation • Instead of a threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency
  • 77. Sigmoid neurons • Want to handle continuous values • Where input can be something other than just 0 or 1 • Where output can be something other than just 0 or 1 • We put the weighted sum of inputs through an activation function • Sigmoid or logistic function
  • 78. Sigmoid neurons • The sigmoid function is basically a smoothed out perceptron! • Output no longer a sudden jump • It’s the smoothness of the function that we care about Img source
  • 79. Activation functions • Which activation function to use? • Heuristics based on experiments, not proof- based Img source
  • 80. More layers! • Increase number of layers to increase capacity for abstraction, hierarchical processing of input
  • 81. Training on big window sizes • How much of window size? On very long sequence, unrolled RNN becomes a very deep network • Same problems with vanishing/exploding gradients as normal networks • And takes a longer time to train • The normal tricks can help – good initialization of parameters, non- saturating activation functions, gradient clipping, batch norm • Training over a limited number of steps – truncated backpropagation through time
  • 82. LSTM mechanics • Input, forget, output gates are little neural networks within the cell • Memory being updated via forget gate and candidate memory • Hidden state being updated by output gate, which weighs up all information
  • 83.
  • 84. Query, Key, and Value transformations • Notice that we are using each input vector on 3 separate occasions • E.g. vector x2 1. To take dot products with each other input vector when calculating y2 2. In dot products with other output vectors (y1, y3, y4) are calculated 3. And in the weighted sum to produce output vector y2
  • 85. Query, Key, and Value transformations • To model these 3 different functions for each input vector, and give the model extra expressivity and flexibility, we are going to modify the input vectors • Apply simple linear transformations
  • 86. Input transformation matrices • These weight matrices are learnable parameters • Gives something else to learn by gradient descent