Contemporary Models of Natural Language Processing

Introduction Language Modeling Machine Translation End
Contemporary Models of Natural Language Processing
Ekaterina Vylomova
June, 2017

Contents
1 Introduction
2 Language Modeling
N-Grams
Distributional Semantics
Learning Representations
Neural Language Models
Evaluation
3 Machine Translation
Statistical Machine Translation
Neural Machine Translation
Attentional Mechanism
Comparison of MT systems
Google MT

NLP and CL
Natural Language Processing: the art of solving engineering problems that need to analyze
(or generate) natural language text
Computational Linguistics: computational methods to answer the scientic questions of
linguistics

Tasks
Language Modeling
Sentiment Analysis
Machine Translation
POS Tagging
Text Classication
Question Answering
Recommender Systems
.. and many others!

N-Grams
Language Modeling
A probability distribution over sequences of words P(w1w2w3...wn)-?.
Markov Assumption: P(w1w2w3...wn) ≈ i P(wi |wi−k ...wi−1)
N-Grams:
P(w1w2w3) = P(w1)P(w2|w1)P(w3|w2, w1)
Unigram: P(w1w2w3) = P(w1)P(w2)P(w3)
BiGram: P(w1w2w3) = P(w1)P(w2|w1)P(w3|w2)
...3-grams, 4-grams, etc.
where P(wi |wi−1) = count(wi−1,wi )
count(wi−1) Maximum Likelihood Estimation

N-Grams
N-Grams
Insucient because of:
Language has long-distance dependences
Not enough generalisation
Surface forms instead of meanings

Where the meaning comes from?
How to get the meaning of a word, a
phrase, a sentence?

Distributional approach
Frege, Firth, Harris : A meaning of a word ≈ a meaning of a context
An example for Russian (word's stem is in bold)
...îí íå ïðîáåæèò áûñòðåå ãåïàðäà...
...ïåðåä ïðîáåãîì ëûæíèêè è âåòåðàíû...
...îíè âûáåãàþò íà ïðîåçæóþ ÷àñòü...
...åñòü , áåæàòü , ñïàòü è äûøàòü...
...áåãàòü è ïðûãàòü íà ïëàòôîðìó...
...ìû áûñòðî ïîáåæàëè ê ïëÿæó...
...äíåâíèê : ïî÷åìó èìåííî áåã...
...áåæàòü ââåðõ , ÿ ñòðóñèë ÷òî ëè...
...ïîáåæàë ê ñâîåé ìàìå...
...ãäå äåòè ìîãóò áåãàòü è èãðàòü...
...çàòåì îí âûáåãàë , îïðîêèäûâàë...
...äåòè èç ëþáûõ ñåìåé ñáåãàþò èç äîìó...

Distributed Representations
Moving away from localrepresentations:
cat = [0000100]
dog = [0100000]
to Distributed :
doc1: A dog eating meat.
doc2 A dog chases a cat.
doc3: A car drives.
Recall Term-Document matrix:
a car cat chases dog drives eating meat
doc1 1 0 0 0 1 0 1 1
doc2 2 0 1 1 1 0 0 0
doc3 1 1 0 0 0 1 0 0
Term - Document matrix.

Distributed Representations
doc1: A dog eating meat.
doc2 A dog chases a cat.
doc3: A car drives.
=== Term-Term Matrix: set the context! Window of size = [1,10]. Let's take 1.
Shorter windows (1-3) == syntactic
Longer windows (4-10)== semantic
a car cat chases dog drives eating meat
a 0 1 1 0 2 0 0 0
car 1 0 0 0 0 0 0 0
cat 1 0 0 0 0 0 0 0
chases 1 0 0 0 1 0 0 0
dog 1 0 0 0 0 0 1 0
drives 0 1 0 0 0 0 0 0
eating 0 0 0 0 0 0 0 1
meat 0 0 0 0 0 0 1 0

Raw counts are very skewed == Pointwise Mutual Information (PMI)
Negative numbers are problematic, so we will use Positive PMI:
PPMI(x, y) = max(PMI(x, y), 0)

The matrix is too large (typically millions of tokens/types)
The matrix is too sparse!
Curse of dimensionality
We want dense and short representations!

SVD/PCA
approximate N-dimensional data with fewer (most important) dimensions
by rotating the axes into a new space
the highest dimension captures the most variance in the original data
the next dimension captures the next most variance, etc.

Principal Component Analysis

Singular Value Decomposition
Store top k singular values instead of all m dimensions. So, each row in W is k-dimensional

Dense vs. Sparse representations
Dense vectors lead to:
denoising
better generalization
easier for classiers to properly weight the dimensions for the task
better at capturing higher order co-occurrence

Models
HLBL: A Scalable Hierarchical Distributed Language Model (Mnih, 2009)
J =
1
T
T
i=1
exp(˜wi wi + bi )
V
k=1 exp(˜wi wk + bk )
,
˜wi =
n−1
j=1 Cj wi−j is the context embedding, {Cj } are scaling matrices and b∗ bias terms.
SENNA (CW): Natural Language Processing (almost) from Scratch (Collobert, 2011)
J =
1
T
T
i=1
V
k=1
max 0, 1 − f (wi−c , . . . , wi−1, wi ) + f (wi−c , . . . , wi−1, wk ) ,
where the last c − 1 words are used as context, and f (x) is a non-linear function of the input

RNN: Linguistic Regularities in Continuous Space Word Representations
(Mikolov, 2013)
Recurrent Neural Language Model (Inspired by Elman, 1992)
s(t) = f (Uw(t) + Ws(t − 1)), y(t) = g(Vs(t))
f (z) = 1
1+e−z , g(zm) =
ezm
k ezk

Learning Representations predictive models
Ecient Estimation of Word Representations in Vector Space (Mikolov, 2013)
Word2Vec CBOW and Skip-grams models (Mikolov, 2013)

Skip-Gram Model (Mikolov, 2013)

Learning Representations: Skip-Gram with Negative Sampling
where σ(x) = 1/(1 + exp(−x))

GloVe: Global Vectors for Word Representation (Pennington, 2014)
GloVe
J =
1
2
V
f (Pij )(wi ˜wj − log Pij )2
wi is a vector for the left context, wj - vector for the right context, Pij - relative frequency of
word j in the context of word i, and f - weighting function

Evaluation
Intrinsic
Perplexity : how well a probability distribution or probability model predicts a sample.
2H(p)
= 2−sumx p(x)log2p(x)
,
where H(p) is the entropy of the distribution and x ranges over events.
Cross-Entropy: H(˜p, q) = − x ˜p(x)log2q(x),
where q is the model, ˜p(x) is the empirical distribution of the test sample
Extrinsic
Word Analogy Tasks

Linguistic Regularities in Continuous Space Word Representations (Mikolov,
2013)
Word Analogy Task, initially designed for word2vec
king is to man as queen is to ?
good is to best as smart is to ?
china is to beijing as russia is to ?

Linguistic Regularities in Continuous Space Word Representations (Mikolov,
2013)
Word Analogy Task vector(king) − vector(man) + vector(woman) ≈ vector(queen)
Use cosine similarity: x = argmaxx cos(x , a∗
− a + b), where a∗
, a, b are typically excluded

Comparison of Neural Language Models

Is word2vec better than SVD and GloVE?
LevyGoldberg, 2014 : Neural Word Embedding as Implicit Matrix Factorization (word2vec
skip-gram with negative sampling)
Vylomova, 2016 : word2vec performs similar to SVD-PPMI on semantic and syntactic
evaluation tasks

Deep Models
Now let's get deeper...

Deep Models: LSTMs
Problem of vanishing gradient in RNN (for long-term dependencies). Solution: LSTMs(Long
Short-Term Memory) and GRUs(Gated-Recurrent Unit), perform similar.
More: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Deep Models: CNNs
Originating from Computer Vision

Deep Models: CNN-Highway (Kim et al, 2015)
Char-CNN+biLSTM

Parallel Corpus
Parallel Corpus
Popular corpora: Europarl, CommonCrawl MT Workshop: http://www.statmt.org/

Learning Alignment
Alignment word/phrase translation probabilities
Alignment for phrase-based MT

SMT Model: Noisy Channel Model
Noisy Channel from Information Theory

SMT Model: Noisy Channel Model
Recall Bayes Theorem P(B|A) = P(A|B)P(B)
P(A)
Bayesian Approach

First Neural MT models
Encoder-Decoder: encode the source sentence using RNN into a single vector and then
iteratively decode until EOS symbol is produced.

Sutskever et al, 2014 Sequence to Sequence Learning with Neural Networks
Deep LSTMs with 4 layers, 1000 cells at each layer and 1000 dimensional word embeddings
|Ve| = 160, 000 |Vf | = 80, 000. The resulting LSTM has 384M parameters.

Sutskever et al, 2014 Sequence to Sequence Learning with Neural Networks
Vector Space

Bahdanau et al., 2014 Neural Machine Translation by Jointly Learning to
Align and Translate
Let's learn the alignment!

Align and Translate
Attentional Mechanism (shamelessly stolen from nvidia tutorial)

Align and Translate
Good news: the ability to interpret and visualize what the model is doing (the alignment
weights)
An example of Alignment Matrix

Other applications of attentions
Other great papers to read
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (Xu et al.,
2015)
Grammar as a Foreign Language (Vinyals et al.,2014)
Teaching Machines to Read and Comprehend (Hermann et al., 2015)

Comparison of MT systems
Comparison of the SMT/Neural MT Models
Phrase-Based SMT baseline vs. Attentional vs. Seq2Seq (taken from Sutskever's paper)

Google MT
Google Translator
Google Translation is ocially neural!
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine
Translation (Wu and many others, 2016)

Thank you!

And many other ...
Subword models (character-level, morpheme-level)
Dialogue systems (iPavlov challenge from MIPT)
Transfer Learning, NLP for Low-Resource Languages
Other models such as memory networks, adversarial networks, etc.
Great researchers: Yoshua Bengio, Georey Hinton, Tomas Mikolov, Chris Dyer, Russ
Salakhutdinov, Kyunghyun Cho, Chris Manning, Hinrich Schuetze, Dan Jurafsky
RuSSIR-2017! Deadline June, 25th

Contemporary Models of Natural Language Processing

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Contemporary Models of Natural Language Processing

Semelhante a Contemporary Models of Natural Language Processing (20)

Mais de Katerina Vylomova

Mais de Katerina Vylomova (16)

Último

Último (20)

Contemporary Models of Natural Language Processing