Contemporary Models of Natural Language Processing
1. Introduction Language Modeling Machine Translation End
Contemporary Models of Natural Language Processing
Ekaterina Vylomova
June, 2017
2. Introduction Language Modeling Machine Translation End
Contents
1 Introduction
2 Language Modeling
N-Grams
Distributional Semantics
Learning Representations
Neural Language Models
Evaluation
3 Machine Translation
Statistical Machine Translation
Neural Machine Translation
Attentional Mechanism
Comparison of MT systems
Google MT
3. Introduction Language Modeling Machine Translation End
NLP and CL
Natural Language Processing: the art of solving engineering problems that need to analyze
(or generate) natural language text
Computational Linguistics: computational methods to answer the scientic questions of
linguistics
4. Introduction Language Modeling Machine Translation End
Tasks
Language Modeling
Sentiment Analysis
Machine Translation
POS Tagging
Text Classication
Question Answering
Recommender Systems
.. and many others!
5. Introduction Language Modeling Machine Translation End
Contents
1 Introduction
2 Language Modeling
N-Grams
Distributional Semantics
Learning Representations
Neural Language Models
Evaluation
3 Machine Translation
Statistical Machine Translation
Neural Machine Translation
Attentional Mechanism
Comparison of MT systems
Google MT
6. Introduction Language Modeling Machine Translation End
N-Grams
Language Modeling
A probability distribution over sequences of words P(w1w2w3...wn)-?.
Markov Assumption: P(w1w2w3...wn) ≈ i P(wi |wi−k ...wi−1)
N-Grams:
P(w1w2w3) = P(w1)P(w2|w1)P(w3|w2, w1)
Unigram: P(w1w2w3) = P(w1)P(w2)P(w3)
BiGram: P(w1w2w3) = P(w1)P(w2|w1)P(w3|w2)
...3-grams, 4-grams, etc.
where P(wi |wi−1) = count(wi−1,wi )
count(wi−1) Maximum Likelihood Estimation
8. Introduction Language Modeling Machine Translation End
N-Grams
N-Grams
Insucient because of:
Language has long-distance dependences
Not enough generalisation
Surface forms instead of meanings
9. Introduction Language Modeling Machine Translation End
Distributional Semantics
Distributional Semantics
Where the meaning comes from?
How to get the meaning of a word, a
phrase, a sentence?
10. Introduction Language Modeling Machine Translation End
Distributional Semantics
Distributional approach
Frege, Firth, Harris : A meaning of a word ≈ a meaning of a context
An example for Russian (word's stem is in bold)
...îí íå ïðîáåæèò áûñòðåå ãåïàðäà...
...ïåðåä ïðîáåãîì ëûæíèêè è âåòåðàíû...
...îíè âûáåãàþò íà ïðîåçæóþ ÷àñòü...
...åñòü , áåæàòü , ñïàòü è äûøàòü...
...áåãàòü è ïðûãàòü íà ïëàòôîðìó...
...ìû áûñòðî ïîáåæàëè ê ïëÿæó...
...äíåâíèê : ïî÷åìó èìåííî áåã...
...áåæàòü ââåðõ , ÿ ñòðóñèë ÷òî ëè...
...ïîáåæàë ê ñâîåé ìàìå...
...ãäå äåòè ìîãóò áåãàòü è èãðàòü...
...çàòåì îí âûáåãàë , îïðîêèäûâàë...
...äåòè èç ëþáûõ ñåìåé ñáåãàþò èç äîìó...
11. Introduction Language Modeling Machine Translation End
Learning Representations
Distributed Representations
Moving away from localrepresentations:
cat = [0000100]
dog = [0100000]
to Distributed :
doc1: A dog eating meat.
doc2 A dog chases a cat.
doc3: A car drives.
Recall Term-Document matrix:
a car cat chases dog drives eating meat
doc1 1 0 0 0 1 0 1 1
doc2 2 0 1 1 1 0 0 0
doc3 1 1 0 0 0 1 0 0
Term - Document matrix.
12. Introduction Language Modeling Machine Translation End
Learning Representations
Distributed Representations
doc1: A dog eating meat.
doc2 A dog chases a cat.
doc3: A car drives.
=== Term-Term Matrix: set the context! Window of size = [1,10]. Let's take 1.
Shorter windows (1-3) == syntactic
Longer windows (4-10)== semantic
a car cat chases dog drives eating meat
a 0 1 1 0 2 0 0 0
car 1 0 0 0 0 0 0 0
cat 1 0 0 0 0 0 0 0
chases 1 0 0 0 1 0 0 0
dog 1 0 0 0 0 0 1 0
drives 0 1 0 0 0 0 0 0
eating 0 0 0 0 0 0 0 1
meat 0 0 0 0 0 0 1 0
13. Introduction Language Modeling Machine Translation End
Learning Representations
Learning Representations
Raw counts are very skewed == Pointwise Mutual Information (PMI)
Negative numbers are problematic, so we will use Positive PMI:
PPMI(x, y) = max(PMI(x, y), 0)
14. Introduction Language Modeling Machine Translation End
Learning Representations
Learning Representations
The matrix is too large (typically millions of tokens/types)
The matrix is too sparse!
Curse of dimensionality
We want dense and short representations!
15. Introduction Language Modeling Machine Translation End
Learning Representations
Learning Representations
SVD/PCA
approximate N-dimensional data with fewer (most important) dimensions
by rotating the axes into a new space
the highest dimension captures the most variance in the original data
the next dimension captures the next most variance, etc.
16. Introduction Language Modeling Machine Translation End
Learning Representations
Learning Representations
Principal Component Analysis
17. Introduction Language Modeling Machine Translation End
Learning Representations
Learning Representations
Singular Value Decomposition
Store top k singular values instead of all m dimensions. So, each row in W is k-dimensional
18. Introduction Language Modeling Machine Translation End
Learning Representations
Learning Representations
Dense vs. Sparse representations
Dense vectors lead to:
denoising
better generalization
easier for classiers to properly weight the dimensions for the task
better at capturing higher order co-occurrence
19. Introduction Language Modeling Machine Translation End
Neural Language Models
Models
HLBL: A Scalable Hierarchical Distributed Language Model (Mnih, 2009)
J =
1
T
T
i=1
exp(˜wi wi + bi )
V
k=1 exp(˜wi wk + bk )
,
˜wi =
n−1
j=1 Cj wi−j is the context embedding, {Cj } are scaling matrices and b∗ bias terms.
SENNA (CW): Natural Language Processing (almost) from Scratch (Collobert, 2011)
J =
1
T
T
i=1
V
k=1
max 0, 1 − f (wi−c , . . . , wi−1, wi ) + f (wi−c , . . . , wi−1, wk ) ,
where the last c − 1 words are used as context, and f (x) is a non-linear function of the input
20. Introduction Language Modeling Machine Translation End
Neural Language Models
RNN: Linguistic Regularities in Continuous Space Word Representations
(Mikolov, 2013)
Recurrent Neural Language Model (Inspired by Elman, 1992)
s(t) = f (Uw(t) + Ws(t − 1)), y(t) = g(Vs(t))
f (z) = 1
1+e−z , g(zm) =
ezm
k ezk
21. Introduction Language Modeling Machine Translation End
Neural Language Models
Learning Representations predictive models
Ecient Estimation of Word Representations in Vector Space (Mikolov, 2013)
Word2Vec CBOW and Skip-grams models (Mikolov, 2013)
22. Introduction Language Modeling Machine Translation End
Neural Language Models
Learning Representations
Skip-Gram Model (Mikolov, 2013)
23. Introduction Language Modeling Machine Translation End
Neural Language Models
Learning Representations: Skip-Gram with Negative Sampling
where σ(x) = 1/(1 + exp(−x))
24. Introduction Language Modeling Machine Translation End
Neural Language Models
GloVe: Global Vectors for Word Representation (Pennington, 2014)
GloVe
J =
1
2
V
f (Pij )(wi ˜wj − log Pij )2
wi is a vector for the left context, wj - vector for the right context, Pij - relative frequency of
word j in the context of word i, and f - weighting function
25. Introduction Language Modeling Machine Translation End
Neural Language Models
Evaluation
Intrinsic
Perplexity : how well a probability distribution or probability model predicts a sample.
2H(p)
= 2−sumx p(x)log2p(x)
,
where H(p) is the entropy of the distribution and x ranges over events.
Cross-Entropy: H(˜p, q) = − x ˜p(x)log2q(x),
where q is the model, ˜p(x) is the empirical distribution of the test sample
Extrinsic
Word Analogy Tasks
26. Introduction Language Modeling Machine Translation End
Neural Language Models
Linguistic Regularities in Continuous Space Word Representations (Mikolov,
2013)
Word Analogy Task, initially designed for word2vec
king is to man as queen is to ?
good is to best as smart is to ?
china is to beijing as russia is to ?
27. Introduction Language Modeling Machine Translation End
Neural Language Models
Linguistic Regularities in Continuous Space Word Representations (Mikolov,
2013)
Word Analogy Task vector(king) − vector(man) + vector(woman) ≈ vector(queen)
Use cosine similarity: x = argmaxx cos(x , a∗
− a + b), where a∗
, a, b are typically excluded
30. Introduction Language Modeling Machine Translation End
Neural Language Models
Is word2vec better than SVD and GloVE?
LevyGoldberg, 2014 : Neural Word Embedding as Implicit Matrix Factorization (word2vec
skip-gram with negative sampling)
Vylomova, 2016 : word2vec performs similar to SVD-PPMI on semantic and syntactic
evaluation tasks
32. Introduction Language Modeling Machine Translation End
Neural Language Models
Deep Models: LSTMs
Problem of vanishing gradient in RNN (for long-term dependencies). Solution: LSTMs(Long
Short-Term Memory) and GRUs(Gated-Recurrent Unit), perform similar.
More: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
33. Introduction Language Modeling Machine Translation End
Neural Language Models
Deep Models: CNNs
Originating from Computer Vision
34. Introduction Language Modeling Machine Translation End
Neural Language Models
Deep Models: CNN-Highway (Kim et al, 2015)
Char-CNN+biLSTM
35. Introduction Language Modeling Machine Translation End
Neural Language Models
Deep Models: CNN-Highway (Kim et al, 2015)
Char-CNN+biLSTM
36. Introduction Language Modeling Machine Translation End
Contents
1 Introduction
2 Language Modeling
N-Grams
Distributional Semantics
Learning Representations
Neural Language Models
Evaluation
3 Machine Translation
Statistical Machine Translation
Neural Machine Translation
Attentional Mechanism
Comparison of MT systems
Google MT
37. Introduction Language Modeling Machine Translation End
Statistical Machine Translation
Parallel Corpus
Parallel Corpus
Popular corpora: Europarl, CommonCrawl MT Workshop: http://www.statmt.org/
38. Introduction Language Modeling Machine Translation End
Statistical Machine Translation
Learning Alignment
Alignment word/phrase translation probabilities
Alignment for phrase-based MT
39. Introduction Language Modeling Machine Translation End
Statistical Machine Translation
SMT Model: Noisy Channel Model
Noisy Channel from Information Theory
40. Introduction Language Modeling Machine Translation End
Statistical Machine Translation
SMT Model: Noisy Channel Model
Recall Bayes Theorem P(B|A) = P(A|B)P(B)
P(A)
Bayesian Approach
41. Introduction Language Modeling Machine Translation End
Neural Machine Translation
First Neural MT models
Encoder-Decoder: encode the source sentence using RNN into a single vector and then
iteratively decode until EOS symbol is produced.
42. Introduction Language Modeling Machine Translation End
Neural Machine Translation
Sutskever et al, 2014 Sequence to Sequence Learning with Neural Networks
Deep LSTMs with 4 layers, 1000 cells at each layer and 1000 dimensional word embeddings
|Ve| = 160, 000 |Vf | = 80, 000. The resulting LSTM has 384M parameters.
43. Introduction Language Modeling Machine Translation End
Neural Machine Translation
Sutskever et al, 2014 Sequence to Sequence Learning with Neural Networks
Vector Space
44. Introduction Language Modeling Machine Translation End
Attentional Mechanism
Bahdanau et al., 2014 Neural Machine Translation by Jointly Learning to
Align and Translate
Let's learn the alignment!
Attentional Mechanism
45. Introduction Language Modeling Machine Translation End
Attentional Mechanism
Bahdanau et al., 2014 Neural Machine Translation by Jointly Learning to
Align and Translate
Attentional Mechanism (shamelessly stolen from nvidia tutorial)
46. Introduction Language Modeling Machine Translation End
Attentional Mechanism
Bahdanau et al., 2014 Neural Machine Translation by Jointly Learning to
Align and Translate
Good news: the ability to interpret and visualize what the model is doing (the alignment
weights)
An example of Alignment Matrix
47. Introduction Language Modeling Machine Translation End
Attentional Mechanism
Other applications of attentions
Other great papers to read
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (Xu et al.,
2015)
Grammar as a Foreign Language (Vinyals et al.,2014)
Teaching Machines to Read and Comprehend (Hermann et al., 2015)
48. Introduction Language Modeling Machine Translation End
Comparison of MT systems
Comparison of the SMT/Neural MT Models
Phrase-Based SMT baseline vs. Attentional vs. Seq2Seq (taken from Sutskever's paper)
49. Introduction Language Modeling Machine Translation End
Google MT
Google Translator
Google Translation is ocially neural!
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine
Translation (Wu and many others, 2016)
51. Introduction Language Modeling Machine Translation End
And many other ...
Subword models (character-level, morpheme-level)
Dialogue systems (iPavlov challenge from MIPT)
Transfer Learning, NLP for Low-Resource Languages
Other models such as memory networks, adversarial networks, etc.
Great researchers: Yoshua Bengio, Georey Hinton, Tomas Mikolov, Chris Dyer, Russ
Salakhutdinov, Kyunghyun Cho, Chris Manning, Hinrich Schuetze, Dan Jurafsky
RuSSIR-2017! Deadline June, 25th