SlideShare uma empresa Scribd logo
1 de 117
Representation Learning of
Text : Word Vectors
Anuj Gupta
Satyam Saxena
@anujgupta82, @Satyam8989
anujgupta82@gmail.com, satyamiitj89@gmail.com
Outline
• Session 1
•Introduction
•Bigram model
•Skip Gram model
•CBOW model
•Evaluation
•Speed Up
•Session 2
•Glove
•T-SNE
•Secret Ingredients
2
Introduction
Example of NLP tasks :
Easy
• Spell Checking
• Keyword Search
• Finding Synonyms
Medium
• Parsing information from websites, documents, etc.
3
4
Hard
• Machine Translation (e.g. Translate Chinese text to English)
• Semantic Analysis (What is the meaning of query statement?)
• Co-reference (e.g. What does "he" or "it" refer to given a document?)
• Question Answering (e.g. Answering Jeopardy questions).
The first and arguably most important common denominator across
all NLP tasks is : how we represent text as input to our models.
• Machine does not understand text.
• We need numeric representation
• An integral part of any NLP pipeline.
• Unlike images (RGB matrix), for text there is no obvious way.
Legacy Techniques*
• Bag of words
• N-gram
• TF-IDF
5* Details in appendix
Bottom Line
• More often than not, how rich your input representation is has huge bearing
on the quality of your downstream ML models.
• For NLP, archaic techniques treat words as atomic symbols. Thus every 2
words are equally apart.
• They don’t have any notion of either syntactic or semantic similarity
between parts of language.
• This is one of the chief reasons for poor/mediocre performance of NLP
based models.
But this has changed dramatically in past few years
6
Distributional & Distributed Representations
7
Distributional representations
• Linguistic aspect.
• Based on co-occurrence/ context
• Distributional hypothesis: linguistic units with similar distributions
have similar meanings.
• The distributional property is usually induced from document or
context or textual vicinity (like sliding window).
8
Distributed representations
• Compact, dense and low dimensional representation.
• Differs from distributional representations as the constraint is to seek
efficient dense representation, not just to capture the co-occurrence
similarity.
• Each single component of vector representation does not have any
meaning of its own.
• The interpretable features (for example, word contexts in case of
word2vec) are hidden and distributed among uninterpretable vector
components.
9
• Embedding: Mapping between space with one dimension per linguistic
unit (word, character, phrase, sentence, document ) to a continuous vector
space with much lower dimension.
“You shall know a word by the company it keeps” - J R Firth
• One of the most successful ideas of modern statistical NLP
10
Global Matrix Factorization
11
Co-occurrence with SVD
• Define a word using the words in its context.
• Words that co-occur
• Building a co-occurrence matrix M.
Context = previous word and
next word
Corpus ={“I like deep learning.”
“I like NLP.”
“I enjoy flying.”}
12
• Imagine we do this for a large
corpus of text
• row vector xdog describes usage
of word dog in the corpus
• can be seen as coordinates of
point in n-dimensional
Euclidean space Rn
• Reduce dimensions using SVD =
M
13
• Given a matrix of m × n dimensionality, construct a m × k matrix, where k << n
• M = U Σ VT
• U is an m × m orthogonal matrix (UUT = I)
• Σ is a m × n diagonal matrix, with diagonal values ordered from largest to smallest (σ1 ≥
σ2 ≥ · · · ≥ σr ≥ 0, where r = min(m, n)) [σi’s are known as singular values]
• V is an n × n orthogonal matrix (VVT = I)
• We construct M’ s.t. rank(M’) = k
• We compute M’ = U Σ’ V, where Σ’ = Σ with k largest singular values
• k captures desired percentage variance
• Then, submatrix U v,k is our desired word embedding matrix.
14
Result of SVD based Model
K = 2 K = 3
15
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
16
Pros & Cons
+ Simple method
+ Captures some sense (though weak) of similarity between words.
- Matrix is extremely sparse.
- Quadratic cost to train (perform SVD)
- Drastic imbalance in frequencies can adversely impact quality of
embeddings.
- Adding new words is expensive.
Take home : we worked with statistics of the corpus rather than working with
the corpus directly. This will recur in GloVe
17
BiGram Model
Idea: Directly learn low-dimensional word vectors ?
18
Language Models
• Filter out good sentences from bad ones.
• Good = semantically and syntactically correct.
• Modeled this via probability of given sequence of n words
Pr (w1, w2, ….., wn)
• S1 = “the cat jumped over the dog”, Pr(S1) ~ 1
• S2 = “jumped over the the cat dog”, Pr(S2) ~ 0
19
Unary Language Models
20
Binary Language Models
21
BiGram Model
• Objective : given wi , predict wi+1
• Training data: given sequence of n words < w1, w2, ….., wn >, extract bi-gram
pairs (wi-1 , wi)
• Knowns:
• input – output training examples : (wi-1 , wi)
• Vocab of training corpus (V) = U (wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
• Model : shallow net
22
wi-1
wi
Scoringlayer
Softmaxlayer
Architecture
23
• Feed index of wi-1 as input to network.
• Use index to lookup embedding matrix.
• Perform affine transform on word embedding to get a score vector.
• Compute probability for each word.
• Set 1-hot vector of wi as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
24
Softmax
25
Cross Entropy
26
27
●Per word, we have 2 vectors :
1. As row in Embedding layer (E)
2. As column in weights layer (used for afine transformation)
●It’s common to take average of the 2 vectors.
●It’s common to normalise the vectors. Divide by norm.
●An alternative way to compute ŷi : # (wi, wi-1) / # (wj, wi-1) ∀ j∈V
●Use co-occurrence matrix to compute these counts.
Remarks
I learn best with toy code,
that I can play with.
- Andrew Trask
jupyter notebook 1
28
CBOW
SkipGram
29
CBOW
• Continuous Bag of words.
• Proposed by Mikolov et al. in 2013
• Conceptually, very similar to Bi-gram model
• In the bigram model, there were 2 key drawbacks:
1. The context was very small – we took only wi-1 , while predicting wi
2. Context is not just preceding words; but following words too.
30
• “the brown cat jumped over the dog”
Context = the brown cat over the dog
Target = jumped
• Context window = k words on either side of the word to be
predicted.
• Pr (w1, w2, ….., wn) = ∏ Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k)
• W = total number of unique windows
• Each window is sliding block 2c+1 words
31
CBOW Model
• Objective : given wc−k, . . . , wc−1, wc+1, . . . , wc+k , predict wc
• Training data: given sequence of n words < w1, w2, ….., wn >, for each window
extract context and target (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc )
• Knowns:
• input – output training examples : (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc )
• Vocab of training corpus (V) = ∪(wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
32
Architecture
33
• Feed indexes of (x(c−k) , ... , x(c−1) , x(c+1) , ... , x(c+k)) for the input context of size
k.
• Use indexes to lookup embedding matrix.
• Average these vectors to get vˆ = (vc−k+vc−1+...+vc+1+vc+k ) / 2m
• Perform affine transform on vˆ to get a score vector.
• Turn scores in probabilities for each word.
• Set 1-hot vector of wc as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
34
Maths behind the scene
• Optimization objective J = - log Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k)
• Maximizing Pr() = Minimizing – log Pr()
• Let vˆ = (wc−k + . . . + wc−1 + wc+1 + . . . + wc+k )/ 2m
• Then, RHS
• gradient descent to update all relevant word vectors uc and wj.
35
Skip-Gram model
• 2nd model proposed by Mikolov et al. in 2013
• Turns CBOW over its head.
• CBOW = given context, predict the target word
• Skip Gram = given target, predict context
• “the brown cat jumped over the dog”
Target = jumped
Context = the, brown, cat, over, the, dog
36
• Objective : given wc , predict wc−k, . . . , wc−1, wc+1, . . . , wc+k
• Training data: given sequence of n words < w1, w2, ….., wn >, for each window
extract target and context pairs (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k)
• Knowns:
• input – output training examples : (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k)
• Vocab of training corpus (V) = ∪ (wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
37
Architecture
38
• Feed index of xc
• Use index to lookup embedding matrix.
• Perform affine transform on vˆ to get a score vector.
• Turn scores in probabilities for each word.
• Set 1-hot vector of wc as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
39
Maths behind the scene
• Optimization objective J = - log Pr(wc−k, . . . , wc−1, wc+1, . . . , wc+k | , wc)
• gradient descent to update all relevant word vectors uc and wj.
40
Evaluating Word vectors
41
• How to quantitatively evaluate the quality of word vectors?
• Intrinsic Evaluation :
• Word Vector Analogies
• Extrinsic Evaluation :
• Downstream NLP task
42
Intrinsic Evaluation
• Specific Intermediate subtasks
• Easy to compute.
• Analogy completion:
• a:b :: c:? d =
man:woman :: king:?
• Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
• Discarding the input words from the search!
• Problem: What if the information is there but not linear?
43
44
Extrinsic Evaluation
• Real task at hand
• Ex: Sentiment analysis.
• Not very robust.
• End result is a function of whole process and not just embeddings.
• Process:
• Data pipelines
• Algorithm(s)
• Fine tuning
• Quality of dataset
45
Speed Up
46
Bottleneck
• Recall, to calculate probability, we use softmax. The denominator is
sum across entire vocab.
• Further, this is calculated for every window.
• Too expensive.
• Single update of parameters requires to iterate over |V|. Our vocab
usually is in millions.
47
To approximate probability, dont use the entire vocab.
There are 2 popular line of attacks to achieve this:
•Modify the structure the softmax
•Hierarchical Softmax
• Sampling techniques : don’t use entire vocabulary to compute the sum
• Negative sampling
48
● Arrange words in vocab as leaf units of a
balanced binary tree.
● |V| leaves |V| - 1 internal nodes
● Each leaf node has a unique path from root to
the leaf
● Probability of a word (leaf node Lw) =
Probability of the path from root node to leaf Lw
● No output vector representation for words,
unlike softmax.
● Instead every internal node has a d-dimension
vector associated with it - v’n(w, j)
Hierarchical Softmax
n(w, j) means the j-th unit on the path from root to the
word w
● Product of probabilities over nodes in the path
● Each probability is computed using sigmoid
●
● Inside it we check : if (j+1)th node on path left child of jth node or not
● v’n(w, j)
T h : vector product between vector on hidden layer and vector for the
inner node in consideration.
● p(w = w2)
● We start at root, and navigate to leaf w2
●
●
● p(w = w2)
●
Example
● Cost: O(|V|) to O(log |V| )
●In practice, use Huffman tree
Negative Sampling
●Given (w, c) : word and context
●Let P(D=1|w,c) be probability that (w, c) came from the corpus data.
●P(D=0|w,c) = probability that (w, c) didn’t come from the corpus data.
● Lets model P(D=1|w,c) with sigmoid:
●Objective function (J):
○ maximize P(D=1|w,c) if (w, c) is in the corpus data.
○ maximize P(D=0|w,c) if (w, c) is not in the corpus data.
●We take a simple maximum likelihood approach of these two probabilities.
θ is parameters of the model. In our case U and V - input, output word vectors.
Took log on
both side
●Now, maximizing log likelihood = minimizing negative log likelihood.
●
● D ̃ s “false” or negative “Corpus” with wrong sentences - "jumped cat dog the the over"
● Generate D ̃ n he ly y an only nllys hes nhse lanl he onao yn .
● For skip-gram, our new objective function for observing the context word wc − m + j given
the center word wc would be :
regular softmax loss for skip-gram
● Likewise for CBOW, our new objective function for observing the center
word uc given the context vector
● I he nyne lnaluynhsn , {u˜k |k = 1 . . . K} are sampled from Pn(w).
● best Pn(w) = Unigram distribution raised to the power of 3/4
● Usually K = 20-30 works well.
regular softmax loss for CBOW
GloVe
Global matrix factorization methods
● Use co-occurrence counts
● Ex: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert)
+ Fast training
+ Efficient usage of statistics
+ Captures word similarity
- Do badly on analogy tasks
- Disproportionate importance given to large counts
58
Local context window method
● Use window to determine context of a word
● Ex: Skip-gram/CBOW ( Mikolov et al), NNLM(Bengio et al), HLBL, (Collobert & Weston)
+ Capture word similarity.
+ Also performance better on analogy tasks
- Slow down with increase in corpus size
- Inefficient usage of statistics
59
Combining the best of both worlds
● Glove model tries to combine the two major model families :-
○ Global matrix factorization (co-occurrence counts)
○ Local context window (context comes from window)
= Co-occurrence counts with context distance
60
Co-occurrence counts with context distance
● Uses context distance : weight each word in context window using its
distance from the center word
● This ensures nearby words have more influence than far off ones.
● Sentence -> “I ys NLP”
○ Co-occurrence for I -> like : 1.0 & I -> NLP : 0.5
○ Co-occurrence for like -> I : 1.0 & like -> NLP : 1.0
○ Co-occurrence for NLP -> I : 0.5 & NLP -> like : 1.0
● Corpus C: I like NLP. I like cricket.
Co-occurrence matrix for C
61
Issues with Co-occurrence Matrix
● Long tail distribution
● Frequent words contribute disproportionately
(use weight function to fix this)
● Use Log for normalization
● Avoid log 0 : Add 1 to each Xij X21
62
Intuition for Glove
●Think of matrix factorization algorithms used in recommendation systems.
●Latent Factor models
○ Find features that describe the characteristics of rated objects.
○ Item characteristics and user preferences are described using vectors which are called factor
vectors
○ Assumption: Ratings can be inferred from a model put together from a smaller number of
parameters
63
Latent Factor models
● Dot product estimates user’s interest in the item
○ where, qi : factor vector for item i.
pu : factor vector for user u
i : estimated user interest
● How to compute vectors for items and users ?
64
Matrix Factorization
●rui : known rating of user u for item i
● predicted rating :
● Similarly glove model tries to model the co-occurrence counts with the
following equation :
65
Weighting function
.
●Properties of f(X)
○vanish at 0 i.e. f(0) = 0
○monotonically increasing
○f(x) should be relatively small for large values of x
● Empirically 𝞪 = 0.75, xmax=100 works best
66
Loss Function
● Scalable.
● Fast training
○ Tans s hsl on ’h o l o n he cnalu sz
○ Always fitting to a |V| x |V| matrix.
● Good performance with small corpus, and small vectors.
67
●Input :
○Xij (|V| x |V| matrix) : co-occurrence matrix
●Parameters
○ W (|V| x |D| lnhasx) & W˜ (|V| x |D| lnhasx) :
■ wi and wj˜ a la hnhsn nl he sth & jth onao lanl W n o W˜ lnhasc a l chse y .
○bi (|V| x 1) column vector : variable for incorporating biases in terms
○bj (1 x |V|) row vector : variable for incorporating biases in terms
68
Training
● Train on Wikipedia data
●|V| = 2000
● Window size = 3
● Iterations = 10000
●D = 50
●Learn two representations for each word in |V|.
●reg = 0.01
●Use momentum optimizer with momentum=0.9.
69
Quick Experiment
Results - months & centuries
70
Countries & languages
71
military terms
72
Music
73
Countries & Languages
Languages
Countries
74
t-SNE
Objective
● Given a collection of N high-dimensional objects x1, x2, …. xN.
● How can we get a feel for how these objects are (relatively) arranged ?
76
Introduction
●Busyo lnl(yno osl sn ) .h. os hn c y ho lns h a ly ch “ slsynashs ” s
the data :
●Minimize some objective function that measures the discrepancy between
similarities in the data and similarities in the map
77
Principal Components Analysis
78
Principal component analysis
● PCA mainly tries to preserve large pairwise distances in the map.
●Is that what we want ?
79
Goals
● Preserve Distances
● Preservation Neighborhood of each point
80
t-SNE High dimension
●Measure pairwise similarities between high dimensional objects
81
xi
xj
t-SNE Lower dimension
●Measure pairwise similarities between low dimensional map points
82
t-SNE
●We have measure of similarity of data points in High Dimension
●We have measure of similarity of data points in Low Dimension
●We need a distance measure between the two.
●Once we have distance measure, all we want is : to minimize it
83
One possible choice - KL divergence
● It’s a measure of how one probability distribution diverges from a second
expected probability distribution
84
KL divergence applied to t-SNE
Objective function (C)
● We want nearby points in high-D to remain nearby in low-D
○ In the case it's not, then
■ pij will large (because points are nearby)
■ but qij will be small (because points are far away)
■ This will result in larger penalty
■ In contrast, If both pij and qij are large : lower penalty 85
KL divergence applied to t-SNE
●Likewise, we want far away points in high-D to remain (relatively) far away in
low-D
○ In the case it's not, then
■ pij will small (because points are far away)
■ but qij will be large (because points are nearby)
■ This will result in lower penalty
● t-SNE mainly preserves local similarity structure of the data
86
t-Distributed Stochastic Neighbor Embedding
●Move points around to minimize :
87
Why a Student t-Distribution ?
●t-SNE tries to retain local structure of this data in the map
●Result : dissimilar points have to be modelled as far apart in the map
●Hinton, has showed that student t-distribution is very similar to gaussian
distribution
88
Local structures
global structure
● Local structures preserved
● global structure is lost
Deciding the effective number of neighbours
● We need to decide the radii in different parts of the space, so that we can keep
the effective number of neighbours about constant.
● A big radius leads to a high entropy for the distribution over neighbors of i.
● A small radius leads to a low entropy.
● So decide what entropy you want and then find the radius that produces that
entropy.
● It's easier to specify 2entropy
○ This is called the perplexity
○ It is the effective number of neighbors.
89
90
Experiments
Hyper parameters really matter: Playing with perplexity
● projected 100 data points clearly separated in two different clusters with tSNE
● Applied tSNE with different values of perplexity
● With perplexity=2, local variations in the data dominate
● With perplexity in range(5-50) as suggested in paper, plots still capture some structure in the data
91
Hyper parameters really matter: Playing with #iterations
● Perplexity set to 30.0
● Applied tSNE with different number of iterations
● Takeaway : different datasets may require different number of iterations
92
Cluster sizes can be misleading
● Uses tSNE to plot two clusters with different standard deviation
● bottomline, we cannot see cluster sizes in t-SNE plots
93
Distances in t-SNE plots
● At lower perplexity clusters look equidistant
● At perplexity=50, tSNE captures some notion of global geometry in the data
● 50 data points in each sub cluster
94
Distances in t-SNE plots
● tSNE is not able to capture global geometry even at perplexity=50.
● key take away : well separated clusters may not mean anything in tSNE.
● 200 data points in each sub cluster
95
Random noise doesn’t always look random
● For this experiment, we generated random points from gaussian distribution
● Plots with lower perplexity, showing misleading structures in the data
96
You can see some shapes sometimes
● Axis aligned gaussian distribution
● For certain values of perplexity, long cluster look almost correct.
● tSNE tends to expands regions which are denser
97
98
Why word2vec does
better than others ?
99
At heart they are all same !!
●Its has been shown that in essence GloVe and word2vec are no different
from traditional methods like PCA, LSA etc (Levy et al. 2015 call them
DSM )
●GloVe ⋍ PCA/LSA is straightforward (both factorize global counts
matrix)
●word2vec ⋍ PCA/LSA is non-trivial (Levy et al. 2015)
●They show that in essence word2vec also factorizes word context matrix
(PMI)
100
●Despite this “equality” of algorithm, word2vec is still known to do better
on several tasks.
●Why ?
○Levy et al. 2015 show : magic lies in Hyperparameters
101
Hyperparameters
●Pre-processing
○ Dynamic context window
○ Subsampling frequent words
○ Deleting rare words
●Post-processing
○ Adding context words
○ Vector normalization
Pre-processing
●Dynamic Context window
○ In DSM, context window: unweighted & constant size.
○ Glove & SGNS - give more weightage to closer terms
○ SGNS - even the window size can be dynamic and take a value between 1 & max of windowsize.
●Subsampling frequent words
○ SGNS dilutes frequent words by randomly removing words whose frequency f is higher than
some threshold t, with probability
●Deleting rare words
○ In SGNS, rare words are also deleted before creating context windows. 102
Post-processing
●Adding context vectors
○ Glove adds word vectors and the context vectors for the final representation.
●Vector normalization
○ All vectors can be normalized to unit length
103
Key Take Home
●Hyperparameters vs Algorithms
○ Hyper parameter settings is more important than the algorithm choice
○ No single algorithm consistently outperforms the other ones
●Hyperparameters vs more data
○ Training on larger corpus helps on some tasks
○ In many cases, tuning hyperparameters in more beneficial
104
References
Idea of word vectors is not new.
• Learning representations by back-propagating errors (Rumelhart et al. 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP from Scratch (Collobert & Weston, 2008)
• Word2Vec (Mikolov et al. 2013)
•Sebastian Ruder’s 3 part Blog series
•Lecture 2-4, CS 224d “Deep Learning for NLP” by Richard Socher
•word2vec Parameter Learning Explained by X Rong
105
References
• GloVe :
•https://nlp.stanford.edu/pubs/glove.pdf
• https://www.youtube.com/watch?v=tRsSi_sqXjI
• http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
• https://cs224d.stanford.edu/lectures/CS224d-Lecture3.pdf
• t-SNE:
•http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
• http://distill.pub/2016/misread-tsne/
• https://www.slideshare.net/ssuserb667a8/visualization-data-using-tsne
• https://youtu.be/RJVL80Gg3lA
• KL Divergence
• http://tdhopper.com/blog/2015/Sep/04/cross-entropy-and-kl-divergence/
106
References
• Cross Entropy :
• https://www.youtube.com/watch?v=tRsSi_sqXjI
• http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
• Softmax:
• https://en.wikipedia.org/wiki/Softmax_function
• http://cs231n.github.io/linear-classify/#softmax
• Tensor Flow
• 1.0 API docs
• CS20SI
107
https://fifthelephant.talkfunnel.com/2017/17-learning-representations-of-text-for-nlp
108
Appendix
109
Bag of Words
• Vocab = set of all the words in corpus
• Document = Words in document w.r.t vocab with multiplicity
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
110
Pros & Cons
+ Quick and Simple
- Too simple
- Orderless
- No notion of syntactic/semantic similarity
111
N-gram model
• Vocab = set of all n-grams in corpus
• Document = n-grams in document w.r.t vocab with multiplicity
For bigram:
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat and,
and the}
Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
112
Pros & Cons
+ Tries to incorporate order of words
- Very large vocab set
- No notion of syntactic/semantic similarity
113
Term Frequency–Inverse Document Frequency (TF-IDF)
• Captures importance of a word to a document in a corpus.
• Importance increases proportionally to the number of times a word appears in the
document; but is offset by the frequency of the word in the corpus.
• TF(t) = (Number of times term t appears in a document) / (Total number of terms
in the document).
• IDF(t) = log (Total number of documents / Number of documents with term t in
it).
• TF-IDF (t) = TF(t) * IDF(t)
114
Example
• Document D1 contains 100 words.
• cat appears 3 times in D1
• TF(cat) = 3 / 100
= 0.3
• Corpus contains 10 million documents
• cat appears in 1000 documents
• IDF(cat) = log (10,000,000 / 1,000)
= 4
• TF-IDF (cat) = 0.3 * 4
115
Pros & Cons
• Pros:
• Easy to compute
• Has some basic metric to extract the most descriptive terms in a document
• Thus, can easily compute the similarity between 2 documents using it
• Disadvantages:
• Based on the bag-of-words (BoW) model, therefore it does not capture position
in text, semantics, co-occurrences in different documents, etc.
• Thus, TF-IDF is only useful as a lexical level feature. (presence/absense)
• Cannot capture semantics (unlike topic models, word embeddings)
116
● Positive Pointwise Mutual Information (PPMI): PMI is a common measure for the strength of
association between two words. It is defined as the log ratio between the joint probability of two
words ww and cc and the product of their marginal probabilities:
a. PMI(w,c)=logP(w,c)/P(w)P(c)
b. PPMI(w, c) = max(PMI(w,c), 0)
117

Mais conteúdo relacionado

Mais procurados

Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overviewananth
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Recent Advances in NLP
  Recent Advances in NLP  Recent Advances in NLP
Recent Advances in NLPAnuj Gupta
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)YerevaNN research lab
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 

Mais procurados (19)

Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
 
Word embedding
Word embedding Word embedding
Word embedding
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Recent Advances in NLP
  Recent Advances in NLP  Recent Advances in NLP
Recent Advances in NLP
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 

Semelhante a Representation Learning of Text: Word Vectors in 40 Characters

Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
presentation2-180202073525.pptx
presentation2-180202073525.pptxpresentation2-180202073525.pptx
presentation2-180202073525.pptxKtonNguyn2
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptxAbdusSadik
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowBruno Gonçalves
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
Breaking the Softmax Bottleneck: a high-rank RNN Language ModelBreaking the Softmax Bottleneck: a high-rank RNN Language Model
Breaking the Softmax Bottleneck: a high-rank RNN Language ModelSsu-Rui Lee
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesDaniil Mirylenka
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxavinashBajpayee1
 

Semelhante a Representation Learning of Text: Word Vectors in 40 Characters (20)

Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
presentation2-180202073525.pptx
presentation2-180202073525.pptxpresentation2-180202073525.pptx
presentation2-180202073525.pptx
 
Science in text mining
Science in text miningScience in text mining
Science in text mining
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
Breaking the Softmax Bottleneck: a high-rank RNN Language ModelBreaking the Softmax Bottleneck: a high-rank RNN Language Model
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph Summaries
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
 

Mais de Anuj Gupta

ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsAnuj Gupta
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesAnuj Gupta
 
Sarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysisSarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysisAnuj Gupta
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Synthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural NetsSynthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural NetsAnuj Gupta
 
Representation Learning for NLP
Representation Learning for NLPRepresentation Learning for NLP
Representation Learning for NLPAnuj Gupta
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 

Mais de Anuj Gupta (7)

ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
 
Sarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysisSarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysis
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Synthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural NetsSynthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural Nets
 
Representation Learning for NLP
Representation Learning for NLPRepresentation Learning for NLP
Representation Learning for NLP
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Representation Learning of Text: Word Vectors in 40 Characters

  • 1. Representation Learning of Text : Word Vectors Anuj Gupta Satyam Saxena @anujgupta82, @Satyam8989 anujgupta82@gmail.com, satyamiitj89@gmail.com
  • 2. Outline • Session 1 •Introduction •Bigram model •Skip Gram model •CBOW model •Evaluation •Speed Up •Session 2 •Glove •T-SNE •Secret Ingredients 2
  • 3. Introduction Example of NLP tasks : Easy • Spell Checking • Keyword Search • Finding Synonyms Medium • Parsing information from websites, documents, etc. 3
  • 4. 4 Hard • Machine Translation (e.g. Translate Chinese text to English) • Semantic Analysis (What is the meaning of query statement?) • Co-reference (e.g. What does "he" or "it" refer to given a document?) • Question Answering (e.g. Answering Jeopardy questions). The first and arguably most important common denominator across all NLP tasks is : how we represent text as input to our models.
  • 5. • Machine does not understand text. • We need numeric representation • An integral part of any NLP pipeline. • Unlike images (RGB matrix), for text there is no obvious way. Legacy Techniques* • Bag of words • N-gram • TF-IDF 5* Details in appendix
  • 6. Bottom Line • More often than not, how rich your input representation is has huge bearing on the quality of your downstream ML models. • For NLP, archaic techniques treat words as atomic symbols. Thus every 2 words are equally apart. • They don’t have any notion of either syntactic or semantic similarity between parts of language. • This is one of the chief reasons for poor/mediocre performance of NLP based models. But this has changed dramatically in past few years 6
  • 7. Distributional & Distributed Representations 7
  • 8. Distributional representations • Linguistic aspect. • Based on co-occurrence/ context • Distributional hypothesis: linguistic units with similar distributions have similar meanings. • The distributional property is usually induced from document or context or textual vicinity (like sliding window). 8
  • 9. Distributed representations • Compact, dense and low dimensional representation. • Differs from distributional representations as the constraint is to seek efficient dense representation, not just to capture the co-occurrence similarity. • Each single component of vector representation does not have any meaning of its own. • The interpretable features (for example, word contexts in case of word2vec) are hidden and distributed among uninterpretable vector components. 9
  • 10. • Embedding: Mapping between space with one dimension per linguistic unit (word, character, phrase, sentence, document ) to a continuous vector space with much lower dimension. “You shall know a word by the company it keeps” - J R Firth • One of the most successful ideas of modern statistical NLP 10
  • 12. Co-occurrence with SVD • Define a word using the words in its context. • Words that co-occur • Building a co-occurrence matrix M. Context = previous word and next word Corpus ={“I like deep learning.” “I like NLP.” “I enjoy flying.”} 12
  • 13. • Imagine we do this for a large corpus of text • row vector xdog describes usage of word dog in the corpus • can be seen as coordinates of point in n-dimensional Euclidean space Rn • Reduce dimensions using SVD = M 13
  • 14. • Given a matrix of m × n dimensionality, construct a m × k matrix, where k << n • M = U Σ VT • U is an m × m orthogonal matrix (UUT = I) • Σ is a m × n diagonal matrix, with diagonal values ordered from largest to smallest (σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0, where r = min(m, n)) [σi’s are known as singular values] • V is an n × n orthogonal matrix (VVT = I) • We construct M’ s.t. rank(M’) = k • We compute M’ = U Σ’ V, where Σ’ = Σ with k largest singular values • k captures desired percentage variance • Then, submatrix U v,k is our desired word embedding matrix. 14
  • 15. Result of SVD based Model K = 2 K = 3 15
  • 16. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence Rohde et al. 2005 16
  • 17. Pros & Cons + Simple method + Captures some sense (though weak) of similarity between words. - Matrix is extremely sparse. - Quadratic cost to train (perform SVD) - Drastic imbalance in frequencies can adversely impact quality of embeddings. - Adding new words is expensive. Take home : we worked with statistics of the corpus rather than working with the corpus directly. This will recur in GloVe 17
  • 18. BiGram Model Idea: Directly learn low-dimensional word vectors ? 18
  • 19. Language Models • Filter out good sentences from bad ones. • Good = semantically and syntactically correct. • Modeled this via probability of given sequence of n words Pr (w1, w2, ….., wn) • S1 = “the cat jumped over the dog”, Pr(S1) ~ 1 • S2 = “jumped over the the cat dog”, Pr(S2) ~ 0 19
  • 22. BiGram Model • Objective : given wi , predict wi+1 • Training data: given sequence of n words < w1, w2, ….., wn >, extract bi-gram pairs (wi-1 , wi) • Knowns: • input – output training examples : (wi-1 , wi) • Vocab of training corpus (V) = U (wi) • Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. • Model : shallow net 22
  • 24. • Feed index of wi-1 as input to network. • Use index to lookup embedding matrix. • Perform affine transform on word embedding to get a score vector. • Compute probability for each word. • Set 1-hot vector of wi as target. • Set loss = cross-entropy between probability vector and target vector. Steps 24
  • 27. 27 ●Per word, we have 2 vectors : 1. As row in Embedding layer (E) 2. As column in weights layer (used for afine transformation) ●It’s common to take average of the 2 vectors. ●It’s common to normalise the vectors. Divide by norm. ●An alternative way to compute ŷi : # (wi, wi-1) / # (wj, wi-1) ∀ j∈V ●Use co-occurrence matrix to compute these counts. Remarks
  • 28. I learn best with toy code, that I can play with. - Andrew Trask jupyter notebook 1 28
  • 30. CBOW • Continuous Bag of words. • Proposed by Mikolov et al. in 2013 • Conceptually, very similar to Bi-gram model • In the bigram model, there were 2 key drawbacks: 1. The context was very small – we took only wi-1 , while predicting wi 2. Context is not just preceding words; but following words too. 30
  • 31. • “the brown cat jumped over the dog” Context = the brown cat over the dog Target = jumped • Context window = k words on either side of the word to be predicted. • Pr (w1, w2, ….., wn) = ∏ Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k) • W = total number of unique windows • Each window is sliding block 2c+1 words 31
  • 32. CBOW Model • Objective : given wc−k, . . . , wc−1, wc+1, . . . , wc+k , predict wc • Training data: given sequence of n words < w1, w2, ….., wn >, for each window extract context and target (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc ) • Knowns: • input – output training examples : (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc ) • Vocab of training corpus (V) = ∪(wi) • Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. 32
  • 34. • Feed indexes of (x(c−k) , ... , x(c−1) , x(c+1) , ... , x(c+k)) for the input context of size k. • Use indexes to lookup embedding matrix. • Average these vectors to get vˆ = (vc−k+vc−1+...+vc+1+vc+k ) / 2m • Perform affine transform on vˆ to get a score vector. • Turn scores in probabilities for each word. • Set 1-hot vector of wc as target. • Set loss = cross-entropy between probability vector and target vector. Steps 34
  • 35. Maths behind the scene • Optimization objective J = - log Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k) • Maximizing Pr() = Minimizing – log Pr() • Let vˆ = (wc−k + . . . + wc−1 + wc+1 + . . . + wc+k )/ 2m • Then, RHS • gradient descent to update all relevant word vectors uc and wj. 35
  • 36. Skip-Gram model • 2nd model proposed by Mikolov et al. in 2013 • Turns CBOW over its head. • CBOW = given context, predict the target word • Skip Gram = given target, predict context • “the brown cat jumped over the dog” Target = jumped Context = the, brown, cat, over, the, dog 36
  • 37. • Objective : given wc , predict wc−k, . . . , wc−1, wc+1, . . . , wc+k • Training data: given sequence of n words < w1, w2, ….., wn >, for each window extract target and context pairs (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k) • Knowns: • input – output training examples : (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k) • Vocab of training corpus (V) = ∪ (wi) • Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. 37
  • 39. • Feed index of xc • Use index to lookup embedding matrix. • Perform affine transform on vˆ to get a score vector. • Turn scores in probabilities for each word. • Set 1-hot vector of wc as target. • Set loss = cross-entropy between probability vector and target vector. Steps 39
  • 40. Maths behind the scene • Optimization objective J = - log Pr(wc−k, . . . , wc−1, wc+1, . . . , wc+k | , wc) • gradient descent to update all relevant word vectors uc and wj. 40
  • 42. • How to quantitatively evaluate the quality of word vectors? • Intrinsic Evaluation : • Word Vector Analogies • Extrinsic Evaluation : • Downstream NLP task 42
  • 43. Intrinsic Evaluation • Specific Intermediate subtasks • Easy to compute. • Analogy completion: • a:b :: c:? d = man:woman :: king:? • Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions • Discarding the input words from the search! • Problem: What if the information is there but not linear? 43
  • 44. 44
  • 45. Extrinsic Evaluation • Real task at hand • Ex: Sentiment analysis. • Not very robust. • End result is a function of whole process and not just embeddings. • Process: • Data pipelines • Algorithm(s) • Fine tuning • Quality of dataset 45
  • 47. Bottleneck • Recall, to calculate probability, we use softmax. The denominator is sum across entire vocab. • Further, this is calculated for every window. • Too expensive. • Single update of parameters requires to iterate over |V|. Our vocab usually is in millions. 47
  • 48. To approximate probability, dont use the entire vocab. There are 2 popular line of attacks to achieve this: •Modify the structure the softmax •Hierarchical Softmax • Sampling techniques : don’t use entire vocabulary to compute the sum • Negative sampling 48
  • 49. ● Arrange words in vocab as leaf units of a balanced binary tree. ● |V| leaves |V| - 1 internal nodes ● Each leaf node has a unique path from root to the leaf ● Probability of a word (leaf node Lw) = Probability of the path from root node to leaf Lw ● No output vector representation for words, unlike softmax. ● Instead every internal node has a d-dimension vector associated with it - v’n(w, j) Hierarchical Softmax n(w, j) means the j-th unit on the path from root to the word w
  • 50. ● Product of probabilities over nodes in the path ● Each probability is computed using sigmoid ● ● Inside it we check : if (j+1)th node on path left child of jth node or not ● v’n(w, j) T h : vector product between vector on hidden layer and vector for the inner node in consideration.
  • 51. ● p(w = w2) ● We start at root, and navigate to leaf w2 ● ● ● p(w = w2) ● Example
  • 52. ● Cost: O(|V|) to O(log |V| ) ●In practice, use Huffman tree
  • 53. Negative Sampling ●Given (w, c) : word and context ●Let P(D=1|w,c) be probability that (w, c) came from the corpus data. ●P(D=0|w,c) = probability that (w, c) didn’t come from the corpus data. ● Lets model P(D=1|w,c) with sigmoid: ●Objective function (J): ○ maximize P(D=1|w,c) if (w, c) is in the corpus data. ○ maximize P(D=0|w,c) if (w, c) is not in the corpus data. ●We take a simple maximum likelihood approach of these two probabilities.
  • 54. θ is parameters of the model. In our case U and V - input, output word vectors. Took log on both side
  • 55. ●Now, maximizing log likelihood = minimizing negative log likelihood. ● ● D ̃ s “false” or negative “Corpus” with wrong sentences - "jumped cat dog the the over" ● Generate D ̃ n he ly y an only nllys hes nhse lanl he onao yn . ● For skip-gram, our new objective function for observing the context word wc − m + j given the center word wc would be : regular softmax loss for skip-gram
  • 56. ● Likewise for CBOW, our new objective function for observing the center word uc given the context vector ● I he nyne lnaluynhsn , {u˜k |k = 1 . . . K} are sampled from Pn(w). ● best Pn(w) = Unigram distribution raised to the power of 3/4 ● Usually K = 20-30 works well. regular softmax loss for CBOW
  • 57. GloVe
  • 58. Global matrix factorization methods ● Use co-occurrence counts ● Ex: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert) + Fast training + Efficient usage of statistics + Captures word similarity - Do badly on analogy tasks - Disproportionate importance given to large counts 58
  • 59. Local context window method ● Use window to determine context of a word ● Ex: Skip-gram/CBOW ( Mikolov et al), NNLM(Bengio et al), HLBL, (Collobert & Weston) + Capture word similarity. + Also performance better on analogy tasks - Slow down with increase in corpus size - Inefficient usage of statistics 59
  • 60. Combining the best of both worlds ● Glove model tries to combine the two major model families :- ○ Global matrix factorization (co-occurrence counts) ○ Local context window (context comes from window) = Co-occurrence counts with context distance 60
  • 61. Co-occurrence counts with context distance ● Uses context distance : weight each word in context window using its distance from the center word ● This ensures nearby words have more influence than far off ones. ● Sentence -> “I ys NLP” ○ Co-occurrence for I -> like : 1.0 & I -> NLP : 0.5 ○ Co-occurrence for like -> I : 1.0 & like -> NLP : 1.0 ○ Co-occurrence for NLP -> I : 0.5 & NLP -> like : 1.0 ● Corpus C: I like NLP. I like cricket. Co-occurrence matrix for C 61
  • 62. Issues with Co-occurrence Matrix ● Long tail distribution ● Frequent words contribute disproportionately (use weight function to fix this) ● Use Log for normalization ● Avoid log 0 : Add 1 to each Xij X21 62
  • 63. Intuition for Glove ●Think of matrix factorization algorithms used in recommendation systems. ●Latent Factor models ○ Find features that describe the characteristics of rated objects. ○ Item characteristics and user preferences are described using vectors which are called factor vectors ○ Assumption: Ratings can be inferred from a model put together from a smaller number of parameters 63
  • 64. Latent Factor models ● Dot product estimates user’s interest in the item ○ where, qi : factor vector for item i. pu : factor vector for user u i : estimated user interest ● How to compute vectors for items and users ? 64
  • 65. Matrix Factorization ●rui : known rating of user u for item i ● predicted rating : ● Similarly glove model tries to model the co-occurrence counts with the following equation : 65
  • 66. Weighting function . ●Properties of f(X) ○vanish at 0 i.e. f(0) = 0 ○monotonically increasing ○f(x) should be relatively small for large values of x ● Empirically 𝞪 = 0.75, xmax=100 works best 66
  • 67. Loss Function ● Scalable. ● Fast training ○ Tans s hsl on ’h o l o n he cnalu sz ○ Always fitting to a |V| x |V| matrix. ● Good performance with small corpus, and small vectors. 67
  • 68. ●Input : ○Xij (|V| x |V| matrix) : co-occurrence matrix ●Parameters ○ W (|V| x |D| lnhasx) & W˜ (|V| x |D| lnhasx) : ■ wi and wj˜ a la hnhsn nl he sth & jth onao lanl W n o W˜ lnhasc a l chse y . ○bi (|V| x 1) column vector : variable for incorporating biases in terms ○bj (1 x |V|) row vector : variable for incorporating biases in terms 68 Training
  • 69. ● Train on Wikipedia data ●|V| = 2000 ● Window size = 3 ● Iterations = 10000 ●D = 50 ●Learn two representations for each word in |V|. ●reg = 0.01 ●Use momentum optimizer with momentum=0.9. 69 Quick Experiment
  • 70. Results - months & centuries 70
  • 75. t-SNE
  • 76. Objective ● Given a collection of N high-dimensional objects x1, x2, …. xN. ● How can we get a feel for how these objects are (relatively) arranged ? 76
  • 77. Introduction ●Busyo lnl(yno osl sn ) .h. os hn c y ho lns h a ly ch “ slsynashs ” s the data : ●Minimize some objective function that measures the discrepancy between similarities in the data and similarities in the map 77
  • 79. Principal component analysis ● PCA mainly tries to preserve large pairwise distances in the map. ●Is that what we want ? 79
  • 80. Goals ● Preserve Distances ● Preservation Neighborhood of each point 80
  • 81. t-SNE High dimension ●Measure pairwise similarities between high dimensional objects 81 xi xj
  • 82. t-SNE Lower dimension ●Measure pairwise similarities between low dimensional map points 82
  • 83. t-SNE ●We have measure of similarity of data points in High Dimension ●We have measure of similarity of data points in Low Dimension ●We need a distance measure between the two. ●Once we have distance measure, all we want is : to minimize it 83
  • 84. One possible choice - KL divergence ● It’s a measure of how one probability distribution diverges from a second expected probability distribution 84
  • 85. KL divergence applied to t-SNE Objective function (C) ● We want nearby points in high-D to remain nearby in low-D ○ In the case it's not, then ■ pij will large (because points are nearby) ■ but qij will be small (because points are far away) ■ This will result in larger penalty ■ In contrast, If both pij and qij are large : lower penalty 85
  • 86. KL divergence applied to t-SNE ●Likewise, we want far away points in high-D to remain (relatively) far away in low-D ○ In the case it's not, then ■ pij will small (because points are far away) ■ but qij will be large (because points are nearby) ■ This will result in lower penalty ● t-SNE mainly preserves local similarity structure of the data 86
  • 87. t-Distributed Stochastic Neighbor Embedding ●Move points around to minimize : 87
  • 88. Why a Student t-Distribution ? ●t-SNE tries to retain local structure of this data in the map ●Result : dissimilar points have to be modelled as far apart in the map ●Hinton, has showed that student t-distribution is very similar to gaussian distribution 88 Local structures global structure ● Local structures preserved ● global structure is lost
  • 89. Deciding the effective number of neighbours ● We need to decide the radii in different parts of the space, so that we can keep the effective number of neighbours about constant. ● A big radius leads to a high entropy for the distribution over neighbors of i. ● A small radius leads to a low entropy. ● So decide what entropy you want and then find the radius that produces that entropy. ● It's easier to specify 2entropy ○ This is called the perplexity ○ It is the effective number of neighbors. 89
  • 91. Hyper parameters really matter: Playing with perplexity ● projected 100 data points clearly separated in two different clusters with tSNE ● Applied tSNE with different values of perplexity ● With perplexity=2, local variations in the data dominate ● With perplexity in range(5-50) as suggested in paper, plots still capture some structure in the data 91
  • 92. Hyper parameters really matter: Playing with #iterations ● Perplexity set to 30.0 ● Applied tSNE with different number of iterations ● Takeaway : different datasets may require different number of iterations 92
  • 93. Cluster sizes can be misleading ● Uses tSNE to plot two clusters with different standard deviation ● bottomline, we cannot see cluster sizes in t-SNE plots 93
  • 94. Distances in t-SNE plots ● At lower perplexity clusters look equidistant ● At perplexity=50, tSNE captures some notion of global geometry in the data ● 50 data points in each sub cluster 94
  • 95. Distances in t-SNE plots ● tSNE is not able to capture global geometry even at perplexity=50. ● key take away : well separated clusters may not mean anything in tSNE. ● 200 data points in each sub cluster 95
  • 96. Random noise doesn’t always look random ● For this experiment, we generated random points from gaussian distribution ● Plots with lower perplexity, showing misleading structures in the data 96
  • 97. You can see some shapes sometimes ● Axis aligned gaussian distribution ● For certain values of perplexity, long cluster look almost correct. ● tSNE tends to expands regions which are denser 97
  • 99. 99 At heart they are all same !! ●Its has been shown that in essence GloVe and word2vec are no different from traditional methods like PCA, LSA etc (Levy et al. 2015 call them DSM ) ●GloVe ⋍ PCA/LSA is straightforward (both factorize global counts matrix) ●word2vec ⋍ PCA/LSA is non-trivial (Levy et al. 2015) ●They show that in essence word2vec also factorizes word context matrix (PMI)
  • 100. 100 ●Despite this “equality” of algorithm, word2vec is still known to do better on several tasks. ●Why ? ○Levy et al. 2015 show : magic lies in Hyperparameters
  • 101. 101 Hyperparameters ●Pre-processing ○ Dynamic context window ○ Subsampling frequent words ○ Deleting rare words ●Post-processing ○ Adding context words ○ Vector normalization
  • 102. Pre-processing ●Dynamic Context window ○ In DSM, context window: unweighted & constant size. ○ Glove & SGNS - give more weightage to closer terms ○ SGNS - even the window size can be dynamic and take a value between 1 & max of windowsize. ●Subsampling frequent words ○ SGNS dilutes frequent words by randomly removing words whose frequency f is higher than some threshold t, with probability ●Deleting rare words ○ In SGNS, rare words are also deleted before creating context windows. 102
  • 103. Post-processing ●Adding context vectors ○ Glove adds word vectors and the context vectors for the final representation. ●Vector normalization ○ All vectors can be normalized to unit length 103
  • 104. Key Take Home ●Hyperparameters vs Algorithms ○ Hyper parameter settings is more important than the algorithm choice ○ No single algorithm consistently outperforms the other ones ●Hyperparameters vs more data ○ Training on larger corpus helps on some tasks ○ In many cases, tuning hyperparameters in more beneficial 104
  • 105. References Idea of word vectors is not new. • Learning representations by back-propagating errors (Rumelhart et al. 1986) • A neural probabilistic language model (Bengio et al., 2003) • NLP from Scratch (Collobert & Weston, 2008) • Word2Vec (Mikolov et al. 2013) •Sebastian Ruder’s 3 part Blog series •Lecture 2-4, CS 224d “Deep Learning for NLP” by Richard Socher •word2vec Parameter Learning Explained by X Rong 105
  • 106. References • GloVe : •https://nlp.stanford.edu/pubs/glove.pdf • https://www.youtube.com/watch?v=tRsSi_sqXjI • http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/ • https://cs224d.stanford.edu/lectures/CS224d-Lecture3.pdf • t-SNE: •http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf • http://distill.pub/2016/misread-tsne/ • https://www.slideshare.net/ssuserb667a8/visualization-data-using-tsne • https://youtu.be/RJVL80Gg3lA • KL Divergence • http://tdhopper.com/blog/2015/Sep/04/cross-entropy-and-kl-divergence/ 106
  • 107. References • Cross Entropy : • https://www.youtube.com/watch?v=tRsSi_sqXjI • http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/ • Softmax: • https://en.wikipedia.org/wiki/Softmax_function • http://cs231n.github.io/linear-classify/#softmax • Tensor Flow • 1.0 API docs • CS20SI 107
  • 110. Bag of Words • Vocab = set of all the words in corpus • Document = Words in document w.r.t vocab with multiplicity Sentence 1: "The cat sat on the hat" Sentence 2: "The dog ate the cat and the hat” Vocab = { the, cat, sat, on, hat, dog, ate, and } Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 } Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1} 110
  • 111. Pros & Cons + Quick and Simple - Too simple - Orderless - No notion of syntactic/semantic similarity 111
  • 112. N-gram model • Vocab = set of all n-grams in corpus • Document = n-grams in document w.r.t vocab with multiplicity For bigram: Sentence 1: "The cat sat on the hat" Sentence 2: "The dog ate the cat and the hat” Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat and, and the} Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0} Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1} 112
  • 113. Pros & Cons + Tries to incorporate order of words - Very large vocab set - No notion of syntactic/semantic similarity 113
  • 114. Term Frequency–Inverse Document Frequency (TF-IDF) • Captures importance of a word to a document in a corpus. • Importance increases proportionally to the number of times a word appears in the document; but is offset by the frequency of the word in the corpus. • TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). • IDF(t) = log (Total number of documents / Number of documents with term t in it). • TF-IDF (t) = TF(t) * IDF(t) 114
  • 115. Example • Document D1 contains 100 words. • cat appears 3 times in D1 • TF(cat) = 3 / 100 = 0.3 • Corpus contains 10 million documents • cat appears in 1000 documents • IDF(cat) = log (10,000,000 / 1,000) = 4 • TF-IDF (cat) = 0.3 * 4 115
  • 116. Pros & Cons • Pros: • Easy to compute • Has some basic metric to extract the most descriptive terms in a document • Thus, can easily compute the similarity between 2 documents using it • Disadvantages: • Based on the bag-of-words (BoW) model, therefore it does not capture position in text, semantics, co-occurrences in different documents, etc. • Thus, TF-IDF is only useful as a lexical level feature. (presence/absense) • Cannot capture semantics (unlike topic models, word embeddings) 116
  • 117. ● Positive Pointwise Mutual Information (PPMI): PMI is a common measure for the strength of association between two words. It is defined as the log ratio between the joint probability of two words ww and cc and the product of their marginal probabilities: a. PMI(w,c)=logP(w,c)/P(w)P(c) b. PPMI(w, c) = max(PMI(w,c), 0) 117