SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
Skip-Gram Model Broken Down
Chin Huan Tan
January 2019
1 Introduction
Skip-gram model uses a neural network to create word representations. A word
may be easily understood by a human but not a machine. A common way
to represent a word in machine language is to encode a word into an array
of characters or a string. However, an array of characters does not carry any
meaning. Here comes the Skip-gram model.
Rather than using characters to represent a word, a Skip-gram model will
create a vector, an element of vector space to represent each word, called a
word vector. A word vector can be seen as a position relative to the origin in a
graph. Words with similar meaning will have their word vectors closer together
whereas the words with different meaning will be far apart. Interestingly, the
word vectors encode the semantic relations through linear translations.
2 Overview
In a Skip-gram model, we will be training a fully connected neural network with
a single hidden layer by predicting a context word from a center word. Assume
that our dataset is
the quick brown fox jumped over the lazy dog
We will create (center, context) pairs of words to train our model. The context
is the window of words to the left and to the right of a center word. Using a
window size of 2,
• Center word, “the” will have context words, “quick” and “brown”.
• Center word, “quick” will have context words, “the”, “brown” and “fox”.
• Center word, “brown” will have context words, “the”, “quick”, “fox” and
“jumped”.
1
• Center word, “fox” will have context words, “quick”, “brown”, “jumped”
and “over”.
. . . and so on.
Thus, our training dataset becomes
(the, quick), (the, brown), (quick, the), (quick, brown), (quick, fox),
(brown, the), (brown, quick), (brown, fox), ...
of (center, context) pairs. The center words will be the input to the neural
network whereas context words are the target.
Before feeding the center word into the neural network, we replace each word
as a number from 0 to W, the number of words in the vocabulary (exclusive of
W). So, each word corresponds to a number (I call it word index). The order
does not matter here.
We can randomly initialize the input weights matrix and output weights matrix.
However, there is a better way to initialize the matrices rather than fixing the
standard deviation to be 0.1 but that is out of the scope here. The weight
matrices will finally become the word vectors after training the neural network.
Notice that the size of the weight matrices are W ×EMBED SIZE. Each row
of the matrix is a vector for each word.
import numpy as np
# The number of component in each word vector
EMBED SIZE = 128
# W i s the number of words in the vocabulary
input weights = np . random . normal (0 , 0.1 ,
s i z e =(W, EMBED SIZE))
output weights = np . random . normal (0 , 0.1 ,
s i z e =(W, EMBED SIZE))
An important thing to take note here is that the final product of a Skip-gram
model is the weight matrices rather than the ability of the model to predict the
context word.
Before training, let’s understand the training objective first. The training ob-
jective of the Skip-gram model is to maximize the average log probability
1
N
N
i=1
log p(wo|wi)
2
where wi is the center word (input), wo is the context word (target) and N is
the number of (center, context) pairs. It simply means that given wi as input,
we have to maximize the probability of wo predicted as output by the neural
network.
In a basic Skip-gram model, p(wo|wi) is defined using a softmax function.
p(wo|wi) =
exp(vwo
vwi
)
W
w=1 exp(vw vwi
)
where vx and vx are the input and output vector representations of word x and
W is the number of words in the vocabulary.
So, our overall simplified training objective for single input is
vwo
vwi
− log(
W
w=1
exp(vw vwi
))
However, the computing cost of the summation in the expression above is too
high to be practical as it is proportional to W which is often very large. In
order to solve this problem, we should use negative sampling. It is described in
a paper, ”Distributed Representations of Words and Phrases and their Compo-
sitionality”[1].
3 Negative Sampling
Negative sampling is a simplified Noise Contrastive Estimation (NCE). Our
problem with using the softmax function is that the computing cost is too
high. With NCE, we are doing an estimation (to reduce the cost) by creating a
contrast between the positive sample and negative samples.
Basically, given an input center word wi , we have to maximize the probability
of getting its corresponding output context word wo to be the output in the
neural network. Besides, we will select k negative samples from the noise distri-
bution Pn(w) and minimize the probability of getting the negative samples as
the outputs. Therefore, our training objective with negative sampling becomes
log σ(vwo
vwi
) +
k
j=1
Ewj ∼Pn(w)[log σ(−vwj
vwi
)]
where σ is a sigmoid function.
Based on the experiments by Tomas Mikolov and his team[1], values of k in the
range 5–20 are useful for small training datasets while for large datasets the k
3
value can be as small as 2–5. Note that the cost of doing the summation has
reduced greatly compared to the objective without negative sampling.
We haven’t defined our noise distribution yet. From the empirical result by
T. Mikolov’s team[1], the best noise distribution is found to be the unigram
distribution raised to the power of 3
4 . In other words, the probability of selecting
a word as a negative sample is equal to the frequency of the word raised to the
power of 3
4 divided by the sum of all word frequencies raised to the power of 3
4 .
P(wi) =
f(wi)
3
4
W
j=0 f(wj)
3
4
Let’s implement the negative sampling in python. The code is based on the
word2vec C implementation[2].
POWER = 0.75
TABLE SIZE = int (1 e8 )
total prob = 0
”””
Assume that vocab i s a l i s t of Word o b j e c t s
which have count a t t r i b u t e ,
storing the frequency of the word in the dataset
vocab can be constructed when reading the dataset
”””
for word in vocab :
total prob += word . count ∗∗ POWER
unigram table = np . zeros ((TABLE SIZE , ) ,
dtype=np . int64 )
i = 0
d1 = vocab [ i ] . count ∗∗ POWER / total prob
for a in range(TABLE SIZE ) :
unigram table [ a ] = i
i f a/TABLE SIZE > d1 :
i += 1
d1 += vocab [ i ] . count ∗∗ POWER / total prob
i f i >= len ( vocab ) :
i = len ( vocab ) − 1
After the unigram table is constructed, the table will contain the word indexes.
The frequency of a word index in the table is given by P(wi)×TABLE SIZE.
Thus, by generating a random integer from 0 to TABLESIZE (exclusive of
TABLE SIZE) and taking the word index from the unigram table using the
random integer generated, we have sampled a word from the noise distribution.
4
# Number of negative samples
k = 5
indexes = np . random . randint (0 , TABLE SIZE , s i z e=k)
neg samples = unigram table [ indexes ]
However, it is possible that the sample taken is the same as the target context
word (a positive sample). We will take care of that later.
4 Training
Now, we can start our forward propagation.
k = 5
def sigmoid (x ) :
return 1/(1 + np . exp(−x ))
for center , context in train data :
indexes = np . random . randint (0 , TABLE SIZE , s i z e=k)
neg samples = unigram table [ indexes ]
# We form a l i s t of t u p l e s of (word , l a b e l )
# l a b e l i s 1 i f the word i s a p o s i t i v e sample
# e l s e 0
samples = [ ( context , 1 ) ]
# Only the samples which are not context word
samples . extend ( [ (w, 0) for w in neg samples
i f w != context ] )
# S e l e c t the word vector in input weights
input embed = input weights [ center ]
for word index , l a b e l in samples :
output embed = output weights [ word index ]
# z i s the value at the output layer
z = np . dot ( input embed , output embed )
# a i s the a c t i v a t i o n at the output layer
a = sigmoid ( z )
# Backpropagation
For backpropagation, we should look back at our training objective, J,
J = log σ(vwo
vwi
) +
k
j=1
Ewj ∼Pn(w)[log σ(−vwj
vwi
)]
5
Let’s define zo and zj which are weighted sum values for positive sample and
negative sample respectively.
zo = vwo
vwi
zj = vwj
vwi
and our training objective becomes
J = log σ(zo) +
k
j=1
Ewj ∼Pn(w)[log σ(−zj)]
Then we can find the gradients, ∂J
∂zo
and ∂J
∂zj
.
The gradients are given by
∂J
∂zo
= 1 − σ(zo)
∂J
∂zj
= −σ(zj)
We can generalize these two derivatives into a single expression so that we can
use it in the for loop.
∂J
∂z
= label − σ(z)
where label is 1 if it is a positive sample, 0 if it is a negative sample.
After that, we can easily get the gradient of the training objective with respect
to the input weights and output weights.
∂J
∂vwi
=
k+1
j=1
∂J
∂zj
vwj
∂J
∂vwj
=
∂J
∂zj
vwi
The summation is up to k+1 because there are k negative samples and 1 positive
sample.
Below is the code on the backpropagation which is the continuation from the
code above on forward propagation.
# Learning rate
l r = 0.1
. . .
for center , context in train data :
. . .
6
# S e l e c t the word vector in input weights
input embed = input weights [ center ]
delta1 = np . zeros ((EMBED SIZE, ) )
for word index , l a b e l in samples :
output embed = output weights [ word index ]
# z i s the value at the output layer
z = np . dot ( input embed , output embed )
# a i s the a c t i v a t i o n at the output layer
a = sigmoid ( z )
delta = ( l a b e l − a ) ∗ l r
delta1 += delta ∗ output embed
output weights [ word index ]
+= delta ∗ input embed
input weights [ center ] += delta1
Here, we have done our backpropagation. The weights matrices, input weights
and output weights will now contain the word vectors. The word with word
index n will have its word vector located at the nth
row in the weight matrix.
5 Subsampling
A skip-gram model tries to explain the meaning of a word by the context which
the word is used in. This is done by creating the word pairs. However, the
words, such as ’the’ and ’and’, are used frequently and they seldom explain the
context. On the other hand, infrequent words or domain-specific words and
phrases usually carry meaning of the context. That’s why we use subsampling.
In layman’s terms, subsampling is to sample the infrequent words more often
than the frequent words to create a new training data. By doing so, their
relative frequencies are kept almost the same. Therefore, a probability to sample
a word has to be defined. Below is the probability used in the word2vec C
implementation[2],
P(wi) = (
z(wi)
0.001
+ 1) ×
0.001
z(wi)
where z(wi) is the fraction of all training words in the corpus.
# A storage to store tokens a f t e r subsampling
new tokens = [ ]
# tokens i s a l i s t of word indexes from o r i g i n a l t e x t
7
for word in tokens :
f r a c = vocab [ word ] . count/len ( tokens )
prob = (np . sqrt ( f r a c /0.001) + 1) ∗ (0.001/ f r a c )
i f np . random . random () < prob :
new tokens . append ( word )
After subsampling, we can now create our training data using the new tokens
and follow the training steps described before.
6 N-grams
In addition, the meaning of a phrase is usually not the direct addition of the
meaning of the individual words. So, our task includes deciding whether a word
is a part of a phrase or a standalone word. Then, we give the phrase its own
word vector. In this case, we use the idea of n-grams to identify the phrases.
Before going to n-grams, we start from bigram. We define a score for each
bigram. If the score is higher than a certain threshold, we consider it a valid
bigram to be used in the training data.
score(wi, wj) =
f(wiwj) − δ
f(wi) × f(wj)
where f(wiwj) represents the frequency of the bigram wiwj in the corpus, f(w)
is the frequency of the word in the corpus and δ is a threshold to remove infre-
quent phrases.
Notice that there are 2 threshold values here. δ is used because there is possi-
bility that both wi and wj are infrequent leading to a high score.
Below is a sample code. The code assumes that you have a dictionary called
n gram hash which has the n gram itself as key and its n gram index as value.
A n gram index is the index of the n gram in the list, vocab. Recall that vocab
is a list of Word objects. Now, it treats a n gram as a Word. For example, let’s
say your training data is
nice to meet you
In the first iteration, we start with only bigrams. So, our n gram hash will
have the following bigrams as keys, ”nice to”, ”to meet” and ”meet you”. The
values will have the frequency of each bigram.
8
min delta = 5
# A storage to store tokens a f t e r
# i d e n t i f y i n g the phrases
new tokens = [ ]
i = 0
while i < len ( tokens ) − 1:
word = tokens [ i ]
next word = tokens [ i + 1]
# f i i s the frequency of the f i r s t word ( i )
f i = vocab [ word ] . count
# f j i s the frequency of the next word
f j = vocab [ next word ] . count
# token a t t r i b u t e s t o r e s the word i t s e l f as s t r i n g
bigram = vocab [ word ] . token + ” ”
+ vocab [ next word ] . token
# f i j i s the frequency of the bigram
f i j = vocab [ n gram hash [ bigram ] ] . count
score = ( f i j − min delta ) / ( f i ∗ f j )
# Has to define the threshold based on the corpus
i f score > threshold :
new tokens . append ( n gram hash [ bigram ] )
i += 2
else :
new tokens . append ( word )
i += 1
# I f the l a s t 2 words are not bigram ,
# we have to include the l a s t word
i f i < len ( tokens ) :
new tokens . append ( tokens [ i ] )
i += 1
After it is done, our new tokens variable will now have a list of unigrams and
bigrams. If we run the program again, we will combine the bigram and unigram
together to form trigram. Make sure your vocab and n gram hash include the
trigrams, four-gram and so on before running the program. We can do so as
many times as we want. Typically, we run it 2 - 4 times with decreasing threshold
value because the chance of having a long phrase in a corpus is smaller.
9
The new tokens can now be taken to do the subsampling and then the model
training.
7 Conclusion
The code given is meant for learning. It is not optimized at all and definitely
not the best way to do it. If you need more details, you may take a look at the
paper[1] and the word2vec C implementation[2].
References
[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Dis-
tributed Representations of Words and Phrases and their Compositionality.
Retrieved from https://arxiv.org/abs/1310.4546
[2] dav/word2vec. Retrieved from https://github.com/dav/word2vec
10

Mais conteúdo relacionado

Mais procurados

Max Entropy
Max EntropyMax Entropy
Max Entropy
jianingy
 
Finite automata intro
Finite automata introFinite automata intro
Finite automata intro
lavishka_anuj
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 

Mais procurados (19)

Max Entropy
Max EntropyMax Entropy
Max Entropy
 
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
 
Finite automata intro
Finite automata introFinite automata intro
Finite automata intro
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
 
Batchal slides
Batchal slidesBatchal slides
Batchal slides
 
Icml12
Icml12Icml12
Icml12
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...
 
Discrete Models in Computer Vision
Discrete Models in Computer VisionDiscrete Models in Computer Vision
Discrete Models in Computer Vision
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
M18 learning
M18 learningM18 learning
M18 learning
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNN
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 

Semelhante a Skip-gram Model Broken Down

Real World Haskell: Lecture 2
Real World Haskell: Lecture 2Real World Haskell: Lecture 2
Real World Haskell: Lecture 2
Bryan O'Sullivan
 
5 structured programming
5 structured programming 5 structured programming
5 structured programming
hccit
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
Ravi Gupta
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
Viral Gupta
 
word2vec_summary_revised
word2vec_summary_revisedword2vec_summary_revised
word2vec_summary_revised
Bennett Bullock
 

Semelhante a Skip-gram Model Broken Down (20)

Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gram
 
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
 
Word2Vec on Italian language
Word2Vec on Italian languageWord2Vec on Italian language
Word2Vec on Italian language
 
Real World Haskell: Lecture 2
Real World Haskell: Lecture 2Real World Haskell: Lecture 2
Real World Haskell: Lecture 2
 
Programming Exam Help
Programming Exam Help Programming Exam Help
Programming Exam Help
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
Algorithm Homework Help
Algorithm Homework HelpAlgorithm Homework Help
Algorithm Homework Help
 
5 structured programming
5 structured programming 5 structured programming
5 structured programming
 
Algorithms Exam Help
Algorithms Exam HelpAlgorithms Exam Help
Algorithms Exam Help
 
Sort Characters in a Python String Alphabetically
Sort Characters in a Python String AlphabeticallySort Characters in a Python String Alphabetically
Sort Characters in a Python String Alphabetically
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
lec18_ref.pdf
lec18_ref.pdflec18_ref.pdf
lec18_ref.pdf
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
 
word2vec_summary_revised
word2vec_summary_revisedword2vec_summary_revised
word2vec_summary_revised
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 

Último

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Último (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Skip-gram Model Broken Down

  • 1. Skip-Gram Model Broken Down Chin Huan Tan January 2019 1 Introduction Skip-gram model uses a neural network to create word representations. A word may be easily understood by a human but not a machine. A common way to represent a word in machine language is to encode a word into an array of characters or a string. However, an array of characters does not carry any meaning. Here comes the Skip-gram model. Rather than using characters to represent a word, a Skip-gram model will create a vector, an element of vector space to represent each word, called a word vector. A word vector can be seen as a position relative to the origin in a graph. Words with similar meaning will have their word vectors closer together whereas the words with different meaning will be far apart. Interestingly, the word vectors encode the semantic relations through linear translations. 2 Overview In a Skip-gram model, we will be training a fully connected neural network with a single hidden layer by predicting a context word from a center word. Assume that our dataset is the quick brown fox jumped over the lazy dog We will create (center, context) pairs of words to train our model. The context is the window of words to the left and to the right of a center word. Using a window size of 2, • Center word, “the” will have context words, “quick” and “brown”. • Center word, “quick” will have context words, “the”, “brown” and “fox”. • Center word, “brown” will have context words, “the”, “quick”, “fox” and “jumped”. 1
  • 2. • Center word, “fox” will have context words, “quick”, “brown”, “jumped” and “over”. . . . and so on. Thus, our training dataset becomes (the, quick), (the, brown), (quick, the), (quick, brown), (quick, fox), (brown, the), (brown, quick), (brown, fox), ... of (center, context) pairs. The center words will be the input to the neural network whereas context words are the target. Before feeding the center word into the neural network, we replace each word as a number from 0 to W, the number of words in the vocabulary (exclusive of W). So, each word corresponds to a number (I call it word index). The order does not matter here. We can randomly initialize the input weights matrix and output weights matrix. However, there is a better way to initialize the matrices rather than fixing the standard deviation to be 0.1 but that is out of the scope here. The weight matrices will finally become the word vectors after training the neural network. Notice that the size of the weight matrices are W ×EMBED SIZE. Each row of the matrix is a vector for each word. import numpy as np # The number of component in each word vector EMBED SIZE = 128 # W i s the number of words in the vocabulary input weights = np . random . normal (0 , 0.1 , s i z e =(W, EMBED SIZE)) output weights = np . random . normal (0 , 0.1 , s i z e =(W, EMBED SIZE)) An important thing to take note here is that the final product of a Skip-gram model is the weight matrices rather than the ability of the model to predict the context word. Before training, let’s understand the training objective first. The training ob- jective of the Skip-gram model is to maximize the average log probability 1 N N i=1 log p(wo|wi) 2
  • 3. where wi is the center word (input), wo is the context word (target) and N is the number of (center, context) pairs. It simply means that given wi as input, we have to maximize the probability of wo predicted as output by the neural network. In a basic Skip-gram model, p(wo|wi) is defined using a softmax function. p(wo|wi) = exp(vwo vwi ) W w=1 exp(vw vwi ) where vx and vx are the input and output vector representations of word x and W is the number of words in the vocabulary. So, our overall simplified training objective for single input is vwo vwi − log( W w=1 exp(vw vwi )) However, the computing cost of the summation in the expression above is too high to be practical as it is proportional to W which is often very large. In order to solve this problem, we should use negative sampling. It is described in a paper, ”Distributed Representations of Words and Phrases and their Compo- sitionality”[1]. 3 Negative Sampling Negative sampling is a simplified Noise Contrastive Estimation (NCE). Our problem with using the softmax function is that the computing cost is too high. With NCE, we are doing an estimation (to reduce the cost) by creating a contrast between the positive sample and negative samples. Basically, given an input center word wi , we have to maximize the probability of getting its corresponding output context word wo to be the output in the neural network. Besides, we will select k negative samples from the noise distri- bution Pn(w) and minimize the probability of getting the negative samples as the outputs. Therefore, our training objective with negative sampling becomes log σ(vwo vwi ) + k j=1 Ewj ∼Pn(w)[log σ(−vwj vwi )] where σ is a sigmoid function. Based on the experiments by Tomas Mikolov and his team[1], values of k in the range 5–20 are useful for small training datasets while for large datasets the k 3
  • 4. value can be as small as 2–5. Note that the cost of doing the summation has reduced greatly compared to the objective without negative sampling. We haven’t defined our noise distribution yet. From the empirical result by T. Mikolov’s team[1], the best noise distribution is found to be the unigram distribution raised to the power of 3 4 . In other words, the probability of selecting a word as a negative sample is equal to the frequency of the word raised to the power of 3 4 divided by the sum of all word frequencies raised to the power of 3 4 . P(wi) = f(wi) 3 4 W j=0 f(wj) 3 4 Let’s implement the negative sampling in python. The code is based on the word2vec C implementation[2]. POWER = 0.75 TABLE SIZE = int (1 e8 ) total prob = 0 ””” Assume that vocab i s a l i s t of Word o b j e c t s which have count a t t r i b u t e , storing the frequency of the word in the dataset vocab can be constructed when reading the dataset ””” for word in vocab : total prob += word . count ∗∗ POWER unigram table = np . zeros ((TABLE SIZE , ) , dtype=np . int64 ) i = 0 d1 = vocab [ i ] . count ∗∗ POWER / total prob for a in range(TABLE SIZE ) : unigram table [ a ] = i i f a/TABLE SIZE > d1 : i += 1 d1 += vocab [ i ] . count ∗∗ POWER / total prob i f i >= len ( vocab ) : i = len ( vocab ) − 1 After the unigram table is constructed, the table will contain the word indexes. The frequency of a word index in the table is given by P(wi)×TABLE SIZE. Thus, by generating a random integer from 0 to TABLESIZE (exclusive of TABLE SIZE) and taking the word index from the unigram table using the random integer generated, we have sampled a word from the noise distribution. 4
  • 5. # Number of negative samples k = 5 indexes = np . random . randint (0 , TABLE SIZE , s i z e=k) neg samples = unigram table [ indexes ] However, it is possible that the sample taken is the same as the target context word (a positive sample). We will take care of that later. 4 Training Now, we can start our forward propagation. k = 5 def sigmoid (x ) : return 1/(1 + np . exp(−x )) for center , context in train data : indexes = np . random . randint (0 , TABLE SIZE , s i z e=k) neg samples = unigram table [ indexes ] # We form a l i s t of t u p l e s of (word , l a b e l ) # l a b e l i s 1 i f the word i s a p o s i t i v e sample # e l s e 0 samples = [ ( context , 1 ) ] # Only the samples which are not context word samples . extend ( [ (w, 0) for w in neg samples i f w != context ] ) # S e l e c t the word vector in input weights input embed = input weights [ center ] for word index , l a b e l in samples : output embed = output weights [ word index ] # z i s the value at the output layer z = np . dot ( input embed , output embed ) # a i s the a c t i v a t i o n at the output layer a = sigmoid ( z ) # Backpropagation For backpropagation, we should look back at our training objective, J, J = log σ(vwo vwi ) + k j=1 Ewj ∼Pn(w)[log σ(−vwj vwi )] 5
  • 6. Let’s define zo and zj which are weighted sum values for positive sample and negative sample respectively. zo = vwo vwi zj = vwj vwi and our training objective becomes J = log σ(zo) + k j=1 Ewj ∼Pn(w)[log σ(−zj)] Then we can find the gradients, ∂J ∂zo and ∂J ∂zj . The gradients are given by ∂J ∂zo = 1 − σ(zo) ∂J ∂zj = −σ(zj) We can generalize these two derivatives into a single expression so that we can use it in the for loop. ∂J ∂z = label − σ(z) where label is 1 if it is a positive sample, 0 if it is a negative sample. After that, we can easily get the gradient of the training objective with respect to the input weights and output weights. ∂J ∂vwi = k+1 j=1 ∂J ∂zj vwj ∂J ∂vwj = ∂J ∂zj vwi The summation is up to k+1 because there are k negative samples and 1 positive sample. Below is the code on the backpropagation which is the continuation from the code above on forward propagation. # Learning rate l r = 0.1 . . . for center , context in train data : . . . 6
  • 7. # S e l e c t the word vector in input weights input embed = input weights [ center ] delta1 = np . zeros ((EMBED SIZE, ) ) for word index , l a b e l in samples : output embed = output weights [ word index ] # z i s the value at the output layer z = np . dot ( input embed , output embed ) # a i s the a c t i v a t i o n at the output layer a = sigmoid ( z ) delta = ( l a b e l − a ) ∗ l r delta1 += delta ∗ output embed output weights [ word index ] += delta ∗ input embed input weights [ center ] += delta1 Here, we have done our backpropagation. The weights matrices, input weights and output weights will now contain the word vectors. The word with word index n will have its word vector located at the nth row in the weight matrix. 5 Subsampling A skip-gram model tries to explain the meaning of a word by the context which the word is used in. This is done by creating the word pairs. However, the words, such as ’the’ and ’and’, are used frequently and they seldom explain the context. On the other hand, infrequent words or domain-specific words and phrases usually carry meaning of the context. That’s why we use subsampling. In layman’s terms, subsampling is to sample the infrequent words more often than the frequent words to create a new training data. By doing so, their relative frequencies are kept almost the same. Therefore, a probability to sample a word has to be defined. Below is the probability used in the word2vec C implementation[2], P(wi) = ( z(wi) 0.001 + 1) × 0.001 z(wi) where z(wi) is the fraction of all training words in the corpus. # A storage to store tokens a f t e r subsampling new tokens = [ ] # tokens i s a l i s t of word indexes from o r i g i n a l t e x t 7
  • 8. for word in tokens : f r a c = vocab [ word ] . count/len ( tokens ) prob = (np . sqrt ( f r a c /0.001) + 1) ∗ (0.001/ f r a c ) i f np . random . random () < prob : new tokens . append ( word ) After subsampling, we can now create our training data using the new tokens and follow the training steps described before. 6 N-grams In addition, the meaning of a phrase is usually not the direct addition of the meaning of the individual words. So, our task includes deciding whether a word is a part of a phrase or a standalone word. Then, we give the phrase its own word vector. In this case, we use the idea of n-grams to identify the phrases. Before going to n-grams, we start from bigram. We define a score for each bigram. If the score is higher than a certain threshold, we consider it a valid bigram to be used in the training data. score(wi, wj) = f(wiwj) − δ f(wi) × f(wj) where f(wiwj) represents the frequency of the bigram wiwj in the corpus, f(w) is the frequency of the word in the corpus and δ is a threshold to remove infre- quent phrases. Notice that there are 2 threshold values here. δ is used because there is possi- bility that both wi and wj are infrequent leading to a high score. Below is a sample code. The code assumes that you have a dictionary called n gram hash which has the n gram itself as key and its n gram index as value. A n gram index is the index of the n gram in the list, vocab. Recall that vocab is a list of Word objects. Now, it treats a n gram as a Word. For example, let’s say your training data is nice to meet you In the first iteration, we start with only bigrams. So, our n gram hash will have the following bigrams as keys, ”nice to”, ”to meet” and ”meet you”. The values will have the frequency of each bigram. 8
  • 9. min delta = 5 # A storage to store tokens a f t e r # i d e n t i f y i n g the phrases new tokens = [ ] i = 0 while i < len ( tokens ) − 1: word = tokens [ i ] next word = tokens [ i + 1] # f i i s the frequency of the f i r s t word ( i ) f i = vocab [ word ] . count # f j i s the frequency of the next word f j = vocab [ next word ] . count # token a t t r i b u t e s t o r e s the word i t s e l f as s t r i n g bigram = vocab [ word ] . token + ” ” + vocab [ next word ] . token # f i j i s the frequency of the bigram f i j = vocab [ n gram hash [ bigram ] ] . count score = ( f i j − min delta ) / ( f i ∗ f j ) # Has to define the threshold based on the corpus i f score > threshold : new tokens . append ( n gram hash [ bigram ] ) i += 2 else : new tokens . append ( word ) i += 1 # I f the l a s t 2 words are not bigram , # we have to include the l a s t word i f i < len ( tokens ) : new tokens . append ( tokens [ i ] ) i += 1 After it is done, our new tokens variable will now have a list of unigrams and bigrams. If we run the program again, we will combine the bigram and unigram together to form trigram. Make sure your vocab and n gram hash include the trigrams, four-gram and so on before running the program. We can do so as many times as we want. Typically, we run it 2 - 4 times with decreasing threshold value because the chance of having a long phrase in a corpus is smaller. 9
  • 10. The new tokens can now be taken to do the subsampling and then the model training. 7 Conclusion The code given is meant for learning. It is not optimized at all and definitely not the best way to do it. If you need more details, you may take a look at the paper[1] and the word2vec C implementation[2]. References [1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Dis- tributed Representations of Words and Phrases and their Compositionality. Retrieved from https://arxiv.org/abs/1310.4546 [2] dav/word2vec. Retrieved from https://github.com/dav/word2vec 10