Skip-gram Model Broken Down

Skip-Gram Model Broken Down
Chin Huan Tan
January 2019
1 Introduction
Skip-gram model uses a neural network to create word representations. A word
may be easily understood by a human but not a machine. A common way
to represent a word in machine language is to encode a word into an array
of characters or a string. However, an array of characters does not carry any
meaning. Here comes the Skip-gram model.
Rather than using characters to represent a word, a Skip-gram model will
create a vector, an element of vector space to represent each word, called a
word vector. A word vector can be seen as a position relative to the origin in a
graph. Words with similar meaning will have their word vectors closer together
whereas the words with diﬀerent meaning will be far apart. Interestingly, the
word vectors encode the semantic relations through linear translations.
2 Overview
In a Skip-gram model, we will be training a fully connected neural network with
a single hidden layer by predicting a context word from a center word. Assume
that our dataset is
the quick brown fox jumped over the lazy dog
We will create (center, context) pairs of words to train our model. The context
is the window of words to the left and to the right of a center word. Using a
window size of 2,
• Center word, “the” will have context words, “quick” and “brown”.
• Center word, “quick” will have context words, “the”, “brown” and “fox”.
• Center word, “brown” will have context words, “the”, “quick”, “fox” and
“jumped”.
1

• Center word, “fox” will have context words, “quick”, “brown”, “jumped”
and “over”.
. . . and so on.
Thus, our training dataset becomes
(the, quick), (the, brown), (quick, the), (quick, brown), (quick, fox),
(brown, the), (brown, quick), (brown, fox), ...
of (center, context) pairs. The center words will be the input to the neural
network whereas context words are the target.
Before feeding the center word into the neural network, we replace each word
as a number from 0 to W, the number of words in the vocabulary (exclusive of
W). So, each word corresponds to a number (I call it word index). The order
does not matter here.
We can randomly initialize the input weights matrix and output weights matrix.
However, there is a better way to initialize the matrices rather than fixing the
standard deviation to be 0.1 but that is out of the scope here. The weight
matrices will finally become the word vectors after training the neural network.
Notice that the size of the weight matrices are W ×EMBED SIZE. Each row
of the matrix is a vector for each word.
import numpy as np
# The number of component in each word vector
EMBED SIZE = 128
# W i s the number of words in the vocabulary
input weights = np . random . normal (0 , 0.1 ,
s i z e =(W, EMBED SIZE))
output weights = np . random . normal (0 , 0.1 ,
s i z e =(W, EMBED SIZE))
An important thing to take note here is that the final product of a Skip-gram
model is the weight matrices rather than the ability of the model to predict the
context word.
Before training, let’s understand the training objective first. The training ob-
jective of the Skip-gram model is to maximize the average log probability
1
N
N
i=1
log p(wo|wi)
2

where wi is the center word (input), wo is the context word (target) and N is
the number of (center, context) pairs. It simply means that given wi as input,
we have to maximize the probability of wo predicted as output by the neural
network.
In a basic Skip-gram model, p(wo|wi) is defined using a softmax function.
p(wo|wi) =
exp(vwo
vwi
)
W
w=1 exp(vw vwi
)
where vx and vx are the input and output vector representations of word x and
W is the number of words in the vocabulary.
So, our overall simplified training objective for single input is
vwo
vwi
− log(
W
w=1
exp(vw vwi
))
However, the computing cost of the summation in the expression above is too
high to be practical as it is proportional to W which is often very large. In
order to solve this problem, we should use negative sampling. It is described in
a paper, ”Distributed Representations of Words and Phrases and their Compo-
sitionality”[1].
3 Negative Sampling
Negative sampling is a simplified Noise Contrastive Estimation (NCE). Our
problem with using the softmax function is that the computing cost is too
high. With NCE, we are doing an estimation (to reduce the cost) by creating a
contrast between the positive sample and negative samples.
Basically, given an input center word wi , we have to maximize the probability
of getting its corresponding output context word wo to be the output in the
neural network. Besides, we will select k negative samples from the noise distri-
bution Pn(w) and minimize the probability of getting the negative samples as
the outputs. Therefore, our training objective with negative sampling becomes
log σ(vwo
vwi
) +
k
j=1
Ewj ∼Pn(w)[log σ(−vwj
vwi
)]
where σ is a sigmoid function.
Based on the experiments by Tomas Mikolov and his team[1], values of k in the
range 5–20 are useful for small training datasets while for large datasets the k
3

value can be as small as 2–5. Note that the cost of doing the summation has
reduced greatly compared to the objective without negative sampling.
We haven’t deﬁned our noise distribution yet. From the empirical result by
T. Mikolov’s team[1], the best noise distribution is found to be the unigram
distribution raised to the power of 3
4 . In other words, the probability of selecting
a word as a negative sample is equal to the frequency of the word raised to the
power of 3
4 divided by the sum of all word frequencies raised to the power of 3
4 .
P(wi) =
f(wi)
3
4
W
j=0 f(wj)
3
4
Let’s implement the negative sampling in python. The code is based on the
word2vec C implementation[2].
POWER = 0.75
TABLE SIZE = int (1 e8 )
total prob = 0
”””
Assume that vocab i s a l i s t of Word o b j e c t s
which have count a t t r i b u t e ,
storing the frequency of the word in the dataset
vocab can be constructed when reading the dataset
”””
for word in vocab :
total prob += word . count ∗∗ POWER
unigram table = np . zeros ((TABLE SIZE , ) ,
dtype=np . int64 )
i = 0
d1 = vocab [ i ] . count ∗∗ POWER / total prob
for a in range(TABLE SIZE ) :
unigram table [ a ] = i
i f a/TABLE SIZE > d1 :
i += 1
d1 += vocab [ i ] . count ∗∗ POWER / total prob
i f i >= len ( vocab ) :
i = len ( vocab ) − 1
After the unigram table is constructed, the table will contain the word indexes.
The frequency of a word index in the table is given by P(wi)×TABLE SIZE.
Thus, by generating a random integer from 0 to TABLESIZE (exclusive of
TABLE SIZE) and taking the word index from the unigram table using the
random integer generated, we have sampled a word from the noise distribution.
4

# Number of negative samples
k = 5
indexes = np . random . randint (0 , TABLE SIZE , s i z e=k)
neg samples = unigram table [ indexes ]
However, it is possible that the sample taken is the same as the target context
word (a positive sample). We will take care of that later.
4 Training
Now, we can start our forward propagation.
k = 5
def sigmoid (x ) :
return 1/(1 + np . exp(−x ))
for center , context in train data :
indexes = np . random . randint (0 , TABLE SIZE , s i z e=k)
neg samples = unigram table [ indexes ]
# We form a l i s t of t u p l e s of (word , l a b e l )
# l a b e l i s 1 i f the word i s a p o s i t i v e sample
# e l s e 0
samples = [ ( context , 1 ) ]
# Only the samples which are not context word
samples . extend ( [ (w, 0) for w in neg samples
i f w != context ] )
# S e l e c t the word vector in input weights
input embed = input weights [ center ]
for word index , l a b e l in samples :
output embed = output weights [ word index ]
# z i s the value at the output layer
z = np . dot ( input embed , output embed )
# a i s the a c t i v a t i o n at the output layer
a = sigmoid ( z )
# Backpropagation
For backpropagation, we should look back at our training objective, J,
J = log σ(vwo
vwi
) +
k
j=1
Ewj ∼Pn(w)[log σ(−vwj
vwi
)]
5

Let’s deﬁne zo and zj which are weighted sum values for positive sample and
negative sample respectively.
zo = vwo
vwi
zj = vwj
vwi
and our training objective becomes
J = log σ(zo) +
k
j=1
Ewj ∼Pn(w)[log σ(−zj)]
Then we can ﬁnd the gradients, ∂J
∂zo
and ∂J
∂zj
.
The gradients are given by
∂J
∂zo
= 1 − σ(zo)
∂J
∂zj
= −σ(zj)
We can generalize these two derivatives into a single expression so that we can
use it in the for loop.
∂J
∂z
= label − σ(z)
where label is 1 if it is a positive sample, 0 if it is a negative sample.
After that, we can easily get the gradient of the training objective with respect
to the input weights and output weights.
∂J
∂vwi
=
k+1
j=1
∂J
∂zj
vwj
∂J
∂vwj
=
∂J
∂zj
vwi
The summation is up to k+1 because there are k negative samples and 1 positive
sample.
Below is the code on the backpropagation which is the continuation from the
code above on forward propagation.
# Learning rate
l r = 0.1
. . .
for center , context in train data :
. . .
6

# S e l e c t the word vector in input weights
input embed = input weights [ center ]
delta1 = np . zeros ((EMBED SIZE, ) )
for word index , l a b e l in samples :
output embed = output weights [ word index ]
# z i s the value at the output layer
z = np . dot ( input embed , output embed )
# a i s the a c t i v a t i o n at the output layer
a = sigmoid ( z )
delta = ( l a b e l − a ) ∗ l r
delta1 += delta ∗ output embed
output weights [ word index ]
+= delta ∗ input embed
input weights [ center ] += delta1
Here, we have done our backpropagation. The weights matrices, input weights
and output weights will now contain the word vectors. The word with word
index n will have its word vector located at the nth
row in the weight matrix.
5 Subsampling
A skip-gram model tries to explain the meaning of a word by the context which
the word is used in. This is done by creating the word pairs. However, the
words, such as ’the’ and ’and’, are used frequently and they seldom explain the
context. On the other hand, infrequent words or domain-speciﬁc words and
phrases usually carry meaning of the context. That’s why we use subsampling.
In layman’s terms, subsampling is to sample the infrequent words more often
than the frequent words to create a new training data. By doing so, their
relative frequencies are kept almost the same. Therefore, a probability to sample
a word has to be deﬁned. Below is the probability used in the word2vec C
implementation[2],
P(wi) = (
z(wi)
0.001
+ 1) ×
0.001
z(wi)
where z(wi) is the fraction of all training words in the corpus.
# A storage to store tokens a f t e r subsampling
new tokens = [ ]
# tokens i s a l i s t of word indexes from o r i g i n a l t e x t
7

for word in tokens :
f r a c = vocab [ word ] . count/len ( tokens )
prob = (np . sqrt ( f r a c /0.001) + 1) ∗ (0.001/ f r a c )
i f np . random . random () < prob :
new tokens . append ( word )
After subsampling, we can now create our training data using the new tokens
and follow the training steps described before.
6 N-grams
In addition, the meaning of a phrase is usually not the direct addition of the
meaning of the individual words. So, our task includes deciding whether a word
is a part of a phrase or a standalone word. Then, we give the phrase its own
word vector. In this case, we use the idea of n-grams to identify the phrases.
Before going to n-grams, we start from bigram. We deﬁne a score for each
bigram. If the score is higher than a certain threshold, we consider it a valid
bigram to be used in the training data.
score(wi, wj) =
f(wiwj) − δ
f(wi) × f(wj)
where f(wiwj) represents the frequency of the bigram wiwj in the corpus, f(w)
is the frequency of the word in the corpus and δ is a threshold to remove infre-
quent phrases.
Notice that there are 2 threshold values here. δ is used because there is possi-
bility that both wi and wj are infrequent leading to a high score.
Below is a sample code. The code assumes that you have a dictionary called
n gram hash which has the n gram itself as key and its n gram index as value.
A n gram index is the index of the n gram in the list, vocab. Recall that vocab
is a list of Word objects. Now, it treats a n gram as a Word. For example, let’s
say your training data is
nice to meet you
In the ﬁrst iteration, we start with only bigrams. So, our n gram hash will
have the following bigrams as keys, ”nice to”, ”to meet” and ”meet you”. The
values will have the frequency of each bigram.
8

min delta = 5
# A storage to store tokens a f t e r
# i d e n t i f y i n g the phrases
new tokens = [ ]
i = 0
while i < len ( tokens ) − 1:
word = tokens [ i ]
next word = tokens [ i + 1]
# f i i s the frequency of the f i r s t word ( i )
f i = vocab [ word ] . count
# f j i s the frequency of the next word
f j = vocab [ next word ] . count
# token a t t r i b u t e s t o r e s the word i t s e l f as s t r i n g
bigram = vocab [ word ] . token + ” ”
+ vocab [ next word ] . token
# f i j i s the frequency of the bigram
f i j = vocab [ n gram hash [ bigram ] ] . count
score = ( f i j − min delta ) / ( f i ∗ f j )
# Has to define the threshold based on the corpus
i f score > threshold :
new tokens . append ( n gram hash [ bigram ] )
i += 2
else :
new tokens . append ( word )
i += 1
# I f the l a s t 2 words are not bigram ,
# we have to include the l a s t word
i f i < len ( tokens ) :
new tokens . append ( tokens [ i ] )
i += 1
After it is done, our new tokens variable will now have a list of unigrams and
bigrams. If we run the program again, we will combine the bigram and unigram
together to form trigram. Make sure your vocab and n gram hash include the
trigrams, four-gram and so on before running the program. We can do so as
many times as we want. Typically, we run it 2 - 4 times with decreasing threshold
value because the chance of having a long phrase in a corpus is smaller.
9

The new tokens can now be taken to do the subsampling and then the model
training.
7 Conclusion
The code given is meant for learning. It is not optimized at all and deﬁnitely
not the best way to do it. If you need more details, you may take a look at the
paper[1] and the word2vec C implementation[2].
References
[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Dis-
tributed Representations of Words and Phrases and their Compositionality.
Retrieved from https://arxiv.org/abs/1310.4546
[2] dav/word2vec. Retrieved from https://github.com/dav/word2vec
10

Skip-gram Model Broken Down

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Skip-gram Model Broken Down

Semelhante a Skip-gram Model Broken Down (20)

Último

Último (20)

Skip-gram Model Broken Down