2. Distributional Semantics
2
Quantify and categorize semantic similarities between items
using distributional properties in Big Data.
Never-Ending Language Learning
http://rtw.ml.cmu.edu/rtw
Entity extraction
http://conceptnet5.media.mit.edu
ConceptNet
Relation extraction
http://word2vec.googlecode.com
Word2Vec
Word embedding
4. Language Class Model
4
Assume that each word w ∈ V belongs to a class c ∈ C.
Language Model
Language Class Model
the class that
contains wi
V = {w1, . . . , wn} C = {c1, . . . , ck}
P(wn
1 ) =
nY
i=1
P(wi|wi 1)
P(wn
1 ) =
nY
i=1
P(wi|ci) · P(ci|ci 1)
log P(wn
1 ) =
nX
i=1
log P(wi|ci) · P(ci|ci 1)
5. Cluster Quality
5
Quality
log P(wn
1 ) =
nX
i=1
log P(wi|ci) · P(ci|ci 1)
Bigrams: (w, w’), where w’ is a word prior to w.
Q(C) =
1
n
nX
i=1
log P(wi|ci) · P(ci|ci 1)
Q(C) =
X
w,w0
n(w, w0
)
n
log P(w0
|c0
) · P(c0
|c)
P(w, w0
)
6. Cluster Quality
6
Q(C) =
X
w,w0
n(w, w0
)
n
log P(w0
|c0
) · P(c0
|c)
=
X
w,w0
n(w, w0
)
n
log
n(w0
)
n(c0)
·
n(c, c0
)
n(c)
=
X
w,w0
n(w, w0
)
n
log
n(w0
)
n
·
n · n(c, c0
)
n(c) · n(c0)
=
X
w,w0
n(w, w0
)
n
log
n(w0
)
n
+
X
w,w0
n(w, w0
)
n
log
n · n(c, c0
)
n(c) · n(c0)
=
X
w0
n(w0
)
n
log
n(w0
)
n
+
X
c,c0
n(c, c0
)
n
log
n · n(c, c0
)
n(c) · n(c0)
8. Brown Clustering
8
V = {w1, . . . , wn} C = {c1, . . . , ck}
Each word is assigned to a unique cluster → n clusters.
Initial State
Each word is assigned to one of clusters ∈ C → k clusters.
Terminal State
Complexity?Run n - k merge steps:
Pick two clusters ci and cj that maximizes the quality.
arg max
i,j
Q(ci [ cj)
9. Brown Clustering
9
Assign the top k most frequent words to unique clusters.
Run n - k merge steps: k clusters
k+1 clusters
k clusters
Run k merge steps to be completely hierarchical.
Assign the next most frequent word to a unique cluster.
Pick two clusters ci and cj that maximizes the quality.
arg max
i,j
Q(ci [ cj)
11. Term Document Matrix
11
x1,1 x1,n
xm,1 xm,n
…
…
…
…
…
tT
i
dj
Term frequency of ti
given a document dj
TF-IDF
xi,1 xi,n…tT
i Term similarity
x1,j xm,j…dj Document similarity
12. Latent Semantic Analysis
12
Low-rank approximation
Remove irrelevant terms or documents from the matrix
x1,1 x1,n
xm,1 xm,n
…
…
…
…
…
X =
Singular value decomposition
u1 um…
σ1 0
0 σn
…
…
…
…
… v1
vn
…
Orthogonal Matrix Diagonal Matrix Orthogonal Matrix
X = U ⋅ Σ ⋅VT
13. Latent Semantic Analysis
13
u1 um…
σ1 0
0 σn
…
…
…
…
…
v1
vn
…
Choose top-k singular values
U → M ✕ M Σ → M ✕ N VT → N ✕ N
U’ → M ✕ K Σ’ → K ✕ K V’T → K ✕ N
X’ = U’ ⋅ Σ’ ⋅V’T ← LSA matrix
14. Word Embeddings
14
Word vectors generated by neural networks.
0 1 1 1 0 1 1 0 0 0 0 0
“king” 0 1 1 1 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 1 1 0 0 0“man”
0 0 0 0 0 1 1 0 0 0 0 0“woman”
“queen”?
Each dimension in word vectors captures distributional semantics.
https://code.google.com/p/word2vec/
royal male
female
15. Generative vs. Discriminative
15
Predict wi given {wi-2, wi-1, wi+1, wi+2}
Generative model
Discriminative model
x = bow(wi 2, wi 1, wi+1, wi+2) 2 R1⇥n
w 2 Rn⇥d
v 2 Rd⇥n
wi = arg max
⇤
P(w⇤|wi 2, wi 1, wi+1, wi+2)
wi = arg max
i
(x · w · v)
vocabulary
size
embedding size
16. Word2Vec
16
Input
…
x 2 Rv
Hidden
…
h 2 Rd
Output
…
ˆy 2 Rv
w 2 Rv⇥d
v 2 Rd⇥v
embedding sizevocabulary size vocabulary size
Feed-Forward Neutral Network
wi 2
0.5
v
,
0.5
v
vi 0
18. Negative Sampling
18
w0 w0 w0 w0 … w1 w1 w2 … w2 w2 … wn
vocabulary size * embedding size * 10
Words are proportionally distributed by their counts.
dist(wi) =
|wi|
3
4
P
8j |wj|
3
4
distribution ratio
Randomly select negative words from the distribution.
19. Sub-Sampling
19
Randomly discard high frequent words from BOW.
d =
r
|w|
s
+ 1
!
·
✓
s
|w|
◆
⇡
r
s
|w|
d
0
2
4
6
8
10
12
14
word count / sample size
0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9
Choose a random number r
and skip the word s.t. d < r
subsample threshold * total word count
word count