Word Mover's Distance

From Word Embeddings To Document
Distances
ICML 2015
80 citations
Matt J. Kusner
Washington University
Yu Sun
Nicholas I. Kolkin
Kilian Q. Weinberger

Motivation
● Same contents, but in different words

Motivation
● Naive BoW gives poor similarity score

Motivation
– Many attempts but none of them has significant
improvement compared to the naive BoW

Motivation
Why not use a better word embedding?

Background: Word2Vec
● Idea: similar words have similar surroundings

● Word embeddings are learned in an unsupervised
manner

● Property: relational information is preserved
German
bratwurst
Japanese
sushi

Problem Formulation
● Word2Vec measures only single-word similarity
– however, a document contains many words
● How do you use it to compare documents?

Word Mover’s Distance
Obama
speaks
media
Illinois
Obama
speaks
to
the
media
in
Illinois

Obama
speaks
media
Illinois
President
press
greets
Chicago
Obama
speaks
to
the
media
in
Illinois
The
President
greets
the
press
in
Chicago

Obama
speaks
media
Illinois
President
press
greets
Chicago
Obama
speaks
to
the
media
in
Illinois
The
President
greets
the
press
in
Chicago
The similarity of 2 sentence is the minimum cost
to move all points from one to another

Obama
speaks
Illinois
President
press
greets
Chicago
What if the length doesn’t match?
Obama
speaks
in
Illinois
The
President
greets
the
press
in
Chicago
?

Each document is represent by normalized BoW
Given word embedding matrix

Distance between documents is defined as

Practical Optimization (1/4)
● For document retrieval
– given a document, find most similar K documents
– better avoid exact computation of WMD for unlikely docs
● Since solving WMD optimization is slow
–
● Let’s derive cheaper lower-bounds to prune docs

● Derive a cheaper lower-bound to prune docs
by triangle inequality

by WMD constraints

Word Centroid Distance

takes to compute

However,
it’s not very tight

● Derive a tighter lower-bound to prune docs
Relax Word Mover’s Distance

optimal solution

optimal solution
require 1 NN-search per source word

optimal solution
reversely, 1 NN-search per destination word

Take the min of the 2 relaxed solutions
gives even tighter bound

For document search, we can speedup by
● Sort all documents by WCD
– Prefect first m documents
● Then, select best k among m documents
– compute RWMD to prune unlikely documents
(with 95% accuracy on experiment dataset)
– update WMD of possible documents

Experiment (1/3)
● Run KNN on different document embeddings

Experiment (1/3)
baselines
– BoW, TF-IDF, BM25 Okapi

Experiment (1/3)
baselines
– LSI (Latent Semantic Indexing)

Experiment (1/3)
baselines
– LDA (Latent Dirichlet Allocation),
CCG (Componential Counting Grid)

Experiment (1/3)
baselines
– mSDA (marginalized Stacked Denoising Autoencoder)

Experiment (1/3)
● KNN result

Experiment (2/3)
● Lower bounds – tightness

Experiment (3/3)
● Lower bounds – speedup

Conclusion
● WMD suits word2vec just well
● However, no comparison to doc2vec
● Possible application to automatically evaluate
quality of language translation

From Word Embeddings To Document
Distances
ICML 2015
80 citations
Matt J. Kusner
Yu Sun
Nicholas I. Kolkin
Kilian Q. Weinberger
今天要報的這篇叫 from word embeddings to
document distances ，是 2015 年 ICML 的 paper ，
作者是華盛頓大學，最後一個作者是 marginalized
stacked denoisng autoencoder 的 coauthor 之一。

Motivation
這篇主要想解決的問題是，對於同樣一件事，常常有
很多種描述方法。比如這兩個句子 Obama speaks
to the media in Illinois 和 The President greets the
press in Chicago 都是在描述 Obama 在芝加哥的依
利諾斯接受訪問。

Motivation
但是假如使用 BoW 或 TF-IDF model 會認為這兩個句
子很不像。因為在做完基本 preprocess 去掉
stopwords 後，兩個句子沒有任何相同的字。

Motivation
要改進這種問題，隨便想都有很多方法，比如你可以
先對 document 做 LDA 找 topic ，然後用 document
包含的 topic 做比較。但 paper 提到這些嘗試在
empirical 上和 naive 的 BoW 沒有很大的差距。

Motivation
Why not use a better word embedding?
那這篇改進的方向是，當時正好有個很紅的 word
embedding ， google 的 word2vec 可以將字義區分
的很好，那為何不用 word2vec 來解決這問題呢？

怕大家忘記，這邊先快速複習一下。要區分字義，其
中最基本問題是，你怎麼定義兩個字是否相近呢？
google 的回答是，假如一個字能夠被另一個字替
換，那這兩個字在某方面一定是類似的。
比如這個時事題， make 什麼 great again ，除了可填
入 American 外，最近哪個國家的總統大選也有讓人
擔心的趨勢呢？

荷蘭。這個例子要說的是，即使你沒有 follow 到新
聞，但透過這種填空比喻的方式，你可以推論荷蘭
很可能出現了一位聲勢很旺主張著極端排外主義的
選舉人。

● Word embeddings are learned in an unsupervised
manner
那根據填空的想法，要怎麼設計架構？ Google 設計了
兩種填空方式，一種是看前後文填中間字的
CBoW ，另一個看中間字預測前後文的 skip-
gram 。雖然形式上要預測附近的字是 supervised
learning ，但要記得我們目的是學字的相似程度，字
的相似程度是不需要 label ，只要有寫好的文章就能
自動 learn 的，所以實際上是 unsupervised 的 learn
word embedding 。那麼你就可以拿大量的文章做填
空訓練，最後 model 前半部將文字 map 到 hidden
layer 的 matrix 就是我們要的 word embedding 。

● Property: relational information is preserved
German
bratwurst
Japanese
sushi
word2vec 有個最重要的性質是，字和字之間的關係會
被保留在有向的距離中，你可以透過向量運算作到
字的類比。比如要回答日本之於壽司，相當於德國
之於什麼？可以用壽司 - 日本 + 德國，算出的向量
大致指到的是 bratwurst 。

Problem Formulation
● Word2Vec measures only single-word similarity
– however, a document contains many words
● How do you use it to compare documents?
但 word2vec 只能比較兩個字之間的相似度，一個
document 有很多字，要怎麼利用 word2vec 比較
document 呢？

Obama
speaks
media
Illinois
Obama
speaks
to
the
media
in
Illinois
以前面舉過了例子來說，要算那兩個句子的距離，我
們先將句子每個字放到 embedding 空間裡。

Obama
speaks
media
Illinois
President
press
greets
Chicago
Obama
speaks
to
the
media
in
Illinois
The
President
greets
the
press
in
Chicago
假如 w2v 有 train 好，在放入第二個句子後，意思相
近的兩字會很靠近

Obama
speaks
media
Illinois
President
press
greets
Chicago
Obama
speaks
to
the
media
in
Illinois
The
President
greets
the
press
in
Chicago
The similarity of 2 sentence is the minimum cost
to move all points from one to another
那麼這兩個句子的距離，可以定義成將其中一個句子
裡的每一個字，搬成另一個句子，所需的最小移動
距離。那以這例子來說，很明顯可以看出最佳解就
是每個字都搬到最相近的字 obama 搬到
president ， speaks 搬到 greets ， media 搬到
press ， illinois 搬到 chicago 。每個字搬動距離的加
總就是這兩個句子的距離

Obama
speaks
Illinois
President
press
greets
Chicago
What if the length doesn’t match?
Obama
speaks
in
Illinois
The
President
greets
the
press
in
Chicago
?
可是你仔細一想會發現，剛剛是每個字剛好都有對
應，假如兩個句子長度不一樣怎麼辦？

直覺會用 discrete 來想，就會有長度不同的問題，簡
單 fix 就是對句子長度做 normalize ，所以給定一個
d 維有 n 個 vocabulary 的 word embedding ，每個
document 可以用 n 維 normalize 的 BoW 表示。

Distance between documents is defined as
那兩個 document 的 WMD 就定義成所有可能的組合
中最小的搬動 cost ，大寫 T 表示從其中一個
document 的第 i 個 vocabulary 搬到另一個
document 的第 j 個 vocabulary 的搬動量， T 可以不
是整數但是要 >0 ，那這兩個 vocabulary 的單位搬
動 cost 就是這兩個 vocabulary 的距離。至於我們的
條件要把一個 document 完全搬成另一個 document
轉成數學就是兩個 constraint ，分別是 source
document 每個字一定要搬完，以及搬完結果一定要
等於 target document 。

● For document retrieval
– given a document, find most similar K documents
– better avoid exact computation of WMD for unlikely docs
● Since solving WMD optimization is slow
–
● Let’s derive cheaper lower-bounds to prune docs
那你想也知道 WMD 算起來一定很慢，不過好在
WMD 是數學經典問題 EMD 的其中一個特例，已經
有很多人給出好的 approximation ，在寫這篇 paper
時最快的 approximation 的複雜度是 p^3 log p ， p
是 document 用到的 vocabulary 數。
那這篇也知道 WMD 算很慢是會被攻擊的點，所以它
花了一半的篇幅在說你實際使用時，比如做
document retrieval ，給定一個 document 要找跟它
相似的 document 時，最好的策略當然是能不算
WMD 就不算，所以這篇 paper 導了計算快很多的
WMD 的 lower-bound 。

by triangle inequality
給定一個搬動量 T ，搬動 cost 以寫成這樣，由於
T>=0 以你可以直接把 T 乘到 norm 裡面，然後使用
三角不等式， norm 的個別加總 >= 加總的 norm

by WMD constraints
把 T 乘入可以展開成兩個 summation 相減，仔細看第
一個 summation 中 summation on source 第 i 個
vocabulary 搬出去的量，以及第二個 sum on 搬入第
j 個 vocabulary 的量，不就是我們的 constraint 嗎？

Word Centroid Distance
所以帶入 constraint ，得到的 approximation 是兩個
document 每個字乘上對應的 normalize 次數，也就
是 document 的 weighted sum 的相減，你也可以想
成是 document 現在被簡化成一個點， document 的
重心，這個 lower-bound 只是在比較兩個 document
重心的距離，所以 paper 把這個 lower-bound 叫做
word centroid distance 。

takes to compute
算起來很快，你只需要對 document 用到的字跑個 for-
loop 算 weighted sum 就行了，所以複雜度是
embedding dimension d 乘上用到的 vocabulary 數
p 。

However,
it’s not very tight
但很不幸的，這個 lower-bound 太 loose 了，重心很
靠近的兩個 document ，假如一個很集中一個分
散，那 WMD 算起來可以非常大。

所以 paper 導了另外一個更 tight 的 lower-bound ，就
是假如把其中一個 constraint ，比如把第二個拿掉的
話，我只要 source document 都有搬出去就好，不
用管有沒有搬成 target document 的話，你直覺就可
以想出最佳解。

optimal solution
只要 source 全搬完就好，那最佳解當然是把每個
source 字整個搬到對它來說 target 中最靠近的那個
字阿。

optimal solution
require 1 NN-search per source word
所以對 source document 每個字需要對 target 的字做
一次 KNN 找最近的字，那複雜度就是， document
用到的 vocabulary 數 p 乘上做 KNN search 的
pd 。

optimal solution
reversely, 1 NN-search per destination word
當然，剛剛去掉第二個 constraint ，假如改成去掉第
一個也可以輕易得到類似的最佳解，只差在 KNN
search 對象顛倒變成 destination 的字對 source 的
字做 search 。

Take the min of the 2 relaxed solutions
gives even tighter bound
所以這兩個 relax 分別可以給出一個解，那麼你再取兩
個解中比較小的那個，就可以得到更 tight 的
bound 。

For document search, we can speedup by
● Sort all documents by WCD
– Prefect first m documents
● Then, select best k among m documents
– compute RWMD to prune unlikely documents
(with 95% accuracy on experiment dataset)
– update WMD of possible documents
導完 lower-bound ，那實際上怎麼對 document
retrieval 做加速呢？剛剛說到簡化成重心的 WCD 雖
然很不準但算起來很快，而 RWMD 很準但比較慢，
所以他的用法是，先用 WCD 將所有 document 依照
對 query document 的重心距離做排序。但說到
WCD 不準所以雖然答案只要 k 個但我們可以多挑幾
個挑前 m 個再做第二層 pruning 。第二層一樣先算
RWMD ，有比要回答的 k 個 document 的 WMD 小
才算真正 WMD 更新答案。
後面有實驗說明 RWMD 真的是很 tight 的 bound ，算
起來非常接近真正的 WMD ， prune 錯的 document
比例很低。

Experiment (1/3)
那要怎麼 evaluate WMD 好不好呢？它 evaluate 的方
式是用 KNN 做 text categorization ，比較在不同
document embedding 上做 KNN 的 performance 。

Experiment (1/3)
baselines
第一種 baseline 是 BoW 類的 BoW, TF-IDF, 和 BM25
Okapi ， BM25 Okapi 大家比較不熟但形式和 TF-
IDF ，差別在於分母有 document 長度，意思是當有
個字只出現在某個 document 的話，那重複這個字
無限多很多次， TF-IDF 算出來的重要性就可以無限
提高，但 document 長度也會提昇，所以 BM25 把
分母除掉 document 長度就可以減少這種影響。

Experiment (1/3)
baselines
baseline 還有很經典的 LSI ，把 BoW 做 SVD 分解得
到 USV ， V 就是 document 的 index

Experiment (1/3)
baselines
Topic model ， LDA 和當時 state-of-the-art 的 LDA 變
形 CCG ，差別在於 LDA 是假設 latent topic 是固定
的，但是實際上 topic 很可能會隨著時間變
化， CCG 能 model 會變化的 topic 的 LDA 。

Experiment (1/3)
baselines
– mSDA (marginalized Stacked Denoising Autoencoder)
最後一個 baseline 是 mSDA ，先從 DAE 講起想知道
AE 有沒有 train 好就在 input 加 noise ，假如 AE 能
將它還原的話就表示確實有 train 好。但是 DAE 對
於尤其是 NLP 這種 input dimension 很高的情況
下，產生 input noise 做 training 很沒效率，所以
mSDA 就是做了點改動，主要是將 encoding
function 限制成 linear 的然後就可以導出 close form
解，那為什麼名字裡有 marginalized 呢？是因為導
出來的 close form 解的形式會 summing over 加
noise 的 input 取平均，那根據 weak law of large
numbers ，取越多 sample 就會越接近真正的平均
值。那假如你加的 noise 就是 0-mean 的
gaussian ，甚至不需要產生任何一個加 noise 的
sample ，就能 close form 的算出 AE 的參數了。

Experiment (1/3)
● Dataset
實驗用到的 dataset 都是經典 text categorization
dataset, 20newsgroup, reuters ， n 是 document 個
數，其中重點 unique words 因為 WMD 計算複雜度
跟 document 的 unique 字數的 p^3 log p ，最大的
是 BBC sports 有 117 個字，算一次 WMD 是 10^6
左右大概要 0.1 秒

Experiment (1/3)
● KNN result
用不同 document embedding 跑 KNN 的
performance ，最後一個酒紅色的是 WMD 辨別錯誤
率幾乎都是最低。不意外 BoW 類的表現都很差，而
LDA 和 LSI 意外表現很好僅次於 WMD 。

Experiment (2/3)
再來是關於 lower-bound 的實驗，這個實驗要測
WMD 的兩個 lower-bound 有多 tight ，所以作法是
從 dataset 中 random 挑兩個 document 算距離。藍
色是 WMD ，紅色的是 RWMD 你可以看到它算出的
距離幾乎就是 WMD ，而差很遠的是 WCD ，低估
WMD 很多，但大致上跟 WMD 是成正比關係。

Experiment (2/3)
接下來這個實驗是直接拿 lower-bound 替代 WMD 去
算 KNN 的 performance ，令人意外的是把 WMD 估
的準不代表 performance 就會好。
左圖是 tightness ，看前兩個 RWMD 下標 c1,c2 是拿
掉其中一個 constraint ， tightness 是 0.92 ，而沒有
下標的是取 min ， tightness 0.96 比拿掉其中一個
constraint 的 tight 一點而已。可是在右圖實際拿去
做 KNN 得到的 error ，只拿掉其中一個 constraint
的 RWMD 都表現很差，甚至比簡化成重心的 WCD
還要差，不過取了 RWMD 的 min 之後 performance
就標到僅次於真正的 WMD 。

Experiment (3/3)
● Lower bounds – speedup
最後一個實驗是看使用 lower-bound 做 pruning 可以
加速多少，最上面的是 KNN 用 WMD 做一次
document search 需要的時間， m 是將所有
document 算過 WCD 後，只挑 m 個 documen 繼續
算 WMD ，其他 prune 掉。所以 m 越小加速越多，
但對於 test error y 軸是個 trade-off 。

Conclusion
● WMD suits word2vec just well
● However, no comparison to doc2vec
● Possible application to automatically evaluate
quality of language translation
WMD 很漂亮的利用 word2vec 定義了一個不錯的
document distance metric ，但可惜的是沒有跟
doc2vec 做比較。另外，假如有個 embedding 可以
同時保持兩個語言的間對應關係，那 WMD 可以拿
來當作 language translation 翻的好不好的 metric 。

Word Mover's Distance

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (6)

Featured

Featured (20)

Word Mover's Distance