NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
1. 2017/07/21@STAIR Lab AI seminar
Improving Nearest Neighbor Methods
from the Perspective of Hubness Phenomenon
Yutaro Shigeto
STAIR Lab, Chiba Institute of Technology
2. A complete reference list is available at
https://yutaro-s.github.io/download/ref-20170721.html
11. !11
The nearest neighbors of many queries are the same
objects (“hubs”)
Hubness Phenomenon
cat
[Radovanović+, 2010]
12. !12
The nearest neighbors of many queries are the same
objects (“hubs”)
Hubness Phenomenon
[Radovanović+, 2010]
cat
13. !13
The nearest neighbors of many queries are the same
objects (“hubs”)
Hubness Phenomenon
[Radovanović+, 2010]
cat
14. !14
The nearest neighbors of many queries are the same
objects (“hubs”)
Hubness Phenomenon
hub
[Radovanović+, 2010]
cat
15. : Normal distribution (zero mean)
!15
Then it can be shown that
Fixed objects , with
EX [ x y2 ] EX [ x y1 ] > 0
y1 < y2
is more likely to be closer to
more likely to be a hubi.e.
Why hubs emerge?
[Radovanović+, 2010]
16. : Normal distribution (zero mean)
!16
Then it can be shown that
Fixed objects , with
EX [ x y2 ] EX [ x y1 ] > 0
y1 < y2
is more likely to be closer to
more likely to be a hubi.e.
Because this holds for any pair and ,
objects closest to the origin tend to be hubs
This bias is called “spatial centrality”
Why hubs emerge?
[Radovanović+, 2010]
17. Variants
•Squared Euclidean distance [Shigeto+, 2015]
•Inner product [Suzuki+, 2013]
!17
EX x y2
2
EX x y1
2
> 0
1
|D|
x D
x, y2
1
|D|
x D
x, y1 < 0
18. !18
Research Objective:
Improve the performance of nearest neighbor
methods via reducing the emergence of hubs
Problem:
The emergence of hubs diminishes nearest
neighbor methods
20. !20
[Suzuki+, 2013]
Centering: Reducing spatial centrality
Spatial centrality implies the object which is similar to
the centroid tends to be hub
After centering, similarities are identical: i.e., zero
centroid
tends to be hub
21. !21
Centering: Reducing spatial centrality
Spatial centrality implies the object which is similar to
the centroid tends to be hub
After centering, similarities are identical: i.e., zero
centroid = origin
[Suzuki+, 2013]
22. Mutual proximity: Breaking asymmetric relation
!22
[Schnitzer+, 2012]
Although hub becomes the nearest neighbor of many
objects, such objects can not become the nearest
neighbor of hub
Mutual proximity makes neighbor relations symmetric
hub
23. Mutual proximity: Breaking asymmetric relation
!23
Although hub becomes the nearest neighbor of many
objects, such objects can not become the nearest
neighbor of hub
Mutual proximity makes neighbor relations symmetric
hub
[Schnitzer+, 2012]
24. Mutual proximity: Breaking asymmetric relation
!24
Although hub becomes the nearest neighbor of many
objects, such objects can not become the nearest
neighbor of hub
Mutual proximity makes neighbor relations symmetric
hub
[Schnitzer+, 2012]
26. Zero-shot learning
Active research topic in NLP, CV, ML
Many applications:
•Image labeling
•Bilingual lexicon extraction
+ Many other cross-domain matching tasks
!26
[Larochelle+, 2008]
27. …but classifier has to predict
labels not appearing in training set
ZSL is a type of multi-class classification
!27
ZSL task
Standard classification task
29. Find a matrix M that projects examples into label space
!29
chimpanzee
lion
tigerlabel spaceexample space
M
Training: find a projection function
30. lion
tigerlabel spaceexample space label spaceexample space
chimpanzee
leopard
gorilla
!30
Prediction: Nearest neighbor search
Given test object and test labels,
to predict the label of a test object,
31. 1. project the example into label space, using matrix M
2. find the nearest label
Prediction: Nearest neighbor search
Given test object and test labels,
to predict the label of a test object,
lion
tigerlabel spaceexample space label spaceexample space
chimpanzee
leopard
gorilla
M
!31
32. Hubness: Problem in ZSL
!32
sheep
zebra
hippo
rat
label spaceexample space
Classifier frequently predicts the same labels (“hubs”)
[Dinu and Baroni, 2015; see also Radovanović+, 2010]
34. !34
Classifier frequently predicts the same labels (“hubs”)
sheep
zebra
hippo
rat
label spaceexample space
Hubness: Problem in ZSL
[Dinu and Baroni, 2015; see also Radovanović+, 2010]
37. !37
Problem with current regression approach:
Research objective:
Learned classifier frequently predicts the same labels
(Emergence of “hub” labels)
Investigate why hubs emerge in regression-based ZSL,
and how to reduce the emergence of hubs
41. Synthetic data result
!41
Hubness
(N1 skewness)
Accuracy
24.2
13.8
0.5
87.6
Current Proposed
Proposed approach reduces hubness
and improves accuracy
42. Why proposed approach reduces hubness
Shrinkage
in regression
!42
Argument for our proposal relies on two concepts
Spatial centrality
of data distributions
43. !43
If we optimize
Then, we have
“Shrinkage” in ridge/least squares regression
[See also Lazaridou+,2015]
44. !44
If we optimize
Then, we have
For simplicity, projected objects are assumed to also follow normal distribution
“Shrinkage” in ridge/least squares regression
[See also Lazaridou+,2015]
45. Why proposed approach reduces hubness
Shrinkage
in regression
!45
Argument for our proposal relies on two concepts
Spatial centrality
of data distributions
✔
47. “Spatial centrality”
: query distribution (zero mean)
is more likely to be closer to
more likely to be a hub
!47
Then it can be shown that
i.e.
Fixed objects , with
EX x y2
2
EX x y1
2
> 0
[See also Radovanović+, 2010]
48. “Spatial centrality”
: query distribution (zero mean)
is more likely to be closer to
more likely to be a hub
!48
Then it can be shown that
i.e.
Fixed objects , with
Because this holds for any pair and ,
objects closest to the origin tend to be hubs
This bias is called “spatial centrality.”
EX x y2
2
EX x y1
2
> 0
[See also Radovanović+, 2010]
50. Degree of spatial centrality
!50
Further assume distribution of
and
This formula quantifies the degree of spatial centrality
We have
51. The smaller the variance of label distribution, the
smaller the spatial centrality (= bias causing hubness)
Spatial centrality depends on variance of
label distributions
!51
52. Why proposed approach reduces hubness
Shrinkage
in regression
!52
Argument for our proposal relies on two concepts
Spatial centrality
of data distributions
✔ ✔
58. Q. Which configuration is better for reducing hubs?
!58
Proposed Current逆方向 順方向
Spatial centrality
For a fixed query distribution ,
data distribution with smaller variance is
preferable to reduce hubs
60. Q. Which configuration is better for reducing hubs?
!60
Proposed Current
Since distribution is not fixed,
comparing label distribution is not meaningful
61. !61
Proposed (scaled) Current
Q. Which configuration is better for reducing hubs?
Scaling does not change the nearest neighbor relation
62. !62
Proposed (scaled) Current
Q. Which configuration is better for reducing hubs?
A. Reverse direction is preferable
For fixed distribution ,
variance of distribution in proposed is smaller
63. Summary of our proposal
!63
Project labels into example space
➥ reduces variance of labels,
hence suppresses hubness
chimpanzee
gorilla
example space label space
Label distribution with smaller variance is
desirable to reduce hubness
Spatial centrality
Regression shrinks variance of projected
objects
Shrinkage
Proposal
70. Summary
• Analyzed why hubs emerge in current ZSL approach
- Variance of labels greater than examples
• Proposed a simple method for reducing hubness
- Reverse the mapping direction
• Proposed method reduced hubness and
outperformed current approach and CCA in image
labeling and bilingual lexicon extraction tasks
!70
75. !75
Given a dataset D = {(xi, yi)}n
i=1
the label of is decided by its k-nearest neighbors:x
ˆy = arg min
yi:(xi,yi) D
f(x, xi)
k-nearest neighbor classification
Distance metric learning learns a matrix
f(x, xi) = Lx Lxi
L
Training is computationally expensive
76. !76
Given a dataset D = {(xi, yi)}n
i=1
the label of is decided by its k-nearest neighbors:x
ˆy = arg min
yi:(xi,yi) D
f(x, xi)
Proposal: Dissimilarity
77. !77
Given a dataset D = {(xi, yi)}n
i=1
the label of is decided by its k-nearest neighbors:x
ˆy = arg min
yi:(xi,yi) D
f(x, xi)
Proposal: Dissimilarity
Spatial centrality
For a fixed query distribution ,
data distribution with smaller variance is
preferable to reduce hubs
78. !78
Given a dataset D = {(xi, yi)}n
i=1
the label of is decided by its k-nearest neighbors:x
ˆy = arg min
yi:(xi,yi) D
f(x, xi)
The function f needs to be computed only
between labeled objects and unlabeled object
➡labeled objects are always target of retrieval,
and unlabeled object is always query
f(x, xi) = x Wxi
2
Proposal: Dissimilarity
79. This method is not metric learning
!79
•The goal of classification is to classify the query
correctly
-finding a suitable decision boundary (not metric)
80. !80
min
W
n
i=1 z Ti
xi Wz 2
+ W 2
F
Find a matrix which minimizes the distance:
Proposal: Training
W
81. !81
min
W
n
i=1 z Ti
xi Wz 2
+ W 2
F
Proposal: Training
Find a matrix which minimizes the distance:W
82. !82
min
W
n
i=1 z Ti
xi Wz 2
+ W 2
F
Proposal: Training
Find a matrix which minimizes the distance:W
83. !83
min
W
n
i=1 z Ti
xi Wz 2
+ W 2
F
Proposal: Training
W = XJXT
(XXT
+ I) 1
This function has the closed-form solution:
Find a matrix which minimizes the distance:W
84. !84
Givne a query object ,
ˆy = arg min
yi:(xi,yi) D
x Wxi
2
Proposal: Test
x
85. !85
ˆy = arg min
yi:(xi,yi) D
x Wxi
2
Proposal: Test
Givne a query object ,x
86. Move labeled objects v.s. move query
!86
f(x, xi) = Mx xi
2
f(x, xi) = x Wxi
2
•Move labeled objects (proposal)
•Move query
This reduces the variance
= reducing the emergence of hubs
This increases the variance
= promoting the emergence of hubs
88. Experimental objective
!88
We evaluate the proposed method on various datasets
Our main focuses are
-Does it suppress hubs?
-Does it improve the classification accuracy?
-Is it faster than distance metric learning?
91. Results: Training time [s]
!91
The proposed method
-reduces the emergence of hubs
-is better than metric learning methods on most datasets
-is faster than … on all datasets
Document datasetsndicate the best performer for each dataset.
(b) Image datasets.
method AwA CUB SUN aPY
LMNN 1525.5 1098.2 15704.3 317.3
ITML 1536.3 577.6 1126.4 9211.2
DML-eig 2048.0 2084.7 2006.1 1787.1
proposed 9.5 1.5 4.1 6.4
92. Results: UCI datasets
!92
The proposed method
-reduces the emergence of hubs
-is better than metric learning methods on most datasets
-is faster than … on all datasets
-does not work well on UCI datasets
able 3: Classification accuracy [%]: Bold figures indicate the best performers for each
ataset.
(a) UCI datasets.
method ionosphere balance-scale iris wine glass
original metric 86.8 89.5 97.2 98.1 68.1
LMNN 90.3 90.0 96.7 98.1 67.7
ITML 87.7 89.5 97.8 99.1 65.0
DML-eig 87.7 91.2 96.7 98.6 66.5
Move-labeled (proposed) 89.6 89.5 97.2 98.6 70.8
Move-query 79.7 89.4 97.2 96.3 62.3
(b) Document datasets.
93. Summary
!93
Prediction:
ˆy = arg min
yi:(xi,yi) D
x Wxi
2
The proposed method
-reduces the emergence of hubs
-is better than metric learning methods on most datasets
-is faster than … on all datasets
-does not work well on UCI datasets