SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
AET vs. AED:
Unsupervised Representation Learning
by Auto-Encoding Transformations
rather than Data
250
1 0 39 0
Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo
n 53 CV @ CVPR2019 ( )
- https://kantocv.connpass.com/event/133980/
n
- Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo,
“AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations
rather than Data”,
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2019. [Link]
※
AI @DeNA
twitter: @tomoyukun
n
- ~2019.3
CV
- 2019.4~
DeNA AI CV
n
- ~2019.3
CV
- 2019.4~
DeNA AI CV
- 2019.6
CVPR2019
AI @DeNA
twitter: @tomoyukun
n CVPR2019
- 30 234
slide share @DeNAxAI_NEWS
n CVPR2019 Oral
n
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding
Transformations rather than Data
Liheng Zhang 1,∗
, Guo-Jun Qi 1,2,†
, Liqiang Wang3
, Jiebo Luo4
1
Laboratory for MAchine Perception and LEarning (MAPLE)
http://maple-lab.net/
2
Huawei Cloud, 3
University of Central Florida, 4
University of Rochester
guojun.qi@huawei.com
http://maple-lab.net/projects/AET.htm
Abstract
The success of deep neural networks often relies on a
rge amount of labeled examples, which can be difficult to
btain in many real scenarios. To address this challenge,
nsupervised methods are strongly preferred for training (a) Auto-Encoding Data (AED)
Paper Project page Code
Unsupervised Representation Learning
n(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
Unsupervised Representation Learning
n
cvpaper.challenge
- Pretext : ImageNet => Target :
➤ AlexNetc uh (g
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
(ex
(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
Unsupervised Representation Learning
n
cvpaper.challenge
- Pretext : ImageNet => Target :
➤ AlexNetc uh (g
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
(ex
(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
Unsupervised Representation Learning
n
cvpaper.challenge
- Pretext : ImageNet => Target :
➤ AlexNetc uh (g
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
(ex
(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
p(x)
discriminative
semantic
disentangle
…
Unsupervised Representation Learning
n
- DNN
- (..?)
n
-
-
n ImageNet …
- ImageNet
-
n
-
Previous works (unsupervised)
n Auto Encoder [Hinton+, 06] / Variational Auto Encoder [Kingma+, 13]
n GANs
- Discriminator encoder [Radford+, 16]
- Generator G " # E # " [Donahue+, 17]
n BiGAN
➤ h-(!|/)hmx (Generator) uGANd t
-(/|!)p (Encoder)
➤ Generatorgru (-0 !, / = -0 !|/ -(/))dEncodergru
(-1 !, / = -1 /|! -(!))x hGANd h mc a u
➤ d Dh x u hGANrtp
- Di d v x g uphci
Cls. Det.
random 53.3 43.4
BiGAN 60.3 46.9
JP 67.7 53.2
Published as a conference paper at ICLR 2017
features data
z G G(z)
xEE(x)
G(z), z
x, E(x)
D P(y)
Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).
Donahue et al., ”Adversarial Feature Learning”, ICLR 2017.
Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial configuration for theDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial configuration for the
inav Gupta1
Alexei A. Efros2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
Question 1: Question 2:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possibleDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial configuration for the
inav Gupta1
Alexei A. Efros2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
Question 1: Question 2:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible sasignificantboostover
resultinginstate-of-the-
swhichuseonlyPascal-
methodshaveleveraged
dexamplestolearnrich,
tations[32].Yetefforts
ternet-scaledatasets(i.e.
ehamperedbythesheer
required.Anaturalway
toemployunsupervised
withoutanyannotation.
adesofsustainedeffort,
etbeenshowntoextract
ectionsoffull-sized,real
itisnotevenclearwhat
onewriteanobjective
twopairsofpatches?Notethatthetaskismucheasieronceyou
haverecognizedtheobject!
Answer key: Q1: Bottom right Q2: Top center
inthecontext(i.e.,afewwordsbeforeand/orafter)given
thevector.Thisconvertsanapparentlyunsupervisedprob-
lem(findingagoodsimilaritymetricbetweenwords)into
a“self-supervised”one:learningafunctionfromagiven
wordtothewordssurroundingit.Herethecontextpredic-
tiontaskisjusta“pretext”toforcethemodeltolearna
goodwordembedding,which,inturn,hasbeenshownto
beusefulinanumberofrealtasks,suchassemanticword
similarity[40].
Ourpaperaimstoprovideasimilar“self-supervised”
formulationforimagedata:asupervisedtaskinvolvingpre-
dictingthecontextforapatch.OurtaskisillustratedinFig-
ures1and2.Wesamplerandompairsofpatchesinoneof
eightspatialconfigurations,andpresenteachpairtoama-
chinelearner,providingnoinformationaboutthepatches’
originalpositionwithintheimage.Thealgorithmmustthen
Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
Previous works (self-supervised)
n [Doersch+, 15]
bhinav Gupta1
Alexei A. Efros2
2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ource
rich
mage
each
e po-
e that
ecog-
fea-
ntext
xam-
vised
birds
more,
he R-
Example:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible
neighbors (red). Can you guess the spatial configuration for the
inav Gupta1
Alexei A. Efros2
Dept. of Electrical Engineering and Computer Science
University of California, Berkeley
ce
ch
ge
ch
o-
at
g-
a-
xt
m-
d
ds
e,
_ _? ?
Question 1: Question 2:
Figure 1. Our task for learning patch representations involves ran-
domly sampling a patch (blue) and then one of eight possible sasignificantboostover
resultinginstate-of-the-
swhichuseonlyPascal-
methodshaveleveraged
dexamplestolearnrich,
tations[32].Yetefforts
ternet-scaledatasets(i.e.
ehamperedbythesheer
required.Anaturalway
toemployunsupervised
withoutanyannotation.
adesofsustainedeffort,
etbeenshowntoextract
ectionsoffull-sized,real
itisnotevenclearwhat
onewriteanobjective
twopairsofpatches?Notethatthetaskismucheasieronceyou
haverecognizedtheobject!
Answer key: Q1: Bottom right Q2: Top center
inthecontext(i.e.,afewwordsbeforeand/orafter)given
thevector.Thisconvertsanapparentlyunsupervisedprob-
lem(findingagoodsimilaritymetricbetweenwords)into
a“self-supervised”one:learningafunctionfromagiven
wordtothewordssurroundingit.Herethecontextpredic-
tiontaskisjusta“pretext”toforcethemodeltolearna
goodwordembedding,which,inturn,hasbeenshownto
beusefulinanumberofrealtasks,suchassemanticword
similarity[40].
Ourpaperaimstoprovideasimilar“self-supervised”
formulationforimagedata:asupervisedtaskinvolvingpre-
dictingthecontextforapatch.OurtaskisillustratedinFig-
ures1and2.Wesamplerandompairsofpatchesinoneof
eightspatialconfigurations,andpresenteachpairtoama-
chinelearner,providingnoinformationaboutthepatches’
originalpositionwithintheimage.Thealgorithmmustthen - mx x SiameseNetg2
- hCNNx m d
➤ Fine-tuningh i rt
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16]
focus on defining similarity metrics which can be used in
more standard clustering algorithms; [45], for instance,
re-casts the problem as frequent itemset mining. Geom-
etry may also be used to for verifying links between im-
ages [44, 6, 23], although this can fail for deformable ob-
jects.
Video can provide another cue for representation learn-
ing. For most scenes, the identity of objects remains un-
changed even as appearance changes with time. This kind
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong
improvements on modern detection datasets [57].
Finally, our work is related to a line of research on dis-
criminative patch mining [13, 50, 28, 37, 52, 11], which has
emphasized weak supervision as a means of object discov-
ery. Like the current work, they emphasize the utility of
Patch 2Patch 1
pool1 (3x3,96,2)pool1 (3x3,96,2)
LRN1LRN1
pool2 (3x3,384,2)pool2 (3x3,384,2)
LRN2LRN2
fc6 (4096)fc6 (4096)
conv5 (3x3,256,1)conv5 (3x3,256,1)
conv4 (3x3,384,1)conv4 (3x3,384,1)
conv3 (3x3,384,1)conv3 (3x3,384,1)
conv2 (5x5,384,2)conv2 (5x5,384,2)
conv1 (11x11,96,4)conv1 (11x11,96,4)
fc7 (4096)
fc8 (4096)
fc9 (8)
pool5 (3x3,256,2)pool5 (3x3,256,2)
Figure 3. Our architecture for pair classification. Dotted lines in-
dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
‘LRN’ is a local response normalization layer. Numbers in paren-
theses are kernel size, number of outputs, and stride (fc layers have
SiameseNet
(if there is no spe-
s “stuff” [1]). We
arn a visual repre-
e that the resulting
ject detection, pro-
OC 2007 compared
nsupervised object
eans, surprisingly,
ss images, despite
n that operates on a
e-level supervision
gory-level tasks.
epresentation is as
nerative model. An
Figure 2. The algorithm receives two patches in one of these eight
Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
Previous works (self-supervised)
n [Gidaris+, 18]
=> artifact lv trivial solutionh d u
➤ objecth x u ogiobjecth
➤ (Cls., Det. ) &
random
CR
JP++
Published as a conference paper at ICLR 2018
Rotated image: X
0
Rotated image: X3
Rotated image: X
2
Rotated image: X
1
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
Image X
Predict 270 degrees rotation (y=3)Rotate 270 degrees
g( X , y=3)
Rotate 180 degrees
g( X , y=2)
Rotate 90 degrees
g( X , y=1)
Rotate 0 degrees
g( X , y=0)
Maximize prob.
F
3
( X
3
)
Predict 0 degrees rotation (y=0)
Maximize prob.
F2
( X2
)
Maximize prob.
F
1
( X
1
)
Maximize prob.
F
0
( X
0
)
Predict 180 degrees rotation (y=2)
Predict 90 degrees rotation (y=1)
Objectives:
Gidaris et al., “Unsupervised Representation Learning by Predicting Image Rotation”, ICLR 2018.
Idea
erception and LEarning (MAPLE)
aple-lab.net/
entral Florida, 4
University of Rochester
qi@huawei.com
.net/projects/AET.htm
a
o
,
n AED (Auto Encoding Data)
-
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Idea
erception and LEarning (MAPLE)
aple-lab.net/
entral Florida, 4
University of Rochester
qi@huawei.com
.net/projects/AET.htm
a
o
,
n AED (Auto Encoding Data)
-
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
l(x, ˆx) = ∥x − ˆx∥2
2 (3)
θ ∈ R8
ˆθ ∈ R8
N
(crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Idea
n a
t to
nge,
ing
this
pre-
ET)
ED)
AET
ac-
fol-
ful-
(a) Auto-Encoding Data (AED)
n AET (Auto Encoding Transform)
- transform
( structure )
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Idea
n a
t to
nge,
ing
this
pre-
ET)
ED)
AET
ac-
fol-
ful-
(a) Auto-Encoding Data (AED)
n AET (Auto Encoding Transform)
- transform
( structure )
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
l(x, ˆx) = ∥x − ˆx∥2
2 (3)
l(t,ˆt) (4)
θ ∈ R8
ˆθ ∈ R8
N
(crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Network
=
1
2
∥M(θ) − M(ˆθ)∥2
2 to model the difference
he target and the estimated transformations. In
nts, we will compare different instances of param-
ransformations in this category and demonstrate
ield competitive performances on training AET.
uced Transformations. One can choose other
ransformations without explicit geometric impli-
ke the affine and the projective transformations.
nsider a GAN generator that transforms an input
manifold of real images. For example, in [24], a
Encoder
n weights share 2branch encoder
n branch concatenate
n transform
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Transformations
n white box
- AET-affine
- AET-project
Rotation
[-180, 180]
Translation
[-0.2, +0.2] * H or W
Scale
[0.7, 1.3]Original
Shear
[-30, +30]
Scale
[0.8, 1.2]
Rotation
0 ,90 ,180 ,270
Stretch corner
0.125 * H or W
Original
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Transformations
n MSE
- AET-affine AET-project
n transform 2 ablation …
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
θ ∈ R8
:
θ ∈ R8
:
N
(crop,
ic segmentation
Rotation 85.07 89.06 86.21 61.73
Proposed 83.11 86.33 86.94 83.21
l(t, ˆt) = ∥θ − ˆθ∥2
2 (2)
θ ∈ R8
ˆθ ∈ R8
N
(crop,
filp, color jitter random erasing)
adver-
sarial perturbation
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
Evaluation
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
k-NN
Evaluation
3 0
n CIFAR-10 ImageNet train data AET ( )
n
n train data or k-NN
n val data
0
Details
n CIFAR-10
- Network In Network
• NIN block4 => GAP => concatenate => fc
• encoder block2 +GAP
- 3 fc NIN block3 ( )
n ImageNet
- AlexNet
• AlexNet fc2=> concatenate => concatenate => fc
• encoder
- 3 fc 1 fc ( )
Results on CIFAR-10
Table 1: Comparison between unsupervised feature learn-
ing methods on CIFAR-10. The fully supervised NIN and
the random Init. + conv have the same three-block NIN ar-
chitecture, but the first is fully supervised while the second
is trained on top of the first two blocks that are randomly
initialized and stay frozen during training.
Method Error rate
Supervised NIN (Lower Bound) 7.20
Random Init. + conv (Upper Bound) 27.50
Roto-Scat + SVM [21] 17.7
ExamplarCNN [7] 15.7
DCGAN [25] 17.2
Scattering [20] 15.3
RotNet + FC [10] 10.94
RotNet + conv [10] 8.84
(Ours) AET-affine + FC 9.77
(Ours) AET-affine + conv 8.05
(Ours) AET-project + FC 9.41
(Ours) AET-project + conv 7.82
SGD with a batch size
nsformed counterparts.
et to 0.9 and 5 × 10−4
.
and scheduled to drop
800 and 1, 000 epochs.
chs in total. For AET-
composition of a ran-
random translation by
both vertical and hori-
ing factor of [0.7, 1.3],
−30◦
, 30◦
] degree. For
ansformation is formed
rs of an image in both
y ±0.125 of its height
ed by [0.8, 1.2] and ro-
(Ours) AET-affine + FC 9.77
(Ours) AET-affine + conv 8.05
(Ours) AET-project + FC 9.41
(Ours) AET-project + conv 7.82
k-NN
n AET-affine AET-project
n AET-project ( )
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Results on ImageNet
n ImageNet AET-projectTable 4: Top-1 accuracy with linear layers on ImageNet. AlexNet is used as backbone to train the unsupervised mo
comparison. A 1, 000-way linear classifier is trained upon various convolutional layers of feature maps that ar
resized to have about 9, 000 elements. Fully supervised and random models are also reported to show the upper an
bounds of unsupervised model performances. Only a single crop is used and no dropout or local response norm
used during testing for the AET, except the models denoted with * where ten crops are applied to compare results
Method Conv1 Conv2 Conv3 Conv4 Conv5
ImageNet Labels (Upper Bound) [10] 19.3 36.3 44.2 48.3 50.5
Random (Lower Bound)[10] 11.6 17.1 16.9 16.3 14.1
Random rescaled [16](Lower Bound) 17.5 23.0 24.5 23.2 20.6
Context [5] 16.2 23.3 30.2 31.7 29.6
Context Encoders [22] 14.1 20.7 21.0 19.8 15.5
Colorization[30] 12.5 24.5 30.4 31.5 30.3
Jigsaw Puzzles [18] 18.2 28.8 34.0 33.9 27.1
BiGAN [6] 17.7 24.5 31.0 29.9 28.0
Split-Brain [29] 17.7 29.3 35.4 35.2 32.8
Counting [19] 18.0 30.6 34.3 32.5 25.7
RotNet [10] 18.8 31.7 38.7 38.2 36.5
(Ours) AET-project 19.2 32.8 40.6 39.7 37.7
DeepCluster* [4] 13.4 32.3 41.0 39.6 38.2
(Ours) AET-project* 19.3 35.4 44.0 43.6 42.4
0.01, and it is dropped by a factor of 10 at epoch 100 and
150. AET is trained for 200 epochs in total. Finally, the
projective transformations applied are randomly sampled in
performance. From the comparison, the AET mo
ly narrow the performance gap to the upper boun
to the upper bound Top-1 accuracy has been decr
er-
ed
ap-
on-
ng
by
ng
ed
in
ns
ed
ent
al-
be-
art
tes
he
se-
ork
m-
eat
on-
so
ith
Table 3: Top-1 accuracy with non-linear layers on Ima-
geNet. AlexNet is used as backbone to train the unsu-
pervised models. After unsupervised features are learned,
nonlinear classifiers are trained on top of Conv4 and Con-
v5 layers with labeled examples to compare their perfor-
mances. We also compare with the fully supervised mod-
els and random models that give upper and lower bounded
performances. For a fair comparison, only a single crop is
applied in AET and no dropout or local response normal-
ization is applied during the testing.
Method Conv4 Conv5
ImageNet Labels [3](Upper Bound) 59.7 59.7
Random [19] (Lower Bound) 27.1 12.0
Tracking [28] 38.8 29.8
Context [5] 45.6 30.4
Colorization [30] 40.7 35.2
Jigsaw Puzzles [18] 45.3 34.6
BiGAN [6] 41.9 32.2
NAT [3] - 36.0
DeepCluster [4] - 44.0
RotNet [10] 50.0 43.8
(Ours) AET-project 53.2 47.0
4.2. ImageNet Experiments
We further evaluate the performance by AET on the Im-
3 NN
1 NN
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Results
n
-
Counting [19] 23.3 33.9
RotNet [10] 21.5 31.0
(Ours) AET-project 22.1 32.9
(a) CIFAR-10 (b) ImageNet
CIFAR-10 ImageNet
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
Results
n
31.0 35.1 34.6 33.7
32.9 37.1 36.2 34.7
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
n
- “structure information” …
-
-
- AET
• Keypoints
n
- 0
trivial
31.0 35.1 34.6 33.7
32.9 37.1 36.2 34.7
Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.

Mais conteúdo relacionado

Semelhante a AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会@関東)

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 
Coates p: the use of genetic programing in exploring 3 d design worlds
Coates p: the use of genetic programing in exploring 3 d design worldsCoates p: the use of genetic programing in exploring 3 d design worlds
Coates p: the use of genetic programing in exploring 3 d design worlds
ArchiLab 7
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614
Sri Ambati
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
FedorNikolaev
 

Semelhante a AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会@関東) (20)

User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Machine learning for document analysis and understanding
Machine learning for document analysis and understandingMachine learning for document analysis and understanding
Machine learning for document analysis and understanding
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To Sentences
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
Coates p: the use of genetic programing in exploring 3 d design worlds
Coates p: the use of genetic programing in exploring 3 d design worldsCoates p: the use of genetic programing in exploring 3 d design worlds
Coates p: the use of genetic programing in exploring 3 d design worlds
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets Deconstructed
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
Collective Response Spike Prediction for Mutually Interacting Consumers
Collective Response Spike Prediction for Mutually Interacting ConsumersCollective Response Spike Prediction for Mutually Interacting Consumers
Collective Response Spike Prediction for Mutually Interacting Consumers
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Android and Deep Learning
Android and Deep LearningAndroid and Deep Learning
Android and Deep Learning
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data (第53回コンピュータビジョン勉強会@関東)

  • 1. AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data 250 1 0 39 0 Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo
  • 2. n 53 CV @ CVPR2019 ( ) - https://kantocv.connpass.com/event/133980/ n - Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo, “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. [Link] ※
  • 3. AI @DeNA twitter: @tomoyukun n - ~2019.3 CV - 2019.4~ DeNA AI CV
  • 4. n - ~2019.3 CV - 2019.4~ DeNA AI CV - 2019.6 CVPR2019 AI @DeNA twitter: @tomoyukun
  • 5. n CVPR2019 - 30 234 slide share @DeNAxAI_NEWS
  • 6. n CVPR2019 Oral n AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data Liheng Zhang 1,∗ , Guo-Jun Qi 1,2,† , Liqiang Wang3 , Jiebo Luo4 1 Laboratory for MAchine Perception and LEarning (MAPLE) http://maple-lab.net/ 2 Huawei Cloud, 3 University of Central Florida, 4 University of Rochester guojun.qi@huawei.com http://maple-lab.net/projects/AET.htm Abstract The success of deep neural networks often relies on a rge amount of labeled examples, which can be difficult to btain in many real scenarios. To address this challenge, nsupervised methods are strongly preferred for training (a) Auto-Encoding Data (AED) Paper Project page Code
  • 7. Unsupervised Representation Learning n(t, ˆt) = ∥θ − ˆθ∥2 2 (2) p(x)
  • 8. Unsupervised Representation Learning n cvpaper.challenge - Pretext : ImageNet => Target : ➤ AlexNetc uh (g Pretext task ex. ImageNet w/o labels ex. AlexNet (ex (t, ˆt) = ∥θ − ˆθ∥2 2 (2) p(x)
  • 9. Unsupervised Representation Learning n cvpaper.challenge - Pretext : ImageNet => Target : ➤ AlexNetc uh (g Pretext task ex. ImageNet w/o labels ex. AlexNet (ex (t, ˆt) = ∥θ − ˆθ∥2 2 (2) p(x)
  • 10. Unsupervised Representation Learning n cvpaper.challenge - Pretext : ImageNet => Target : ➤ AlexNetc uh (g Pretext task ex. ImageNet w/o labels ex. AlexNet (ex (t, ˆt) = ∥θ − ˆθ∥2 2 (2) p(x) discriminative semantic disentangle …
  • 11. Unsupervised Representation Learning n - DNN - (..?) n - - n ImageNet … - ImageNet - n -
  • 12. Previous works (unsupervised) n Auto Encoder [Hinton+, 06] / Variational Auto Encoder [Kingma+, 13] n GANs - Discriminator encoder [Radford+, 16] - Generator G " # E # " [Donahue+, 17] n BiGAN ➤ h-(!|/)hmx (Generator) uGANd t -(/|!)p (Encoder) ➤ Generatorgru (-0 !, / = -0 !|/ -(/))dEncodergru (-1 !, / = -1 /|! -(!))x hGANd h mc a u ➤ d Dh x u hGANrtp - Di d v x g uphci Cls. Det. random 53.3 43.4 BiGAN 60.3 46.9 JP 67.7 53.2 Published as a conference paper at ICLR 2017 features data z G G(z) xEE(x) G(z), z x, E(x) D P(y) Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN). Donahue et al., ”Adversarial Feature Learning”, ICLR 2017.
  • 13. Previous works (self-supervised) n [Doersch+, 15] bhinav Gupta1 Alexei A. Efros2 2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ource rich mage each e po- e that ecog- fea- ntext xam- vised birds more, he R- Example: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for theDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
  • 14. Previous works (self-supervised) n [Doersch+, 15] bhinav Gupta1 Alexei A. Efros2 2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ource rich mage each e po- e that ecog- fea- ntext xam- vised birds more, he R- Example: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for the inav Gupta1 Alexei A. Efros2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ce ch ge ch o- at g- a- xt m- d ds e, _ _? ? Question 1: Question 2: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possibleDoersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
  • 15. Previous works (self-supervised) n [Doersch+, 15] bhinav Gupta1 Alexei A. Efros2 2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ource rich mage each e po- e that ecog- fea- ntext xam- vised birds more, he R- Example: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for the inav Gupta1 Alexei A. Efros2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ce ch ge ch o- at g- a- xt m- d ds e, _ _? ? Question 1: Question 2: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible sasignificantboostover resultinginstate-of-the- swhichuseonlyPascal- methodshaveleveraged dexamplestolearnrich, tations[32].Yetefforts ternet-scaledatasets(i.e. ehamperedbythesheer required.Anaturalway toemployunsupervised withoutanyannotation. adesofsustainedeffort, etbeenshowntoextract ectionsoffull-sized,real itisnotevenclearwhat onewriteanobjective twopairsofpatches?Notethatthetaskismucheasieronceyou haverecognizedtheobject! Answer key: Q1: Bottom right Q2: Top center inthecontext(i.e.,afewwordsbeforeand/orafter)given thevector.Thisconvertsanapparentlyunsupervisedprob- lem(findingagoodsimilaritymetricbetweenwords)into a“self-supervised”one:learningafunctionfromagiven wordtothewordssurroundingit.Herethecontextpredic- tiontaskisjusta“pretext”toforcethemodeltolearna goodwordembedding,which,inturn,hasbeenshownto beusefulinanumberofrealtasks,suchassemanticword similarity[40]. Ourpaperaimstoprovideasimilar“self-supervised” formulationforimagedata:asupervisedtaskinvolvingpre- dictingthecontextforapatch.OurtaskisillustratedinFig- ures1and2.Wesamplerandompairsofpatchesinoneof eightspatialconfigurations,andpresenteachpairtoama- chinelearner,providingnoinformationaboutthepatches’ originalpositionwithintheimage.Thealgorithmmustthen Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
  • 16. Previous works (self-supervised) n [Doersch+, 15] bhinav Gupta1 Alexei A. Efros2 2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ource rich mage each e po- e that ecog- fea- ntext xam- vised birds more, he R- Example: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for the inav Gupta1 Alexei A. Efros2 Dept. of Electrical Engineering and Computer Science University of California, Berkeley ce ch ge ch o- at g- a- xt m- d ds e, _ _? ? Question 1: Question 2: Figure 1. Our task for learning patch representations involves ran- domly sampling a patch (blue) and then one of eight possible sasignificantboostover resultinginstate-of-the- swhichuseonlyPascal- methodshaveleveraged dexamplestolearnrich, tations[32].Yetefforts ternet-scaledatasets(i.e. ehamperedbythesheer required.Anaturalway toemployunsupervised withoutanyannotation. adesofsustainedeffort, etbeenshowntoextract ectionsoffull-sized,real itisnotevenclearwhat onewriteanobjective twopairsofpatches?Notethatthetaskismucheasieronceyou haverecognizedtheobject! Answer key: Q1: Bottom right Q2: Top center inthecontext(i.e.,afewwordsbeforeand/orafter)given thevector.Thisconvertsanapparentlyunsupervisedprob- lem(findingagoodsimilaritymetricbetweenwords)into a“self-supervised”one:learningafunctionfromagiven wordtothewordssurroundingit.Herethecontextpredic- tiontaskisjusta“pretext”toforcethemodeltolearna goodwordembedding,which,inturn,hasbeenshownto beusefulinanumberofrealtasks,suchassemanticword similarity[40]. Ourpaperaimstoprovideasimilar“self-supervised” formulationforimagedata:asupervisedtaskinvolvingpre- dictingthecontextforapatch.OurtaskisillustratedinFig- ures1and2.Wesamplerandompairsofpatchesinoneof eightspatialconfigurations,andpresenteachpairtoama- chinelearner,providingnoinformationaboutthepatches’ originalpositionwithintheimage.Thealgorithmmustthen - mx x SiameseNetg2 - hCNNx m d ➤ Fine-tuningh i rt cover clusters of, say, foliage. A few subsequent works have attempted to use representations more closely tied to shape [36, 43], but relied on contour extraction, which is difficult in complex images. Many other approaches [22, 29, 16] focus on defining similarity metrics which can be used in more standard clustering algorithms; [45], for instance, re-casts the problem as frequent itemset mining. Geom- etry may also be used to for verifying links between im- ages [44, 6, 23], although this can fail for deformable ob- jects. Video can provide another cue for representation learn- ing. For most scenes, the identity of objects remains un- changed even as appearance changes with time. This kind of temporal coherence has a long history in visual learning literature [18, 59], and contemporaneous work shows strong improvements on modern detection datasets [57]. Finally, our work is related to a line of research on dis- criminative patch mining [13, 50, 28, 37, 52, 11], which has emphasized weak supervision as a means of object discov- ery. Like the current work, they emphasize the utility of Patch 2Patch 1 pool1 (3x3,96,2)pool1 (3x3,96,2) LRN1LRN1 pool2 (3x3,384,2)pool2 (3x3,384,2) LRN2LRN2 fc6 (4096)fc6 (4096) conv5 (3x3,256,1)conv5 (3x3,256,1) conv4 (3x3,384,1)conv4 (3x3,384,1) conv3 (3x3,384,1)conv3 (3x3,384,1) conv2 (5x5,384,2)conv2 (5x5,384,2) conv1 (11x11,96,4)conv1 (11x11,96,4) fc7 (4096) fc8 (4096) fc9 (8) pool5 (3x3,256,2)pool5 (3x3,256,2) Figure 3. Our architecture for pair classification. Dotted lines in- dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’ stands for a fully-connected one, ‘pool’ is a max-pooling layer, and ‘LRN’ is a local response normalization layer. Numbers in paren- theses are kernel size, number of outputs, and stride (fc layers have SiameseNet (if there is no spe- s “stuff” [1]). We arn a visual repre- e that the resulting ject detection, pro- OC 2007 compared nsupervised object eans, surprisingly, ss images, despite n that operates on a e-level supervision gory-level tasks. epresentation is as nerative model. An Figure 2. The algorithm receives two patches in one of these eight Doersch et al., “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015.
  • 17. Previous works (self-supervised) n [Gidaris+, 18] => artifact lv trivial solutionh d u ➤ objecth x u ogiobjecth ➤ (Cls., Det. ) & random CR JP++ Published as a conference paper at ICLR 2018 Rotated image: X 0 Rotated image: X3 Rotated image: X 2 Rotated image: X 1 ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) Image X Predict 270 degrees rotation (y=3)Rotate 270 degrees g( X , y=3) Rotate 180 degrees g( X , y=2) Rotate 90 degrees g( X , y=1) Rotate 0 degrees g( X , y=0) Maximize prob. F 3 ( X 3 ) Predict 0 degrees rotation (y=0) Maximize prob. F2 ( X2 ) Maximize prob. F 1 ( X 1 ) Maximize prob. F 0 ( X 0 ) Predict 180 degrees rotation (y=2) Predict 90 degrees rotation (y=1) Objectives: Gidaris et al., “Unsupervised Representation Learning by Predicting Image Rotation”, ICLR 2018.
  • 18. Idea erception and LEarning (MAPLE) aple-lab.net/ entral Florida, 4 University of Rochester qi@huawei.com .net/projects/AET.htm a o , n AED (Auto Encoding Data) - Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 19. Idea erception and LEarning (MAPLE) aple-lab.net/ entral Florida, 4 University of Rochester qi@huawei.com .net/projects/AET.htm a o , n AED (Auto Encoding Data) - l(t, ˆt) = ∥θ − ˆθ∥2 2 (2) l(x, ˆx) = ∥x − ˆx∥2 2 (3) θ ∈ R8 ˆθ ∈ R8 N (crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 20. Idea n a t to nge, ing this pre- ET) ED) AET ac- fol- ful- (a) Auto-Encoding Data (AED) n AET (Auto Encoding Transform) - transform ( structure ) Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 21. Idea n a t to nge, ing this pre- ET) ED) AET ac- fol- ful- (a) Auto-Encoding Data (AED) n AET (Auto Encoding Transform) - transform ( structure ) l(t, ˆt) = ∥θ − ˆθ∥2 2 (2) l(x, ˆx) = ∥x − ˆx∥2 2 (3) l(t,ˆt) (4) θ ∈ R8 ˆθ ∈ R8 N (crop,Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 22. Network = 1 2 ∥M(θ) − M(ˆθ)∥2 2 to model the difference he target and the estimated transformations. In nts, we will compare different instances of param- ransformations in this category and demonstrate ield competitive performances on training AET. uced Transformations. One can choose other ransformations without explicit geometric impli- ke the affine and the projective transformations. nsider a GAN generator that transforms an input manifold of real images. For example, in [24], a Encoder n weights share 2branch encoder n branch concatenate n transform Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 23. Transformations n white box - AET-affine - AET-project Rotation [-180, 180] Translation [-0.2, +0.2] * H or W Scale [0.7, 1.3]Original Shear [-30, +30] Scale [0.8, 1.2] Rotation 0 ,90 ,180 ,270 Stretch corner 0.125 * H or W Original Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 24. Transformations n MSE - AET-affine AET-project n transform 2 ablation … l(t, ˆt) = ∥θ − ˆθ∥2 2 (2) θ ∈ R8 : θ ∈ R8 : N (crop, ic segmentation Rotation 85.07 89.06 86.21 61.73 Proposed 83.11 86.33 86.94 83.21 l(t, ˆt) = ∥θ − ˆθ∥2 2 (2) θ ∈ R8 ˆθ ∈ R8 N (crop, filp, color jitter random erasing) adver- sarial perturbation
  • 25. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data
  • 26. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data
  • 27. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data
  • 28. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data
  • 29. Evaluation n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data k-NN
  • 30. Evaluation 3 0 n CIFAR-10 ImageNet train data AET ( ) n n train data or k-NN n val data 0
  • 31. Details n CIFAR-10 - Network In Network • NIN block4 => GAP => concatenate => fc • encoder block2 +GAP - 3 fc NIN block3 ( ) n ImageNet - AlexNet • AlexNet fc2=> concatenate => concatenate => fc • encoder - 3 fc 1 fc ( )
  • 32. Results on CIFAR-10 Table 1: Comparison between unsupervised feature learn- ing methods on CIFAR-10. The fully supervised NIN and the random Init. + conv have the same three-block NIN ar- chitecture, but the first is fully supervised while the second is trained on top of the first two blocks that are randomly initialized and stay frozen during training. Method Error rate Supervised NIN (Lower Bound) 7.20 Random Init. + conv (Upper Bound) 27.50 Roto-Scat + SVM [21] 17.7 ExamplarCNN [7] 15.7 DCGAN [25] 17.2 Scattering [20] 15.3 RotNet + FC [10] 10.94 RotNet + conv [10] 8.84 (Ours) AET-affine + FC 9.77 (Ours) AET-affine + conv 8.05 (Ours) AET-project + FC 9.41 (Ours) AET-project + conv 7.82 SGD with a batch size nsformed counterparts. et to 0.9 and 5 × 10−4 . and scheduled to drop 800 and 1, 000 epochs. chs in total. For AET- composition of a ran- random translation by both vertical and hori- ing factor of [0.7, 1.3], −30◦ , 30◦ ] degree. For ansformation is formed rs of an image in both y ±0.125 of its height ed by [0.8, 1.2] and ro- (Ours) AET-affine + FC 9.77 (Ours) AET-affine + conv 8.05 (Ours) AET-project + FC 9.41 (Ours) AET-project + conv 7.82 k-NN n AET-affine AET-project n AET-project ( ) Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 33. Results on ImageNet n ImageNet AET-projectTable 4: Top-1 accuracy with linear layers on ImageNet. AlexNet is used as backbone to train the unsupervised mo comparison. A 1, 000-way linear classifier is trained upon various convolutional layers of feature maps that ar resized to have about 9, 000 elements. Fully supervised and random models are also reported to show the upper an bounds of unsupervised model performances. Only a single crop is used and no dropout or local response norm used during testing for the AET, except the models denoted with * where ten crops are applied to compare results Method Conv1 Conv2 Conv3 Conv4 Conv5 ImageNet Labels (Upper Bound) [10] 19.3 36.3 44.2 48.3 50.5 Random (Lower Bound)[10] 11.6 17.1 16.9 16.3 14.1 Random rescaled [16](Lower Bound) 17.5 23.0 24.5 23.2 20.6 Context [5] 16.2 23.3 30.2 31.7 29.6 Context Encoders [22] 14.1 20.7 21.0 19.8 15.5 Colorization[30] 12.5 24.5 30.4 31.5 30.3 Jigsaw Puzzles [18] 18.2 28.8 34.0 33.9 27.1 BiGAN [6] 17.7 24.5 31.0 29.9 28.0 Split-Brain [29] 17.7 29.3 35.4 35.2 32.8 Counting [19] 18.0 30.6 34.3 32.5 25.7 RotNet [10] 18.8 31.7 38.7 38.2 36.5 (Ours) AET-project 19.2 32.8 40.6 39.7 37.7 DeepCluster* [4] 13.4 32.3 41.0 39.6 38.2 (Ours) AET-project* 19.3 35.4 44.0 43.6 42.4 0.01, and it is dropped by a factor of 10 at epoch 100 and 150. AET is trained for 200 epochs in total. Finally, the projective transformations applied are randomly sampled in performance. From the comparison, the AET mo ly narrow the performance gap to the upper boun to the upper bound Top-1 accuracy has been decr er- ed ap- on- ng by ng ed in ns ed ent al- be- art tes he se- ork m- eat on- so ith Table 3: Top-1 accuracy with non-linear layers on Ima- geNet. AlexNet is used as backbone to train the unsu- pervised models. After unsupervised features are learned, nonlinear classifiers are trained on top of Conv4 and Con- v5 layers with labeled examples to compare their perfor- mances. We also compare with the fully supervised mod- els and random models that give upper and lower bounded performances. For a fair comparison, only a single crop is applied in AET and no dropout or local response normal- ization is applied during the testing. Method Conv4 Conv5 ImageNet Labels [3](Upper Bound) 59.7 59.7 Random [19] (Lower Bound) 27.1 12.0 Tracking [28] 38.8 29.8 Context [5] 45.6 30.4 Colorization [30] 40.7 35.2 Jigsaw Puzzles [18] 45.3 34.6 BiGAN [6] 41.9 32.2 NAT [3] - 36.0 DeepCluster [4] - 44.0 RotNet [10] 50.0 43.8 (Ours) AET-project 53.2 47.0 4.2. ImageNet Experiments We further evaluate the performance by AET on the Im- 3 NN 1 NN Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 34. Results n - Counting [19] 23.3 33.9 RotNet [10] 21.5 31.0 (Ours) AET-project 22.1 32.9 (a) CIFAR-10 (b) ImageNet CIFAR-10 ImageNet Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 35. Results n 31.0 35.1 34.6 33.7 32.9 37.1 36.2 34.7 Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.
  • 36. n - “structure information” … - - - AET • Keypoints n - 0 trivial 31.0 35.1 34.6 33.7 32.9 37.1 36.2 34.7 Zhang et al., “AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data”, CVPR 2019.