CVPR2010: Semi-supervised Learning in Vision: Part 2: Theory

Semi-Supervised Learning in Vision

Amir Saﬀari, Christian Leistner, Horst Bischof

Institute for Computer Graphics and Vision, Graz University of Technology

CVPR San Francisco, June 18, 2010

SSL Self-Training Generative Models Margin Manifold Multi-View Large-Scale Related

Outline

1 Semi-Supervised Learning

2 Self-Training

3 Generative Models

4 Margin Assumption

5 Cluster and Manifold Assumption

6 Multi-View Learning

7 Large-Scale, Multi-Class SSL, and Online Learning

8 Related Topics
Graz University of Technology

Amir Saﬀari, Christian Leistner, Horst Bischof Semi-Supervised Learning in Vision


Supervised and Semi-Supervised Learning

Supervised learning is all about ﬁnding mappings from input (feature)
space to output space:

f :X →Y





Supervised learning is all about ﬁnding mappings from input (feature)
space to output space:

f (x; θ ) ∈ F : X → Y





In Semi-supervised learning we wish to ﬁnd mappings by using both
labeled and unlabeled data:

f (x; θ ) ∈ F : X → Y




Why Should SSL Make Sense?

Learner
f (x; θ ) ∈ F : X → Y
Labeled data
Dl = {(x, y) ∈ X × Y }





Learner
f (x; θ ) ∈ F : X → Y
Labeled data
Dl = {(x, y) ∈ X × Y }
Unlabeled data
Du = {x ∈ X }





Learner
f (x; θ ) ∈ F : X → Y
Labeled data
Dl = {(x, y) ∈ X × Y }
Unlabeled data
Du = {x ∈ X }
Unsupervised learning: the goal is to recover the structure of the
data, eg. clusters, manifolds, low dimensional embeddings, density ...





Learner
f (x; θ ) ∈ F : X → Y
Labeled data
Dl = {(x, y) ∈ X × Y }
Unlabeled data
Du = {x ∈ X }
Unsupervised learning: the goal is to recover the structure of the
data, eg. clusters, manifolds, low dimensional embeddings, density ...

Density
p(x)




When unlabeled data is going to help?

Bayes rule:
p(y) p(x|y)
prior likelihood
p(y|x) =
p(x)
posterior
evidence





When we expect that p(x) (structure of the data) is related to
the p(y|x), ie. they share parameters.





When we expect that p(x) (structure of the data) is related to
the p(y|x), ie. they share parameters.
In other words, a better estimation of p(x) can improve the
estimation of p(y|x).




Semi-Supervised Learning Assumptions

Semi-supervised learning often is eﬀective, if the assumptions
regarding the relationship between the structure of the data p(x) and
the posterior p(y|x) are true for a given problem.




Computer Vision Problems: Object Recognition

Structure: similar images may contain similar objects.




Computer Vision Problems: Object Detection

Structure: similar image patches may contain similar objects. Very
close patches (over the 2D image neighborhood) may contain the
same object.




Computer Vision Problems: Object Tracking

Structure: similar image patches may contain similar objects. Very
close patches (over the 2D image neighborhood and time) may
contain the same object.




Computer Vision Problems: Object Segmentation

Structure: similar pixels may correspond to the same object. Very
close pixels (over the 2D image neighborhood and time) may belong
to the same object.




Diﬀerent Semi-Supervised Learning Settings

Semi-supervised classiﬁcation: f : X → Y , Y = {1, · · · , K }.





Semi-supervised regression: f : X → Y , Y = R.





Semi-supervised regression: f : X → Y , Y = R.
Semi-supervised clustering: constrainted clustering, clustering
with pair-wise must-link and cannot-link constraints.




Transductive and Inductive SSL

Transductive: ﬁnd f : Dl ∪ Du → Y |Dl ∪Du | . Can not be used for
any future example which was not in the training set.
Inductive: ﬁnd f : X → Y . Can be used for any future example,
beyond the training set.




Interactive Segmentation




Other Uses of Unlabeled Data

Unsupervised preprocessing: normalization, standardization,
PCA, ICA, ...

[Torralba ICCV 2009, Mobahi et al. ICML 2009, Ranzato NIPS 2007]




PCA, ICA, ...
Feature extraction: bag-of-words





PCA, ICA, ...
Feature extraction: bag-of-words

Unsupervised feature learning: deep learning and sparse coding




The Simplest Approach to SSL

Self-training is a meta learning (wrapper) semi-supervised method.
Self-Training
Inputs: learning algorithm T, labeled set Dl , and unlabeled set
Du .
For n = 1 to N:
1 Train using the labeled set: f n = T (Dl ).
2 Use f n to classify the unlabeled set: cu = f n (Du ).
3 Create the set of m most conﬁdent examples from the unlabeled
set: C ⊂ Du .
4 Update the labeled set: Dl ← Dl ∪ {(x, f n (x ))| x ∈ C}.
5 Update the unlabeled set: Du ← Du C .




Example: 1-NN Classiﬁer


[Zhu TPCL 2009]


Example: Interactive Segmentation with RF




Example: Self-Training with RF (Iteration 5)




Example: Self-Training with RF (Iteration 10)




Generative Models

Joint Distribution

p(x, y|θ, π ) = p(x|y; θ ) p(y|π )
mixture component πy




Expectation Maximization Approach

Log-likelihood
(θ, π; Dl , Du ) = ∑ log πy p(x|y; θ ) +λ ∑ log ∑ πk p(x|k; θ )
(x,y)∈Dl x ∈Du k

labeled data unlabeled data




Expectation Maximization Approach

Log-likelihood
(θ, π; Dl , Du ) = ∑ log πy p(x|y; θ ) +λ ∑ log ∑ πk p(x|k; θ )
(x,y)∈Dl x ∈Du k

labeled data unlabeled data
Expectation Maximization (EM) can be used to estimate the model
parameters.
EM
Inputs: labeled set Dl , and unlabeled set Du , initial model
parameters θ0 , π0 .
For n = 1 to N:
1 E step: Estimate the posterior p (y | x; θn−1 , πn−1 ) for Du .
2 M step: Update the model parameters, given the posteriors for
unlabeled data: θn , πn = arg max (θ, π; Dl , Du ).
θ,π Graz University of Technology



Example: Two Gaussians




Example: Supervised GMM




Example: Semi-Supervised EM with GMM




Assumptions

[Jaakkola et al.]




Linear Classiﬁer




Maximum Margin Linear Classiﬁer




SVM




S3VM and TSVM




S3VM

SVM with soft margins: binary case y ∈ {−1, 1} without the bias
term
λ 1
min w 2 +
w 2 2 ∑ max(0, 1 − y w, x )
|Dl | (x,y)∈D
l
margin
hinge loss




S3VM

SVM with soft margins: binary case y ∈ {−1, 1} without the bias
term
λ 1
min w 2 +
w 2 2 ∑ max(0, 1 − y w, x )
|Dl | (x,y)∈D
l
margin
hinge loss
Prediction rule:
ˆ
y = sign( w, x )
Margin of an unlabeled data:

mu (w, x) = | w, x |




S3VM: Loss Functions




S3VM: Formulation

S3VM
λ 1
min
w 2
w 2
2 + ∑ max(0, 1 − y w, x ) +
|Dl | (x,y)∈D
l
margin
hinge loss
γ
+ ∑ max(0, 1 − | w, x |)
|Du | x∈Du
sym. hinge




S3VM: Formulation

S3VM
λ 1
min
w 2
w 2
2 + ∑ max(0, 1 − y w, x ) +
|Dl | (x,y)∈D
l
margin
hinge loss
γ
+ ∑ max(0, 1 − | w, x |)
|Du | x∈Du
sym. hinge

Balancing constraint:
1 1
∑ y = |Du | ∑ w, x |
|Dl | (x,y)∈D
lx∈Du

[Vapnik 1998, Joachims ICML 1999] Graz University of Technology



Entropy Regularization




Example: Interactive Segmentation with Linear SVM




Example: Interactive Segmentation with Linear TSVM




Example: Linear TSVM

[Zhu TPCL 2009]



Cluster Assumption and Margin Maximization




Cluster Kernels

Linear SVM:
f (x) = w, x




Cluster Kernels

Linear SVM:
f (x) = w, x
Nonlinear SVM:

f (x ) = w, φ(x ) = ∑ yα x φ(x), φ(x )
(x,y)∈Dl
nonlinear mapping K (x,x )




Cluster Kernels

Linear SVM:
f (x) = w, x
Nonlinear SVM:

f (x ) = w, φ(x ) = ∑ yα x φ(x), φ(x )
(x,y)∈Dl
nonlinear mapping K (x,x )

Cluster Kernel:
∑ p I(c(x) == c(x ))
Kc (x, x ) = K (x, x )
n
[Weston et al. NIPS 2003]




Boosting

Boosting
M
f (x) = ∑ wm g(x; θm )
m =1




Boosting

Boosting
M
f (x) = ∑ wm g(x; θm )
m =1

Base Learner
g(x; θm ) ∈ G : X → Y




Boosting

Boosting
M
f (x) = ∑ wm g(x; θm )
m =1

Base Learner
g(x; θm ) ∈ G : X → Y
Boosting is a linear classiﬁer over the space of base learners G :

f (x) = G (x; θ )w




Boosting: Learning with Functional Gradient Descent

M
1
f (x; β∗ ) = arg min ∑ (x, y; f ) ⇒ f (x; β∗ ) = ∑ wm g(x; θm )
|Xl | (x,y)∈X
∗ ∗
β l m =1




Cluster Prior


[Saﬀari et al. CVPR 2009]



Cluster Prior




Example: SSL Boosting with Cluster Prior

0.60
VOC2006
0.55

0.50
Class. Acc.

0.45

0.40

0.35
RMSB
SER
0.30 AML
SVM
TSVM
0.25
0.0 0.1 0.2 0.3 0.4 0.5
Labeled Samp. Ratio, r




Example: SSL Boosting with Cluster Prior

10000
Computation Time
r =0.1
r =0.5
8000
Time (sec)

6000

4000

2000

0 RMSB SER TSVM





Example: Interactive Segmentation with Boosting




Example: Interactive Segmentation with Cluster Prior




Manifold Assumption

[Hein and von Luxburg MLSS 2007]




Manifold Assumption

[Hein and von Luxburg MLSS 2007] Graz University of Technology



Manifold Assumption

[Hein and von Luxburg MLSS 2007]



Label Propagation: Supervised

[Zhu TPCL 2009]




Label Propagation: Semi-Supervised

[Zhu TPCL 2009]




Label Propagation: Semi-Supervised

[Kveton et al. CVPR OLCV 2010]




Mincut




Mincut

Mincut energy:

min ∑
{y x } x∈D (x ,y )∈D
∑ s(x, x )(yx − y )2 + ∑ s(x, x )(yx − yx )2
u l x ∈Du




Random Walks on Graph

[Zhu TPCL 2009]



Manifold Regularization: LapSVM

LapSVM
λ 1
min
w 2
w 2
2 + ∑ max(0, 1 − y w, x )+
|Dl | (x,y)∈D
l
γ
+
|Du |2 ∑ ∑ s(x, x )( w, x − w, x )2
x∈Du x ∈Dl ∪Du




Manifold Regularization: LapSVM

LapSVM
λ 1
min
w 2
w 2
2 + ∑ max(0, 1 − y w, x )+
|Dl | (x,y)∈D
l
γ
+
|Du |2 ∑ ∑ s(x, x )( w, x − w, x )2
x∈Du x ∈Dl ∪Du

Graph Laplacian
λ 1
min
w 2
w 2
2 + ∑ max(0, 1 − y w, x )+
|Dl | (x,y)∈D
l
γ
+ f, Lf
|Du |2
where f is the response vector of the classiﬁer to both labeled and
unlabeled data.

[Belkin et al. JMLR 2006]



Boosting with Priors and Manifolds

1
min
w,θ
∑ (x, y; w, θ ) +
|Dl | (x,y)∈D
l
Supervised
γ s(x, x )
+ ∑ λ  p (x, q; w, θ ) +(1 − λ) ∑ z(x) m (x, x ; w, θ )
|Du | x∈Du x ∈Du
Prior x =x
Manifold

Unsupervised


Example: Boosting with Cluster and Manifold Priors

Methods-Dataset g241c g241d Digit1 USPS COIL BCI
1-NN 44.05 43.22 23.47 19.82 65.91 48.74
SVM 47.32 46.66 30.60 20.03 68.36 49.85
RF (weak) 47.51 48.44 42.42 22.90 75.72 49.35
GBoost-RF 46.77 46.61 38.70 20.89 69.85 49.12
TSVM 24.71 50.08 17.77 25.20 67.50 49.15
Cluster Kernel 48.28 42.05 18.73 19.41 67.32 48.31
LapSVM 46.21 45.15 08.97 19.05 N/A 49.25
ManifoldBoost 42.17 42.80 19.42 19.97 N/A 47.12
SemiBoost-RF 48.41 47.19 10.57 15.83 63.39 49.77
MCSSB-RF 49.77 48.57 38.50 22.95 69.96 49.12
GPMBoost-RF 15.69 39.45 11.19 14.92 62.60 49.27

[Saﬀari et al. 2010]




Example: Boosting with Cluster and Manifold Priors

Methods-Dataset g241c g241d Digit1 USPS COIL BCI
1-NN 40.28 37.49 06.12 07.64 23.27 44.83
SVM 23.11 24.64 05.53 09.75 22.93 34.31
RF (weak) 44.23 45.02 17.80 16.73 34.26 43.78
GBoost-RF 31.84 32.38 06.24 13.81 21.88 40.08
TSVM 18.46 22.42 06.15 09.77 25.80 33.25
Cluster Kernel 13.49 04.95 03.79 09.68 21.99 35.17
LapSVM 23.82 26.36 03.13 04.70 N/A 32.39
ManifoldBoost 22.87 25.00 04.29 06.65 N/A 32.17
SemiBoost-RF 41.26 39.14 02.56 05.92 15.31 47.12
MCSSB-RF 45.11 40.26 13.19 12.31 31.09 47.64
GPMBoost-RF 12.80 12.59 02.35 06.33 14.49 45.41

[Saﬀari et al. 2010]




Example: Interactive Segmentation with Cluster Prior and Manifolds




Learning from Diﬀerent Views

Data is represented by multiple views: x = [x1 | · · · |xV ] T
T T




Learning from Diﬀerent Views

Data is represented by multiple views: x = [x1 | · · · |xV ] T
T T

There is a classiﬁer per view: F = { f v }V=1
v

Multi-View Learning

F ∗ = arg min ∑ (x, y; F ) + γ ∑ φ(x; F )
F (x,y)∈Dl x∈Du




Co-Training

Co-Training
Inputs: learning algorithm T, labeled set Dl , and unlabeled set
Du .
For n = 1 to N:
1 1 2
1 Train using the labeled set: f n = T (Dl ) and f n = T (Dl ).2

2 Use f n1 and f 2 to classify the unlabeled set: c1 = f 1 (D 1 ) and
n u n u
c2 = f n (Du ).
u
2 2

3 Create the set of m most conﬁdent examples from each view of
the unlabeled set: C 1 ⊂ Du andC 2 ⊂ Du .
1 2

4 Update the labeled set:
Dl ← Dl ∪ {(x, f n (x))| x ∈ C 1 } ∪ {(x, f n (x))| x ∈ C 2 }.
1 2

5 Update the unlabeled set: Du ← Du (C 1 ∪ C 2 ).

[Blum and Mitchell COLT 1998]




Co-Training and Assumptions

We can represent the data in two views: x = [x1 , x2 ].





We can train a good classiﬁer only using either x1 or x2 .

[Blum and Mitchell COLT 1998, Sindhwani et al. ICML 2005]





x1 and x2 are conditionally independent given the class.




Co-Training: Visual and EEG Data


[Kapoor et al. CVPR 2008]



Multi-View Boosting with Priors

Multi-View Priors
1
V − 1 s∑
qv (k |xv ) = ps (k |xs ), ∀v ∈ {1, · · · , V }, k ∈ Y
=v




Multi-View Boosting with Priors

Multi-View Priors
1
V − 1 s∑
qv (k |xv ) = ps (k |xs ), ∀v ∈ {1, · · · , V }, k ∈ Y
=v

Boosting with Priors:
1
arg min ∑ (x, y; w, θ ) +
|Dl | (x,y)∈D
w,θ l
Supervised
γ
+ ∑  p (x, q; w, θ )
|Du | x∈Du
Prior
[Saﬀari et al. ECCV 2010]




Robust Loss Functions

q + =0.5
5 5
0-1 KL
Hinge SKL
4
Exponential 4
JS
Logit
Savage
3 3
(x,y;f)

(x,q;f)
2 2

1 1

0 4 2 0 2 4 0 4 2 0 2 4
m(x,y;f) f + (x)

q + =0.75
5
KL
SKL
4
JS

3
(x,q;f)

2

1

0 4 2 0 2 4
f + (x)

[Saﬀari et al. ECCV 2010]


Example: Interactive Segmentation with Multi-View
Classiﬁers




Large-Scale Applications




Learning from Gigantic Image Collections

Original Objective
J (f) = (f − y)T A(f − y) + f T Lf

Objective with eigenvectors using f = Uα
J (α) = (Uα − y)T A(Uα − y) + α T Σα T

[Fergus et al. NIPS 2009]




80 million tiny images

[Torralba PAMI 2008, Fergus et al. NIPS 2009]





[Fergus et al. NIPS 2009]



Deep Learning

Embedding loss
∑ ( f ( x ), y ) + γ ∑ ∑ s(x, x ) ( f (x), f (x ))
(x,y)∈Dl x∈Du x ∈D




Deep Learning

Semantic Role Labeling:

“The cat eats the ﬁsh in the pond” →
“TheARG0 catARG0 eatsREL theARG1 ﬁshARG1 inARGM-LOC
theARGM-LOC pondARGM-LOC ”

Trained with stochastic gradient descent on 1 million labeled data
and 631 million unlabeled data from Wikipedia.


[Weston et al. ICML 2008]



Online Learning and SSL

More about online semi-supervised learning in the second part of the
tutorial.

[Goldberg et al. ECML 2009, Zeisl et al. CVPR 2010, Leistner et al. PR 2010, Saﬀari et al. ECCV 2010, Kveton CVPR

OLCV 2010]



Multi-Class Problems

Many interesting problems are multi-class.





What is wrong with 1-vs-all?





Computational problems.





Calibration problems [B. Schoelkopf, A. Smola, 2002].





Calibration problems [B. Schoelkopf, A. Smola, 2002].
Artiﬁcial unbalanced binary problems.




Boosting with Priors and Manifolds

Boosting with Priors and Manifolds is an inherently multi-class
algorithm.


[Saﬀari et al. CVPR 2009, CVPR 2010, ECCV 2010]


Transfer Learning

[Pan and Yang TKDE 2009]




Transfer Learning with Attributes

[Farhadi CVPR 2009, Lampert CVPR 2009] Graz University of Technology





[Farhadi et al. CVPR 2010]




[Rohbach et al. CVPR 2010]



Learning from Latent (Hidden) Variables

Latent SVM
f w (x) = max w, φ(x, z)
z

λ 1
min
w 2
w 2
2 + ∑ max(0, 1 − y f w (x))
|Dl | (x,y)∈D
l


[Felzenszwalb et al. CVPR 2008]



Indoor Scene Labeling

[Wang et al. ECCV 2010]




Latent Structural SVM

[Yu and Joachims 2009]




Latent Structural SVM for Indoor Scene Labeling


[Wang et al. ECCV 2010]



Multiple Instance Learning

-
-
- -
+ -

-
+ -

- -
-
+
-


CVPR2010: Semi-supervised Learning in Vision: Part 2: Theory

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de zukun

Mais de zukun (20)

Último

Último (20)

CVPR2010: Semi-supervised Learning in Vision: Part 2: Theory