The harmony potential: fusing local and global information for semantic image segmentation

Introduction
Graph cuts for image segmentation
The harmony potential
Experimental results
Discussion
The harmony potential:
fusing local and global information for semantic image segmentation
Andrew D. Bagdanov
bagdanov@cvc.uab.es
Departamento de Ciencias de la Computación
Universidad Autónoma de Barcelona
CVPR 2010 (to appear)
J. Gonfaus, X. Boix, J. van de Weijer, J. Serrat, J. González The harmony potential

Introduction
Discussion
Outline
1 Introduction
2 Graph cuts for image segmentation
3 The harmony potential
4 Experimental results
5 Discussion

Introduction
Discussion
Semantic image segmentation
Semantic categories
Our main idea
Outline
1 Introduction
Semantic categories
Our main idea
5 Discussion

Introduction
Discussion
Semantic categories
Our main idea
Giving semantics to pixels
Image Object Class
Semantic image segmentation is not object segmentation
Only for simple cases are they the same

Introduction
Discussion
Semantic categories
Our main idea
Turning a hard problem into a harder one
Image Object Class
The object is to assign semantic labels to every pixel
Fine distinctions must be made

Introduction
Discussion
Semantic categories
Our main idea
Make that a very hard one
Image Object Class
The object is to assign semantic labels to every pixel
Fine distinctions must be made
Occlusions, varying viewpoint and size complicate things

Introduction
Discussion
Semantic categories
Our main idea
Semantic categories
20 semantic categories for Pascal
aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow,
diningtable, dog, horse, motorbike, person, potted plant, sheep,
sofa, train, and tv/monitor.

Introduction
Discussion
Semantic categories
Our main idea
SOA: Conditional Random Fields (CRFs)
One of the most successful approaches to image segmentation is
the Hierarchical CRF approach.
Using potential functions, information at different scales can be
incorporated into the segmentation.
We identify three levels of scale: local, mid-level and global [Zhu,
NIPS2008].
We show how these three levels of scale can be integrated in a
way that preserves their unique characteristics.
Existing techniques apply overly-simpliﬁed models of context that
do not generalize upward from local to global scales.

Introduction
Discussion
Semantic categories
Our main idea
Global constraints on label combinations
Our principal idea is to use global classification to enhance
segmentation results.
Global image classification results tend to be less noisy than ones.
We will use them to constrain the combinations of semantic labels
we are likely to encounter during segmentation.
We also show how the resulting optimization problem can be
made tractable by learning to efficiently subsample label
combinations at the global level.

Introduction
Discussion
Smoothness potentials
Potts potentials
Robust PN
Outline
1 Introduction
Potts potentials
Robust PN
5 Discussion

Introduction
Discussion
Potts potentials
Robust PN
Some terminology
We represent our segmentation problem as a graph: G = (V, E)
V is used for indexing random variables, and E is the set of
undirected edges representing compatibility relationships between
random variables.
X = {Xi} denotes the set of random variables or nodes, for i ∈ V.
An energy function will be defined over graphical configurations of
random variables.
By the Hammersley-Clifford theorem, the energy of a configuration
of x = {xi} can be written as the negative exponential of an
energy function E(x) = c∈C ϕc(xc), where ϕc is the potential
function of clique c ∈ C.

Introduction
Discussion
Potts potentials
Robust PN
Consistency potentials for labeling problems
The energy function of G can be written as:
E(x) =
i∈V
φ(xi) +
(i,j)∈EL
ψL(xi, xj) +
(i,g)∈EG
ψG(xi, xg).
The unary term φ(xi) depends on a single probability P(Xi = xi|i),
where i is the observation that affects Xi in the model.
The smoothness potential ψL(xi, xj) determines the pairwise
relationship between two local nodes.
The consistency potential ψG(xi, xg) expresses the dependency
between local nodes and a global node.
And the Maximum a Posteriori (MAP) estimate of the optimal
labeling is:
x∗
= arg min
x
E(x).

Introduction
Discussion
Potts potentials
Robust PN
Representing semantic segmentations
Each node represents an image region
Nodes take single label from the set of semantic categories

Introduction
Discussion
Potts potentials
Robust PN
Smoothness: only local constraints
Adds additional constraint on neighboring nodes
Usually enforces gradual (local) changes

Introduction
Discussion
Potts potentials
Robust PN
Potts: ψG(xi, xg) = γl
i T[xi = xg]
New node enforces global consistency among local labels
Consistency with a single global label [Plath, ICML2009]

Introduction
Discussion
Potts potentials
Robust PN
Robust PN
: consistency + “anything goes”
Free
Extends Potts potential [Kohili, CVPR2008]
“Free label” at global node allows any local combination

Introduction
Discussion
Motivation revisited
Blowing up the problem
Outline
1 Introduction
5 Discussion

Introduction
Discussion
Different features for discriminations
The previously mentioned approaches all try to make global
distinctions using local information.
Either by voting of local observations (Potts).
Or, by penalizing rampantly discordant local label assignments
PN.
None of these techniques try to exploit truly global information to
constrain local labels.
And none incorporate the notion of encoding combinations of
primitive node labels at the global level.

Introduction
Discussion
The harmony potential: symphony of semantics
Let L = {l1, . . . , lM} denote the set of semantic class labels from
which local nodes Xi, take their labels.
The global node Xg, instead, will take labels from P(L), the power
set of L.
In this way, we can represent any combinations of primitive labels
from L at the global node.
The harmony potential is now deﬁned as:
ψG(xi, xg) = γl
i T[xi /∈ xg].

Introduction
Discussion
The harmony potential: selective subsets
Only labels that do not agree with subset are penalized.
Can represent more diverse combinations.

Introduction
Discussion
Potentials: the gory details
The unary potential of the local nodes is:
φL(xi) = −µLKiωL(xi) log P(Xi = xi|i),
where µL is the weighting factor of the local unary potential, Ki
normalizes over the number of pixels inside superpixel i, and
ωL(xi) is a learned per-class normalization.
P(Xi = xi|i) is the classiﬁcation score given an observed
representation i of the region, which is based on a bag-of-words
built from features of superpixel i and those superpixels adjacent
to it.

Introduction
Discussion
More potentials
The global unary potential is deﬁned as:
φG(xg) = −µGωG(xg) log P(Xg = xg|g),
where µG is the weighting factor of the global unary potential, and
ωG(xg) is again a per-class normalization like the one used in the
local unary potential.
The main difference comes in the computation of P(Xg = xg|g),
which is the posterior:
P(Xg = xg|g) ∝ P(g|Xg = xg)P(Xg = xg).

Introduction
Discussion
Holy crap that’s a lot of labels!
We have turned a barely tractable optimization problem into a
(seemingly) spectacularly intractable one.
To optimize the energy function, we must optimize over 2|L|
possible global node labels.
If we had an analytic form for P( = x∗
g |O) we might be able to do
something.
We don’t. Instead, we will use the probability that a certain label
∈ P(L) appears in x∗, given all the observations O required by
the model.

Introduction
Discussion
Ranked subsampling of P(L)
We can do this using the following posterior:
P( ⊆ x∗
g |) ∝ P( ⊆ x∗
g )P(O| ⊆ x∗
g ).
This allows us to effectively rank possible global node labels, and
thus to prioritize candidates in the search for the optimal label x∗
g .
P( ⊆ x∗
g |O) establishes an order on subsets of the (unknown)
optimal labeling of the global node x∗
g that guides the
consideration of global labels.
We may not be able to exhaustively consider all labels in P(L), but
at least we consider the most likely candidates for x∗
g .
And image classiﬁcation can give us an estimate of this posterior.

Introduction
Discussion
Datasets and implementation
Results: Pascal VOC 2009
Results: MSRC-21
Outline
1 Introduction
Results: MSRC-21
5 Discussion

Introduction
Discussion
Results: MSRC-21
Datasets
We have evaluated the harmony potential approach on two
standard, publicly available datasets.
The Pascal VOC 2009 Segmentation Challenge dataset contains
2250 color images of 20 different semantic classes.
This set is split into 750 images for training, 750 images for
testing, and 750 for validation.
The Microsoft MSRC-21 dataset contains 591 color images of 21
object classes.
We do our own splits for cross-validation on MSRC-21.

Introduction
Discussion
Results: MSRC-21
Unsupervised segmentation
Images are ﬁrst over-segmented to with quick-shift to derive
super-pixels [Fulkerson, ICCV 2009].
This preserves object boundaries while simplifying the
representation.
Working at the super-pixel level reduces the number of nodes in
the CRF by 102 to 105 per image.

Introduction
Discussion
Results: MSRC-21
Local classiﬁcation scores: P(Xi = xi|Oi)
We extract patches with 50% overlap on a regular grid at several
resolutions (12, 24, 36 and 48 pixels in diameter).
Patches are described with SIFT, color and for MSCR-21 location
features.
A vocabulary is constructed using k-means to quantize to 1000
SIFT words and 400 color words.
An SVM classiﬁer using an intersection kernel is built for each
semantic category.
A similar number of positive and negative examples are used:
around a total of 8.000 superpixel samples for MSCR-21, and
20.000 for VOC 2009 for each class.

Introduction
Discussion
Results: MSRC-21
Global classification scores: P(Xg = xg|Og)
For the Pascal 2009 dataset we use our entry to the 2009 VOC
Classification Challenge
[Khan, PAMI2010 (submitted)].
It uses a bag-of-words representation based on SIFT and color
SIFT, plus spatial pyramids and color attention
[Khan, ICCV 2009].
An SVM classifier with a χ2 kernel is trained for each semantic
category in the dataset.
SVM outputs are re-normalized to generate an estimate of the
global label: P(Xg = xg|Og).

Introduction
Discussion
Results: MSRC-21
MAP inference
The optimal MAP label conﬁguration x∗ is inferred using
α-expansion graph cuts [Kolmogorov, PAMI2004].
The global node uses the 100 most probable label subsets
obtained from ranked subsampling.
No signiﬁcant improvements were observed by considering more
than 100 label subsets.
The average time to do MAP inference for an image in MSCR-21
is 0.24 seconds and in VOC 2009 is 0.32 seconds.

Introduction
Discussion
Results: MSRC-21
Cross-validation of CRF parameters
For MSCR-21 we learn the CRF parameters with a 5-fold
cross-validation of the union of training and validation sets.
If we only use the validation set of 59 images, we overﬁt to this
small set.
For VOC 2009, we used the available validation set to train CRF
parameters.
Since the background class always appears in combination with
other classes, we do not allow the harmony potential to apply any
penalization to the background class.

Introduction
Discussion
Results: MSRC-21
Qualitative results

Introduction
Discussion
Results: MSRC-21
Qualitative results (II)

Introduction
Discussion
Results: MSRC-21
Quantitative results
Background
Aeroplane
Bicycle
Bird
Boat
Bottle
Bus
Car
Cat
Chair
BONN 83.9 64.3 21.8 21.7 32.0 40.2 57.3 49.4 38.8 5.2
BROOKES 79.6 48.3 6.7 19.1 10.0 16.6 32.7 38.1 25.3 5.5
Harmony potential 80.5 62.3 24.1 28.3 30.5 32.7 42.2 48.1 22.8 9.1
Cow
DinningTable
Dog
Horse
Motorbike
Person
PottedPlant
Sheep
Sofa
Train
TV/Monitor
Average
BONN 28.5 22.0 19.6 33.6 45.5 33.6 27.3 40.4 18.1 33.6 46.1 36.3
BROOKES 9.4 25.1 13.3 12.3 35.5 20.7 13.4 17.1 18.4 37.5 36.4 24.8
Harmony potential 30.1 7.9 21.5 41.9 49.6 31.5 26.1 37.0 20.1 39.4 31.1 34.1

Introduction
Discussion
Computational considerations
The future
Reﬂections
Outline
1 Introduction
5 Discussion
The future
Reﬂections

Introduction
Discussion
The future
Reﬂections
A modest cluster proposal
4 Dell R610i 1U Rack Servers
Each with: 2x Intel Xeon E5502 Quad Core CPUs
Each with: 24GB RAM
Each with: 4x Broadcom 10Gb Ethernet adapters
Each with: 1x 160GB 7.2K RPM Disk
Two units with: PERC 6/i SAS RAID Controller
One unit with: 5x 300GB 10K RPM Disk

Introduction
Discussion
The future
Reﬂections
Organizing computations

Introduction
Discussion
The future
Reﬂections
Some (mostly meaningless) numbers
Days of pascal challenge: 45
Seconds of computation: 3,888,000.00
Estimated GFLOPS: 307.2
Sustainded CPU utilization: 80%
Total GFLOP: 955,514,880.00
Images: 15,000
Pixels (assuming 640 × 480): 4,608,000,000.00
GFLOP/Image: 63,700.99
GFLOP/Pixel: 0.21

Introduction
Discussion
The future
Reflections
Conclusions
The harmony potential works well for fusing global information into
local segmentations.
It works by modeling global observations as subsets of the local
label set.
Ranked sub-sampling, driven by the same posterior as used to
define the global potential function, renders the optimization
problem tractable.
The harmony potential gets state-of-the-art results are difficult,
publicly available datasets.
Most useful when multiple semantic classes co-occur frequently.

Introduction
Discussion
The future
Reﬂections
Prospectus
Semantic image segmentation has come a long way, but still has a
long way to go.
Segmentation will become mainstream event in Pascal VOC 2010
We have shown that combining global information with local can
be tractable and improves on state-of-the-art.
Currently, combining mid-level information is where the game is
being played.
Detection is probably the key.
We can also begin to think about what types of new applications
are enabled by such combinations.

Introduction
Discussion
The future
Reﬂections
Final words
Semantic image segmentation is hard.
Participating in a competition like the Pascal VOC is very hard.
But, it brings many technologies and people and groups and ideas
together.
Xavier Pep Fahad

The harmony potential: fusing local and global information for semantic image segmentation

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a The harmony potential: fusing local and global information for semantic image segmentation

Semelhante a The harmony potential: fusing local and global information for semantic image segmentation (20)

Mais de Media Integration and Communication Center

Mais de Media Integration and Communication Center (18)

Último

Último (20)

The harmony potential: fusing local and global information for semantic image segmentation