Contextless Object Recognition with Shape-enriched SIFT and Bags of Features

Contextless Object Recognition
with Shape-enriched SIFT and
Bags of Features
Marcel Tella Amo
Directed by Dr. Matthias Zeppelzauer (TU Wien)
Codirected by Dr. Xavier Giró-i-Nieto (UPC)

Motivation
2
Object Recognition and Classification
Categories
• Ball
• Airplane
• Chair
• Beaver
• …
Ball Airplane Chair
Shape
Information
Texture
information

3
Index
Requirements
State of the Art
Design
Results

Requirements State of the Art Design Results
Design shape features that can be used in an
aggregated framework, like Bag of Words with
no need of matching or alignment.
5
Take a
successful method :
Shape
Information
SIFT

Analyse the implication of the vocabulary size
with respect to the size of the shape features.
SIFT
6
Shape

The proposed features should be at least scale,
rotation and translation invariant. If it is
possible, flip invariant as well.
7

Need for Segmentation to codify the shape
Study the limitations of shape coding when using a state of the art
segmentation.
Manual annotations vs Automatic Segmentation
8

Object Candidates algorithms
Multiscale Combinatorial Grouping (MCG)
10
Ranking
Object Plausibility
Arbelaez, P., Pont-Tuset, J., Barron, J. T., Marques, F., Malik, J. (2014).
Multiscale Combinatorial Grouping. CVPR.
High
Low

Shape Context
11
G. Mori, S. Belongie, and J. Malik. Ecient shape
matching using shape
contexts. PAMI, 27(11), 2005.

Interest point descriptors:
SIFT descriptor
Simplified example
Typically 4x4 divisions * 8 bins/hist = 128 features
dense SIFT
sparse SIFT
12
David G Lowe, Distinctive image features from scale-invariant keypoints, International journal of
computer vision 60 (2004), no. 2, 91{110.

Enrichment of SIFT
Extra features : Absolute spatial location (X,Y) or angle and distance
Rene Grzeszick, Leonard Rothacker, and Gernot A. Fink, "Bag-of-features representations using spatial visual vocabularies
for object classication,“ in IEEE Intl. Conf. on Image Processing, Melbourne, Australia, 2013
Extra features : Relative position + aspect ratio + scale ratio + Color Space
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In
Computer Vision{ECCV 2012} (pp. 430-443). Springer Berlin Heidelberg.
13
128-dimensional SIFT descriptor Extra features

Bag of Words
14

Bags of Words - Pipeline
15
Get
Descriptors
Clustering
(K-means)
Create
histograms
Train Model
(SVM)
Image
Create
histogram
Evaluate
(SVM)

Why dense SIFT?
17

Main principle: Combination of dense SIFT and Object Candidates
18

Distance to the nearest border (DNB)
Logarithmic distance to the nearest border (LDNB)
Less influence of big distances
19
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order
pooling. In Computer Vision-ECCV 2012 (pp. 430-443). Springer Berlin Heidelberg.

Distance and Angle to the nearest border (DANB)
Problem: Really similar in 2D but very different values.
Solution: Codify them in two separated features.
20

Rotation Invariant Angle to the nearest border
21

Distance to the center (DC)
22

η - Angular Scan (ηAS)
WINNER!
23

Shape Context from a dense SIFT (DSC)
Note: It crosses the contour of the region like Shape Context.
ηAS does not!
24

Rotation Invariant Region Quantization (RIRQ)
Main idea: Get spatial information.
Easily extensible to a pyramid!
25
Lazebnik, S., Schmid, C., & Ponce, J. (2006). 2006 IEEE Computer Society Conference on (Vol. 2, pp.
2169-2178). IEEE.

Achieving flip invariance (RIRQ)
1
2
4 3
1
2 3
4
2
4 1
3 2
3
4
1
4 2 2 4
SORT SORT
2 4
26

Where do we integrate our features?
Two main Architectures
Enriched SIFT (eSIFT)
SIFT Shape features
Visual Vocabulary
Bag of eSIFT visual words
BoW+Shape
SIFT
Visual Vocabulary
Bag of Words Shape histogram
27

BoW+Shape Creation of the shape histograms
SIFT
Accumulation of features
Visual Vocabulary
Bag of Words Shape histogram
1
1. Accumulate the
same feature for all
points .
2. Create a
histogram of X bins
for that feature.
1
2
2
3. Concatenate
histograms to create
the final one.
Example: 8-Angular Scan
8 distances (different angles)
# SIFT keypoints
28

The dataset: Caltech-101
30
•Well recognized dataset
• 101 Different Categories of images
• Ground truth annotations available
• From 40 to 800 images per category.

Metrics: Accuracy (%)
31
Correct Classifications
Correct + Incorrect Classifications

Experiments setup
32
• 30 images per category in train and 30-50 in test.
• 101 Categories + Background category.
• Different Vocabulary sizes in the X axis.
• Accuracy(%) in the Y axis:
•Experiments and analysis:
• eSIFT
• BoW+S
• eSIFT vs BoW+S
• Performance acheived
• Comparison between adding features before or after quantization
• Number of bins per histogram
• Ground truth vs MCG Object Canditates
• Context vs Shape

Results enriched SIFT
33

Results BoW+S
34

Performance achieved
35
Conclusion
With Angular Scan, there is an increase of performance
from 16% to around 41%.

Comparison between adding features
after and before
Conclusion
In Angular Scan, if the number of shape features is high,
both architectures tend to converge. 36

Number of bins per histogram
Conclusion
In Angular Scan, 8 bins is the value that gives the best
performance. 37

Ground truth vs MCG Object Candidates
Conclusion 1
2
Higher vocabulary values lead to a more robust
approach in terms of segmentation errors.
Shape-based methods are more sensible to
segmentation errors than texture-based. 38

Context gain vs Shape gain
Conclusion
Object
Context
It gives better performance to codify the shape
than the context of the image. 39

FutureWork
Comparison betwen our work and
Second Order Pooling
PhD thesis of Carles Ventura
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order
pooling. In Computer Vision-ECCV 2012 (pp. 430-443). Springer Berlin Heidelberg.
40

Distance to the nearest border (DNB)
41
Future Work

Conclusions
1. Increase of performance from 16% to around 41%
2. In Angular Scan, if the number of shape features is high, both
architectures tend to converge.
3. In Angular Scan, 8 bins is the value that gives the best performance.
4. Higher vocabulary values lead to a more robust approach in terms of
segmentation errors.
5. Shape-based methods are more sensible to segmentation errors than
texture-based.
6. It gives better performance to codify the shape than the context of the
image.
Thank you!
Questions? 42

Contextless Object Recognition with Shape-enriched SIFT and Bags of Features

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Contextless Object Recognition with Shape-enriched SIFT and Bags of Features

Semelhante a Contextless Object Recognition with Shape-enriched SIFT and Bags of Features (20)

Mais de Universitat Politècnica de Catalunya

Mais de Universitat Politècnica de Catalunya (20)

Último

Último (20)

Contextless Object Recognition with Shape-enriched SIFT and Bags of Features