A location-aware embedding technique for accurate landmark recognition

A location-aware embedding technique for
accurate landmark recognition
Federico Magliani, Navid Mahmoudian Bidgoli, Andrea Prati
ICDSC 2017 – Stanford, USA – 5-7 September 2017

Agenda
2
➢ Motivations
➢ Summary of contribution
➢ Related works
➢ Introduction to VLAD
➢ Proposed approach (locVLAD)
➢ Experimental results
➢ Conclusions and Future Works

Motivations
3
Landmark Recognition problem
➢ try to understand what’s is
in front of you
➢ using client-server
communication
➢ helping with geolocalization
(GPS)

Motivations
4
➢ Challenges
○ high accuracy retrieval (precision)
○ fast research (response to query)
○ reduced memory occupied (mobile friendly)
○ work well with big data (>100k data)
➢ Possible applications
○ augmented reality (tourism)
➢ Why mobile based?
○ everyone owns a mobile phone
○ a mobile phone has powerful HW, that allows to run some applications

Motivations
5
“Changes in the image resolution, illumination conditions, viewpoint and the presence
of distractors such as trees or traffic signs (just to mention some) make the task of
matching features between a query image and the database rather difficult.”
➢ In order to mitigate these problems, the existing approaches rely on feature
description with a certain degree of invariance to scale, orientation and
illumination changes.

Agenda
6
➢ Motivations
➢ Related works

Summary of contribution
7
➢ A location-aware version of VLAD, called locVLAD, that allows to outperform the state
of the art in the intra-dataset problem. It tries to overcome a weakness of VLAD,
reducing the noise of the features in the borders of the images
➢ The time for vocabulary creation is significantly reduced, using only ⅕ random of the
detected features
➢ A new balanced version of the public dataset ZuBuD is proposed and made available
to the scientific community (ZuBuD+)

Agenda
8
➢ Motivations
➢ Related works

Related work
9
➢ Bag of Words (BoW): first method for solving the problem (different
techniques: vocabulary tree, …)
➢ Fisher vector: embedding based on Fisher kernel
➢ VLAD and its variants: simplified version of Fisher vector
➢ Hamming embedding: embedding based on binarized descriptors
➢ CNN based: deep neural network, that at the end contain
classification layers

Agenda
11
➢ Motivations
➢ Related works

VLAD (Vector of Locally Aggregated Descriptors)
C = {c1
,.., ck
} codebook of k visual words (K-means clustering)
1. Every local descriptor x, extracted from the image, is assigned to the closest cluster
center of the codebook (ci
= NN(xj
))
2. vi
= ∑ (x - ci
) (residuals)
3. VLAD vector is the concatenation of vi
vectors (i = 1, …, k) d-dimensional
4. VLAD normalization to contrast the burstiness problem
16 centroids, features described with SIFT 128d → D=128x16=2048 12

VLAD normalization
13
➢ Signed Square Rooting normalization: sign(xi
) sqrt(|xi
|) followed by L2
norm
➢ Residual normalization: independent residual L2
norm followed by L2
norm
➢ Z-Score normalization: residual normalization followed by subtraction of the mean
from every vector and division by the standard deviation
➢ Power normalization: sign(xi
)|xi
|α
(usually α=0.2) followed by L2
norm

Agenda
14
➢ Motivations
➢ Related works

Proposed approach: locVLAD
➢ This method allows to improve the performance of VLAD vectors in the recognition
problem.
➢ It tackles this problem by reducing the influence of features found at the borders of the
image.
How does it work?
It consists in a new global descriptor, that is the mean of VLAD descriptors of the original
query image (v̇) and a VLAD descriptor calculated on a cropped query image (v̇cropped
).
15

The dimension of the cropped image is a parameter, that depends on the used dataset
➢ ZuBuD → 90% of the original query images
➢ Holidays → 70% of the original query images.
16424 features detected 367 features detected

Why does it increase the performance?
Because, usually, the important features for the recognition are located in the center of the
images while the features close to the border are noisy features.
Why not applying VLAD encoding directly on the cropped image?
Because useful information might be lost. Not any guarantee that features in the borders
are only noisy features.
Why not creating a cropped vocabulary?
Experiments were conducted but results were poor.
17

Agenda
18
➢ Motivations
➢ Related works

Datasets
➢ INRIA Holidays (1491 images in 2448x3264: 500 classes, 500 query)
➢ ZuBuD (1005 images in 640x480: 201 classes, 115 query in 320x240)
➢ ZuBuD+ (1005 images in 640x480: 201 classes, 1005 query in 320x240)
19

ZuBuD+
2222
It is the balanced version of ZuBuD
➢ 1005 query in 320x240 instead of 115 query.
➢ The new query images are random choices of database images, but different from other
query images
○ rotation (±90°) and resize
○ resize only
Download: http://implab.ce.unipr.it/?page_id=194

Evaluation Metrics
2323
Different evaluation metrics are used to compare with the state-of-the-art approaches:
➢ Top1 → accuracy retrieval, evaluating only the first position of the ranking
➢ 5 x Recall in Top5 → average of how many times the correct image is in the top 5
results in the ranking
➢ mAP (mean Average Precision) → mean of Average Precision scores (correct results) for
each query, based on the position in the ranking

Results on ZuBuD (and ZuBuD+)
24

Results on ZuBuD (and ZuBuD+)
25
Method Descriptor size Top1 5 x Recall in Top5
Tree histogram (ZuBuD) [7] 10M 98.00 % -
Decision tree (ZuBuD) [9] n/a 91.00 % -
Sparse coding (ZuBuD) [22] 8k*64+1k*36 - 4.538
VLAD (ZuBuD) [12] 4281*128 99.00 % 4.416
VLAD (ZuBuD+) [12] 4281*128 99.00 % 4.526
locVLAD (ZuBuD) 4281*128 100.00 % 4.469
locVLAD (ZuBuD+) 4281*128 100.00 % 4.543
It is worth to note that on ZuBuD the method based on sparse coding slightly outperforms the proposed one.
This is due to an unbalanced query set and, probably, on the use of color information.

Results on Holidays
27
Method Descriptor size mAP
Sparse coding [22] 8k*64+1k*36 76.51 %
VLAD [12] 4281*128 74.43 %
locVLAD 4281*128 77.20 %
Sparse coding [4] 20k*128 79.00 %
VLAD [12] 20k*128 78.78 %
locVLAD 20k*128 80.89 %

Agenda
29
➢ Motivations
➢ Related works

Conclusions
➢ The proposed locVLAD technique includes, at a certain degree, information on
the location of the features, by mitigating the negative effects of distractors
found at the image borders.
➢ Experiments are performed on two public datasets, namely ZuBuD and Holidays,
and demonstrate superior recognition accuracy w.r.t. the state of the art.
30

Future works
➢ Compression: try to reduce the dimension of the descriptors, while keeping the
same accuracy in retrieval (mobile friendly).
➢ Indexing: create a system for the evaluation in a large scale domain (adding until 1M
distractors). Passing from Nearest Neighbor problem to Approximate Nearest
Neighbor problem. We are working with kd tree and permutation-based methods.
➢ Sparse coding: new methods for the creation of the vocabulary and the assignment
of the features to the VLAD vector.
31

Thank you for your attention!
questions?
http://implab.ce.unipr.it
32

A location-aware embedding technique for accurate landmark recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A location-aware embedding technique for accurate landmark recognition

Similar to A location-aware embedding technique for accurate landmark recognition (20)

Recently uploaded

Recently uploaded (20)

A location-aware embedding technique for accurate landmark recognition