SlideShare uma empresa Scribd logo
1 de 60
Baixar para ler offline
Abstract
The phenomenal expansion of the web, the increasing number of recording de-
vices, the emergence of social medias, all these phenomena lead to the pro-
duction of a constantly increasing number of digital images. Theses images
often end up stored without any additional information or metadata associated.
Enhancing the machine ability to analyze the content of a picture to discover
semantic knowledge is therefore mandatory if we want to improve the use which
is made of these data.
This thesis study the field known as Content-Based Image Retrieval that deal
with the image retrieval problem when no metadata is involved. Recently new
methods, deep learning algorithms, have emerged and have proven to greatly
outperform traditional methods for all kinds of tasks. Therefore a strong em-
phasis will be made on these new techniques throughout the study and several
experiments are presented that assess the dominance of deep learning algo-
rithms.
1
Contents
I State Of The Art 5
1 Content-Based Image Retrieval Overview 6
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Real World Applications . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Components of CBIR systems . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Database Management . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Query Specification . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Relevance Feedback . . . . . . . . . . . . . . . . . . . . . 10
2 Segmentation 11
2.1 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Histogram-based methods . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Region-growing methods . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Graph partitioning methods . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Cut Criterion . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Graphical Probabilistic Model . . . . . . . . . . . . . . . . 13
3 Feature Extraction 15
3.1 Global and Local Features . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Handcrafted Features . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Color Features . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Shape Features . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Texture Features . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Shallow methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Visual Bag-of-Word . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Improved Fisher Kernel . . . . . . . . . . . . . . . . . . . 19
4 Convolutional Neural Network 20
4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Local Response Normalization Layer . . . . . . . . . . . . 23
4.1.4 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . 23
4.1.5 Softmax and Loss Layer . . . . . . . . . . . . . . . . . . . 24
2
4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Backpropagation and Gradient Descent . . . . . . . . . . 24
4.2.2 L1 and L2 regularization . . . . . . . . . . . . . . . . . . . 26
4.2.3 Batch normalization . . . . . . . . . . . . . . . . . . . . . 26
4.2.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Research Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Convolutional Neural Network Architecture . . . . . . . . 27
4.3.2 Enhanced Convolutional Neural Network . . . . . . . . . 29
4.3.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . 30
5 Feature Interpretation 32
5.1 Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1 Fixed Measure . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.2 Similarity Learning . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Evaluation of CBIR Systems 35
6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 Pascal Voc . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.2 Caltech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.3 ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Tasks Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2.1 Image Classification . . . . . . . . . . . . . . . . . . . . . 36
6.2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . 36
6.2.3 Image Similarity . . . . . . . . . . . . . . . . . . . . . . . 36
6.2.4 Multi-class Image Segmentation and Labeling . . . . . . . 37
6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . 37
6.3.2 Top-k Error . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . 38
II Contribution 39
7 Experiments 40
7.1 Image Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1.4 Evaluation and further Improvements . . . . . . . . . . . 42
7.2 Video Classification and Tagging . . . . . . . . . . . . . . . . . . 44
7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.4 Evaluation and further improvements . . . . . . . . . . . 46
3
Appendices 56
A Video Corpus Description 57
4
Part I
State Of The Art
5
Chapter 1
Content-Based Image
Retrieval Overview
1.1 Introduction
It is often said that an image is worth a thousand words. This idiom refers to
the notion that a complex idea can be conveyed with just a single still image.
Indeed they are a powerful mean of communication and we interact with them
in our everyday life. We use them to freeze a moment and share it with our
relatives and friends, to illustrate a written article or to capture a emotion in
a painting. Nowadays an increasing number of devices, cameras, satellites and
video sensors allow us to automatically record digital images which are then
stored in constantly increasing databases. The phenomenal expansion of the
web enabling us to store those images online also contributes to this collection
of images in very huge databases. To allow us to take advantage of this in-
creasing amount of data we need to develop tools that efficiently process images
and retrieve them in a meaningful fashion. This is the problem addressed by
Content-Based Image Retrieval. Formally it can be defined as any technologies
that help to deal with the organization of digital pictures based on their con-
tent [19]. The key idea is that no metadata is involved and only the raw pixels
of the images are used in order to infer semantic knowledge. It is a growing
field which span over numerous research areas such as Information Retrieval,
Computer Vision, Machine Learning...
Humans are naturally gifted to deduce informations by just looking a few
seconds at an image and it is interesting to understand how we are performing
this task in order to be able to reproduce it into a machine. In doing this
operation we are usually relying on at least three different factors :
Our sensory abilities Our sensory abilities refer to our abilities to acquire
visual inputs through our eyes and the process that is made of these inputs
by our visual cortex.
6
The context The context of an image include everything that is not in the
image but contribute to its interpretation. It is well known that an image
can have different meaning depending on its context.
Our experiences Different experiences makes us respond differently when con-
fronted to a situation. This is valid for the interpretation of an image which
will rationally differ between individuals.
In order to successfully retrieve images for an end-user a machine should
theoretically be able to incorporate these three factors into its process. As one
can imagine such objective is especially challenging for many reasons. First we
are far from completely understanding how our brain is processing the visual
inputs that it receives. Then the context of an image is not always available,
especially when dealing only with raw pixels without the help of additional
metadata. Finally individuals with different cultures can have very different
interpretations of the same image and to retrieve the right images for an user
would imply to have some knowledge about its background or intentions.
1.2 Real World Applications
Even if the process of making good interpretations based on the content of an
image is far from being mastered numerous real world applications make al-
ready use of Content-Based Image Retrieval technologies. Among this different
applications we can differentiate between those who operate on images with a
very broad domain and those who operate on a narrow domain. In the review
made by Smeulders and all [65] they define a broad domain as having ”an un-
limited and unpredicted variability even for the same semantic meaning”. Web
applications and general visual search engines often process on a broad domain
since images can come from many different users and similar objects can have
different size, shape or illumination and can even be partially occluded.
The first visual search engine was developed by IBM with their commercial
system called QBIC (Query by Image Content) [55]. QBIC enable an user to
query a image database with image queries. Many possibilities is given to the
user to specify its query. He can either query by sketch, query by specifying
specific color, texture and shape or query using an image similar to the ones
he was looking for. Depending on the type of the query different similarities
measures are then used to retrieve the most similar images. In order to help
further the research semi-automatic detection of objects or points of interest is
also performed when a new image is inserted into the database. Since this first
attempt other systems (TinEye, Pinterest, Google Query By Image) have been
developed which rely on more advanced techniques. For instance Pinterest is a
web and mobile application where user can upload images of interest through
collections known as pinboards. Pinterest then use those images to look for
similar images and recommend them to the user. While QBIC was only relying
on handcrafted features Pinterest, among other algorithms, rely on features
7
learning and algorithms called Convolutional Neural Network to extract features
from images [44].
As opposed to a broad domain a narrow domain is defined as having a
limited and predictable variability in all relevant aspects of its appearance. When
dealing with a narrow domain knowledge of the domain can be efficiently used
to design good featur detectors. Illustration of a narrow domain can be found in
many applications such as medical diagnosis or face recognition. In the medical
domain, due in part to the steadily growing rate of images production, Content-
Based Image Retrieval can play a key role in many medical applications [54].
It can for example enable the automatic annotation and the classification of
medical images. Content-Based Image Retrieval can also be used to help in the
clinical decision–making process. Indeed when confronted with a medical case
a useful scenario would be to supply the doctor with similar cases by finding
other images of the same disease that would allow him to take a decision with
more information at hand. Another application would be for instance within
the radiology department to develop tools that enable automatic diagnostic of
the diseases [45].
Content-Based Image Retrieval techniques might also find useful applications
in many more domains. For instance in digital art gallery or online museum
collection they can be used to bring to the user art works similar to the ones
he browsed. For law enforcement and criminal investigation they could help
investigators to process pictures of security cameras.
1.3 Components of CBIR systems
All typical Content-Based Image Retrieval systems are usually performing com-
mon steps to achieve the retrieval of images as illustrated by the figure 1.1.
The first step which might be performed is segmentation which can be useful
to isolate meaningful regions from the background for example. As a second
step feature extraction is performed. The extraction of features can rely on
two different methods. Either it is done through handcrafted features that is
humans have to design a specific features detector for the problem at hand or it
is done through feature learning that is we let the machine learn features thanks
to training samples. In the domain of image processing one model have started
to show great promises to learn features and is known as Convolutional Neural
Network. One of the great challenge of features extraction is to reduce what
is known as the semantic gap that is the difference between the interpretation
of the image and the features that have been computed. Finally one have to
interpret these features which is often done through a similarity measure or a
machine learning classifier. The following of this work will detail each of these
steps.
8
Figure 1.1: Content-Based Image Retrieval System Components
While every Content-Based Image Retrieval system usually perform the op-
erations described above their performance is not limited to them. Below is a
review of the main components that typically compose a CBIR system and im-
provement in each of these components is required to achieve better systems [31].
1.3.1 Database Management
When an image database is growing very large efficient access methods should
be developed in order to speed up the retrieval of images. These methods should
act as a filter with only relevant images being retrieved from the storage system
and tested for similarity with the query image. Most data structures used today
for indexing are spatial access methods and they assume that all images fea-
tures are numeric and represent an image as a point in a multidimensional space.
These methods can be classified into three categories : space partitioning, data
partitioning an distance-based techniques [65]. Space partitioning techniques
(k-d tree, k-d B-tree...) organize the feature space as a tree and with nodes
standing for regions of the space. When a region contain too much points it
is split into subregions whose nodes become the children of the original region.
Data partitioning techniques (R-tree, R+-tree, R*-tree...) associate with each
point a region representing the neighborhood of that point and leaf nodes cor-
responding to minimum bounding regions of a set of vectors. Distance-Based
index structures (VP-tree, MVP-tree, M-tree...) is applicable to general metric
spaces whose primary idea is to pick an example point and divide the rest of
the features space into concentric rings around the example. Nevertheless when
the feature vectors being indexed contain a considerable number of dimensions
the performance of the previous indices greatly decrease. This is known as the
curse of dimensionality phenomenon.
9
1.3.2 Query Specification
Queries made by the user can take several forms which logically influence on the
retrieval process. Query by example where the user provides a image and the
similarity is computed based on the entire scene is a common way to perform
the request. Another method is to let the user specify a region of interest (a
person or an object for example) and search for images with similar regions.
Yet an other way is for the user to draw a sketch of the item or landscape he
is interested in. How the query is specified depends of the users of the system.
Casual users surely prefer to make its request by using natural language “Find
me mountain with snow and with a bird in the sky” or with an image as an
example whereas domain expert might want more sophisticated methods such
as query by sketch.
1.3.3 Relevance Feedback
As pointed out earlier bringing relevant images to the end user imply to know its
point of view. Learning the point of view of the user is more and more achieved
through relevant feedback. From a broad perspective a relevance feedback sce-
nario usually has the following scheme. First initial results are returned by the
machine then the user provide a judgment on the results being displayed and
finally the machine learns from the judgments to retrieve new results and so
on. The exact algorithm used for learning is going to depend on whether the
user is looking on a particular target item or a class of similar item. Also it
is going to differ if the user provide a binary judgment (relevant or not) or a
comparative judgment (this result is better that this one). Different relevance
feedback algorithms can be found in the survey of Zhou and al. [81] or the one
of Crucianu and al. [16].
10
Chapter 2
Segmentation
As humans we often attribute the semantic of an image based only on a par-
ticular region of this image disregarding the rest of the scene. Therefore it
is logic when trying to learn good representations of the image to first select
meaningful regions by segmentation of the image. Good segmentations should
group into regions pixels with similar characteristics such as color or texture.
Below is a review of different techniques that have been investigated to perform
segmentation.
2.1 Clustering methods
One way to achieve image segmentation is through clustering methods. The
most basic case would be to use a simple k-means algorithm. In this case the
main steps to perform the segmentation are :
1 Choose k clusters (e.g. pixels) either randomly or based on some heuristic
2 Based on a distance measure assign each pixel in the image to the closest
cluster
3 Recompute the clusters centers
4 Repeat step 2 and 3 until convergence that is no more pixels change of cluster.
Improvement upon this simple scheme is to use fuzzy clustering. In fuzzy
clustering pixels, rather than belonging completely to just one cluster, has a
degree of membership. Further improvement is achieved by adding spatial in-
formation to the membership function as in the fuzzy c-means clustering pro-
posed by Chuang and al. [13]. Indeed their finale membership function take
into account features of the pixels and also neighboring information. Hence the
probability to belong to a cluster will be higher for a pixel if the neighboring
pixels already belong to this cluster. Fuzzy c-Means clustering is also explored
by Ahmed and al. [2] for segmentation of magnetic resonance imaging.
11
2.2 Histogram-based methods
The principle of histogram-based methods for image segmentation is first to
compute an histogram from the image. Then one or many thresholds are chosen
between two peaks of the histogram and pixels of the image are attributed
into clusters according to these thresholds. This technique is for example used
by Arifin and al. [3] to distinguish between the foreground and background
of grayscale images. Refinement of this technique is to recursively apply the
histogram computation and the thresholding to subregions as explained in detail
in the work of Ohlander and al. [57].
2.3 Region-growing methods
In Region-growing methods we start with elementary regions, either all pixels
or a subset of pixels, and we iteratively combine these small regions based on
a statistical test to decide or not of the merging. A recent algorithm that fol-
low this approach is Statistical Region Merging proposed by Richard Nock and
Frank Nielsen [56]. Their algorithm is based on a model of image generation
which captures the idea that grouping is an inference problem. It provides a
simple merging predicate and a simple ordering in merges. They argue their
method can cope with significant noise corruption, handle occlusions, and per-
form scale-sensitive segmentation. Another method is the seeded region growing
method. This method take a set of seeds as input along with the image which
correspond to the objects to be segmented. Then the regions are iteratively
grown by comparing unallocated neighboring pixels to the regions. One way of
comparison could be for example to compare the intensity of a pixel with the
average intensity of the region.
2.4 Graph partitioning methods
Graph partitioning methods have been lately the main research direction for
segmenting images. These methods see an image as a weighted undirected graph
G = (V, E) where each node v ∈ V corresponds to a pixel or a group of pixels
and each edge (i, j) ∈ E is weighted according to the dissimilarity between the
two pixels that are linked.
2.4.1 Cut Criterion
In order for the graph image to be partitioned into relevant clusters good cut
criterions must be found. A cut in graph theoretic is the partitioning of the
graph into two disjoint subsets which are judged dissimilar. The degree of
dissimilarity is basically computed by doing the sum of the vertices that connect
the two subsets :
Cut(I, J) =
i∈I,j∈J
w(i, j)
12
where w is a function used to estimate the similarity between two nodes/pixels.
The problem of this metric is that it tends to create clusters composed of a
unique node. A popular criterion for finding good clusters is know as normalized
cut [63]. To avoid the unique node bias normalized cut suggests to normalize
the cut criterion by the total edge weights of all the nodes in the graph. Other
criterions are minimal cut [77] (a cut is minimum if we can’t find in the graph
a cut with smaller weight) or maximum cut (No cut with a biggest cut weight).
Yet an other recent variant is the Multiscale Normalized Cuts (NCuts) approach
of Cour et al. [15].
2.4.2 Graphical Probabilistic Model
Graphical probabilistic models are probabilistic models for which a graph ex-
presses the conditional dependence structure between random variables. In the
case of image processing they are often use to both segment and label each seg-
mented regions of the image. In this context random variables represent pixels
or more often group of pixels (superpixels).
Figure 2.1: A graphical model for image segmentation and labeling
The first step when using a graphical model is usually to perform unsuper-
vised segmentation on the image. The result of this step is a set of superpixels
as illustrated by the image on the left of the figure 2.1. Different algorithms have
been designed to achieve this such as SLIC [1], an adaption to k-means clus-
tering approach, or felzenszwalb [24], a graph based image approach. Following
this step we can model the network with random variables Ci that represent
superpixels and can take as values the different class labels for the problem at
hand. A potential function Ψ(Ci, Cj) is defined between all neighboring vari-
ables that capture the dependencies between them. We also define an other
random variables Xi that represent the observations or the evidences we have
about a superpixel. In the context of image processing they consist in local
features such as color, shape and texture features. These evidences will be used
to make inference about the values of Ci through the potential Φ(Xk).
Two different models, namely Markov Random Fields (MRF) and Condi-
tional Random Fields (CRF), have been mostly investigated by researchers.
Markov random fields are generative models that model the joint probability
distribution P(C, X) whereas conditional random fields are discriminative mod-
13
els that model the conditional probability P(C|X). The advantage of CRFs over
MRFs is that they directly model the standard decision problem and are often
more accurate. They are also easier to train since they are not trying to model
the distribution over the observations. However they are limited to the strict
prediction problem and cannot be used to do anything else. Demonstration of
segmentation with markov random fields can be found in the work of Won and
al. [76] or the one of Zhang and al. [79] which use the model to segment brain
magnetic resonance (MR) images. Conditional random fields have been investi-
gated by Plath and al. [60] who use them to perform scale-space segmentation
with class assignment.
14
Chapter 3
Feature Extraction
Features in Content-Based Image Retrieval are relevant information about the
underlying image. What makes that a feature is relevant is generally dependent
of the problem at hand. Nevertheless one can identify several properties that
features should respect in order to be considered good features. Thus features
should be invariant to :
Affine transformation (Rotation, Scaling, Translation...). Ideally features
computed on an image would be the same whatever the location or the
scale of the object in the image.
Distortion As for affine transformation features should be tolerant to small
distortion.
We can discern between two approaches when dealing with feature extrac-
tion. The first one is to rely on handcrafted features that describe elementary
characteristics of the image such as the shape, the color or the texture. A major
drawback of handcrafted features is their dependence to the application domain
which led to another set of techniques called feature learning. Feature learning
exploit a training dataset to discover useful features from the images. Another
distinction one can make in the domain of feature extraction is between global
and local features.
3.1 Global and Local Features
Global features are features which are aggregated from the entire image. More
formally global features can be symbolized by :
Fj =
Tj
f o i(x)
where represent an aggregation operation (can be different that sum), T is
the partitioning over which the value of Fj is computed, f account for possible
weights and i(x) is the image.
15
As opposed to global features, local features are computed by considering
only a subpart of the image. Usually for an image a set of features is computed
for each pixel using its neighborhood or for non-overlapping blocks. After this
step we usually have a set Xi, 0 < i < sizeimage where X represent the feature
vector computed at the location i of the image. A further step of summarization
can also be performed. For example we might derive a distribution for Xi based
on the set. As reported by Datta and al. [20] local features often correspond
with more meaningful image components, such as rigid objects and entities,
which make association of semantics with image portions straightforward.
3.2 Handcrafted Features
3.2.1 Color Features
An example of global features that have been extensively used is color his-
togram that is a representation of the distribution of colors in an image. Color
histograms can be useful for retrieval as long as the color pattern of interest is
representative of an item throughout the dataset. It has the advantage to be
robust to translation and rotation transformations. Various distance measure
can then be used such as euclidean distance, histogram intersection, cosine or
quadratic distance to compute the similarity between images [68] [32] [67]. How-
ever color histograms suffer from obvious flaws. Thus color histograms can’t be
of any help to identify that a red cup and a blue cup actually represent the
same object and additionally if the similarity of two images with very different
scenes but identical color distribution is computed using a color histogram they
might be falsely judged similar. To improve color histograms efficiency joint his-
tograms [58], histograms that incorporates additional informations other than
color, two-dimensional histograms [5], histograms that considers the relations
between the pixel pair colors, or correlograms [39], a three dimensional his-
togram, have been investigated.
Another key issue when dealing with color features is the choice of the color
space that is been used (RGB, HSV, Lab-Space...) which usually depend on the
special need of the application. Two aspect of colors have to be taken account
here. The first is that depending on how the scene was taken (viewpoint of
the camera, position of the illumination, orientation of the surface...) the color
recorded might varies considerably. The second is that the human perception of
color greatly changes between individuals. RGB space is one of the most popular
color space and assign for each pixel a (R(x), G(x), B(x)) triplet corresponding
to the additive primary colors of light (Red, Green, Blue). RGB space is an
adequate choice only when there is little variations in the recording. For instance
the RGB space would probably be a good choice for art painting but a bad choice
for outdoor taken pictures. Indeed a color relatively close in the RGB color space
might be perceived as very different from the point of view of an human. In the
opponent color space colors are defined according to the opponent color axes
derived from the RGB values : (R-G, 2B-R-G, R+B+G). It has the advantage
16
to isolate the brightness information on the third axis. Since humans are more
sensitive to variations in brightness the two other axis could be downsampled
to reduce the memory usage. Other color spaces have been studied and can be
found in the survey of Smeulders and al. [65] or the one of Khokher and al. [47].
3.2.2 Shape Features
Shape features methods are trying to identify interesting regions in an image like
edges or corners and compute features based on these regions. As a primary step
scale-space detection is very often performed since it provides the way to detect
interesting regions at any scale. A widely used detector is SIFT (Scale-Invariant
Feature Transform) published by David Lowe in 1999 [52]. SIFT extract key-
points from an image at different scale-spaces using Difference of Gaussians and
assign an orientation to each of them to achieve invariance to image rotation.
Keypoints descriptors are then computed based on their neighborhood that is
for each neighborhood a orientation histogram is created. In addition several
measures are taken to increase the robustness of the descriptors to changes in
illumination. Improvement over SIFT have since been made especially with the
Speeded Up Robust Features (SURF) detector [7] which is several times faster
than SIFT and claimed by their authors to be more robust against different
image transformations. An other descriptor is known as Histogram of oriented
gradients. The idea behind histogram of oriented gradients is that objects or
humans can be described by the distribution of oriented gradients. In this tech-
nique the image is divided into cells (grouping of adjacent pixels) of circular or
rectangular shape. For each pixel gradient orientation is computed and then
for each cell histogram of gradients orientation are deduced. Each pixel in a
cell contribute to the histogram depending generally on the magnitude of the
gradient. To account for illumination changes and shadowing, gradients in a
region (grouping of several cells), are normalized with the average intensity of
the region. HOG descriptor is then the concatenated vector of all the normal-
ized histograms. It has been investigated for instance for human detection [18].
Other techniques include harris corner detector [34] or the Hough Transform [4].
3.2.3 Texture Features
A texture can be defined as an homogeneous region in an image that do not result
only from uniqueness in color but from identical structural arrangements or
repetitive patterns within that region. For instance the bark of threes might have
different colors between species or depending of the season but they form a same
texture. Bricks, parquets or grass are other examples of textures. Detecting
texture is important because they often provide strong semantic interpretation.
According to different surveys, Khokher and al. [47], Haralick and al. [33],
texture features can be divided into two main categories : Structured Approach
and Statistical Approach. In the structured approach the image is seen as a set
of primitive texels with a regular or repeated pattern. The more widely used
17
statistical approach is based on the distribution of gray level in the image. Ac-
cording to Robert M. Hawlick [33] statistical approach can be divided between
height different techniques : autocorrelation functions, optical transforms, dig-
ital transforms, textural edgeness, structural elements, spatial gray tone cooc-
currence probabilities, gray tone run lengths, and autoregressive models. A
explanation for each of these techniques is provided in the survey. Wavelet-
based features have also received wide attention. In the work by Minh N. Do
and all [21] wavelet features are used in combination with Kullback-Leibler dis-
tance for texture retrieval. Gabor filters have also been investigated for texture
features [30].
3.3 Shallow methods
Shallow methods refer here to techniques used in order to extract meaningful
representation using features descriptors described above. Shallow is used in
opposition to deep methods that involve several layers of features extraction.
3.3.1 Visual Bag-of-Word
The visual bag-of-words method in computer vision is analogous to the bag-of-
words model for document. For a document the bag-of-word is a vector (i.e.
histogram) that contains the number of occurrence of words that are defined
by a vocabulary. For an image a bag of word is a vector or histogram that
contain the number of occurrence of visual words. Visual words correspond to
representative vectors computed from local image features. The main steps of
the method are :
Local features Extraction For this step different detectors such as harris
detector or SIFT can be used and have been effectively investigated.
Encoding in a Codebook From the local features codewords (analogous to
words) have to been found that will produce the codebook (analogous to
a dictonary). The simplest method is to perform a k-means clustering
over the entire features with cluster centers corresponding to codewords.
Other methods to cluster the vector space are Square-error partitioning
algorithms or Hierarchical techniques.
Bag of keypoints construction Once the codebook has been determined we
can construct for each image an histogram (called here bag of keypoints)
that contain the number of vectors assigned to each codewords (i.e. cluster
centers).
More detailed explanations about the bag-of-words model can be found in
the paper by Csurka and al. [17] that introduced the method in 2004. Since then
more encoding methods such as locality-constrained linear encoding, improved
Fisher encoding, super vector encoding or kernel codebook encoding have been
proposed to improve the model.
18
3.3.2 Improved Fisher Kernel
The Fisher Kernel was introduced by Jaakkola and Haussler [43]. The Fisher
Kernel combines the benefits of generative and discriminative approaches by
deriving a kernel from a generative model of the data. In the case of images it
consist in characterizing local patch descriptors by its deviation from a Gaussian
Mixture Model. Thus the Improved Fisher Kernel extend the BoW representa-
tion by including not only statistical counts of visual words but also additional
informations about the distribution of descriptors. However in practice results
obtained with the Improved Fisher Kernel has not been better that with BoW
model [59]. Comparison of these two shallow methods can be found in the work
of Chatfield and al. [10].
19
Chapter 4
Convolutional Neural
Network
The alternative to handcrafted features that is gaining huge attention in the
recent years is the possibility to let the machine learn the best features to rep-
resent the image. In the domain of image processing the learning of the features
is mainly done through a model called Convolutional Neural Network (CNN).
Convolutional neural networks are models inspired by how the brain works and
how neurons interact between each other. They are made of successive layers of
neurons, each of them aggregating the results from the preceding layer in order
to compute more abstract features. The succession of layers is the reason why
such algorithms are known as deep learning methods. The convolutional neural
network was inspired by the neocognitron proposed by Kunihiko Fukushima in
the 1980s [25]. It was the first attempt to mimic the animals visual cortex.
Early papers [40] had shown that the cortex of animals is made of different
neurons. The neurons located downstream in the visual recognition process are
responsible for the extraction of specific patterns within their receptive fields
and neurons located upstream in the process aggregate these results and are
invariant to specific locations of the patterns. Accordingly the neocognitron
consist of multiple types of cells, the most important of which are called S-
cells and C-cells. The purpose of S-cells is to extract local features which are
integrated gradually and classified in the higher layers. However the neocogni-
tron lacked a proper supervised training algorithm. This algorithm was found
simultaneously by different teams in the eighties and is known as backpropaga-
tion, abbreviation for backward propagation of errors. Backpropagation used in
conjunction with an optimization method such as gradient descent has been suc-
cessfully applied in 1989 by LeCun and al. to recognize handwritten digit [49].
With the rise of efficient GPU computing a second breakthrough was achieved
in 2012 by Geoffrey Hinton and his team which designed a convolutional neural
network known as AlexNet [48] that achieved state-of-the-art performance on
the Imagenet dataset.
20
As opposed to handcrafted features convolutional neural networks has the
advantage to require very little if not preliminary knowledge of the domain at
hand. For instance, when using a Visual Bag-Of-Words representation for im-
ages, one has to choose between different detectors which one will be the more
efficient for the task. This choice depends obviously of the underlying problem.
As for convolutional neural networks the domain knowledge is used to finely
tune the parameters of the model but it is also often achieved through trials
and errors. Prior knowledge can also be put into the network by designing ap-
propriate connectivity, weights constraints and neuron activation function. In
any case it is less intrusive than handcrafted features. Another benefit of con-
volutional neural network is their ability to learn more discriminative features.
Both in the review of Chatfield and al [11] and in the study of Zhang and al [78]
they show better performance than traditional shallow methods. Drawbacks of
convolutional neural networks include their complexity, the substantial amount
of data and the computational resources required to train them effectively.
4.1 Design
The complexity of convolutional neural networks reside mostly in their design
and in the different layers that the network is made of. The figure 4.1 illustrates
a very simple convolutional neural network. It is made of the succession of two
convolutional layers with pooling layers and is followed by a fully connected
layer. As for the neocognitron the first layers that are the convolutional lay-
ers and the pooling layers are responsible for the extraction of local features.
Specifically the convolutional layers learn to recognize different patterns and the
pooling layers gradually ensure that the network is invariant to the location of
these patterns. Fully connected layers enable to map these low level features to
more abstract concepts. Below is a detailed description of the different layers
one can find in a network.
Figure 4.1: A Convolutional Neural Network
4.1.1 Convolutional Layer
Convolutional layers are the core of convolutional neural networks. The figure
4.2 represents a convolutional layer made of four learnable linear filters or ker-
nels. Each performs a convolution on the preceding layer. That is each filter is
21
slid across the width and the height of the preceding layer. The output of the
convolution of one filter is called a feature map. Formally we have :
hk
ij = (Wk
∗ x)ij + bk
where hk
denote the k-th feature map, Wk
and bk
the associated filter and bias
and x the inputs.
Figure 4.2: A Convolutional Layer
It results of this operation that
all neurons in a feature map share
the same weights but have different
receptive fields that is they look at
different regions from the preceding
layer. Thus neurons in a same fea-
ture map act as replicated feature de-
tectors able to detect a pattern at any
location. Adjacent neurons in differ-
ent feature maps share the same re-
ceptive field but, the different filters
learning different weights, each fea-
ture map learns to detect different
patterns. The exact nature of the
patterns detected is hard to know and
might not be easily interpretable for
humans.
Following the convolutions each neuron is the input to an activation function
that determines the final output of the layer. Different activation functions can
be used such as the binary function, the logistic function, the hyperbolic tan-
gent function but the most commonly used activation function is the rectifier
function. This function is defined as f(x) = max(0, x) and has been argued to
be the more biologically plausible [28].
Using convolutional layers before fully connected layers present many advan-
tages. First it enables to take the spatial structure of the image into account.
In fully connected layers neurons are joined to all the neurons of the previous
layers thus they treat pixels which are far apart and close together exactly the
same way whereas neurons in convolutional layers are only joined to neighboring
neurons from the preceding layer. Also it enables to greatly reduce the number of
parameters to learn by the network. For example if an image is of size 200*200*3
a single fully connected layer would have to learn 200∗200∗3 = 120, 000 weights
whereas a single convolutional layer with 50 different filters of size 5*5*3 only
has to learn 50 ∗ 5 ∗ 5 ∗ 3 = 3750 weights.
4.1.2 Pooling Layer
Pooling layers are common between two convolutional layers. They perform
a form of non-linear down-sampling. They partition the previous layer into,
overlapping or not, rectangular regions and for each region compute a unique
22
output. It enables to get a small amount of translational invariance each time
they are used.
Figure 4.3: A Pooling Layer
Most of the time the function
used by a pooling layer is max-
pooling that compute the maxi-
mum of the underlying region but
average-pooling (compute the aver-
age of the underlying region) or
L2-norm pooling can also be used.
By pooling the exact locations of
the features are lost but relative
locations, which are the most im-
portant, are preserved. Pool-
ing layers are useful to progres-
sively reduce the spatial size of
the feature maps, thus the number
of parameters and prevent overfit-
ting.
4.1.3 Local Response Normalization Layer
Local Response Normalization Layers can be found especially in the AlexNet
network [48]. It performs a normalization between neurons from adjacent ker-
nels. The idea is to suppress the activity of a hidden neuron if nearby neurons in
adjacent feature maps have stronger activities. Indeed if a pattern is detected
with low intensity it becomes not so relevant if strong patterns are detected
nearby. The authors argue that it helps to increase the performance of the
network however another study [64] reports that performance improvement was
rarely achieved by adding local response normalization layers and that it even
leads to increased memory consumption and computation time. More generally
this kind of layer is not reused in subsequent works.
4.1.4 Fully Connected Layer
The following layers in a convolutional neural network are the ones that we
can find in a traditional multilayer perceptron that is one or more succession of
fully connected layers. Fully Connected Layers are called this way because the
neurons of these layers are connected to all the neurons of the layer below. In
a convolutional neural network the feature maps of the last convolutional layer
are vectorized and fed into fully connected layers. As for convolutional layer an
activation function is used. If we suppose that rectifier units are used then we
formally have :
yj = max(0, zj) with zj = b +
i
Wij ∗ xi
23
where yj is the output of the j-th neurons of the layer, W is the weights matrix
for this layer and b is the associated bias.
4.1.5 Softmax and Loss Layer
The loss layer specifies how the network penalizes the deviation between the
predicted and true labels and is obviously the last layer in the network. Right
before that layer we usually find a softmax layer that takes a feature vector as
input and force the outputs to sum to one so they can represent a probability
distribution across discrete mutually exclusive classes :
yi =
ezi
j ezj
The cost function associated with the softmax function is the cross-entropy
function :
E = −
j
tj ∗ log(yj)
where tj is the target value and yj is the predicted value.
4.2 Training
The training of a network is done through the backprogation algorithm which
will be explained shortly. One problem, due to the fact that convolutional
neural networks are able to learn complex mathematic models, is that they can
easily overfit. To prevent this from happening one can either limit the number
of hidden layers and the number of units per layer or stop the learning early
but several more advanced methods have been proposed that we will discuss
thereafter.
4.2.1 Backpropagation and Gradient Descent
Gradient descent is an optimization algorithm to find the local minimum of a
function. In the case of machine learning it enables to find the minimum of the
cost function by applying the delta rule :
Wij = Wij − α ∗ E(Wij)
where Wij is the weight between the node i and j, α is the learning rate and
E(Wij) is the gradient of the cost function with respect to the weight Wij.
24
W
E
Start
Figure 4.4: Illustration of gradient descent.
The idea is that E(Wij) gives us the direction of the steepest increase
of the cost function so by iteratively updating the weights in the direction of
the negative gradient we should reach the minimum of the cost function as
illustrated by the figure 4.4.
The sensitive part with convolutional neural network is to compute the gra-
dients which is achieved thanks to the backpropagation algorithm. Below is a
sketch of the algorithm for fully connected layers where we assume that the cost
function is the sum of squares :
E =
1
2
∗
j
(tj − yj)2
i
j
Wij
zj
yj
Figure 4.5: Fully Connected Layers
The goal is to find dE
dWij
so we can
inject it into the delta rule. For this
we make use of the chain rule :
dE
dWij
=
dyj
dWij
∗
dE
dyj
dE
dyj
can easily be found from the
cost function :
dE
dyj
= −(tj − yj)
To find
dyj
dWij
we once again make
use of the chain rule :
dyj
dWij
=
dzj
dWij
∗
dyj
dzj
dyj
dzj
will depend on the activation
function used. If we assume a lo-
gistic activation function where yj =
25
1
1 + e−zj
then
dyj
dzj
= yj(1 − yj). Also we easily find
dzj
dWij
= xi so that we finally
have :
dE
dWij
= −xi ∗ yj(1 − yj)(tj − yj)
The final equation can finally be injected into the delta rule enabling the
network to be trained. Obviously the above demonstration will depend on the
chosen cost function, the kind of layer used and the activation function. Also
gradient descent is much of the time used in conjunction with the momentum
method that enable faster convergence.
4.2.2 L1 and L2 regularization
L1 and L2 regularization are a common ways of preventing the overfitting of the
network. They both consist to introduce an extra term in the cost function that
is going to penalize large weights. The L2 regularization is the most commonly
used and is also known as weight decay. The extra term in this case is the sum
of the squares of all the weights in the network scaled by a factor. So the new
error function is :
E = E +
λ
2
∗
ij
W2
ij
The effect of such a regularization is to make the network prefer to learn small
weights and allow for large weight only if they considerably improve the cost
function. Promoting small weight has been proven to be effective to reduce
overfitting.
4.2.3 Batch normalization
Batch normalization [42] potentially helps the training in two ways : faster
learning and higher overall accuracy. When dealing with machine learning nor-
malization of the data is often performed before the training process in order to
make the data comparable across features. One problem with neural networks
is that during the training process the weights of the network fluctuate and that
inputs of upper layers are affected by the weights of the precedent layers. There-
fore the distribution of inputs for upper levels can end up being distorted. This
problem is referred by the authors as internal covariate shift. Their solution is
to normalize the input data of each layer for each training mini-batch.
4.2.4 Dropout
A typical way to reduce overfitting when doing machine learning is to combine
the results of several models. However with large neural network training many
models can be very tedious due to the amount of training data required, the
computational resources... Dropout is a technique proposed by Geoffrey Hinton
and al. [38] [66] to prevent large networks from overfitting and is equivalent to
combine the results of many models. The term dropout refers to dropping out
26
temporarily neurons along with their incoming and outgoing connections from
the network. Each neuron is assigned a fixed probability p independent of each
other and for each training case each neuron is retained with that probability
p. Hence for each training case a different network will be trained. It prompts
neurons to not make up for the errors made by the other neurons but to make
good predictions on their own. At test time the outgoing weight of each neuron
is multiplied by the probability p assigned to it which is a way to approximate
the predictions of the different models that have been trained.
4.3 Research Trends
As mentioned earlier the first convolutional neural network to show great per-
formances was AlexNet [48]. The overall architecture of AlexNet was composed
of five convolutional layers and three fully connected layers. Three max-pooling
layers were also used as well as layers they called Local Response Normaliza-
tion layers. Since AlexNet many other attempts to improve the performance of
CNN have been made which mainly focus on varying the number of layers used
as well as the size of the kernel filters. Simonyan and all [64] for instance have
fixed all the parameters of the network other than the depth of the network (the
number of layers) and have shown that the deeper the network the better the
performances achieved. However a bigger size usually means more parameters
which can become problematic regarding the use of computational resources,
the time of training as well as the overfitting. This problem is addressed by
Szegedy and al. [69] whose study focus on the efficiency of convolutional neural
networks by using sparsely connected architectures. Apart from increasing the
depth and the width of convolutional neural networks other methods have been
explored, some of them being presented below.
4.3.1 Convolutional Neural Network Architecture
MCDNN MCDNN stands for Multi-Column Deep Neural Network and is an
implementation of a neural network proposed by Ciresan and all [14]. Each
column of their network is actually a convolutional neural network and the final
classification is made by aggregating the results of these networks. Different
training strategies ensure that each column produce a different result.
Network In Nework Lin and all argue in their study [50] that the level of
abstraction obtained with linear filters is low. Therefore they substitute linear
filters in the convolutional layers by what they call micro neural networks and
are actually multilayer perceptrons. Also, fully connected layers often being
responsible for overfitting, they propose a strategy called global average pooling
to replace them. The idea is for the last convolutional layer to have as many
feature maps that the number of categories on which the network is trained.
For each feature map the average is computed and the resulting vector is fed
directly into the softmax layer. The final network is tested on four benchmark
27
datasets (CIFAR-10, CIFAR-100, SVHN and MNIST) and achieved the best
performances on all of them.
4.3.1.1 Going deeper with convolution
The main contrinution of the work by Szegedy and al. [69] is the use of the
Inception module. The first idea is that for each convolutional layer in the
network we don’t know which size of convolution is the best choice to capture
the local correlations in the previous layer.
Figure 4.6: Inception module with di-
mensionality reduction
So instead of only applying one
convolution the inception module
perform simultaneously three convo-
lutions as well as a pooling operation
on the previous layer then concate-
nate all the features maps to form
the final output of the module. One
problem with this approach is that
the concatenation of the different con-
volutions and the pooling operation
leads to an exploding number of fea-
tures maps and then even a modest
number of 5*5 convolutions can be-
come prohibitively expensive to com-
pute. To prevent this from happening all the 3*3 and 5*5 convolutions in the
module are preceded by a 1*1 convolution that enable to reduce the number of
filters and thus make the learning tractable. The exact structure of the network
they used for the ILSVRC14 competition which achieved the best results was
made of 22 layer and 8 of these Inception modules.
4.3.1.2 Residual Learning for Image Recognition
Even if adding more and more layers to a network usually helps to acheive better
performance cases have been reported where it leads to worse results [35]. This
problem is addressed by Kaiming and al. [36] .
Figure 4.7: Residual Learning : a build-
ing block
Their idea is that you only want to
add an extra layer if you are sure to
increase the performance of the net-
work. One way to ensure this is to
provide to the output of each layer the
input without any transformations .
Formally instead of learning F(x) the
layer is learning
H(x) = F(x) + x
The result of this new mapping is that
if the identity mapping is ideal then it
28
is easy to set the weights of the layer
to 0. In the contrary it forces the layer to learn something new. Using such
building blocks the network they use for the ILSVRC15 competition was made
of 152 layers and achieved the best results.
4.3.2 Enhanced Convolutional Neural Network
Since the first breakthrough made by the AlexNet architecture to classify im-
ages convolutional neural networks have also shown great promises for different
applications such as Object Detection, Human Pose Estimation, Multi-label
Classification...
4.3.2.1 Object Detection
In the work by Girshick and al. [26] convolutional neural network is used for
object recognition. Their main contribution is to show that convolutional neural
networks can be used successfully not only for image classification but also for
object recognition. They use the following approach : first for an image they
generate category-independent region proposals through selective search, then
for each region they compute a feature vector using a CNN and finally they
classify each region with category-specific linear SVMs. An other contribution
is to show that a first training on a large dataset (ILSVRC) followed by a
domain-specific fine-tuning on a small dataset is an effective way to train a
network. Since their system combine region proposals with CNNs, they called
their method R-CNN: Regions with CNN features. It has improved the previous
performances on the object recognition challenge of Pascal Voc.
4.3.2.2 Multi-label classification
Single label classification is interesting but very often an image contains several
entities and we would like to be able to detect them all. Different approaches
have been investigated to enable CNN to predict multi-labels. In the work of
Yunchao and al. [74] they use the following methodology. Given an image they
first use a state-of-the-art objectness detection technique (BING) to extract
hypotheses. The result of this step is to have several patches of the original
images that might contain an object. Each of these patches is then fed to a
traditional CNN and all the outputs are then pooled into a single vector for
the final prediction. The global architecture is called Hypotheses-CNN-Pooling
(HCP). For the training the traditional CNN is first pretrained on a single-
label image dataset then the HCP network is finetuned on a multi-label image
dataset. Multi-labels classification has also been studied by Gong and al. [29]
with emphasis on the loss layer. Indeed they investigated several loss functions
to train the network. Their first approach is to use use a softmax layer followed
by a loss layer that minimize the KL-Divergence between the predictions and the
ground-truth probabilities. The second loss function is called pairwise-ranking
29
loss and ensure that positive labels have always better scores than negative
labels. The third loss used is the weighted approximate ranking (WARP) which
is defined by Weston and al: [75] and optimizes the top-k accuracy for annotation
by using a stochastic sampling approach. Their result show that all three losses
give comparable performance except for the WARP loss which achieve better
performance for rare tags.
4.3.3 Transfer Learning
Training entirely a whole network is cumbersome because it requires a very large
amount of data (1.2 millions of image for the Imagenet database) and suitable
hardware. Event when these requirements are met the training can last up to
two weeks. For these reasons it is common to use a pre-trained network either
as a fixed features extractor or to finetune it on a smaller dataset.
4.3.3.1 Convolutional Neural Networks as feature extractor
When using a pre-trained convolutional neural network as a feature extractor
the common procedure is to remove the last fully connected layer (this layer’s
output correspond to the class scores of the task it has been trained for) and
take the output of the preceding layer as our features. The resulting feature
vector can then be fed to any machine learning classifier such as SVM, Random
Forest... In the work of Donahue and al. [22] they show by different experi-
ments that features extracted from a pre-trained model actually generalize well
to unseen classes. The pre-trained model follows the architecture of the AlexNet
model [48] and is trained on the ILSVRC-2012 dataset. Their first experiment is
to extract features from a new dataset called SUN-397 then run a dimensionality
reduction algorithm (t-SNE algorithm) to find a 2-d representation. By then
plotting the points which are colored depending on their semantic categories we
can observe good clustering. The second experiment is made on the Caltech-
101 dataset. Again they extract features thanks to their pre-trained model and
use different classifiers (SVM and logistic regression) to discriminate between
objects. They show that the performance obtained with the SVM classifier out-
performs methods with traditional hand-engineered image features. In other
experiments, domain adaptation, subcategory recognition and scene recogni-
tion they show again that features from a pre-trained deep model outperform
traditional hand-engineered image features. Supervised pre-training is also ex-
plored by Wan and al. [72]. As for Donahue and al. they pre-train a AlexNet
model with the ImageNet dataset and test its performance on other datasets
against visual bag-of-words and GIST features. In this case features from the
network frequently, but not always, outperform traditional features. Convolu-
tional Neural Network is also explored by Razavian and al. [61] on different
tasks such as Image Classification, Object detection, Image Retrieval and again
they outperform traditional shallow methods.
30
4.3.3.2 Fine-tuning a Convolutional Neural Network
The second strategy when using a pretrained network is to finetune it using a
smaller dataset. For this we need to replace the last layer of the pre-trained
network so as to fit the new taks. Then the training can be continued on
the new dataset. The weights of the unchanged layers are initialized with the
values learned from the previous training and the weights for the new layers
are initialized randomly. Also during the training phase different learning rates
can be used for the different layers. Indeed we can consider that the first layers
already encode very generic features and therefore does not need to be retrained
whereas the last layers encode features that are more tasks-specifics and thus
need to be adapted to the new task at hand. In the study of Wan and al. [72]
they compare the results they obtained from a CNN used as a simple feature
extractor and when it is finetuned on the new task. As expected the finetuning
strategy always holds better results.
31
Chapter 5
Feature Interpretation
Once features have been extracted by the way of some techniques described in
the previous chapter one need to interpret these features in order to achieve end
applications. What I mean by interpret is that based on the features one of the
following questions, depending on the problem at hand, has to be answered :
“Can we identify an object in the image ?”, “What is the class of the image ?”,
“Are these two images similar ?” . . .
5.1 Similarity Measure
An obvious objective in CBIR is to assess the similarity of a query image with
images in database. Different similarity measures have been investigated for
this. Also algorithms that are able to learn a similarity measure from training
data have been developed. The choice of the right metric to use is of course
dependent of the problem at hand but also of the technique that was used
to extract the features. Indeed depending on which techniques were used the
representation of the image will differ. In the survey of Datta and al. [20] they
call these different representations signatures and they discern between three
types of signatures:
Single Vector The image is represented as a single feature vector. It is for
instance the result of using a global shape descriptor.
Set of Vectors The image is represented as a set of vectors and is usually the
result of features computed on different regions obtained by segmentation
of the image.
Histogram The image is represented as a histogram. This can be the result
of the shallow methods introduced previously.
32
5.1.1 Fixed Measure
Basic metrics for computing the similarity between two feature vectors include
the minkowski distance or the cosine similarity. However when computed based
on low-level features such metrics often fail to concur the distance computed
and the semantic similarity. An attractive measure for measuring the distance
between two probability distributions is the Earth Mover Distance (EMD) in-
troduced by Rubner and al. in 2000 [62]. The EMD is based on the minimal
cost that must be paid to transform one distribution into the other and matches
perceptual similarity better than other distances. Another metric is the Kull-
back–Leibler Distance used for instance by Do and al, [21] to compute the
similarity from wavelet-based texture features.
5.1.2 Similarity Learning
Rather that using a fixed measure learning another approach is to rely on ma-
chine learning to learn a similarity measure. Three common setups exist :
Regression similarity learning In this approach pairs of image are given
(I1
i , I2
i ) with their associated measure of similarity yi and the goal is to
learn a function that approximates f(I1
i , I2
i ) = yi.
Classification similarity learning In this approach pairs of images are given
(I1
i , I2
i ) with binary labels yi ∈ {0, 1}. The goal is to learn a classifier able
to judge new pairs of images.
Ranking similarity learning This time training samples consist to triplet
(Ii, I+
i , I−
i ) where Ii is know to be more similar to I+
i than to I−
i .
An example of ranking similarity learning for large scale image learning is
the OASIS algorithm [12] that models the similarity function as a bilinear form :
Given two images I1andI2 the similarity is measured through the form I1 ∗W ∗I2
where matrix W is not required to be positive or symmetric.
5.2 Ontology
Object Ontology can be used to map low-level features to high level entities. An
illustration of an object ontology is shown by the figure 5.1. With this example a
concept such as sky could be defined as having a blue color, a upper position and
an uniform texture. This way by extracting color, texture and other features
from an image different regions can be mapped to concepts. Object ontology is
used by Hyvonen and al. [41] to access pictures of the promotional events of the
university of Helsinki. Top-level ontological categories represent concepts such
as Persons, Events, Places and pictures are mapped to theses entities when
inserted into the database. In the work by Mezaris and al. [53] the images
are first segmented then color and shape descriptors are extracted enabling the
mapping of each region to a corresponding concept.
33
Figure 5.1: Object Ontology
5.3 Machine Learning
A last approach to associate features with high-level concept is to rely on su-
pervised learning methods. A good candidate which has proven to show strong
performance is the support vector machine (SVM) classifier. SVM has originally
been designed for binary classification so that we need to train multiple mod-
els if we want to achieve multi-classification. The idea behind SVM is to find
a hyperplane that best divides the features belonging to two separate classes.
Among the possible hyperplanes a common one is the optimal separating plane
which maximizes the distance between the hyperplane and the nearest data
points of each class. SVM has often been used in conjunction with the Visual
Bag-Of-Words representation to classify images. This is the case for example
for the work of Csurka and al. [17] or the one of Lowe and al. [52]. Another
common classifier is the naive bayes classifier. It relies on the baye’s theorem so
that prior probabilities and class-conditional densities for the features have to
be computed from the training samples. In the work of Vailaya and al. [70] they
use first and second order moments in the LUV color space as color features
and MSAR texture features. Then they perform vector quantization with the
learning vector quantization (LVQ) algorithm and the result is fed as input to
the naive bayes classifier. This scheme is used to classify indoor / outdoor and
city / landscape images. Other classifiers that have been investigated are neural
networks and decision trees.
34
Chapter 6
Evaluation of CBIR
Systems
In order to evaluate the different systems that have been proposed for Content-
Based Image Retrieval researchers need common image databases with trust-
worthy ground truth and well defined metrics. We will review here some of the
well established datasets used for this purpose as well as the different tasks that
are evaluated.
6.1 Datasets
6.1.1 Pascal Voc
The Pascal Visual Object Classes (VOC) Challenge consists of two components
: a publicly available dataset and an annual competition. The dataset consist of
annotated consumer photographs collected from the flickr photo-sharing web-
site. In total 500,000 images were retrieved from flickr and divided between
20 classes. For each of the 20 classes images were retrieved by querying flickr
with a number of related keywords and randomly choosing an image among the
100,000 first results. The process was repeated until sufficient images were col-
lected. In order to evaluate the detection challenge a bounding box was further
added for every object in the target set of object classes.
6.1.2 Caltech
The goal of the Calech dataset is to provide with relevant images for performing
multi-class object detection. Caltech 101 dataset provides pictures of objects
belonging to 101 categories with most of the categories having 50 images. Thus
the resulting training is relatively small compared to other datasets. Each image
contains only a single object. A common criticism of this dataset, is that the
images are largely without clutter, variation in pose is limited, and the images
35
have been manually aligned to reduce the variability in appearance. Caldech
256 correct some disadvantages of the previous dataset by more than doubling
the number of class and introducing a large number of clutter images.
6.1.3 ImageNet
ImageNet is an image database organized according to the WordNet hierarchy
that is each meaningful concept is possibly described by multiple words or word
phrases and is called a synonym set or synset. ImageNet is comprised of more
of 100,000 synsets with on average 1000 images for each synset. The ImageNet
dataset has been created especially for deep learning methods that need huge
amount of training data.
6.2 Tasks Evaluated
6.2.1 Image Classification
One common task is to discern between different classes which one correspond
to a given image. It is called Image Classification and is often relying on the
presence or not of a specific object in the image such as a car, a plane, a bi-
cycle and so on. Performing such task can be useful to answer to a query of
the type “Find pictures with a red car”. To achieve Image Classification a
commonly used approach is to compute local features of the image, summarize
them into an histogram which is given as input to a classifier [23]. This ap-
proach is know as bag-of-visual-word in analogy with the bag-of-words (BOW)
used for text representation. Within the approach different features extractors
(SIFT descriptor, Harris descriptor...) and different classifiers (SVM, Earth
Mover’s Distance kernel...)have been investigated [17] [78] [52]. New trends also
perform classification using Convolutional Neural Network who achieved best
performance on several benchmarks especially on the ImageNet database.
6.2.2 Object Detection
Object Detection consist to assess if an object is present in a given image and to
identify its location. A very widespread method to achieve Object Detection is
to use a sliding window on the image. Features are computed from the window
and given to a classifier to compute evaluate if the object is present. The window
is slid throughout the image at different scale and for each scale and location
the classifier is applied [71] [18].
6.2.3 Image Similarity
Image Similarity purpose is to assess if images, commonly stored in a database,
are similar to a query image. The most similar images can then be returned in
answer to the query. Image similarity is typically performed by reverse search
engines. The process to achieve Image Similarity is first to extract features
36
from the images. Then a similarity measure is used to compute the similarity
between the images. Similarity measure can be computed with distance metrics
such as euclidean distance or with more advanced techniques relying on machine
learning to learn a similarity function [12].
6.2.4 Multi-class Image Segmentation and Labeling
Multi-class Image Segmentation and Labeling consist in assigning to each pixel
a class label. First step here is usually to perform a segmentation on the image
and to aggregate similar pixels into groups. Secondly a label is chosen for each
group. Probabilistic graphical models have been successfully applied for this
kind of task.
6.3 Evaluation Metrics
6.3.1 Precision and Recall
Probably the most common evaluation measures used in information retrieval
are precision and recall. Precision is the fraction of retrieved items that are
relevant to the query :
precision =
|relevantdocuments ∩ retrieveddocuments|
retrieveddocuments
Recall is the fraction of relevant items that are retrieved :
recall =
|relevantdocuments ∩ retrieveddocuments|
relevantdocuments
Precision and recall are often presented through a precision versus recall graph.
Based on these two metrics several other have been derived that bring additional
informations and make up for their inadequacy in some cases. For instance when
performing retrieval with huge quantity of documents recall is not any more a
relevant metric as the query might have thousand of relevant documents.
Precision at k Precision at rank k correspond to the number of relevant re-
sults in the first k documents.
Average Precision For one query average precision is the average of the pre-
cision computed over an interval rank.
Mean Average Precision For a set of queries mean average precision is the
mean of the average precision scores for each query.
F-score The F-score is a weighted average of the precision and recall defined
as 2 ∗
precison ∗ recall
precision + recall
where 1 is the best value and 0 the worst.
37
6.3.2 Top-k Error
The Top-k error rate is useful to evaluate classification tasks. It is defined as the
fraction of items where the correct label is not among the k labels considered
the most probable by the model. In the Imagenet classification challenge the
top-1 and the top-5 error rates are used as benchmarks to compare the different
submissions.
6.3.3 Confusion Matrix
To evaluate the performance of a classification system a confusion matrix, also
known as a contingency table, is often used. The name confusion come from
the fact that the matrix enable us to easily check if the system is confusing
different classes. A confusion matrix compute the rate of true positive and true
negative also know as sensitivity and specificity. For a binary classification test
sensitivity measure the proportion of positives that are correctly identified as
such while specificity measure the proportion of negatives that are correctly
identified as such.
38
Part II
Contribution
39
Chapter 7
Experiments
Several experiments have been conducted to assess the dominance of convolu-
tional neural networks in the domain of image processing. Then such networks
was used to perform video classification and tagging for a startup that wish to
exploit a video corpus at their disposal more efficiently. They wish to propose a
system where clients can retrieve the videos in the corpus by performing natural
queries. Also new videos are susceptible to be added to the initial corpus. To
answer the request the system should rely on the information obtained after the
classification and tagging stages. The initial corpus has already been manually
annotated but to make the whole process more efficient automatic annotation
should be performed.
7.1 Image Similarity
7.1.1 Introduction
The objective of the experiments was to investigate how well traditional hand-
crafted features were performing compared to features learned with a convolu-
tional neural network. Thus the features were used to represent images in the
case of an image similarity task. Precisely features from images supposedly in
a database are computed using both methods then different image queries are
performed and for each query the system retrieves the images judged the most
similar to the query.
7.1.2 Dataset
7.1.2.1 MNIST
The dataset used for the experiment is the MNIST dataset. It consists in
grayscale images that represent handwritten digits. It has a training set of
60,000 examples, and a test set of 10,000 examples. The images were obtained
from another dataset called NIST, normalized and centered in a 28x28 image
40
by computing the center of mass of the pixels, and translating the images so as
to position this point at the center of the 28x28 field.
Figure 7.1: Handwitten digits from the MNIST Dataset
7.1.3 Methods
Two methods were used to extract the features from the images. The first
method consists of a visual bag-of-words model with SIFT descriptors. This
representation was chosen because it is a simple model yet it has proven to
achieve relatively good performance. The second method consists in training a
convolutional neural network and extracting the features it has learned. Fur-
thermore the similarity between two images is assessed by using a similarity
learning algorithm named OASIS (Online Algorithm for Scalable Image Simi-
larity) proposed by Chechik and al [12].
7.1.3.1 SIFT and Visual Bag-Of-Words
To obtain a visual bag-of-words model local descriptors are computed using
the SIFT algorithm through the OpenCV-Python library [9]. The value 3 was
used for the number of layers in each octave which is what D. Lowe used in
its paper [51]. The contrast threshold used to filter out weak features was of
0.01 and the threshold used to filter out edge-like features was of 50. Using a
higher threshold to filter weak features or a lower threshold to filter edge-like
features lead to no keypoints detected. Finally the sigma of the gaussian value
was 0.8. To go from local descriptors to global image descriptors a standard
histogram encoding was performed. All local descriptors from a subset of the
training images (5000 images randomly picked from the original training set)
were fed as input to a k-mean clustering with a value for k of 64. The centers
of the clusters found are used as codewords and for each image an histogram
is computed by counting the number of local descriptors that belong to each
cluster.
41
(a) Bag-Of-Words features (b) Network Features
Figure 7.2: Result of a query for the two different features
7.1.3.2 Convolutional Neural Network
The learning of the features was achieved with a convolutional neural network.
The model has been implemented using the Theano library [6] [8]. The full code
is available on github 1
. For the training it was initialized with two convolutional
layers with 20 filters for the first layer and 50 filters for the second layer. Each
of the convolutional layer is followed by a polling layer and two fully connected
layers with respectively 225 neurons and 100 neurons are added after them. The
MNIST training set was randomly divided between 50,000 training images and
10,000 validation images in order to train the network.
The weights of the model are saved for the lowest validation rate reached
which is 0.0115. The features from all the training images are then extracted
using these weights.
7.1.3.3 Similarity Learning
Using the previously computed training features a similarity function was learned
through the OASIS algorithm. The OASIS learns a weights matrix that, given
two images I1andI2, enable to compute a similarity through the form I1 ∗W ∗I2.
The implementation of the algorithm is again made in python 2
based on the
matlab implementation of Chechik and al. [12].
7.1.4 Evaluation and further Improvements
7.1.4.1 Evaluation
For the comparison of the two methods we evaluate how well the system per-
forms when retrieving images in answer to a query image. Query images are
images coming from the testing set and the system search in the training images
for the most similar. The figure A.3 show the images returned by the system
in answer to a query with the image representing the digit 3. The web demo is
implemented with the django and angularjs frameworks.
1https://github.com/leovetter/DeepNet
2https://gist.github.com/gwtaylor/94412c4808421acbd1d0
42
We can observe that the features extracted with the convolutional neural
network seem to perfom much better that the features from bag-of words model
since, when using the network features, all the retrieved images are of the same
class that the query image which is far from being the case when using the bow
features. This observation is confirmed by the figure 7.3 that show the precision-
recall curve for the same query. We can see that for the network features the
precision is always of 1 which means that even for result with high rank they
are still of the same class that the image query. As for the bow features we can
see that the precision slowly decrease when the recall increase that is when we
retrieve more results we have, as expected, a higher recall but all results not
being relevant the precision is decreasing. The figure 7.4 show the same results
but this time the precision and the recalls is plotted as a function on the number
of images retrieved.
Figure 7.3: Precision-Recall curve
(a) Bag-Of-Words features (b) Network Features
Figure 7.4: Precision and Recall for the two methods
43
7.1.4.2 Further Improvements
Being limited by the computational resources available to me the dataset used
is not a very challenging one and does not enable to observe the limitation of
the features from a convolutional neural network. It would be interesting to
see how well it performs on dataset with more real images such as the Pascal
Voc dataset or the Imagenet dataset. Also in order to improve the performance
of the bag-of-words model better encoding could be used such as the fisher
encoding [59], the support vector encoding [80] or the locality-constrained linear
(LLC) encoding [73].
7.2 Video Classification and Tagging
7.2.1 Introduction
The other experiments conducted in order to evaluate the efficiency of convo-
lutional neural networks first consist to correctly classify a video and then to
extract meaningful tags from the same videos. More formally the first task en-
able to predict one label corresponding to the global appearance of the video
whereas the second task enable to extract up to N tags relevant to the content
of the video.
7.2.2 Dataset
For the two tasks mentioned above experiments were conducted using a corpus
of approximately seven thousands videos (rushes). All videos present semantic
unity that is they are made of a single shot (no cuts) showing related contents
(for example one video only display a sea landscape and another only a city
landscape) and their average length is of 4-5s. All the videos have been manually
annotated with a theme describing the global landscape of the video and with
tags denoting entities present in the videos. More detailed description of the
corpus can be found in the appendix.
7.2.3 Methods
For both tasks, classification and tagging, the method used was to finetune
a pretrained network. Two different networks were evaluated and compared
against each other. The first one use the AlexNet architecture [48] and the sec-
ond one the Inception architecture [69]. Both models have been trained using
the Imagenet database and are publicly available through the Caffe framework.
All the following experiments have been made using the Caffe framework with
the python interface. As the network only takes images as inputs and no videos
two key frames have been extracted for each video. For videos with no camera
movements in it two random images are extracted. For videos with camera
movements ORB features are extracted for each frame of the video then a se-
mantic clustering is done using the k-means algorithm with k set to two. The
44
objective here is to extract two different images that best represent the video.
The resulting set contains about 14,000 images. Eighty percent of the images
is then used for the training set and twenty percent for the evaluation set.
7.2.3.1 Classification
For the classification task finetuning a pretrained network to learn to discrim-
inate between custom classes imply to replace the last fully connected layer of
the network to adapt it to the new problem. Specifically, given that we want to
learn to discriminate between 22 themes, we specify a fully connected layer with
22 outputs. Then our personal corpus of video is split between a training and
testing set and the network is retrained on the training set. For the training the
value of all the weights, except for the new layer, are initialized with the values
computed during the pretraining and the weights of the new layer are initialized
using the xavier algorithm [27] that automatically determines the scale of ini-
tialization based on the number of input and output neurons. The bias filter is
initialized with the value 0. On a computer installed with linux (Ubuntu 14.04),
a GeForce GTX Titan X (12Gb of memory), a processor Intel Core i7-5960X
and 23Gb of RAM the training only takes several minutes to conclude.
(a) Alexnet Model (b) Inception Model
Figure 7.5: Learning curves for the classification task
7.2.3.2 Tagging
As for the classification task finetuning a network to learn to predict several
outputs imply to replace the last fully connected layer by a layer with outputs
equal to the number of tags we want to learn to predict. Also since we are not
dealing anymore with a simple classification scheme (several tags can be pre-
dicted for one image) the loss layer have to be adapted. Specifically the sigmoid
cross entropy loss is used. For the training previously existing layers are again
initialized with values from the pretraining and new layers are initialized follow-
45
ing the same procedure as for the classification task. As for the classification
task the learning is very fast.
(a) Alexnet Model
(b) Inception Model
Figure 7.6: Learning curves for the tagging tasks
7.2.4 Evaluation and further improvements
7.2.4.1 Evaluation
The evaluation of the models is done through the evaluation set and the pre-
cision, the recall and the f1-score is used for the classification task. The best
results are obtained with the Inception architecture. The average precision and
recall obtained is of 0.91 for both metrics. The detailed results for each class are
shown below. For the tagging task the precision and recall are used and best
results are obtained with the Inception architecture. The network holds good
results yet not so good as for the classification task. Again the detailed results
are shown below.
Precision Recall
Person 0.58 0.59
Animal 0.93 0.61
Nature 0.92 0.60
Water 0.96 0.60
Building 0.92 0.60
Figure 7.7: Precision and Recall for the Tagging task
46
Precision Recall f1-score
Campagne / Zone rural 0.91 0.85 0.88
Canyon 0.74 0.96 0.84
Culture 1.00 0.50 0.67
D´esert 0.91 0.84 0.87
Enfants 1.00 1.00 1.00
Evenement 1.00 0.67 0.80
Forˆet 0.74 0.83 0.78
Gastronomie 0.87 0.74 0.80
Jardins 1.00 0.75 0.86
Marais 1.00 1.00 1.00
Mer 1.00 0.78 0.88
Montagne 0.94 0.83 0.88
Montagne / Zone urbaine 0.83 0.91 0.87
Patrimoine 1.00 0.67 0.80
Plaine 1.00 1.00 1.00
Marais 1.00 0.91 0.95
Plein air 0.83 0.83 0.83
Rizi`ere 1.00 1.00 1.00
Savane 1.00 0.95 0.97
Ville / Zone coti`ere 1.00 0.86 0.93
Ville / Zone urbaine 0.95 0.96 0.96
avg / Total 0.91 0.91 0.91
Figure 7.8: Precision, Recall and F1-score for the Classification task
7.2.4.2 Further Improvements
7.2.4.2.1 Data Augmentation In order to improve the performance ob-
tained by the networks we can consider to increase the size of the training
set. Indeed we have about 11,000 images for the training which is a very small
dataset when dealing with neural networks. Data Augmentation can be realized
in several ways. In [48] they are relying on two forms of data augmentation. The
first form consist to generate image translation and horizontal reflection from
the original images. Also they randomly cropped each 256*256 original images
to four crops of 224*224 allowing them to capture some translation invariance.
It enable them to increase the size of their training set by a factor of 2048
and they claim it allow them to greatly reduce overfitting. The second form of
data augmentation consist of altering the intensities of the RGB channels in the
training set. This scheme enable them to reduce the the top-1 error rate in the
ILSVRC-2012 competition by over 1%. Another common way to generate more
training data is to perfom color jittering that is to change the contrast of the
image.
47
7.2.4.2.2 Recurrent Neural Network and Graphical model An other
approach to reach better performances would be to investigate how recurrent
neural network are performing. Indeed traditional convolutional neural network
are not able to combine information from different frames of the video that is
they predict an output based solely on the information present in one frame.
On the contrary recurrent neural networks are designed to deal with sequences
of inputs and thus might be able to combine information from different frames
of the video to hold better performances. Also investigating how different al-
gorithms such as probabilistic graphical models perform might be interesting.
Graphical models such as Markov random fields [46] or Conditionial random
fields [37] have been used with some success for image processing tasks. One of
the advantage of such models is that they can model the dependencies between
pixels or group of pixels (superpixels). So they can make predictions for an
image by relying on features but also, for instance, spatial dependence between
pixels (nearby pixels will have more probabilities to share the same labels than
far away pixels...). For the features used by the graphical models it might be
judicious to use features extracted from convolutional neural network.
7.2.4.2.3 Semantic resources A final idea to improve the classification
and tagging of the videos is to rely on external semantic resources. Indeed we
dispose of the geolocation of each video so that we can request some spatial
database to retrieve nearby buildings or monuments. Knowing for example that
a video has been taken near the Eiffel tower we can aggregate this information
with the results of the convolutional neural networks by increasing the proba-
bility that a building is on the video.
48
Bibliography
[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pas-
cal Fua, and Sabine Susstrunk. Slic superpixels compared to state-of-the-
art superpixel methods. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 34(11):2274–2282, 2012.
[2] Mohamed N Ahmed, Sameh M Yamany, Nevin Mohamed, Aly A Farag,
and Thomas Moriarty. A modified fuzzy c-means algorithm for bias field
estimation and segmentation of mri data. Medical Imaging, IEEE Trans-
actions on, 21(3):193–199, 2002.
[3] Agus Zainal Arifin and Akira Asano. Image thresholding by histogram
segmentation using discriminant analysis. In Proceedings of Indonesia–
Japan Joint Scientific Symposium, pages 169–174, 2004.
[4] Dana H Ballard. Generalizing the hough transform to detect arbitrary
shapes. Pattern recognition, 13(2):111–122, 1981.
[5] EA Bashkov and NS Kostyukova. Effectiveness estimation of image re-
trieval by 2d color histogram. Journal of Automation and Information
Sciences, 8(11):74–80, 2006.
[6] Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J.
Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio.
Theano: new features and speed improvements. Deep Learning and Unsu-
pervised Feature Learning NIPS 2012 Workshop, 2012.
[7] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust
features. In Computer vision–ECCV 2006, pages 404–417. Springer, 2006.
[8] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Raz-
van Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,
and Yoshua Bengio. Theano: a CPU and GPU math expression compiler.
In Proceedings of the Python for Scientific Computing Conference (SciPy),
June 2010. Oral Presentation.
[9] G. Bradski. Dr. Dobb’s Journal of Software Tools.
49
[10] Ken Chatfield, Victor S Lempitsky, Andrea Vedaldi, and Andrew Zisser-
man. The devil is in the details: an evaluation of recent feature encoding
methods. In BMVC, volume 2, page 8, 2011.
[11] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.
Return of the devil in the details: Delving deep into convolutional nets.
arXiv preprint arXiv:1405.3531, 2014.
[12] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale on-
line learning of image similarity through ranking. The Journal of Machine
Learning Research, 11:1109–1135, 2010.
[13] Keh-Shih Chuang, Hong-Long Tzeng, Sharon Chen, Jay Wu, and Tzong-
Jer Chen. Fuzzy c-means clustering with spatial information for image seg-
mentation. computerized medical imaging and graphics, 30(1):9–15, 2006.
[14] Dan Ciresan, Ueli Meier, and J¨urgen Schmidhuber. Multi-column deep
neural networks for image classification. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on, pages 3642–3649. IEEE,
2012.
[15] Timothee Cour, Florence Benezit, and Jianbo Shi. Spectral segmentation
with multiscale graph decomposition. In Computer Vision and Pattern
Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on,
volume 2, pages 1124–1131. IEEE, 2005.
[16] Michel Crucianu, Marin Ferecatu, and Nozha Boujemaa. Relevance feed-
back for image retrieval: a short survey. Report of the DELOS2 European
Network of Excellence (FP6), 2004.
[17] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and
C´edric Bray. Visual categorization with bags of keypoints. In Workshop
on statistical learning in computer vision, ECCV, volume 1, pages 1–2.
Prague, 2004.
[18] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.
IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE,
2005.
[19] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang. Image retrieval:
Ideas, influences, and trends of the new age. ACM Computing Surveys, 40,
2008.
[20] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Image retrieval:
Ideas, influences, and trends of the new age. ACM Computing Surveys
(CSUR), 40(2):5, 2008.
50
[21] Minh N Do and Martin Vetterli. Wavelet-based texture retrieval using gen-
eralized gaussian density and kullback-leibler distance. Image Processing,
IEEE Transactions on, 11(2):146–158, 2002.
[22] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric
Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature
for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.
[23] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn,
and Andrew Zisserman. The pascal visual object classes (voc) challenge.
International journal of computer vision, 88(2):303–338, 2010.
[24] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based
image segmentation. International Journal of Computer Vision, 59(2):167–
181, 2004.
[25] Kunihiko Fukushima. Neocognitron: A self-organizing neural network
model for a mechanism of pattern recognition unaffected by shift in po-
sition. Biological cybernetics, 36(4):193–202, 1980.
[26] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich
feature hierarchies for accurate object detection and semantic segmenta-
tion. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 580–587, 2014.
[27] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training
deep feedforward neural networks. In International conference on artificial
intelligence and statistics, pages 249–256, 2010.
[28] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier
neural networks. In International Conference on Artificial Intelligence and
Statistics, pages 315–323, 2011.
[29] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and
Sergey Ioffe. Deep convolutional ranking for multilabel image annotation.
arXiv preprint arXiv:1312.4894, 2013.
[30] Simona E Grigorescu, Nicolai Petkov, and Peter Kruizinga. Comparison of
texture features based on gabor filters. Image Processing, IEEE Transac-
tions on, 11(10):1160–1167, 2002.
[31] Venkat N Gudivada and Vijay V Raghavan. Content based image retrieval
systems. Computer, 28(9):18–22, 1995.
[32] James Hafner, Harpreet S Sawhney, Will Equitz, Myron Flickner, and
Wayne Niblack. Efficient color histogram indexing for quadratic form dis-
tance functions. Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on, 17(7):729–736, 1995.
51
[33] Robert M Haralick. Statistical and structural approaches to texture. Pro-
ceedings of the IEEE, 67(5):786–804, 1979.
[34] Chris Harris and Mike Stephens. A combined corner and edge detector. In
Alvey vision conference, volume 15, page 50. Citeseer, 1988.
[35] Kaiming He and Jian Sun. Convolutional neural networks at constrained
time cost. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5353–5360, 2015.
[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
[37] Xuming He, Richard S Zemel, and Miguel ´A Carreira-Perpi˜n´an. Multi-
scale conditional random fields for image labeling. In Computer vision
and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE
computer society conference on, volume 2, pages II–695. IEEE, 2004.
[38] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and
Ruslan R Salakhutdinov. Improving neural networks by preventing co-
adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[39] Jing Huang, S Ravi Kumar, Mandar Mitra, Wei-Jing Zhu, and Ramin
Zabih. Spatial color indexing and applications. International Journal of
Computer Vision, 35(3):245–268, 1999.
[40] David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in
the cat’s striate cortex. The Journal of physiology, 148(3):574–591, 1959.
[41] Eero Hyv¨onen, Samppa Saarela, Avril Styrman, and Kim Viljanen.
Ontology-based image retrieval. In WWW (Posters), 2003.
[42] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167, 2015.
[43] Tommi Jaakkola, David Haussler, et al. Exploiting generative models in
discriminative classifiers. Advances in neural information processing sys-
tems, pages 487–493, 1999.
[44] Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff
Donahue, and Sarah Tavel. Visual search at pinterest. arXiv preprint
arXiv:1505.07647, 2015.
[45] Charles E Kahn Jr. Artificial intelligence in radiology: decision support
systems. Radiographics, 14(4):849–861, 1994.
[46] Zoltan Kato and Ting-Chuen Pong. A markov random field image seg-
mentation model for color textured images. Image and Vision Computing,
24(10):1103–1114, 2006.
52
Content Based Image Retrieval
Content Based Image Retrieval
Content Based Image Retrieval
Content Based Image Retrieval
Content Based Image Retrieval
Content Based Image Retrieval

Mais conteúdo relacionado

Mais procurados

Content Based Image Retrieval
Content Based Image Retrieval Content Based Image Retrieval
Content Based Image Retrieval Swati Chauhan
 
Cbir final ppt
Cbir final pptCbir final ppt
Cbir final pptrinki nag
 
Color and texture based image retrieval
Color and texture based image retrievalColor and texture based image retrieval
Color and texture based image retrievaleSAT Journals
 
Content based image retrieval using clustering Algorithm(CBIR)
Content based image retrieval using clustering Algorithm(CBIR)Content based image retrieval using clustering Algorithm(CBIR)
Content based image retrieval using clustering Algorithm(CBIR)Raja Sekar
 
CBIR For Medical Imaging...
CBIR For Medical  Imaging...CBIR For Medical  Imaging...
CBIR For Medical Imaging...Isha Sharma
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image RetrievalPrem kumar
 
Content based image retrieval
Content based image retrievalContent based image retrieval
Content based image retrievalrubaiyat11
 
Content Based Image and Video Retrieval Algorithm
Content Based Image and Video Retrieval AlgorithmContent Based Image and Video Retrieval Algorithm
Content Based Image and Video Retrieval AlgorithmAkshit Bum
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image RetrievalSOURAV KAR
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image RetrievalRayan Dasoriya
 
Multimedia content based retrieval slideshare.ppt
Multimedia content based retrieval slideshare.pptMultimedia content based retrieval slideshare.ppt
Multimedia content based retrieval slideshare.pptgovintech1
 
Week06 bme429-cbir
Week06 bme429-cbirWeek06 bme429-cbir
Week06 bme429-cbirIkram Moalla
 
Content Based Image Retrieval Using Dominant Color and Texture Features
Content Based Image Retrieval Using Dominant Color and Texture FeaturesContent Based Image Retrieval Using Dominant Color and Texture Features
Content Based Image Retrieval Using Dominant Color and Texture FeaturesIJMTST Journal
 

Mais procurados (20)

Content Based Image Retrieval
Content Based Image Retrieval Content Based Image Retrieval
Content Based Image Retrieval
 
Cbir final ppt
Cbir final pptCbir final ppt
Cbir final ppt
 
CBIR
CBIRCBIR
CBIR
 
Color and texture based image retrieval
Color and texture based image retrievalColor and texture based image retrieval
Color and texture based image retrieval
 
Content based image retrieval using clustering Algorithm(CBIR)
Content based image retrieval using clustering Algorithm(CBIR)Content based image retrieval using clustering Algorithm(CBIR)
Content based image retrieval using clustering Algorithm(CBIR)
 
Image Indexing and Retrieval
Image Indexing and RetrievalImage Indexing and Retrieval
Image Indexing and Retrieval
 
CBIR For Medical Imaging...
CBIR For Medical  Imaging...CBIR For Medical  Imaging...
CBIR For Medical Imaging...
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image Retrieval
 
CBIR by deep learning
CBIR by deep learningCBIR by deep learning
CBIR by deep learning
 
CBIR
CBIRCBIR
CBIR
 
Content based image retrieval
Content based image retrievalContent based image retrieval
Content based image retrieval
 
Image search engine
Image search engineImage search engine
Image search engine
 
Content Based Image and Video Retrieval Algorithm
Content Based Image and Video Retrieval AlgorithmContent Based Image and Video Retrieval Algorithm
Content Based Image and Video Retrieval Algorithm
 
CBIR with RF
CBIR with RFCBIR with RF
CBIR with RF
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image Retrieval
 
Gi3411661169
Gi3411661169Gi3411661169
Gi3411661169
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image Retrieval
 
Multimedia content based retrieval slideshare.ppt
Multimedia content based retrieval slideshare.pptMultimedia content based retrieval slideshare.ppt
Multimedia content based retrieval slideshare.ppt
 
Week06 bme429-cbir
Week06 bme429-cbirWeek06 bme429-cbir
Week06 bme429-cbir
 
Content Based Image Retrieval Using Dominant Color and Texture Features
Content Based Image Retrieval Using Dominant Color and Texture FeaturesContent Based Image Retrieval Using Dominant Color and Texture Features
Content Based Image Retrieval Using Dominant Color and Texture Features
 

Destaque

Analise e comparação de métricas de similaridade
Analise e comparação de métricas de similaridadeAnalise e comparação de métricas de similaridade
Analise e comparação de métricas de similaridadeAlberth David
 
Trabajo Miguel Historia de los ordenadores
Trabajo Miguel  Historia de los ordenadoresTrabajo Miguel  Historia de los ordenadores
Trabajo Miguel Historia de los ordenadoresMiguel Agudo Postigo
 
Giáo trình kế toán hành chính sự nghiệp data4u
Giáo trình kế toán hành chính sự nghiệp data4uGiáo trình kế toán hành chính sự nghiệp data4u
Giáo trình kế toán hành chính sự nghiệp data4uXephang Daihoc
 
Додавання та віднімання натуральних чисел
Додавання та віднімання натуральних чиселДодавання та віднімання натуральних чисел
Додавання та віднімання натуральних чиселelena_kalinina
 
Roles of different financial institutions
Roles of different financial institutionsRoles of different financial institutions
Roles of different financial institutionsTanzil Al Gazmir
 
откуда пришла в наш дом вода
откуда пришла в наш дом водаоткуда пришла в наш дом вода
откуда пришла в наш дом водаTatjana Videva
 
Content obesity: An organisation's silent killer
Content obesity: An organisation's silent killerContent obesity: An organisation's silent killer
Content obesity: An organisation's silent killerSally Bagshaw
 

Destaque (18)

Analise e comparação de métricas de similaridade
Analise e comparação de métricas de similaridadeAnalise e comparação de métricas de similaridade
Analise e comparação de métricas de similaridade
 
Image004.jpg
Image004.jpgImage004.jpg
Image004.jpg
 
Image018.jpg
Image018.jpgImage018.jpg
Image018.jpg
 
Trabajo Miguel Historia de los ordenadores
Trabajo Miguel  Historia de los ordenadoresTrabajo Miguel  Historia de los ordenadores
Trabajo Miguel Historia de los ordenadores
 
Image008.jpg
Image008.jpgImage008.jpg
Image008.jpg
 
Linkedin Powerpoint
Linkedin PowerpointLinkedin Powerpoint
Linkedin Powerpoint
 
Giáo trình kế toán hành chính sự nghiệp data4u
Giáo trình kế toán hành chính sự nghiệp data4uGiáo trình kế toán hành chính sự nghiệp data4u
Giáo trình kế toán hành chính sự nghiệp data4u
 
Image015.jpg
Image015.jpgImage015.jpg
Image015.jpg
 
Додавання та віднімання натуральних чисел
Додавання та віднімання натуральних чиселДодавання та віднімання натуральних чисел
Додавання та віднімання натуральних чисел
 
Image019.jpg
Image019.jpgImage019.jpg
Image019.jpg
 
Roles of different financial institutions
Roles of different financial institutionsRoles of different financial institutions
Roles of different financial institutions
 
T.i.c aprendizaje obicuo
T.i.c aprendizaje obicuo T.i.c aprendizaje obicuo
T.i.c aprendizaje obicuo
 
откуда пришла в наш дом вода
откуда пришла в наш дом водаоткуда пришла в наш дом вода
откуда пришла в наш дом вода
 
Image006.jpg
Image006.jpgImage006.jpg
Image006.jpg
 
Content obesity: An organisation's silent killer
Content obesity: An organisation's silent killerContent obesity: An organisation's silent killer
Content obesity: An organisation's silent killer
 
Acidentes com máquinas riscos e prevenção
Acidentes com máquinas   riscos e prevençãoAcidentes com máquinas   riscos e prevenção
Acidentes com máquinas riscos e prevenção
 
Folha de setembro
Folha de setembroFolha de setembro
Folha de setembro
 
Συναρτήσεις, επανάληψη
Συναρτήσεις, επανάληψη Συναρτήσεις, επανάληψη
Συναρτήσεις, επανάληψη
 

Semelhante a Content Based Image Retrieval

Semelhante a Content Based Image Retrieval (20)

Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
 
btpreport
btpreportbtpreport
btpreport
 
Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...
 
mscthesis
mscthesismscthesis
mscthesis
 
Grl book
Grl bookGrl book
Grl book
 
Sona project
Sona projectSona project
Sona project
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...
 
Report
ReportReport
Report
 
Thesis_Prakash
Thesis_PrakashThesis_Prakash
Thesis_Prakash
 
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...Robustness in Deep Learning - Single Image Denoising using Untrained Networks...
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign Detection
 
Final Report - Major Project - MAP
Final Report - Major Project - MAPFinal Report - Major Project - MAP
Final Report - Major Project - MAP
 
main
mainmain
main
 
Data mining of massive datasets
Data mining of massive datasetsData mining of massive datasets
Data mining of massive datasets
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
main
mainmain
main
 
Thesis
ThesisThesis
Thesis
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on Steroids
 

Content Based Image Retrieval

  • 1.
  • 2. Abstract The phenomenal expansion of the web, the increasing number of recording de- vices, the emergence of social medias, all these phenomena lead to the pro- duction of a constantly increasing number of digital images. Theses images often end up stored without any additional information or metadata associated. Enhancing the machine ability to analyze the content of a picture to discover semantic knowledge is therefore mandatory if we want to improve the use which is made of these data. This thesis study the field known as Content-Based Image Retrieval that deal with the image retrieval problem when no metadata is involved. Recently new methods, deep learning algorithms, have emerged and have proven to greatly outperform traditional methods for all kinds of tasks. Therefore a strong em- phasis will be made on these new techniques throughout the study and several experiments are presented that assess the dominance of deep learning algo- rithms. 1
  • 3.
  • 4. Contents I State Of The Art 5 1 Content-Based Image Retrieval Overview 6 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Real World Applications . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Components of CBIR systems . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Database Management . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Query Specification . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Relevance Feedback . . . . . . . . . . . . . . . . . . . . . 10 2 Segmentation 11 2.1 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Histogram-based methods . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Region-growing methods . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Graph partitioning methods . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Cut Criterion . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 Graphical Probabilistic Model . . . . . . . . . . . . . . . . 13 3 Feature Extraction 15 3.1 Global and Local Features . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Handcrafted Features . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Color Features . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Shape Features . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Texture Features . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Shallow methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Visual Bag-of-Word . . . . . . . . . . . . . . . . . . . . . 18 3.3.2 Improved Fisher Kernel . . . . . . . . . . . . . . . . . . . 19 4 Convolutional Neural Network 20 4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . 21 4.1.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.3 Local Response Normalization Layer . . . . . . . . . . . . 23 4.1.4 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . 23 4.1.5 Softmax and Loss Layer . . . . . . . . . . . . . . . . . . . 24 2
  • 5. 4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 Backpropagation and Gradient Descent . . . . . . . . . . 24 4.2.2 L1 and L2 regularization . . . . . . . . . . . . . . . . . . . 26 4.2.3 Batch normalization . . . . . . . . . . . . . . . . . . . . . 26 4.2.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Research Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.1 Convolutional Neural Network Architecture . . . . . . . . 27 4.3.2 Enhanced Convolutional Neural Network . . . . . . . . . 29 4.3.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . 30 5 Feature Interpretation 32 5.1 Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1.1 Fixed Measure . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1.2 Similarity Learning . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6 Evaluation of CBIR Systems 35 6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.1 Pascal Voc . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.2 Caltech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.3 ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.2 Tasks Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.2.1 Image Classification . . . . . . . . . . . . . . . . . . . . . 36 6.2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . 36 6.2.3 Image Similarity . . . . . . . . . . . . . . . . . . . . . . . 36 6.2.4 Multi-class Image Segmentation and Labeling . . . . . . . 37 6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . 37 6.3.2 Top-k Error . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . 38 II Contribution 39 7 Experiments 40 7.1 Image Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.1.4 Evaluation and further Improvements . . . . . . . . . . . 42 7.2 Video Classification and Tagging . . . . . . . . . . . . . . . . . . 44 7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.2.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.2.4 Evaluation and further improvements . . . . . . . . . . . 46 3
  • 6. Appendices 56 A Video Corpus Description 57 4
  • 7. Part I State Of The Art 5
  • 8. Chapter 1 Content-Based Image Retrieval Overview 1.1 Introduction It is often said that an image is worth a thousand words. This idiom refers to the notion that a complex idea can be conveyed with just a single still image. Indeed they are a powerful mean of communication and we interact with them in our everyday life. We use them to freeze a moment and share it with our relatives and friends, to illustrate a written article or to capture a emotion in a painting. Nowadays an increasing number of devices, cameras, satellites and video sensors allow us to automatically record digital images which are then stored in constantly increasing databases. The phenomenal expansion of the web enabling us to store those images online also contributes to this collection of images in very huge databases. To allow us to take advantage of this in- creasing amount of data we need to develop tools that efficiently process images and retrieve them in a meaningful fashion. This is the problem addressed by Content-Based Image Retrieval. Formally it can be defined as any technologies that help to deal with the organization of digital pictures based on their con- tent [19]. The key idea is that no metadata is involved and only the raw pixels of the images are used in order to infer semantic knowledge. It is a growing field which span over numerous research areas such as Information Retrieval, Computer Vision, Machine Learning... Humans are naturally gifted to deduce informations by just looking a few seconds at an image and it is interesting to understand how we are performing this task in order to be able to reproduce it into a machine. In doing this operation we are usually relying on at least three different factors : Our sensory abilities Our sensory abilities refer to our abilities to acquire visual inputs through our eyes and the process that is made of these inputs by our visual cortex. 6
  • 9. The context The context of an image include everything that is not in the image but contribute to its interpretation. It is well known that an image can have different meaning depending on its context. Our experiences Different experiences makes us respond differently when con- fronted to a situation. This is valid for the interpretation of an image which will rationally differ between individuals. In order to successfully retrieve images for an end-user a machine should theoretically be able to incorporate these three factors into its process. As one can imagine such objective is especially challenging for many reasons. First we are far from completely understanding how our brain is processing the visual inputs that it receives. Then the context of an image is not always available, especially when dealing only with raw pixels without the help of additional metadata. Finally individuals with different cultures can have very different interpretations of the same image and to retrieve the right images for an user would imply to have some knowledge about its background or intentions. 1.2 Real World Applications Even if the process of making good interpretations based on the content of an image is far from being mastered numerous real world applications make al- ready use of Content-Based Image Retrieval technologies. Among this different applications we can differentiate between those who operate on images with a very broad domain and those who operate on a narrow domain. In the review made by Smeulders and all [65] they define a broad domain as having ”an un- limited and unpredicted variability even for the same semantic meaning”. Web applications and general visual search engines often process on a broad domain since images can come from many different users and similar objects can have different size, shape or illumination and can even be partially occluded. The first visual search engine was developed by IBM with their commercial system called QBIC (Query by Image Content) [55]. QBIC enable an user to query a image database with image queries. Many possibilities is given to the user to specify its query. He can either query by sketch, query by specifying specific color, texture and shape or query using an image similar to the ones he was looking for. Depending on the type of the query different similarities measures are then used to retrieve the most similar images. In order to help further the research semi-automatic detection of objects or points of interest is also performed when a new image is inserted into the database. Since this first attempt other systems (TinEye, Pinterest, Google Query By Image) have been developed which rely on more advanced techniques. For instance Pinterest is a web and mobile application where user can upload images of interest through collections known as pinboards. Pinterest then use those images to look for similar images and recommend them to the user. While QBIC was only relying on handcrafted features Pinterest, among other algorithms, rely on features 7
  • 10. learning and algorithms called Convolutional Neural Network to extract features from images [44]. As opposed to a broad domain a narrow domain is defined as having a limited and predictable variability in all relevant aspects of its appearance. When dealing with a narrow domain knowledge of the domain can be efficiently used to design good featur detectors. Illustration of a narrow domain can be found in many applications such as medical diagnosis or face recognition. In the medical domain, due in part to the steadily growing rate of images production, Content- Based Image Retrieval can play a key role in many medical applications [54]. It can for example enable the automatic annotation and the classification of medical images. Content-Based Image Retrieval can also be used to help in the clinical decision–making process. Indeed when confronted with a medical case a useful scenario would be to supply the doctor with similar cases by finding other images of the same disease that would allow him to take a decision with more information at hand. Another application would be for instance within the radiology department to develop tools that enable automatic diagnostic of the diseases [45]. Content-Based Image Retrieval techniques might also find useful applications in many more domains. For instance in digital art gallery or online museum collection they can be used to bring to the user art works similar to the ones he browsed. For law enforcement and criminal investigation they could help investigators to process pictures of security cameras. 1.3 Components of CBIR systems All typical Content-Based Image Retrieval systems are usually performing com- mon steps to achieve the retrieval of images as illustrated by the figure 1.1. The first step which might be performed is segmentation which can be useful to isolate meaningful regions from the background for example. As a second step feature extraction is performed. The extraction of features can rely on two different methods. Either it is done through handcrafted features that is humans have to design a specific features detector for the problem at hand or it is done through feature learning that is we let the machine learn features thanks to training samples. In the domain of image processing one model have started to show great promises to learn features and is known as Convolutional Neural Network. One of the great challenge of features extraction is to reduce what is known as the semantic gap that is the difference between the interpretation of the image and the features that have been computed. Finally one have to interpret these features which is often done through a similarity measure or a machine learning classifier. The following of this work will detail each of these steps. 8
  • 11. Figure 1.1: Content-Based Image Retrieval System Components While every Content-Based Image Retrieval system usually perform the op- erations described above their performance is not limited to them. Below is a review of the main components that typically compose a CBIR system and im- provement in each of these components is required to achieve better systems [31]. 1.3.1 Database Management When an image database is growing very large efficient access methods should be developed in order to speed up the retrieval of images. These methods should act as a filter with only relevant images being retrieved from the storage system and tested for similarity with the query image. Most data structures used today for indexing are spatial access methods and they assume that all images fea- tures are numeric and represent an image as a point in a multidimensional space. These methods can be classified into three categories : space partitioning, data partitioning an distance-based techniques [65]. Space partitioning techniques (k-d tree, k-d B-tree...) organize the feature space as a tree and with nodes standing for regions of the space. When a region contain too much points it is split into subregions whose nodes become the children of the original region. Data partitioning techniques (R-tree, R+-tree, R*-tree...) associate with each point a region representing the neighborhood of that point and leaf nodes cor- responding to minimum bounding regions of a set of vectors. Distance-Based index structures (VP-tree, MVP-tree, M-tree...) is applicable to general metric spaces whose primary idea is to pick an example point and divide the rest of the features space into concentric rings around the example. Nevertheless when the feature vectors being indexed contain a considerable number of dimensions the performance of the previous indices greatly decrease. This is known as the curse of dimensionality phenomenon. 9
  • 12. 1.3.2 Query Specification Queries made by the user can take several forms which logically influence on the retrieval process. Query by example where the user provides a image and the similarity is computed based on the entire scene is a common way to perform the request. Another method is to let the user specify a region of interest (a person or an object for example) and search for images with similar regions. Yet an other way is for the user to draw a sketch of the item or landscape he is interested in. How the query is specified depends of the users of the system. Casual users surely prefer to make its request by using natural language “Find me mountain with snow and with a bird in the sky” or with an image as an example whereas domain expert might want more sophisticated methods such as query by sketch. 1.3.3 Relevance Feedback As pointed out earlier bringing relevant images to the end user imply to know its point of view. Learning the point of view of the user is more and more achieved through relevant feedback. From a broad perspective a relevance feedback sce- nario usually has the following scheme. First initial results are returned by the machine then the user provide a judgment on the results being displayed and finally the machine learns from the judgments to retrieve new results and so on. The exact algorithm used for learning is going to depend on whether the user is looking on a particular target item or a class of similar item. Also it is going to differ if the user provide a binary judgment (relevant or not) or a comparative judgment (this result is better that this one). Different relevance feedback algorithms can be found in the survey of Zhou and al. [81] or the one of Crucianu and al. [16]. 10
  • 13. Chapter 2 Segmentation As humans we often attribute the semantic of an image based only on a par- ticular region of this image disregarding the rest of the scene. Therefore it is logic when trying to learn good representations of the image to first select meaningful regions by segmentation of the image. Good segmentations should group into regions pixels with similar characteristics such as color or texture. Below is a review of different techniques that have been investigated to perform segmentation. 2.1 Clustering methods One way to achieve image segmentation is through clustering methods. The most basic case would be to use a simple k-means algorithm. In this case the main steps to perform the segmentation are : 1 Choose k clusters (e.g. pixels) either randomly or based on some heuristic 2 Based on a distance measure assign each pixel in the image to the closest cluster 3 Recompute the clusters centers 4 Repeat step 2 and 3 until convergence that is no more pixels change of cluster. Improvement upon this simple scheme is to use fuzzy clustering. In fuzzy clustering pixels, rather than belonging completely to just one cluster, has a degree of membership. Further improvement is achieved by adding spatial in- formation to the membership function as in the fuzzy c-means clustering pro- posed by Chuang and al. [13]. Indeed their finale membership function take into account features of the pixels and also neighboring information. Hence the probability to belong to a cluster will be higher for a pixel if the neighboring pixels already belong to this cluster. Fuzzy c-Means clustering is also explored by Ahmed and al. [2] for segmentation of magnetic resonance imaging. 11
  • 14. 2.2 Histogram-based methods The principle of histogram-based methods for image segmentation is first to compute an histogram from the image. Then one or many thresholds are chosen between two peaks of the histogram and pixels of the image are attributed into clusters according to these thresholds. This technique is for example used by Arifin and al. [3] to distinguish between the foreground and background of grayscale images. Refinement of this technique is to recursively apply the histogram computation and the thresholding to subregions as explained in detail in the work of Ohlander and al. [57]. 2.3 Region-growing methods In Region-growing methods we start with elementary regions, either all pixels or a subset of pixels, and we iteratively combine these small regions based on a statistical test to decide or not of the merging. A recent algorithm that fol- low this approach is Statistical Region Merging proposed by Richard Nock and Frank Nielsen [56]. Their algorithm is based on a model of image generation which captures the idea that grouping is an inference problem. It provides a simple merging predicate and a simple ordering in merges. They argue their method can cope with significant noise corruption, handle occlusions, and per- form scale-sensitive segmentation. Another method is the seeded region growing method. This method take a set of seeds as input along with the image which correspond to the objects to be segmented. Then the regions are iteratively grown by comparing unallocated neighboring pixels to the regions. One way of comparison could be for example to compare the intensity of a pixel with the average intensity of the region. 2.4 Graph partitioning methods Graph partitioning methods have been lately the main research direction for segmenting images. These methods see an image as a weighted undirected graph G = (V, E) where each node v ∈ V corresponds to a pixel or a group of pixels and each edge (i, j) ∈ E is weighted according to the dissimilarity between the two pixels that are linked. 2.4.1 Cut Criterion In order for the graph image to be partitioned into relevant clusters good cut criterions must be found. A cut in graph theoretic is the partitioning of the graph into two disjoint subsets which are judged dissimilar. The degree of dissimilarity is basically computed by doing the sum of the vertices that connect the two subsets : Cut(I, J) = i∈I,j∈J w(i, j) 12
  • 15. where w is a function used to estimate the similarity between two nodes/pixels. The problem of this metric is that it tends to create clusters composed of a unique node. A popular criterion for finding good clusters is know as normalized cut [63]. To avoid the unique node bias normalized cut suggests to normalize the cut criterion by the total edge weights of all the nodes in the graph. Other criterions are minimal cut [77] (a cut is minimum if we can’t find in the graph a cut with smaller weight) or maximum cut (No cut with a biggest cut weight). Yet an other recent variant is the Multiscale Normalized Cuts (NCuts) approach of Cour et al. [15]. 2.4.2 Graphical Probabilistic Model Graphical probabilistic models are probabilistic models for which a graph ex- presses the conditional dependence structure between random variables. In the case of image processing they are often use to both segment and label each seg- mented regions of the image. In this context random variables represent pixels or more often group of pixels (superpixels). Figure 2.1: A graphical model for image segmentation and labeling The first step when using a graphical model is usually to perform unsuper- vised segmentation on the image. The result of this step is a set of superpixels as illustrated by the image on the left of the figure 2.1. Different algorithms have been designed to achieve this such as SLIC [1], an adaption to k-means clus- tering approach, or felzenszwalb [24], a graph based image approach. Following this step we can model the network with random variables Ci that represent superpixels and can take as values the different class labels for the problem at hand. A potential function Ψ(Ci, Cj) is defined between all neighboring vari- ables that capture the dependencies between them. We also define an other random variables Xi that represent the observations or the evidences we have about a superpixel. In the context of image processing they consist in local features such as color, shape and texture features. These evidences will be used to make inference about the values of Ci through the potential Φ(Xk). Two different models, namely Markov Random Fields (MRF) and Condi- tional Random Fields (CRF), have been mostly investigated by researchers. Markov random fields are generative models that model the joint probability distribution P(C, X) whereas conditional random fields are discriminative mod- 13
  • 16. els that model the conditional probability P(C|X). The advantage of CRFs over MRFs is that they directly model the standard decision problem and are often more accurate. They are also easier to train since they are not trying to model the distribution over the observations. However they are limited to the strict prediction problem and cannot be used to do anything else. Demonstration of segmentation with markov random fields can be found in the work of Won and al. [76] or the one of Zhang and al. [79] which use the model to segment brain magnetic resonance (MR) images. Conditional random fields have been investi- gated by Plath and al. [60] who use them to perform scale-space segmentation with class assignment. 14
  • 17. Chapter 3 Feature Extraction Features in Content-Based Image Retrieval are relevant information about the underlying image. What makes that a feature is relevant is generally dependent of the problem at hand. Nevertheless one can identify several properties that features should respect in order to be considered good features. Thus features should be invariant to : Affine transformation (Rotation, Scaling, Translation...). Ideally features computed on an image would be the same whatever the location or the scale of the object in the image. Distortion As for affine transformation features should be tolerant to small distortion. We can discern between two approaches when dealing with feature extrac- tion. The first one is to rely on handcrafted features that describe elementary characteristics of the image such as the shape, the color or the texture. A major drawback of handcrafted features is their dependence to the application domain which led to another set of techniques called feature learning. Feature learning exploit a training dataset to discover useful features from the images. Another distinction one can make in the domain of feature extraction is between global and local features. 3.1 Global and Local Features Global features are features which are aggregated from the entire image. More formally global features can be symbolized by : Fj = Tj f o i(x) where represent an aggregation operation (can be different that sum), T is the partitioning over which the value of Fj is computed, f account for possible weights and i(x) is the image. 15
  • 18. As opposed to global features, local features are computed by considering only a subpart of the image. Usually for an image a set of features is computed for each pixel using its neighborhood or for non-overlapping blocks. After this step we usually have a set Xi, 0 < i < sizeimage where X represent the feature vector computed at the location i of the image. A further step of summarization can also be performed. For example we might derive a distribution for Xi based on the set. As reported by Datta and al. [20] local features often correspond with more meaningful image components, such as rigid objects and entities, which make association of semantics with image portions straightforward. 3.2 Handcrafted Features 3.2.1 Color Features An example of global features that have been extensively used is color his- togram that is a representation of the distribution of colors in an image. Color histograms can be useful for retrieval as long as the color pattern of interest is representative of an item throughout the dataset. It has the advantage to be robust to translation and rotation transformations. Various distance measure can then be used such as euclidean distance, histogram intersection, cosine or quadratic distance to compute the similarity between images [68] [32] [67]. How- ever color histograms suffer from obvious flaws. Thus color histograms can’t be of any help to identify that a red cup and a blue cup actually represent the same object and additionally if the similarity of two images with very different scenes but identical color distribution is computed using a color histogram they might be falsely judged similar. To improve color histograms efficiency joint his- tograms [58], histograms that incorporates additional informations other than color, two-dimensional histograms [5], histograms that considers the relations between the pixel pair colors, or correlograms [39], a three dimensional his- togram, have been investigated. Another key issue when dealing with color features is the choice of the color space that is been used (RGB, HSV, Lab-Space...) which usually depend on the special need of the application. Two aspect of colors have to be taken account here. The first is that depending on how the scene was taken (viewpoint of the camera, position of the illumination, orientation of the surface...) the color recorded might varies considerably. The second is that the human perception of color greatly changes between individuals. RGB space is one of the most popular color space and assign for each pixel a (R(x), G(x), B(x)) triplet corresponding to the additive primary colors of light (Red, Green, Blue). RGB space is an adequate choice only when there is little variations in the recording. For instance the RGB space would probably be a good choice for art painting but a bad choice for outdoor taken pictures. Indeed a color relatively close in the RGB color space might be perceived as very different from the point of view of an human. In the opponent color space colors are defined according to the opponent color axes derived from the RGB values : (R-G, 2B-R-G, R+B+G). It has the advantage 16
  • 19. to isolate the brightness information on the third axis. Since humans are more sensitive to variations in brightness the two other axis could be downsampled to reduce the memory usage. Other color spaces have been studied and can be found in the survey of Smeulders and al. [65] or the one of Khokher and al. [47]. 3.2.2 Shape Features Shape features methods are trying to identify interesting regions in an image like edges or corners and compute features based on these regions. As a primary step scale-space detection is very often performed since it provides the way to detect interesting regions at any scale. A widely used detector is SIFT (Scale-Invariant Feature Transform) published by David Lowe in 1999 [52]. SIFT extract key- points from an image at different scale-spaces using Difference of Gaussians and assign an orientation to each of them to achieve invariance to image rotation. Keypoints descriptors are then computed based on their neighborhood that is for each neighborhood a orientation histogram is created. In addition several measures are taken to increase the robustness of the descriptors to changes in illumination. Improvement over SIFT have since been made especially with the Speeded Up Robust Features (SURF) detector [7] which is several times faster than SIFT and claimed by their authors to be more robust against different image transformations. An other descriptor is known as Histogram of oriented gradients. The idea behind histogram of oriented gradients is that objects or humans can be described by the distribution of oriented gradients. In this tech- nique the image is divided into cells (grouping of adjacent pixels) of circular or rectangular shape. For each pixel gradient orientation is computed and then for each cell histogram of gradients orientation are deduced. Each pixel in a cell contribute to the histogram depending generally on the magnitude of the gradient. To account for illumination changes and shadowing, gradients in a region (grouping of several cells), are normalized with the average intensity of the region. HOG descriptor is then the concatenated vector of all the normal- ized histograms. It has been investigated for instance for human detection [18]. Other techniques include harris corner detector [34] or the Hough Transform [4]. 3.2.3 Texture Features A texture can be defined as an homogeneous region in an image that do not result only from uniqueness in color but from identical structural arrangements or repetitive patterns within that region. For instance the bark of threes might have different colors between species or depending of the season but they form a same texture. Bricks, parquets or grass are other examples of textures. Detecting texture is important because they often provide strong semantic interpretation. According to different surveys, Khokher and al. [47], Haralick and al. [33], texture features can be divided into two main categories : Structured Approach and Statistical Approach. In the structured approach the image is seen as a set of primitive texels with a regular or repeated pattern. The more widely used 17
  • 20. statistical approach is based on the distribution of gray level in the image. Ac- cording to Robert M. Hawlick [33] statistical approach can be divided between height different techniques : autocorrelation functions, optical transforms, dig- ital transforms, textural edgeness, structural elements, spatial gray tone cooc- currence probabilities, gray tone run lengths, and autoregressive models. A explanation for each of these techniques is provided in the survey. Wavelet- based features have also received wide attention. In the work by Minh N. Do and all [21] wavelet features are used in combination with Kullback-Leibler dis- tance for texture retrieval. Gabor filters have also been investigated for texture features [30]. 3.3 Shallow methods Shallow methods refer here to techniques used in order to extract meaningful representation using features descriptors described above. Shallow is used in opposition to deep methods that involve several layers of features extraction. 3.3.1 Visual Bag-of-Word The visual bag-of-words method in computer vision is analogous to the bag-of- words model for document. For a document the bag-of-word is a vector (i.e. histogram) that contains the number of occurrence of words that are defined by a vocabulary. For an image a bag of word is a vector or histogram that contain the number of occurrence of visual words. Visual words correspond to representative vectors computed from local image features. The main steps of the method are : Local features Extraction For this step different detectors such as harris detector or SIFT can be used and have been effectively investigated. Encoding in a Codebook From the local features codewords (analogous to words) have to been found that will produce the codebook (analogous to a dictonary). The simplest method is to perform a k-means clustering over the entire features with cluster centers corresponding to codewords. Other methods to cluster the vector space are Square-error partitioning algorithms or Hierarchical techniques. Bag of keypoints construction Once the codebook has been determined we can construct for each image an histogram (called here bag of keypoints) that contain the number of vectors assigned to each codewords (i.e. cluster centers). More detailed explanations about the bag-of-words model can be found in the paper by Csurka and al. [17] that introduced the method in 2004. Since then more encoding methods such as locality-constrained linear encoding, improved Fisher encoding, super vector encoding or kernel codebook encoding have been proposed to improve the model. 18
  • 21. 3.3.2 Improved Fisher Kernel The Fisher Kernel was introduced by Jaakkola and Haussler [43]. The Fisher Kernel combines the benefits of generative and discriminative approaches by deriving a kernel from a generative model of the data. In the case of images it consist in characterizing local patch descriptors by its deviation from a Gaussian Mixture Model. Thus the Improved Fisher Kernel extend the BoW representa- tion by including not only statistical counts of visual words but also additional informations about the distribution of descriptors. However in practice results obtained with the Improved Fisher Kernel has not been better that with BoW model [59]. Comparison of these two shallow methods can be found in the work of Chatfield and al. [10]. 19
  • 22. Chapter 4 Convolutional Neural Network The alternative to handcrafted features that is gaining huge attention in the recent years is the possibility to let the machine learn the best features to rep- resent the image. In the domain of image processing the learning of the features is mainly done through a model called Convolutional Neural Network (CNN). Convolutional neural networks are models inspired by how the brain works and how neurons interact between each other. They are made of successive layers of neurons, each of them aggregating the results from the preceding layer in order to compute more abstract features. The succession of layers is the reason why such algorithms are known as deep learning methods. The convolutional neural network was inspired by the neocognitron proposed by Kunihiko Fukushima in the 1980s [25]. It was the first attempt to mimic the animals visual cortex. Early papers [40] had shown that the cortex of animals is made of different neurons. The neurons located downstream in the visual recognition process are responsible for the extraction of specific patterns within their receptive fields and neurons located upstream in the process aggregate these results and are invariant to specific locations of the patterns. Accordingly the neocognitron consist of multiple types of cells, the most important of which are called S- cells and C-cells. The purpose of S-cells is to extract local features which are integrated gradually and classified in the higher layers. However the neocogni- tron lacked a proper supervised training algorithm. This algorithm was found simultaneously by different teams in the eighties and is known as backpropaga- tion, abbreviation for backward propagation of errors. Backpropagation used in conjunction with an optimization method such as gradient descent has been suc- cessfully applied in 1989 by LeCun and al. to recognize handwritten digit [49]. With the rise of efficient GPU computing a second breakthrough was achieved in 2012 by Geoffrey Hinton and his team which designed a convolutional neural network known as AlexNet [48] that achieved state-of-the-art performance on the Imagenet dataset. 20
  • 23. As opposed to handcrafted features convolutional neural networks has the advantage to require very little if not preliminary knowledge of the domain at hand. For instance, when using a Visual Bag-Of-Words representation for im- ages, one has to choose between different detectors which one will be the more efficient for the task. This choice depends obviously of the underlying problem. As for convolutional neural networks the domain knowledge is used to finely tune the parameters of the model but it is also often achieved through trials and errors. Prior knowledge can also be put into the network by designing ap- propriate connectivity, weights constraints and neuron activation function. In any case it is less intrusive than handcrafted features. Another benefit of con- volutional neural network is their ability to learn more discriminative features. Both in the review of Chatfield and al [11] and in the study of Zhang and al [78] they show better performance than traditional shallow methods. Drawbacks of convolutional neural networks include their complexity, the substantial amount of data and the computational resources required to train them effectively. 4.1 Design The complexity of convolutional neural networks reside mostly in their design and in the different layers that the network is made of. The figure 4.1 illustrates a very simple convolutional neural network. It is made of the succession of two convolutional layers with pooling layers and is followed by a fully connected layer. As for the neocognitron the first layers that are the convolutional lay- ers and the pooling layers are responsible for the extraction of local features. Specifically the convolutional layers learn to recognize different patterns and the pooling layers gradually ensure that the network is invariant to the location of these patterns. Fully connected layers enable to map these low level features to more abstract concepts. Below is a detailed description of the different layers one can find in a network. Figure 4.1: A Convolutional Neural Network 4.1.1 Convolutional Layer Convolutional layers are the core of convolutional neural networks. The figure 4.2 represents a convolutional layer made of four learnable linear filters or ker- nels. Each performs a convolution on the preceding layer. That is each filter is 21
  • 24. slid across the width and the height of the preceding layer. The output of the convolution of one filter is called a feature map. Formally we have : hk ij = (Wk ∗ x)ij + bk where hk denote the k-th feature map, Wk and bk the associated filter and bias and x the inputs. Figure 4.2: A Convolutional Layer It results of this operation that all neurons in a feature map share the same weights but have different receptive fields that is they look at different regions from the preceding layer. Thus neurons in a same fea- ture map act as replicated feature de- tectors able to detect a pattern at any location. Adjacent neurons in differ- ent feature maps share the same re- ceptive field but, the different filters learning different weights, each fea- ture map learns to detect different patterns. The exact nature of the patterns detected is hard to know and might not be easily interpretable for humans. Following the convolutions each neuron is the input to an activation function that determines the final output of the layer. Different activation functions can be used such as the binary function, the logistic function, the hyperbolic tan- gent function but the most commonly used activation function is the rectifier function. This function is defined as f(x) = max(0, x) and has been argued to be the more biologically plausible [28]. Using convolutional layers before fully connected layers present many advan- tages. First it enables to take the spatial structure of the image into account. In fully connected layers neurons are joined to all the neurons of the previous layers thus they treat pixels which are far apart and close together exactly the same way whereas neurons in convolutional layers are only joined to neighboring neurons from the preceding layer. Also it enables to greatly reduce the number of parameters to learn by the network. For example if an image is of size 200*200*3 a single fully connected layer would have to learn 200∗200∗3 = 120, 000 weights whereas a single convolutional layer with 50 different filters of size 5*5*3 only has to learn 50 ∗ 5 ∗ 5 ∗ 3 = 3750 weights. 4.1.2 Pooling Layer Pooling layers are common between two convolutional layers. They perform a form of non-linear down-sampling. They partition the previous layer into, overlapping or not, rectangular regions and for each region compute a unique 22
  • 25. output. It enables to get a small amount of translational invariance each time they are used. Figure 4.3: A Pooling Layer Most of the time the function used by a pooling layer is max- pooling that compute the maxi- mum of the underlying region but average-pooling (compute the aver- age of the underlying region) or L2-norm pooling can also be used. By pooling the exact locations of the features are lost but relative locations, which are the most im- portant, are preserved. Pool- ing layers are useful to progres- sively reduce the spatial size of the feature maps, thus the number of parameters and prevent overfit- ting. 4.1.3 Local Response Normalization Layer Local Response Normalization Layers can be found especially in the AlexNet network [48]. It performs a normalization between neurons from adjacent ker- nels. The idea is to suppress the activity of a hidden neuron if nearby neurons in adjacent feature maps have stronger activities. Indeed if a pattern is detected with low intensity it becomes not so relevant if strong patterns are detected nearby. The authors argue that it helps to increase the performance of the network however another study [64] reports that performance improvement was rarely achieved by adding local response normalization layers and that it even leads to increased memory consumption and computation time. More generally this kind of layer is not reused in subsequent works. 4.1.4 Fully Connected Layer The following layers in a convolutional neural network are the ones that we can find in a traditional multilayer perceptron that is one or more succession of fully connected layers. Fully Connected Layers are called this way because the neurons of these layers are connected to all the neurons of the layer below. In a convolutional neural network the feature maps of the last convolutional layer are vectorized and fed into fully connected layers. As for convolutional layer an activation function is used. If we suppose that rectifier units are used then we formally have : yj = max(0, zj) with zj = b + i Wij ∗ xi 23
  • 26. where yj is the output of the j-th neurons of the layer, W is the weights matrix for this layer and b is the associated bias. 4.1.5 Softmax and Loss Layer The loss layer specifies how the network penalizes the deviation between the predicted and true labels and is obviously the last layer in the network. Right before that layer we usually find a softmax layer that takes a feature vector as input and force the outputs to sum to one so they can represent a probability distribution across discrete mutually exclusive classes : yi = ezi j ezj The cost function associated with the softmax function is the cross-entropy function : E = − j tj ∗ log(yj) where tj is the target value and yj is the predicted value. 4.2 Training The training of a network is done through the backprogation algorithm which will be explained shortly. One problem, due to the fact that convolutional neural networks are able to learn complex mathematic models, is that they can easily overfit. To prevent this from happening one can either limit the number of hidden layers and the number of units per layer or stop the learning early but several more advanced methods have been proposed that we will discuss thereafter. 4.2.1 Backpropagation and Gradient Descent Gradient descent is an optimization algorithm to find the local minimum of a function. In the case of machine learning it enables to find the minimum of the cost function by applying the delta rule : Wij = Wij − α ∗ E(Wij) where Wij is the weight between the node i and j, α is the learning rate and E(Wij) is the gradient of the cost function with respect to the weight Wij. 24
  • 27. W E Start Figure 4.4: Illustration of gradient descent. The idea is that E(Wij) gives us the direction of the steepest increase of the cost function so by iteratively updating the weights in the direction of the negative gradient we should reach the minimum of the cost function as illustrated by the figure 4.4. The sensitive part with convolutional neural network is to compute the gra- dients which is achieved thanks to the backpropagation algorithm. Below is a sketch of the algorithm for fully connected layers where we assume that the cost function is the sum of squares : E = 1 2 ∗ j (tj − yj)2 i j Wij zj yj Figure 4.5: Fully Connected Layers The goal is to find dE dWij so we can inject it into the delta rule. For this we make use of the chain rule : dE dWij = dyj dWij ∗ dE dyj dE dyj can easily be found from the cost function : dE dyj = −(tj − yj) To find dyj dWij we once again make use of the chain rule : dyj dWij = dzj dWij ∗ dyj dzj dyj dzj will depend on the activation function used. If we assume a lo- gistic activation function where yj = 25
  • 28. 1 1 + e−zj then dyj dzj = yj(1 − yj). Also we easily find dzj dWij = xi so that we finally have : dE dWij = −xi ∗ yj(1 − yj)(tj − yj) The final equation can finally be injected into the delta rule enabling the network to be trained. Obviously the above demonstration will depend on the chosen cost function, the kind of layer used and the activation function. Also gradient descent is much of the time used in conjunction with the momentum method that enable faster convergence. 4.2.2 L1 and L2 regularization L1 and L2 regularization are a common ways of preventing the overfitting of the network. They both consist to introduce an extra term in the cost function that is going to penalize large weights. The L2 regularization is the most commonly used and is also known as weight decay. The extra term in this case is the sum of the squares of all the weights in the network scaled by a factor. So the new error function is : E = E + λ 2 ∗ ij W2 ij The effect of such a regularization is to make the network prefer to learn small weights and allow for large weight only if they considerably improve the cost function. Promoting small weight has been proven to be effective to reduce overfitting. 4.2.3 Batch normalization Batch normalization [42] potentially helps the training in two ways : faster learning and higher overall accuracy. When dealing with machine learning nor- malization of the data is often performed before the training process in order to make the data comparable across features. One problem with neural networks is that during the training process the weights of the network fluctuate and that inputs of upper layers are affected by the weights of the precedent layers. There- fore the distribution of inputs for upper levels can end up being distorted. This problem is referred by the authors as internal covariate shift. Their solution is to normalize the input data of each layer for each training mini-batch. 4.2.4 Dropout A typical way to reduce overfitting when doing machine learning is to combine the results of several models. However with large neural network training many models can be very tedious due to the amount of training data required, the computational resources... Dropout is a technique proposed by Geoffrey Hinton and al. [38] [66] to prevent large networks from overfitting and is equivalent to combine the results of many models. The term dropout refers to dropping out 26
  • 29. temporarily neurons along with their incoming and outgoing connections from the network. Each neuron is assigned a fixed probability p independent of each other and for each training case each neuron is retained with that probability p. Hence for each training case a different network will be trained. It prompts neurons to not make up for the errors made by the other neurons but to make good predictions on their own. At test time the outgoing weight of each neuron is multiplied by the probability p assigned to it which is a way to approximate the predictions of the different models that have been trained. 4.3 Research Trends As mentioned earlier the first convolutional neural network to show great per- formances was AlexNet [48]. The overall architecture of AlexNet was composed of five convolutional layers and three fully connected layers. Three max-pooling layers were also used as well as layers they called Local Response Normaliza- tion layers. Since AlexNet many other attempts to improve the performance of CNN have been made which mainly focus on varying the number of layers used as well as the size of the kernel filters. Simonyan and all [64] for instance have fixed all the parameters of the network other than the depth of the network (the number of layers) and have shown that the deeper the network the better the performances achieved. However a bigger size usually means more parameters which can become problematic regarding the use of computational resources, the time of training as well as the overfitting. This problem is addressed by Szegedy and al. [69] whose study focus on the efficiency of convolutional neural networks by using sparsely connected architectures. Apart from increasing the depth and the width of convolutional neural networks other methods have been explored, some of them being presented below. 4.3.1 Convolutional Neural Network Architecture MCDNN MCDNN stands for Multi-Column Deep Neural Network and is an implementation of a neural network proposed by Ciresan and all [14]. Each column of their network is actually a convolutional neural network and the final classification is made by aggregating the results of these networks. Different training strategies ensure that each column produce a different result. Network In Nework Lin and all argue in their study [50] that the level of abstraction obtained with linear filters is low. Therefore they substitute linear filters in the convolutional layers by what they call micro neural networks and are actually multilayer perceptrons. Also, fully connected layers often being responsible for overfitting, they propose a strategy called global average pooling to replace them. The idea is for the last convolutional layer to have as many feature maps that the number of categories on which the network is trained. For each feature map the average is computed and the resulting vector is fed directly into the softmax layer. The final network is tested on four benchmark 27
  • 30. datasets (CIFAR-10, CIFAR-100, SVHN and MNIST) and achieved the best performances on all of them. 4.3.1.1 Going deeper with convolution The main contrinution of the work by Szegedy and al. [69] is the use of the Inception module. The first idea is that for each convolutional layer in the network we don’t know which size of convolution is the best choice to capture the local correlations in the previous layer. Figure 4.6: Inception module with di- mensionality reduction So instead of only applying one convolution the inception module perform simultaneously three convo- lutions as well as a pooling operation on the previous layer then concate- nate all the features maps to form the final output of the module. One problem with this approach is that the concatenation of the different con- volutions and the pooling operation leads to an exploding number of fea- tures maps and then even a modest number of 5*5 convolutions can be- come prohibitively expensive to com- pute. To prevent this from happening all the 3*3 and 5*5 convolutions in the module are preceded by a 1*1 convolution that enable to reduce the number of filters and thus make the learning tractable. The exact structure of the network they used for the ILSVRC14 competition which achieved the best results was made of 22 layer and 8 of these Inception modules. 4.3.1.2 Residual Learning for Image Recognition Even if adding more and more layers to a network usually helps to acheive better performance cases have been reported where it leads to worse results [35]. This problem is addressed by Kaiming and al. [36] . Figure 4.7: Residual Learning : a build- ing block Their idea is that you only want to add an extra layer if you are sure to increase the performance of the net- work. One way to ensure this is to provide to the output of each layer the input without any transformations . Formally instead of learning F(x) the layer is learning H(x) = F(x) + x The result of this new mapping is that if the identity mapping is ideal then it 28
  • 31. is easy to set the weights of the layer to 0. In the contrary it forces the layer to learn something new. Using such building blocks the network they use for the ILSVRC15 competition was made of 152 layers and achieved the best results. 4.3.2 Enhanced Convolutional Neural Network Since the first breakthrough made by the AlexNet architecture to classify im- ages convolutional neural networks have also shown great promises for different applications such as Object Detection, Human Pose Estimation, Multi-label Classification... 4.3.2.1 Object Detection In the work by Girshick and al. [26] convolutional neural network is used for object recognition. Their main contribution is to show that convolutional neural networks can be used successfully not only for image classification but also for object recognition. They use the following approach : first for an image they generate category-independent region proposals through selective search, then for each region they compute a feature vector using a CNN and finally they classify each region with category-specific linear SVMs. An other contribution is to show that a first training on a large dataset (ILSVRC) followed by a domain-specific fine-tuning on a small dataset is an effective way to train a network. Since their system combine region proposals with CNNs, they called their method R-CNN: Regions with CNN features. It has improved the previous performances on the object recognition challenge of Pascal Voc. 4.3.2.2 Multi-label classification Single label classification is interesting but very often an image contains several entities and we would like to be able to detect them all. Different approaches have been investigated to enable CNN to predict multi-labels. In the work of Yunchao and al. [74] they use the following methodology. Given an image they first use a state-of-the-art objectness detection technique (BING) to extract hypotheses. The result of this step is to have several patches of the original images that might contain an object. Each of these patches is then fed to a traditional CNN and all the outputs are then pooled into a single vector for the final prediction. The global architecture is called Hypotheses-CNN-Pooling (HCP). For the training the traditional CNN is first pretrained on a single- label image dataset then the HCP network is finetuned on a multi-label image dataset. Multi-labels classification has also been studied by Gong and al. [29] with emphasis on the loss layer. Indeed they investigated several loss functions to train the network. Their first approach is to use use a softmax layer followed by a loss layer that minimize the KL-Divergence between the predictions and the ground-truth probabilities. The second loss function is called pairwise-ranking 29
  • 32. loss and ensure that positive labels have always better scores than negative labels. The third loss used is the weighted approximate ranking (WARP) which is defined by Weston and al: [75] and optimizes the top-k accuracy for annotation by using a stochastic sampling approach. Their result show that all three losses give comparable performance except for the WARP loss which achieve better performance for rare tags. 4.3.3 Transfer Learning Training entirely a whole network is cumbersome because it requires a very large amount of data (1.2 millions of image for the Imagenet database) and suitable hardware. Event when these requirements are met the training can last up to two weeks. For these reasons it is common to use a pre-trained network either as a fixed features extractor or to finetune it on a smaller dataset. 4.3.3.1 Convolutional Neural Networks as feature extractor When using a pre-trained convolutional neural network as a feature extractor the common procedure is to remove the last fully connected layer (this layer’s output correspond to the class scores of the task it has been trained for) and take the output of the preceding layer as our features. The resulting feature vector can then be fed to any machine learning classifier such as SVM, Random Forest... In the work of Donahue and al. [22] they show by different experi- ments that features extracted from a pre-trained model actually generalize well to unseen classes. The pre-trained model follows the architecture of the AlexNet model [48] and is trained on the ILSVRC-2012 dataset. Their first experiment is to extract features from a new dataset called SUN-397 then run a dimensionality reduction algorithm (t-SNE algorithm) to find a 2-d representation. By then plotting the points which are colored depending on their semantic categories we can observe good clustering. The second experiment is made on the Caltech- 101 dataset. Again they extract features thanks to their pre-trained model and use different classifiers (SVM and logistic regression) to discriminate between objects. They show that the performance obtained with the SVM classifier out- performs methods with traditional hand-engineered image features. In other experiments, domain adaptation, subcategory recognition and scene recogni- tion they show again that features from a pre-trained deep model outperform traditional hand-engineered image features. Supervised pre-training is also ex- plored by Wan and al. [72]. As for Donahue and al. they pre-train a AlexNet model with the ImageNet dataset and test its performance on other datasets against visual bag-of-words and GIST features. In this case features from the network frequently, but not always, outperform traditional features. Convolu- tional Neural Network is also explored by Razavian and al. [61] on different tasks such as Image Classification, Object detection, Image Retrieval and again they outperform traditional shallow methods. 30
  • 33. 4.3.3.2 Fine-tuning a Convolutional Neural Network The second strategy when using a pretrained network is to finetune it using a smaller dataset. For this we need to replace the last layer of the pre-trained network so as to fit the new taks. Then the training can be continued on the new dataset. The weights of the unchanged layers are initialized with the values learned from the previous training and the weights for the new layers are initialized randomly. Also during the training phase different learning rates can be used for the different layers. Indeed we can consider that the first layers already encode very generic features and therefore does not need to be retrained whereas the last layers encode features that are more tasks-specifics and thus need to be adapted to the new task at hand. In the study of Wan and al. [72] they compare the results they obtained from a CNN used as a simple feature extractor and when it is finetuned on the new task. As expected the finetuning strategy always holds better results. 31
  • 34. Chapter 5 Feature Interpretation Once features have been extracted by the way of some techniques described in the previous chapter one need to interpret these features in order to achieve end applications. What I mean by interpret is that based on the features one of the following questions, depending on the problem at hand, has to be answered : “Can we identify an object in the image ?”, “What is the class of the image ?”, “Are these two images similar ?” . . . 5.1 Similarity Measure An obvious objective in CBIR is to assess the similarity of a query image with images in database. Different similarity measures have been investigated for this. Also algorithms that are able to learn a similarity measure from training data have been developed. The choice of the right metric to use is of course dependent of the problem at hand but also of the technique that was used to extract the features. Indeed depending on which techniques were used the representation of the image will differ. In the survey of Datta and al. [20] they call these different representations signatures and they discern between three types of signatures: Single Vector The image is represented as a single feature vector. It is for instance the result of using a global shape descriptor. Set of Vectors The image is represented as a set of vectors and is usually the result of features computed on different regions obtained by segmentation of the image. Histogram The image is represented as a histogram. This can be the result of the shallow methods introduced previously. 32
  • 35. 5.1.1 Fixed Measure Basic metrics for computing the similarity between two feature vectors include the minkowski distance or the cosine similarity. However when computed based on low-level features such metrics often fail to concur the distance computed and the semantic similarity. An attractive measure for measuring the distance between two probability distributions is the Earth Mover Distance (EMD) in- troduced by Rubner and al. in 2000 [62]. The EMD is based on the minimal cost that must be paid to transform one distribution into the other and matches perceptual similarity better than other distances. Another metric is the Kull- back–Leibler Distance used for instance by Do and al, [21] to compute the similarity from wavelet-based texture features. 5.1.2 Similarity Learning Rather that using a fixed measure learning another approach is to rely on ma- chine learning to learn a similarity measure. Three common setups exist : Regression similarity learning In this approach pairs of image are given (I1 i , I2 i ) with their associated measure of similarity yi and the goal is to learn a function that approximates f(I1 i , I2 i ) = yi. Classification similarity learning In this approach pairs of images are given (I1 i , I2 i ) with binary labels yi ∈ {0, 1}. The goal is to learn a classifier able to judge new pairs of images. Ranking similarity learning This time training samples consist to triplet (Ii, I+ i , I− i ) where Ii is know to be more similar to I+ i than to I− i . An example of ranking similarity learning for large scale image learning is the OASIS algorithm [12] that models the similarity function as a bilinear form : Given two images I1andI2 the similarity is measured through the form I1 ∗W ∗I2 where matrix W is not required to be positive or symmetric. 5.2 Ontology Object Ontology can be used to map low-level features to high level entities. An illustration of an object ontology is shown by the figure 5.1. With this example a concept such as sky could be defined as having a blue color, a upper position and an uniform texture. This way by extracting color, texture and other features from an image different regions can be mapped to concepts. Object ontology is used by Hyvonen and al. [41] to access pictures of the promotional events of the university of Helsinki. Top-level ontological categories represent concepts such as Persons, Events, Places and pictures are mapped to theses entities when inserted into the database. In the work by Mezaris and al. [53] the images are first segmented then color and shape descriptors are extracted enabling the mapping of each region to a corresponding concept. 33
  • 36. Figure 5.1: Object Ontology 5.3 Machine Learning A last approach to associate features with high-level concept is to rely on su- pervised learning methods. A good candidate which has proven to show strong performance is the support vector machine (SVM) classifier. SVM has originally been designed for binary classification so that we need to train multiple mod- els if we want to achieve multi-classification. The idea behind SVM is to find a hyperplane that best divides the features belonging to two separate classes. Among the possible hyperplanes a common one is the optimal separating plane which maximizes the distance between the hyperplane and the nearest data points of each class. SVM has often been used in conjunction with the Visual Bag-Of-Words representation to classify images. This is the case for example for the work of Csurka and al. [17] or the one of Lowe and al. [52]. Another common classifier is the naive bayes classifier. It relies on the baye’s theorem so that prior probabilities and class-conditional densities for the features have to be computed from the training samples. In the work of Vailaya and al. [70] they use first and second order moments in the LUV color space as color features and MSAR texture features. Then they perform vector quantization with the learning vector quantization (LVQ) algorithm and the result is fed as input to the naive bayes classifier. This scheme is used to classify indoor / outdoor and city / landscape images. Other classifiers that have been investigated are neural networks and decision trees. 34
  • 37. Chapter 6 Evaluation of CBIR Systems In order to evaluate the different systems that have been proposed for Content- Based Image Retrieval researchers need common image databases with trust- worthy ground truth and well defined metrics. We will review here some of the well established datasets used for this purpose as well as the different tasks that are evaluated. 6.1 Datasets 6.1.1 Pascal Voc The Pascal Visual Object Classes (VOC) Challenge consists of two components : a publicly available dataset and an annual competition. The dataset consist of annotated consumer photographs collected from the flickr photo-sharing web- site. In total 500,000 images were retrieved from flickr and divided between 20 classes. For each of the 20 classes images were retrieved by querying flickr with a number of related keywords and randomly choosing an image among the 100,000 first results. The process was repeated until sufficient images were col- lected. In order to evaluate the detection challenge a bounding box was further added for every object in the target set of object classes. 6.1.2 Caltech The goal of the Calech dataset is to provide with relevant images for performing multi-class object detection. Caltech 101 dataset provides pictures of objects belonging to 101 categories with most of the categories having 50 images. Thus the resulting training is relatively small compared to other datasets. Each image contains only a single object. A common criticism of this dataset, is that the images are largely without clutter, variation in pose is limited, and the images 35
  • 38. have been manually aligned to reduce the variability in appearance. Caldech 256 correct some disadvantages of the previous dataset by more than doubling the number of class and introducing a large number of clutter images. 6.1.3 ImageNet ImageNet is an image database organized according to the WordNet hierarchy that is each meaningful concept is possibly described by multiple words or word phrases and is called a synonym set or synset. ImageNet is comprised of more of 100,000 synsets with on average 1000 images for each synset. The ImageNet dataset has been created especially for deep learning methods that need huge amount of training data. 6.2 Tasks Evaluated 6.2.1 Image Classification One common task is to discern between different classes which one correspond to a given image. It is called Image Classification and is often relying on the presence or not of a specific object in the image such as a car, a plane, a bi- cycle and so on. Performing such task can be useful to answer to a query of the type “Find pictures with a red car”. To achieve Image Classification a commonly used approach is to compute local features of the image, summarize them into an histogram which is given as input to a classifier [23]. This ap- proach is know as bag-of-visual-word in analogy with the bag-of-words (BOW) used for text representation. Within the approach different features extractors (SIFT descriptor, Harris descriptor...) and different classifiers (SVM, Earth Mover’s Distance kernel...)have been investigated [17] [78] [52]. New trends also perform classification using Convolutional Neural Network who achieved best performance on several benchmarks especially on the ImageNet database. 6.2.2 Object Detection Object Detection consist to assess if an object is present in a given image and to identify its location. A very widespread method to achieve Object Detection is to use a sliding window on the image. Features are computed from the window and given to a classifier to compute evaluate if the object is present. The window is slid throughout the image at different scale and for each scale and location the classifier is applied [71] [18]. 6.2.3 Image Similarity Image Similarity purpose is to assess if images, commonly stored in a database, are similar to a query image. The most similar images can then be returned in answer to the query. Image similarity is typically performed by reverse search engines. The process to achieve Image Similarity is first to extract features 36
  • 39. from the images. Then a similarity measure is used to compute the similarity between the images. Similarity measure can be computed with distance metrics such as euclidean distance or with more advanced techniques relying on machine learning to learn a similarity function [12]. 6.2.4 Multi-class Image Segmentation and Labeling Multi-class Image Segmentation and Labeling consist in assigning to each pixel a class label. First step here is usually to perform a segmentation on the image and to aggregate similar pixels into groups. Secondly a label is chosen for each group. Probabilistic graphical models have been successfully applied for this kind of task. 6.3 Evaluation Metrics 6.3.1 Precision and Recall Probably the most common evaluation measures used in information retrieval are precision and recall. Precision is the fraction of retrieved items that are relevant to the query : precision = |relevantdocuments ∩ retrieveddocuments| retrieveddocuments Recall is the fraction of relevant items that are retrieved : recall = |relevantdocuments ∩ retrieveddocuments| relevantdocuments Precision and recall are often presented through a precision versus recall graph. Based on these two metrics several other have been derived that bring additional informations and make up for their inadequacy in some cases. For instance when performing retrieval with huge quantity of documents recall is not any more a relevant metric as the query might have thousand of relevant documents. Precision at k Precision at rank k correspond to the number of relevant re- sults in the first k documents. Average Precision For one query average precision is the average of the pre- cision computed over an interval rank. Mean Average Precision For a set of queries mean average precision is the mean of the average precision scores for each query. F-score The F-score is a weighted average of the precision and recall defined as 2 ∗ precison ∗ recall precision + recall where 1 is the best value and 0 the worst. 37
  • 40. 6.3.2 Top-k Error The Top-k error rate is useful to evaluate classification tasks. It is defined as the fraction of items where the correct label is not among the k labels considered the most probable by the model. In the Imagenet classification challenge the top-1 and the top-5 error rates are used as benchmarks to compare the different submissions. 6.3.3 Confusion Matrix To evaluate the performance of a classification system a confusion matrix, also known as a contingency table, is often used. The name confusion come from the fact that the matrix enable us to easily check if the system is confusing different classes. A confusion matrix compute the rate of true positive and true negative also know as sensitivity and specificity. For a binary classification test sensitivity measure the proportion of positives that are correctly identified as such while specificity measure the proportion of negatives that are correctly identified as such. 38
  • 42. Chapter 7 Experiments Several experiments have been conducted to assess the dominance of convolu- tional neural networks in the domain of image processing. Then such networks was used to perform video classification and tagging for a startup that wish to exploit a video corpus at their disposal more efficiently. They wish to propose a system where clients can retrieve the videos in the corpus by performing natural queries. Also new videos are susceptible to be added to the initial corpus. To answer the request the system should rely on the information obtained after the classification and tagging stages. The initial corpus has already been manually annotated but to make the whole process more efficient automatic annotation should be performed. 7.1 Image Similarity 7.1.1 Introduction The objective of the experiments was to investigate how well traditional hand- crafted features were performing compared to features learned with a convolu- tional neural network. Thus the features were used to represent images in the case of an image similarity task. Precisely features from images supposedly in a database are computed using both methods then different image queries are performed and for each query the system retrieves the images judged the most similar to the query. 7.1.2 Dataset 7.1.2.1 MNIST The dataset used for the experiment is the MNIST dataset. It consists in grayscale images that represent handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. The images were obtained from another dataset called NIST, normalized and centered in a 28x28 image 40
  • 43. by computing the center of mass of the pixels, and translating the images so as to position this point at the center of the 28x28 field. Figure 7.1: Handwitten digits from the MNIST Dataset 7.1.3 Methods Two methods were used to extract the features from the images. The first method consists of a visual bag-of-words model with SIFT descriptors. This representation was chosen because it is a simple model yet it has proven to achieve relatively good performance. The second method consists in training a convolutional neural network and extracting the features it has learned. Fur- thermore the similarity between two images is assessed by using a similarity learning algorithm named OASIS (Online Algorithm for Scalable Image Simi- larity) proposed by Chechik and al [12]. 7.1.3.1 SIFT and Visual Bag-Of-Words To obtain a visual bag-of-words model local descriptors are computed using the SIFT algorithm through the OpenCV-Python library [9]. The value 3 was used for the number of layers in each octave which is what D. Lowe used in its paper [51]. The contrast threshold used to filter out weak features was of 0.01 and the threshold used to filter out edge-like features was of 50. Using a higher threshold to filter weak features or a lower threshold to filter edge-like features lead to no keypoints detected. Finally the sigma of the gaussian value was 0.8. To go from local descriptors to global image descriptors a standard histogram encoding was performed. All local descriptors from a subset of the training images (5000 images randomly picked from the original training set) were fed as input to a k-mean clustering with a value for k of 64. The centers of the clusters found are used as codewords and for each image an histogram is computed by counting the number of local descriptors that belong to each cluster. 41
  • 44. (a) Bag-Of-Words features (b) Network Features Figure 7.2: Result of a query for the two different features 7.1.3.2 Convolutional Neural Network The learning of the features was achieved with a convolutional neural network. The model has been implemented using the Theano library [6] [8]. The full code is available on github 1 . For the training it was initialized with two convolutional layers with 20 filters for the first layer and 50 filters for the second layer. Each of the convolutional layer is followed by a polling layer and two fully connected layers with respectively 225 neurons and 100 neurons are added after them. The MNIST training set was randomly divided between 50,000 training images and 10,000 validation images in order to train the network. The weights of the model are saved for the lowest validation rate reached which is 0.0115. The features from all the training images are then extracted using these weights. 7.1.3.3 Similarity Learning Using the previously computed training features a similarity function was learned through the OASIS algorithm. The OASIS learns a weights matrix that, given two images I1andI2, enable to compute a similarity through the form I1 ∗W ∗I2. The implementation of the algorithm is again made in python 2 based on the matlab implementation of Chechik and al. [12]. 7.1.4 Evaluation and further Improvements 7.1.4.1 Evaluation For the comparison of the two methods we evaluate how well the system per- forms when retrieving images in answer to a query image. Query images are images coming from the testing set and the system search in the training images for the most similar. The figure A.3 show the images returned by the system in answer to a query with the image representing the digit 3. The web demo is implemented with the django and angularjs frameworks. 1https://github.com/leovetter/DeepNet 2https://gist.github.com/gwtaylor/94412c4808421acbd1d0 42
  • 45. We can observe that the features extracted with the convolutional neural network seem to perfom much better that the features from bag-of words model since, when using the network features, all the retrieved images are of the same class that the query image which is far from being the case when using the bow features. This observation is confirmed by the figure 7.3 that show the precision- recall curve for the same query. We can see that for the network features the precision is always of 1 which means that even for result with high rank they are still of the same class that the image query. As for the bow features we can see that the precision slowly decrease when the recall increase that is when we retrieve more results we have, as expected, a higher recall but all results not being relevant the precision is decreasing. The figure 7.4 show the same results but this time the precision and the recalls is plotted as a function on the number of images retrieved. Figure 7.3: Precision-Recall curve (a) Bag-Of-Words features (b) Network Features Figure 7.4: Precision and Recall for the two methods 43
  • 46. 7.1.4.2 Further Improvements Being limited by the computational resources available to me the dataset used is not a very challenging one and does not enable to observe the limitation of the features from a convolutional neural network. It would be interesting to see how well it performs on dataset with more real images such as the Pascal Voc dataset or the Imagenet dataset. Also in order to improve the performance of the bag-of-words model better encoding could be used such as the fisher encoding [59], the support vector encoding [80] or the locality-constrained linear (LLC) encoding [73]. 7.2 Video Classification and Tagging 7.2.1 Introduction The other experiments conducted in order to evaluate the efficiency of convo- lutional neural networks first consist to correctly classify a video and then to extract meaningful tags from the same videos. More formally the first task en- able to predict one label corresponding to the global appearance of the video whereas the second task enable to extract up to N tags relevant to the content of the video. 7.2.2 Dataset For the two tasks mentioned above experiments were conducted using a corpus of approximately seven thousands videos (rushes). All videos present semantic unity that is they are made of a single shot (no cuts) showing related contents (for example one video only display a sea landscape and another only a city landscape) and their average length is of 4-5s. All the videos have been manually annotated with a theme describing the global landscape of the video and with tags denoting entities present in the videos. More detailed description of the corpus can be found in the appendix. 7.2.3 Methods For both tasks, classification and tagging, the method used was to finetune a pretrained network. Two different networks were evaluated and compared against each other. The first one use the AlexNet architecture [48] and the sec- ond one the Inception architecture [69]. Both models have been trained using the Imagenet database and are publicly available through the Caffe framework. All the following experiments have been made using the Caffe framework with the python interface. As the network only takes images as inputs and no videos two key frames have been extracted for each video. For videos with no camera movements in it two random images are extracted. For videos with camera movements ORB features are extracted for each frame of the video then a se- mantic clustering is done using the k-means algorithm with k set to two. The 44
  • 47. objective here is to extract two different images that best represent the video. The resulting set contains about 14,000 images. Eighty percent of the images is then used for the training set and twenty percent for the evaluation set. 7.2.3.1 Classification For the classification task finetuning a pretrained network to learn to discrim- inate between custom classes imply to replace the last fully connected layer of the network to adapt it to the new problem. Specifically, given that we want to learn to discriminate between 22 themes, we specify a fully connected layer with 22 outputs. Then our personal corpus of video is split between a training and testing set and the network is retrained on the training set. For the training the value of all the weights, except for the new layer, are initialized with the values computed during the pretraining and the weights of the new layer are initialized using the xavier algorithm [27] that automatically determines the scale of ini- tialization based on the number of input and output neurons. The bias filter is initialized with the value 0. On a computer installed with linux (Ubuntu 14.04), a GeForce GTX Titan X (12Gb of memory), a processor Intel Core i7-5960X and 23Gb of RAM the training only takes several minutes to conclude. (a) Alexnet Model (b) Inception Model Figure 7.5: Learning curves for the classification task 7.2.3.2 Tagging As for the classification task finetuning a network to learn to predict several outputs imply to replace the last fully connected layer by a layer with outputs equal to the number of tags we want to learn to predict. Also since we are not dealing anymore with a simple classification scheme (several tags can be pre- dicted for one image) the loss layer have to be adapted. Specifically the sigmoid cross entropy loss is used. For the training previously existing layers are again initialized with values from the pretraining and new layers are initialized follow- 45
  • 48. ing the same procedure as for the classification task. As for the classification task the learning is very fast. (a) Alexnet Model (b) Inception Model Figure 7.6: Learning curves for the tagging tasks 7.2.4 Evaluation and further improvements 7.2.4.1 Evaluation The evaluation of the models is done through the evaluation set and the pre- cision, the recall and the f1-score is used for the classification task. The best results are obtained with the Inception architecture. The average precision and recall obtained is of 0.91 for both metrics. The detailed results for each class are shown below. For the tagging task the precision and recall are used and best results are obtained with the Inception architecture. The network holds good results yet not so good as for the classification task. Again the detailed results are shown below. Precision Recall Person 0.58 0.59 Animal 0.93 0.61 Nature 0.92 0.60 Water 0.96 0.60 Building 0.92 0.60 Figure 7.7: Precision and Recall for the Tagging task 46
  • 49. Precision Recall f1-score Campagne / Zone rural 0.91 0.85 0.88 Canyon 0.74 0.96 0.84 Culture 1.00 0.50 0.67 D´esert 0.91 0.84 0.87 Enfants 1.00 1.00 1.00 Evenement 1.00 0.67 0.80 Forˆet 0.74 0.83 0.78 Gastronomie 0.87 0.74 0.80 Jardins 1.00 0.75 0.86 Marais 1.00 1.00 1.00 Mer 1.00 0.78 0.88 Montagne 0.94 0.83 0.88 Montagne / Zone urbaine 0.83 0.91 0.87 Patrimoine 1.00 0.67 0.80 Plaine 1.00 1.00 1.00 Marais 1.00 0.91 0.95 Plein air 0.83 0.83 0.83 Rizi`ere 1.00 1.00 1.00 Savane 1.00 0.95 0.97 Ville / Zone coti`ere 1.00 0.86 0.93 Ville / Zone urbaine 0.95 0.96 0.96 avg / Total 0.91 0.91 0.91 Figure 7.8: Precision, Recall and F1-score for the Classification task 7.2.4.2 Further Improvements 7.2.4.2.1 Data Augmentation In order to improve the performance ob- tained by the networks we can consider to increase the size of the training set. Indeed we have about 11,000 images for the training which is a very small dataset when dealing with neural networks. Data Augmentation can be realized in several ways. In [48] they are relying on two forms of data augmentation. The first form consist to generate image translation and horizontal reflection from the original images. Also they randomly cropped each 256*256 original images to four crops of 224*224 allowing them to capture some translation invariance. It enable them to increase the size of their training set by a factor of 2048 and they claim it allow them to greatly reduce overfitting. The second form of data augmentation consist of altering the intensities of the RGB channels in the training set. This scheme enable them to reduce the the top-1 error rate in the ILSVRC-2012 competition by over 1%. Another common way to generate more training data is to perfom color jittering that is to change the contrast of the image. 47
  • 50. 7.2.4.2.2 Recurrent Neural Network and Graphical model An other approach to reach better performances would be to investigate how recurrent neural network are performing. Indeed traditional convolutional neural network are not able to combine information from different frames of the video that is they predict an output based solely on the information present in one frame. On the contrary recurrent neural networks are designed to deal with sequences of inputs and thus might be able to combine information from different frames of the video to hold better performances. Also investigating how different al- gorithms such as probabilistic graphical models perform might be interesting. Graphical models such as Markov random fields [46] or Conditionial random fields [37] have been used with some success for image processing tasks. One of the advantage of such models is that they can model the dependencies between pixels or group of pixels (superpixels). So they can make predictions for an image by relying on features but also, for instance, spatial dependence between pixels (nearby pixels will have more probabilities to share the same labels than far away pixels...). For the features used by the graphical models it might be judicious to use features extracted from convolutional neural network. 7.2.4.2.3 Semantic resources A final idea to improve the classification and tagging of the videos is to rely on external semantic resources. Indeed we dispose of the geolocation of each video so that we can request some spatial database to retrieve nearby buildings or monuments. Knowing for example that a video has been taken near the Eiffel tower we can aggregate this information with the results of the convolutional neural networks by increasing the proba- bility that a building is on the video. 48
  • 51. Bibliography [1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pas- cal Fua, and Sabine Susstrunk. Slic superpixels compared to state-of-the- art superpixel methods. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(11):2274–2282, 2012. [2] Mohamed N Ahmed, Sameh M Yamany, Nevin Mohamed, Aly A Farag, and Thomas Moriarty. A modified fuzzy c-means algorithm for bias field estimation and segmentation of mri data. Medical Imaging, IEEE Trans- actions on, 21(3):193–199, 2002. [3] Agus Zainal Arifin and Akira Asano. Image thresholding by histogram segmentation using discriminant analysis. In Proceedings of Indonesia– Japan Joint Scientific Symposium, pages 169–174, 2004. [4] Dana H Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern recognition, 13(2):111–122, 1981. [5] EA Bashkov and NS Kostyukova. Effectiveness estimation of image re- trieval by 2d color histogram. Journal of Automation and Information Sciences, 8(11):74–80, 2006. [6] Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsu- pervised Feature Learning NIPS 2012 Workshop, 2012. [7] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer vision–ECCV 2006, pages 404–417. Springer, 2006. [8] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Raz- van Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation. [9] G. Bradski. Dr. Dobb’s Journal of Software Tools. 49
  • 52. [10] Ken Chatfield, Victor S Lempitsky, Andrea Vedaldi, and Andrew Zisser- man. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, volume 2, page 8, 2011. [11] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014. [12] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale on- line learning of image similarity through ranking. The Journal of Machine Learning Research, 11:1109–1135, 2010. [13] Keh-Shih Chuang, Hong-Long Tzeng, Sharon Chen, Jay Wu, and Tzong- Jer Chen. Fuzzy c-means clustering with spatial information for image seg- mentation. computerized medical imaging and graphics, 30(1):9–15, 2006. [14] Dan Ciresan, Ueli Meier, and J¨urgen Schmidhuber. Multi-column deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3642–3649. IEEE, 2012. [15] Timothee Cour, Florence Benezit, and Jianbo Shi. Spectral segmentation with multiscale graph decomposition. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 1124–1131. IEEE, 2005. [16] Michel Crucianu, Marin Ferecatu, and Nozha Boujemaa. Relevance feed- back for image retrieval: a short survey. Report of the DELOS2 European Network of Excellence (FP6), 2004. [17] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and C´edric Bray. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague, 2004. [18] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. [19] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 40, 2008. [20] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR), 40(2):5, 2008. 50
  • 53. [21] Minh N Do and Martin Vetterli. Wavelet-based texture retrieval using gen- eralized gaussian density and kullback-leibler distance. Image Processing, IEEE Transactions on, 11(2):146–158, 2002. [22] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013. [23] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. [24] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167– 181, 2004. [25] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in po- sition. Biological cybernetics, 36(4):193–202, 1980. [26] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmenta- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. [27] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics, pages 249–256, 2010. [28] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011. [29] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894, 2013. [30] Simona E Grigorescu, Nicolai Petkov, and Peter Kruizinga. Comparison of texture features based on gabor filters. Image Processing, IEEE Transac- tions on, 11(10):1160–1167, 2002. [31] Venkat N Gudivada and Vijay V Raghavan. Content based image retrieval systems. Computer, 28(9):18–22, 1995. [32] James Hafner, Harpreet S Sawhney, Will Equitz, Myron Flickner, and Wayne Niblack. Efficient color histogram indexing for quadratic form dis- tance functions. Pattern Analysis and Machine Intelligence, IEEE Trans- actions on, 17(7):729–736, 1995. 51
  • 54. [33] Robert M Haralick. Statistical and structural approaches to texture. Pro- ceedings of the IEEE, 67(5):786–804, 1979. [34] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey vision conference, volume 15, page 50. Citeseer, 1988. [35] Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5353–5360, 2015. [36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. [37] Xuming He, Richard S Zemel, and Miguel ´A Carreira-Perpi˜n´an. Multi- scale conditional random fields for image labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on, volume 2, pages II–695. IEEE, 2004. [38] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co- adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. [39] Jing Huang, S Ravi Kumar, Mandar Mitra, Wei-Jing Zhu, and Ramin Zabih. Spatial color indexing and applications. International Journal of Computer Vision, 35(3):245–268, 1999. [40] David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology, 148(3):574–591, 1959. [41] Eero Hyv¨onen, Samppa Saarela, Avril Styrman, and Kim Viljanen. Ontology-based image retrieval. In WWW (Posters), 2003. [42] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [43] Tommi Jaakkola, David Haussler, et al. Exploiting generative models in discriminative classifiers. Advances in neural information processing sys- tems, pages 487–493, 1999. [44] Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue, and Sarah Tavel. Visual search at pinterest. arXiv preprint arXiv:1505.07647, 2015. [45] Charles E Kahn Jr. Artificial intelligence in radiology: decision support systems. Radiographics, 14(4):849–861, 1994. [46] Zoltan Kato and Ting-Chuen Pong. A markov random field image seg- mentation model for color textured images. Image and Vision Computing, 24(10):1103–1114, 2006. 52