Eye Tracking Predicts Image Relevance in CBIR Systems

Eye Movement as an Interaction Mechanism for Relevance Feed-
back in a Content-Based Image Retrieval System
Yun Zhang*1,2 ¶
† ‡
Hong Fu 2, Zhen Liang 2, Zheru Chi 2
§
Dagan Feng 2,3,
1 3
School of Computer Science 2
Centre for Multimedia Signal School of Information Technologies
Northwestern Polytechnical Uni- Processing Department of Electronic The University of Sydney
versity, Xi’an, Shaanxi, China and Information Engineering Sydney, Australia
The Hong Kong Polytechnic University
Hong Kong, China
ver, the subjective nature of human annotation adds another
Abstract
dimension of difficulty in managing image database.
Relevance feedback (RF) mechanisms are widely adopted in CBIR is an alternative solution to retrieve images. However,
Content-Based Image Retrieval (CBIR) systems to improve after years of rapid growth since 1990s [Flickner et al.1995], the
image retrieval performance. However, there exist some intrinsic gaps between low level features and semantic contents of images
problems: (1) the semantic gap between high-level concepts and holds back the progress and has entered a plateau phase. Such
low-level features and (2) the subjectivity of human perception gaps can be concretely outlined into three aspects: (1) image
of visual contents. The primary focus of this paper is to evaluate representation (2) similarity measure (3) user’s interaction. Most
the possibility of inferring the relevance of images based on eye of the image representations are based on intuitiveness of the
movement data. In total, 882 images from 101 categories are researchers and the fulfillment of mathematics, instead of hu-
viewed by 10 subjects to test the usefulness of implicit RF, man’s eye behavior. Do the features extracted reflect humans’
where the relevance of each image is known beforehand. A set understanding of the image’s content? There is no clear answer
of measures based on fixations are thoroughly evaluated which to this question. Similarity measure is highly dependent on the
include fixation duration, fixation count, and the number of revi- features and structures used in image representation. Moreover,
sits. Finally, the paper proposes a decision tree to predict the developing better distance descriptors and refining similarity
user’s input during the image searching tasks. The prediction measures are also very challenging. User interaction can be a
precision of the decision tree is over 87%, which spreads light feasible approach to answer the question and to improve the
on a promising integration of natural eye movement into CBIR image retrieval performance. In the Relevance Feedback (RF)
systems in the future. process, the user is asked to refine the searching by providing
CR Categories: H.3.3 [Information Storage and Retrieval]: explicit RF, such as selecting Areas-of-Interest (AOIs) from the
Information Search and Retrieval—Relevance feedback, Search query image, or to tick positive and negative samples from re-
Process; H.5.2 [Information Interfaces and Representation]: trieves. In the past few years, many articles reported that RF can
User interfaces help to establish the association between the low-level features
Keywords: Eye Tracking, Relevance Feedback (RF), Content- and the semantics of images and to improve the retrieval per-
Based Image Retrieval (CBIR), Visual Perception formance [Liu et al.2006; Dacheng Tao et al.2008].

1 Introduction However, the explicit feedback is laborious for the user and
limited in complexity. In this paper, we propose an eye move-
Numerous digital images are being produced everyday from ment based implicit feedback as a rich and natural source to
digital cameras, medical devices, security monitors, and other replace the time-consuming and expensive explicit feedback. As
image capturing apparatus. It has become more and more diffi- far as we know, there are just a few preliminary studies on im-
cult to retrieve a desired picture even from a photo album on a plementing some general eye movement features in image re-
home computer because of the exponential increase in the num- trieval. One is from Oyekoya and Setntiford’s work [Oyekoya
ber of images. Most traditional and common methods of im- and Stentiford.2004; Oyekoya and Stentiford.2006]. They made
age retrieval based on metadata, such as textual annotations or an investigation into the fixation duration and found that they
user-specified tags, have become the industry standard for re- are different on images with/without a clear AOI. The other
trieval from large image collections. However, manual image work was reported by Klami et al. [Klami et al. 2008]. They
annotation is time-consuming, laborious and expensive. Moreo- proposed nine-feature vectors from different forms of fixations
*email: tvsunny@gmail.comemail:
†
and saccades and used a classifier to predict one relevant image
email:zhenliang@eie.polyu.edu.hk from four candidates.
‡
email:enhongfu@inet.polyu.edu.hk
‖
email:enzheru@inet.polyu.edu.hk Different from the previous work, the study reported in this pa-
§
email: feng@it.usyd.edu.au per attempts to simulate a more real and complex image retrieval
situation and to quantitatively analyze the correlation between
users’ eye behavior and target images (positive images). In our
Copyright © 2010 by the Association for Computing Machinery, Inc.
experiments, the images come from a wide variety of web
Permission to make digital or hard copies of part or all of this work for personal or sources, and in each task, the query image and the numbers of
classroom use is granted without fee provided that copies are not made or distributed positive images are varied from time to time. We evaluated the
for commercial advantage and that copies bear this notice and the full citation on the significance of fixation durations, fixation counts, and the num-
first page. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
ber of revisits to provide a systematic interoperation of the us-
servers, or to redistribute to lists, requires prior specific permission and/or a fee. er’s attention and effort allocation in eye movements, laying a
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
permissions@acm.org.
ETRA 2010, Austin, TX, March 22 – 24, 2010.
© 2010 ACM 978-1-60558-994-7/10/0003 $10.00

37

concrete and substantial foundation to involve natural eye Ten participants took part in the study, four females and six
movement as a robust RF source [Zhou and Huang. 2003]. males in an age range from 20 to 32 all with an academic back-
ground. All of them are proficient computer users, and half of
The rest of the paper is organized as follows. Section 2 introduc- them have had experience of using an eye tracking system. Their
es experimental design and setting for relevance feedback tasks visions are either normal or correct-to-normal. The participants
and the corresponding eye movement data collecting. In Section were asked to complete two sets of the above mentioned image
3, we report our thorough investigation on using fixation dura- searching tasks and the gaze data are recorded with a 60 Hz
tion, fixation count and the numbers of revisits for the prediction sampling rate. Afterwards the participants were asked to indicate
of relevant images. These factors are performed with the ANO- which images they have chosen as positive images to ensure the
VA test to reveal their significances and interconnections. Sec- accuracy of a further analysis on their eye movement data. The
tion 4 proposes a decision tree model to predict the user’s input eye tracker is non-intrusive and allows a 300x220x300 mm free
during the images searching tasks. Finally, we conclude the head movement space. Different candidate images and the loca-
results and propose the future work. tions of positive images are ensured in and between each set of
the task. In other words, no two images are the same and no two
2 Design of Experiments stimuli have the same positive image locations. This is to reduce
the memory effects and to simulate the natural relevance feed-
2.1 Task Setup back situation.
We study an image searching task which reflects kinds of activi-
ties occurring in a complete CBIR system. In total, 882 images 3 Analysis of Gaze Data in Image Searching
are randomly selected from 101 object categories. The image set
is obtained by collecting images through the Google image Raw gaze data are preprocessed by finding the fixations with the
search enginee [Li 2005]. The design and examples of the built-in filter provided by Tobii Technology. The filter maps a
searching task interface is shown in Fig. 1. On the top left is the series of raw coordinates to a single fixation if the coordinates
query image. Twenty candidate images are arranged as a 4x5 stay sufficiently long within a sphere of a given radius. We used
grid display. All of the images are from 101 categories such as an interval threshold of 150 ms and a radius of 1 º visual angle.
landscapes, animals, buildings, human faces, and home ap- 3.1 Fixation Duration and Fixation Count
pliances. The red blocks in Fig. 1(a) denotes the locations of
positive images in Fig. 1(b) (Class No. 22: Pyramid). The others The main features used in eye tracking related information re-
are negative images and their image classes are different from trieval are fixations and saccades [Jacob and Karn.2003]. Two
each other. That is to say, apart from the query image’s category, groups of derived metrics stem from the fixation: fixation dura-
no two images in the grid are from the same category. The can- tion and fixation count are thoroughly studied to support the
didate images in one searching stimulus are randomly arranged. possibility of inferring the relevance of images based on eye
movements [Goldberg et al.2002; Gołofit 2008]. Suppose that
FDP(m) and FDN(m) are the fixation durations on the positive
Query Class No Class No Class No Class No Class No
Image 01 22 22 75 64 and the negative images observed by subject m, respectively;
Negative Positive Positive Negative Negative
FCP(m) and FCN(m) are the fixation counts on the positive and
Class No
56
Class No
38
Class No
17
Class No
100
Class No
12
the negative images observed by subject m, respectively; Then
Negative Negative Negative Negative Negative
in our searching task, FDP(m) and FDN(m) are defined as
Class No Class No Class No Class No Class No
45 22 06 77 91
Negative Positive Negative Negative Negative ∑, , FD sgn ,
FDP =
(a) Class No Class No Class No Class No Class No
(b) ∑, , sgn , (1)
13 69 22 22 28
Negative Negative Positive Positive Negative ∑, , FD 1 sgn ,
FDN =
∑, , 1 sgn 1,
Figure 1. Image searching stimulus. (a) the layout of the search-
ing stimulus with 5 positive images; (b) an example. where 0,1, … ,20 denotes the image candidate in each
searching stimulus interface; 1,2, … ,21 denotes the stimulus
in each searching task (it also represents the numbers of positive
Such a simulated relevance feedback task asks each participant
images in the current stimulus); 1,2 denotes the task set,
to use his eye to locate the positive image on each stimulus. On
1,2, … ,10 represents the subject and sgn(x) is the signum
locating the positive image, the participants select the target by
function. Consequently, FD is the fixation duration on the
fixating on it for a short period of time with the eye. A set of the
i-th image candidate of the j-th stimulus of the k-th task from
task are composed of 21 such stimulus whose positive image
subject m, and
number are varied from 0 to 20. Thus, the set of task contains
21x21 = 441 images and the total number of the negative images 1 if subject regards cadidate image as positive
and positive images are equal (210 images each). ,
0 if subject regards cadidate image as negative
2.2 Apparatus and Procedure In the similar manner, FCP(m) and FCN(m) are defined as
Eye tracking data is collected by the Tobii X120 eye tracker, ∑, , FC sgn ,
whose accuracy is α 0.5° and drift β 0.3°. Each candidate FCP =
∑, , sgn ,
image has a resolution of 300 x 300 pixels and thus an image
∑, , FC 1 sgn , (2)
stimulus has 1800 x 1200 pixels. Each of stimuli is displayed on FCN =
the screen with a viewing distance of 600 mm and the screen’s ∑, , 1 sgn ,
resolution is 1920x1280 pixels and the pixel pitch is h = 0.264 where FC is the fixation counts on the i-th image candi-
mm. Hence the output uncertainty is just R tan α β /h = date of the j-th stimulus of the k-th task from subject m. The two
30 pixels, which has ensured the error of gaze data no larger pairs of fixation-related variables were monitored and recorded
than 1% area of each candidate image.

38

during the experiment. The average value and standard deviation ing task. We can see that (1) some of the candidate images are
of ten participants are summarized in Table 1. never visited, which indicates the use of pre-attentive vision at
the very beginning of the visual search [Salojärvi et al. 2004].
Table 1 Statistics on the fixation duration and fixation count on During the pre-attentive process, all the candidate images have
positive and negative images been examined to decide the successive fixation locations; and
Sub. FDP(m) FDN(m) FCP(m) FCN(m) (2) in our experiments, revisits happen both on positive images
1.410±1.081 0.415±0.481 2.5±1.9 1.3±1.3 and negative images. The majority of them have just been vi-
1
sited once, while some of them are revisited during the image
2 1.332±0.394 0.283±0.247 2.7±1.4 1.2±0.9 searching.
3 2.582±1.277 0.418±0.430 5.6±3.3 1.7±1.5
4 0.805±0.414 0.356±0.328 2.4±1.2 1.5±1.2 The Number of Visits ‐‐ Histogram
2500 2149
5 1.154±0.484 0.388±0.284 2.6±1.4 1.5±1.0
2000
6 1.880±0.926 0.402±0.338 3.0±1.9 1.4±1.0
1500
7 0.987±0.397 0.166±0.283 1.7±0.8 0.6±0.7 878
8 0.704±0.377 0.358±0.254 2.2±1.1 1.3±0.9 1000
403 306
9 1.125±0.674 0.329±0.403 3.0±2.0 1.4±1.5 500 119 65 80
10 1.101±0.444 0.392±0.235 2.7±1.3 1.5±0.8 0
AVG. 1.308±0.891 0.351±0.345 2.8±2.0 1.3±1.1 No Visit 1 times 2 times 3 times 4 times 5 times > 6 times

Figure 2 The total revisit histogram. The X-axis denotes the
Analysis of variance (ANOVA) tests are performed to find out number of re-fixation and Y-axis is the corresponding count
whether there are discriminating visual behaviors between the (unit: millisecond).
observation of positive and negative images. Given the individu-
al difference in eye movements, we designed two groups of two- Table 3 Overall revisits on positive and negative images
way ANOVA among three factors: test subject, fixation duration A1 1 2 3 4 5 6 >7
and fixation count. The results are shown in Table 2.
A2 549 196 88 55 34 13 27
Table 2 ANOVA test results among three factors: test subject, A3 329 110 31 10 3 2 1
fixation duration and fixation count.
A4 878 306 119 65 37 15 28
GROUP I
Factor Levels Test result A5 63% 64% 74% 85% 92% 87% 100%
(A) Test 10 levels A1 = the number of revisits on an image candidate; A2 = revisit
F(9,9) = 1.26, p < 0.37
Subjects (10 subjects) counts on positive images; A3 = revisit counts on negative im-
(B) Fixation 2 levels ages; A4 = the total number of revisits; A5 = the percentage of
F(1,9) = 32.84, p < 0.0003
Duration (FDP & FDN) the total revisits occurring to positive images.
GROUP II
Factor Levels Test result To compare with Oyekoya and Setntiford’s work [2006], we
(A) Test 10 levels investigate whether the variance of revisit counts has a different
F(9,9) = 2.03, p < 0.15 effect between positive and negative image candidates over all
Subjects (10 subjects)
(B) Fixation 2 levels the participants (as shown in Table 3). When revisits counts ≥ 3
F(1,9) = 28.28, p < 0.0005
Count (FCP & FCN) times, the result of one-way ANOVA is significant with F(1,8)
= 5.73, p < 0.044. That is to say, the probability of revisits on a
As illustrated in Table 2, both fixation duration and fixation positive image is increased with revisits counts. For example,
count revealed significant effects to positive and negative im- when an image is revisited more than three times, it has a very
ages during simulated relevance feedback tasks. Concretely high probability (over 74%) to be a positive image candidate. As
speaking, the fixation durations on each positive image from all a result, the number of revisit is also a feasible implicit relev-
the subjects (1.30 seconds) are longer than those on negative ance feedback to drive an image retrieval engine.
image (0.35seconds). Correspondingly, the analysis of fixation
count produces similar results that subjects visit more times on a 4 Feature Extraction and Results
positive image (2.8) than on a negative one (1.3). On the other The primary focus of this paper is on evaluating the possibility
hand, the variations of different subjects have no significant of inferring the relevance of images based on eye movement
effects on both groups. (In GROUP I, 0.37 > α = 0.05; in GROUP II, data. The features such as fixation duration, fixation count and
0.15 > α = 0.05). the number of revisit have shown discriminating power between
positive and negative images. Consequently, we composed a
3.2 Number of Revisits simple set of 11 features , ,…, , an eye
A revisit is defined as the re-fixation on an AOI previously fix- movement’s vector to predict the positive images from each
ated. Much human computer interaction and usability research returned 4x5 image candidates set in the simulated relevance
shows that re-fixation or revisit on a target may be an indication feedback task, where 1,2, … ,20 denotes the numbers of
of special interest on the target. Therefore, the analysis of revisit positive images in the current stimulus; 1,2, … ,10
during the relevance feedback process may reveal the correlation represents the subject , , … , are listed in Table 4, where
between the eye movement pattern and positive image candi- 1, … ,20 and FL FD /FC .
dates. Table 4 Features used in relevance feedback to predict positive
images
Figure 2 shows a general status of the overall visit frequency (no.
of revisits = no. of visits - 1) throughout the whole image search-

39

NO. Features Description color, texture, shape, and spatial information, to human attention,
such as AOIs. As a result, eye tracking data can be a rich and
Fixation duration on i-th image inside 4x5 image
FD
candidate set interface
new source for improving image representation [Lei Wu et al.
Fixation Count on i-th image inside 4x5 image
2009]. Our future work is to develop an eye tracking based
FC CBIR system in which human beings’ natural eye movements
FL FD /FC will be effectively exploited and used in the modules of image
FL Fixation Length on i-th image inside 4x5 image representation, similarity measurement and relevance feedback.
R
Revisit numbers happened on i-th image inside Acknowledgments
4x5 image candidate set interface
The work reported in this paper is substantially supported by the
Different from Klami et al.’s work [Klami et al. 2008], we use a Research Grants Council of the Hong Kong Special Administra-
decision tree (DT) as a classifier to automatically learn the pre- tive Region, China (Project code: PolyU 5141/07E) and the
diction rules. The data set mentioned in Section 2 is divided into PolyU Grant (Project code: 1-BBZ9).
a training and a testing sets to evaluate the prediction accuracy.
Two different methods are used to train the DT, which are illu- References
strated in Table 5 (prediction precisions are 87.3% and 93.5%,
respectively), and an example of predicted positive image from DACHENG TAO, XIAOOU TANG AND XUELONG LI. 2008. Which
4x5 candidates set is shown in Figure 3. Components are Important for Interactive Image Searching Circuits and
Systems for Video Technology, IEEE Transactions on 18, 3-11. .
Table 5 Training methods and testing results of decision trees
FLICKNER, M., SAWHNEY, H., NIBLACK, W., ASHLEY, J., HUANG,
Method I
Q., DOM, B., GORKANI, M., HAFNER, J., LEE, D., PETKOVIC, D.,
Training Data Set 1,2, … 5 STEELE, D. AND YANKER, P. 1995. Query by Image and Video
Testing Data Set 5,6, … 10 Content: The QBIC System. Computer 28, 23-32. .
Prediction Precision 87.3% GOLDBERG, J.H., STIMSON, M.J., LEWENSTEIN, M., SCOTT, N.
Method II AND WICHANSKY, A.M. 2002. Eye tracking in web search tasks:
design implications. In ETRA '02: Proceedings of the 2002 symposium
Training Data Set 1,3,5 … 19
on Eye tracking research & applications, New Orleans, Louisiana,
Testing Data Set 2,4,6 … 20 Anonymous ACM, New York, NY, USA, 51-58.
Prediction Precision 93.5%
GOŁOFIT, K. 2008. Click Passwords Under Investigation. Computer
Security - ESORICS 2007 343-358. .
JACOB, R. AND KARN, K. 2003. Eye Tracking in Human-Computer
Interaction and Usability Research: Ready to Deliver the Promises. In
The Mind's Eye: Cognitive and Applied Aspects of Eye Movement Re-
search, HYONA, RADACH AND DEUBEL, Eds. Elsevier Science,
Oxford, England.
KLAMI, A., SAUNDERS, C., DE CAMPOS, T.E. AND KASKI, S. 2008.
Can relevance of images be inferred from eye movements? In MIR '08:
Proceeding of the 1st ACM international conference on Multimedia
information retrieval, Vancouver, British Columbia, Canada, Anonym-
ous ACM, New York, NY, USA, 134-140.

Figure 3 An example of predicted positive images from 4x5 LEI WU, YANG HU, MINGJING LI, NENGHAI YU AND XIAN-
candidates set in the simulated relevance feedback task. The SHENG HUA. 2009. Scale-Invariant Visual Language Modeling for
Object Categorization. Multimedia, IEEE Transactions on 11, 286-294. .
query image is “hedgehog”, and DT model returned 8 predicted
positive images (in red frames) based on the 11 features vector FEIFEI. LI ,Visual recognition: computational models and human psy-
with 100% accuracy. chophysics, Phd Thesis, California Institute of Technology, 2005.
LIU, D., HUA, K., VU, K. AND YU, N. 2006. Fast Query Point Move-
5 Conclusion and Further Work ment Techniques with Relevance Feedback for Content-Based Image
Retrieval. Advances in Database Technology - EDBT 2006 700-717. .
An eye tracking system can be possibly integrated into a CBIR
system as a more efficient input mechanism for implementing OYEKOYA, O. AND STENTIFORD, F. 2004. Exploring Human Eye
the user’s relevance feedback process. In this paper, we mainly Behaviour using a Model of Visual Attention. 17th International Confe-
concentrate on a group of fixation- related measurements which rence on (ICPR'04) Volume 4, IEEE Computer Society, Washington, DC,
USA, 945-948.
shows static eye movement patterns. In fact, the dynamic cha-
racteristics can also manifest human organizational behavior and OYEKOYA, O. AND STENTIFORD, F. 2006. Perceptual Image Retriev-
decision processes, such as saccades and scan path, which reveal al Using Eye Movements. Advances in Machine Vision, Image
the pre-attention and cognition process of a human being while Processing, and Pattern Analysis 281-289. .
viewing an image. In our further work, we will continue to de- SALOJÄRVI, J., PUOLAMÄKI, K. AND KASKI, S. 2004. Relevance
velop a more comprehensive study which includes both the stat- feedback from eye movements for proactive information retrieval. In
ic and dynamic features of eye movements. Originally, it is a Workshop on Processing Sensory Information for Proactive Systems
unity of human’s conscious and unconscious visual cognition (PSIPS 2004, Anonymous , 14-15.
behavior, which can not only be used in relevance feedback, but
also a new source of image representation. Human’s image ZHOU, X.S. AND HUANG, T.S. 2003. Relevance feedback in image
viewing automatically bridge the low level features, such as retrieval: A comprehensive review. Multimedia Systems 8, 536-544. .

40

Eye Tracking Predicts Image Relevance in CBIR Systems

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (20)

Semelhante a Eye Tracking Predicts Image Relevance in CBIR Systems

Semelhante a Eye Tracking Predicts Image Relevance in CBIR Systems (20)

Mais de Kalle

Mais de Kalle (20)

Eye Tracking Predicts Image Relevance in CBIR Systems