This document discusses exploring different interaction modes for image retrieval. It describes developing a framework that allows multimodal interaction using techniques like eye tracking, voice recognition, and multi-touch. An experiment was conducted to compare the usability of different interaction methods for query by example image retrieval. Nine participants used four methods - anchor, gaze, mouse, and touch - to select regions in images. Metrics like accuracy, precision and time were measured. Preliminary results showed touch interaction had the most consistent performance and shortest completion times.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Exploring Modes for Image Retrieval
1. Exploring Interaction Modes for Image Retrieval
Corey Engelman1 Rui Li1 Jeff Pelz2 Pengcheng Shi1 Anne Haake1
ABSTRACT applications, where information about the images can be extracted
The number of digital images in use is growing at an increasing from experts and utilized. Major questions remain as to how best
rate across a wide array of application domains. That being said, to bring users “into the loop” [2,3].
there is an ever-growing need for innovative ways to help end- Multimodal user interfaces are promising as the interactive
users gain access to these images quickly and effectively. component of CBIR systems because different modes are best
Moreover, it is becoming increasingly more difficult to manually suited to expressing different kinds of information. Recent
annotate these images, for example with text labels, to generate research efforts have been focused on developing and studying
useful metadata. One such method for helping users gain access to usability for multimodal interaction [4,5,6]. Designing natural,
digital images is content-based image retrieval (CBIR). Practical usable interaction will require an understanding of which user
use of CBIR systems has been limited by several “gaps”, interactions should be explicit and which implicit. Consider query
including the well-known semantic gap and usability gaps [1]. by example (QBE), which requires users to select a representative
Innovative designs are needed to bring end users into the loop to image and often a region of that image. It is the usual paradigm in
bridge these gaps. Our human-centered approaches integrate CBIR but users have difficulty forming such queries. There is a
human perception and multimodal interaction to facilitate more need for innovative new methods to support QBE. Beyond QBE,
usable and effective image retrieval. Here we show that multi- more effective methods are needed for gaining input from the user
touch interaction is more usable than gaze based interaction for for relevance feedback to refine the results of a search. For
explicit image region selection. 1 example, this could be done explicitly, by actually having the user
directly specify which images were close to what they were
Categories and Subject Descriptors looking for, or implicitly by simply making note of which images
H.5.2 [Information Interfaces and Presentation]: User they looked at with interest (e.g via gaze). Finally, better
Interfaces – Graphical user interfaces, input devices and organization of the images returned from a query is as important
strategies, prototyping, user-centered design, voice I/O, as the underlying retrieval system itself, in that it allows the user
interaction styles. to quickly scan the results and find what they are looking for.
General Terms Our approach to overcoming the interactivity challenges of CBIR
Measurement, Performance, Design, Experimentation, Human is largely based on bringing the user into the process by
Factors combining traditional modes of input such as the keyboard and
mouse with interaction styles that may be more natural such as
Keywords gaze input (eye-tracking), voice recognition, and multi-touch
Multimodal, eye tracking, image retrieval, human-centered interaction. A software framework was developed for such a
computing system using existing graphical user interface (GUI) libraries and
then designing several subcomponents that allow for interaction
1. INTRODUCTION via the new methods within a GUI. With the implementation of
Research in CBIR has shown that image content is more this basic framework for multimodal interface design it is now
expressive of users’ perception than is textual annotation. A possible to quickly develop and test prototypes for different
semantic gap occurs, however, when low-level image features, interface layouts and even prototypes for different modes of
such as color or texture, are insufficient in completely interaction using one or more of the input modes (mouse,
representing an image in a way that reflects human perception. keyboard, gaze, voice, touch).
One possible way to bridge the semantic gap is to take a “human- A series of studies will be performed to determine which of these
centered” approach in system design. This is particularly prototypes are most efficient and usable across a range of image
important in knowledge rich domains, such as biomedical types and among varied end user groups. The first of these,
described here, involves study of modes of interaction for
1
B. Thomas Golisano College of Computing and Information Sciences, performing QBE through explicit region of interest selection. The
Rochester Institute of Technology main goal is to effectively compare the efficiency of different
1 Lomb Memorial Drive, Rochester, NY 14623-5603 interaction methods, as well as user preference, ease-of-use, and
{cde7825, rxl5604, spcast, arhics}@rit.edu ease-of-learning.
2
College of Imaging Arts and Sciences, Rochester Institute Technology 2. Methods
1 Lomb Memorial Drive, Rochester, NY 14623-5603
{jbppph }@rit.edu 2.1 Design And Implementation
The best approach to developing a multimodal user interface such
Permission to make digital or hard copies of all or part of this work for as the one described here is an evolutionary approach. This means
personal or classroom use is granted without fee provided that copies are breaking the overall large goal of building a multimodal user
not made or distributed for profit or commercial advantage and that copies interface into smaller obtainable goals, and designing,
bear this notice and the full citation on the first page. To copy otherwise, implementing, testing, and integrating these smaller portions. In
or republish, to post on servers or to redistribute to lists, requires prior this way, the developer can ensure that separate components are
specific permission and/or a fee. not dependent on one another, because one builds stand-alone
subsystems, and then integrates them.
NGCA '11, May 26-27 2011, Karlskrona, Sweden
Copyright 2011 ACM 978-1-4503-0680-5/11/05…$10.00.
2. 2.1.1 Eye Tracking window (JFrame) and the LayoutManager class for managing
The Sensomotoric Instruments (SMI) RED 250 Hz eye-tracking placement of components within the window. Furthermore, a
device, was used to track the position of the user’s gaze on the system for allowing rapid prototyping of UI layouts can be put in
monitor. SMI’s iViewX software was used to run the eye tracker place to facilitate development. This involves creating an Abstract
during use and SMI’s Experiment Center was used to perform a class called PrototypeUI that inherits from javas JFrame class.
calibration prior to use. Our custom software, written in Java, Any number of prototype UI layouts can be created and tested
communicates with the device using Unified Data Protocol (UDP) without changing the code for core functionality of the system or
to send signals to the eye-tracker to start and stop recording. Once for the previously mentioned subcomponents that are handling
the eye tracker receives the start signal, it begins streaming screen different modes of input.
coordinates to the program. A separate program thread can then 2.2 Experimental Design
repeatedly get the new coordinates and update respective variables
To evaluate prototype interaction styles for QBE, we recruited 9
corresponding to the users gaze. Because the human eye is
undergraduate and graduate students at Rochester Institute of
naturally jittery, it is necessary to implement an algorithm for
Technology as study participants. Participants were given an
smoothing/filtering the data coming from the eye tracker. Because
explanation of the CBIR paradigm and of QBE and then were
the system is developed in an Object Oriented Programming
given a brief tutorial on each prototype mode they would be using.
Language (OOP), implementing such functionality is as simple as
For the study they were shown a set of ten images, four separate
creating an abstract Filter class, and then creating several
times, in randomized order. Each of the four times they were
instances of that abstract Filter. This allows multiple different
shown the ten images, their task was to perform QBE by explicit
filtering algorithms to be created easily. Even this functionality
region of interest selection using one of the four prototype
affords a vast array of possibilities then for how the eye input data
methods of interaction. Because we are not concerned in this
can be used for interaction. For example, eye tracking could be
study about regions of interest within objects but rather whether
used to replace mouse/keyboard scrolling and panning [7].
the user can effectively select an object, we instructed the user to
2.1.2 Voice Recognition select a specific object from each image (e.g select the eight ball
Java defines the Java Speech Application Programming Interface from an image of billiard balls on a pool table; see Figure 1C).
(JSAPI), implemented by several open source libraries. Any
2.2.1 Image Selection
implementation of the JSAPI is a suitable choice as they all
When choosing the images to use for the study, there were two
perform the functionality specified by Java. For our system, we
main considerations. First, because we specified what to select,
chose Cloud Garden JSAPI (http://www.cloudgarden.com).
there was a requirement for obvious, discrete objects in the image
Beyond a suitable library that implements the JSAPI, a speech
to eliminate ambiguity. Next, we wanted to test our four
recognition engine is required on the computer running the
prototypes across a variety of images and so we defined categories
multimodal system. For our system, we have used Windows
of images. These categories; simple, intermediate, and complex,
Speech Recognition, because it is included in the Windows
were based on the complexity of the object the user was to select.
operating system (Windows 7). A custom “grammar” can be
For the simple category, we photographed billiard balls in
written to specify which commands the system will accept. Then a
different configurations. This covers both criteria, because the
simple controller can be implemented to receive commands,
shape is simply a circle, and it allows us to instruct the user to
interpret them, and pass them on to the proper event handler.
select the eight ball. For the intermediate category, we used dice.
Voice recognition has the potential to greatly increase the
This allowed us to construct a number of intermediate complexity
efficiency of interaction between system and user. Furthermore, it
shapes. We considered them to be intermediate, because the edges
is simple to include basic functions such as a speech lock, so that
were always straight and in a 2D image, the shapes formed by the
the user can easily turn on/off voice recognition.
dice are essentially polygons. Finally, for the complex images, we
2.1.3 Multi-Touch Interaction chose to use images of horses. This is obviously a more complex
For multi-touch, an open source library called MT4J shape than the previous example, and it still allows for easy
(http://www.mt4j.org) was used. This library allows the Windows instruction of what to select, because each of the images contained
7 touch screen commands to be used within a Java application. a brown pony and a larger whitish/greyish horse.
From here, it is possible to implement custom gesture processors,
2.2.2 Prototype Interaction Methods
or use a number of predefined processors. Touch interaction can
be applied to QBE, and a number of other interactions with the 2.2.2.1 The Anchor Method
user. Beyond this, the library allows creation of custom multi- The anchor method combines interaction styles of gaze, voice and
touch user interface components. Another benefit is that it is either the mouse or touch screen. The user looks at the center of
simple to create stand-alone multi-touch applications and then the object they want to select, then says the command “set
embed them in the system. This follows the previously mentioned anchor”. This places a small selection circle on screen where the
evolutionary prototyping engineering methodology, because it user was looking. Next to this selection circle is a slider object
easily allows simple standalone prototypes to be developed, then which can slide left to decrease the radius of the selection circle,
integrated into the existing system. For our experiment, a Dell or right to increase the radius of the selection circle. The slider
SX2210T Touch Screen Monitor was used can be adjusted using either mouse or touch, depending on the
user’s preference.
2.1.4 Traditional GUI Components
Because the subcomponents of the multimodal user interface were 2.2.2.2 Gaze Interaction
developed in Java, the Swing GUI libraries can be used to create Unlike the anchor method, this method uses eye tracking almost
traditional visual components and handle input from the mouse exclusively. The user finds the object to select, then clicks a
and keyboard. This also makes developing the basic framework button using either mouse or touch screen to begin eye tracking.
for the user interface (i.e windowing and layout structure) very Once turned on, the program begins painting over the area to
simple, because Java’s Swing library includes classes for a UI provide feedback, as the user glances over the object. When
3. finished, the user presses the same button to stop the eye tracker. participants missed, a measure of precision by showing excess
Alternatively, eye tracking can be started by saying the command selection as the percentage of the users total selection that was not
“start eye tracking” and stopped by saying, “stop eye tracking”. the object, and a measure of efficiency by showing the time to
While painting, saccades are not drawn; rather fixations are complete the image.
visualized by placing translucent circles on the screen. The radius
of the circle is determined by the fixation duration (i.e a longer 3.2 Efficiency of Interaction Methods
fixation duration means a larger radius). Descriptive statistical analysis of the data was performed to
determine efficiency of the different prototypes in terms of
2.2.2.3 Mouse Selection accuracy, precision, and time to complete. Box plots were
For this method, the user finds the object of interest and then constructed to show the comparison of the different prototypes.
presses and holds the mouse button to begin drawing a selection
window. The selection auto-completes by always drawing a
straight line from the point of the initial click to where the mouse
is currently located. When the user finishes their selection, they
simply release the mouse button.
2.2.2.4 Touch Selection
This method works similarly to mouse selection except that rather
than pointing and clicking with the mouse, the user traces the
object with a finger to form the selection window. The window
auto-completes in the same fashion as for mouse selection.
Figure 2.a
Figure 1. From left to right, images from the intermediate,
complex and simple categories. The first is a selection made
using the touch screen. The second uses gaze interaction, and
the third uses the anchor method
2.2.3 Metrics
To evaluate the usability attributes of efficiency and usefulness
for each style of interaction we defined several metrics. Accuracy
was measured by calculating the area of the object in the image
(in pixels) prior to selection using the GNU Image Manipulation Figure 2.b
Program (GIMP), then calculating the area of the object in a given
selection. To determine the percentage of the object the user
missed. Precision was determined by calculating how much of the
users selection was outside of the object. The amount of excess
selection (in pixels), was divided by the total selection (in pixels)
to calculate a relative excess value of the user’s selection.
Efficiency of the different modes was determined by measuring
the time (in seconds) to complete a selection. We also asked the
users to rate each of the prototypes in three categories on a scale
from one to five. The categories were ease-of-use, ease-of-
learning, and how natural the method felt. Also, we counted the
number of times the user had to use the undo function. These
measurements show more the usability of a prototype rather than Figure 2.c
its efficiency and accuracy. Figures 2.a-2.c show comparison of box plots of the data
3. Data Analysis collected from the nine participants on all four interaction
methods for one of the images of horses. 2.a shows the
3.1 Data Collection percentage of the selection that was excess, 2.b shows the
Camtasia Studio (TechSmith) was used to record the screen percentage of the object missed by the user, and 2.c shows the
during the study. Data were extracted from the images captured time taken to complete the selection
from the video. These images showed the participants selections In all three of the plots above, the touch screen method has the
for each of the ten images four separate times (one for each most consistent results (smaller size of the box). The touch screen
method). The data extracted included, the area (in pixels) that they also has the lowest median value for percentage of the selection
selected within the object, and the area that was excess selection. missed and time taken to complete. For percentage of excess
Again, the values were measured using GIMP. Viewing of the selection, the mouse has the lowest median, but the touch screen
data suggested that the best way to effectively show the still had a more consistent set of values in which the bulk of the
comparison of the four prototypes would be to show a measure of values were lower than those from the mouse.
accuracy by displaying the percentage of the object that the
4. Table 1. The table below shows the average values of excess requires the user to coordinate between their hand and eye without
selection, percentage of the object missed, and time taken for the hand being in their field of vision. Furthermore, the average
all four prototype methods. user prefers to use a mouse or touch screen for this type of task.
Anchor Method Touch Mouse Gaze 4.1.3 Individual Differences
Excess 48.4% 17.7% 17.1% 49.4% Finally, our study metrics show that interaction with the mouse
Missed 9.0% 4.7% 9.8% 7.6% and touch screen is generally consistent across participants,
whereas there is greater variability with eye tracking, This
Time (s) 17.6 13.9 16.3 20.8
probably occurs because using one’s eyes to select or trace
something is not natural, and so while some people may learn the
3.3 User Preference method very quickly, others will not.
Table 2. The table below shows the average values of user 4.1.4 Future Studies
preference (scale of one to five) and the average undo usage Studies are ongoing to prototype and test additional interaction
for all four prototypes styles which may be useful for image retrieval. For example, a
study to show the efficiency of different modes in a search related
Anchor Touch Mouse Gaze
task, like scrolling, selection of an entire image from a set, or
Ease-of-Use 2.9 4.5 4.7 3.3 using gestures, see [10], would be useful. This would be
Ease-of-Learning 3.5 4.8 4.4 3.8 interesting to see, because it might be the case that in these types
of tasks, mouse and touch screen are not the most efficient. We
Natural 2.6 4.7 4 2.4 are also engaged in using gaze for implicit interaction, such as in
Undo Usage 8 1 1 1 [5,9], towards our long-term goals of creating adaptive,
multimodal systems for image retrieval.
The table above clearly shows that the mouse and touch screen
received higher ratings than the two methods using eye tracking. 5. ACKNOWLEDGMENTS
In general, the users were in agreement about the different This work is supported by NSF grant IIS-0941452. Any opinions,
prototypes, with the standard deviation on average being below findings, conclusions, or recommendations expressed in this
one (SD ≈ .86). Undo usage was fairly low with the average user material are those of the authors and do not necessarily reflect the
pressing undo just once per 10 images when using touch, mouse, views of the NSF.
or gaze. However, the Anchor method had a significantly higher
undo usage. Furthermore, the variance with undo usage for the 6. REFERENCES
anchor method is relatively high (SD ≈ 10.2). This variance is [1] Deserno TM, Antani S, Long R. Ontology of gaps in
likely caused by a combination of the high learning curve that this content-based image retrieval. J Digit Imaging.2009
method has. It requires the user to coordinate use of three input Apr;22(2):202-15. Epub 2008 Feb 1.
methods. Furthermore, the inaccuracy of the eye tracker, plus or
[2] Lew S.L., Sebe N., Lifl D. C., and J. Ramesh. Content-based
minus two visual degrees, plays a more significant factor here,
multimedia information retrieval: State of the art and
because unlike the gaze method where the user can see where they
challenges. ACM Transactions on Multimedia Computing,
are painting, and adjust their eyes, in this method if the tracker is
Communications and Applications, 2(1): 1-19, 2006.
off, then the user only sees this after the anchor is placed. Then
the user must click undo. [3] Müller H, Michoux N, Bandon D, A. Geissbuhler. A review
of CBIR systems in medical applications-clinical benefits
4. Conclusions and future directions. Int J Med Inform., 73(1):1-23, 2004.
4.1.1 Eye Tracking Interaction Methods [4] Qvarfordt P. and Zhai S. Conversing with the User Based on
This study shows clearly that using eye tracking for explicit user Eye-Gaze Patterns. Proc. CHI (2005), ACM, 221-230.
interaction in a task that requires the user to be precise and [5] Sadeghi M., TienG., Hamarneh G., and Atkins A. . Hands-
accurate is not effective. This is not surprising since people have free Interactive Image Segmentation Using Eyegaze. In SPIE
difficulty with smooth pursuit, that might be required for drawing Medical Imaging, 2009.
or tracing activities, when objects are stationary [8] This, in
combination with some inaccuracy of the eye tracker, does not [6] Ren, J., Zhao, R., Feng, D.D., and Siu, W. Multimodal
allow enough accuracy using interaction styles implemented for Interface Techniques in Content-Based Multimedia
this study. It is more likely that implicit interaction i.e. selection Retrieval. In Proceedings of ICMI. 2000, 634-641.
based on more natural gaze behavior as a user is browsing or [7] Kumar, M., and Winograd, T. Gaze-enhanced Scrolling
examining an image, such as in [5,9], will be effective for QBE. Techniques, UIST: Symposium on User Interface Software
and Technology. New Port, RI. 2007
4.1.2 Touch Screen and Mouse Interaction Methods
For the user group studied here, touch screen and mouse show [8] Krauzlis, RJ. The control of voluntary eye movements: new
similar results for a task such as tracing/selecting. The general perspectives. The Neuroscientist. 2005 Apr;11(2):124-37.
case is that touch screen is slightly more efficient than the mouse. PMID 15746381.
However, when we consider the images from the category of [9] Santella, A., Agrawala, M., DeCarlo D., Saleshin, D., Cohen,
complexly shaped images, it is apparent that the trend does not M., Gaze-Based Interaction for Semi-Automatic Photo
apply. The touch screen is more efficient than the mouse. This is Cropping. CHI proceedings: Collecting and Editing Photos,
likely caused by the fact that the touch screen is more natural than 2006
mouse even for technically-savvy, college-age participants
[10] Heikkilä, H., Räihä, K-J. Speed and Accuracy of Gaze
because it is closer to the human’s natural interaction process. In
Gestures, Journal of Eye Movement Research. 2009
contrast, the mouse somewhat mimics a natural interaction, but