Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data

SemantiCode: Using Content Similarity and Database-driven
Matching to Code Wearable Eyetracker Gaze Data
Daniel F. Pontillo, Thomas B. Kinsman, & Jeff B. Pelz*
Multidisciplinary Vision Research Lab, Carlson Center for Imaging Science, Rochester Institute of Technology
{dfp7615, btk1526, *pelz}@cis.rit.edu

Abstract can be mapped onto the scene camera’s intrinsic 3D coordinate
system. This allows for accurate ray tracing from a known origin
Laboratory eyetrackers, constrained to a fixed display and static relative to the scene camera. While this method has been shown
(or accurately tracked) observer, facilitate automated analysis of to be accurate, it has limitations. Critically, it requires an
fixation data. Development of wearable eyetrackers has accurate and complete a priori map of the environment to relate
extended environments and tasks that can be studied at the object identities with fixated volumes of interest. In addition, all
expense of automated analysis. data collection must be completed with a carefully calibrated
scene camera, and the algorithm is computationally intensive.
Wearable eyetrackers provide 2D point-of-regard (POR) in Another proposed method is based on Simultaneous
scene-camera coordinates, but the researcher is typically Localization and Mapping (SLAM) algorithms originally
interested in some high-level semantic property (e.g., object developed for mobile robotics applications [Thrun and Leonard
identity, region, or material) surrounding individual fixation 2008]. Like FixTag, current implementations of SLAM-based
points. The synthesis of POR into fixations and semantic analyses require that the environment be completely mapped
information remains a labor-intensive manual task, limiting the before analysis begins, and are brittle to scene layout changes,
application of wearable eyetracking. precluding their use in novel and/or dynamic environments.

We describe a system that segments POR videos into fixations Our initial impetus for this research was the need for a tool to
and allows users to train a database-driven, object-recognition aid the coding of gaze data from mobile shoppers interacting
system. A correctly trained library results in a very accurate and with products. Because the environment changes every time a
semi-automated translation of raw POR data into a sequence of product is purchased (or the shopper picks up a product to
objects, regions or materials. inspect it), neither FixTag nor SLAM-based solutions were
viable. Another application of the tool is in a geoscience
Keywords: semantic coding, eyetracking, gaze data analysis research project, in which multiple observers explore a large
number of sites. While the background in each scene is static, it
isn’t practical to survey each site horizon-to-horizon, and
1 Introduction because the scenes include an active instructor and other
observers, existing solutions were not suitable for this case.
Eye tracking has a well-established history of revealing valuable
information about visual perception and more broadly about Figure 1 shows sample frames from the geosciences-project
cognitive processes [Buswell 1935; Yarbus 1967; Mackworth gaze video recorded in an open, natural scene, which contains
and Morandi 1967; Just and Carpenter 1976]. Within this field many irregular objects and other observers. Note that even if it
of research, the objective is often to examine how an observer were possible to extract volumes of interest and camera motions
visually engages with the content or layout of an environment. within this environment, there would be no mechanism for
When the observer’s head is stationary (or accurately tracked) mapping fixations within volumes into their semantic identities
and the stimuli are static (or their motion over time is recorded), because of the dynamic properties of the scene.
commercial systems exist that are capable of automatically
extracting gaze behavior in scene coordinates. Outside the
laboratory, where observers are free to move through dynamic
environments, the lack of constraints precludes the use of most
existing automatic methods.

A variety of solutions have been proposed and implemented in
order to overcome this issue. One approach, ‘FixTag,’ [Munn
and Pelz 2009] utilizes ray tracing to estimate fixation on 3D
volumes of interest. In this scheme, a calibrated scene camera is
used to track features across frames, allowing for the extraction
of 3D camera movement. With this, points in a 2D image plane
Figure 1 Frames gaze captured in outdoor scene
Copyright © 2010 by the Association for Computing Machinery, Inc.
Permission to make digital or hard copies of part or all of this work for personal or 2 Semantic-based Coding
classroom use is granted without fee provided that copies are not made or distributed
for commercial advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others than ACM must be Our goal in developing the SemantiCode tool was to replace the
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on concept of location-based coding with a semantic-based tool. In
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
2D location-based coding, the identity of each fixation is defined
permissions@acm.org.
ETRA 2010, Austin, TX, March 22 – 24, 2010.
© 2010 ACM 978-1-60558-994-7/10/0003 $10.00

267

by the (X,Y) coordinate of the fixation in a scene plane. The 2D An existing algorithm for automatic extraction of fixations
scene plane can be extended for dynamic cases such as web [Munn 2009; Rothkopf and Pelz 2004] was modified and
pages, providing that any scene motion (i.e., scrolling) is embedded within the SemantiCode system. Temporal and spatial
captured for analysis. In 3D location-based coding, fixations are
defined by the (X,Y,Z) coordinate of the fixation in scene space,
provided that the space is mapped and all objects of interest are
placed within the map.

By contrast, in semantic-based coding, a fixation’s identity can
be determined independent of its location in a 2D or 3D scene.
Rather than basing identity on location, semantic-based coding
uses the tools of object recognition to infer a semantic identity
for each fixation. A wide range of spectral, spatial, and temporal
features can be used in this recognition step. Note that while
identity can be determined independent of location in semantic-
based coding, location can be retained as a feature in identifying
a fixation by integrating location data. Alternatively, a ‘relative Figure 2 The SemantiCode GUI as it appears after the user
location’ feature can be included by incorporating the semantic- has loaded a video and tagged a number of fixations. This
based features of the region surrounding the fixated object. usage example represents a scenario wherein a library has
just been built. The area on the left side of the interface
Fundamental to the design of the SemantiCode Tool is the contains all of the fixation viewer components, while the area
concept of database training. Training occurs at two levels; the on the right is generally devoted to coding, library
system is first trained by manually coding gaze videos. As each management, and the display of the top matches from the
fixation is coded, the features at fixation are captured and stored active library
along with the image region as an exemplar of the semantic
identifier. Higher-level training can occur via relative weighting constraints on the fixation extraction can be adjusted by the
of multiple features, as described in Section 8. experimenter via the Fixation Definition Adjustment subgui seen
in Figure 3. The user is also presented with statistics about the
3 Software Overview fixations as calculated from the currently selected video. The
average fixation duration and the rate of fixations per second can
be useful indicators of how well the automatic segmentation has
SemantiCode was designed as a tool to optimize and optionally
worked for the current video [Salvucci and Goldberg 2000].
automate the process of coding without sacrificing adaptability,
robustness, and an immediate mechanism for manual overrides.
The software is reliant on user interaction and feedback, yet in
most cases this interaction requires very little training. One
major design consideration was a scalable operative complexity;
this is crucial for research groups who employ undergraduates
and other short-term researchers, as it obviates the need for an
extended period of operator training. To this end, the graphical
user interface (GIU) allows users manual control over every
parameter and phase of video analysis and coding, while
simultaneously offering default settings and semi-automation
that should be applicable to most situations. Assuming previous Figure 3 Fixation Definition Adjustment subgui allows the
users have trained a strong library of objects and exemplars, the user to shift the constraints on what may be considered a
coding task could be as simple as pressing one key to progress fixation in order to produce more or fewer fixations.
through fixations, resulting in a table of data that correlates each
fixation to the semantic identity of the fixated region. The 5 Fixation Analysis
training process requires some straightforward manual
configuration before this type of usage is possible, but A single frame, extracted from the temporal center of the active
depending on the variety of objects of interest, this can still be fixation in the gaze video is displayed on the left of the main
achieved in a much shorter period of time with significantly less GUI. Within the frame, a blue box indicates the pixel region
effort than previous manual processes have required. considered relevant in all further steps. Beneath the frame, that
region’s semantic identifier is shown, if one exists, along with a
4 Graphical User Interface text display of the progress that has been made in coding the
currently selected video. The user can use an intuitive control
When the user runs SemantiCode for the first time, the first step panel for switching between fixation frames, videos and
is to import a video that has been run through POR analysis projects. Users have the option of manually navigating fixations
software. (The examples here were done with Positive Science either with a drop-down menu fixation selector, with the
Yarbus 2.0 software [www.positivescience.com]). Any video next/previous buttons, or with the right and left arrows on the
with an accompanying text file listing the POR for each video user’s keyboard.
frame can be used. The POR location and time code of each
frame are used to automatically segment the video into estimated 6 Object Coding
fixations. Once this is finished, the first fixation frame appears,
and coding can proceed. The primary purpose of SemantiCode is the attachment of
semantic identification information to a set of pixel locations in

268

an eye tracking video. Thus, the actual coding of fixations is a exemplars to test against. The denominator is the sum of each
critical functionality in the software. The first time the software model’s histogram, a normalization constant computed once.
is used with video of a new environment, coding begins H(I,M) represents the fractional match value [0 – 1] between the
manually. Users add new objects to the active library by typing fixated region and a model in the library. This has the desirable
in an identifier for the fixated region in the active frame, which qualities that background colors, which are not present in the
can be selected as either 64x64 or 128x128 pixels surrounding object, are ignored. The intersection only increases if the same
the point of regard. With each added object, the image and its colors are present, and the amount of those colors does not
histogram are stored in the active library under the specified exceed the amount expected. This approach is robust to changes
name. Once a sufficient number of objects have been added to in orientation and scale because it relies only on the color
describe the elements of interest in the environment, the user can intensity values within the two images being compared. It is also
continue coding by selecting the most appropriate member of the computationally efficient, requiring only n comparisons, n
object list. As each frame is tagged with a name, the frame additions, and one division.
number, video name, and semantic identifier are stored and
displayed as coded frames. The representation of 3D content by 2D views is elegantly
handled by the design of the library. Each semantic identifier
After coding each fixation (either manually or by accepting can contain an arbitrary number of exemplars from any view or
SemantiCode’s match), data about the fixation and the video scale. Consequently, multiple perspectives are added to the
from which it was extracted are written to an output data file. library as they are required. The library is thus adaptively
With this, statistical analyses can easily be run on the newly modified to meet the needs of the coding task.
added semantic information for each coded fixation.
Future work will involve extended feature generation and
7 Building a Library selection, including alternative and adaptive histogram
techniques, and the use of machine-learning algorithms for
enhanced object discrimination.
The data structure that underlies SemantiCode is referred to as a
library. A library is simply a collection of semantic identifiers
that each contain one or more images or exemplars that has been
constructed through the act of coding. When a user runs the
software for the first time, an unpopulated default library is
automatically created. Users can immediately start adding to this
library, which is a persistent data structure that is automatically
imported for every subsequent session of the software. The user
can create a new blank library, copy an existing library into a
new one, merge two or more libraries into one, and delete
unwanted libraries.
Figure 4 The Examine subgui for a region called “Distant
Alternatively, users can import a pre-existing library, or merge Terrain.” The GUI displays the exemplar and image for
several libraries into one before ever coding a single object. This each fixation tagged with this name.
portability is a major feature, as it means that theoretically for a
given environment, manual object coding must only be done Since the current algorithm is not affected by shape or spatial
once. All subsequent coding tasks, regardless of the user or the configuration, it is not is not necessary to segment the region of
location of the software, can be based on a pre-built library of interest from its background. As a result, irregular environments
exemplars and object images. and observer movement do not degrade performance. Even more
compelling is the capacity for this algorithm to accurately match
8 Semantic Recognition materials and other mass nouns that may not take the form of
discrete objects. The ability to automatically identify materials
Computer Vision usually attempts to either find the location of a along with objects helps to address a larger issue in the machine-
known object (“where?”), or identify an object at a known vision field about the salience of uniform material regions.
location (“what?”). In the case of eyetracking the fixation
location is given, so the primary question is, “What is the fixated These factors make the Swain and Ballard [1990] color-
object, region, or material?” To answer this the region histogram method an attractive choice for a highly adaptable and
surrounding the calculated POR is characterized by one or more robust form of assisted semantic coding. Testing with just RGB
features. Those features are then compared to the features stored histogram intersections shows great promise. In its current
in a library to answer the question posed above. implementation, each time a new fixation frame is shown,
SemantiCode matches its histogram against every object in the
As our initial method, we used the color-histogram intersection currently active library, ranks them, and displays the top ten
method introduced by Swain and Ballard [1990], in which the objects on the right panel. The highest-ranking object shows the
count of the number of pixels in each bin of image I’s histogram top three exemplars.
is compared to the number of pixels in the same bin of model
M’s histogram: Table 1 shows the results of preliminary tests in a challenging
outdoor environment similar to that depicted in Figure 1. For
n n analysis, five regions were identified: Distant terrain,
H(I, M) = " min(I j , M j ) " Mj Eq.1 Midground terrain, Horizon, and Lake. After initializing the
j=1 j=1
library by coding the first nine fixations within each region, the
color-histogram match scores for the tenth fixation in each
Where Ij represents the jth bin of the histogram at fixation and Mj region were calculated. Recall that SemantiCode performs an
is the jth bin of a model’s histogram from the library of
!

269

exhaustive search of all histograms. Table 1 contains the peak With future improvements and extensibility, SemantiCode
histogram match within each category. In the current promises to become a valuable tool to support attaching
implementation, SemantiCode presents the top ten matches to semantic identifiers to image content. It will be possible to tune
the experimenter. Hitting a single key accepts the top match; any SemantiCode to virtually any environment. By combining the
of the next nine can be accepted instead by using the numeric power of database-driven identification with unique matching
keypad, as seen in Figure 2. techniques, it will only be limited by the degree to which it is
appropriately trained. It is thus promising both as a tool for
Table 1 Peak histogram match (see text) evaluating which algorithms are useful in different experimental
scenarios, and as an improved practical coding system with
which to analyze research data.
Midground

Horizon
Lighter

Distant
terrain

terrain

terrain
10 Acknowledgments

Lake
This work was made possible by the generous support of Procter
Midground 81% 52% 26% 38% 55% & Gamble and NSF Grant 0909588.
terrain
Lighter terrain 34% 77% 72% 54% 65%
Distant terrain 45% 65% 82% 58% 71% 11 Proprietary Information/Conflict of Interest
Horizon 14% 30% 39% 60% 55%
Lake 14% 61% 72% 65% 81% Invention disclosure and provisional patent protection for the
described tools are in process.
The next version will allow the experimenter to implement
automatic coding when the feature matches are unambiguous. References
For example, if the top match exceeds a predefined accept
parameter (e.g., 80%), and no other matches are closer than a BUSSWELL, G.T. 1935 How People Look At Pictures: A Study Of
conflict parameter (e.g., 10%) of the top match, the fixation The Psychology Of Perception In Art, The University of
would be coded without experimenter intervention. If either Chicago Press, Chicago
constraint is not met, SemantiCode would revert to suggesting
codes and waiting for verification. Table 1 shows that even in JUST, M. A. AND CARPENTER, P. A. 1976. Eye fixations and
the challenging case of a low-contrast outdoor scene with similar cognitive processes. Cognitive Psychology, 8, 441-480.
spectral signatures, three of the five semantic categories would
be coded correctly without user intervention, even with only MACKWORTH, N.H. AND MORANDI, A. 1967. The gaze selects
nine exemplars per region. Note that in this case the semantic informative details within pictures, Perception and
label ‘Horizon’ spans two distinct regions, making it a challenge Psychophysics, 2, 547–552.
to match. Still, the correct label is the second highest match.
MUNN, S.M., and Pelz, J.B. 2009. FixTag: An algorithm for
To test SemantiCode’s ability to work in various environments, identifying and tagging fixations to simplify the analysis of data
it was also evaluated in a consumer-shopping environment. Six collected by portable eye trackers. Transactions on Applied
regions were identified for analysis: four shampoos and two Perception, Special Issue on APGV, In press.
personal hygiene products. Histogram matches were calculated
as described for the outdoor environment. The indoor ROTHKOPF, C. A. and PELZ, J. B. 2004. Head movement
environment was less challenging – after training, all six estimation for wearable eye tracker. In Proceedings of the 2004
semantic categories could be coded correctly without user Symposium on Eye Tracking Research & Applications (San
intervention with top matches ranging from 74% to 85%. Antonio, Texas, March 22 - 24, 2004). ETRA '04. ACM, New
York, NY, 123-130.
In the near future, additional image-matching algorithms will be
evaluated within the SemantiCode application for their SALVUCCI, D. D. and GOLDBERG, J. H. 2000. Identifying
effectiveness in different scene circumstances. Using the results fixations and saccades in eye-tracking protocols. In Proceedings
from these evaluations it will be possible to select optimally of the 2000 Symposium on Eye Tracking Research &
useful match evaluation approaches. Applications (Palm Beach Gardens, Florida, United States,
November 06 - 08, 2000). ETRA '00. ACM, New York, NY, 71-
Match scores can be computed as weighted combinations of 78.
outputs from a number of image matching algorithms. Weights,
dynamically adjusted by the reinforcement introduced by the SWAIN, M.J., BALLARD, D.H. 1990. Indexing Via Color
experimenter’s manual coding, would allow a given library to be Histograms, 1990, Third International Conference on Computer
highly tuned to the detection of content that may otherwise be Vision.
too indistinct for any individual matching technique.
THRUN, S. and LEONARD, J. 2008. Simultaneous localization and
9 Conclusion mapping. In SICILIANO, B. and KHATIB, O., Springer Handbook
of Robotics, Springer, Berlin.
SemantiCode offers a significant improvement over previous
YARBUS, A.L. 1967. Eye movements and vision. New York:
approaches to streamlining the coding of eyetracking data. The
Plenum Press.
immediate benefit is seen in the dramatically increased
efficiency for video coding, and increased gains are anticipated
with the semi-autonomous coding described.

270

Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Destaque

Destaque (19)

Semelhante a Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data

Semelhante a Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data (20)

Mais de Kalle

Mais de Kalle (17)

Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data