We present a novel method for the measurement of the similarity between aggregates of scanpaths. This may be thought of as a solution to the “average scanpath” problem. As a by-product of this method, we derive a classifier for groups of scanpaths drawn from various classes. This capability is empirically demonstrated using data gathered from an experiment in an attempt to automatically determine expert/novice classification for a set of visual tasks.
2. (a) (b)
(c) (d)
Figure 2: Collections of scanpaths of novice (a) and expert (c) pilots over a single stimulus. Time-projected scanpaths of novices (b) and of
experts (d) can be considered side views of the three-dimensional data.
ADHD through eye tracking data. They created three classifiers, scanpath s and time t, the fixation function, f (s, t), produces either
including a classifier based on Levenshtein distance, and discov- the fixation attributable at that timestamp, e.g., frame, or null (for
ered that Levenshtein’s gave the best results among their chosen saccades) from scanpath s. Figure 1 visualizes the difference be-
algorithms. To show relative improvement, we also compare the tween the standard scanpath representation and a side view of the
performance of our algorithm to a similar Levenshtein classifier. three-dimensional representation.
We extend the above definition to the function f (S, t) by chang-
3 Group-Wise Similarity ing the single scanpath parameter s to a collection of scanpaths
S. This function would then return a collection of fixations for
Our algorithm takes as input two collections of fixation-filtered all scanpaths in S at the given timestamp. Then, we may differen-
scanpaths. An example image is presented in Figure 2, displayed in tiate groups of subjects into their own scanpath sets. For instance,
2(a) with all novice scanpaths and in 2(c) with all expert scanpaths. in our experiment, we study the differences between experts and
From a simple visual examination, there is no obvious characteris- novices. We may then create an expert scanpath set E and a novice
tic that stands out for either collection. A procedure is then needed scanpath set N . The functions f (E, t) and f (N, t) would then re-
to perform a deeper statistical analysis of each collection. turn collections of fixations at timestamp t for experts and novices,
The original impetus for this approach was the desire to formulate respectively (Figures 2(b) and 2(d) visualize the same data as in
an elegant scanpath comparison measure for dynamic stimuli, such Figures 2(a) and 2(c), but as side views of their three-dimensional
as movies or interactive tasks. Current string-editing approaches representations).
are not sufficient for video. For example, a string-editing alignment
could mistakenly align AOIs from frames that are many seconds These group-specific collections of fixations for single frames may
apart. There is nothing to explicitly constrain AOIs to only coincide be clustered by the mean shift approach described by Santella and
within specific temporal limits. DeCarlo [2004]. The resulting clusters serve as general AOIs for a
given frame, describing regions of varying interest for that specific
From the perspective of a collection of movie frames, each frame group of individuals. We may then construct a probabilistic model
can be thought of as a separate stimulus. The scanpath for a sin- of expected attention for that group. Such a model for a single
gle subject, viewing a movie stimulus, can then be broken up into frame is visualized in Figure 3.
a collection of fixation-frame units, which are more or less inde-
pendent from each other. This conceptualization of a scanpath dif- Each frame will have a separate model associated with it, and we
fers from the conventional view, in that the conventional visualiza- may calculate the “error per group” of a given fixation in a frame by
tion is a “projection” of fixations over time onto a two-dimensional calculating the summation of the Gaussian distances from the fixa-
plane. Our conceptualization avoids this projection entirely. Thus, tion point to all group-specific cluster centers. We use a Gaussian
we produce a three-dimensional “scanpath function”. Given some kernel with standard deviation of 50 pixels to determine the dis-
102
3. this approach to validate whether our group-wise similarity mea-
sure produces information that may be used to reliably discriminate
between groups. A classifier must be constructed for each group,
e.g., the expert group and the novice group. The classifier for expert
data will be described below. The classifier for novice data may be
constructed identically, though with different input values.
As input to our classifier, we provide a list of group-wise similarity
scores, corresponding to the similarities of individual scanpaths to
the expert model, as described above. The goal of the expert clas-
sifier, then, is to determine some similarity threshold score, above
which indicates that a given scanpath is likely to be expert and be-
low which indicates that the scanpath is unlikely to be expert.
We use the receiver operating characteristic (ROC) curve to find
Figure 3: Mixture of Gaussians for expert fixations at a discrete this threshold. A thorough description of the curve may be found in
timestamp. Displayed novice fixations were not used in the clus- Fogarty et al. [2005]. This curve may also be used to compute the
tering operation. Note that the fixation labeled ‘A’ is far from the area under the ROC curve (AUC). This value describes the discrim-
cluster centers, and thus has lower similarity than fixation labeled inative ability of a classifier. Simple percentage accuracy values
‘B’ that is close to a cluster center. may be misrepresentative, especially in skewed cases, such as hav-
ing a large quantity of data from one class and a small quantity of
data from another. The AUC value describes the probability that an
tance value, which we then invert. Thus, a fixation point collocated individual instance of one class will be classified differently from
with a cluster mean or centroid has inverse distance value (similar- an instance of another class.
ity) of 1.0, and a fixation point more than 50 pixels away from the
cluster mean has inverse distance value close to 0. The summation Two classifiers are being trained: an expert and a novice classifier.
of the cluster similarities for a single fixation point are divided by This means that two scores are produced for a single instance. Each
the number of clusters, giving a value between 0 and 1. score describes the probability that an instance is a member of the
expert or novice group, respectively. In order to decide which class
With a mechanism to evaluate group-specific error, or rather simi- this instance conclusively belongs to, we use a heuristic. There
larity, of fixation points in individual frames, we may then extrap- are a few possibilities for the arrangement of these scores. First,
olate this process over the entire scanpath duration by summing in- the expert score may be higher than the expert threshold, and the
dividual similarities for each frame and then returning the average. novice score may be lower than the novice threshold. This case
Thus, a scanpath in which most fixation points lie near to group- is trivially expert. Similarly, an instance with expert score lower
specific clusters will have similarity close to 1.0 for that group, than the expert threshold and novice score higher than the novice
while a scanpath in which most fixations points lie far away from threshold is trivially novice. In the case of both scores being above
those clusters will have similarity close to 0. This metric may then or below their respective thresholds, we divide the score of each
be extrapolated further to describe the similarity of one group of classifier by its threshold value and choose the greater of the two.
scanpaths to another by simply averaging together the group-wise
similarities for each scanpath in one group to the entire other group.
5 Results
The data collected for expert/novice classification purposes did not,
in fact, use video as stimulus. Nevertheless, while the video-based
In order to evaluate our method we analyzed the results of a study
approach is expected to be more reliable for video, its application
wherein 20 high-time pilots (experts) and 20 non-pilots (novices)
to static images would also be beneficial. In concordance with the
were presented with 20 different images of weather. Subjects
video paradigm, we take samples from our data every 16 millisec-
were asked to determine whether they would continue their current
onds. Thus, this procedure may be utilized for analysis over both
flight path or if they needed to divert. Their eye movements were
static and dynamic stimuli. In our study, recorded scanpaths are of
recorded by a Tobii ET-1750 eye tracker (their verbal responses
various lengths. We must, therefore, specify a time window over
were ignored in our analysis). Our objective was to produce a clas-
which to collect fixation data. The upper bound on the length of
sifier that can predict whether a subject is expert or novice, based
this window is the shorter of either the length of the scanpath being
solely on their eye movements.
compared or the mean of the scanpath lengths for a given stimulus.
To evaluate the capabilities of this new approach, we compared the With two classes, a random classifier would be expected to produce
results to a group-wise extension of pairwise string-editing similar- 0.50 accuracy and AUC values. Evaluation metrics for our mech-
ity. The group-wise string-editing similarity of a single scanpath anism are listed in Table 1. In our evaluation, we refer to expert
to a group of scanpaths is the average pairwise similarity of that data as our positive class and novice data as negative. According to
scanpath to each scanpath in the group it is being compared to. the p-values, all metrics are significantly higher than random for our
method, while only the accuracy and AUC for the positive classifier
are significantly higher for the string-editing method.
4 Classification
Results show the classifier’s discriminative ability over a single
Our method of group-wise scanpath similarity is validated by a stimulus. Given multiple stimuli, our measure is extrapolated over
machine learning validation approach. Machine learning, specifi- all stimuli for each subject. A “majority vote” is then used, where
cally classification, is a statistical framework which takes, as input, one vote is drawn from each stimulus. If more than half the votes
one or more groups of data and produces, as output, probability indicate that a subject is expert, that subject is then classified as
values that describe the likelihood that some arbitrary datum is a conclusively expert. Otherwise, a subject is classified as novice.
member of one or more of the defined groups. Thus, we may use Accuracies for this voting mechanism are listed in Table 2.
103
4. Cross-Validation Results those scanpaths. This mechanism has been empirically and statis-
posAcc negAcc totAcc posAUC negAUC tically validated, showing that it is capable of discriminating be-
Temporal tween groupings at least as diverse as expert/novice subject appel-
Average 0.71 0.64 0.68 0.85 0.86 lation, with greater accuracy and reliability than random. Potential
Std Dev 0.07 0.12 0.07 0.07 0.04 applications include training environments, neurological disorder
Median 0.74 0.66 0.68 0.87 0.86 diagnosis, and, in general, evaluation of attention deviation from
p-value 0.00 0.01 0.00 0.00 0.00 that expected or desired during a dynamic stimulus. Future work
String-editing may include pre-alignment of unclassified scanpaths with classified
Average 0.49 0.64 0.57 0.81 0.72 scanpaths, attempting to increase the accuracy further during the
Std Dev 0.17 0.13 0.06 0.06 0.11 calculation of class similarity.
Median 0.48 0.64 0.57 0.82 0.71
p-value 0.98 0.02 0.16 0.00 0.00
References
Table 1: Results of classification cross-validation for both the new D EMPERE -M ARCO , L., H U , X.-P., E LLIS , S. M., H ANSELL ,
temporal method and string-editing similarity. Columns are accu- D. M., AND YANG , G.-Z. 2006. Analysis of Visual Search Pat-
racy of positive (expert) and negative (novice) instances, total com- terns With EMD Metric in Normalized Anatomical Space. IEEE
bined accuracy, and AUC values for positive and negative classifi- Transactions on Medical Imaging 25, 8 (August), 1011–1021.
cation. P-values are results of t-test for significance of score distri-
butions against a random distribution. D UCHOWSKI , A. T. AND M C C ORMICK , B. H. 1998. Gaze-
Contingent Video Resolution Degradation. In Human Vision and
Subject Results Electronic Imaging III. SPIE, Bellingham, WA.
Temporal String-editing F OGARTY, J., BAKER , R. S., AND H UDSON , S. E. 2005. Case
Experts Novices Experts Novices studies in the use of ROC curve analysis for sensor-based esti-
Average 0.68 0.35 0.45 0.34 mates in human computer interaction. In GI ’05: Proceedings
Accuracy 85% 95% 40% 80% of Graphics Interface 2005. Canadian Human-Computer Com-
munications Society, School of Computer Science, University of
Table 2: Results of cross-stimulus validation. Accuracy is deter- Waterloo, Waterloo, Ontario, Canada, 129–136.
mined by counting the number of experts/novices with expert ratio
G ALGANI , F., S UN , Y., L ANZI , P., AND L EIGH , J. 2009. Au-
greater than 0.5 in the case of experts and less than or equal to 0.5
tomatic analysis of eye tracking data for medical diagnosis. In
in the case of novices.
Proceedings of IEEE Symposium on Computational Intelligence
and Data Mining (IEEE CIDM 2009). IEEE.
6 Discussion H EMBROOKE , H., F EUSNER , M., AND G AY, G. 2006. Averag-
ing Scan Patterns and What They Can Tell Us. In Eye Tracking
Research & Applications (ETRA) Symposium. ACM, San Diego,
The AUC values listed in Table 1 show stronger discriminative abil-
CA, 41.
ity than a measure based on string-editing. P-values from t-tests
indicate that the results of our new method are significantly dif- L EIGH , R. J. AND Z EE , D. S. 1991. The Neurology of Eye Move-
ferent from random for all measures, while results of the string- ments, 2nd ed. Contemporary Neurology Series. F. A. Davis
editing method are only significant for novices and AUC values. Company, Philadelphia, PA.
The cross-stimulus results in Table 2 show that novice instances are
consistently easier to classify than expert instances, but the overall P OMPLUN , M., R ITTER , H., AND V ELICHKOVSKY, B. 1996. Dis-
accuracies are still quite high. 85% of the positive instances are ambiguating Complex Visual Information: Towards Communi-
properly classified as experts, while 95% of the negative instances cation of Personal Views of a Scene. Perception 25, 8, 931–948.
are classified as novice. This is an improvement over string-editing, P RIVITERA , C. M. AND S TARK , L. W. 2000. Algorithms for
with 40% positive accuracy and 80% negative accuracy. Defining Visual Regions-of-Interest: Comparison with Eye Fix-
ations. IEEE Transactions on Pattern Analysis and Machine In-
The average accuracies in the cross-stimulus table may be inter-
telligence (PAMI) 22, 9, 970–982.
preted as the cross-validated similarity of each class to the expert
class. The group-wise similarity of experts to the expert class is ¨ ¨
R AIH A , K.-J., AULA , A., M AJARANTA , P., R ANTALA , H., AND
0.68, while the group-wise similarity of novices to the expert class KOIVUNEN , K. 2005. Static Visualization of Temporal Eye-
is 0.35. The experts’ similarity is above 0.5, while the novices’ is Tracking Data. In INTERACT. IFIP, 946–949.
below 0.5, which is appropriate and intuitive, though one might ex-
pect the similarity of a class with itself to be closer to 1.0. In this S ADASIVAN , S., G REENSTEIN , J. S., G RAMOPADHYE , A. K.,
case, though, since we are cross-validating our results, we are not so AND D UCHOWSKI , A. T. 2005. Use of Eye Movements as Feed-
much measuring the similarity between a group and itself, but mea- forward Training for a Synthetic Aircraft Inspection Task. In
suring the average similarity between members of the same class. Proceedings of ACM CHI 2005 Conference on Human Factors
In the case of measuring the similarity of different classes, though, in Computing Systems. ACM Press, Portland, OR, 141–149.
such as comparing the novice class to the expert class, the intuitive S ANTELLA , A. AND D E C ARLO , D. 2004. Robust Clustering of
idea of group-wise similarity is more appropriate and convenient. Eye Movement Recordings for Quantification of Visual Interest.
In Eye Tracking Research & Applications (ETRA) Symposium.
7 Conclusion ACM, San Antonio, TX, 27–34.
W OODING , D. 2002. Fixation Maps: Quantifying Eye-Movement
A group-wise scanpath similarity measure and classification al- Traces. In Eye Tracking Research & Applications (ETRA) Sym-
gorithm have been described, allowing analysis and discrimina- posium. ACM, New Orleans, LA.
tion of groups of scanpaths, based on any informative grouping of
104