Homography normalization is presented as a novel gaze estimation method for uncalibrated setups. The method applies when head movements are present but without any requirements to camera calibration or geometric calibration. The method is geometrically and empirically demonstrated to be robust to head pose changes and despite being less constrained than cross-ratio methods, it consistently
performs favorably by several degrees on both simulated data and data from physical setups. The physical setups include the use of off-the-shelf web cameras with infrared light (night vision) and standard cameras with and without infrared light. The benefits of homography normalization and uncalibrated setups in general are also demonstrated through obtaining gaze estimates (in the visible spectrum) using only the screen reflections on the cornea.
2. Feature-based methods explore the characteristics of the human eye screen positions). Coutinho and Morimoto [2006] extend the model
to identify a set of distinctive and informative features around the of Yoo et al. [2005], by using the offset between visual and optical
eyes that are less sensitive to variations in illumination and view- axes as an argument to learn a constant on-screen offset. They ad-
point. Ensuring head pose invariance is a common problem often ditionally perform an elaborate evaluation of the consequences of
solved through the use of external light sources and their reflections changing the calibration of the virtual calibration parameter (α).
(glints) on the cornea. Besides the glints, the pupil is the most com- Based on this, they argue that a simpler model can be made by
mon feature to use, since it is easy to extract in IR spectrum images. learning a single α value rather than four different values as orig-
The image measurements (e.g. the pupil) however, are influenced inally proposed. Where calibration in [Yoo and Chung 2005] can
by refraction [Guestrin and Eizenman 2006]. The limbus is less only be done by looking at the light sources in the screen corners,
influenced by refraction, but since its boundary may be partially the method of [Coutinho and Morimoto 2006] may use multiple
occluded, it may be more difficult to obtain reliable measurements. on-screen targets.
Two types of feature-based gaze estimation approaches exist: the Since the cross-ratio is defined on projective planes and is invariant
interpolation-based (regression-based) and the model-based (geo- to any projective transformation, scale changes will not influence
metric) Using a single camera, the 2D regression methods model the cross-ratio. The method is therefore not directly applicable to
the optical properties, geometry and the eye physiology indirectly depth translations. Coutinho and Morimoto [2006] show signifi-
and may, therefore, be considered as approximate models which cant accuracy improvements compared to the original paper, pro-
may not strictly guarantee head pose invariance. They are, how- vided the user does not change their distance to the camera and
ever, simple to implement, do not require camera or geometric cal- monitor. The advantage of the method, compared to methods based
ibration (a.k.a weak prior model) and may still provide good re- on calibrated setups, is that full hardware calibration is needless.
sults under conditions of small head movements. More recent 2D The method only requires light source position data relative to the
regression-based methods attempt to improve performance under screen. One limitation is that the light sources should be placed
larger head movements through compensation, or by adding addi- right on the corners of the screen. In practice the method is highly
tional cameras [Hansen and Ji 2010]. The 3D model-based meth- sensitive to the individual eye and formal analysis of the method is
ods, on the other hand, directly compute the gaze direction from presented by Kang et al. [2008]. They identified two main sources
the eye features based on a geometric model of the eye. Most 3D of errors: (1) the angular offset between visual and optical axes and
model-based (or geometric) approaches rely on metric information (2) the offset between pupil and glint planes. Depending on the
and thus require camera calibration and a global geometric model point configuration, the cross-ratio is also known for not being par-
(external to the eye) of light sources, camera and monitor position ticularly robust to noise, since small changes in point positions can
and orientation. Gaze direction is modeled either as the optical axis result in large variations in the cross-ratio.
or the visual axis. The optical axis is the line connecting the pupil
center, cornea center and the eyeball center. The line connecting 3 Homography Normalization for Gaze Esti-
the fovea and the center of the cornea is the visual axis. The visual
axis is presumably the true direction of gaze. The visual and optical
mation
axes intersect at the cornea center with subject dependent angular
This section presents the fundamental model for a robust point of
offsets. In a typical adult, the fovea is located about 4 − 5◦ horizon-
regard estimation method in uncalibrated setups (a priori unknown
tally and about 1.5◦ below the point of the optic axis and the retina
geometry and camera parameters). The components of the model
and may vary up to 3◦ vertically between subjects. Much of the the-
are illustrated in figure 1.
ory behind geometric models using fully calibrated setups, has been
formalized by Guestrin and Eizenman [2006]. Their model covers
L2 L1
a variable number of light sources and cameras, human specific pa-
rameters, light source positions, refraction, and camera parameters
but is limited by only applying to fully calibrated setups. Methods
Cornea
relying on fully calibrated setups are most common in commercial L3 l1
and research-based systems but are limited for public use unless l2
placed in a rigid setup. Any change (e.g. placing the camera dif-
ferently or changing the zoom of the camera) requires a tedious l3
recalibration. Πc
L4
l4 Pupil
An alternative to the fully calibrated systems while allowing for
head movements is to use projective invariants and multiple light Πs fc p
sources [Yoo and Chung 2005; Coutinho and Morimoto 2006]. c C
Contrary to the previous methods, Yoo et al. [2005] describe a
method which is capable of determining the point of regard based Camera Πi
solely on the availability of light source position information (e.g. Center
no camera calibration or prior knowledge of rigid transformations
between hardware units) by exploiting the cross-ratio of four points Figure 1: Geometric model of the human eye, light sources, screen,
(light sources) in projective space. Yoo et al. [2005] use two cam- camera and projections (dashed line). The pupil is depicted as an
eras and four IR light sources placed around the screen to project ellipse with center pc and the cornea as a hemisphere with center
these corners on the corneal surface, but only one camera is needed C. The corneal-reflection plane, Πc , and its projection in the image
for gaze estimation. When looking at the screen the pupil center are shown by quadrilaterals. Both Πc and the cornea focal point,
should ideally be within the four glint area. A fifth IR light emitter fc , are displaced relative to each other and to the pupil center for
is placed on-axis to produce bright pupil images and to be able to illustration purposes.
account for non-linear displacements (modeled by four αi parame-
ters) of the glints. The method of Yoo et al. [2005] was shown to be The cornea is approximately spherical and has a radius, Rc , about
prone to large person specific errors [Coutinho and Morimoto 2006] 7.8mm. The cornea reflects light similarly to a convex mirror and
and can only use the light sources for calibration (e.g. not on other has a focal point, fc , located halfway between the corneal surface
14
3. and the center of corneal curvature (fc = Rc ≈ 3.9 mm). Re-
2
n n
(normalized plane) spanned by four points g1 . . . g4 . Πn represents
flections on the cornea consequently appear further away than the the (unknown) corneal-reflection plane given up to a homography.
n n
corneal surface (a.k.a virtual reflections). Let gj (j = 1..4) be the corners of the unit square and define Hi
n n
such that gj = Hi gj . Notice, using the screen corners to span the
Denote the screen plane Πs and four (virtual) reflection on the normalized space would be equally viable. The basic idea is that the
c c
cornea (g1 . . . g4 ). The reflections may come from any point in 3D n
pupil is mapped to the normalized space through Hi to normalize
space, for example external light sources (Li ) or the corners of a the effects of head pose prior to any calibration or gaze estimation
screen reflected on the cornea. The issue of screen projections will s
procedure (Fn , in figure 2). The mapping of the reflections from
be addressed in section 5.3. For the sake of simplicity and with- s s
the image Πi to the screen Πs via Πn is therefore Hi = Hn ◦ Hi . n
c c
out loss of generality, the following description assumes (g1 . . . g4 ) s s
That is, a homography Hn is a sufficient model for Fn when the
come from point light sources. Provided the eye is stationary then pupil and Πc coincide.
any location of a light source, Li , on li with same direction produce
s
the same point of reflection on the cornea. The light sources can Hi can be found through a user calibration consisting of a min-
therefore and interchangeably be assumed located on e.g. the screen imum of 4 calibration targets, t1 . . . tN on the screen. Denote the
plane Πs or at infinity as depicted in figure 1. Projected points at general principle of normalizing eye data (pupil center, pupil or lim-
infinity lie in the focal plane of the convex mirror. With four light bus contours) with respect to the reflections by homography nor-
s s
source there will exist a plane Πc (in fact a family of planes related malization. The method of using Fn = Hn in connection with
by homographies), spanned by the lines li . This plane is denoted homography normalization is referred to as (Hom).
the corneal-reflection plane and is close to fc when Li at infin-
ity. When considering the reflection laws (e.g. not a projection) the The cross-ratio method do not model the visual axis well [Kang
corneal reflections may only be approximately planar. et al. 2008]. Homography normalization, on the other hand, does
model the offset between the optical and visual axes to a much
Without loss of generality suppose the light sources are located on higher degree. Points in normalized space are based on the pupil
c c
Πs . The quadrilateral of glints (g1 . . . g4 ) is consequently related center i.e. a model of the optical axis without the interference of
i i
to the corresponding quadrilateral (g1 . . . g4 ) in the image via a ho- head movements. However, as offsets between the optical and vi-
i
mography, Hc , from the cornea (Πc ) to the image (Πi ) [Hartley sual axes correspond to translations in normalized space, the visual
s s
and Zisserman 2004]. Similarly, the mapping from the cornea to and optical axis offset is modeled implicitly through Fn = Hn .
s
the screen is also given by a homography Hc . The homography
s s c
from the image to the screen Hi = Hc ◦ Hi via the Πc will 3.1 Model Error from Planarity Assumption
therefore exist regardless of the location of the cornea, provided
the geometric setup does not change. These arguments also apply The previous section describes a generalized approach for head
to cross-ratio-based methods [Coutinho and Morimoto 2006; Yoo pose invariant PoR estimation under the assumption that the pupil
and Chung 2005]. and Πc coincide. If the pupil had been located on Πc , it would
be a head pose invariant gaze estimation method that models the
The pupil center is located about 4.2 mm from the cornea center
visual and optical axis offset. Euclidean information is not avail-
but its location vary between subjects and over time for a particu-
able in uncalibrated settings. Using metric information (e.g. be-
lar subject [Guestrin and Eizenman 2006]. However, the pupil is
tween the pupil and the Πc ) does therefore not apply in this setting.
located approximately 0.3 mm (| Rc − 4.2|) from the corneal focal
2 This section provides an analysis of the model error and section
point, fc , and thus also close to Πc . In the following suppose that
3.2 discusses an approach to accommodate the errors. Figure 3 il-
Πc and the pupil coincide. The pupil may under these assumptions
s lustrates two different gaze directions and the associated modeling
be mapped through Hi from the image to the screen via the corneal
error measured from the camera.
reflections.
Camera center
Image space Normalized space Screen
Pupil
gi gn
gn
gi 2 1
pc
n 2 Camera
1 optical axis
n
H i Fs
n
gi gi gn gn PoR Gaze
direction 1
3 4 3 4 Gaze
direction 2
pci
Πc
X
X
e1
Figure 2: (left) Reflection points (crosses) and the pupil (gray el- e2
lipse) are observed in the image and (middle) the pupil mapped to Pupil position 2 Pupil position 1
the normalized space using the four reflection points. (right) from
the normalized space the pupil is mapped to the point of regard.
Figure 3: Projected differences between pupil and the correspond-
These basic observations are sufficient to describe the fundamen- ing point on Πc for two gaze directions. Πc is kept constant for
tal and simple algorithm for PoR estimation in an uncalibrated set- clarity.
ting. The method is illustrated in figure 2 and is based on locating
i i
and tracking four reflections (g1 . . . g4 ) (e.g. glints) and the pupil
in the image. The pupil center, pc , will be used in the following When the user looks away from the camera (’gaze direction 1’) it is
description. However, the presented method may alternatively use evident that the error in the image plane is related to the projected
the limbus center or the pupil/limbus ellipse contours directly in the line segment (between the point on Πc and the actual location of
mapping since homographies allow for mappings of points, lines the pupil), el , onto the image plane. A gaze vector directed to-
and conics. wards the camera (’gaze direction 2’) yields a point and therefore
no error. Hence equal angular offsets from the optical axis of the
It is convenient, though not necessary, to define a virtual plane, Πn , camera generate offset vectors ∆c (i, j) with the same magnitude
15
4. when viewed from the camera. The largest magnitude of errors oc- seen for single or dual glint systems [Morimoto and Mimica 2005].
cur when the gaze direction is perpendicular to the optical axis of One of the limitation when using polynomials is that any increase
the camera. The magnitude field |∆c (i, j)| in camera coordinates of the order of the polynomial would require additional calibration
consequently consists of elliptic iso-contours, centered around the targets in order to estimate the parameters of the polynomial. A cu-
optical axis of the camera. However, it is the error, ∆s , in screen bic polynomial seem to be a good approximation for ∆i [Cerrolaza
coordinates, that is of interest. The true point of regard in screen co- et al. 2008], however it would require at least 10 calibration targets.
ordinates, ρ∗ = ρs + ∆s is a function of the estimated gaze ρs and
s ˆ ˆ Different from the ’weight space’ approach of polynomials is the
the error ∆s . That is ρ∗ = Hi (pc + ∆i ) = Hi pc + Hi ∆i , hence
s
s s s
function view approach of Gaussian processes (GP). Gaussian pro-
s
errors on the screen ∆s = Hi ∆i are merely errors in the camera cess (GP) interpolation method is used to estimate ∆i by using a
propagated to the screen through the homography. An example of squared exponential covariance function [Rasmussen and Williams
the error vector field, ∆s , using a simulator and the corresponding 2006]:
vector magnitudes is shown in Figure 4.
1 |xp − xq |
cov(xp , xq ) = k1 ∗ exp(− 2
) + k3 σ 2
Calibration Targets 2 k2
Vector field of PoR errors Magnitudes of PoR error vector field
16
14
where xp and xq are data points and ki are weights. GP’s have
12
several innate properties that make them highly suited for gaze es-
10
16
timation. Gaussian processes do not model weights directly and
8
12
14
thus there are no requirements on the minimum number of calibra-
6 0.015
10 tion targets needed to infer model parameters. Each additional cal-
4
0.01
6
8
ibration target provides additional information that will be used to
2
0.005
0
4 increase accuracy. Each estimate also comes with an error measure-
ment which, via the covariance function, is related to the distance
2
0
5
0 10 0
0 5 10 15 15
from the input data to the calibration data. This information can
Camera location potentially be used to regularize output data. The exponential co-
variance function has been adopted since it is highly smooth (like
∆i ) and it makes it possible to account for noise directly in the co-
Figure 4: (left) Error vector field and (right) corresponding mag- variance function through k3 σ 2 . In the following we denote with
nitudes obtained from simulated data. Crosses indicate calibration s
(GP) the method of Fn that use (Hom) together with Gaussian pro-
targets and the circles the projection of the camera center. cess modeling of ∆i .
To argue for the characteristics of ∆s it is without loss of general- 4 Assessment on Simulated Data
ity and for the sake of simplicity assumed that only four calibration
points, (t1 . . . t4 ), are used (crosses in figure 4). When estimat-
s Head pose, head position, the offset between visual and optical
ing the homography, Hi , through user calibration, the errors in the
axes, refraction, measurement noise, relative position of hardware
calibration targets, ∆s (ti) = 0, are minimized to zero and there
and camera parameters are factors that mostly influence the accu-
will therefore be 5 points (calibration targets and the camera opti-
racy of gaze estimation methods. We will in the following sec-
cal axis) where the ∆s is zero.
tions evaluate the homography normalization methods ((Hom) and
One way of thinking of a homography is that it generates a linear (GP)) to the cross-ratio methods ((Yoo)[Yoo and Chung 2005] and
s (Cou)[Coutinho and Morimoto 2006]). These methods have been
vector field of displacements. ∆s = Hi ∆i is therefore a compo-
sition of two vector fields (∆s = Vh + ∆i ), a linear vector field chosen since they operate under similar premises as homography
corresponding to the homography (Vh ) and an ellipsoidal vector normalization (e.g. uncalibrated/semi-calibrated setup). Simulated
field ∆i . Since ∆s (ti ) = 0 then Vh (ti ) = −∆s (ti ). Vh (ti ) is data is used in this section to be able to asses the effects of potential
consequently defined through the negative error vectors of ∆i (ti ). noise-factors separately. The simulator [B¨ hme et al. 2008] allows
o
It is worth noting that as the camera location is unknown due to the for detailed modeling of the different components of the setup and
uncalibrated setup assumption and the location of the maximum er- eye specific parameters. The evaluation is divided according to the
ror depends on the location of the camera, it would be impossible presence of head movements and the number of calibration targets
to determine the extremal location without additional information. (N). Notice the methods, except (Yoo), allow for multiple on-screen
However, despite of this, it is be shown in the following sections calibration targets. The effects of eye specific parameters such as
that it is possible through homography normalization to obtain re- refraction and offset between the visual and optical axis as well
sults quite similar to fully calibrated setups. as the effect of the number of calibration targets and errors asso-
ciated with the model assumptions are evaluated when the head is
3.2 Modeling Error Vectors fixed (section 4.2). The methods are examined with respect to head
movements in section 4.3. In some experiments the (GP) method
This section discusses one approach of modeling the error caused has been left out since it is a derivative of (Hom) and would not alter
by the non-coplanarity of Πc and the pupil. Even though the loca- the inherent properties of using homography normalization, it only
tion of the largest errors cannot be determined (a priori) due to the makes a difference to the accuracy when the number of calibration
uncalibrated setup, it may be worthwhile to accommodate the er- targets is larger than four (N > 4).
rors to the extent possible. That is to estimate a vector field similar
to figure 4. When the camera is placed outside the screen area, the 4.1 Setup
error due to the homography is zero in 5 points (e.g. the calibration
targets and the camera projection center) and non-zero elsewhere. The camera is located slightly below and to the right of the cen-
s
After estimating Hi it is possible to measure the error due to the ho- ter of the screen as to simulate a realistic setup (e.g. users do not
mography for each additional calibration target. Since the error vec- place the components in an exact position). All tests have been
tor field is smooth, a simplified yet effective approach would be to conducted with the same camera focal length. The cornea is mod-
model the error through polynomials in a similar way as previously eled as a sphere with radius 7.98 mm. Four light sources are placed
16
5. at the corners of a planar surface (screen) to be able compare ho- offset, γ ( with β = 0), has a significant effect on the accuracies of
mography and cross-ratio methods. In the following denote with the cross-ratio methods but not on homography normalization. The
N the number of calibration targets. γ and β correspond to the an- reason is that homography normalization models the optical visual
gular offsets between the visual and optical axes in horizontal and offset to a much higher degree.
vertical directions, respectively.
4.2 Stationary Head Accuracy with variable optical/visual−axis offset
3.5
Yoo
Cou
Basic Settings and Refraction In this section the methods are 3 Hom
evaluated as if the head is kept still while gazing at a uniformly
On−screen error (deg)
distributed set of 64 × 64 targets. Figure 5 shows the mean ac- 2.5
curacy (degrees) with error-bars (variance) in the hypothetical eye
2
model, where there is no offset between visual and optical axes
E0 = {γ = β = 0} and a more realistic setting with eye model
1.5
E1 ={γ = 4.5, β = 1.5}. Each sub-figure shows the cases where
refraction is included and when it is not. E0 is a physically infea- 1
sible setup since the optical and visual axis are different, but the
model avoids eye specific biases. It is clear from figure 5 that the 0.5
methods exhibit similar accuracies in E0 , but the offset between vi-
sual and optical axes in E1 makes a notable difference between the 0
−5 −3.9 −2.8 −1.7 −0.6 0.6 1.7 2.8 3.9 5
methods. Refraction has only a minor effect on the methods. Offset (degrees)
Influence of refraction with eye model 0
Influence of refraction with eye model 1
Figure 7: Accuracy as a function of the angular offset.
0.8
Refraction 3.5
No refraction Refraction
0.7 No refraction
3
0.6
Error magnitude (deg)
2.5
Error magnitude (deg)
0.5 4.3 Head Movements
2
0.4
1.5
0.3
0.2 1 Gaze trackers should ideally be head pose invariant. This section
0.1 0.5
evaluates the methods in scenarios where the eye location changes
0 0
in space (±300 mm in both x and y directions from the camera
Yoo Cou Hom Yoo Cou Hom
Method Method center) but the target location remains fixed on the screen.
Figure 5: Comparison of methods (with/without refraction) when
the head is kept still using eye model (left) E0 =(γ = β = 0) and Influence of N and γ Figure 8 shows the accuracies of using
(right) eye model E1 =(γ = 4.5, β = 1.5) and N = 4 calibration a variable number of calibration targets and eye parameters in the
targets. presence of head movements. The results show similarities to the
head still experiments by also revealing that the offset between the
optical and visual axes makes a significant difference to the cross-
Changing N The previous test is based on a minimum number ratio methods, but not to the homography-based methods. The
of calibration targets. However, the methods may, besides (Yoo), number of calibration targets has only a minor effect on accuracy.
improve accuracy as the N uniformly distributed calibration targets Non-linear modeling improves accuracy and especially the differ-
increase. Figure 6 shows accuracy of the methods as a function of ence between 4 and 9 calibration targets makes a significant dif-
N for both eye models. (GP) exhibit a rapid increase of accuracy ference. When considering the nuisance of calibration and the ob-
when increasing N . Both (Hom) and (Cou) may be improved by tained accuracy, it is task dependent whether the rather small in-
increasing N , but large N implies a accuracy decrease for (Cou). crease in accuracy between 9 and 16 calibration targets is worth-
The accuracy for (Yoo) is as expected. while.
Varying the number of calibration targets eye model 0 Varying the number of calibration targets eye model 1
0.8 3.5
Yoo
Cou
Yoo
Cou Depth Translation The methods analyzed here are all using
0.7 Hom Hom
GP
3
GP properties on projective planes. Movements in depth is therefore
0.6
2.5 not an inherent property to the methods. The influence of head
Accuracy (deg)
Accuracy (deg)
0.5
2 movements will therefore be examined by evaluating head move-
0.4
1.5
ments as translations parallel to the screen plane (or equivalently
0.3
1
Πc ) as depicted in figure 9 and movements in depth (figure 10). A
0.2
single depth is used for calibration. The results show that none of
0.5
0.1
the methods are invariant to neither depth or in-plane translations,
0
4 9 16 25 36 49
Number of calibration targets
64
0
4 9 16 25 36 49
Number of calibration targets
64 but that the homography normalization-based methods have better
performance. For depth changes larger than 150 mm (see figure 10)
Figure 6: Changing the number of calibration targets, N , for E0 the (GP) method does not perform as well as (Hom). The reason is
(left) E1 (right). that the learned offsets in (GP) are only valid for a single scale.
The graphs in figure 10 show the accuracy as a function of depth
Offset between Visual and Optical Axes There is a noticeable changes (from the calibration depth) when using different eye pa-
accuracy difference when using E0 and E1 in the previous experi- rameters (E0 and E1 ) and with a variable number of calibration
ments. Figure 7 shows that the influence of the angular horizontal targets, N .
17