GAZE OBJECT DETECTION
Amartya Bhattacharya
Intern
Institute of Datability
Science
Osaka University
Supervisor:
Prof. Hajime Nagahara
ABOUT ME
2
I am Amartya Bhattacharya, currently working as an intern at Institute of
Datability Science, Osaka University.
Graduated with Bachelor degree in Computer Science from University of
Calcutta, India.
Previously worked in the domain of Computer Vision, Natural Language
Processing and Multi-modal models and my research interests are the same.
Will present the project on Gaze Object Detection
PROBLEM INTRODUCTION
Detect the objects people are gazing at in
videos
Also detect if a person is looking at some
other person present in the video
From the Gaze Objects Obtained the
interaction between the people can be
studied
1
2
3
PREVIOUS WORKS
Recasens et al. 2015, proposed a methodology for
estimating the gaze point coordinates from image data
Chong et al. 2018, proposing an improved model for the
same problem, Chong et al. 2020 provided the first
spatiotemporal model for Gaze Estimation in videos
Wang et al. 2022 provided the first Gaze Object Detection
in images and proposed an improved model for gaze
estimation added a YOLO v5 late fusion branch to it
4
COMPARISON OF PREVIOUS WORKS
NOTE:
1. Models were trained on Gaze on Object
Dataset containing only images and not
videos
2. Chong et al. 2020 had both spatial and
temporal model, for obtaining result on
image data, temporal part was removed
5
Models Type
of Data
Type of Input Angular
Error(°)
Type of
Problem
Recasens et
al. 2015
Image Image + Head
Location ( x, y, w,
h)
33.00 Gaze
Estimation
Chong et al.
2018
Image Image + Head
Location ( x, y, w,
h)
21.80 Gaze
Estimation
Chong et al.
2020
Video +
Image
Image + Head
Location ( x, y, w,
h)
15.10 Gaze
Estimation
Wang et al.
2022
Image Image + Head
Location (x, y, w,
h)
14.90 Gaze
Object
Detection
• Recasens, Adria, et al. "Where are they looking?." Advances in neural information processing systems 28 (2015).
• Chong, Eunji, et al. "Detecting attended visual targets in video." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020
• Wang, Binglu, et al. "GaTector: A Unified Framework for Gaze Object Prediction." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2022.
Center of
Head GT
pred
Angular Error
SCOPE OF
IMPROVEMENT
No existing model for
detecting gaze objects in
video data
No model for understanding
the interaction between
people/object or
people/people
All the existing model
requires head position as an
input, where the bounding
box of head should be
marked for all the frames in
the video, leads to a tedious
task
10 min. Video at 30 fps
having 'n' people = n* 30 *
60 *10 = n * 18000
annotations!
6
AUTOMATIC GAZE OBJECT DETECTION
7
Head Tracking
Module
Input
Video
Object Detection
Model (YOLO v7)
Spatiotemporal
Model
Gaze
Object
Class
Assignme
nt
Gaze
Object
WHOLE ARCHITECTURE OF GAZE OBJECT DETECTION
HEAD TRACKING MODULE PART1 - DETECTION
8
Novel head tracking had to be proposed due
to unavailability of any such model
Based on Object Tracking principle I)
detection in initial frame II) detection in the
next frame III) relating the objects from current
to previous frame
Head Detection done using YOLO v5 model
trained on Crowdhuman dataset1
1.Shao, Shuai, et al. "Crowdhuman: A benchmark for detecting human in a crowd." arXiv preprint arXiv:1805.00123 (2018).
HEAD TRACKING MODULE PART 2 - DETECTION AND
ASSIGNMENT
- SOTA object tracking methods involve assignment by comparing the feature
vector from the detections obtained
- Feature vectors are generally calculated using a ResNet 501 based model
- Task of comparing objects across the frames involve re identification
- Feature vectors calculated using Omni-Scale Network(SOTA for person re
identification)2
- Person Re identification method chosen due to absence of head re
identification model
- Intuition was the model would extract important feature for re identifying
objects across frames and thus tracking
- Idea validated after completion of the work by the latest SOTA paper3
- Model successfully tracked and also performed well when occlusions
happened
9
1. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778
2. K. Zhou, Y. Yang, A. Cavallaro and T. Xiang, "Learning Generalisable Omni-Scale Representations for Person Re-Identification," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5056-5069, 1 Sept. 2022, doi: 10.1109/TPAMI.2021.3069237.
3. Du, Yunhao, et al. "Strongsort: Make deepsort great again." IEEE Transactions on Multimedia (2023).
HEAD TRACKING MODULE PART3- MODULE SUMMARY AND
RESULTS
10
Head Tracking Module Results
SPATIOTEMPORAL MODEL
FOR GAZE ESTIMATION
- Chong et al. 20201 paper was
implemented for gaze estimation purpose
- Only model for Gaze Estimation in videos
- The image as well as the head bounding
box coordinates are needed as an input to
the model
- Uses spatial as well as temporal features
for gaze estimation
- Head bounding box obtained from Head
Tracking Module was passed into the
model
- Trained on Video Attention Dataset1
12
Video Attention Network model for Gaze
Estimation
1. Chong, Eunji, et al. "Detecting attended visual targets in video." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
(x,y)
GAZE ESTIMATION RESULTS
13
Pre-trained spatiotemporal
model was used to generate
the gaze coordinates
Spatial model achieved a 0.2
less angular error than the
only Gaze Object
Detection Model
GaTector1, the gaze object
detection model is not
suitable of video data
GAZE OBJECT CLASS ASSIGNMENT
14
If a gaze point lies inside
a bounding box, assign
the gaze object to the
label name associated
with bounding box
If a gaze point lies inside
multiple bounding boxes,
assign the object class
associated with the
bounding box whose
center is closest to the
gaze point
DISCUSSIONS
18
Model is a first of its kind for detection of Gaze Objects in a
video
Model solves the issue of manually annotating thousands of
frames in order to obtain the gaze estimation, observed in the
previous works
Provides an opportunity to study the interaction of different
people in a video
DRAWBACKS
19
- Model was pre-trained on a predefined dataset, the generalization capability was decent
- Performance observed to decrease with the decrease in quality of the images
- Presence of noise inside images, such as masks or accessories can affect the model
SCOPES OF IMPROVEMENT
Model works decent in most of the cases, performance is
sensitive to the kind of noise like masks, caps
etc. Susceptible to giving false positives
The effect was observed in UCL Data where it showed an
accuracy of 67% (also the correct metrics to judge, is a part of
discussion)
Pre-trained models were used, the generalization capability of
the model is debatable, a novel model on a new dataset for
object detection can improve the performance
20
SUPPLEMENTARY – OS NET
Model1 learns features at different
scales
The stacking up of 3*3 convolutions
helps to learn multi-scale features
In cases of re-id, the multi-scale
features prove to be important
Features aggregated through various
weights, helps in dynamic feature
learning
22
1. K. Zhou, Y. Yang, A. Cavallaro and T. Xiang, "Learning Generalisable Omni-Scale Representations for Person Re-Identification," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5056-5069, 1 Sept. 2022, doi:
10.1109/TPAMI.2021.3069237.
Bottleneck block for OS Net
Whole Architecture