From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Fast object re detection and localization in video for spatio-temporal fragment creation
1. 1
Information Technologies Institute
Centre for Research and Technology Hellas
Fast object re-detection and localization
in video for spatio-temporal fragment
creation
Evlampios Apostolidis, Vasileios Mezaris, Ioannis Kompatsiaris
Information Technologies Institute / Centre for Research and Technology Hellas
ICME MMIX 2013, San Jose, CA, USA, July 2013
2. 2
Information Technologies Institute
Centre for Research and Technology Hellas
Overview
• Introduction - problem formulation
• Related work
• Baseline approach
• Proposed approach
– GPU-based processing
– Video-structure-based sampling of video frames
– Robustness to scale variations
• Experiments and results
• Conclusions
3. 3
Information Technologies Institute
Centre for Research and Technology Hellas
Introduction – problem formulation
• Object re-detection: a particular case of image matching
• Main goal: find instances of a specific object within a single video or a
collection of videos
– Input: object of interest + video file
– Processing: similarity estimation by means of image matching
– Output: detected instances of the object of interest
4. 4
Information Technologies Institute
Centre for Research and Technology Hellas
Introduction – problem formulation
Extension for interactive and linked TV
• Semi-automatic identification and annotation of object-specific spatio-
temporal media fragments
– Annotate the object of interest
– Run the object re-detection algorithm
– Get automatically instance-based annotated video fragments
– Find related content fragments and establish links between them
Assign a label
to the object of
interest
Instance-based
annotated
video fragment
Links to related
content
5. 5
Information Technologies Institute
Centre for Research and Technology Hellas
Related work
• Extraction and matching of scale- and rotation-invariant local descriptors
is one of the most popular SoA approaches for similarity estimation
between pairs of images
– Local feature extraction
• Edge detectors (e.g. Canny), corner detectors (e.g. Harris-Laplace)
– Local feature description
• SIFT or extensions of it, SURF, BRISK, binary descriptors such as BRIEF, …
– Matching of local descriptors
• k-Nearest Neighbor search between descriptor pairs using brute-force or hashing
– Filtering of erroneous matches
• Symmetry test between the pairs of matched descriptors
• Ratio test regarding the distances of the calculated nearest neighbors
• Geometric verification between the pair of images using RANSAC
– Extensions
• Combined use of keypoints and motion information (tracking)
• Bag-of-Words (BoW) matching for pruning
6. 6
Information Technologies Institute
Centre for Research and Technology Hellas
Proposed approach
• Starting from a baseline approach,
– Improve detection accuracy
– Reduce the needed processing time
• Work directions:
– GPU-based processing
– Video-structure-based sampling of frames
– Enhancing robustness to scale variations
7. 7
Information Technologies Institute
Centre for Research and Technology Hellas
GPU-based processing
Accelerated parts of the overall pipeline:
• Video decompression
into frames
• Keypoint detection and
description
• Brute-Force matching
and 2-NN search
• Drawing of the
calculated bounding
boxes (optional)
8. 8
Information Technologies Institute
Centre for Research and Technology Hellas
Video-structure-based sampling
• Sequential processing of video frames is replaced by a structure-based
one, using the analysis results of a shot segmentation method
Example
Check shot 1
No detection!
Move to the
next shot
Check shot 2
Detection!
Check all
shot-2 frames
Detect and highlight
the object of interest
9. 9
Information Technologies Institute
Centre for Research and Technology Hellas
Robustness to scale variations
Problem
• Major changes in scale may lead to detection failure due to the significant
limitation of the area that is used for matching
• Zoom-in case: the middle image (b) corresponds to a small upper right
area of the object O in the left one (a)
• Zoom-out case: in the right image (c) the object O occupies a very small
part of the frame
• Both cases lead to a considerable reduction of the number of matched
pairs of descriptors, and thus often to detection failure
a b c
10. 10
Information Technologies Institute
Centre for Research and Technology Hellas
Robustness to scale variations
Solution
• we automatically generate a zoomed-out and a centralized zoomed-in
instance of the object O and we utilize them in the matching procedure
Zoomed-in instance
– selection of a center-aligned sub-
area of the original object O and
enlargement to the actual size of O
using bilinear interpolation
– choice: 70% of the original image
area 140% zoom-in factor
Zoomed-out instance
– shrink the original image O into a
smaller one using nearest neighbor
interpolation
– the maximum zoom-out factor is
determined by the restrictions of
the GPU-based implementation of
SURF
Original
image
Zoomed-in
instance
Zoomed-out
instance
11. 11
Information Technologies Institute
Centre for Research and Technology Hellas
Experiments and Results
• System specifications
– Intel Core i7 processor at 3.4GHz
– 8GB RAM memory
– CUDA-enabled NVIDIA GeForce GTX560 GPU
• Dataset
– 6 videos* of 273 minutes total duration
– 30 manually selected objects
• Ground-truth (generated via manual annotation)
– 75.632 frames contain at least one of these objects
– 333.455 frames do not include any of the selected objects
* The videos are episodes from the “Antiques Roadshow” of the Dutch public broadcaster AVRO (http://avro.nl/)
Examples of sought objects
12. 12
Information Technologies Institute
Centre for Research and Technology Hellas
Experiments and Results
• Aim: quantify the improvement that each extension of the baseline
approach is responsible for
• Four experimental configurations:
– C1: baseline implementation
– C2: GPU-accelerated implementation,
– C3: GPU-accelerated and video-structure-based sampling
implementation
– C4: complete proposed approach which includes:
GPU-processing
video-structure-based sampling
and robustness to scale variations
13. 13
Information Technologies Institute
Centre for Research and Technology Hellas
Experiments and Results
• Detection accuracy is expressed in terms of Precision, Recall and F-Score
• Evaluation was performed in a per-frame basis, i.e. considering the 30
selected objects and counting the number of frames where these were
correctly detected, missed, etc.
• Time efficiency was evaluated by expressing the processing time of each
configuration as a factor of the actual duration of the processed videos
• Robustness to scale variations was quantified using two specific sets of
frames where the object of interest was observed from:
– a very close viewing position (2.940 frames) and
– a very distant viewing position (4.648 frames)
14. 14
Information Technologies Institute
Centre for Research and Technology Hellas
Experiments and Results
Precision Recall F-Score
C1 0.999 0.856 0.922
C2 0.999 0.856 0.922
C3 1.000 0.852 0.920
C4 1.000 0.992 0.996
Precision Recall F-Score Processing Time
(x Real-Time)
C1 0.999 0.868 0.929 2.98-5.26
C2 0.999 0.850 0.918 0.35-1.24
C3 0.999 0.849 0.918 0.03-0.13
C4 0.999 0.872 0.931 0.03-0.19
Evaluation results for configurations C1 to C4
Precision Recall F-Score
C1 0.999 0.831 0.907
C2 0.999 0.831 0.907
C3 1.000 0.799 0.888
C4 1.000 0.914 0.955
Evaluation results for highly zoomed-out instances Evaluation results for highly zoomed-in instances
15. 15
Information Technologies Institute
Centre for Research and Technology Hellas
Experiments and Results
Detection accuracy
• All versions exhibited very good results in terms of detection accuracy
• Version C4 (complete proposed approach) achieved the best results
• The algorithm performed considerably well for a range of different scales
and orientations and for partial visibility or partial occlusion
Processing time
• The video-structure-based sampling
strategy led to a great reduction of the
required processing time
• The algorithm needs about 10% of the
video’s duration, preserving the same
high levels of detection accuracy with
the slower configurations
Online demo available at: http://www.youtube.com/watch?v=0IeVkXRTYu8
16. 16
Information Technologies Institute
Centre for Research and Technology Hellas
Extensions, ideas and plans
• Recent extension: Multiple instances of an object of interest can be used
as input for more efficient re-detection of 3D objects
• Future ideas: test the algorithm’s performance as a tool for chapter
segmentation in videos where the chapters are temporally demarcated by
the presence of a specific object (e.g. a painting in a video about art)
• Future plans: evaluate the extended algorithm’s performance (detection
accuracy and time efficiency) in a new set of videos
Input Output
17. 17
Information Technologies Institute
Centre for Research and Technology Hellas
Conclusions
• The proposed method can be used for fast and accurate re-detection of
pre-defined objects in videos
• The time performance of the implemented algorithm allows for real-time
processing of multi-media content
• Extended by a prior object labeling step, this technique can be seen as:
– A reliable tool for instance-based annotated, spatio-temporal
fragments in videos
– A key-enabled technology for finding similar content and establishing
links between related media fragments, thus contributing to the
realization of interactive and linked TV