Event Detection in Surveillance Video: How we Got Here, What We Should Do Next - presentation on our work on crowd counting and a reflection on 10 years of TRECVid Surveillance Event Detection task
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
Huawei STW 2018 public
1. Event Detection in Surveillance
Video: How we Got Here, What We
Should Do Next
Prof. Alan F. Smeaton
E: alan.smeaton@dcu.ie
2. Talk Agenda
• Importance of visual content
• Manual annotation, automatic annotation
• TRECVid – what it is, what it does
• Surveillance Event Detection task – how far its got
• Understanding Crowds
• Crowd counting, crowd behaviour, metric performance
• Surveillance Video
• What can we do now?
• What’s the roadmap ?
2
4. • Annual workshop series (2001-) promoting
research/progress in content-based video analysis
• Foundation for large-scale laboratory testing & forum for
exchange of research ideas and discussion of
approaches – what works, what doesn’t, and why.
• Focus: content-based tasks
• search / detection / summarization / segmentation
• Realistic tasks and test collections
• focus on relatively high-level functionality (e.g.
interactive search) & measurement against human
abilities
• Provides data, tasks, and uniform scoring procedures
4
What is TRECVid ?
5. 5
English
TV News
0
500
1000
1500
2000
2500
3000
3500
4000
4500
TV news BBC rushes
Sound &
vision
Airport
Surveillance
Internet Archive
Creative Commons
HAVIC
Flickr
BBC
East-
Enders
… 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
BBC
hyper-
linking
Blib.tv YFCC1
00M
TRECVid Video Data: 2003 to 2016
6. 1. Shot boundary detection
2. Ad hoc search
3. Features/semantic
indexing
4. Stories
5. Camera motion
6. BBC summaries
7. Copy detection
8. Surveillance events
9. Known-item search
10.Instance search
11.Multimedia event
detection
12.Multimedia event
recounting
13.Video hyperlinking
14.Localization
15.Video to text (captions)
6
TRECVid Tasks: 2001 to 2018
9. In 2012, this happened
• ImageNet, an equivalent of TRECVid,
for images rather then videos
• Krizhevsky, Sutskever and Hinton @
Univ Toronto, “won” the ImageNet
large scale visual recognition
challenge with a “convolutional neural
network”
11. • Surveillance event detection - leverage machine learning
for detecting a pre-defined set of events … in airport
surveillance video
• Use case … detect visual events (people engaged in
particular activities) in a large collection of streaming video
data collected by the UK Home Office
• Part of TRECVid since 2008 so 10 years
• 7 events …
11
TRECVid Surveillance Event Detection
12. 1. CellToEar: put a cell phone to his/her head or ear
2. Embrace: put one or both arms at least part way around another
person (POOR)
3. ObjectPut: drop or put down an object (VERY GOOD)
4. PeopleMeet: One or more people walk up to one or more other
people, stop, and some communication occurs
5. PeopleSplitUp: From two or more people, standing, sitting, or moving
together, communicating, one or more people separate themselves
and leave the frame (VERY GOOD PERFORMANCE)
6. PersonRuns: Someone runs (POOR)
7. Pointing: Someone points (VERY GOOD PERFORMANCE)
12
TRECVid Surveillance Event Detection
13. Participating Research Group
#Years
Embrace
ObjPut
PeopMeet
PeopSPlit
PersRuns
Point
CellToEar
Beijing Univ Posts & Telegraphs 8
CMU/Renmin Univ/Univ of Sydney/Shandong Univ 9
Hikvision Research Institute 1
Wuhan University 2
ITI Greece 2
NII – Hitachi - UiT 1
Southeast Univ Jiulonghu Campus 2
Univ of Queensland, Australia 2
(4 China, 1 Greece, 1 Japan, 1 US, 1 Vietnam) (8 groups in total)
13
TRECVid SED 2016
14. A 10 year critique …
• Progress is slow, but improving because groups use deep
learning
• Task is still going because its important but available
training data is the bottleneck
• Approaches are tailored and tuned to each activity … we
can’t afford to do that
• These aren’t anomalous activities, these are everyday
activities, this is behaviour monitoring for the purpose of
behaviour monitoring and then anomaly detection
• But … do we need the events to detect the anomalies ?
14
15. What has this got to do with surveillance video ?
15
17. He needs help !
• 2016 – 100M new surveillance cameras shipped
worldwide .. in 2018 it will be 130M
• But – as costs fall, so vendors need ways to differentiate and using
deep learning for analysing video content is one of those ways
• Deep Learning is already appearing …
– Deep Learning equipment has a fast uptake in China
– Deep leaning enabled cameras, chips from Nvidia, Movidius or
others
– Body worn cameras and vehicular dashcams also have great
potential, especially when combined with GPS and accelerometers
17
18. • What are the Deep Learning services ?
1. Face Recognition and tracking is the early, and easiest,
scenario, immediately useful in safe city applications
2. Detecting events outliers, anomalies, as well as usual
patterns, public safety abnormal event detection … can be
in real time or archive search for evidence gathering
3. Large scale search – across cameras
• Deep Learning addresses a large big data problem, fusion of
heterogeneous data sources – hence body-worn cameras – and
also large scale search
18
He needs help !
21. Motivation for Understanding Crowds
• Crowd Density Estimation
• Level of crowd congestion observed at a given point in time
• Crowd Counting (state of the art results)
• True number of people present in an image of a crowded scene
• Crowd Segmentation
• Locate different crowd characteristics in a scene
• Crowd Behaviour Classification (state of the art results)
• Categorise the behaviour observed in a crowded scene
• Anomalous Behaviour Detection
• Identify behaviour which strays significantly from an established
norm, typically learned from normal behaviour training data
21
22. Motivation for Understanding Crowds
• Crowd Density Estimation
• Level of crowd congestion observed at a given point in time
• Crowd Counting (state of the art results)
• True number of people present in an image of a crowded scene
• Crowd Segmentation
• Locate different crowd characteristics in a scene
• Crowd Behaviour Classification (state of the art results)
• Categorise the behaviour observed in acrowded scene
• Anomalous Behaviour Detection
• Identify behaviour which strays significantly from an established
norm, typically learned from normal behaviour training data
22
23. Recent Approaches to Crowd Counting
1. Counting by detection
• Training a visual object
detector to find and count
each person
• Performs poorly with +100
people in frame
2. Counting by regression
• Learn a direct mapping
between low-level features
and the overall number of
people in frame
23
Deep learning approaches
lead to significant
improvements in counting
accuracy for high density
crowds (100-5000 people)
24. Crowd Counting – Approach #1
Contributions
• Training set augmentation scheme which improves generalisation
• Deep, single column, fully convolutional network architecture
• Multi-scale count averaging step during inference
25
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). Fully Convolutional Crowd Counting on Highly
Congested Scenes. 2017. International Conference on Computer Vision Theory and Applications.
Fully Convolutional
Neural Network
Pixel-wise sum= 1544Crowd Count= 1566
Fully Convolutional Crowd Counting on Highly Congested Scenes
25. Crowd Counting – Approach #1
26
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). Fully Convolutional Crowd Counting on Highly
Congested Scenes. 2017. International Conference on Computer Vision Theory and Applications.
Fully Convolutional Crowd Counting on Highly Congested ScenesUCF_CC_50 Dataset
Method Mean Absolute Error Mean Squared Error
(Rodriguez et al., 2011) 655.7 697.8
(Lemiptsky and
Zisserman, 2010)
493.4 487.1
(Idress et al., 2013) 419.5 541.6
(Zhang et al.,2015) 467 498.6
(Zhang et al.,2016) 377.6 509.1
Our Approach 338.6 425.5
Our approach improvesupon the state-of-the-art by 11% (MAE) and 13% (MSE)
Our approach improves upon the state-of-the-art by 11% (MAE) and 13% (MSE)
32. Contributions
• A new 100 image dataset, fully annotated for crowd counting, violent behaviour detection and
density level classification
• A deep, residual ANN architecture for simultaneous counting, behaviour detection and crowd
density estimation
.
33
Crowd Counting – Approach #2
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). ResnetCrowd: A Residual Deep Learning
Architecture for Crowd Counting, Violent Behaviour Detection and Crowd Density Level Classification. 2017.
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)
Multi-Task Neural
Network
Crowd Count
Violent Behaviour Detection
ResnetCrowd: A Residual Deep Learning Architecture for Crowd Counting,
Violent Behaviour Detection and Crowd Density Level Classification
Crowd Density Level
33. Data set
• Apply labels for additional tasks to an existing dataset.
• WWW Crowd clips where either the “Fight” or “Mob” concepts are present.
• Crowd counting GT created in the same way as the UCF_CC_50
34
Crowd Counting – Approach #2
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). ResnetCrowd: A Residual Deep Learning
Architecture for Crowd Counting, Violent Behaviour Detection and Crowd Density Level Classification. 2017.
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)
ResnetCrowd: A Residual Deep Learning Architecture for Crowd Counting,
Violent Behaviour Detection and Crowd Density Level Classification
34. Data set
• Apply labels for additional tasks to an existing dataset.
• WWW Crowd clips where either the ”Fight” or ”Mob” concepts are present.
• Crowd counting GT created in the same way as the UCF_CC_50
35
Crowd Counting – Approach #2
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). ResnetCrowd: A Residual Deep Learning
Architecture for Crowd Counting, Violent Behaviour Detection and Crowd Density Level Classification. 2017.
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)
ResnetCrowd: A Residual Deep Learning Architecture for Crowd Counting,
Violent Behaviour Detection and Crowd Density Level Classification
35. ResnetCrowd
• Based upon the Resnet18 network of He et al.
• Minimise a loss function which combines losses for each of the outputs.
36
Crowd Counting – Approach #2
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). ResnetCrowd: A Residual Deep Learning
Architecture for Crowd Counting, Violent Behaviour Detection and Crowd Density Level Classification. 2017.
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)
Multi-Task Neural
Network
Crowd Count
Violent Behaviour Detection
ResnetCrowd: A Residual Deep Learning Architecture for Crowd Counting,
Violent Behaviour Detection and Crowd Density Level Classification
Crowd Density Level
36. 37
Object Counting Has Applications In Many Domains
People Vehicles
Cell Nuclei Wildlife
Marsden, M., McGuinness, K., Little, S., Keogh, C.E. O’Connor, N. E. (2018). People, Penguins and Petri Dishes:
Adapting Object Counting Models To New Visual Domains And Object Types Without Forgetting. 2018.
Computer Vision and Pattern Recognition (CVPR)
37. 38
• Single object counting model for multiple domains (People, Vehicles, Cells,
Wildlife): Trained model can be adjusted to each domain
• Mean count error: 19% on ShanghaiTech dataset
• 30% relative improvement on prior approach to crowd counting
• Current state of the art for crowd counting and wildlife counting
Shared Counting
Neural Network
Object Count =
∑ patch counts
Image Patch
Base Network
(pre-trained on
ImageNet)
Crowd Counting – Approach #3
People, Penguins and Petri Dishes: Adapting Object Counting Models To
New Visual Domains And Object Types Without Forgetting
38. Crowd Counting – Approach #3
• Base object counting regressor
• Set of high-level features are extracted from each image patch using a
pre-trained image classification network
• N-dimensional feature representation is then mapped to an object
count value using a fully connected neural network.
• Domain-specific layers
• Included before each fully connected layer and after the final fully
connected layer.
• Increases the trainable parameter count by just 5%
• Sequential training
• Leverages Rebuffi et al. for learning new tasks over time without
discarding the previously learned functions.
39
41. A Summary of our work in …
• Crowd Density Estimation
• Level of crowd congestion observed at a given point in time
• Crowd Counting
• True number of people present in an image of a crowded
scene
• Crowd Segmentation
• Locate different crowd characteristics in a scene
• Crowd Behaviour Classification
• Categorise the behaviour observed in acrowded scene
• Anomalous Behaviour Detection
• Identify behaviour which strays significantly from an established
norm, typically learned from normal behaviour training data
42
42. Surveillance Video – Roadmap
• We can compute crowd counts,
crowd segments, traffic volumes, very
accurately
• We’re not good at detecting events
• We can learn behaviour patterns for an
area, campus, stadium, city, from
surveillance video – but by detecting
simpler things, like crowd numbers;
• We can use raw CCTV + raw audio as sensor input streams
• We can determine regular behaviour using e.g. periodicity
• We can deviations from normal, alert security, let them do their job
43