Huawei STW 2018 public

Event Detection in Surveillance
Video: How we Got Here, What We
Should Do Next
Prof. Alan F. Smeaton
E: alan.smeaton@dcu.ie

Talk Agenda
• Importance of visual content
• Manual annotation, automatic annotation
• TRECVid – what it is, what it does
• Surveillance Event Detection task – how far its got
• Understanding Crowds
• Crowd counting, crowd behaviour, metric performance
• Surveillance Video
• What can we do now?
• What’s the roadmap ?
2

We know … Manual Annotation
3

• Annual workshop series (2001-) promoting
research/progress in content-based video analysis
• Foundation for large-scale laboratory testing & forum for
exchange of research ideas and discussion of
approaches – what works, what doesn’t, and why.
• Focus: content-based tasks
• search / detection / summarization / segmentation
• Realistic tasks and test collections
• focus on relatively high-level functionality (e.g.
interactive search) & measurement against human
abilities
• Provides data, tasks, and uniform scoring procedures
4
What is TRECVid ?

5
English
TV News
0
500
1000
1500
2000
2500
3000
3500
4000
4500
TV news BBC rushes
Sound &
vision
Airport
Surveillance
Internet Archive
Creative Commons
HAVIC
Flickr
BBC
East-
Enders
… 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
BBC
hyper-
linking
Blib.tv YFCC1
00M
TRECVid Video Data: 2003 to 2016

1. Shot boundary detection
2. Ad hoc search
3. Features/semantic
indexing
4. Stories
5. Camera motion
6. BBC summaries
7. Copy detection
8. Surveillance events
9. Known-item search
10.Instance search
11.Multimedia event
detection
12.Multimedia event
recounting
13.Video hyperlinking
14.Localization
15.Video to text (captions)
6
TRECVid Tasks: 2001 to 2018

7
Groups
Finished
Task
code
Task
name
8 SED
Surveillance
event detection
10 AVS
Ad-hoc Video
Search
8 INS Instance search
6 MED
Multimedia event
detection
3 LNK
Video
hyperlinking
16 VTT
Pilot task
(Video_to_Text)
20
10
7
2
Asia Europe
North America Australia
TRECVid 2017 Tasks and 39 Finishers

TRECVid Concept Detection: 2003

In 2012, this happened
• ImageNet, an equivalent of TRECVid,
for images rather then videos
• Krizhevsky, Sutskever and Hinton @
Univ Toronto, “won” the ImageNet
large scale visual recognition
challenge with a “convolutional neural
network”

Now everybody tries deep learning, for everything

• Surveillance event detection - leverage machine learning
for detecting a pre-defined set of events … in airport
surveillance video
• Use case … detect visual events (people engaged in
particular activities) in a large collection of streaming video
data collected by the UK Home Office
• Part of TRECVid since 2008 so 10 years
• 7 events …
11
TRECVid Surveillance Event Detection

1. CellToEar: put a cell phone to his/her head or ear
2. Embrace: put one or both arms at least part way around another
person (POOR)
3. ObjectPut: drop or put down an object (VERY GOOD)
4. PeopleMeet: One or more people walk up to one or more other
people, stop, and some communication occurs
5. PeopleSplitUp: From two or more people, standing, sitting, or moving
together, communicating, one or more people separate themselves
and leave the frame (VERY GOOD PERFORMANCE)
6. PersonRuns: Someone runs (POOR)
7. Pointing: Someone points (VERY GOOD PERFORMANCE)
12
TRECVid Surveillance Event Detection

Participating Research Group
#Years
Embrace
ObjPut
PeopMeet
PeopSPlit
PersRuns
Point
CellToEar
Beijing Univ Posts & Telegraphs 8
CMU/Renmin Univ/Univ of Sydney/Shandong Univ 9
Hikvision Research Institute 1
Wuhan University 2
ITI Greece 2
NII – Hitachi - UiT 1
Southeast Univ Jiulonghu Campus 2
Univ of Queensland, Australia 2
(4 China, 1 Greece, 1 Japan, 1 US, 1 Vietnam) (8 groups in total)
13
TRECVid SED 2016

A 10 year critique …
• Progress is slow, but improving because groups use deep
learning
• Task is still going because its important but available
training data is the bottleneck
• Approaches are tailored and tuned to each activity … we
can’t afford to do that
• These aren’t anomalous activities, these are everyday
activities, this is behaviour monitoring for the purpose of
behaviour monitoring and then anomaly detection
• But … do we need the events to detect the anomalies ?
14

What has this got to do with surveillance video ?
15

He needs help !
• 2016 – 100M new surveillance cameras shipped
worldwide .. in 2018 it will be 130M
• But – as costs fall, so vendors need ways to differentiate and using
deep learning for analysing video content is one of those ways
• Deep Learning is already appearing …
– Deep Learning equipment has a fast uptake in China
– Deep leaning enabled cameras, chips from Nvidia, Movidius or
others
– Body worn cameras and vehicular dashcams also have great
potential, especially when combined with GPS and accelerometers
17

• What are the Deep Learning services ?
1. Face Recognition and tracking is the early, and easiest,
scenario, immediately useful in safe city applications
2. Detecting events outliers, anomalies, as well as usual
patterns, public safety abnormal event detection … can be
in real time or archive search for evidence gathering
3. Large scale search – across cameras
• Deep Learning addresses a large big data problem, fusion of
heterogeneous data sources – hence body-worn cameras – and
also large scale search
18
He needs help !

Motivation for Understanding Crowds
• Crowd Density Estimation
• Level of crowd congestion observed at a given point in time
• Crowd Counting (state of the art results)
• True number of people present in an image of a crowded scene
• Crowd Segmentation
• Locate different crowd characteristics in a scene
• Crowd Behaviour Classification (state of the art results)
• Categorise the behaviour observed in a crowded scene
• Anomalous Behaviour Detection
• Identify behaviour which strays significantly from an established
norm, typically learned from normal behaviour training data
21

Motivation for Understanding Crowds
• Crowd Counting (state of the art results)
• True number of people present in an image of a crowded scene
• Crowd Behaviour Classification (state of the art results)
• Categorise the behaviour observed in acrowded scene
22

Recent Approaches to Crowd Counting
1. Counting by detection
• Training a visual object
detector to find and count
each person
• Performs poorly with +100
people in frame
2. Counting by regression
• Learn a direct mapping
between low-level features
and the overall number of
people in frame
23
Deep learning approaches
lead to significant
improvements in counting
accuracy for high density
crowds (100-5000 people)

Crowd Counting – Approach #1
Contributions
• Training set augmentation scheme which improves generalisation
• Deep, single column, fully convolutional network architecture
• Multi-scale count averaging step during inference
25
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). Fully Convolutional Crowd Counting on Highly
Congested Scenes. 2017. International Conference on Computer Vision Theory and Applications.
Fully Convolutional
Neural Network
Pixel-wise sum= 1544Crowd Count= 1566
Fully Convolutional Crowd Counting on Highly Congested Scenes

26
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). Fully Convolutional Crowd Counting on Highly
Congested Scenes. 2017. International Conference on Computer Vision Theory and Applications.
Fully Convolutional Crowd Counting on Highly Congested ScenesUCF_CC_50 Dataset
Method Mean Absolute Error Mean Squared Error
(Rodriguez et al., 2011) 655.7 697.8
(Lemiptsky and
Zisserman, 2010)
493.4 487.1
(Idress et al., 2013) 419.5 541.6
(Zhang et al.,2015) 467 498.6
(Zhang et al.,2016) 377.6 509.1
Our Approach 338.6 425.5
Our approach improvesupon the state-of-the-art by 11% (MAE) and 13% (MSE)
Our approach improves upon the state-of-the-art by 11% (MAE) and 13% (MSE)

27
Crowd Counting in Action
Academic Datasets

28
Crowd Counting in Action : Academic Datasets
Estimated Person Count : 26
True Person Count: 23

29

30
Estimated Person Count : 1544 True Person Count: 1566

31
Crowd Counting in Action
CCTV

32
Crowd Counting in Action : CCTV Footage
• Challenging CCTV footage taken from Croke Park Stadium, Dublin
• Same scene observed during a quiet and busy period during a match day
Metric for a video sequence : mean count ± standard deviation
Estimated Count: 0 ± 1.5 True Count: 0 ± 0.8 Estimated Count: 52 ± 6.8 True Count: 70 ± 3.3
Video clips removed
for © reasons !

Contributions
• A new 100 image dataset, fully annotated for crowd counting, violent behaviour detection and
density level classification
• A deep, residual ANN architecture for simultaneous counting, behaviour detection and crowd
density estimation
.
33
Marsden, M., McGuinness, K., Little, S., O’Connor, N. E. (2017). ResnetCrowd: A Residual Deep Learning
Architecture for Crowd Counting, Violent Behaviour Detection and Crowd Density Level Classification. 2017.
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)
Multi-Task Neural
Network
Crowd Count
Violent Behaviour Detection
ResnetCrowd: A Residual Deep Learning Architecture for Crowd Counting,
Violent Behaviour Detection and Crowd Density Level Classification
Crowd Density Level

Data set
• Apply labels for additional tasks to an existing dataset.
• WWW Crowd clips where either the “Fight” or “Mob” concepts are present.
• Crowd counting GT created in the same way as the UCF_CC_50
34

Data set
• Apply labels for additional tasks to an existing dataset.
• WWW Crowd clips where either the ”Fight” or ”Mob” concepts are present.
• Crowd counting GT created in the same way as the UCF_CC_50
35

ResnetCrowd
• Based upon the Resnet18 network of He et al.
• Minimise a loss function which combines losses for each of the outputs.
36
Multi-Task Neural
Network
Crowd Count
Violent Behaviour Detection
Crowd Density Level

37
Object Counting Has Applications In Many Domains
People Vehicles
Cell Nuclei Wildlife
Marsden, M., McGuinness, K., Little, S., Keogh, C.E. O’Connor, N. E. (2018). People, Penguins and Petri Dishes:
Adapting Object Counting Models To New Visual Domains And Object Types Without Forgetting. 2018.
Computer Vision and Pattern Recognition (CVPR)

38
• Single object counting model for multiple domains (People, Vehicles, Cells,
Wildlife): Trained model can be adjusted to each domain
• Mean count error: 19% on ShanghaiTech dataset
• 30% relative improvement on prior approach to crowd counting
• Current state of the art for crowd counting and wildlife counting
Shared Counting
Neural Network
Object Count =
∑ patch counts
Image Patch
Base Network
(pre-trained on
ImageNet)
People, Penguins and Petri Dishes: Adapting Object Counting Models To
New Visual Domains And Object Types Without Forgetting

• Base object counting regressor
• Set of high-level features are extracted from each image patch using a
pre-trained image classification network
• N-dimensional feature representation is then mapped to an object
count value using a fully connected neural network.
• Domain-specific layers
• Included before each fully connected layer and after the final fully
connected layer.
• Increases the trainable parameter count by just 5%
• Sequential training
• Leverages Rebuffi et al. for learning new tasks over time without
discarding the previously learned functions.
39

40
Crowd Counting Cell Counting

41
Penguin Counting Vehicle Counting

A Summary of our work in …
• Crowd Counting
• True number of people present in an image of a crowded
scene
• Crowd Behaviour Classification
• Categorise the behaviour observed in acrowded scene
42

Surveillance Video – Roadmap
• We can compute crowd counts,
crowd segments, traffic volumes, very
accurately
• We’re not good at detecting events
• We can learn behaviour patterns for an
area, campus, stadium, city, from
surveillance video – but by detecting
simpler things, like crowd numbers;
• We can use raw CCTV + raw audio as sensor input streams
• We can determine regular behaviour using e.g. periodicity
• We can deviations from normal, alert security, let them do their job
43

Huawei STW 2018 public

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Huawei STW 2018 public

Semelhante a Huawei STW 2018 public (20)

Último

Último (20)

Huawei STW 2018 public