SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center
Video Analysis
Day 3 Lectures 1 & 2
#DLUPC
http://bit.ly/dlcv2018
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPCSelf-supervision: motivation
● Neural Networks generally need tons of annotated data, but
generating those annotations is expensive
● Can we create some tasks for which we can get free annotations?
○ Yes! And videos are very convenient for this
● We want to
○ Frame the problem as a supervised learning task…
○ … but collecting annotations in an unsupervised manner
bit.ly/DLCV2018
#DLUPCSelf-supervision: examples
Temporal coherence
Misra et al., Shuffle and learn: unsupervised learning using temporal order verification. ECCV 2016
bit.ly/DLCV2018
#DLUPCSelf-supervision: examples
Motion as a proxy for foreground segmentation
Pathak et al., Learning features by watching objects move. CVPR 2017
bit.ly/DLCV2018
#DLUPCSelf-supervision: examples
Correspondence between images and audio
Arandjelović & Zisserman. Look, Listen and Learn. ICCV 2017.
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPCWhat is a video?
● Formally, a video is a 3D signal
○ Spatial coordinates: x, y
○ Temporal coordinate: t
● If we fix t, we obtain an image. We can understand videos as
sequences of images (a.k.a. frames)
bit.ly/DLCV2018
#DLUPCHow do we work with images?
● Convolutional Neural Networks (CNN) provide state of the art
performance on image analysis tasks
● How can we extend CNNs to image sequences?
bit.ly/DLCV2018
#DLUPCCNNs for sequences of images
Several approaches have been proposed
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
Each method leverages the temporal information in a different way
f ŷ
bit.ly/DLCV2018
#DLUPCSingle frame models
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a pooling operation
(e.g. max, sum, average).
Drawback: pooling is not
aware of the temporal order!
Ng et al., Beyond short snippets: Deep networks for video classification, CVPR 2015
bit.ly/DLCV2018
#DLUPCCNN + RNN
Recurrent Neural Networks are
well suited for processing
sequences.
Drawback: RNNs are sequential
and cannot be parallelized.
Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015
CNN CNN CNN...
RNN RNN RNN...
bit.ly/DLCV2018
#DLUPC3D Convolutions: C3D
We can add an extra dimension to standard CNNs:
● An image is a HxWxD tensor: MxNxD’ conv filters
● A video is a TxHxWxD tensor: KxMxNxD’ conv filters
The video needs to be split into chunks (also known as clips) with a number of
frames that fits the receptive field of the C3D. Usually clips have 16 frames.
Drawbacks:
● How can we handle longer videos?
● How can we capture longer temporal dependencies?
● How can we use pre-trained networks?
Tran et al., Learning spatiotemporal features with 3D convolutional networks, ICCV 2015
bit.ly/DLCV2018
#DLUPCFrom 2D CNNs to 3D CNNs
We can add an extra dimension to standard CNNs:
● An image is a HxWxD tensor: MxNxD’ conv filters
● A video is a TxHxWxD tensor: KxMxNxD’ conv filters
We can convert an MxNxD’ conv filter into a KxMxNxD’ filter by replicating it K
times in the time axis and scaling its values by 1/K.
● This allows to leverage networks pre-trained on ImageNet and alleviate
the computational burden associated to training from scratch
Feichtenhofer et al., Spatiotemporal Residual Networks for Video Action Recognition, NIPS 2016
Carreira et al., Quo vadis, action recognition? A new model and the kinetics dataset, CVPR 2017
bit.ly/DLCV2018
#DLUPCTwo-stream CNNs
Single frame models do not take into account motion in videos. Proposal:
extract optical flow for a stack of frames and use it as an input to a CNN.
Drawback: not scalable due to computational requirements and memory
footprint.
Simonyan & Zisserman, Two-stream convolutional networks for action recognition in videos, ICCV 2015
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPC
● So far, we considered video-level predictions
● What about frame-level predictions?
● What about applications which require low latency?
Problem definition
f ŷ
f (ŷ1
, …, ŷN
)
bit.ly/DLCV2018
#DLUPCMinimizing latency
Not all methods are suited for real-time applications
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
bit.ly/DLCV2018
#DLUPCMinimizing latency
Not all methods are suited for real-time applications
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
bit.ly/DLCV2018
#DLUPCSingle frame models
CNN CNN CNN...
bit.ly/DLCV2018
#DLUPCSingle frame models: redundancy
Shellhamer et al., Clockwork Convnets for Video Semantic Segmentation, ECCV 2016
bit.ly/DLCV2018
#DLUPCSingle frame models: exploiting redundancy
Carreira et al., Massively Parallel Video Networks, arXiv 2018
bit.ly/DLCV2018
#DLUPCCNN+RNN: redundancy
CNN CNN CNN...
RNN RNN RNN...
bit.ly/DLCV2018
#DLUPCCNN+RNN: exploiting redundancy
CNN CNN CNN...
RNN RNN RNN... After processing a frame, let
the RNN decide how many
future frames can be skipped
In skipped frames, simply
copy the output and state
from the previous time step
There is no ground truth for
which frames can be skipped.
The RNN learns it by itself
during training!
Campos et al., Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, ICLR 2018
bit.ly/DLCV2018
#DLUPCCNN+RNN: exploiting redundancy
Campos et al., Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, ICLR 2018
Used Unused
CNN CNN CNN...
RNN RNN RNN...
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPCActivity Recognition: Datasets
Abu-El-Haija et al., Youtube-8m: A large-scale video classification benchmark, arxiv 2016
bit.ly/DLCV2018
#DLUPCComputational burden
● The reference dataset for image classification, ImageNet, has ~1.3M
images
○ Training a state of the art CNN can take up to 2 weeks on a
single GPU
● Now imagine that we have an ‘ImageNet’ of 1.3M videos
○ Assuming videos of 30s at 24fps, we have 936M frames
○ This is 720x ImageNet!
● Videos exhibit a large redundancy in time
○ We can reduce the frame rate without losing too much
information
bit.ly/DLCV2018
#DLUPCMemory issues
● Current GPUs can fit batches of 32~64 images when training state of
the art CNNs
○ This means 32~64 video frames at once
● Memory footprint can be reduced in different ways if a pre-trained
CNN model is used
○ Freezing some of the lower layers, reducing the memory impact
of backprop
○ Extracting frame-level features and training a model on top of it
(e.g. RNN on top of CNN features). This is equivalent to freezing
the whole architecture, but the CNN part needs to be computed
only once.
bit.ly/DLCV2018
#DLUPCI/O bottleneck
● In practice, applying deep learning to video analysis requires from
multi-GPU or distributed settings
● In such settings it is very important to avoid starving the GPUs or we
will not obtain any speedup
○ The next batch needs to be loaded and preprocessed to keep
the GPU as busy as possible
○ Using asynchronous data loading pipelines is a key factor
○ Loading individual files is slow due to the introduced overhead,
so using other formats such as TFRecord/HDF5/LMDB is highly
recommended
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPCCNN+RNN: redundancy
CNN CNN CNN...
RNN RNN RNN...
Frame-level predictions
are optional and depend
on the task
~99% of the computational
burden is here

Mais conteúdo relacionado

Mais de Universitat Politècnica de Catalunya

Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Universitat Politècnica de Catalunya
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaUniversitat Politècnica de Catalunya
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 

Mais de Universitat Politècnica de Catalunya (20)

Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 

Último

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Último (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Deep Video Analysis - Víctor Campos - UPC Barcelona 2018

  • 1. Víctor Campos victor.campos@bsc.es PhD Candidate Barcelona Supercomputing Center Video Analysis Day 3 Lectures 1 & 2 #DLUPC http://bit.ly/dlcv2018
  • 2. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 3. bit.ly/DLCV2018 #DLUPCSelf-supervision: motivation ● Neural Networks generally need tons of annotated data, but generating those annotations is expensive ● Can we create some tasks for which we can get free annotations? ○ Yes! And videos are very convenient for this ● We want to ○ Frame the problem as a supervised learning task… ○ … but collecting annotations in an unsupervised manner
  • 4. bit.ly/DLCV2018 #DLUPCSelf-supervision: examples Temporal coherence Misra et al., Shuffle and learn: unsupervised learning using temporal order verification. ECCV 2016
  • 5. bit.ly/DLCV2018 #DLUPCSelf-supervision: examples Motion as a proxy for foreground segmentation Pathak et al., Learning features by watching objects move. CVPR 2017
  • 6. bit.ly/DLCV2018 #DLUPCSelf-supervision: examples Correspondence between images and audio Arandjelović & Zisserman. Look, Listen and Learn. ICCV 2017.
  • 7. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 8. bit.ly/DLCV2018 #DLUPCWhat is a video? ● Formally, a video is a 3D signal ○ Spatial coordinates: x, y ○ Temporal coordinate: t ● If we fix t, we obtain an image. We can understand videos as sequences of images (a.k.a. frames)
  • 9. bit.ly/DLCV2018 #DLUPCHow do we work with images? ● Convolutional Neural Networks (CNN) provide state of the art performance on image analysis tasks ● How can we extend CNNs to image sequences?
  • 10. bit.ly/DLCV2018 #DLUPCCNNs for sequences of images Several approaches have been proposed 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN Each method leverages the temporal information in a different way f ŷ
  • 11. bit.ly/DLCV2018 #DLUPCSingle frame models CNN CNN CNN... Combination method Combination is commonly implemented as a small NN on top of a pooling operation (e.g. max, sum, average). Drawback: pooling is not aware of the temporal order! Ng et al., Beyond short snippets: Deep networks for video classification, CVPR 2015
  • 12. bit.ly/DLCV2018 #DLUPCCNN + RNN Recurrent Neural Networks are well suited for processing sequences. Drawback: RNNs are sequential and cannot be parallelized. Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015 CNN CNN CNN... RNN RNN RNN...
  • 13. bit.ly/DLCV2018 #DLUPC3D Convolutions: C3D We can add an extra dimension to standard CNNs: ● An image is a HxWxD tensor: MxNxD’ conv filters ● A video is a TxHxWxD tensor: KxMxNxD’ conv filters The video needs to be split into chunks (also known as clips) with a number of frames that fits the receptive field of the C3D. Usually clips have 16 frames. Drawbacks: ● How can we handle longer videos? ● How can we capture longer temporal dependencies? ● How can we use pre-trained networks? Tran et al., Learning spatiotemporal features with 3D convolutional networks, ICCV 2015
  • 14. bit.ly/DLCV2018 #DLUPCFrom 2D CNNs to 3D CNNs We can add an extra dimension to standard CNNs: ● An image is a HxWxD tensor: MxNxD’ conv filters ● A video is a TxHxWxD tensor: KxMxNxD’ conv filters We can convert an MxNxD’ conv filter into a KxMxNxD’ filter by replicating it K times in the time axis and scaling its values by 1/K. ● This allows to leverage networks pre-trained on ImageNet and alleviate the computational burden associated to training from scratch Feichtenhofer et al., Spatiotemporal Residual Networks for Video Action Recognition, NIPS 2016 Carreira et al., Quo vadis, action recognition? A new model and the kinetics dataset, CVPR 2017
  • 15. bit.ly/DLCV2018 #DLUPCTwo-stream CNNs Single frame models do not take into account motion in videos. Proposal: extract optical flow for a stack of frames and use it as an input to a CNN. Drawback: not scalable due to computational requirements and memory footprint. Simonyan & Zisserman, Two-stream convolutional networks for action recognition in videos, ICCV 2015
  • 16. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 17. bit.ly/DLCV2018 #DLUPC ● So far, we considered video-level predictions ● What about frame-level predictions? ● What about applications which require low latency? Problem definition f ŷ f (ŷ1 , …, ŷN )
  • 18. bit.ly/DLCV2018 #DLUPCMinimizing latency Not all methods are suited for real-time applications 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN
  • 19. bit.ly/DLCV2018 #DLUPCMinimizing latency Not all methods are suited for real-time applications 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN
  • 21. bit.ly/DLCV2018 #DLUPCSingle frame models: redundancy Shellhamer et al., Clockwork Convnets for Video Semantic Segmentation, ECCV 2016
  • 22. bit.ly/DLCV2018 #DLUPCSingle frame models: exploiting redundancy Carreira et al., Massively Parallel Video Networks, arXiv 2018
  • 24. bit.ly/DLCV2018 #DLUPCCNN+RNN: exploiting redundancy CNN CNN CNN... RNN RNN RNN... After processing a frame, let the RNN decide how many future frames can be skipped In skipped frames, simply copy the output and state from the previous time step There is no ground truth for which frames can be skipped. The RNN learns it by itself during training! Campos et al., Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, ICLR 2018
  • 25. bit.ly/DLCV2018 #DLUPCCNN+RNN: exploiting redundancy Campos et al., Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, ICLR 2018 Used Unused CNN CNN CNN... RNN RNN RNN...
  • 26. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 27. bit.ly/DLCV2018 #DLUPCActivity Recognition: Datasets Abu-El-Haija et al., Youtube-8m: A large-scale video classification benchmark, arxiv 2016
  • 28. bit.ly/DLCV2018 #DLUPCComputational burden ● The reference dataset for image classification, ImageNet, has ~1.3M images ○ Training a state of the art CNN can take up to 2 weeks on a single GPU ● Now imagine that we have an ‘ImageNet’ of 1.3M videos ○ Assuming videos of 30s at 24fps, we have 936M frames ○ This is 720x ImageNet! ● Videos exhibit a large redundancy in time ○ We can reduce the frame rate without losing too much information
  • 29. bit.ly/DLCV2018 #DLUPCMemory issues ● Current GPUs can fit batches of 32~64 images when training state of the art CNNs ○ This means 32~64 video frames at once ● Memory footprint can be reduced in different ways if a pre-trained CNN model is used ○ Freezing some of the lower layers, reducing the memory impact of backprop ○ Extracting frame-level features and training a model on top of it (e.g. RNN on top of CNN features). This is equivalent to freezing the whole architecture, but the CNN part needs to be computed only once.
  • 30. bit.ly/DLCV2018 #DLUPCI/O bottleneck ● In practice, applying deep learning to video analysis requires from multi-GPU or distributed settings ● In such settings it is very important to avoid starving the GPUs or we will not obtain any speedup ○ The next batch needs to be loaded and preprocessed to keep the GPU as busy as possible ○ Using asynchronous data loading pipelines is a key factor ○ Loading individual files is slow due to the introduced overhead, so using other formats such as TFRecord/HDF5/LMDB is highly recommended
  • 31. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 32.
  • 33. bit.ly/DLCV2018 #DLUPCCNN+RNN: redundancy CNN CNN CNN... RNN RNN RNN... Frame-level predictions are optional and depend on the task ~99% of the computational burden is here