O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019

648 visualizações

Publicada em

https://mcv-m6-video.github.io/deepvideo-2019/

These slides provides an overview of how deep neural networks can be used to solve an object tracking task

Publicada em: Dados e análise
  • Seja o primeiro a comentar

Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019

  1. 1. @DocXavi Module 6 - Day 9 - Lecture 1 Deep Video Object Tracking 4th April 2019 Xavier Giró-i-Nieto [http://pagines.uab.cat/mcv/]
  2. 2. Acknowledgements 2 Andreu Girbau
  3. 3. Outline ● Motivation ● Datasets & Benchmarks ● Tracking from classification CNN ● Correlation filters ● Feature learning ● Matching functions ● Learning to track ● Regression networks 3
  4. 4. Motivation 4 Tracking: Mostly, object detection as a first step and tracking as post-processing. Objective: Infer a tracklet over multiple frames by simultaneous detection and tracking. Tracklet Slide credit: Andreu Girbau
  5. 5. Outline ● Motivation ● Datasets & Benchmarks ● Tracking from classification CNN ● Correlation filters ● Feature learning ● Matching functions ● Learning to track ● Regression networks 5
  6. 6. Tracking Challenges at CVPR 2019 4th BMTT MOTChallenge Workshop - How crowded can it get? 2nd Workshop and Challenge on Target Re-identification and Multi-Target Multi-Camera Tracking AI City Challenge 2019
  7. 7. 7 Datasets: MOT Leal-Taixé, Laura, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, and Stefan Roth. "Tracking the trackers: an analysis of the state of the art in multiple object tracking." arXiv (2017). Milan, Anton, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. "MOT16: A benchmark for multi-object tracking." arXiv (2016).
  8. 8. 8 Datasets: ImageNet VID Video Object Detection = Intra-frame Localization + Inter-frame tracking Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015
  9. 9. 9 Real, Esteban, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. "Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video." CVPR 2017. Datasets: YouTube-BB
  10. 10. Outline ● Motivation ● Datasets & Benchmarks ● Tracking from classification CNN ● Correlation filters ● Feature learning ● Matching functions ● Learning to track ● Regression networks 10
  11. 11. 11 Off-the-shelf convs as weak trackers Previous work: Off-the-shelff CNN filters as weak object detectors. Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene CNNs." ICLR 2015.
  12. 12. 12Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015. [code] Off-the-shelf convs as weak trackers Simiarly to Zhou (ICLR 2016), in this work they identified feature maps in conv5-3 as weak object detectors… but were not discriminative enough to discriminate between instances of the same class.
  13. 13. 13 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015. [code] Off-the-shelf convs as weak trackers conv4-3 (specific) conv5-3 (general) On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…
  14. 14. 14 SNet=Specific Network (online learning) GNet=General Network (fixed) Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015. [code] Off-the-shelf convs as weak trackers + Online learning
  15. 15. Outline ● Motivation ● Datasets & Benchmarks ● Tracking from classification CNN ● Correlation filters ● Feature learning ● Matching functions ● Learning to track ● Regression networks 15
  16. 16. 16 Danelljan, Martin, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. "Convolutional features for correlation filter based visual tracking." ICCCVW 2015. Correlation Filters: Off-the-shelf The activations from the convolutional layers of an off-the-shelf CNN (VGG-M) are used as discriminative correlation filters for tracking. Filters from layer are the best ones
  17. 17. 17 Danelljan, Martin, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. "Convolutional features for correlation filter based visual tracking." ICCCVW 2015. Correlation Filters: Of-the-shelf Responses from the filters of the first layer with highest activations. Filters captured different orientations and colors.
  18. 18. 18 Danelljan, Martin, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. "Convolutional features for correlation filter based visual tracking." ICCCVW 2015. Correlation Filters: Off-the-shelf Off-the-shelf filters from the first VGG-M layer outperform handcrafted correlation filters.
  19. 19. 19 #CFNET Valmadre, Jack, Luca Bertinetto, João F. Henriques, Andrea Vedaldi, and Philip HS Torr. "End-to-end representation learning for Correlation Filter based tracking." CVPR 2017 Correlation Filters: Learned VOT-17 Learned !!
  20. 20. 20 Correlation Filters: Learned + Detection Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Detect to track and track to detect." ICCV 2017 [Slides by Andreu Girbau] Activations from the learned correlation features.
  21. 21. 21 Correlation Filters: Learned + Detection Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Detect to track and track to detect." ICCV 2017 [Slides by Andreu Girbau] Simultaneous detection and tracking in a single architecture through a multi-objective loss.
  22. 22. 22 Correlation Filters Learned + Detection Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Detect to track and track to detect." ICCV 2017 [Slides by Andreu Girbau] Simultaneous detection and tracking in a single archietcture.
  23. 23. Outline ● Motivation ● Datasets & Benchmarks ● Tracking from classification CNN ● Correlation filters ● Feature learning ● Matching functions ● Learning to track ● Regression networks 23
  24. 24. 24 Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code] Robust appearance features are learned with a denoising MLP autoencoder... Feature Learning: MLP Artificial noise (“corrpution”) is added to force the autoencoder to be robust to it.
  25. 25. 25 Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code] ...the encoder part is used as feature extractor to train a binary classifier used in a particles filter. Feature Learning: MLP + Particle Filter
  26. 26. 26 Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code] What is the size of the input images ? Hint: They are square. Feature Learning: MLP + Particle Filter
  27. 27. 27 Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code] What is the size of the input images ? Hint: They are square. “From the almost 80 million tiny images each of size 32 × 32, we randomly sample 1 million images for offline training” 32x32 = 2014 Feature Learning: MLP + Particle Filter
  28. 28. 28 Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code] What basic limitation presents this neural architecture for the task ? Feature Learning: MLP + Particle Filter
  29. 29. 29 Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code] What basic limitation presents this neural architecture for the task ? “... it would be an interesting direction to investigate a shift-variant CNN.” Feature Learning: MLP + Particle Filter
  30. 30. 30 CNN is trained with a triplet loss. Ristani, Ergys, and Carlo Tomasi. "Features for multi-target multi-camera tracking and re-identification." CVPR 2018. Feature Learning: CNN + Clustering
  31. 31. 31 Sadeghian, Amir, Alexandre Alahi, and Silvio Savarese. "Tracking the untrackable: Learning to track multiple cues with long-term dependencies." ICCV 2017. The temporally aggregated appearance of target BBi from t to t+1 is encoded with an LSTM. Feature Learning: RNN
  32. 32. 32 The temporally aggregated appearance of target BBi from t to t+1 is encoded with an RNN. Feature Learning: RNN CNN CNN CNN... RNN RNN RNN... time
  33. 33. 33 Sadeghian, Amir, Alexandre Alahi, and Silvio Savarese. "Tracking the untrackable: Learning to track multiple cues with long-term dependencies." ICCV 2017. A FC layer assessed whether the appearance of BBi matches the one of detection BBj in t+1. Feature Learning: RNN
  34. 34. 34 Sadeghian, Amir, Alexandre Alahi, and Silvio Savarese. "Tracking the untrackable: Learning to track multiple cues with long-term dependencies." ICCV 2017. A similar approach is used to assess the similarity between targets and detections in terms of motion and interaction (spatial layout). Feature Learning: RNN
  35. 35. 35 Sadeghian, Amir, Alexandre Alahi, and Silvio Savarese. "Tracking the untrackable: Learning to track multiple cues with long-term dependencies." ICCV 2017. The similarities in terms of appearance, motion and interaction are fused with another fully connected (FC) layer to compute a similarity score between each target at T and detections at T+1. Feature Learning: RNN
  36. 36. Outline ● Motivation ● Datasets & Benchmarks ● Tracking from classification CNN ● Correlation filters ● Feature learning ● Matching functions ● Learning to track ● Regression networks 36
  37. 37. 37#SINT Tao, Ran, Efstratios Gavves, and Arnold WM Smeulders. "Siamese instance search for tracking." CVPR 2016. Matching functions The tracker simply matches the initial patch of the target in the first frame with candidates in a new frame and returns the most similar patch by a learned matching function.
  38. 38. 38#SINT Tao, Ran, Efstratios Gavves, and Arnold WM Smeulders. "Siamese instance search for tracking." CVPR 2016. The matching function can learned with a siamese architecture (shared weights) with a margin contrastive loss. Matching functions: Siamese
  39. 39. 39 The matching function may be completed with contextual features derived from the position and size of the compared input patches. Matching functions: Siamese + context Leal-Taixé, Laura, Cristian Canton-Ferrer, and Konrad Schindler. "Learning by tracking: Siamese CNN for robust target association." CVPRW. 2016. [video]
  40. 40. Outline ● Motivation ● Datasets & Benchmarks ● Tracking from classification CNN ● Correlation filters ● Feature learning ● Matching functions ● Learning to track ● Regression networks 40
  41. 41. 41 Learning to Track: RNN P. Ondruska and I. Posner, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks,” AAAI 2016 [code]
  42. 42. 42P. Ondruska and I. Posner, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks,” AAAI 2016 [code]
  43. 43. 43 Milan, Anton, S. Hamid Rezatofighi, Anthony Dick, Ian Reid, and Konrad Schindler. "Online multi-target tracking using recurrent neural networks." AAAI 2017. Learning to Track: RNN
  44. 44. Outline ● Motivation ● Datasets & Benchmarks ● Tracking from classification CNN ● Correlation filters ● Feature learning ● Matching functions ● Learning to track ● Regression networks 44
  45. 45. #GOTURN Held, David, Sebastian Thrun, and Silvio Savarese. "Learning to track at 100 fps with deep regression networks." ECCV 2016. Regression Network The bounding box to be tracked is not explicitly encoded: it determines the position of the crops to be fed into the network. Inference at 100 fps.
  46. 46. Gordon, Daniel, Ali Farhadi, and Dieter Fox. "Re3 : Re al-Time Recurrent Regression Networks for Visual Tracking of Generic Objects." ICRA 2018. [code] Regression Network + RNN A RNN is added on top of a regression network for long-term temporal consistency.
  47. 47. Gordon, Daniel, Ali Farhadi, and Dieter Fox. "Re3 : Re al-Time Recurrent Regression Networks for Visual Tracking of Generic Objects." ICRA 2018. [code] Skip connections help in maintaining the spatial resolution at deeper layers. Regression Network + RNN
  48. 48. Gordon, Daniel, Ali Farhadi, and Dieter Fox. "Re3 : Re al-Time Recurrent Regression Networks for Visual Tracking of Generic Objects." ICRA 2018.
  49. 49. Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS 2015. [Lecture by Amaia Salvador] [Lecture by Míriam Bellver] Regression network + Tracking by detection Background: Faster-RCNN Conv Layer 5 Conv layers RPN RPN Proposals RPN Proposals Class probabilities RoI pooling layer FC layers Class scores Figure: Amaia Salvador (DLCV 2016)
  50. 50. #FasterRCNN Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS 2015. [Lecture by Amaia Salvador] [Lecture by Míriam Bellver] Regression network + Tracking by detection Background: Faster-RCNN Conv Layer 5 Conv layers RPN RPN Proposals RPN Proposals Class probabilities RoI pooling layer FC layers Class scores Figure: Amaia Salvador (DLCV 2016)
  51. 51. Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS 2015. [Lecture by Amaia Salvador] [Lecture by Míriam Bellver] Regression network + Tracking by detection Objectness scores Bounding Box Regression In practice, k = 9 (3 different scales and 3 aspect ratios) Background: Faster-RCNN: Region Proposal Network (RPN)
  52. 52. Regression network + Tracking by detection Green: The regressor from the Faster R-CNN detector regresses all bounding boxes from previous frames (bk t-1 ) to the object’s new position in frame t (bk t ). #Tracktor Bergmann, Philipp, Tim Meinhardt, and Laura Leal-Taixe. "Tracking without bells and whistles." arXiv (2019). [Slides]
  53. 53. Regression network + Tracking by detection Green: The corresponding object classification scores sk t of the new bounding box positions are then used to kill potentially occluded tracks. #Tracktor Bergmann, Philipp, Tim Meinhardt, and Laura Leal-Taixe. "Tracking without bells and whistles." arXiv (2019). [Slides]
  54. 54. Regression network + Tracking by detection Red: The object detector provides a set of object detections Dt at frame t (not only the ones tracked from t-1). This allows initializing new tracks if non of the detections has an IoU larger than a certain threshold. #Tracktor Bergmann, Philipp, Tim Meinhardt, and Laura Leal-Taixe. "Tracking without bells and whistles." arXiv (2019). [Slides]
  55. 55. Regression network + Tracking by detection #Tracktor Bergmann, Philipp, Tim Meinhardt, and Laura Leal-Taixe. "Tracking without bells and whistles." arXiv (2019). [Slides] The basic algorithm is combined with short term bounding box re-identifcation (reID) and camera motion compensation (CMC) to improve the state of the art.
  56. 56. Suggested Extra Lecture 56 Laura Leal-Taixé, UPC DLCV 2018
  57. 57. 57 #ROLO Ning, Guanghan, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. "Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking." ISCAS 2017. Bounding box regression with RNNs
  58. 58. 58 Ning, Guanghan, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. "Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking." ISCAS 2017.
  59. 59. 59 Questions ?
  60. 60. 60 Deep Learning courses @ UPC TelecomBCN: ● MSc course [2017] [2018] ● BSc course [2018] [2019] ● 1st edition (2016) ● 2nd edition (2017) ● 3rd edition (2018) ● 4th edition (2019) ● 1st edition (2017) ● 2nd edition (2018) ● 3rd edition - NLP (2019) Next edition: Autumn 2019 Registration open for 2019Registration open for 2019
  61. 61. 61 Deep Learning for Professionals @ UPC School Next edition starts November 2019. Sign up here.

×