SlideShare uma empresa Scribd logo
1 de 23
Information Retrieval
and Data Science
Paul Ramirez
paul.m.ramirez@jpl.nasa.gov
Madhav Sharan
msharan@usc.edu
ICMR 2017, Bucharest 1
Scalable Hadoop-Based Pooled
Time Series of Big Video Data
from the Deep Web
Dr. Chris Mattmann
mattmann@usc.edu
https://github.com/USCDataScience/hadoop-pot
Information Retrieval
and Data Science
2
Information Retrieval and Data Science (IRDS) Group
University of Southern California, Los Angeles, CA https://irds.usc.edu
Dr. Chris Mattmann
Director, IRDS
Chief Scientist JPL
ABOUT
Madhav Sharan
Graduate Student IRDS/JPL
Computer Science for Data Intensive Applications Group
Jet Propulsion Laboratory, Pasadena, CA
Paul Ramirez
Group Supervisor JPL
Information Retrieval
and Data Science
OUTLINE
1. Introduction
2. Dataset
3. Hadoop PoT
4. Evaluation
5. Video Space
6. Thanks
3
Information Retrieval
and Data Science
INTRODUCTION
• AIM – To create a scalable approach of calculating similarity between all pairs in a
set of videos
• Built on previous effort by Pooled Time Series (PoT) algorithm from CVPR 2015 by
Dr. Michael Ryoo
• We present our dataset and use case of video similarity then our journey of scaling
algorithm on hadoop
4
Information Retrieval
and Data Science
DATASET
5
Information Retrieval
and Data Science
HUMAN TRAFFICKING DATASET
HT(Human Trafficking) videos are crawled from internet ads of escorts from
backpage.com
1. TOTAL SIZE - 26Gb
2. TOTAL VIDEOS - 6805
3. AVERAGE VIDEO SIZE - 3.8MB
4. TOTAL RECORDING LENGTH ≈ 2250 hr
5. AVERAGE RECORDING LENGTH = 19.8 secs
6
Information Retrieval
and Data Science
HMDB DATASET
HMDB: A large Human Motion DataBase open sourced by serre lab
1. TOTAL SIZE - 1.9Gb
2. TOTAL VIDEOS - 7,000
3. AVERAGE VIDEO SIZE ≈ 0.5 MB
4. TOTAL RECORDING LENGTH ≈ 350 hr
5. AVERAGE RECORDING LENGTH = 3.1 secs
This is an open source labeled dataset used for evaluation of similarity algorithm.
7
Information Retrieval
and Data Science
PoT Similarity
8
Information Retrieval
and Data Science
FEATURE EXTRACTION
9
Information Retrieval
and Data Science
SIMILARITY ALGORITHM
1. Permute across whole video set to form all possible pair of videos
2. For each pair - Calculate mean distance
a. Calculate HOF and HOG for both videos using OpenCV or use cached. Cache HOF and HOG
b. Calculate Pooled time series feature for both videos
3. For each pair - Calculate chi-squared similarity
a. Use cache HOF and HOG
b. Calculate Pooled time series feature for both videos
c. Use mean distance and both series to calculate a similarity score for pair
10
Information Retrieval
and Data Science
PROBLEMS
1. Out of Memory (OoM) issues
2. Time consuming Sequential Code
3. Instrumentation and Checkpointing
4. Could only process 500 videos in 2 days
11
Information Retrieval
and Data Science
HADOOP PoT
12
Information Retrieval
and Data Science
HADOOP JOBS
13
Information Retrieval
and Data Science
CARTESIAN INPUT FORMAT
14
Information Retrieval
and Data Science
EVALUATION
15
Information Retrieval
and Data Science
OBSERVED RUNTIME
16
Total time for all Hadoop jobs :
HT - 33.18 hours
HMDB - 26.84 hours
Time difference as per
video length
Similar time for different
video length
Information Retrieval
and Data Science
QUALITATIVE EVALUATION
1. Fetch top 5 most similar videos as per PoT
2. Record number of videos with same label (True)
3. Recall = True/Total
4. Every label had highest recall for it’s own label
17
Information Retrieval
and Data Science
VIDEOSPACE
18
Information Retrieval
and Data Science
INTRODUCING VIDEOSPACE
19
Information Retrieval
and Data Science
SEARCH RESULTS PAGE
20
Information Retrieval
and Data Science
DETAILS POPUP
21
Information Retrieval
and Data Science
FUTURE WORK
22
1. Preprocessing videos
1. Removing banners at starting of a video
2. Dividing a video into a set of scenes
2. Adding convolutional features to enable object recognition etc.. HOF and HOG
are too simple
Information Retrieval
and Data Science
THANK YOU
23
Questions/Comments?
Madhav Sharan
msharan@usc.edu
@goyal_madhav
@smadha
Dr. Chris Mattmann
mattmann@usc.edu
@chrismattmann
@chrismattmann
https://github.com/USCDataScience/hadoop-pot

Mais conteúdo relacionado

Semelhante a Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web

Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
IJECEIAES
 

Semelhante a Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web (20)

Creating a Science-Driven Big Data Superhighway
Creating a Science-Driven Big Data SuperhighwayCreating a Science-Driven Big Data Superhighway
Creating a Science-Driven Big Data Superhighway
 
Security Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformSecurity Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research Platform
 
Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020
 
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...
 
afternoon3.pdf
afternoon3.pdfafternoon3.pdf
afternoon3.pdf
 
Frank Würthwein - NRP and the Path forward
Frank Würthwein - NRP and the Path forwardFrank Würthwein - NRP and the Path forward
Frank Würthwein - NRP and the Path forward
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Looking Back, Looking Forward NSF CI Funding 1985-2025
Looking Back, Looking Forward NSF CI Funding 1985-2025Looking Back, Looking Forward NSF CI Funding 1985-2025
Looking Back, Looking Forward NSF CI Funding 1985-2025
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
 
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
 
PRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path ForwardPRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path Forward
 
PRP, CHASE-CI, TNRP and OSG
PRP, CHASE-CI, TNRP and OSGPRP, CHASE-CI, TNRP and OSG
PRP, CHASE-CI, TNRP and OSG
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksDynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
 
The Pacific Research Platform Enables Distributed Big-Data Machine-Learning
The Pacific Research Platform Enables Distributed Big-Data Machine-LearningThe Pacific Research Platform Enables Distributed Big-Data Machine-Learning
The Pacific Research Platform Enables Distributed Big-Data Machine-Learning
 
Global Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureGlobal Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, Future
 
Real time video copy detection based on hadoop
Real time video copy detection based on hadoopReal time video copy detection based on hadoop
Real time video copy detection based on hadoop
 
Fast object re-detection and localization in video for spatio-temporal fragme...
Fast object re-detection and localization in video for spatio-temporal fragme...Fast object re-detection and localization in video for spatio-temporal fragme...
Fast object re-detection and localization in video for spatio-temporal fragme...
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
 

Último

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Último (20)

Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 

Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web

  • 1. Information Retrieval and Data Science Paul Ramirez paul.m.ramirez@jpl.nasa.gov Madhav Sharan msharan@usc.edu ICMR 2017, Bucharest 1 Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web Dr. Chris Mattmann mattmann@usc.edu https://github.com/USCDataScience/hadoop-pot
  • 2. Information Retrieval and Data Science 2 Information Retrieval and Data Science (IRDS) Group University of Southern California, Los Angeles, CA https://irds.usc.edu Dr. Chris Mattmann Director, IRDS Chief Scientist JPL ABOUT Madhav Sharan Graduate Student IRDS/JPL Computer Science for Data Intensive Applications Group Jet Propulsion Laboratory, Pasadena, CA Paul Ramirez Group Supervisor JPL
  • 3. Information Retrieval and Data Science OUTLINE 1. Introduction 2. Dataset 3. Hadoop PoT 4. Evaluation 5. Video Space 6. Thanks 3
  • 4. Information Retrieval and Data Science INTRODUCTION • AIM – To create a scalable approach of calculating similarity between all pairs in a set of videos • Built on previous effort by Pooled Time Series (PoT) algorithm from CVPR 2015 by Dr. Michael Ryoo • We present our dataset and use case of video similarity then our journey of scaling algorithm on hadoop 4
  • 5. Information Retrieval and Data Science DATASET 5
  • 6. Information Retrieval and Data Science HUMAN TRAFFICKING DATASET HT(Human Trafficking) videos are crawled from internet ads of escorts from backpage.com 1. TOTAL SIZE - 26Gb 2. TOTAL VIDEOS - 6805 3. AVERAGE VIDEO SIZE - 3.8MB 4. TOTAL RECORDING LENGTH ≈ 2250 hr 5. AVERAGE RECORDING LENGTH = 19.8 secs 6
  • 7. Information Retrieval and Data Science HMDB DATASET HMDB: A large Human Motion DataBase open sourced by serre lab 1. TOTAL SIZE - 1.9Gb 2. TOTAL VIDEOS - 7,000 3. AVERAGE VIDEO SIZE ≈ 0.5 MB 4. TOTAL RECORDING LENGTH ≈ 350 hr 5. AVERAGE RECORDING LENGTH = 3.1 secs This is an open source labeled dataset used for evaluation of similarity algorithm. 7
  • 8. Information Retrieval and Data Science PoT Similarity 8
  • 9. Information Retrieval and Data Science FEATURE EXTRACTION 9
  • 10. Information Retrieval and Data Science SIMILARITY ALGORITHM 1. Permute across whole video set to form all possible pair of videos 2. For each pair - Calculate mean distance a. Calculate HOF and HOG for both videos using OpenCV or use cached. Cache HOF and HOG b. Calculate Pooled time series feature for both videos 3. For each pair - Calculate chi-squared similarity a. Use cache HOF and HOG b. Calculate Pooled time series feature for both videos c. Use mean distance and both series to calculate a similarity score for pair 10
  • 11. Information Retrieval and Data Science PROBLEMS 1. Out of Memory (OoM) issues 2. Time consuming Sequential Code 3. Instrumentation and Checkpointing 4. Could only process 500 videos in 2 days 11
  • 12. Information Retrieval and Data Science HADOOP PoT 12
  • 13. Information Retrieval and Data Science HADOOP JOBS 13
  • 14. Information Retrieval and Data Science CARTESIAN INPUT FORMAT 14
  • 15. Information Retrieval and Data Science EVALUATION 15
  • 16. Information Retrieval and Data Science OBSERVED RUNTIME 16 Total time for all Hadoop jobs : HT - 33.18 hours HMDB - 26.84 hours Time difference as per video length Similar time for different video length
  • 17. Information Retrieval and Data Science QUALITATIVE EVALUATION 1. Fetch top 5 most similar videos as per PoT 2. Record number of videos with same label (True) 3. Recall = True/Total 4. Every label had highest recall for it’s own label 17
  • 18. Information Retrieval and Data Science VIDEOSPACE 18
  • 19. Information Retrieval and Data Science INTRODUCING VIDEOSPACE 19
  • 20. Information Retrieval and Data Science SEARCH RESULTS PAGE 20
  • 21. Information Retrieval and Data Science DETAILS POPUP 21
  • 22. Information Retrieval and Data Science FUTURE WORK 22 1. Preprocessing videos 1. Removing banners at starting of a video 2. Dividing a video into a set of scenes 2. Adding convolutional features to enable object recognition etc.. HOF and HOG are too simple
  • 23. Information Retrieval and Data Science THANK YOU 23 Questions/Comments? Madhav Sharan msharan@usc.edu @goyal_madhav @smadha Dr. Chris Mattmann mattmann@usc.edu @chrismattmann @chrismattmann https://github.com/USCDataScience/hadoop-pot