SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Scalable high-dimensional
indexing with Hadoop
TEXMEX team, INRIA Rennes, France
Denis Shestakov, PhD
denis.shestakov at {aalto.fi,inria.fr}
linkedin: linkedin.com/in/dshestakov
mendeley: mendeley.com/profiles/denis-shestakov
Denis Shestakov, Diana Moise,
Gylfi Gudmundsson, Laurent Amsaleg
Outline
● Motivation
● Approach overview: scaling indexing &
searching using Hadoop
● Experimental setup: datasets, resources,
configuration
● Results
● Observations & implications
● Things to share
● Future directions
Motivation
● Big data is here
○ Lots of multimedia content
○ Even forgetting 'big' companies, 1TB/day of
multimedia is now common for many parties
● Solution: apply more computational power
○ Luckily, easier access to such power via grid/cloud
resources
● Applications:
○ Large-scale image retrieval: e.g., detecting copyright
violations in huge image repositories
○ Google Goggles-like systems: annotating the scene
Our approach
● Index & search huge image collection using
MapReduce-based eCP algorithm
○ See our work at ICMR'13: Indexing and searching
100M images with MapReduce [7]
○ See Section II for quick overview
● Use the Grid5000 plartform
○ Distributed infrastructure available to French
researchers & their partners
● Use the Hadoop framework
○ Most popular open-source implementation of
MapReduce model
○ Data stored in HDFS that splits it into chunks (64MB or
often bigger) and distributes it across nodes
Our approach
● Hadoop used for both indexing and searching
● Our search scenario:
■ Searching for batch of images
● Thousands of images in one run
● Focus on throughput, not on response time
for individual image
■ Use case: copyright violation detection
● Note: indexed dataset can be searched on single
machine with adequate disk capacity if necessary
Experimental setup
● Used Grid5000 platform:
○ Nodes in rennes site of Grid5000
■ Up to 110 nodes available
■ Nodes capacity/performance varied
● Heterogenous, come from three clusters
● From 8 cores to 24 cores per node
● From 24GB to 48GB RAM per node
● Hadoop ver.1.0.1
○ (!) No changes in Hadoop internals
■ Pros: easy to migrate, try and compare by others
■ Cons: not top performance
Experimental setup
● Over 100 mln images (~30 billion SIFT descriptors)
○ Collected from the Web and provided by one of the
partners in Quaero project
■ One of the largest reported in literature
○ Images resized to 150px on largest side
○ Worked with
■ The whole set (~4TB)
■ The subset, 20mln images (~1TB)
○ Used as distracting dataset
Experimental setup
● For evaluation of indexing quality:
○ Added to distracting datasets:
■ INRIA Copydays (127 images)
○ Queried for
■ Copydays batch (3055 images = 127 original
images and their associated variants incl. strong
distortions, e.g. print-crumple-scan )
■ 12k batch (12081 images = 245 random images
from dataset and their variants)
○ Checked if original images returned as top voted
search results
Results: workflow overview
● Experiment on indexing & searching 1TB took 5-6
hours
Results: indexing 1TB
Results: indexing 4TB
● 4TB
● 100 nodes
● Used tuned parameters
○ Except change in #mappers/#reducers per node
■ To fit bigger index tree (for 4TB) to RAM
■ 4 mappers/2 reducers
● Time: 507min
Results: search quality
Results: search scalability
Results: search execution
Search 12k batch over 1TB using 100 nodes
Results: searching 4TB
● 4TB
● 87 nodes
● Copydays query batch (3k images)
○ Throughput: 460ms per image
● 12k query batch
○ Throughput: 210ms per image
● Bigger batches improve throughput insignificantly
○ bigger batch -> bigger lookup table -> more RAM per
mapper required -> less mappers per node
Observations &
implications
● HDFS block size limits scalability
○ 1TB dataset => 1186 blocks of 1024MB size
○ Assuming 8-core nodes and reported searching
method: no scaling after 149 nodes (i.e.
8x149=1192)
○ Solutions:
■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for
512MB blocks
■ Re-visit search process: e.g., partial-loading of lookup
table
● Big data is here but not resources to process
○ E.g, indexing&searching >10TB not possible given resources we had
Things to share
● Our methods/system can be applied to audio datasets
○ No major changes expected
○ Contact me if interested
● Code for MapReduce-eCP algorithm available on request
○ Should run smoothly on your Hadoop cluster
○ Interested in comparisons
● Hadoop job history logs behind our experiments (not only
for those reported at CBMI) available on request
○ Describe indexing/searching our dataset by giving details
on map/reduce tasks execution
○ Insights on better analysis/visualization are welcome
○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
Future directions
● Deal with big batches of query images
○ ~200k query images
● Share auxiliary data (index tree, lookup table) by
mappers
○ Multithreaded map tasks
● (environment-specific) Test scalability on more nodes
○ Use several sites of Grid5000 infrastructure
■ rennes+nancy sites (up to 300 nodes) --in
progress
Acknowledgements
● TEXMEX team, INRIA Rennes http://www.
irisa.fr/texmex/index_en.php
● Quaero project, http://www.quaero.org/
● Grid5000 infrastructure & its Rennes
maintenance team, https://www.grid5000.fr
Thank you!
Questions?

Mais conteúdo relacionado

Mais procurados

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 

Mais procurados (20)

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 

Semelhante a Scalable high-dimensional indexing with Hadoop

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataRob Gardner
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger databodaceacat
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger dataSara-Jayne Terp
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_enOgibayashi
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 

Semelhante a Scalable high-dimensional indexing with Hadoop (20)

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
InternReport
InternReportInternReport
InternReport
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 

Mais de Denis Shestakov

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the WebDenis Shestakov
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Denis Shestakov
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep WebDenis Shestakov
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery systemDenis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database SystemsDenis Shestakov
 

Mais de Denis Shestakov (9)

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Último

Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 

Último (20)

Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 

Scalable high-dimensional indexing with Hadoop

  • 1. Scalable high-dimensional indexing with Hadoop TEXMEX team, INRIA Rennes, France Denis Shestakov, PhD denis.shestakov at {aalto.fi,inria.fr} linkedin: linkedin.com/in/dshestakov mendeley: mendeley.com/profiles/denis-shestakov Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg
  • 2. Outline ● Motivation ● Approach overview: scaling indexing & searching using Hadoop ● Experimental setup: datasets, resources, configuration ● Results ● Observations & implications ● Things to share ● Future directions
  • 3. Motivation ● Big data is here ○ Lots of multimedia content ○ Even forgetting 'big' companies, 1TB/day of multimedia is now common for many parties ● Solution: apply more computational power ○ Luckily, easier access to such power via grid/cloud resources ● Applications: ○ Large-scale image retrieval: e.g., detecting copyright violations in huge image repositories ○ Google Goggles-like systems: annotating the scene
  • 4. Our approach ● Index & search huge image collection using MapReduce-based eCP algorithm ○ See our work at ICMR'13: Indexing and searching 100M images with MapReduce [7] ○ See Section II for quick overview ● Use the Grid5000 plartform ○ Distributed infrastructure available to French researchers & their partners ● Use the Hadoop framework ○ Most popular open-source implementation of MapReduce model ○ Data stored in HDFS that splits it into chunks (64MB or often bigger) and distributes it across nodes
  • 5. Our approach ● Hadoop used for both indexing and searching ● Our search scenario: ■ Searching for batch of images ● Thousands of images in one run ● Focus on throughput, not on response time for individual image ■ Use case: copyright violation detection ● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary
  • 6. Experimental setup ● Used Grid5000 platform: ○ Nodes in rennes site of Grid5000 ■ Up to 110 nodes available ■ Nodes capacity/performance varied ● Heterogenous, come from three clusters ● From 8 cores to 24 cores per node ● From 24GB to 48GB RAM per node ● Hadoop ver.1.0.1 ○ (!) No changes in Hadoop internals ■ Pros: easy to migrate, try and compare by others ■ Cons: not top performance
  • 7. Experimental setup ● Over 100 mln images (~30 billion SIFT descriptors) ○ Collected from the Web and provided by one of the partners in Quaero project ■ One of the largest reported in literature ○ Images resized to 150px on largest side ○ Worked with ■ The whole set (~4TB) ■ The subset, 20mln images (~1TB) ○ Used as distracting dataset
  • 8. Experimental setup ● For evaluation of indexing quality: ○ Added to distracting datasets: ■ INRIA Copydays (127 images) ○ Queried for ■ Copydays batch (3055 images = 127 original images and their associated variants incl. strong distortions, e.g. print-crumple-scan ) ■ 12k batch (12081 images = 245 random images from dataset and their variants) ○ Checked if original images returned as top voted search results
  • 9. Results: workflow overview ● Experiment on indexing & searching 1TB took 5-6 hours
  • 11. Results: indexing 4TB ● 4TB ● 100 nodes ● Used tuned parameters ○ Except change in #mappers/#reducers per node ■ To fit bigger index tree (for 4TB) to RAM ■ 4 mappers/2 reducers ● Time: 507min
  • 14. Results: search execution Search 12k batch over 1TB using 100 nodes
  • 15. Results: searching 4TB ● 4TB ● 87 nodes ● Copydays query batch (3k images) ○ Throughput: 460ms per image ● 12k query batch ○ Throughput: 210ms per image ● Bigger batches improve throughput insignificantly ○ bigger batch -> bigger lookup table -> more RAM per mapper required -> less mappers per node
  • 16. Observations & implications ● HDFS block size limits scalability ○ 1TB dataset => 1186 blocks of 1024MB size ○ Assuming 8-core nodes and reported searching method: no scaling after 149 nodes (i.e. 8x149=1192) ○ Solutions: ■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for 512MB blocks ■ Re-visit search process: e.g., partial-loading of lookup table ● Big data is here but not resources to process ○ E.g, indexing&searching >10TB not possible given resources we had
  • 17. Things to share ● Our methods/system can be applied to audio datasets ○ No major changes expected ○ Contact me if interested ● Code for MapReduce-eCP algorithm available on request ○ Should run smoothly on your Hadoop cluster ○ Interested in comparisons ● Hadoop job history logs behind our experiments (not only for those reported at CBMI) available on request ○ Describe indexing/searching our dataset by giving details on map/reduce tasks execution ○ Insights on better analysis/visualization are welcome ○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
  • 18. Future directions ● Deal with big batches of query images ○ ~200k query images ● Share auxiliary data (index tree, lookup table) by mappers ○ Multithreaded map tasks ● (environment-specific) Test scalability on more nodes ○ Use several sites of Grid5000 infrastructure ■ rennes+nancy sites (up to 300 nodes) --in progress
  • 19. Acknowledgements ● TEXMEX team, INRIA Rennes http://www. irisa.fr/texmex/index_en.php ● Quaero project, http://www.quaero.org/ ● Grid5000 infrastructure & its Rennes maintenance team, https://www.grid5000.fr