3. Motivation
● Big data is here
○ Lots of multimedia content
○ Even forgetting 'big' companies, 1TB/day of
multimedia is now common for many parties
● Solution: apply more computational power
○ Luckily, easier access to such power via grid/cloud
resources
● Applications:
○ Large-scale image retrieval: e.g., detecting copyright
violations in huge image repositories
○ Google Goggles-like systems: annotating the scene
4. Our approach
● Index & search huge image collection using
MapReduce-based eCP algorithm
○ See our work at ICMR'13: Indexing and searching
100M images with MapReduce [7]
○ See Section II for quick overview
● Use the Grid5000 plartform
○ Distributed infrastructure available to French
researchers & their partners
● Use the Hadoop framework
○ Most popular open-source implementation of
MapReduce model
○ Data stored in HDFS that splits it into chunks (64MB or
often bigger) and distributes it across nodes
5. Our approach
● Hadoop used for both indexing and searching
● Our search scenario:
■ Searching for batch of images
● Thousands of images in one run
● Focus on throughput, not on response time
for individual image
■ Use case: copyright violation detection
● Note: indexed dataset can be searched on single
machine with adequate disk capacity if necessary
6. Experimental setup
● Used Grid5000 platform:
○ Nodes in rennes site of Grid5000
■ Up to 110 nodes available
■ Nodes capacity/performance varied
● Heterogenous, come from three clusters
● From 8 cores to 24 cores per node
● From 24GB to 48GB RAM per node
● Hadoop ver.1.0.1
○ (!) No changes in Hadoop internals
■ Pros: easy to migrate, try and compare by others
■ Cons: not top performance
7. Experimental setup
● Over 100 mln images (~30 billion SIFT descriptors)
○ Collected from the Web and provided by one of the
partners in Quaero project
■ One of the largest reported in literature
○ Images resized to 150px on largest side
○ Worked with
■ The whole set (~4TB)
■ The subset, 20mln images (~1TB)
○ Used as distracting dataset
8. Experimental setup
● For evaluation of indexing quality:
○ Added to distracting datasets:
■ INRIA Copydays (127 images)
○ Queried for
■ Copydays batch (3055 images = 127 original
images and their associated variants incl. strong
distortions, e.g. print-crumple-scan )
■ 12k batch (12081 images = 245 random images
from dataset and their variants)
○ Checked if original images returned as top voted
search results
11. Results: indexing 4TB
● 4TB
● 100 nodes
● Used tuned parameters
○ Except change in #mappers/#reducers per node
■ To fit bigger index tree (for 4TB) to RAM
■ 4 mappers/2 reducers
● Time: 507min
15. Results: searching 4TB
● 4TB
● 87 nodes
● Copydays query batch (3k images)
○ Throughput: 460ms per image
● 12k query batch
○ Throughput: 210ms per image
● Bigger batches improve throughput insignificantly
○ bigger batch -> bigger lookup table -> more RAM per
mapper required -> less mappers per node
16. Observations &
implications
● HDFS block size limits scalability
○ 1TB dataset => 1186 blocks of 1024MB size
○ Assuming 8-core nodes and reported searching
method: no scaling after 149 nodes (i.e.
8x149=1192)
○ Solutions:
■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for
512MB blocks
■ Re-visit search process: e.g., partial-loading of lookup
table
● Big data is here but not resources to process
○ E.g, indexing&searching >10TB not possible given resources we had
17. Things to share
● Our methods/system can be applied to audio datasets
○ No major changes expected
○ Contact me if interested
● Code for MapReduce-eCP algorithm available on request
○ Should run smoothly on your Hadoop cluster
○ Interested in comparisons
● Hadoop job history logs behind our experiments (not only
for those reported at CBMI) available on request
○ Describe indexing/searching our dataset by giving details
on map/reduce tasks execution
○ Insights on better analysis/visualization are welcome
○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
18. Future directions
● Deal with big batches of query images
○ ~200k query images
● Share auxiliary data (index tree, lookup table) by
mappers
○ Multithreaded map tasks
● (environment-specific) Test scalability on more nodes
○ Use several sites of Grid5000 infrastructure
■ rennes+nancy sites (up to 300 nodes) --in
progress