SlideShare uma empresa Scribd logo
1 de 36
MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1
Data management landscape flexibility MRShare – sharing framework for MR ,[object Object]
Large scale setups
 Time performanceσπ efficiency 2
MRShare – a sharing framework for Map Reduce MRShare framework: Inspired by sharing primitives from relational domain Introduces a cost model for Map Reduce jobs Searches for the optimal sharing strategies Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj 3
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 4
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 5
network Map Reduce recap. Reduce Map I Output I I Output I HDFS HDFS 6
Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 7
Sharing primitives – sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25 8
MRShare – sharing scans (map). Input Meta-map Map 1 Map 2 Map 3 Map 4 Map output 9
Meta-reduce MRShare – sharing scans (reduce) Reduce 1 Reduce 2 Reduce 3 Reduce 4 10
Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce Sharing scans Sharing intermediate data MRShare – Cost based approach to sharing  MRShare Evaluation Summary 11
Sharing primitives - Sharing intermediate data. SELECT COUNT(*) FROM user  WHERE occupation=‘student’ GROUP BY hometown SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Age ?> 18 Occupation ?= ‘student’ Toronto 1 Toronto 1 Map Reduce Reduce Reduce Toronto 1 Toronto 1 Toronto 1 Toronto 3 Toronto 1 Toronto 2 Toronto 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 2 Montreal 2 Ottawa 1 Montreal 1 12
Meta-map MRShare – sharing intermediate data (map). Input Map 1 Map 2 Map 3 Map 4 Map output 13
Meta-reduce MRShare – sharing intermediate data (reduce). Reduce 1 Reduce 2 Reduce 3 Reduce 4 14
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 15
Cost model for Map Reduce (single job) Reading input Sorting int. data Copying Writing output Reading– f(input size) Sorting– f(intermediate data size) Copying– f(intermediate data size) Writing – f(output size) 16
Cost of executing a group of jobs Read Sort Copy Write J1 Read Sort Copy Write J2 Read Sort Copy Write J3 J1+J2+J3 Read Sort Copy Write Potential costs Potential savings Savings 17
Finding the optimal sharing strategy “NoShare” J3 J3 J2 J2 18 J5 J4 J4 J1 J1 J5 J3 J2 J4 J1 ,[object Object],J5 “GreedyShare”
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy  SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 19
Sharing scans - cost based optimization  20 Read Sort J1 J1+J2+J3 Read Sort J2 Read Sort Read Sort J3 Potential costs Savings Savings come from reduced number of scans The sorting cost  might change The costs of copying  and writing the output do not change ,[object Object],[object Object]
SplitJobs – a DP solution for sharing scans. We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting. J6 J5 J4 J3 J2 J1 ,[object Object],J6 J5 J4 J3 J2 J1 SplitJobs 22 G1 G2 G3
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 23
MultiSplitJobs – an improvement of SplitJobs 24 J8 J7 J6 J5 J4 J3 J2 J1 G1 G2 SplitJobs SplitJobs G3 SplitJobs G4 MultiSplitJobs
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 25
Sharing intermediate data - cost based optimization  26 Read Sort Copy J1 J1+J2+J3 Read Sort Copy Read Sort Copy J2 Savings Potential savings Read Sort Copy J3 Potential costs or savings The sorting and copying costs change – depending on the size of the intermediate data Prohibitive cost of maintaining statistics J3 We need to estimate the size of the intermediate data of all combinations of jobs. J1 J2
Approximate the size of the intermediate data J3 J1 γ-MultiSplitJobs – the solution for sharing intermediate data 27 J2 J3 J2 J1 = + γ * J1 J2 J3 ,[object Object]
γ set heuristically,[object Object]
Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries Counts words matching a regular expression Allows for variable intermediate data sizes Generic aggregation Map Reduce job 29
Evaluation goals Sharing is not always beneficial. ‘GreedyShare’ policy How much can we save on sharing scans? MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data?  MRShare - γ-MultiSplitJobs evaluation 30
Is sharing always beneficial?- ‘GreedyShare’ policy 31
How much we save on sharing scans – MRShare MultiSplitJobs 32
How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 33
Summary We introduced MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 34
Thank you!!! Questions? 35

Mais conteúdo relacionado

Mais procurados

QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2CAPSUCSF
 
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...I3E Technologies
 
Compression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesCompression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesWerner Hoffmann
 
Towards and adaptable spatial processing architecture
Towards and adaptable spatial processing architectureTowards and adaptable spatial processing architecture
Towards and adaptable spatial processing architectureArmando Guevara
 
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information SystemsTYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information SystemsArti Parab Academics
 
Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)Mei Chi Lo
 
Large graph analysis using g mine system
Large graph analysis using g mine systemLarge graph analysis using g mine system
Large graph analysis using g mine systemsaujog
 
Fundamental operations
Fundamental operationsFundamental operations
Fundamental operationssrinivas2036
 
OKCon 2013 Moodboards
OKCon 2013 MoodboardsOKCon 2013 Moodboards
OKCon 2013 Moodboardsthuesing
 
Digitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine planDigitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine planSafdar Ali
 
TYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and PreparationTYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and PreparationArti Parab Academics
 
Plan4business technical solution
Plan4business technical solutionPlan4business technical solution
Plan4business technical solutionKarel Charvat
 
TYBSC IT PGIS Unit IV Spacial Data Analysis
TYBSC IT PGIS Unit IV  Spacial Data AnalysisTYBSC IT PGIS Unit IV  Spacial Data Analysis
TYBSC IT PGIS Unit IV Spacial Data AnalysisArti Parab Academics
 
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsTYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsArti Parab Academics
 
Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)esambale
 

Mais procurados (20)

QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2
 
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
 
Compression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesCompression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure Primites
 
Towards and adaptable spatial processing architecture
Towards and adaptable spatial processing architectureTowards and adaptable spatial processing architecture
Towards and adaptable spatial processing architecture
 
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information SystemsTYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
 
Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)
 
Large graph analysis using g mine system
Large graph analysis using g mine systemLarge graph analysis using g mine system
Large graph analysis using g mine system
 
Fundamental operations
Fundamental operationsFundamental operations
Fundamental operations
 
GIS Data Types
GIS Data TypesGIS Data Types
GIS Data Types
 
OKCon 2013 Moodboards
OKCon 2013 MoodboardsOKCon 2013 Moodboards
OKCon 2013 Moodboards
 
Mrp Final
Mrp FinalMrp Final
Mrp Final
 
Digitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine planDigitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine plan
 
TYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and PreparationTYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
 
Chap02 01
Chap02 01Chap02 01
Chap02 01
 
Plan4business technical solution
Plan4business technical solutionPlan4business technical solution
Plan4business technical solution
 
TerraWorld
TerraWorldTerraWorld
TerraWorld
 
TYBSC IT PGIS Unit IV Spacial Data Analysis
TYBSC IT PGIS Unit IV  Spacial Data AnalysisTYBSC IT PGIS Unit IV  Spacial Data Analysis
TYBSC IT PGIS Unit IV Spacial Data Analysis
 
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsTYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
 
Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)
 
Domain research presentation Midterm
Domain research presentation MidtermDomain research presentation Midterm
Domain research presentation Midterm
 

Destaque

Sql joins inner join self join outer joins
Sql joins inner join self join outer joinsSql joins inner join self join outer joins
Sql joins inner join self join outer joinsDeepthi Rachumallu
 
Sql server JOIN
Sql server JOINSql server JOIN
Sql server JOINRiteshkiit
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Serverprogrammings guru
 
MS Sql Server: Joining Databases
MS Sql Server: Joining DatabasesMS Sql Server: Joining Databases
MS Sql Server: Joining DatabasesDataminingTools Inc
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query OptimizationBrian Gallagher
 

Destaque (11)

Sql Server
Sql ServerSql Server
Sql Server
 
Sql joins inner join self join outer joins
Sql joins inner join self join outer joinsSql joins inner join self join outer joins
Sql joins inner join self join outer joins
 
Sql server JOIN
Sql server JOINSql server JOIN
Sql server JOIN
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Server
 
MS Sql Server: Joining Databases
MS Sql Server: Joining DatabasesMS Sql Server: Joining Databases
MS Sql Server: Joining Databases
 
SQL Joins
SQL JoinsSQL Joins
SQL Joins
 
Sql joins
Sql joinsSql joins
Sql joins
 
Sql joins
Sql joinsSql joins
Sql joins
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query Optimization
 
joins in database
 joins in database joins in database
joins in database
 
SQL JOIN
SQL JOINSQL JOIN
SQL JOIN
 

Semelhante a Mr Share 11 Sep 2010

On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?TerrierTeam
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceEdureka!
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...IAEME Publication
 
Parallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsParallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational databaseGraphRM
 
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxGIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxshericehewat
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark graphdevroom
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 

Semelhante a Mr Share 11 Sep 2010 (20)

On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Main map reduce
Main map reduceMain map reduce
Main map reduce
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
 
Parallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsParallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applications
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
2013-imMens-EuroVis
2013-imMens-EuroVis2013-imMens-EuroVis
2013-imMens-EuroVis
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational database
 
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxGIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark
 
50120140505004
5012014050500450120140505004
50120140505004
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 

Último

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Mr Share 11 Sep 2010

  • 1. MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1
  • 2.
  • 4. Time performanceσπ efficiency 2
  • 5. MRShare – a sharing framework for Map Reduce MRShare framework: Inspired by sharing primitives from relational domain Introduces a cost model for Map Reduce jobs Searches for the optimal sharing strategies Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj 3
  • 6. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 4
  • 7. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 5
  • 8. network Map Reduce recap. Reduce Map I Output I I Output I HDFS HDFS 6
  • 9. Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 7
  • 10. Sharing primitives – sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25 8
  • 11. MRShare – sharing scans (map). Input Meta-map Map 1 Map 2 Map 3 Map 4 Map output 9
  • 12. Meta-reduce MRShare – sharing scans (reduce) Reduce 1 Reduce 2 Reduce 3 Reduce 4 10
  • 13. Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce Sharing scans Sharing intermediate data MRShare – Cost based approach to sharing MRShare Evaluation Summary 11
  • 14. Sharing primitives - Sharing intermediate data. SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Age ?> 18 Occupation ?= ‘student’ Toronto 1 Toronto 1 Map Reduce Reduce Reduce Toronto 1 Toronto 1 Toronto 1 Toronto 3 Toronto 1 Toronto 2 Toronto 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 2 Montreal 2 Ottawa 1 Montreal 1 12
  • 15. Meta-map MRShare – sharing intermediate data (map). Input Map 1 Map 2 Map 3 Map 4 Map output 13
  • 16. Meta-reduce MRShare – sharing intermediate data (reduce). Reduce 1 Reduce 2 Reduce 3 Reduce 4 14
  • 17. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 15
  • 18. Cost model for Map Reduce (single job) Reading input Sorting int. data Copying Writing output Reading– f(input size) Sorting– f(intermediate data size) Copying– f(intermediate data size) Writing – f(output size) 16
  • 19. Cost of executing a group of jobs Read Sort Copy Write J1 Read Sort Copy Write J2 Read Sort Copy Write J3 J1+J2+J3 Read Sort Copy Write Potential costs Potential savings Savings 17
  • 20.
  • 21. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 19
  • 22.
  • 23.
  • 24. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 23
  • 25. MultiSplitJobs – an improvement of SplitJobs 24 J8 J7 J6 J5 J4 J3 J2 J1 G1 G2 SplitJobs SplitJobs G3 SplitJobs G4 MultiSplitJobs
  • 26. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 25
  • 27. Sharing intermediate data - cost based optimization 26 Read Sort Copy J1 J1+J2+J3 Read Sort Copy Read Sort Copy J2 Savings Potential savings Read Sort Copy J3 Potential costs or savings The sorting and copying costs change – depending on the size of the intermediate data Prohibitive cost of maintaining statistics J3 We need to estimate the size of the intermediate data of all combinations of jobs. J1 J2
  • 28.
  • 29.
  • 30. Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries Counts words matching a regular expression Allows for variable intermediate data sizes Generic aggregation Map Reduce job 29
  • 31. Evaluation goals Sharing is not always beneficial. ‘GreedyShare’ policy How much can we save on sharing scans? MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data? MRShare - γ-MultiSplitJobs evaluation 30
  • 32. Is sharing always beneficial?- ‘GreedyShare’ policy 31
  • 33. How much we save on sharing scans – MRShare MultiSplitJobs 32
  • 34. How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 33
  • 35. Summary We introduced MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 34
  • 37. Ongoing work – sharing expensive computation Sharing across multiple Map Reduce jobs with expensive predicates. 36 Input Meta-map Map 1 Map 2 Map 3 Map 4
  • 38. Ongoing work – dynamic sharing Dynamic sharing. 37 J1+j2 progress J1 J2 time J2 J1

Notas do Editor

  1. Talk about different possibilities of arranging jobs, and the question which one is the optimal one.