SlideShare uma empresa Scribd logo
1 de 24
MapReduce in the Clouds for Science ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu, xqiu,gcf}@indiana.edu CloudCom 2010 Nov 30 – Dec 3, 2010
Introduction Cloud computing combined with cloud infrastructure services  A very viable alternative for scientists MapReduce frameworks  Scalability Excellent fault tolerance features Ease of use.  Several options for using MapReduce in cloud environments MapReduceas a service Setting up MapReducecluster on cloud instances Specialized cloud MapReduce runtimes  Take advantage of cloud infrastructure services.
Introduction Analyze the performance and viability of performing two types of bioinformatics computations using MapReduce in cloud environments Sequence alignment Sequence assembly AzureMapReduce Provide an decentralized, on demand MapReduce framework Leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services  Sustained performance of clouds
Platforms Apache Hadoop On BareMetal On EC2 Amazon Web Services Elastic MapReduce Microsoft Azure AzureMapReduce
Challenges for MapReduce in the clouds Data storage Reliability Master node Metadata storage Performance consistency Communication consistency and scalability CPU performance  Choosing suitable instance types Logging
AzureMapReduce Built on using Azure cloud services Distributed, highly scalable & highly available services Minimal management / maintenance overhead Reduced footprint Co-exist with eventual consistency & high latency of cloud services Decentralized control
AzureMapReduce Features Ability to dynamically scale up/down Familiar programming model Fault Tolerance Easy testing and deployment  Combiner step Web based monitoring console
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture Starting the Sort & Reduce phases,  When all the map tasks are finished & When a reduce task is finished downloading all the intermediate data products No guarantee when all the intermediate data will appear in Task tables Map Tasks store the number of reduce data products it generated for each reduce task
Performance Parallel efficiency AzureMapReduce Azure small instances – Single Core (1.7 GB memory) Hadoop Bare Metal -IBM iDataplex cluster Two quad-core CPUs (Xeon 2.33GHz),16 GB memory, Gigabit Ethernet per node  EMR & Hadoop on EC2 Cap3 – HighCPU Extra Large instances (8 Cores, 20 CU, 7GB memory per instance) SWG – Extra Large Instances (4 Cores, 8 CU, 15GB memory per instance)
Sequence Alignment Smith-Waterman-GOTOH to calculate all-pairs dissimilarity OutFile1 OutFile2 OutFile3 OutFile4
Sequence Alignment Performance
Seqeunce Assembly Assemble sequences using Cap3 Pleasingly parallel Map Only
Sequence Assembly Performance
Sustained performance of clouds
Conclusion MapReduce in the cloud infrastructures provides an easy to use, economical option to perform loosely coupled scientific computations. Cloud infrastructure services can successfully be leveraged built distributed parallel systems with acceptable performance and consistency. For non-IO intensive workloads, cloud performance sustained well.
Thanks http://salsahpc.indiana.edu/azuremapreduce/
Acknowledgements All the SALSA group members for their support Microsoft for their technical support on Azure.  This work was made possible using the compute use grant provided by Amazon Web Service which is titled "Proof of concepts linking FutureGrid users to AWS". This work is partially funded by Microsoft "CRMC" grant and NIH Grant Number RC2HG005806-02.

Mais conteúdo relacionado

Mais procurados

BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer DisksBUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer DisksXiao Qin
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)Subhajit Sahu
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTEScuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTESSubhajit Sahu
 
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...maneesh boddu
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Working together with SURF Raymond Oonk Annette Langedijk SURF
Working together with SURF Raymond Oonk Annette Langedijk SURFWorking together with SURF Raymond Oonk Annette Langedijk SURF
Working together with SURF Raymond Oonk Annette Langedijk SURFCommunicatieSURF
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
2015 cloud sim projects
2015 cloud sim projects2015 cloud sim projects
2015 cloud sim projectsHari Krishnan
 
energy efficient resource management in virtualised datacenters
energy efficient resource management in virtualised datacentersenergy efficient resource management in virtualised datacenters
energy efficient resource management in virtualised datacentersFabien Hermenier
 
Scaling Deep Learning Models for Large Spatial Time-Series Forecasting
Scaling Deep Learning Models for Large Spatial Time-Series ForecastingScaling Deep Learning Models for Large Spatial Time-Series Forecasting
Scaling Deep Learning Models for Large Spatial Time-Series ForecastingZainab Abbas
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani
 
Making Elasticity Testing of Cloud-Based Systems Reproducible
Making Elasticity Testing of Cloud-Based Systems ReproducibleMaking Elasticity Testing of Cloud-Based Systems Reproducible
Making Elasticity Testing of Cloud-Based Systems ReproducibleMichel Albonico
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)Sri Prasanna
 

Mais procurados (20)

BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer DisksBUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTEScuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
 
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Working together with SURF Raymond Oonk Annette Langedijk SURF
Working together with SURF Raymond Oonk Annette Langedijk SURFWorking together with SURF Raymond Oonk Annette Langedijk SURF
Working together with SURF Raymond Oonk Annette Langedijk SURF
 
post119s1-file3
post119s1-file3post119s1-file3
post119s1-file3
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
2015 cloud sim projects
2015 cloud sim projects2015 cloud sim projects
2015 cloud sim projects
 
energy efficient resource management in virtualised datacenters
energy efficient resource management in virtualised datacentersenergy efficient resource management in virtualised datacenters
energy efficient resource management in virtualised datacenters
 
Scaling Deep Learning Models for Large Spatial Time-Series Forecasting
Scaling Deep Learning Models for Large Spatial Time-Series ForecastingScaling Deep Learning Models for Large Spatial Time-Series Forecasting
Scaling Deep Learning Models for Large Spatial Time-Series Forecasting
 
Hello cloud 3
Hello  cloud 3Hello  cloud 3
Hello cloud 3
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databases
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
Making Elasticity Testing of Cloud-Based Systems Reproducible
Making Elasticity Testing of Cloud-Based Systems ReproducibleMaking Elasticity Testing of Cloud-Based Systems Reproducible
Making Elasticity Testing of Cloud-Based Systems Reproducible
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
 
Super Computer
Super ComputerSuper Computer
Super Computer
 

Semelhante a Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...IOSR Journals
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3'sdelagoya
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesInderjeet Singh
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...AM Publications
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)Robert Grossman
 
Cloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sureCloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sureNguyen Duong
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010BOSC 2010
 
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Onyebuchi nosiri
 
5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedyOnyebuchi nosiri
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...jaliyae
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityPapitha Velumani
 

Semelhante a Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/) (20)

D017212027
D017212027D017212027
D017212027
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3's
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud Technologies
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Cloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sureCloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sure
 
Paper444012-4014
Paper444012-4014Paper444012-4014
Paper444012-4014
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
GRID COMPUTING
GRID COMPUTINGGRID COMPUTING
GRID COMPUTING
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
 
5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 

Último

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

  • 1. MapReduce in the Clouds for Science ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu, xqiu,gcf}@indiana.edu CloudCom 2010 Nov 30 – Dec 3, 2010
  • 2. Introduction Cloud computing combined with cloud infrastructure services A very viable alternative for scientists MapReduce frameworks Scalability Excellent fault tolerance features Ease of use. Several options for using MapReduce in cloud environments MapReduceas a service Setting up MapReducecluster on cloud instances Specialized cloud MapReduce runtimes Take advantage of cloud infrastructure services.
  • 3. Introduction Analyze the performance and viability of performing two types of bioinformatics computations using MapReduce in cloud environments Sequence alignment Sequence assembly AzureMapReduce Provide an decentralized, on demand MapReduce framework Leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services Sustained performance of clouds
  • 4. Platforms Apache Hadoop On BareMetal On EC2 Amazon Web Services Elastic MapReduce Microsoft Azure AzureMapReduce
  • 5. Challenges for MapReduce in the clouds Data storage Reliability Master node Metadata storage Performance consistency Communication consistency and scalability CPU performance Choosing suitable instance types Logging
  • 6. AzureMapReduce Built on using Azure cloud services Distributed, highly scalable & highly available services Minimal management / maintenance overhead Reduced footprint Co-exist with eventual consistency & high latency of cloud services Decentralized control
  • 7. AzureMapReduce Features Ability to dynamically scale up/down Familiar programming model Fault Tolerance Easy testing and deployment Combiner step Web based monitoring console
  • 15. AzureMapReduce Architecture Starting the Sort & Reduce phases, When all the map tasks are finished & When a reduce task is finished downloading all the intermediate data products No guarantee when all the intermediate data will appear in Task tables Map Tasks store the number of reduce data products it generated for each reduce task
  • 16. Performance Parallel efficiency AzureMapReduce Azure small instances – Single Core (1.7 GB memory) Hadoop Bare Metal -IBM iDataplex cluster Two quad-core CPUs (Xeon 2.33GHz),16 GB memory, Gigabit Ethernet per node EMR & Hadoop on EC2 Cap3 – HighCPU Extra Large instances (8 Cores, 20 CU, 7GB memory per instance) SWG – Extra Large Instances (4 Cores, 8 CU, 15GB memory per instance)
  • 17. Sequence Alignment Smith-Waterman-GOTOH to calculate all-pairs dissimilarity OutFile1 OutFile2 OutFile3 OutFile4
  • 19. Seqeunce Assembly Assemble sequences using Cap3 Pleasingly parallel Map Only
  • 22. Conclusion MapReduce in the cloud infrastructures provides an easy to use, economical option to perform loosely coupled scientific computations. Cloud infrastructure services can successfully be leveraged built distributed parallel systems with acceptable performance and consistency. For non-IO intensive workloads, cloud performance sustained well.
  • 24. Acknowledgements All the SALSA group members for their support Microsoft for their technical support on Azure. This work was made possible using the compute use grant provided by Amazon Web Service which is titled "Proof of concepts linking FutureGrid users to AWS". This work is partially funded by Microsoft "CRMC" grant and NIH Grant Number RC2HG005806-02.

Notas do Editor

  1. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable alternative to traditional servers and computing clusters. MapReduce distributed data processing architecture has become the weapon of choice for data-intensive analyses in the clouds and in commodity clusters due to its excellent fault tolerance features, scalability and the ease of use. Currently, there are several options for using MapReduce in cloud environments, such as using MapReduce as a service, setting up one’s own MapReduce cluster on cloud instances, or using specialized cloud MapReduce runtimes that take advantage of cloud infrastructure services. In this paper, we introduce AzureMapReduce, a novel MapReduce runtime built using the Microsoft Azure cloud infrastructure services. AzureMapReduce architecture successfully leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services to provide an efficient, on demand alternative to traditional MapReduce clusters. Further we evaluate the use and performance of MapReduce frameworks, including AzureMapReduce, in cloud environments for scientific applications using sequence assembly and sequence alignment as use cases.
  2. Data storage: Clouds typically provide a variety of storage options, such as off-instance cloud storage (e.g.: Amazon S3), mountable off-instance block storage (e.g.: Amazon EBS) as well as virtualized instance storage (persistent for the lifetime of the instance), which can be used to set up a file system similar to HDFS [13]. The choice of the storage best-suited to the particular MapReduce deployment plays a crucial role as the performance of data intensive applications rely a lot on the storage location and on the storage bandwidth.Metadata storage: MapReduce frameworks need to maintain metadata information to manage the jobs as well as the infrastructure. This metadata needs to be stored reliability ensuring good scalability and the accessibility to avoid single point of failures and performance bottlenecks to the MapReduce computation.Communication consistency and scalability: Cloud infrastructures are known to exhibit inter-node I/O performance fluctuations (due to shared network, unknown topology), which affect the intermediate data transfer performance of MapReduce applications.Performance consistency (sustained performance): Clouds are implemented as shared infrastructures operating using virtual machines. It’s possible for the performance to fluctuate based the load of the underlying infrastructure services as well as based on the load from other users on the shared physical node which hosts the virtual machine (see Section VII).Reliability (Node failures): Node failures are to be expected whenever large numbers of nodes are utilized for computations. But they become more prevalent when virtual instances are running on top of non-dedicated hardware. While MapReduce frameworks can recover jobs from worker node failures, master node (nodes which store meta-data, which handle job scheduling queue, etc) failures can become disastrous.Choosing a suitable instance type: Clouds offer users several types of instance options, with different configurations and price points (See Sections B and D). It’s important to select the best matching instance type, both in terms of performance as well as monetary wise, for a particular MapReduce job.Logging: Cloud instance storage is preserved only for the lifetime of the instance. Hence, information logged to the instance storage would be lost after the instance termination. This can be crucial if one needs to process the logs afterwards, for an example to identify a software-caused instance failure. On the other hand, performing excessive logging to a bandwidth limited off-instance storage location can become a performance bottleneck for the MapReduce computation.
  3. Client driver loads the map & reduce tasks to queues in parallel using TPL..Create the task monitoring table. Standalone client or a web client. Can wait for completion.Explain the advantages of using Azure queues.Explain the advantages of using Azure table.. Scalability. Ease of use.. No maintenance overhead. No need to install DB. Easily visualize using a webrole.
  4. Map & Reduce workers pick up map tasks from the queue
  5. Map workers download data from Blob storage and start processing- – update the status in the task monitoring table.Advantages of blob storage.Custom input/output formats & keys..
  6. Finished Map tasks upload result data sets to Azure Storage and then add entries for the respective reduce task tables. – update the status. Get the next task from the queue and start processing it.Custom part
  7. Reduce tasks notice the intermediate data product meta-data in reduce task tables and start downloading them -> update the reduce task tablesThis happens when the map tasks are actually processing the next set of map tasks..
  8. Reduce tasks start reducing, when all the map tasks are finished and when the respective reduce tasks are finish downloading the intermediate data products.Custom output formats
  9. Global barrier…
  10. Idataplex - Two quad-core CPUs (Intel Xeon CPU E5410 2.33GHz) 16 GB memory, Gigabit Ethernet network interface
  11. Use block decompositionLower triangle only using load balancing algorithmEach row block is collected by reducers.Relatively small amount of input data, but large intermediate and output data.
  12. ~123 million sequence alignments, for under 30$ with zero up front hardware cost,
  13. SWG - In these tests, 32 cores were used to align 4000 sequences. Standard deviations of 1.56% for EMR and 2.25% for AzureMapReduce.