SlideShare uma empresa Scribd logo
1 de 24
MapReduce in the Clouds for Science ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu, xqiu,gcf}@indiana.edu CloudCom 2010 Nov 30 – Dec 3, 2010
Introduction Cloud computing combined with cloud infrastructure services  A very viable alternative for scientists MapReduce frameworks  Scalability Excellent fault tolerance features Ease of use.  Several options for using MapReduce in cloud environments MapReduceas a service Setting up MapReducecluster on cloud instances Specialized cloud MapReduce runtimes  Take advantage of cloud infrastructure services.
Introduction Analyze the performance and viability of performing two types of bioinformatics computations using MapReduce in cloud environments Sequence alignment Sequence assembly AzureMapReduce Provide an decentralized, on demand MapReduce framework Leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services  Sustained performance of clouds
Platforms Apache Hadoop On BareMetal On EC2 Amazon Web Services Elastic MapReduce Microsoft Azure AzureMapReduce
Challenges for MapReduce in the clouds Data storage Reliability Master node Metadata storage Performance consistency Communication consistency and scalability CPU performance  Choosing suitable instance types Logging
AzureMapReduce Built on using Azure cloud services Distributed, highly scalable & highly available services Minimal management / maintenance overhead Reduced footprint Co-exist with eventual consistency & high latency of cloud services Decentralized control
AzureMapReduce Features Ability to dynamically scale up/down Familiar programming model Fault Tolerance Easy testing and deployment  Combiner step Web based monitoring console
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture
AzureMapReduce Architecture Starting the Sort & Reduce phases,  When all the map tasks are finished & When a reduce task is finished downloading all the intermediate data products No guarantee when all the intermediate data will appear in Task tables Map Tasks store the number of reduce data products it generated for each reduce task
Performance Parallel efficiency AzureMapReduce Azure small instances – Single Core (1.7 GB memory) Hadoop Bare Metal -IBM iDataplex cluster Two quad-core CPUs (Xeon 2.33GHz),16 GB memory, Gigabit Ethernet per node  EMR & Hadoop on EC2 Cap3 – HighCPU Extra Large instances (8 Cores, 20 CU, 7GB memory per instance) SWG – Extra Large Instances (4 Cores, 8 CU, 15GB memory per instance)
Sequence Alignment Smith-Waterman-GOTOH to calculate all-pairs dissimilarity OutFile1 OutFile2 OutFile3 OutFile4
Sequence Alignment Performance
Seqeunce Assembly Assemble sequences using Cap3 Pleasingly parallel Map Only
Sequence Assembly Performance
Sustained performance of clouds
Conclusion MapReduce in the cloud infrastructures provides an easy to use, economical option to perform loosely coupled scientific computations. Cloud infrastructure services can successfully be leveraged built distributed parallel systems with acceptable performance and consistency. For non-IO intensive workloads, cloud performance sustained well.
Thanks http://salsahpc.indiana.edu/azuremapreduce/
Acknowledgements All the SALSA group members for their support Microsoft for their technical support on Azure.  This work was made possible using the compute use grant provided by Amazon Web Service which is titled "Proof of concepts linking FutureGrid users to AWS". This work is partially funded by Microsoft "CRMC" grant and NIH Grant Number RC2HG005806-02.

Mais conteúdo relacionado

Mais procurados

BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer DisksBUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
Xiao Qin
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
Sri Prasanna
 

Mais procurados (20)

BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer DisksBUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
BUDW: Energy-Efficient Parallel Storage Systems with Write-Buffer Disks
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTEScuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
 
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
A New Approach for Parallel Region Growing Algorithm in Image Segmentation u...
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Working together with SURF Raymond Oonk Annette Langedijk SURF
Working together with SURF Raymond Oonk Annette Langedijk SURFWorking together with SURF Raymond Oonk Annette Langedijk SURF
Working together with SURF Raymond Oonk Annette Langedijk SURF
 
post119s1-file3
post119s1-file3post119s1-file3
post119s1-file3
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
2015 cloud sim projects
2015 cloud sim projects2015 cloud sim projects
2015 cloud sim projects
 
energy efficient resource management in virtualised datacenters
energy efficient resource management in virtualised datacentersenergy efficient resource management in virtualised datacenters
energy efficient resource management in virtualised datacenters
 
Scaling Deep Learning Models for Large Spatial Time-Series Forecasting
Scaling Deep Learning Models for Large Spatial Time-Series ForecastingScaling Deep Learning Models for Large Spatial Time-Series Forecasting
Scaling Deep Learning Models for Large Spatial Time-Series Forecasting
 
Hello cloud 3
Hello  cloud 3Hello  cloud 3
Hello cloud 3
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databases
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
Making Elasticity Testing of Cloud-Based Systems Reproducible
Making Elasticity Testing of Cloud-Based Systems ReproducibleMaking Elasticity Testing of Cloud-Based Systems Reproducible
Making Elasticity Testing of Cloud-Based Systems Reproducible
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
 
Super Computer
Super ComputerSuper Computer
Super Computer
 

Semelhante a Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud Technologies
Inderjeet Singh
 
Cloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sureCloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sure
Nguyen Duong
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Onyebuchi nosiri
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
butest
 

Semelhante a Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/) (20)

D017212027
D017212027D017212027
D017212027
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3's
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud Technologies
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Cloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sureCloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sure
 
Paper444012-4014
Paper444012-4014Paper444012-4014
Paper444012-4014
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
GRID COMPUTING
GRID COMPUTINGGRID COMPUTING
GRID COMPUTING
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
 
5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

  • 1. MapReduce in the Clouds for Science ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu, xqiu,gcf}@indiana.edu CloudCom 2010 Nov 30 – Dec 3, 2010
  • 2. Introduction Cloud computing combined with cloud infrastructure services A very viable alternative for scientists MapReduce frameworks Scalability Excellent fault tolerance features Ease of use. Several options for using MapReduce in cloud environments MapReduceas a service Setting up MapReducecluster on cloud instances Specialized cloud MapReduce runtimes Take advantage of cloud infrastructure services.
  • 3. Introduction Analyze the performance and viability of performing two types of bioinformatics computations using MapReduce in cloud environments Sequence alignment Sequence assembly AzureMapReduce Provide an decentralized, on demand MapReduce framework Leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services Sustained performance of clouds
  • 4. Platforms Apache Hadoop On BareMetal On EC2 Amazon Web Services Elastic MapReduce Microsoft Azure AzureMapReduce
  • 5. Challenges for MapReduce in the clouds Data storage Reliability Master node Metadata storage Performance consistency Communication consistency and scalability CPU performance Choosing suitable instance types Logging
  • 6. AzureMapReduce Built on using Azure cloud services Distributed, highly scalable & highly available services Minimal management / maintenance overhead Reduced footprint Co-exist with eventual consistency & high latency of cloud services Decentralized control
  • 7. AzureMapReduce Features Ability to dynamically scale up/down Familiar programming model Fault Tolerance Easy testing and deployment Combiner step Web based monitoring console
  • 15. AzureMapReduce Architecture Starting the Sort & Reduce phases, When all the map tasks are finished & When a reduce task is finished downloading all the intermediate data products No guarantee when all the intermediate data will appear in Task tables Map Tasks store the number of reduce data products it generated for each reduce task
  • 16. Performance Parallel efficiency AzureMapReduce Azure small instances – Single Core (1.7 GB memory) Hadoop Bare Metal -IBM iDataplex cluster Two quad-core CPUs (Xeon 2.33GHz),16 GB memory, Gigabit Ethernet per node EMR & Hadoop on EC2 Cap3 – HighCPU Extra Large instances (8 Cores, 20 CU, 7GB memory per instance) SWG – Extra Large Instances (4 Cores, 8 CU, 15GB memory per instance)
  • 17. Sequence Alignment Smith-Waterman-GOTOH to calculate all-pairs dissimilarity OutFile1 OutFile2 OutFile3 OutFile4
  • 19. Seqeunce Assembly Assemble sequences using Cap3 Pleasingly parallel Map Only
  • 22. Conclusion MapReduce in the cloud infrastructures provides an easy to use, economical option to perform loosely coupled scientific computations. Cloud infrastructure services can successfully be leveraged built distributed parallel systems with acceptable performance and consistency. For non-IO intensive workloads, cloud performance sustained well.
  • 24. Acknowledgements All the SALSA group members for their support Microsoft for their technical support on Azure. This work was made possible using the compute use grant provided by Amazon Web Service which is titled "Proof of concepts linking FutureGrid users to AWS". This work is partially funded by Microsoft "CRMC" grant and NIH Grant Number RC2HG005806-02.

Notas do Editor

  1. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable alternative to traditional servers and computing clusters. MapReduce distributed data processing architecture has become the weapon of choice for data-intensive analyses in the clouds and in commodity clusters due to its excellent fault tolerance features, scalability and the ease of use. Currently, there are several options for using MapReduce in cloud environments, such as using MapReduce as a service, setting up one’s own MapReduce cluster on cloud instances, or using specialized cloud MapReduce runtimes that take advantage of cloud infrastructure services. In this paper, we introduce AzureMapReduce, a novel MapReduce runtime built using the Microsoft Azure cloud infrastructure services. AzureMapReduce architecture successfully leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services to provide an efficient, on demand alternative to traditional MapReduce clusters. Further we evaluate the use and performance of MapReduce frameworks, including AzureMapReduce, in cloud environments for scientific applications using sequence assembly and sequence alignment as use cases.
  2. Data storage: Clouds typically provide a variety of storage options, such as off-instance cloud storage (e.g.: Amazon S3), mountable off-instance block storage (e.g.: Amazon EBS) as well as virtualized instance storage (persistent for the lifetime of the instance), which can be used to set up a file system similar to HDFS [13]. The choice of the storage best-suited to the particular MapReduce deployment plays a crucial role as the performance of data intensive applications rely a lot on the storage location and on the storage bandwidth.Metadata storage: MapReduce frameworks need to maintain metadata information to manage the jobs as well as the infrastructure. This metadata needs to be stored reliability ensuring good scalability and the accessibility to avoid single point of failures and performance bottlenecks to the MapReduce computation.Communication consistency and scalability: Cloud infrastructures are known to exhibit inter-node I/O performance fluctuations (due to shared network, unknown topology), which affect the intermediate data transfer performance of MapReduce applications.Performance consistency (sustained performance): Clouds are implemented as shared infrastructures operating using virtual machines. It’s possible for the performance to fluctuate based the load of the underlying infrastructure services as well as based on the load from other users on the shared physical node which hosts the virtual machine (see Section VII).Reliability (Node failures): Node failures are to be expected whenever large numbers of nodes are utilized for computations. But they become more prevalent when virtual instances are running on top of non-dedicated hardware. While MapReduce frameworks can recover jobs from worker node failures, master node (nodes which store meta-data, which handle job scheduling queue, etc) failures can become disastrous.Choosing a suitable instance type: Clouds offer users several types of instance options, with different configurations and price points (See Sections B and D). It’s important to select the best matching instance type, both in terms of performance as well as monetary wise, for a particular MapReduce job.Logging: Cloud instance storage is preserved only for the lifetime of the instance. Hence, information logged to the instance storage would be lost after the instance termination. This can be crucial if one needs to process the logs afterwards, for an example to identify a software-caused instance failure. On the other hand, performing excessive logging to a bandwidth limited off-instance storage location can become a performance bottleneck for the MapReduce computation.
  3. Client driver loads the map & reduce tasks to queues in parallel using TPL..Create the task monitoring table. Standalone client or a web client. Can wait for completion.Explain the advantages of using Azure queues.Explain the advantages of using Azure table.. Scalability. Ease of use.. No maintenance overhead. No need to install DB. Easily visualize using a webrole.
  4. Map & Reduce workers pick up map tasks from the queue
  5. Map workers download data from Blob storage and start processing- – update the status in the task monitoring table.Advantages of blob storage.Custom input/output formats & keys..
  6. Finished Map tasks upload result data sets to Azure Storage and then add entries for the respective reduce task tables. – update the status. Get the next task from the queue and start processing it.Custom part
  7. Reduce tasks notice the intermediate data product meta-data in reduce task tables and start downloading them -> update the reduce task tablesThis happens when the map tasks are actually processing the next set of map tasks..
  8. Reduce tasks start reducing, when all the map tasks are finished and when the respective reduce tasks are finish downloading the intermediate data products.Custom output formats
  9. Global barrier…
  10. Idataplex - Two quad-core CPUs (Intel Xeon CPU E5410 2.33GHz) 16 GB memory, Gigabit Ethernet network interface
  11. Use block decompositionLower triangle only using load balancing algorithmEach row block is collected by reducers.Relatively small amount of input data, but large intermediate and output data.
  12. ~123 million sequence alignments, for under 30$ with zero up front hardware cost,
  13. SWG - In these tests, 32 cores were used to align 4000 sequences. Standard deviations of 1.56% for EMR and 2.25% for AzureMapReduce.