SlideShare uma empresa Scribd logo
1 de 17
Deadline Queries: Leveraging the Cloud to Produce On-Time Results Authors: David Ribeiro Alves, Pedro Bizarro, Paulo Marques
In a nutshell Cluster computing widely used to solve “BigData” problems Users use programming abstractions to express the computation, e.g., MapReduce, but are left with some difficult questions:  how many nodes?  how long will it take? Proposed solution:Users define a deadline; cluster expands/contracts to meet it. 2 CLOUD '11
Introducing Deadline Queries Cluster computing tasks that complete within a deadline… … while minimizingcost/resource consumption Independently of: 3 Processing Capacity per Machine Faults or Perturbations Initial Number of Nodes Data Size, Content or Skew Computation Complexity CLOUD '11
Approaches in current systems 4 … make the task fit the cluster. CLOUD '11
Our Approach 5 … make cluster fit the task. CLOUD '11
Architecture and Runtime 6 Ex:  SELECT symbol, avg(value), avg(volume) FROM Stocks     GROUP BY symbol FINISH IN 900 SEC Master Node Query     IaaS Provider request nodes metrics Worker Node Part. 1 Worker Node mod. cluster Worker Node Part. 2 Worker Node Worker Worker Part. 3 Worker Part. n CLOUD '11
Stream Processing  Continuous processing allows phases to start before previous phases complete Continuous processing allows to continuously gather progress metrics about the computation as a whole SP provides continuous load balancing, which allows to:  take immediate advantage of arriving nodes deal with temporary or permanent asymmetries deal with data skew SP fault tolerance allow to quickly respond to faults  CLOUD '11 7
MapReduce SELECT symbol, avg(value),avg(volume) FROM Stocks  GROUP BY symbol FINISH IN 900 sec MapReduce Decomposition: 8 Fetch & Transform Map (Select/Project) Group Reduce (Aggregate) Store Results CLOUD '11
Streaming MapReduce - Scaling Stream Processing => load balancing and fault tolerance in a changing cluster MapReduce => Simple, parallel, scalable programming and execution model 9 CLOUD '11
Progress estimation Consumed vs. remaining data + linear regression to estimate finish time. React accordingly by either expanding or contracting the cluster. 10 CLOUD '11
Experimental Evaluation - Setup 11 Real world environment experiments On top of Amazon EC2 Running Query: SELECT symbol, avg(value), avg(volume)FROM StocksGROUP BY symbol FINISH IN 900 sec Used between 1 and 27 machines (m1.large) 2* Dual Core Xeon (2.66 Ghz) 7.5 GB of RAM Experiments show: Predicted remaining time Number of nodes CLOUD '11
Exp. 1 – Varying Initial Cluster Size 12 CLOUD '11
Exp. 2 – Varying Deadline 13 CLOUD '11
Exp. 3 – Introducing Perturbations 14 CLOUD '11
Conclusions Cloud Computing, e.g., IaaS, allow new approaches to cluster computing and new optimization goals. Deadline Queries may help in expressing computation prov. requirements beyond number of nodes. Deadline Queries is a viable alternative to implement hard time limits for query execution. Real implementation and evaluation show approach is feasible and works as expected.  15 CLOUD '11
16 Questions? CLOUD '11
Fault Tolerance 17 CLOUD ‘11

Mais conteúdo relacionado

Mais procurados

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Filipo Mór
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
airbots
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
Noha Elprince
 

Mais procurados (20)

Hadoop mapreduce performance study on arm cluster
Hadoop mapreduce performance study on arm clusterHadoop mapreduce performance study on arm cluster
Hadoop mapreduce performance study on arm cluster
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
CNIT 127 Ch 5: Introduction to heap overflows
CNIT 127 Ch 5: Introduction to heap overflowsCNIT 127 Ch 5: Introduction to heap overflows
CNIT 127 Ch 5: Introduction to heap overflows
 
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
 
Scheduling in cloud
Scheduling in cloudScheduling in cloud
Scheduling in cloud
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Llnl talk
Llnl talkLlnl talk
Llnl talk
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
Ch 5: Introduction to heap overflows
Ch 5: Introduction to heap overflowsCh 5: Introduction to heap overflows
Ch 5: Introduction to heap overflows
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Mcs 041 assignment solution (2020-21)
Mcs 041 assignment solution (2020-21)Mcs 041 assignment solution (2020-21)
Mcs 041 assignment solution (2020-21)
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Scaling metrics
Scaling metricsScaling metrics
Scaling metrics
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
REVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud ComputingREVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud Computing
 

Destaque

Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Ryft
 

Destaque (11)

James elastic search
James   elastic searchJames   elastic search
James elastic search
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elastic Search Indexing Internals
Elastic Search Indexing InternalsElastic Search Indexing Internals
Elastic Search Indexing Internals
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search Walkthrough
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
Elastic search & patent information @ mtc
Elastic search & patent information @ mtcElastic search & patent information @ mtc
Elastic search & patent information @ mtc
 
Linux commands and file structure
Linux commands and file structureLinux commands and file structure
Linux commands and file structure
 
Diario Resumen 20170315
Diario Resumen 20170315Diario Resumen 20170315
Diario Resumen 20170315
 
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 

Semelhante a IEEE CLOUD \'11

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Download It
Download ItDownload It
Download It
butest
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
James McGalliard
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 

Semelhante a IEEE CLOUD \'11 (20)

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Green scheduling
Green schedulingGreen scheduling
Green scheduling
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
Genetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing EnvironmentGenetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing Environment
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Download It
Download ItDownload It
Download It
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
Handout3o
Handout3oHandout3o
Handout3o
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrations
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

IEEE CLOUD \'11

  • 1. Deadline Queries: Leveraging the Cloud to Produce On-Time Results Authors: David Ribeiro Alves, Pedro Bizarro, Paulo Marques
  • 2. In a nutshell Cluster computing widely used to solve “BigData” problems Users use programming abstractions to express the computation, e.g., MapReduce, but are left with some difficult questions: how many nodes? how long will it take? Proposed solution:Users define a deadline; cluster expands/contracts to meet it. 2 CLOUD '11
  • 3. Introducing Deadline Queries Cluster computing tasks that complete within a deadline… … while minimizingcost/resource consumption Independently of: 3 Processing Capacity per Machine Faults or Perturbations Initial Number of Nodes Data Size, Content or Skew Computation Complexity CLOUD '11
  • 4. Approaches in current systems 4 … make the task fit the cluster. CLOUD '11
  • 5. Our Approach 5 … make cluster fit the task. CLOUD '11
  • 6. Architecture and Runtime 6 Ex: SELECT symbol, avg(value), avg(volume) FROM Stocks GROUP BY symbol FINISH IN 900 SEC Master Node Query IaaS Provider request nodes metrics Worker Node Part. 1 Worker Node mod. cluster Worker Node Part. 2 Worker Node Worker Worker Part. 3 Worker Part. n CLOUD '11
  • 7. Stream Processing Continuous processing allows phases to start before previous phases complete Continuous processing allows to continuously gather progress metrics about the computation as a whole SP provides continuous load balancing, which allows to: take immediate advantage of arriving nodes deal with temporary or permanent asymmetries deal with data skew SP fault tolerance allow to quickly respond to faults CLOUD '11 7
  • 8. MapReduce SELECT symbol, avg(value),avg(volume) FROM Stocks GROUP BY symbol FINISH IN 900 sec MapReduce Decomposition: 8 Fetch & Transform Map (Select/Project) Group Reduce (Aggregate) Store Results CLOUD '11
  • 9. Streaming MapReduce - Scaling Stream Processing => load balancing and fault tolerance in a changing cluster MapReduce => Simple, parallel, scalable programming and execution model 9 CLOUD '11
  • 10. Progress estimation Consumed vs. remaining data + linear regression to estimate finish time. React accordingly by either expanding or contracting the cluster. 10 CLOUD '11
  • 11. Experimental Evaluation - Setup 11 Real world environment experiments On top of Amazon EC2 Running Query: SELECT symbol, avg(value), avg(volume)FROM StocksGROUP BY symbol FINISH IN 900 sec Used between 1 and 27 machines (m1.large) 2* Dual Core Xeon (2.66 Ghz) 7.5 GB of RAM Experiments show: Predicted remaining time Number of nodes CLOUD '11
  • 12. Exp. 1 – Varying Initial Cluster Size 12 CLOUD '11
  • 13. Exp. 2 – Varying Deadline 13 CLOUD '11
  • 14. Exp. 3 – Introducing Perturbations 14 CLOUD '11
  • 15. Conclusions Cloud Computing, e.g., IaaS, allow new approaches to cluster computing and new optimization goals. Deadline Queries may help in expressing computation prov. requirements beyond number of nodes. Deadline Queries is a viable alternative to implement hard time limits for query execution. Real implementation and evaluation show approach is feasible and works as expected. 15 CLOUD '11
  • 17. Fault Tolerance 17 CLOUD ‘11

Notas do Editor

  1. ----- Meeting Notes (10/20/10 14:48) -----Notasgenericas:Mais "sharp"FocarnaaudienciaNuncadigocomovouavaliar o sistema.Gantt estamuitopequeno
  2. In particular I’d like to refer to two practical cases:1st one is that of a portuguese bank that must complete processing 10M transaction and produce the respective reports in the morning, but has no idea how much machine power it requires to do so.2nd is that of a portuguese telecom company that is actually building the largest portuguese private cloud, but still has problems alocating nodes to tasks to guarantee they complete in time.
  3. Create an animation in a slide or two that describes how the problem was previously deal with and our solution, introduce the running example hereStory of the slide is:start a processing documents, (start moving doc arrow to the cluster)when the system predicts the deadline will be missed (clock turns red)… it starts discard data or reducing accuracy (put documents in the trash)mencionaroralmenteque outros sistemasdicartam dados mas naoficarporaquimuito tempomencionarqueemmuitoscasosnao se podedeitar dados for a (exemplosprevios)
  4. Story of the slide is:start a processing documents, (start moving doc arrow to the cluster)when we see the deadline will be missed (clock turns red)… start expanding used resources
  5. Mencionarqueadoptamos streaming mapreduceparasermoscapazes de lidar com alteracoes no cluster
  6. Transform task in dataflow and split data in partitionsRequest nodes and assign dataflow parts to themNodes fetch partitions from a queue and insert them in the dataflowNodes send report updates to the master, which decides if more nodes are needed anf if so…**CLICK**New nodes are added to the computationThe fact that we use stramingmapreduce allows us to:deal with data skew, by using streaming routing techniquesdeal with faults relatively quickly
  7. Transform task in dataflow and split data in partitionsRequest nodes and assign dataflow parts to themNodes fetch partitions from a queue and insert them in the dataflowNodes send report updates to the master, which decides if more nodes are needed anf if so…**CLICK**New nodes are added to the computationUse load-balanced content insensitive routing where possibleUse load-balanced content sensitive routing where needed.----- Meeting Notes (6/27/11 19:18) -----Por o nome das maquinas----- Meeting Notes (6/29/11 14:33) -----eficiente palavra demasiado relativa
  8. Experiment 1 – Varying Initial Cluster Size, Click 1 – The experiment starts with 1 nodeClick 2 – At first we
  9. Series 1Lets see the experimentsWe begin by executing the query starting with one node**CLICK**The system starts execution with 1 nodeAt first there are no statistics on progress so nothing can be said about wether the deadline will be met**CLICK**As soon as the system detects the deadline will be missed----- Meeting Notes (6/29/11 14:33) -----threshold on dealine fault detectionthreshold on max number of machineshistoria interesasnte para contar como se portaria em diversos modelos de custospeculOS
  10. Clarificaroralmentecomoforaminjectadas as perturbacoes (comandoslinux)
  11. Normal OperationMaps process single partitions and tag results with part_idPartial reduces maitain per partition windowsTotal Reduces maitain a tentative set, where results are separated partition wise.Upon receiving a part_end punctuation When faults occurMaster notifies remaining nodes that the node has failed (so they know not to receive data from that node).Nodes discard all data from that partition (partial reduces discard the partitions window set and total reduces discard the partitions group in the tentative set)----- Meeting Notes (6/27/11 18:55) -----Transformar isto em dois slides