IEEE CLOUD \'11

2. In a nutshell Cluster computing widely used to solve “BigData” problems Users use programming abstractions to express the computation, e.g., MapReduce, but are left with some difficult questions: how many nodes? how long will it take? Proposed solution:Users define a deadline; cluster expands/contracts to meet it. 2 CLOUD '11

3. Introducing Deadline Queries Cluster computing tasks that complete within a deadline… … while minimizingcost/resource consumption Independently of: 3 Processing Capacity per Machine Faults or Perturbations Initial Number of Nodes Data Size, Content or Skew Computation Complexity CLOUD '11

4. Approaches in current systems 4 … make the task fit the cluster. CLOUD '11

5. Our Approach 5 … make cluster fit the task. CLOUD '11

6. Architecture and Runtime 6 Ex: SELECT symbol, avg(value), avg(volume) FROM Stocks GROUP BY symbol FINISH IN 900 SEC Master Node Query IaaS Provider request nodes metrics Worker Node Part. 1 Worker Node mod. cluster Worker Node Part. 2 Worker Node Worker Worker Part. 3 Worker Part. n CLOUD '11

7. Stream Processing Continuous processing allows phases to start before previous phases complete Continuous processing allows to continuously gather progress metrics about the computation as a whole SP provides continuous load balancing, which allows to: take immediate advantage of arriving nodes deal with temporary or permanent asymmetries deal with data skew SP fault tolerance allow to quickly respond to faults CLOUD '11 7

8. MapReduce SELECT symbol, avg(value),avg(volume) FROM Stocks GROUP BY symbol FINISH IN 900 sec MapReduce Decomposition: 8 Fetch & Transform Map (Select/Project) Group Reduce (Aggregate) Store Results CLOUD '11

9. Streaming MapReduce - Scaling Stream Processing => load balancing and fault tolerance in a changing cluster MapReduce => Simple, parallel, scalable programming and execution model 9 CLOUD '11

10. Progress estimation Consumed vs. remaining data + linear regression to estimate finish time. React accordingly by either expanding or contracting the cluster. 10 CLOUD '11

11. Experimental Evaluation - Setup 11 Real world environment experiments On top of Amazon EC2 Running Query: SELECT symbol, avg(value), avg(volume)FROM StocksGROUP BY symbol FINISH IN 900 sec Used between 1 and 27 machines (m1.large) 2* Dual Core Xeon (2.66 Ghz) 7.5 GB of RAM Experiments show: Predicted remaining time Number of nodes CLOUD '11

12. Exp. 1 – Varying Initial Cluster Size 12 CLOUD '11

13. Exp. 2 – Varying Deadline 13 CLOUD '11

14. Exp. 3 – Introducing Perturbations 14 CLOUD '11

15. Conclusions Cloud Computing, e.g., IaaS, allow new approaches to cluster computing and new optimization goals. Deadline Queries may help in expressing computation prov. requirements beyond number of nodes. Deadline Queries is a viable alternative to implement hard time limits for query execution. Real implementation and evaluation show approach is feasible and works as expected. 15 CLOUD '11

16. 16 Questions? CLOUD '11

17. Fault Tolerance 17 CLOUD ‘11

Editor's Notes

----- Meeting Notes (10/20/10 14:48) -----Notasgenericas:Mais "sharp"FocarnaaudienciaNuncadigocomovouavaliar o sistema.Gantt estamuitopequeno
In particular I’d like to refer to two practical cases:1st one is that of a portuguese bank that must complete processing 10M transaction and produce the respective reports in the morning, but has no idea how much machine power it requires to do so.2nd is that of a portuguese telecom company that is actually building the largest portuguese private cloud, but still has problems alocating nodes to tasks to guarantee they complete in time.
Create an animation in a slide or two that describes how the problem was previously deal with and our solution, introduce the running example hereStory of the slide is:start a processing documents, (start moving doc arrow to the cluster)when the system predicts the deadline will be missed (clock turns red)… it starts discard data or reducing accuracy (put documents in the trash)mencionaroralmenteque outros sistemasdicartam dados mas naoficarporaquimuito tempomencionarqueemmuitoscasosnao se podedeitar dados for a (exemplosprevios)
Story of the slide is:start a processing documents, (start moving doc arrow to the cluster)when we see the deadline will be missed (clock turns red)… start expanding used resources
Mencionarqueadoptamos streaming mapreduceparasermoscapazes de lidar com alteracoes no cluster
Transform task in dataflow and split data in partitionsRequest nodes and assign dataflow parts to themNodes fetch partitions from a queue and insert them in the dataflowNodes send report updates to the master, which decides if more nodes are needed anf if so…**CLICK**New nodes are added to the computationThe fact that we use stramingmapreduce allows us to:deal with data skew, by using streaming routing techniquesdeal with faults relatively quickly
Transform task in dataflow and split data in partitionsRequest nodes and assign dataflow parts to themNodes fetch partitions from a queue and insert them in the dataflowNodes send report updates to the master, which decides if more nodes are needed anf if so…**CLICK**New nodes are added to the computationUse load-balanced content insensitive routing where possibleUse load-balanced content sensitive routing where needed.----- Meeting Notes (6/27/11 19:18) -----Por o nome das maquinas----- Meeting Notes (6/29/11 14:33) -----eficiente palavra demasiado relativa
Experiment 1 – Varying Initial Cluster Size, Click 1 – The experiment starts with 1 nodeClick 2 – At first we
Series 1Lets see the experimentsWe begin by executing the query starting with one node**CLICK**The system starts execution with 1 nodeAt first there are no statistics on progress so nothing can be said about wether the deadline will be met**CLICK**As soon as the system detects the deadline will be missed----- Meeting Notes (6/29/11 14:33) -----threshold on dealine fault detectionthreshold on max number of machineshistoria interesasnte para contar como se portaria em diversos modelos de custospeculOS
Clarificaroralmentecomoforaminjectadas as perturbacoes (comandoslinux)
Normal OperationMaps process single partitions and tag results with part_idPartial reduces maitain per partition windowsTotal Reduces maitain a tentative set, where results are separated partition wise.Upon receiving a part_end punctuation When faults occurMaster notifies remaining nodes that the node has failed (so they know not to receive data from that node).Nodes discard all data from that partition (partial reduces discard the partitions window set and total reduces discard the partitions group in the tentative set)----- Meeting Notes (6/27/11 18:55) -----Transformar isto em dois slides

IEEE CLOUD \'11

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to IEEE CLOUD \'11

Similar to IEEE CLOUD \'11 (20)

Recently uploaded

Recently uploaded (20)

IEEE CLOUD \'11

Editor's Notes