IEEE CLOUD \'11

Deadline Queries: Leveraging the Cloud to Produce On-Time Results Authors: David Ribeiro Alves, Pedro Bizarro, Paulo Marques

In a nutshell Cluster computing widely used to solve “BigData” problems Users use programming abstractions to express the computation, e.g., MapReduce, but are left with some difficult questions: how many nodes? how long will it take? Proposed solution:Users define a deadline; cluster expands/contracts to meet it. 2 CLOUD '11

Introducing Deadline Queries Cluster computing tasks that complete within a deadline… … while minimizingcost/resource consumption Independently of: 3 Processing Capacity per Machine Faults or Perturbations Initial Number of Nodes Data Size, Content or Skew Computation Complexity CLOUD '11

Approaches in current systems 4 … make the task fit the cluster. CLOUD '11

Our Approach 5 … make cluster fit the task. CLOUD '11

Architecture and Runtime 6 Ex: SELECT symbol, avg(value), avg(volume) FROM Stocks GROUP BY symbol FINISH IN 900 SEC Master Node Query IaaS Provider request nodes metrics Worker Node Part. 1 Worker Node mod. cluster Worker Node Part. 2 Worker Node Worker Worker Part. 3 Worker Part. n CLOUD '11

Stream Processing Continuous processing allows phases to start before previous phases complete Continuous processing allows to continuously gather progress metrics about the computation as a whole SP provides continuous load balancing, which allows to: take immediate advantage of arriving nodes deal with temporary or permanent asymmetries deal with data skew SP fault tolerance allow to quickly respond to faults CLOUD '11 7

MapReduce SELECT symbol, avg(value),avg(volume) FROM Stocks GROUP BY symbol FINISH IN 900 sec MapReduce Decomposition: 8 Fetch & Transform Map (Select/Project) Group Reduce (Aggregate) Store Results CLOUD '11

Streaming MapReduce - Scaling Stream Processing => load balancing and fault tolerance in a changing cluster MapReduce => Simple, parallel, scalable programming and execution model 9 CLOUD '11

Progress estimation Consumed vs. remaining data + linear regression to estimate finish time. React accordingly by either expanding or contracting the cluster. 10 CLOUD '11

Experimental Evaluation - Setup 11 Real world environment experiments On top of Amazon EC2 Running Query: SELECT symbol, avg(value), avg(volume)FROM StocksGROUP BY symbol FINISH IN 900 sec Used between 1 and 27 machines (m1.large) 2* Dual Core Xeon (2.66 Ghz) 7.5 GB of RAM Experiments show: Predicted remaining time Number of nodes CLOUD '11

Exp. 1 – Varying Initial Cluster Size 12 CLOUD '11

Exp. 2 – Varying Deadline 13 CLOUD '11

Exp. 3 – Introducing Perturbations 14 CLOUD '11

Conclusions Cloud Computing, e.g., IaaS, allow new approaches to cluster computing and new optimization goals. Deadline Queries may help in expressing computation prov. requirements beyond number of nodes. Deadline Queries is a viable alternative to implement hard time limits for query execution. Real implementation and evaluation show approach is feasible and works as expected. 15 CLOUD '11

Fault Tolerance 17 CLOUD ‘11

IEEE CLOUD \'11

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (11)

Semelhante a IEEE CLOUD \'11

Semelhante a IEEE CLOUD \'11 (20)

Último

Último (20)

IEEE CLOUD \'11

Notas do Editor