1. MapReduce Online Tyson Condie UC Berkeley Joint work with Neil Conway, Peter Alvaro, and Joseph M. Hellerstein (UC Berkeley) KhaledElmeleegy and Russell Sears (Yahoo! Research)
2. MapReduce Programming Model Think data-centric Apply a two step transformation to data sets Map step: Map(k1, v1) -> list(k2, v2) Apply map function to input records Divide output records into groups Reduce step: Reduce(k2, list(v2)) -> list(v3) Consolidate groups from the map step Apply reduce function to each group
3.
4. Life Beyond Batch MapReduce often used for analytics on streams of data that arrive continuously Click streams, network traffic, web crawl data, … Batch approach: buffer, load, process High latency Not scalable Online approach: run MR jobs continuously Analyze data as it arrives
5. Online Query Processing Two domains of interest (at massive scale): Online aggregation Interactive data analysis (watch answer evolve) Stream processing Continuous (real-time) analysis of data streams Blocking operators are a poor fit Final answers only No infinite streams Operators need to pipeline BUT we must retain fault tolerance AND Keep It Simple Stupid!
6. A Brave New MapReduce World Pipelined MapReduce Maps can operate on infinite data (Stream processing) Reduces can export early answers (Online aggregation) Hadoop Online Prototype (HOP) Hadoop with pipelining support Preserves Hadoop interfaces and APIs Pipelining fault tolerance model
7. Outline Hadoop MR Background Hadoop Online Prototype (HOP) Online Aggregation Stream Processing Performance (blocking vs. pipelining) Future Work
8. Hadoop Architecture HadoopMapReduce Single master node (JobTracker), many worker nodes (TaskTrackers) Client submits a job to the JobTracker JobTracker splits each job into tasks (map/reduce) Assigns tasks to TaskTrackers on demand Hadoop Distributed File System (HDFS) Single name node, many data nodes Data is stored as fixed-size (e.g., 64MB) blocks HDFS typically holds map input and reduce output
11. Hadoop Job Execution Finished Map Locations Local FS map reduce Local FS map reduce Map output: sorted by group id and key group id = hash(key) mod # reducers
13. Hadoop Job Execution reduce Write Final Answer HDFS reduce Input: sorted runs of records assigned the same group id Process: merge-sort runs, for each final group call reduce
14.
15.
16. Pipelining Data Unit Initial design: pipeline eagerly (each record) Prevents map side preaggregation (a.k.a., combiner) Moves all the sorting work to the reduce step Map computation can block on network I/O Revised design: pipeline small sorted runs (spills) Task thread: apply (map/reduce) function, buffer output Spill thread: sort & combine buffer, spill to a file TaskTracker: sends spill files to consumers Simple adaptive algorithm Halt pipeline when 1. spill files backup OR 2. effective combiner Resume pipeline by first merging & combining accumulated spill files into a single file
17. Pipelined Fault Tolerance (PFT) Simple PFT design: Reduce treats in-progress map output as tentative If map dies then throw away output If map succeeds then accept output Revised PFT design: Spill files have deterministic boundaries and are assigned a sequence number Correctness: Reduce tasks ensure spill files are idempotent Optimization: Map tasks avoid sending redundant spill files
18. Benefits of Pipelining Online aggregation An early view of the result from a running computation Interactive data analysis (you say when to stop) Stream processing Tasks operate on infinite data streams Real-time data analysis Performance? Pipelining can… Improve CPU and I/O overlap Steady network traffic (fewer load spikes) Improve cluster utilization (reducers presort earlier)
19. Outline Hadoop MR Background Hadoop Online Prototype (HOP) Online Aggregation Implementation Example Approximation Query Stream Processing Performance (blocking vs. pipelining) Future Work
20.
21.
22. Bar graph shows results for a single hour (1600) Taken less than 2 minutes into a ~2 hour job!
23. Approximation error: |estimate – actual| / actual Job progress assumes hours are uniformly sampled Sample fraction closer to the sample distribution of each hour
24. Outline Hadoop MR Background Hadoop Online Prototype (HOP) Online Aggregation Stream Processing Implementation Use case: real-time monitoring system Performance (blocking vs. pipelining) Future Work
25. Implementation Map and reduce tasks run continuously Challenge: what if number of tasks exceed slot capacity? Current approach: wait for required slot capacity Map tasks stream spill files Input taken from arbitrary source (MR job, socket, etc.) Garbage collection handled by system Window management done at reducer Reduce function arguments Input data: the set of current input records OutputCollector: output records for the current window InputCollector: records for subsequent windows Return value says when to call next e.g., in X milliseconds, after receiving Y records, etc.
26. Real-time Monitoring System Use MapReduce to monitor MapReduce Continuous jobs that monitor cluster health Same scheduler for user jobs and monitoring jobs Economy of Mechanism Agents monitor machines Record statistics of interest (/proc, log files, etc.) Implemented as a continuous map task Aggregators group agent-local statistics High level (rack, datacenter) analysis and correlations Reduce windows: 1, 5, and 15 second load averages
27.
28. Performance Open problem! A lot of performance related work still remains Focus on obvious cases first Why block? Effective combiner Reduce step is a bottleneck Why pipeline? Improve cluster utilization Smooth out network traffic
29. Blocking vs. Pipelining Final map finishes, sorts and sends to reduce 3rd map finishes, sorts, and sends to reduce 2 maps sort and send output to reducer Simple wordcount on two (small) EC2 nodes Map machine: 2 map slots Reduce machine: 2 reduce slots Input 2GB data, 512MB block size So job contains 4 maps and (a hard-coded) 2 reduces
30. Blocking vs. Pipelining Job completion when reduce finishes Reduce task performing final merge-sort No significant idle periods during the shuffle phase 4th map output received 3rd map output received 2nd map output received 1st map output received Reduce task 6.5 minute idle period ~ 15 minutes ~ 9 minutes Simple wordcount on two (small) EC2 nodes Map machine: 2 map slots Reduce machine: 2 reduce slots Input 2GB data, 512MB block size So job contains 4 maps and (a hard-coded) 2 reduces
31. Recall in blocking… Operators block Poor CPU and I/O overlap Reduce task idle periods Only the final answer is fetched So more data is fetched resulting in… Network traffic spikes Especially when a group of maps finish
32. CPU Utilization Map tasks loading 2GB of data Mapper CPU Pipelining reduce tasks start working (presorting) early Reduce task 6.5 minute idle period Reducer CPU Amazon Cloudwatch
33. Recall in blocking… Operators block Poor CPU and I/O overlap Reduce task idle periods Only the final answer is fetched So more data is fetched at once resulting in… Network traffic spikes Especially when a group of maps finish
34. Network Traffic Spikes(map machine network out) First 2 maps finish and send output Last map finishes and sends output 3rd map finishes and sends output Amazon Cloudwatch
35.
36. Adaptive vs. Fixed Block Size Job completion time ~ 49 minutes ~ 36 minutes 512MB block size, 240 maps Maps scheduled in 3 waves (1st 80, 2nd 80, last 80) Large block size creates reduce idle periods Poor CPU and I/O overlap Pipelining minimizes idle periods
37. Adaptive vs. Fixed Block Size Job completion time ~ 42 minutes ~ 35 minutes 32MB block size, 3120 maps Maps scheduled in 39 waves, maps finish fast! Small block size improves CPU and I/O overlap BUT increases scheduling overhead so not a scalable solution Adaptive policy finds the right degree of pipelined parallelism Based on runtime dynamics (load, network capacity, etc.)
38. Future Work Blocking vs. Pipelining Thorough performance analysis at scale Hadoop optimizer Online Aggregation Statically-robust estimation Random sampling of the input Better UI for approximate results Stream Processing Develop full-fledged stream processing framework Stream support for high-level query languages