Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
An introduction to Hadoop for large scale data analysis
1. Hadoop – Large scale data analysis Abhijit Sharma Page 1 | 9/8/2011
2. Unprecedented growth in Data set size - Facebook 21+ PB data warehouse, 12+ TB/day Un(semi)-structured data – logs, documents, graphs Connected data web, tags, graphs Relevant to enterprises – logs, social media, machine generated data, breaking of silos Page 2 | 9/8/2011 Big Data Trends
3. Page 3 | 9/8/2011 Putting Big Data to work Data driven Org – decision support, new offerings Analytics on large data sets (FB Insights – Page, App etc stats), Data Mining – Clustering - Google News articles Search - Google
4. Embarrassingly data parallel problems Data chunked & distributed across cluster Parallel processing with data locality – task dispatched where data is Horizontal/Linear scaling approach using commodity hardware Write Once, Read Many Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links Page 4 | 9/8/2011 Problem characteristics and examples
5. Open source system for large scale batch distributed computing on big data Map Reduce Programming Paradigm & Framework Map Reduce Infrastructure Distributed File System (HDFS) Endorsed/used extensively by web giants – Google, FB, Yahoo! Page 5 | 9/8/2011 What is Hadoop?
6. MapReduce is a programming model and an implementation for parallel processing of large data sets Map processes each logical record per input split to generate a set of intermediate key/value pairs Reduce merges all intermediate values associated with the same intermediate key Page 6 | 9/8/2011 Map Reduce - Definition
7. Map : Apply a function to each list member - Parallelizable [1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9] Reduce : Apply a function and an accumulator to each list member [1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6 Map & Reduce [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14 Page 7 | 9/8/2011 Map Reduce - Functional Programming Origins
10. mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the” reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..]) sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum) Page 10 | 9/8/2011 Word Count - Pseudo code
11. Word Count / Distributed logs search for # accesses to various URLs Map – emits word/URL, 1 for each doc/log split Reduce – sums up the counts for a specific word/URL Term Vector generation – term -> [doc-id] Map – emits term, doc-id for each doc split Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..]) Reverse Links – source -> target to target-> source Map – emits (target, source) for each doc split Reducer – Identity Reducer – accumulates the (target, [source, source ..]) Page 11 | 9/8/2011 Examples – Map Reduce Defn
12. Hides complexity of distributed computing Automatic parallelization of job Automatic data chunking & distribution (via HDFS) Data locality – MR task dispatched where data is Fault tolerant to server, storage, N/W failures Network and disk transfer optimization Load balancing Page 12 | 9/8/2011 Map Reduce – Hadoop Implementation
20. Job Tracker tracks MR jobs – runs on master node Task Tracker Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node Heartbeats to Job Tracker Maintains and picks up tasks from a queue Page 20 | 9/8/2011 Hadoop Map Reduce Components
21. Name Node Manages the file system namespace and regulates access to files by clients – stores meta data Mapping of blocks to Data Nodes and replicas Manage replication Executes file system namespace operations like opening, closing, and renaming files and directories. Data Node One per node, which manages local storage attached to the node Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. Page 21 | 9/8/2011 HDFS