SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Map/Reduce


   Based on original presentation by
Christophe Bisciglia, Aaron Kimball, and
        Sierra Michels-Slettvet

  Except as otherwise noted, the content
 of this presentation is licensed under the
Creative Commons Attribution 2.5 License.
Functional Programming Review

   Functional operations do not modify data structures: They
    always create new ones
   Original data still exists in unmodified form
   Data flows are implicit in program design
   Order of operations does not matter
Functional Programming Review

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

 Order of sum() and mul(), etc does not matter – they do not
 modify l
Functional Updates Do Not Modify Structures

fun append(x, lst) =
 let lst' = reverse lst in
   reverse ( x :: lst' )


  The append() function above reverses a list, adds a new
  element to the front, and returns all of that, reversed,
  which appends an item.


  But it never modifies lst!
Functions Can Be Used As Arguments

fun DoDouble(f, x) = f (f x)

It does not matter what f does to its
argument; DoDouble() will do it twice.


What is the type of this function?
Map
map f a [] = f(a)
map f (a:as) = list(f(a), map(f, as))
 Creates a new list by applying f to each element of the input list; returns
  output in order. f


                         f


                               f


                                     f


                                          f


                                                f
Fold

fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
  Moves across a list, applying f to each element plus an accumulator. f
   returns the next accumulator value, which is combined with the next
   element of the list




                           f    f      f    f     f   returned
                 initial
fold left vs. fold right

   Order of list elements can be significant
   Fold left moves left-to-right across the list
   Fold right moves from right-to-left


SML Implementation:

fun foldl f a []      = a
  | foldl f a (x::xs) = foldl f (f(x, a)) xs

fun foldr f a []      = a
  | foldr f a (x::xs) = f(x, (foldr f a xs))
Example

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

How can we implement this?
Example (Solved)

fun foo(l: int list) =
 sum(l) + mul(l) + length(l)

fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst
fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst
fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
A More Complicated Fold Problem

   Given a list of numbers, how can we generate a list of partial
    sums?

e.g.: [1, 4, 8, 3, 7, 9] 
    [0, 1, 5, 13, 16, 23, 32]
A More Complicated Fold Problem

   Given a list of numbers, how can we generate a list of partial
    sums?

e.g.: [1, 4, 8, 3, 7, 9] 
    [0, 1, 5, 13, 16, 23, 32]
fun partialsum(lst) = foldl(fn(x,a) => list(a (last(a) + x))) 0 lst
A More Complicated Map Problem

   Given a list of words, can we: reverse the letters in each word,
    and reverse the whole list, so it all comes out backwards?

[“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
A More Complicated Map Problem

   Given a list of words, can we: reverse the letters in each word,
    and reverse the whole list, so it all comes out backwards?

[“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
fun reverse2(lst) = foldr(fn(x,a)=>list(a, reverseword(x)) [] lst
map Implementation
fun map f []      = []
  | map f (x::xs) = (f x) :: (map f xs)


   This implementation moves left-to-right across the list,
    mapping elements one at a time

   … But does it need to?
Implicit Parallelism In Map

   In a purely functional setting, elements of a list being computed by map
    cannot see the effects of the computations on other elements
   If order of application of f to elements in list is commutative, we can
    reorder or parallelize execution
   This is the insight behind MapReduce
Motivation: Large Scale Data Processing

   Want to process lots of data ( > 1 TB)
   Want to parallelize across hundreds/thousands of CPUs
   … Want to make this easy
    • Hide the details of parallelism, machine management, fault
      tolerance, etc.
Sample Applications
   Distributed Greo
   Count of URL Access Frequency
   Reverse Web-Lijk Graph
   Inverted Index
   Distributed Sort
MapReduce

   Automatic parallelization & distribution
   Fault-tolerant
   Provides status and monitoring tools
   Clean abstraction for programmers
Programming Model

   Borrows from functional programming
   Users implement interface of two functions:

    • map (in_key, in_value) ->
       (out_key, intermediate_value) list

    • reduce (out_key, intermediate_value list) ->
       out_value list
Map

   Records from the data source (lines out of files, rows of a
    database, etc) are fed into the map function as key*value pairs:
    e.g., (filename, line)
   map() produces one or more intermediate values along with an
    output key from the input
   Buffers intermediate values in memory before periodically
    writing to local disk
   Writes are split into R regions based on intermediate key value
    (e.g., hash(key) mod R)
     • Locations of regions communicated back to master who informs
       reduce tasks of all appropriate disk locations
Reduce

   After the map phase is over, all the intermediate values for a
    given output key are combined together into a list
     • RPC over GFS to gather all the keys for a given region
     • Sort all keys since the same key can in general come from
       multiple map processes
   reduce() combines those intermediate values into one or more
    final values for that same output key
   Optional combine() phase as an optimization
Input key *value                                                    Input key *value
                     pairs                                                               pairs



                                              ...

                                  map                                                                  map
Data store 1                                                        Data store n




                (key 1,          (key 2,       (key 3,                               (key 1,            (key 2,       (key 3,
               values ...)      values ...)   values ...)                           values ...)        values ...)   values ...)



                        == Barrier == : Aggregates intermediate values by output key

                                   key 1,                              key 2,                                 key 3,
                               intermediate                        intermediate                           intermediate
                                   values                              values                                 values


                             reduce                           reduce                                  reduce




                       final key 1                          final key 2                           final key 3
                          values                               values                                values
Parallelism

   map() functions run in parallel, creating different intermediate
    values from different input data sets
   reduce() functions also run in parallel, each working on a
    different output key
   All values are processed independently
   Bottleneck: reduce phase cannot start until map phase
    completes
Example: Count word occurrences
map(String input_key, String input_value):
  // input_key: document name
  // input_value: document contents
 for each word w in input_value:
   EmitIntermediate(w, "1");


reduce(String output_key, Iterator
  intermediate_values):
  // output_key: a word
  // output_values: a list of counts
 int result = 0;
 for each v in intermediate_values:
    result += ParseInt(v);
Emit(AsString(result));
Example vs. Actual Source Code

   Example is written in pseudo-code
   Actual implementation is in C++, using a MapReduce library
   Bindings for Python and Java exist via interfaces
   True code is somewhat more involved (defines how the input
    key/values are divided up and accessed, etc.)
Implementation
Locality

   Master program divides up tasks based on location of data:
    tries to have map() tasks on same machine as physical file data
    • Failing that, on the same switch where bandwidth is relatively
      plentiful
    • Datacenter communications architecture?
   map() task inputs are divided into 64 MB blocks: same size as
    Google File System chunks
Fault Tolerance

   Master detects worker failures
    • Re-executes completed & in-progress map() tasks
    • Re-executes in-progress reduce() tasks
    • Importance of deterministic operations
   Data written to temporary files by both map() and reduce()
    • Upon successful completion, map() tells master of file names
        Master ignores if already heard from another map on same task
    • Upon successful completion, reduce() atomically renames file
   Master notices particular input key/values cause crashes in
    map(), and skips those values on re-execution.
    • Effect: Can work around bugs in third-party libraries
Optimizations

   No reduce can start until map is complete:
     • A single slow disk controller can rate-limit the whole process
   Master redundantly executes “slow-moving” map tasks; uses
    results of first copy to finish




    Why is it safe to redundantly execute map tasks? Wouldn’t this mess up
    the total computation?
Optimizations

   “Combiner” functions can run on same machine as a mapper
   Causes a mini-reduce phase to occur before the real reduce
    phase, to save bandwidth
Performance Evaluation
   1800 Machines
    • Gigabit Ethernet Switches
    • Two level tree hierarchy, 100-200 Gbps at root
    • 4GB RAM, 2Ghz dual Intel processors
    • Two 160GB drives
   Grep: 10^10 100 byte records (1 TB)
    • Search for relatively rare 3-character sequence
   Sort: 10^10 100 byte records (1 TB)
Grep Data Transfer Rate
Sort: Normal Execution
Sort: No Backup Tasks
MapReduce Conclusions

   MapReduce has proven to be a useful abstraction
   Greatly simplifies large-scale computations at Google
   Functional programming paradigm can be applied to large-scale
    applications
   Focus on problem, let library deal w/ messy details

Mais conteúdo relacionado

Mais procurados

Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)
stasimus
 

Mais procurados (20)

A complete introduction on matlab and matlab's projects
A complete introduction on matlab and matlab's projectsA complete introduction on matlab and matlab's projects
A complete introduction on matlab and matlab's projects
 
Language R
Language RLanguage R
Language R
 
Haskell for data science
Haskell for data scienceHaskell for data science
Haskell for data science
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional Programming
 
Programming in R
Programming in RProgramming in R
Programming in R
 
01. haskell introduction
01. haskell introduction01. haskell introduction
01. haskell introduction
 
R Basics
R BasicsR Basics
R Basics
 
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
N-Queens Combinatorial Problem - Polyglot FP for fun and profit - Haskell and...
 
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Sql server lab_2
Sql server lab_2Sql server lab_2
Sql server lab_2
 
Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)Introduction to Monads in Scala (1)
Introduction to Monads in Scala (1)
 
BasicGraphsWithR
BasicGraphsWithRBasicGraphsWithR
BasicGraphsWithR
 
Data import-cheatsheet
Data import-cheatsheetData import-cheatsheet
Data import-cheatsheet
 
Functional programming with haskell
Functional programming with haskellFunctional programming with haskell
Functional programming with haskell
 
What is matlab
What is matlabWhat is matlab
What is matlab
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays
 
Lecture notesmap
Lecture notesmapLecture notesmap
Lecture notesmap
 
Scala collections
Scala collectionsScala collections
Scala collections
 

Semelhante a Map Reduce

Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)
Sri Prasanna
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementation
Sri Prasanna
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
tugrulh
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen
 
Five Languages in a Moment
Five Languages in a MomentFive Languages in a Moment
Five Languages in a Moment
Sergio Gil
 
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
Little Tukta Lita
 
An important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdfAn important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdf
ARORACOCKERY2111
 

Semelhante a Map Reduce (20)

Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementation
 
Lec2 Mapred
Lec2 MapredLec2 Mapred
Lec2 Mapred
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
 
Big data shim
Big data shimBig data shim
Big data shim
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
R Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdfR Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdf
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
purrr.pdf
purrr.pdfpurrr.pdf
purrr.pdf
 
Five Languages in a Moment
Five Languages in a MomentFive Languages in a Moment
Five Languages in a Moment
 
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
โปรแกรมย่อยและฟังชั่นมาตรฐาน ม.6 1
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
A Proposition for Business Process Modeling
A Proposition for Business Process ModelingA Proposition for Business Process Modeling
A Proposition for Business Process Modeling
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
 
An important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdfAn important part of electrical engineering is PCB design. One impor.pdf
An important part of electrical engineering is PCB design. One impor.pdf
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 

Mais de Sri Prasanna

Mais de Sri Prasanna (20)

Qr codes para tech radar
Qr codes para tech radarQr codes para tech radar
Qr codes para tech radar
 
Qr codes para tech radar 2
Qr codes para tech radar 2Qr codes para tech radar 2
Qr codes para tech radar 2
 
Test
TestTest
Test
 
Test
TestTest
Test
 
assds
assdsassds
assds
 
assds
assdsassds
assds
 
asdsa
asdsaasdsa
asdsa
 
dsd
dsddsd
dsd
 
About stacks
About stacksAbout stacks
About stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About StacksAbout Stacks
About Stacks
 
About Stacks
About StacksAbout Stacks
About Stacks
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
 
Introduction & Parellelization on large scale clusters
Introduction & Parellelization on large scale clustersIntroduction & Parellelization on large scale clusters
Introduction & Parellelization on large scale clusters
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systems
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systems
 

Último

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

Map Reduce

  • 1. Map/Reduce Based on original presentation by Christophe Bisciglia, Aaron Kimball, and Sierra Michels-Slettvet Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
  • 2. Functional Programming Review  Functional operations do not modify data structures: They always create new ones  Original data still exists in unmodified form  Data flows are implicit in program design  Order of operations does not matter
  • 3. Functional Programming Review fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter – they do not modify l
  • 4. Functional Updates Do Not Modify Structures fun append(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) The append() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst!
  • 5. Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice. What is the type of this function?
  • 6. Map map f a [] = f(a) map f (a:as) = list(f(a), map(f, as)) Creates a new list by applying f to each element of the input list; returns output in order. f f f f f f
  • 7. Fold fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list f f f f f returned initial
  • 8. fold left vs. fold right  Order of list elements can be significant  Fold left moves left-to-right across the list  Fold right moves from right-to-left SML Implementation: fun foldl f a [] = a | foldl f a (x::xs) = foldl f (f(x, a)) xs fun foldr f a [] = a | foldr f a (x::xs) = f(x, (foldr f a xs))
  • 9. Example fun foo(l: int list) = sum(l) + mul(l) + length(l) How can we implement this?
  • 10. Example (Solved) fun foo(l: int list) = sum(l) + mul(l) + length(l) fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
  • 11. A More Complicated Fold Problem  Given a list of numbers, how can we generate a list of partial sums? e.g.: [1, 4, 8, 3, 7, 9]  [0, 1, 5, 13, 16, 23, 32]
  • 12. A More Complicated Fold Problem  Given a list of numbers, how can we generate a list of partial sums? e.g.: [1, 4, 8, 3, 7, 9]  [0, 1, 5, 13, 16, 23, 32] fun partialsum(lst) = foldl(fn(x,a) => list(a (last(a) + x))) 0 lst
  • 13. A More Complicated Map Problem  Given a list of words, can we: reverse the letters in each word, and reverse the whole list, so it all comes out backwards? [“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]
  • 14. A More Complicated Map Problem  Given a list of words, can we: reverse the letters in each word, and reverse the whole list, so it all comes out backwards? [“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”] fun reverse2(lst) = foldr(fn(x,a)=>list(a, reverseword(x)) [] lst
  • 15. map Implementation fun map f [] = [] | map f (x::xs) = (f x) :: (map f xs)  This implementation moves left-to-right across the list, mapping elements one at a time  … But does it need to?
  • 16. Implicit Parallelism In Map  In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements  If order of application of f to elements in list is commutative, we can reorder or parallelize execution  This is the insight behind MapReduce
  • 17. Motivation: Large Scale Data Processing  Want to process lots of data ( > 1 TB)  Want to parallelize across hundreds/thousands of CPUs  … Want to make this easy • Hide the details of parallelism, machine management, fault tolerance, etc.
  • 18. Sample Applications  Distributed Greo  Count of URL Access Frequency  Reverse Web-Lijk Graph  Inverted Index  Distributed Sort
  • 19. MapReduce  Automatic parallelization & distribution  Fault-tolerant  Provides status and monitoring tools  Clean abstraction for programmers
  • 20. Programming Model  Borrows from functional programming  Users implement interface of two functions: • map (in_key, in_value) -> (out_key, intermediate_value) list • reduce (out_key, intermediate_value list) -> out_value list
  • 21. Map  Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line)  map() produces one or more intermediate values along with an output key from the input  Buffers intermediate values in memory before periodically writing to local disk  Writes are split into R regions based on intermediate key value (e.g., hash(key) mod R) • Locations of regions communicated back to master who informs reduce tasks of all appropriate disk locations
  • 22. Reduce  After the map phase is over, all the intermediate values for a given output key are combined together into a list • RPC over GFS to gather all the keys for a given region • Sort all keys since the same key can in general come from multiple map processes  reduce() combines those intermediate values into one or more final values for that same output key  Optional combine() phase as an optimization
  • 23. Input key *value Input key *value pairs pairs ... map map Data store 1 Data store n (key 1, (key 2, (key 3, (key 1, (key 2, (key 3, values ...) values ...) values ...) values ...) values ...) values ...) == Barrier == : Aggregates intermediate values by output key key 1, key 2, key 3, intermediate intermediate intermediate values values values reduce reduce reduce final key 1 final key 2 final key 3 values values values
  • 24. Parallelism  map() functions run in parallel, creating different intermediate values from different input data sets  reduce() functions also run in parallel, each working on a different output key  All values are processed independently  Bottleneck: reduce phase cannot start until map phase completes
  • 25. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
  • 26. Example vs. Actual Source Code  Example is written in pseudo-code  Actual implementation is in C++, using a MapReduce library  Bindings for Python and Java exist via interfaces  True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)
  • 28. Locality  Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data • Failing that, on the same switch where bandwidth is relatively plentiful • Datacenter communications architecture?  map() task inputs are divided into 64 MB blocks: same size as Google File System chunks
  • 29. Fault Tolerance  Master detects worker failures • Re-executes completed & in-progress map() tasks • Re-executes in-progress reduce() tasks • Importance of deterministic operations  Data written to temporary files by both map() and reduce() • Upon successful completion, map() tells master of file names Master ignores if already heard from another map on same task • Upon successful completion, reduce() atomically renames file  Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. • Effect: Can work around bugs in third-party libraries
  • 30. Optimizations  No reduce can start until map is complete: • A single slow disk controller can rate-limit the whole process  Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?
  • 31. Optimizations  “Combiner” functions can run on same machine as a mapper  Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth
  • 32. Performance Evaluation  1800 Machines • Gigabit Ethernet Switches • Two level tree hierarchy, 100-200 Gbps at root • 4GB RAM, 2Ghz dual Intel processors • Two 160GB drives  Grep: 10^10 100 byte records (1 TB) • Search for relatively rare 3-character sequence  Sort: 10^10 100 byte records (1 TB)
  • 36. MapReduce Conclusions  MapReduce has proven to be a useful abstraction  Greatly simplifies large-scale computations at Google  Functional programming paradigm can be applied to large-scale applications  Focus on problem, let library deal w/ messy details