This document provides an overview of the MapReduce pattern and various MapReduce frameworks including Google MapReduce, Hadoop, and Qizmt.
- The MapReduce pattern provides automatic parallelization and distribution, fault tolerance, and tools for monitoring jobs. It offers a clean abstraction for programmers.
- Qizmt is a MapReduce framework for Windows that allows writing MapReduce jobs in C# and debugging them within an integrated IDE. It supports features like delta-only exchange and dynamically adding machines to a cluster.
3. • Automatic parallelization & distribution
• Fault-tolerant
• Provides status and monitoring tools
• Clean abstraction for programmers
4. MAP REDUCE
• Google
• Map reduce
• Page rank, crawler, google map
• Hadoop
•
• Map function, reduce function
• Qizmt
•
• C# Map function, reduce function
• etc
• C++, C#, Java, Haskell
• http://en.wikipedia.org/wiki/MapReduce
5. MAP
map f lst: (’a->’b) -> (’a list) -> (’b list)
f
.
<key, value> .
6. REDUCE
(= fold, accumulate, compress, inject)
fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
,f
accumulator . key
value reduce .
7. MAPREDUCE ?
28 CHAPTER 2 THE BASICS OF A MAPREDUCE JOB
Provided by Hadoop
Provided by User Framework
•
Job Configuration
.
Input Splitting &
Distribution
Input Format
Start of Individual
• Input format Input Locations Map Tasks
Map Function
• Input location Number of Shuffle, Partition/Sort
Reduce Tasks per Map Output
Reduce Function
• Map function Output
Merge Sort for
Map Outputs for Each
Key Type Reduce Task
• Reduce function Output
Value Type Start of Individual
Reduce Tasks
• Output format
Output Format
Output Location
Collection of
Final Output
• Output location
Figure 2-1. Parts of a MapReduce job
The user is responsible for handling the job setup, specifying the input
11. ?
PROGRAM Map function Reduce function
Distributed Grep matched lines pass
Reverse Web link graph <target, source> <target, list(src)>
URL <URL, 1> <URL, total count>
Term-Vector per Host <hostname, term-vector> <hostname, all-term-vector>
Inverted Index <word, doc id> <word, list(doc id)>
Distributed Sort <key,value> pass
12.
13. CLUSTER 80 CHAPTER 3
- HADOOP
THE BASICS OF MULTIMACHINE CLUSTERS
Enable Job Control Options on the Web Interfaces
• Master Both the JobTracker and the NameNode provide a web interface for monitori
trol. By default, the JobTracker provides web service on
the NameNode provides web service on . If the
• Name node parameter is set to , the JobTracker web interface will ad
and Change Job Priority options to the per-job detail page. The default locatio
tional options is the bottom-left corner of the page (so you usually need to scr
page to see them).
• Job tracker
A Sample Cluster Configuration
In this section, we will walk through a simple configuration of a six-node Had
• Slave( =Worker ) cluster will be composed of six machines: , , ,
. The JobTracker and NameNode will reside on the machine
NameNode will be placed on . The DataNodes and TaskTrackers will b
the same machines, and the nodes will be named through . Fi
• Data node this setup.
Master Slave01
NameNode
• Task tracker
Slave02
http://master:50070/ Datanode
Slave03
JobTracker Datanode
TasktrackerSlave04
http://master:50030/ Datanode
TasktrackerSlave05
Datanode
Tasktracker
DataNode
Tasktracker
TaskTracker
Figure 3-2. A simple six-node cluster
17. http://research.microsoft.com/barc/SortBenchmark/.
ence. Concurrency and Computation: Practice and Ex- input->set_filepattern(argv[i]);
class Adder : public Reducer {
perience, 2004. input->set_mapper_class("WordCounter");
[11] William Gropp, Ewing Lusk, and Anthony Skjellum. virtual void Reduce(ReduceInput* input) {
}
Using MPI: Portable Parallel Programming with the
[17] L. G. Valiant. A bridging model for parallel computation. // Iterate over all entries with the
Message-Passing Interface. MIT Press, Cambridge, MA, // same key and add the values
Communications of the ACM, 33(8):103–111, 1997. // Specify the output files:
int64 value = 0;
1999. // /gfs/test/freq-00000-of-00100
[18] Jim Wyllie. Spsort: How to sort a terabyte quickly. // /gfs/test/freq-00001-of-00100
while (!input->done()) {
EXAMPLE - WORDCOUNT
http://alme1.almaden.ibm.com/cs/spsort.pdf. L. Huston, R. Sukthankar, R. Wickremesinghe, M. Satya-
[12] // ...
value += StringToInt(input->value());
narayanan, G. R. Ganger, E. Riedel, and A. out = spec.output(); input->NextValue();
MapReduceOutput* Ailamaki. Di-
amond: A storage architecture for early discard in inter- }
out->set_filebase("/gfs/test/freq");
A Word Frequency active search. In Proceedings of the 2004 USENIX File
out->set_num_tasks(100);
// Emit sum for input->key()
and Storage Technologies FAST Conference, April 2004.
out->set_format("text");
Emit(IntToString(value));
out->set_reducer_class("Adder");
This section contains a program that counts the number
[13] Richard E. Ladner and Michael J. Fischer. Parallel prefix }
};
of occurrences of each unique word in a set of input files Journal ofOptional: do partial 1980. within map
computation. // the ACM, 27(4):831–838, sums
REGISTER_REDUCER(Adder);
specified on the command line. // tasks to save network bandwidth
[14] Michael O. Rabin. Efficient dispersal of information for
out->set_combiner_class("Adder");
security, load balancing and fault tolerance. Journal of int main(int argc, char** argv) {
#include "mapreduce/mapreduce.h" the ACM, 36(2):335–348, 1989. parameters: use at most ParseCommandLineFlags(argc, argv);
// Tuning 2000
// User’s map function // Faloutsos, Garth A. Gibson, and
[15] Erik Riedel, Christos machines and 100 MB of memory per task
MapReduceSpecification spec;
spec.set_machines(2000);
class WordCounter : public Mapper { David Nagle. Active disks for large-scale data process-
public: spec.set_map_megabytes(100);
ing. IEEE Computer, pages 68–74, June 2001.
spec.set_reduce_megabytes(100);
// Store list of input files into "spec"
virtual void Map(const MapInput& input) { for (int i = 1; i < argc; i++) {
[16] Douglas Thain, Todd Tannenbaum, and Miron Livny.
const string& text = input.value(); MapReduceInput* input = spec.add_input();
const int n = text.size(); Distributed computing in practice:it
// Now run The Condor experi-
input->set_format("text");
MapReduceResult result;
for (int i = 0; i < n; ) { ence. Concurrency if (!MapReduce(spec, &result)) abort();
and Computation: Practice and Ex- input->set_filepattern(argv[i]);
// Skip past leading whitespace perience, 2004. input->set_mapper_class("WordCounter");
while ((i < n) && isspace(text[i])) }
i++; [17] L. G. Valiant. A bridging model ’result’ computation. contains info
// Done: for parallel structure
Communications of the ACM, 33(8):103–111,time taken, number of
// about counters,
1997. // Specify the output files:
// Find word end // machines used, etc.
// /gfs/test/freq-00000-of-00100
int start = i; [18] Jim Wyllie. Spsort: How to sort a terabyte quickly. // /gfs/test/freq-00001-of-00100
http://alme1.almaden.ibm.com/cs/spsort.pdf.
while ((i < n) && !isspace(text[i])) return 0;
// ...
i++; }
MapReduceOutput* out = spec.output();
out->set_filebase("/gfs/test/freq");
can
scan
if (start < i)
if (start < i)
A Word Frequency out->set_num_tasks(100);
ni- Emit(text.substr(start,i-start),"1"); out->set_format("text");
gni- To}} Emit(text.substr(start,i-start),"1");
appear in OSDI 2004 13
out->set_reducer_class("Adder");
96. This section contains a program that counts the number
’96. }
nce
ence }; } of occurrences of each unique word in a set of input files // Optional: do partial sums within map
};
REGISTER_MAPPER(WordCounter);
REGISTER_MAPPER(WordCounter);
specified on the command line. // tasks to save network bandwidth
ge. out->set_combiner_class("Adder");
age. // User’s reduce function
// User’s reduce function #include "mapreduce/mapreduce.h"
class Adder : public Reducer { // Tuning parameters: use at most 2000
um. class Adder : public Reducer {
virtual void Reduce(ReduceInput* // User’s map function
input) { // machines and 100 MB of memory per task
um. virtual void Reduce(ReduceInput* input) {
the // Iterate over all entries with the WordCounter : public Mapper {
class spec.set_machines(2000);
the // Iterate over all entries with the
// same key and add the values public: spec.set_map_megabytes(100);
MA, // same key and add the values
MA, int64 value = 0; spec.set_reduce_megabytes(100);
int64 value = 0; virtual void Map(const MapInput& input) {
while (!input->done()) { const string& text = input.value();
while (!input->done()) {
ya- value += StringToInt(input->value()); int n = text.size(); // Now run it
tya- const
value += StringToInt(input->value());
Di- input->NextValue(); for (int i = 0; i < n; ) { MapReduceResult result;
Di- }
input->NextValue();
if (!MapReduce(spec, &result)) abort();
er- } // Skip past leading whitespace
nter-
File while ((i < n) && isspace(text[i]))
File // Emit sum for input->key() i++; // Done: ’result’ structure contains info
04. // Emit sum for input->key()
04. Emit(IntToString(value));
Emit(IntToString(value));
// about counters, time taken, number of
efix } // Find word end // machines used, etc.
efix }
80. }; int start = i;
980. };
REGISTER_REDUCER(Adder); while ((i < n) && !isspace(text[i])) return 0;
REGISTER_REDUCER(Adder);
for }
for i++;
of int main(int argc, char** argv) {
l of int main(int argc, char** argv) {
ParseCommandLineFlags(argc, argv);
ParseCommandLineFlags(argc, argv);