2. Algorithms
SORT
map (key, values):
for each val in values:
emit (values)
No reduce needed
Values will be automatically sorted by shuffle/sort
2011 IPM - HPC4 2
3. Algorithms
INVERTED INDEX
File1 : aa bb cc
File2: bb cc
Result -> (aa,”File1”) (“bb, “File1,File2”) (“cc”, “File1,FIle2”)
map (key, values):
for each val in values:
emit (val, Key)
reduce (key, values):
string str
for each val in values:
str +=“val,”
emit (key, str)
5. Algorithms
INNER JOIN
Map ()
if (type == PK) emit (a_id, „A‟), a_data)
else emit (a_id, „B‟), b_data)
-> Secondary sort – intermdiate values ordered by key,keyType (PK or FK)
=> Primary Key will always be before Foreing key
Reduce()
string a_data_val
if (key.keyType == „A‟) a_data_val = value.data
if (key.keyType==„B‟) emit (key.a_id,a_data_val,value);
6. Algorithms
Standard Deviation.
Weather datasets - classify abnormal climatic
conditions.
StdDev one of the measures of dispersion
describing the spread of data
Standard Deviations Abnormality Probability of
Away From Mean Occurance
beyond -3 sd extremely subnormal 0.15%
-3 to -2 sd greatly subnormal 2.35%
-2 to -1 sd subnormal 13.50%
-1 to +1 sd normal 68.00%
+1 to +2 sd above normal 13.50%
+2 to +3 sd greatly above normal 2.35%
beyond +3 sd extremely above 0.15%
normal
7. Algorithms
Weather dataset : http://www.ncdc.noaa.gov/
0200010570999992011010106004...000010021019N0250001N1-01401-01591999999ADDAA112...70002;
0114010570999992011010112004...000010021019N0750001N1-00901-01081999999ADDAY1818...693/;
0114010570999992011012712004...005010300019N0750001N1+00131-00581999999ADDAY1310...3945;
Extract Date, Temperature and Quality.
The process should:
Filter by Quality
Calculate Mean for temperature on each date.
Calculate standard deviation for temperature on each date.
8. Algorithms
Standard deviation
Map()
{if quality = …
Emit(date,temp)}
Can we use a combiner?
Reduce(date,temp)
{ All processing is done in reducers ,no
n = size(temp) load balancing across nodes.
μ = ∑temp/n;
σ = √ ∑(temp_i–μ)²/n Bottleneck if many sampling per date
Emit (date, σ) (temperature array becoming too
} big).
9. Algorithms
Standard deviation can be expressed differently:
Map(){ Reduce(date,[[n,sum,sum2]])
Emit(date,[1,temp,temp²])} {
μ = ∑sum/ ∑n;
σ = √ ((∑(sum2) / ∑n) - μ²);
Combine(date,[[n,sum,sum2]]){ Emit (date, σ)
Emit (date, }
[∑n,∑sum,∑sum2])}
Combiner contain the associative part of calculation.
It’s executed on mapper nodes -> Much better load balancing.
But is combiner always executed ?
10. Reference
http://www.cloudera.com
Hadoop – The definitive guide
Tom White
Data-Intensive Text Processing with MapReduce
Jimmy Lin and Chris Dyer
Beautiful Data
Toby Segaran / Jeff Hammerbacher
2011 IPM - HPC4 10