SlideShare uma empresa Scribd logo
1 de 10
MAP/REDUCE ALGORITHMS

            HPC4 Seminar
                IPM
           December 2011




                                      Omid Djoudi
                               od90125@yahoo.com


2011           IPM - HPC4                       1
Algorithms
SORT
map (key, values):
   for each val in values:
       emit (values)


No reduce needed

Values will be automatically sorted by shuffle/sort




2011                                   IPM - HPC4     2
Algorithms
INVERTED INDEX
File1 : aa bb cc
File2: bb cc
Result -> (aa,”File1”) (“bb, “File1,File2”) (“cc”, “File1,FIle2”)

map (key, values):
   for each val in values:
        emit (val, Key)

reduce (key, values):
   string str
   for each val in values:
        str +=“val,”
   emit (key, str)
Algorithms
INNER JOIN
Algorithms
INNER JOIN
Map ()
   if (type == PK)          emit (a_id, „A‟), a_data)
   else                     emit (a_id, „B‟), b_data)


-> Secondary sort – intermdiate values ordered by key,keyType (PK or FK)
   => Primary Key will always be before Foreing key

Reduce()
   string a_data_val
   if (key.keyType == „A‟)            a_data_val = value.data
   if (key.keyType==„B‟)              emit (key.a_id,a_data_val,value);
Algorithms
  Standard Deviation.
  Weather datasets - classify abnormal climatic
     conditions.
  StdDev one of the measures of dispersion
     describing the spread of data
Standard Deviations      Abnormality         Probability of
 Away From Mean                               Occurance
   beyond -3 sd       extremely subnormal       0.15%
     -3 to -2 sd       greatly subnormal        2.35%
    -2 to -1 sd            subnormal            13.50%
    -1 to +1 sd              normal             68.00%
    +1 to +2 sd          above normal           13.50%
    +2 to +3 sd       greatly above normal       2.35%
   beyond +3 sd         extremely above         0.15%
                             normal
Algorithms
Weather dataset : http://www.ncdc.noaa.gov/
0200010570999992011010106004...000010021019N0250001N1-01401-01591999999ADDAA112...70002;
0114010570999992011010112004...000010021019N0750001N1-00901-01081999999ADDAY1818...693/;
0114010570999992011012712004...005010300019N0750001N1+00131-00581999999ADDAY1310...3945;


Extract Date, Temperature and Quality.

The process should:
Filter by Quality
Calculate Mean for temperature on each date.
Calculate standard deviation for temperature on each date.
Algorithms
     Standard deviation


Map()
{if quality = …
   Emit(date,temp)}
                              Can we use a combiner?
Reduce(date,temp)
{                             All processing is done in reducers ,no
n = size(temp)                    load balancing across nodes.
μ = ∑temp/n;
σ = √ ∑(temp_i–μ)²/n          Bottleneck if many sampling per date
Emit (date, σ)                   (temperature array becoming too
}                                big).
Algorithms
   Standard deviation can be expressed differently:



Map(){                                    Reduce(date,[[n,sum,sum2]])
Emit(date,[1,temp,temp²])}                {
                                          μ = ∑sum/ ∑n;
                                          σ = √ ((∑(sum2) / ∑n) - μ²);
Combine(date,[[n,sum,sum2]]){             Emit (date, σ)
Emit (date,                               }
[∑n,∑sum,∑sum2])}



Combiner contain the associative part of calculation.
It’s executed on mapper nodes -> Much better load balancing.
But is combiner always executed ?
Reference
http://www.cloudera.com

Hadoop – The definitive guide
Tom White

Data-Intensive Text Processing with MapReduce
Jimmy Lin and Chris Dyer

Beautiful Data
Toby Segaran / Jeff Hammerbacher




2011                               IPM - HPC4   10

Mais conteúdo relacionado

Mais procurados

Pyclustering tutorial - K-means
Pyclustering tutorial - K-meansPyclustering tutorial - K-means
Pyclustering tutorial - K-meansAndrei Novikov
 
Chapter 1 Basic Concepts
Chapter 1 Basic ConceptsChapter 1 Basic Concepts
Chapter 1 Basic ConceptsHareem Aslam
 
Simulation and Comparison of P, PI, PID Controllers on MATLAB/ Simulink
Simulation and Comparison of P, PI, PID Controllers on MATLAB/ SimulinkSimulation and Comparison of P, PI, PID Controllers on MATLAB/ Simulink
Simulation and Comparison of P, PI, PID Controllers on MATLAB/ SimulinkHarshKumar649
 
LTI System, Basic Types of Digital signals, Basic Operations, Causality, Stab...
LTI System, Basic Types of Digital signals, Basic Operations, Causality, Stab...LTI System, Basic Types of Digital signals, Basic Operations, Causality, Stab...
LTI System, Basic Types of Digital signals, Basic Operations, Causality, Stab...Waqas Afzal
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learningAkhilesh Joshi
 
5. R basics
5. R basics5. R basics
5. R basicsFAO
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano
 
BREEZE 3D Analyst for the Advanced AERMOD Modeler
BREEZE 3D Analyst for the Advanced AERMOD ModelerBREEZE 3D Analyst for the Advanced AERMOD Modeler
BREEZE 3D Analyst for the Advanced AERMOD ModelerBREEZE Software
 
Presentation on binary search, quick sort, merge sort and problems
Presentation on binary search, quick sort, merge sort  and problemsPresentation on binary search, quick sort, merge sort  and problems
Presentation on binary search, quick sort, merge sort and problemsSumita Das
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimationLouis A. Poulin
 
rit seminars-privacy assured outsourcing of image reconstruction services in ...
rit seminars-privacy assured outsourcing of image reconstruction services in ...rit seminars-privacy assured outsourcing of image reconstruction services in ...
rit seminars-privacy assured outsourcing of image reconstruction services in ...thahirakabeer
 
control system
control systemcontrol system
control systemnaqeeb93
 
Energy Efficient Compression of Shock Data using Compressed Sensing
Energy Efficient Compression of Shock Data using Compressed SensingEnergy Efficient Compression of Shock Data using Compressed Sensing
Energy Efficient Compression of Shock Data using Compressed SensingJerrin Panachakel
 
Algorithm: Quick-Sort
Algorithm: Quick-SortAlgorithm: Quick-Sort
Algorithm: Quick-SortTareq Hasan
 
11. Linear Models
11. Linear Models11. Linear Models
11. Linear ModelsFAO
 

Mais procurados (20)

Pyclustering tutorial - K-means
Pyclustering tutorial - K-meansPyclustering tutorial - K-means
Pyclustering tutorial - K-means
 
Chapter 1 Basic Concepts
Chapter 1 Basic ConceptsChapter 1 Basic Concepts
Chapter 1 Basic Concepts
 
Simulation and Comparison of P, PI, PID Controllers on MATLAB/ Simulink
Simulation and Comparison of P, PI, PID Controllers on MATLAB/ SimulinkSimulation and Comparison of P, PI, PID Controllers on MATLAB/ Simulink
Simulation and Comparison of P, PI, PID Controllers on MATLAB/ Simulink
 
LTI System, Basic Types of Digital signals, Basic Operations, Causality, Stab...
LTI System, Basic Types of Digital signals, Basic Operations, Causality, Stab...LTI System, Basic Types of Digital signals, Basic Operations, Causality, Stab...
LTI System, Basic Types of Digital signals, Basic Operations, Causality, Stab...
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
 
5. R basics
5. R basics5. R basics
5. R basics
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
Energy of Some Simple Graphs: MATLAB Approach
Energy of Some Simple Graphs: MATLAB ApproachEnergy of Some Simple Graphs: MATLAB Approach
Energy of Some Simple Graphs: MATLAB Approach
 
BREEZE 3D Analyst for the Advanced AERMOD Modeler
BREEZE 3D Analyst for the Advanced AERMOD ModelerBREEZE 3D Analyst for the Advanced AERMOD Modeler
BREEZE 3D Analyst for the Advanced AERMOD Modeler
 
Presentation on binary search, quick sort, merge sort and problems
Presentation on binary search, quick sort, merge sort  and problemsPresentation on binary search, quick sort, merge sort  and problems
Presentation on binary search, quick sort, merge sort and problems
 
Maps&hash tables
Maps&hash tablesMaps&hash tables
Maps&hash tables
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimation
 
rit seminars-privacy assured outsourcing of image reconstruction services in ...
rit seminars-privacy assured outsourcing of image reconstruction services in ...rit seminars-privacy assured outsourcing of image reconstruction services in ...
rit seminars-privacy assured outsourcing of image reconstruction services in ...
 
control system
control systemcontrol system
control system
 
Energy Efficient Compression of Shock Data using Compressed Sensing
Energy Efficient Compression of Shock Data using Compressed SensingEnergy Efficient Compression of Shock Data using Compressed Sensing
Energy Efficient Compression of Shock Data using Compressed Sensing
 
Inside Apache SystemML
Inside Apache SystemMLInside Apache SystemML
Inside Apache SystemML
 
Algorithm: Quick-Sort
Algorithm: Quick-SortAlgorithm: Quick-Sort
Algorithm: Quick-Sort
 
11. Linear Models
11. Linear Models11. Linear Models
11. Linear Models
 
Programming Assignment Help
Programming Assignment HelpProgramming Assignment Help
Programming Assignment Help
 

Semelhante a 04 Algorithms

Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationMarjan Sterjev
 
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...Pioneer Natural Resources
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Cdiscount
 
Secure information aggregation in sensor networks
Secure information aggregation in sensor networksSecure information aggregation in sensor networks
Secure information aggregation in sensor networksAleksandr Yampolskiy
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptxKokilaK25
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptxKarthikVijay59
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming languageLincoln Hannah
 
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)Maho Nakata
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfRajJain516913
 
SYSTEM IDENTIFICATION USING CEREBELLAR MODEL ARITHMETIC COMPUTER
SYSTEM IDENTIFICATION USING CEREBELLAR MODEL ARITHMETIC COMPUTERSYSTEM IDENTIFICATION USING CEREBELLAR MODEL ARITHMETIC COMPUTER
SYSTEM IDENTIFICATION USING CEREBELLAR MODEL ARITHMETIC COMPUTERTarun Kumar
 
Introduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptxIntroduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptxPJS KUMAR
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson
 

Semelhante a 04 Algorithms (20)

Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
 
Secure information aggregation in sensor networks
Secure information aggregation in sensor networksSecure information aggregation in sensor networks
Secure information aggregation in sensor networks
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
 
Paper computer
Paper computerPaper computer
Paper computer
 
Paper computer
Paper computerPaper computer
Paper computer
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming language
 
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
 
Aggarwal Draft
Aggarwal DraftAggarwal Draft
Aggarwal Draft
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
SYSTEM IDENTIFICATION USING CEREBELLAR MODEL ARITHMETIC COMPUTER
SYSTEM IDENTIFICATION USING CEREBELLAR MODEL ARITHMETIC COMPUTERSYSTEM IDENTIFICATION USING CEREBELLAR MODEL ARITHMETIC COMPUTER
SYSTEM IDENTIFICATION USING CEREBELLAR MODEL ARITHMETIC COMPUTER
 
Introduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptxIntroduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptx
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 
Slide2
Slide2Slide2
Slide2
 
dsp.pdf
dsp.pdfdsp.pdf
dsp.pdf
 

04 Algorithms

  • 1. MAP/REDUCE ALGORITHMS HPC4 Seminar IPM December 2011 Omid Djoudi od90125@yahoo.com 2011 IPM - HPC4 1
  • 2. Algorithms SORT map (key, values): for each val in values: emit (values) No reduce needed Values will be automatically sorted by shuffle/sort 2011 IPM - HPC4 2
  • 3. Algorithms INVERTED INDEX File1 : aa bb cc File2: bb cc Result -> (aa,”File1”) (“bb, “File1,File2”) (“cc”, “File1,FIle2”) map (key, values): for each val in values: emit (val, Key) reduce (key, values): string str for each val in values: str +=“val,” emit (key, str)
  • 5. Algorithms INNER JOIN Map () if (type == PK) emit (a_id, „A‟), a_data) else emit (a_id, „B‟), b_data) -> Secondary sort – intermdiate values ordered by key,keyType (PK or FK) => Primary Key will always be before Foreing key Reduce() string a_data_val if (key.keyType == „A‟) a_data_val = value.data if (key.keyType==„B‟) emit (key.a_id,a_data_val,value);
  • 6. Algorithms Standard Deviation. Weather datasets - classify abnormal climatic conditions. StdDev one of the measures of dispersion describing the spread of data Standard Deviations Abnormality Probability of Away From Mean Occurance beyond -3 sd extremely subnormal 0.15% -3 to -2 sd greatly subnormal 2.35% -2 to -1 sd subnormal 13.50% -1 to +1 sd normal 68.00% +1 to +2 sd above normal 13.50% +2 to +3 sd greatly above normal 2.35% beyond +3 sd extremely above 0.15% normal
  • 7. Algorithms Weather dataset : http://www.ncdc.noaa.gov/ 0200010570999992011010106004...000010021019N0250001N1-01401-01591999999ADDAA112...70002; 0114010570999992011010112004...000010021019N0750001N1-00901-01081999999ADDAY1818...693/; 0114010570999992011012712004...005010300019N0750001N1+00131-00581999999ADDAY1310...3945; Extract Date, Temperature and Quality. The process should: Filter by Quality Calculate Mean for temperature on each date. Calculate standard deviation for temperature on each date.
  • 8. Algorithms Standard deviation Map() {if quality = … Emit(date,temp)} Can we use a combiner? Reduce(date,temp) { All processing is done in reducers ,no n = size(temp) load balancing across nodes. μ = ∑temp/n; σ = √ ∑(temp_i–μ)²/n Bottleneck if many sampling per date Emit (date, σ) (temperature array becoming too } big).
  • 9. Algorithms Standard deviation can be expressed differently: Map(){ Reduce(date,[[n,sum,sum2]]) Emit(date,[1,temp,temp²])} { μ = ∑sum/ ∑n; σ = √ ((∑(sum2) / ∑n) - μ²); Combine(date,[[n,sum,sum2]]){ Emit (date, σ) Emit (date, } [∑n,∑sum,∑sum2])} Combiner contain the associative part of calculation. It’s executed on mapper nodes -> Much better load balancing. But is combiner always executed ?
  • 10. Reference http://www.cloudera.com Hadoop – The definitive guide Tom White Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer Beautiful Data Toby Segaran / Jeff Hammerbacher 2011 IPM - HPC4 10