SlideShare uma empresa Scribd logo
1 de 24
-    Nagarjuna K
               nagarjuna@outlook.com	
  
•      Understanding	
  MapReduce	
  

•      Map	
  Reduce	
   	
  -­‐	
  An	
  Introduction	
  
       •  Word	
  count	
  –	
  default	
  
       •  Word	
  count	
  –	
  custom	
  	
  


	
  
	
                         nagarjuna@outlook.com	
  
¡    Programming	
  model	
  to	
  process	
  large	
  datasets	
  
¡    Supported	
  languages	
  for	
  MR	
  
      §    Java	
  
      §    Ruby	
  
      §    Python	
  
      §    C++	
  

¡    Map	
  Reduce	
  Programs	
  are	
  Inherently	
  parallel.	
  	
  
      §  More	
  data	
  è	
  more	
  machines	
  to	
  analyze.	
  	
  
      §  No	
  need	
  to	
  change	
  anything	
  in	
  the	
  code.	
  	
  
                                nagarjuna@outlook.com	
  
¡  Start	
  with	
  WORDCOUNT	
  example	
  


  §  “Do	
  as	
  I	
  say,	
  not	
  as	
  I	
  do”	
     Word	
  	
     Count	
  

  	
                                                        As	
           2	
  
                                                            Do	
           2	
  
                                                            I	
            2	
  
                                                            Not	
          2	
  
                                                            Say	
          1	
  


                          nagarjuna@outlook.com	
  
define	
  wordCount	
  as	
  Map<String,long>;	
  	
  
	
  	
  	
  for	
  each	
  document	
  in	
  documentSet	
  {	
  	
  
                      	
     	
  T	
  =	
  tokenize(document);	
  	
  
                      	
     	
  for	
  each	
  token	
  in	
  T	
  {	
  	
  
                      	
     	
               	
  wordCount[token]++;	
  	
  
                      	
     	
  }	
  	
  
	
  	
  	
  }	
  	
  
display(wordCount);	
  	
  
	
  
¡  This	
  works	
  until	
  the	
  no.of	
  documents	
  to	
  process	
  is	
  not	
  
              very	
  large	
  
                       nagarjuna@outlook.com	
  
¡  Spam	
  filter	
  
    §  Millions	
  of	
  emails	
  
    §  Word	
  count	
  for	
  analysis	
  
¡  Working	
  from	
  a	
  single	
  computer	
  is	
  time	
  
  consuming	
  

¡  Rewrite	
  the	
  program	
  to	
  count	
  form	
  multiple	
  
  machines	
  
                  nagarjuna@outlook.com	
  
¡  How	
  do	
  we	
  attain	
  parallel	
  computing	
  ?	
  	
  
   1.  All	
  the	
  machines	
  compute	
  fraction	
  of	
  
       documents	
  
   2.  Combine	
  the	
  results	
  from	
  all	
  the	
  machines	
  




                    nagarjuna@outlook.com	
  
STAGE	
  1	
  
define	
  wordCount	
  as	
  Map<String,long>;	
  	
  
	
  	
  	
  for	
  each	
  document	
  in	
  documentSUBSet	
  {	
  	
  
                      	
     	
  T	
  =	
  tokenize(document);	
  	
  
                      	
     	
  for	
  each	
  token	
  in	
  T	
  {	
  	
  
                      	
     	
               	
  wordCount[token]++;	
  	
  
                      	
     	
  }	
  	
  
	
  	
  	
  }	
  	
  



                             nagarjuna@outlook.com	
  
STAGE	
  2	
  
	
  
define	
  totalWordCount	
  as	
  Multiset;	
  	
  
	
  
for	
  each	
  wordCount	
  received	
  from	
  firstPhase	
  {	
  	
  
          	
  multisetAdd	
  (totalWordCount,	
  wordCount);	
  	
  
}	
  	
  
Display(totalWordcount)               	
  

                      nagarjuna@outlook.com	
  
Master	
  	
  

                                 Comp-­‐1	
  


                                 Comp-­‐2	
  
Documents	
  




                                 Comp-­‐3	
  



                                 Comp-­‐4	
  
                                  nagarjuna@outlook.com	
  
Problems	
  
                                                                 STAGE	
  1	
  
            	
                                                   •  Documents	
  segregations	
  to	
  be	
  well	
  
            	
     Master	
  	
                                     defined	
  

                                    Comp-­‐1	
                   •  Bottle	
  neck	
  in	
  network	
  transfer	
  
                                                                     •  Data-­‐intensive	
  processing	
  
                                                                     •  Not	
  computational	
  intensive	
  
                                    Comp-­‐2	
                       •  So	
  better	
  store	
  files	
  over	
  
Documents	
  




                                                                         processing	
  machines	
  

                                                                 •  BIGGEST	
  FLAW	
  
                                    Comp-­‐3	
  
                                                                      •  Storing	
  the	
  words	
  and	
  count	
  in	
  
                                                                         memory	
  
                                                                      •  Disk	
  based	
  hash-­‐table	
  
                                    Comp-­‐4	
  
                                     nagarjuna@outlook.com	
             implementation	
  needed	
  
Problems	
  
                                                              STAGE	
  2	
  
                Master	
  	
                                  •  Phase	
  2	
  has	
  only	
  once	
  machine	
  
                                                                   •  Bottle	
  Neck	
  
                                 Comp-­‐1	
                        •  Phase	
  1	
  highly	
  distributed	
  though	
  

                                                              •  Make	
  phase	
  2	
  also	
  distributed	
  
                                 Comp-­‐2	
  
Documents	
  




                                                              •  Need	
  changes	
  in	
  Phase	
  1	
  
                                                                  •  Partition	
  the	
  phase-­‐1	
  output	
  (say	
  based	
  
                                                                     on	
  first	
  character	
  of	
  the	
  word)	
  
                                 Comp-­‐3	
                       •  We	
  have	
  26	
  machines	
  in	
  phase	
  2	
  	
  
                                                                  •  Single	
  Disk	
  based	
  hash-­‐table	
  should	
  be	
  
                                                                     now	
  26	
  Disk	
  based	
  hash-­‐table	
  	
  
                                                                  •  Word	
  count-­‐a	
  ,	
  worcount-­‐b,wordcount-­‐c	
  
                                 Comp-­‐4	
  
                                  nagarjuna@outlook.com	
  
 
            	
     Master	
  	
  

                                                                 A	
   B	
       C	
         D	
   E	
  
                                    Comp-­‐1	
                                                                Comp-­‐10	
  
                                                                 1	
     2	
     4	
         5	
     10	
  

                                    Comp-­‐2	
                                                                Comp-­‐20	
  
Documents	
  




                                                                 A	
   B	
       C	
         D	
   E	
  
                                                                 10	
   20	
   40	
   5	
            9	
  
                                    Comp-­‐3	
                                                                Comp-­‐30	
  
                                                                                     .	
  
                                                                                     .	
  
                                                                                     .	
  
                                    Comp-­‐4	
  
                                     nagarjuna@outlook.com	
                        	
                        Comp-­‐40	
  
¡    After	
  phase-­‐1	
  
      §  From	
  comp-­‐1	
  
         ▪      WordCount-­‐A	
  à	
  comp-­‐10	
  
         ▪      WordCount-­‐B	
  à	
  comp-­‐20	
  
         ▪      .	
  
         ▪      .	
  
         ▪      .	
  
¡    Each	
  machine	
  in	
  phase	
  1	
  will	
  shuffle	
  its	
  output	
  to	
  
      different	
  machines	
  in	
  phase	
  2	
  
         	
  
                             nagarjuna@outlook.com	
  
¡  This	
  is	
  getting	
  complicated	
  
   §  Store	
  files	
  where	
  are	
  they	
  are	
  being	
  processed	
  
   §  Write	
  disk-­‐based	
  hash	
  table	
  obviating	
  RAM	
  
       limitations	
  
   §  Partition	
  the	
  phase-­‐1	
  output	
  
   §  Shuffle	
  the	
  phase-­‐1	
  output	
  and	
  send	
  it	
  to	
  
       appropriate	
  reducer	
  
                      nagarjuna@outlook.com	
  
¡  This	
  is	
  more	
  than	
  a	
  lot	
  for	
  word	
  count	
  

¡  We	
  haven’t	
  even	
  touched	
  the	
  fault	
  tolerance	
  
    §  What	
  if	
  comp-­‐1	
  or	
  com-­‐10	
  fails	
  

¡  So,	
  A	
  need	
  of	
  frame	
  work	
  to	
  take	
  care	
  of	
  all	
  
   these	
  things	
  	
  
   §  We	
  concentrate	
  only	
  on	
  business	
  
   	
                 nagarjuna@outlook.com	
  
Interim	
  
                  	
                        MAPPER	
                                            output	
                                        REDUCER	
  
                  	
       Master	
  	
  

                                                                                     A	
   B	
       C	
         D	
   E	
  
                                              Comp-­‐1	
                                                                                        Comp-­‐10	
  




                                                                                                                                  Shuffling	
  
                                                                  Partitioning	
  
                                                                                     1	
     2	
     4	
         5	
     10	
  

                                             Comp-­‐2	
                                                                                         Comp-­‐20	
  
           Documents	
  




                                                                                     A	
   B	
       C	
         D	
   E	
  
HDFS	
  




                                                                                     1	
     2	
     4	
         5	
     10	
  
                                              Comp-­‐3	
                                                                                        Comp-­‐30	
  
                                                                                                         .	
  
                                                                                                         .	
  
                                                                                                         .	
  
                                             Comp-­‐4	
  
                                              nagarjuna@outlook.com	
                                   	
                                      Comp-­‐40	
  
¡  Mapper	
  
¡  Reducer	
  
	
  
Mapper	
  filters	
  and	
  transforms	
  the	
  input	
  	
  
Reducer	
  collects	
  that	
  and	
  aggregate	
  on	
  that.	
  
	
  
Extensive	
  research	
  is	
  done	
  two	
  arrive	
  at	
  two	
  
phase	
  strategy	
  
	
                nagarjuna@outlook.com	
  
¡  Mapper,Reducer,Partitioner,Shuffling	
  
  §  Work	
  together	
  à	
  common	
  structure	
  for	
  data	
  
    processing	
  	
  

                                Input	
             Output	
  
      Mapper	
                  <K1,V1>	
           List<K2,V2>	
  
      Reducer	
                 <k2,list(v2)>	
     List<k3,v3>	
  


                    nagarjuna@outlook.com	
  
¡  Mapper	
  
  §  <key,words_per_line>	
  	
  :	
  Input	
  
  §  <word,1>	
  :	
  output	
                                Input	
           Output	
  
¡  Reducer	
                                    Mapper	
      <K1,V1>	
         List<K2,V2>	
  
                                                 Reducer	
     <k2,list(v2)>	
   List<k3,v3>	
  
  §  <word,list(1)>	
  	
  :	
  Input	
  
  §  <word,count(list(1))>	
  	
  :	
  Output	
  


                     nagarjuna@outlook.com	
  
¡  As	
  said,	
  don’t	
  store	
  the	
  data	
  in	
  memory	
  
   §  So	
  keys	
  and	
  values	
  regularly	
  have	
  to	
  be	
  written	
  to	
  
       disk.	
  
   §  They	
  must	
  be	
  serialized.	
  
   §  Hadoop	
  provides	
  its	
  way	
  of	
  deserialization	
  
   §  Any	
  class	
  to	
  be	
  key	
  or	
  value	
  have	
  to	
  implement	
  
       WRITABLE	
  class.	
  
                      nagarjuna@outlook.com	
  
Java	
  Type	
     Hadoop	
  Serialized	
  
                                           Types	
  
                        String	
           Text	
  
                        Integer	
          IntWritable	
  
                        Long	
             LongWritable	
  




nagarjuna@outlook.com	
  
¡  Let’s	
  try	
  to	
  execute	
  the	
  following	
  command	
  	
  
      ▪  hadoop	
  jar	
  hadoop-­‐examples-­‐0.20.2-­‐cdh3u4.jar	
  
         wordcount 	
  	
  

      ▪  hadoop	
  jar	
  hadoop-­‐examples-­‐0.20.2-­‐cdh3u4.jar	
  
         wordcount 	
  <input>	
  	
  <output>	
  

¡  What	
  does	
  this	
  code	
  do	
  ?	
  
                     nagarjuna@outlook.com	
  
¡  Switch	
  to	
  eclipse	
  




                   nagarjuna@outlook.com	
  

Mais conteúdo relacionado

Mais procurados

Linked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityLinked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityConSanFrancisco123
 
Lessons learned from Isbank - A Story of a DB2 for z/OS Initiative
Lessons learned from Isbank - A Story of a DB2 for z/OS InitiativeLessons learned from Isbank - A Story of a DB2 for z/OS Initiative
Lessons learned from Isbank - A Story of a DB2 for z/OS InitiativeCuneyt Goksu
 
DB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and PlanningDB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and PlanningJohn Campbell
 
zIIP Capacity Planning
zIIP Capacity PlanningzIIP Capacity Planning
zIIP Capacity PlanningMartin Packer
 
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
IBM DB2 Analytics Accelerator  Trends & Directions by Namik Hrle IBM DB2 Analytics Accelerator  Trends & Directions by Namik Hrle
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle Surekha Parekh
 
The fillmore-group-aese-presentation-111810
The fillmore-group-aese-presentation-111810The fillmore-group-aese-presentation-111810
The fillmore-group-aese-presentation-111810Gennaro (Rino) Persico
 
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topicsMunich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topicsMartin Packer
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Benoit Hudzia
 
DB2 Design for High Availability and Scalability
DB2 Design for High Availability and ScalabilityDB2 Design for High Availability and Scalability
DB2 Design for High Availability and ScalabilitySurekha Parekh
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Efficient Monitoring & Tuning of Dynamic SQL in DB2 for z/OS by Namik Hrle ...
Efficient Monitoring & Tuning of Dynamic SQL in DB2 for z/OS  by  Namik Hrle ...Efficient Monitoring & Tuning of Dynamic SQL in DB2 for z/OS  by  Namik Hrle ...
Efficient Monitoring & Tuning of Dynamic SQL in DB2 for z/OS by Namik Hrle ...Surekha Parekh
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
MapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanMapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanJamie Pitts
 
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...John Campbell
 
SIMS Project Schedule
SIMS Project Schedule SIMS Project Schedule
SIMS Project Schedule WildOakForrest
 
Db2 V12 incompatibilities_&amp;_improvements_over_V11
Db2 V12 incompatibilities_&amp;_improvements_over_V11Db2 V12 incompatibilities_&amp;_improvements_over_V11
Db2 V12 incompatibilities_&amp;_improvements_over_V11Abhishek Verma
 

Mais procurados (19)

Linked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityLinked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And Scalability
 
Lessons learned from Isbank - A Story of a DB2 for z/OS Initiative
Lessons learned from Isbank - A Story of a DB2 for z/OS InitiativeLessons learned from Isbank - A Story of a DB2 for z/OS Initiative
Lessons learned from Isbank - A Story of a DB2 for z/OS Initiative
 
DB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and PlanningDB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and Planning
 
zIIP Capacity Planning
zIIP Capacity PlanningzIIP Capacity Planning
zIIP Capacity Planning
 
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
IBM DB2 Analytics Accelerator  Trends & Directions by Namik Hrle IBM DB2 Analytics Accelerator  Trends & Directions by Namik Hrle
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
 
PStorM
PStorMPStorM
PStorM
 
The fillmore-group-aese-presentation-111810
The fillmore-group-aese-presentation-111810The fillmore-group-aese-presentation-111810
The fillmore-group-aese-presentation-111810
 
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topicsMunich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
 
DB2 Design for High Availability and Scalability
DB2 Design for High Availability and ScalabilityDB2 Design for High Availability and Scalability
DB2 Design for High Availability and Scalability
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Efficient Monitoring & Tuning of Dynamic SQL in DB2 for z/OS by Namik Hrle ...
Efficient Monitoring & Tuning of Dynamic SQL in DB2 for z/OS  by  Namik Hrle ...Efficient Monitoring & Tuning of Dynamic SQL in DB2 for z/OS  by  Namik Hrle ...
Efficient Monitoring & Tuning of Dynamic SQL in DB2 for z/OS by Namik Hrle ...
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
MapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanMapReduce Using Perl and Gearman
MapReduce Using Perl and Gearman
 
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
 
SIMS Project Schedule
SIMS Project Schedule SIMS Project Schedule
SIMS Project Schedule
 
MapReduce
MapReduceMapReduce
MapReduce
 
Memory Forensics
Memory ForensicsMemory Forensics
Memory Forensics
 
Db2 V12 incompatibilities_&amp;_improvements_over_V11
Db2 V12 incompatibilities_&amp;_improvements_over_V11Db2 V12 incompatibilities_&amp;_improvements_over_V11
Db2 V12 incompatibilities_&amp;_improvements_over_V11
 

Destaque

琵琶贈心心_趨勢教育基金會
琵琶贈心心_趨勢教育基金會琵琶贈心心_趨勢教育基金會
琵琶贈心心_趨勢教育基金會kevin liao
 
Creating Android Apps with PhoneGap
Creating Android Apps with PhoneGapCreating Android Apps with PhoneGap
Creating Android Apps with PhoneGapDean Peters
 
Creating and Distributing Mobile Web Applications with PhoneGap
Creating and Distributing Mobile Web Applications with PhoneGapCreating and Distributing Mobile Web Applications with PhoneGap
Creating and Distributing Mobile Web Applications with PhoneGapJames Pearce
 
Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Naoki Yanai
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 

Destaque (9)

Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
琵琶贈心心_趨勢教育基金會
琵琶贈心心_趨勢教育基金會琵琶贈心心_趨勢教育基金會
琵琶贈心心_趨勢教育基金會
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Creating Android Apps with PhoneGap
Creating Android Apps with PhoneGapCreating Android Apps with PhoneGap
Creating Android Apps with PhoneGap
 
Creating and Distributing Mobile Web Applications with PhoneGap
Creating and Distributing Mobile Web Applications with PhoneGapCreating and Distributing Mobile Web Applications with PhoneGap
Creating and Distributing Mobile Web Applications with PhoneGap
 
Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Hadoop入門とクラウド利用
Hadoop入門とクラウド利用
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Semelhante a Map Reduce An Introduction

%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim CrontabsPaolo Negri
 
Ultra-scalable Architectures for Telecommunications and Web 2.0 Services
Ultra-scalable Architectures for Telecommunications and Web 2.0 ServicesUltra-scalable Architectures for Telecommunications and Web 2.0 Services
Ultra-scalable Architectures for Telecommunications and Web 2.0 ServicesMauricio Arango
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsC-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsQian Lin
 
Webinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBWebinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBMongoDB
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDavid van Geest
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Biug 20112026 dimensional modeling and mdx best practices
Biug 20112026   dimensional modeling and mdx best practicesBiug 20112026   dimensional modeling and mdx best practices
Biug 20112026 dimensional modeling and mdx best practicesItay Braun
 

Semelhante a Map Reduce An Introduction (20)

%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
 
Ultra-scalable Architectures for Telecommunications and Web 2.0 Services
Ultra-scalable Architectures for Telecommunications and Web 2.0 ServicesUltra-scalable Architectures for Telecommunications and Web 2.0 Services
Ultra-scalable Architectures for Telecommunications and Web 2.0 Services
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsC-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
 
Webinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBWebinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDB
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and Cassandra
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Biug 20112026 dimensional modeling and mdx best practices
Biug 20112026   dimensional modeling and mdx best practicesBiug 20112026   dimensional modeling and mdx best practices
Biug 20112026 dimensional modeling and mdx best practices
 

Map Reduce An Introduction

  • 1. -  Nagarjuna K nagarjuna@outlook.com  
  • 2. •  Understanding  MapReduce   •  Map  Reduce    -­‐  An  Introduction   •  Word  count  –  default   •  Word  count  –  custom         nagarjuna@outlook.com  
  • 3. ¡  Programming  model  to  process  large  datasets   ¡  Supported  languages  for  MR   §  Java   §  Ruby   §  Python   §  C++   ¡  Map  Reduce  Programs  are  Inherently  parallel.     §  More  data  è  more  machines  to  analyze.     §  No  need  to  change  anything  in  the  code.     nagarjuna@outlook.com  
  • 4. ¡  Start  with  WORDCOUNT  example   §  “Do  as  I  say,  not  as  I  do”   Word     Count     As   2   Do   2   I   2   Not   2   Say   1   nagarjuna@outlook.com  
  • 5. define  wordCount  as  Map<String,long>;          for  each  document  in  documentSet  {        T  =  tokenize(document);        for  each  token  in  T  {          wordCount[token]++;        }          }     display(wordCount);       ¡  This  works  until  the  no.of  documents  to  process  is  not   very  large   nagarjuna@outlook.com  
  • 6. ¡  Spam  filter   §  Millions  of  emails   §  Word  count  for  analysis   ¡  Working  from  a  single  computer  is  time   consuming   ¡  Rewrite  the  program  to  count  form  multiple   machines   nagarjuna@outlook.com  
  • 7. ¡  How  do  we  attain  parallel  computing  ?     1.  All  the  machines  compute  fraction  of   documents   2.  Combine  the  results  from  all  the  machines   nagarjuna@outlook.com  
  • 8. STAGE  1   define  wordCount  as  Map<String,long>;          for  each  document  in  documentSUBSet  {        T  =  tokenize(document);        for  each  token  in  T  {          wordCount[token]++;        }          }     nagarjuna@outlook.com  
  • 9. STAGE  2     define  totalWordCount  as  Multiset;       for  each  wordCount  received  from  firstPhase  {      multisetAdd  (totalWordCount,  wordCount);     }     Display(totalWordcount)   nagarjuna@outlook.com  
  • 10. Master     Comp-­‐1   Comp-­‐2   Documents   Comp-­‐3   Comp-­‐4   nagarjuna@outlook.com  
  • 11. Problems   STAGE  1     •  Documents  segregations  to  be  well     Master     defined   Comp-­‐1   •  Bottle  neck  in  network  transfer   •  Data-­‐intensive  processing   •  Not  computational  intensive   Comp-­‐2   •  So  better  store  files  over   Documents   processing  machines   •  BIGGEST  FLAW   Comp-­‐3   •  Storing  the  words  and  count  in   memory   •  Disk  based  hash-­‐table   Comp-­‐4   nagarjuna@outlook.com   implementation  needed  
  • 12. Problems   STAGE  2   Master     •  Phase  2  has  only  once  machine   •  Bottle  Neck   Comp-­‐1   •  Phase  1  highly  distributed  though   •  Make  phase  2  also  distributed   Comp-­‐2   Documents   •  Need  changes  in  Phase  1   •  Partition  the  phase-­‐1  output  (say  based   on  first  character  of  the  word)   Comp-­‐3   •  We  have  26  machines  in  phase  2     •  Single  Disk  based  hash-­‐table  should  be   now  26  Disk  based  hash-­‐table     •  Word  count-­‐a  ,  worcount-­‐b,wordcount-­‐c   Comp-­‐4   nagarjuna@outlook.com  
  • 13.     Master     A   B   C   D   E   Comp-­‐1   Comp-­‐10   1   2   4   5   10   Comp-­‐2   Comp-­‐20   Documents   A   B   C   D   E   10   20   40   5   9   Comp-­‐3   Comp-­‐30   .   .   .   Comp-­‐4   nagarjuna@outlook.com     Comp-­‐40  
  • 14. ¡  After  phase-­‐1   §  From  comp-­‐1   ▪  WordCount-­‐A  à  comp-­‐10   ▪  WordCount-­‐B  à  comp-­‐20   ▪  .   ▪  .   ▪  .   ¡  Each  machine  in  phase  1  will  shuffle  its  output  to   different  machines  in  phase  2     nagarjuna@outlook.com  
  • 15. ¡  This  is  getting  complicated   §  Store  files  where  are  they  are  being  processed   §  Write  disk-­‐based  hash  table  obviating  RAM   limitations   §  Partition  the  phase-­‐1  output   §  Shuffle  the  phase-­‐1  output  and  send  it  to   appropriate  reducer   nagarjuna@outlook.com  
  • 16. ¡  This  is  more  than  a  lot  for  word  count   ¡  We  haven’t  even  touched  the  fault  tolerance   §  What  if  comp-­‐1  or  com-­‐10  fails   ¡  So,  A  need  of  frame  work  to  take  care  of  all   these  things     §  We  concentrate  only  on  business     nagarjuna@outlook.com  
  • 17. Interim     MAPPER   output   REDUCER     Master     A   B   C   D   E   Comp-­‐1   Comp-­‐10   Shuffling   Partitioning   1   2   4   5   10   Comp-­‐2   Comp-­‐20   Documents   A   B   C   D   E   HDFS   1   2   4   5   10   Comp-­‐3   Comp-­‐30   .   .   .   Comp-­‐4   nagarjuna@outlook.com     Comp-­‐40  
  • 18. ¡  Mapper   ¡  Reducer     Mapper  filters  and  transforms  the  input     Reducer  collects  that  and  aggregate  on  that.     Extensive  research  is  done  two  arrive  at  two   phase  strategy     nagarjuna@outlook.com  
  • 19. ¡  Mapper,Reducer,Partitioner,Shuffling   §  Work  together  à  common  structure  for  data   processing     Input   Output   Mapper   <K1,V1>   List<K2,V2>   Reducer   <k2,list(v2)>   List<k3,v3>   nagarjuna@outlook.com  
  • 20. ¡  Mapper   §  <key,words_per_line>    :  Input   §  <word,1>  :  output   Input   Output   ¡  Reducer   Mapper   <K1,V1>   List<K2,V2>   Reducer   <k2,list(v2)>   List<k3,v3>   §  <word,list(1)>    :  Input   §  <word,count(list(1))>    :  Output   nagarjuna@outlook.com  
  • 21. ¡  As  said,  don’t  store  the  data  in  memory   §  So  keys  and  values  regularly  have  to  be  written  to   disk.   §  They  must  be  serialized.   §  Hadoop  provides  its  way  of  deserialization   §  Any  class  to  be  key  or  value  have  to  implement   WRITABLE  class.   nagarjuna@outlook.com  
  • 22. Java  Type   Hadoop  Serialized   Types   String   Text   Integer   IntWritable   Long   LongWritable   nagarjuna@outlook.com  
  • 23. ¡  Let’s  try  to  execute  the  following  command     ▪  hadoop  jar  hadoop-­‐examples-­‐0.20.2-­‐cdh3u4.jar   wordcount     ▪  hadoop  jar  hadoop-­‐examples-­‐0.20.2-­‐cdh3u4.jar   wordcount  <input>    <output>   ¡  What  does  this  code  do  ?   nagarjuna@outlook.com  
  • 24. ¡  Switch  to  eclipse   nagarjuna@outlook.com