SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Introduction To Hadoop

                                    Kenneth Heafield

                                          Google Inc


                                    January 14, 2008



Example code from Hadoop 0.13.1 used under the Apache License Version 2.0
and modified for presentation. Except as otherwise noted, the content of this
presentation is licensed under the Creative Commons Attribution 2.5 License.


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   1 / 12
Outline


1    Word Count Code
      Mapper
      Reducer
      Main

2    How it Works
       Serialization
       Data Flow

3    Lab




    Kenneth Heafield (Google Inc)   Introduction To Hadoop   January 14, 2008   2 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Reducer


Reducer

< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
                   throws IOException {
  int sum = 0;
  while (values.hasNext()) {
    sum += ((IntWritable) values.next()).get();
  }
  output.collect(key, new IntWritable(sum));
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   4 / 12
Word Count Code   Reducer


Reducer

< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
                   throws IOException {
  int sum = 0;
  while (values.hasNext()) {
    sum += ((IntWritable) values.next()).get();
  }
  output.collect(key, new IntWritable(sum));
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   4 / 12
Word Count Code   Reducer


Reducer

< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
                   throws IOException {
  int sum = 0;
  while (values.hasNext()) {
    sum += ((IntWritable) values.next()).get();
  }
  output.collect(key, new IntWritable(sum));
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   4 / 12
Word Count Code   Reducer


Reducer

< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
                   throws IOException {
  int sum = 0;
  while (values.hasNext()) {
    sum += ((IntWritable) values.next()).get();
  }
  output.collect(key, new IntWritable(sum));
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   4 / 12
Word Count Code   Main


Main

public static void main(String[] args)
    throws IOException {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(ReduceClass.class);
  conf.setReducerClass(ReduceClass.class);
  conf.setNumMapTasks(new Integer(40));
  conf.setNumReduceTasks(new Integer(30));
  conf.setInputPath(new Path("/shared/wikipedia_small"));
  conf.setOutputPath(new Path("/user/kheafield/word_count"));
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  JobClient.runJob(conf);
}
 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   5 / 12
Word Count Code   Main


Main

public static void main(String[] args)
    throws IOException {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(ReduceClass.class);
  conf.setReducerClass(ReduceClass.class);
  conf.setNumMapTasks(new Integer(40));
  conf.setNumReduceTasks(new Integer(30));
  conf.setInputPath(new Path("/shared/wikipedia_small"));
  conf.setOutputPath(new Path("/user/kheafield/word_count"));
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  JobClient.runJob(conf);
}
 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   5 / 12
Word Count Code   Main


Main

public static void main(String[] args)
    throws IOException {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(ReduceClass.class);
  conf.setReducerClass(ReduceClass.class);
  conf.setNumMapTasks(new Integer(40));
  conf.setNumReduceTasks(new Integer(30));
  conf.setInputPath(new Path("/shared/wikipedia_small"));
  conf.setOutputPath(new Path("/user/kheafield/word_count"));
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  JobClient.runJob(conf);
}
 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   5 / 12
Word Count Code   Main


Main

public static void main(String[] args)
    throws IOException {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(ReduceClass.class);
  conf.setReducerClass(ReduceClass.class);
  conf.setNumMapTasks(new Integer(40));
  conf.setNumReduceTasks(new Integer(30));
  conf.setInputPath(new Path("/shared/wikipedia_small"));
  conf.setOutputPath(new Path("/user/kheafield/word_count"));
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  JobClient.runJob(conf);
}
 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   5 / 12
How it Works   Serialization


Types

Purpose
Simple serialization for keys, values, and other data

Interface Writable
     Read and write binary format
     Convert to String for text formats
     WritableComparable adds sorting order for keys

Example Implementations
    ArrayWritable is only Writable
     BooleanWritable
     IntWritable sorts in increasing order
     Text holds a String

 Kenneth Heafield (Google Inc)      Introduction To Hadoop      January 14, 2008   6 / 12
How it Works   Serialization


A Writable

public class IntPairWritable implements Writable {
  public int first;
  public int second;
  public void write(DataOutput out) throws IOException {
    out.writeInt(first);
    out.writeInt(second);
  }
  public void readFields(DataInput in) throws IOException {
    first = in.readInt();
    second = in.readInt();
  }
  public int hashCode() { return first + second; }
  public String toString() {
    return Integer.toString(first) + "," +
           Integer.toString(second);
  }
 Kenneth Heafield (Google Inc)      Introduction To Hadoop      January 14, 2008   7 / 12
How it Works   Serialization


WritableComparable Method



public int compareTo(Object other) {
  IntPairWritable o = (IntPairWritable)other;
  if (first < o.first) return -1;
  if (first > o.first) return 1;
  if (second < o.second) return -1;
  if (second > o.second) return 1;
  return 0;
}




 Kenneth Heafield (Google Inc)      Introduction To Hadoop      January 14, 2008   8 / 12
How it Works   Data Flow


Data Flow

Default Flow
 1 Mappers read from HDFS

 2   Map output is partitioned by key and sent to Reducers
 3   Reducers sort input by key
 4   Reduce output is written to HDFS

          1   HDFS              2   Mapper      3   Reducer                    4    HDFS
              Input                 Map             Sort              Reduce       Output




              Input                 Map             Sort              Reduce       Output


 Kenneth Heafield (Google Inc)                Introduction To Hadoop                January 14, 2008   9 / 12
How it Works   Data Flow


Combiners

Concept
    Add counts at Mapper before sending to Reducer.
     Word count is 6 minutes with combiners and 14 without.

Implementation
    Mapper caches output and periodically calls Combiner
     Input to Combine may be from Map or Combine
     Combiner uses interface as Reducer
              Mapper
                                Combine            Sort        Reduce        Output

Input              Map

                                 Cache             Sort        Reduce        Output
 Kenneth Heafield (Google Inc)         Introduction To Hadoop            January 14, 2008   10 / 12
Lab


Exercises


Recommended: Word Count
Get word count running.

Bigrams
Count bigrams and unigrams efficiently.

Capitalization
With what probability is a word capitalized?

Indexer
In what documents does each word appear? Where in the documents?



 Kenneth Heafield (Google Inc)   Introduction To Hadoop   January 14, 2008   11 / 12
Lab


Instructions




  1   Login to the cluster successfully (and set your password).
  2   Get Eclipse installed, so you can build Java code.
  3   Install the Hadoop plugin for Eclipse so you can deploy jobs to the
      cluster.
  4   Set up your Eclipse workspace from a template that we provide.
  5   Run the word counter example over the Wikipedia data set.




 Kenneth Heafield (Google Inc)   Introduction To Hadoop      January 14, 2008   12 / 12

Mais conteúdo relacionado

Mais procurados

Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Map Reduce: Which Way To Go?
Map Reduce: Which Way To Go?Map Reduce: Which Way To Go?
Map Reduce: Which Way To Go?Ozren Gulan
 
The Ring programming language version 1.5.2 book - Part 13 of 181
The Ring programming language version 1.5.2 book - Part 13 of 181The Ring programming language version 1.5.2 book - Part 13 of 181
The Ring programming language version 1.5.2 book - Part 13 of 181Mahmoud Samir Fayed
 
The Ring programming language version 1.5.1 book - Part 12 of 180
The Ring programming language version 1.5.1 book - Part 12 of 180The Ring programming language version 1.5.1 book - Part 12 of 180
The Ring programming language version 1.5.1 book - Part 12 of 180Mahmoud Samir Fayed
 
Distributed systems
Distributed systemsDistributed systems
Distributed systemsSonali Parab
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
IPC: AIDL is sexy, not a curse
IPC: AIDL is sexy, not a curseIPC: AIDL is sexy, not a curse
IPC: AIDL is sexy, not a curseYonatan Levin
 
Ipc: aidl sexy, not a curse
Ipc: aidl sexy, not a curseIpc: aidl sexy, not a curse
Ipc: aidl sexy, not a curseYonatan Levin
 
The Ring programming language version 1.5.3 book - Part 14 of 184
The Ring programming language version 1.5.3 book - Part 14 of 184The Ring programming language version 1.5.3 book - Part 14 of 184
The Ring programming language version 1.5.3 book - Part 14 of 184Mahmoud Samir Fayed
 
Application-Specific Models and Pointcuts using a Logic Meta Language
Application-Specific Models and Pointcuts using a Logic Meta LanguageApplication-Specific Models and Pointcuts using a Logic Meta Language
Application-Specific Models and Pointcuts using a Logic Meta LanguageESUG
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQLPeter Eisentraut
 
Showdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennShowdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennJavaDayUA
 
Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020InfluxData
 
The Ring programming language version 1.8 book - Part 18 of 202
The Ring programming language version 1.8 book - Part 18 of 202The Ring programming language version 1.8 book - Part 18 of 202
The Ring programming language version 1.8 book - Part 18 of 202Mahmoud Samir Fayed
 
.NET Database Toolkit
.NET Database Toolkit.NET Database Toolkit
.NET Database Toolkitwlscaudill
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015Fastly
 
Rx for Android & iOS by Harin Trivedi
Rx for Android & iOS  by Harin TrivediRx for Android & iOS  by Harin Trivedi
Rx for Android & iOS by Harin Trivediharintrivedi
 

Mais procurados (20)

Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
DCN Practical
DCN PracticalDCN Practical
DCN Practical
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Map Reduce: Which Way To Go?
Map Reduce: Which Way To Go?Map Reduce: Which Way To Go?
Map Reduce: Which Way To Go?
 
The Ring programming language version 1.5.2 book - Part 13 of 181
The Ring programming language version 1.5.2 book - Part 13 of 181The Ring programming language version 1.5.2 book - Part 13 of 181
The Ring programming language version 1.5.2 book - Part 13 of 181
 
The Ring programming language version 1.5.1 book - Part 12 of 180
The Ring programming language version 1.5.1 book - Part 12 of 180The Ring programming language version 1.5.1 book - Part 12 of 180
The Ring programming language version 1.5.1 book - Part 12 of 180
 
Distributed systems
Distributed systemsDistributed systems
Distributed systems
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
IPC: AIDL is sexy, not a curse
IPC: AIDL is sexy, not a curseIPC: AIDL is sexy, not a curse
IPC: AIDL is sexy, not a curse
 
Ipc: aidl sexy, not a curse
Ipc: aidl sexy, not a curseIpc: aidl sexy, not a curse
Ipc: aidl sexy, not a curse
 
XTW_Import
XTW_ImportXTW_Import
XTW_Import
 
The Ring programming language version 1.5.3 book - Part 14 of 184
The Ring programming language version 1.5.3 book - Part 14 of 184The Ring programming language version 1.5.3 book - Part 14 of 184
The Ring programming language version 1.5.3 book - Part 14 of 184
 
Application-Specific Models and Pointcuts using a Logic Meta Language
Application-Specific Models and Pointcuts using a Logic Meta LanguageApplication-Specific Models and Pointcuts using a Logic Meta Language
Application-Specific Models and Pointcuts using a Logic Meta Language
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQL
 
Showdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennShowdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp Krenn
 
Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020
 
The Ring programming language version 1.8 book - Part 18 of 202
The Ring programming language version 1.8 book - Part 18 of 202The Ring programming language version 1.8 book - Part 18 of 202
The Ring programming language version 1.8 book - Part 18 of 202
 
.NET Database Toolkit
.NET Database Toolkit.NET Database Toolkit
.NET Database Toolkit
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
 
Rx for Android & iOS by Harin Trivedi
Rx for Android & iOS  by Harin TrivediRx for Android & iOS  by Harin Trivedi
Rx for Android & iOS by Harin Trivedi
 

Destaque

كيفية إضافة تقرير المواطن
كيفية إضافة تقرير المواطنكيفية إضافة تقرير المواطن
كيفية إضافة تقرير المواطنTom Trewinnard
 
American Urbanization: New York City
American Urbanization: New York CityAmerican Urbanization: New York City
American Urbanization: New York Citymeggss24
 
The fifth p, student version
The fifth p, student versionThe fifth p, student version
The fifth p, student versionRonald Voorn
 
Developing the Work Programme
Developing the Work ProgrammeDeveloping the Work Programme
Developing the Work ProgrammeNoel Hatch
 
Nations of the Americas (Culture)
Nations of the Americas (Culture)Nations of the Americas (Culture)
Nations of the Americas (Culture)03ram
 
Panama & Los Angeles: The Waterworks That Made the American West
Panama & Los Angeles: The Waterworks That Made the American WestPanama & Los Angeles: The Waterworks That Made the American West
Panama & Los Angeles: The Waterworks That Made the American West03ram
 
Interview Questions for Organisations
Interview Questions for OrganisationsInterview Questions for Organisations
Interview Questions for OrganisationsNoel Hatch
 
Cross Border Ediscovery vs. EU Data Protection at LegalTech West Coast
 Cross Border Ediscovery vs. EU Data Protection at LegalTech West Coast Cross Border Ediscovery vs. EU Data Protection at LegalTech West Coast
Cross Border Ediscovery vs. EU Data Protection at LegalTech West CoastAltheimPrivacy
 
How to post a Citizen Report (EN)
How to post a Citizen Report (EN)How to post a Citizen Report (EN)
How to post a Citizen Report (EN)Tom Trewinnard
 
Ripped from the Headlines: Cautionary Tales from the Annals of Data Privacy
Ripped from the Headlines: Cautionary Tales from the Annals of Data PrivacyRipped from the Headlines: Cautionary Tales from the Annals of Data Privacy
Ripped from the Headlines: Cautionary Tales from the Annals of Data PrivacyAltheimPrivacy
 
踏み出そう、初めの一歩!
踏み出そう、初めの一歩!踏み出そう、初めの一歩!
踏み出そう、初めの一歩!naka9181
 
Americas 19th Century
Americas 19th CenturyAmericas 19th Century
Americas 19th Centurymeggss24
 
California
CaliforniaCalifornia
Californiameggss24
 
8. Comparative History: Article Readings
8. Comparative History: Article Readings8. Comparative History: Article Readings
8. Comparative History: Article Readings03ram
 
Part 1: California
Part 1: CaliforniaPart 1: California
Part 1: Californiameggss24
 
Co-creating social and economic Europe
Co-creating social and economic EuropeCo-creating social and economic Europe
Co-creating social and economic EuropeNoel Hatch
 
Festival Techniques
Festival TechniquesFestival Techniques
Festival TechniquesNoel Hatch
 
American Urbanization and New York City
American Urbanization and New York CityAmerican Urbanization and New York City
American Urbanization and New York City03ram
 
Day ın the Lıfe - Example
Day ın the Lıfe - ExampleDay ın the Lıfe - Example
Day ın the Lıfe - ExampleNoel Hatch
 
Brazil: Nation Report
Brazil: Nation ReportBrazil: Nation Report
Brazil: Nation Reportmeggss24
 

Destaque (20)

كيفية إضافة تقرير المواطن
كيفية إضافة تقرير المواطنكيفية إضافة تقرير المواطن
كيفية إضافة تقرير المواطن
 
American Urbanization: New York City
American Urbanization: New York CityAmerican Urbanization: New York City
American Urbanization: New York City
 
The fifth p, student version
The fifth p, student versionThe fifth p, student version
The fifth p, student version
 
Developing the Work Programme
Developing the Work ProgrammeDeveloping the Work Programme
Developing the Work Programme
 
Nations of the Americas (Culture)
Nations of the Americas (Culture)Nations of the Americas (Culture)
Nations of the Americas (Culture)
 
Panama & Los Angeles: The Waterworks That Made the American West
Panama & Los Angeles: The Waterworks That Made the American WestPanama & Los Angeles: The Waterworks That Made the American West
Panama & Los Angeles: The Waterworks That Made the American West
 
Interview Questions for Organisations
Interview Questions for OrganisationsInterview Questions for Organisations
Interview Questions for Organisations
 
Cross Border Ediscovery vs. EU Data Protection at LegalTech West Coast
 Cross Border Ediscovery vs. EU Data Protection at LegalTech West Coast Cross Border Ediscovery vs. EU Data Protection at LegalTech West Coast
Cross Border Ediscovery vs. EU Data Protection at LegalTech West Coast
 
How to post a Citizen Report (EN)
How to post a Citizen Report (EN)How to post a Citizen Report (EN)
How to post a Citizen Report (EN)
 
Ripped from the Headlines: Cautionary Tales from the Annals of Data Privacy
Ripped from the Headlines: Cautionary Tales from the Annals of Data PrivacyRipped from the Headlines: Cautionary Tales from the Annals of Data Privacy
Ripped from the Headlines: Cautionary Tales from the Annals of Data Privacy
 
踏み出そう、初めの一歩!
踏み出そう、初めの一歩!踏み出そう、初めの一歩!
踏み出そう、初めの一歩!
 
Americas 19th Century
Americas 19th CenturyAmericas 19th Century
Americas 19th Century
 
California
CaliforniaCalifornia
California
 
8. Comparative History: Article Readings
8. Comparative History: Article Readings8. Comparative History: Article Readings
8. Comparative History: Article Readings
 
Part 1: California
Part 1: CaliforniaPart 1: California
Part 1: California
 
Co-creating social and economic Europe
Co-creating social and economic EuropeCo-creating social and economic Europe
Co-creating social and economic Europe
 
Festival Techniques
Festival TechniquesFestival Techniques
Festival Techniques
 
American Urbanization and New York City
American Urbanization and New York CityAmerican Urbanization and New York City
American Urbanization and New York City
 
Day ın the Lıfe - Example
Day ın the Lıfe - ExampleDay ın the Lıfe - Example
Day ın the Lıfe - Example
 
Brazil: Nation Report
Brazil: Nation ReportBrazil: Nation Report
Brazil: Nation Report
 

Semelhante a Hadoop

Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand DechouxOpen XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand DechouxPublicis Sapient Engineering
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleChristopher Curtin
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Hadoop Installation_13_09_2022(1).docx
Hadoop Installation_13_09_2022(1).docxHadoop Installation_13_09_2022(1).docx
Hadoop Installation_13_09_2022(1).docx1MS20CS406
 
실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리Byeongweon Moon
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programmingKuldeep Dhole
 
Map reduce模型
Map reduce模型Map reduce模型
Map reduce模型dhlzj
 
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataAlexander Schätzle
 
Answer this question for quality assurance. Include a final applicat.pdf
Answer this question for quality assurance. Include a final applicat.pdfAnswer this question for quality assurance. Include a final applicat.pdf
Answer this question for quality assurance. Include a final applicat.pdfakanshanawal
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Introduction Big Data and Hadoop
Introduction Big Data and HadoopIntroduction Big Data and Hadoop
Introduction Big Data and Hadoop명신 김
 
Solving real world problems with Hadoop
Solving real world problems with HadoopSolving real world problems with Hadoop
Solving real world problems with Hadoopsynctree
 
Auto-GWT : Better GWT Programming with Xtend
Auto-GWT : Better GWT Programming with XtendAuto-GWT : Better GWT Programming with Xtend
Auto-GWT : Better GWT Programming with XtendSven Efftinge
 

Semelhante a Hadoop (20)

Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand DechouxOpen XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand Dechoux
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop example
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Hadoop Installation_13_09_2022(1).docx
Hadoop Installation_13_09_2022(1).docxHadoop Installation_13_09_2022(1).docx
Hadoop Installation_13_09_2022(1).docx
 
실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Map reduce模型
Map reduce模型Map reduce模型
Map reduce模型
 
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
 
Answer this question for quality assurance. Include a final applicat.pdf
Answer this question for quality assurance. Include a final applicat.pdfAnswer this question for quality assurance. Include a final applicat.pdf
Answer this question for quality assurance. Include a final applicat.pdf
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction Big Data and Hadoop
Introduction Big Data and HadoopIntroduction Big Data and Hadoop
Introduction Big Data and Hadoop
 
Solving real world problems with Hadoop
Solving real world problems with HadoopSolving real world problems with Hadoop
Solving real world problems with Hadoop
 
Auto-GWT : Better GWT Programming with Xtend
Auto-GWT : Better GWT Programming with XtendAuto-GWT : Better GWT Programming with Xtend
Auto-GWT : Better GWT Programming with Xtend
 

Mais de Karunakar Singh Thakur

Mais de Karunakar Singh Thakur (14)

Rational Team Concert (RTC) installation and setup guide
Rational Team Concert (RTC) installation and setup guideRational Team Concert (RTC) installation and setup guide
Rational Team Concert (RTC) installation and setup guide
 
All About Jazz Team Server Technology
All About Jazz Team Server TechnologyAll About Jazz Team Server Technology
All About Jazz Team Server Technology
 
Android Firewall project
Android Firewall projectAndroid Firewall project
Android Firewall project
 
Hyper Threading technology
Hyper Threading technologyHyper Threading technology
Hyper Threading technology
 
Advanced sql injection 2
Advanced sql injection 2Advanced sql injection 2
Advanced sql injection 2
 
Advanced sql injection 1
Advanced sql injection 1Advanced sql injection 1
Advanced sql injection 1
 
Plsql programs(encrypted)
Plsql programs(encrypted)Plsql programs(encrypted)
Plsql programs(encrypted)
 
Complete placement guide(non technical)
Complete placement guide(non technical)Complete placement guide(non technical)
Complete placement guide(non technical)
 
Complete placement guide(technical)
Complete placement guide(technical)Complete placement guide(technical)
Complete placement guide(technical)
 
How to answer the 64 toughest interview questions
How to answer the 64 toughest interview questionsHow to answer the 64 toughest interview questions
How to answer the 64 toughest interview questions
 
genetic algorithms-artificial intelligence
 genetic algorithms-artificial intelligence genetic algorithms-artificial intelligence
genetic algorithms-artificial intelligence
 
Prepare for aptitude test
Prepare for aptitude testPrepare for aptitude test
Prepare for aptitude test
 
Thesis of SNS
Thesis of SNSThesis of SNS
Thesis of SNS
 
Network survivability karunakar
Network survivability karunakarNetwork survivability karunakar
Network survivability karunakar
 

Hadoop

  • 1. Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 1 / 12
  • 2. Outline 1 Word Count Code Mapper Reducer Main 2 How it Works Serialization Data Flow 3 Lab Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 2 / 12
  • 3. Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 4. Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 5. Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 6. Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 7. Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 8. Word Count Code Reducer Reducer < “The”, 1 >, < “The”, 1 >→ < “The”, 2 > public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  • 9. Word Count Code Reducer Reducer < “The”, 1 >, < “The”, 1 >→ < “The”, 2 > public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  • 10. Word Count Code Reducer Reducer < “The”, 1 >, < “The”, 1 >→ < “The”, 2 > public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  • 11. Word Count Code Reducer Reducer < “The”, 1 >, < “The”, 1 >→ < “The”, 2 > public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  • 12. Word Count Code Main Main public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  • 13. Word Count Code Main Main public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  • 14. Word Count Code Main Main public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  • 15. Word Count Code Main Main public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  • 16. How it Works Serialization Types Purpose Simple serialization for keys, values, and other data Interface Writable Read and write binary format Convert to String for text formats WritableComparable adds sorting order for keys Example Implementations ArrayWritable is only Writable BooleanWritable IntWritable sorts in increasing order Text holds a String Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 6 / 12
  • 17. How it Works Serialization A Writable public class IntPairWritable implements Writable { public int first; public int second; public void write(DataOutput out) throws IOException { out.writeInt(first); out.writeInt(second); } public void readFields(DataInput in) throws IOException { first = in.readInt(); second = in.readInt(); } public int hashCode() { return first + second; } public String toString() { return Integer.toString(first) + "," + Integer.toString(second); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 7 / 12
  • 18. How it Works Serialization WritableComparable Method public int compareTo(Object other) { IntPairWritable o = (IntPairWritable)other; if (first < o.first) return -1; if (first > o.first) return 1; if (second < o.second) return -1; if (second > o.second) return 1; return 0; } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 8 / 12
  • 19. How it Works Data Flow Data Flow Default Flow 1 Mappers read from HDFS 2 Map output is partitioned by key and sent to Reducers 3 Reducers sort input by key 4 Reduce output is written to HDFS 1 HDFS 2 Mapper 3 Reducer 4 HDFS Input Map Sort Reduce Output Input Map Sort Reduce Output Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 9 / 12
  • 20. How it Works Data Flow Combiners Concept Add counts at Mapper before sending to Reducer. Word count is 6 minutes with combiners and 14 without. Implementation Mapper caches output and periodically calls Combiner Input to Combine may be from Map or Combine Combiner uses interface as Reducer Mapper Combine Sort Reduce Output Input Map Cache Sort Reduce Output Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 10 / 12
  • 21. Lab Exercises Recommended: Word Count Get word count running. Bigrams Count bigrams and unigrams efficiently. Capitalization With what probability is a word capitalized? Indexer In what documents does each word appear? Where in the documents? Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 11 / 12
  • 22. Lab Instructions 1 Login to the cluster successfully (and set your password). 2 Get Eclipse installed, so you can build Java code. 3 Install the Hadoop plugin for Eclipse so you can deploy jobs to the cluster. 4 Set up your Eclipse workspace from a template that we provide. 5 Run the word counter example over the Wikipedia data set. Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 12 / 12