SlideShare a Scribd company logo
1 of 64
A MapReduce-based
   Programming Model for
Self-maintainable Aggregate Views
                          2012-08-31




                           Johannes Schildgen
                             TU Kaiserslautern
                               schildgen@cs.uni-kl.de
Motivation
11    26        14 23 37
             39   41         26
   19    8
                     25
19   22 15 18 10 16
                    27
 8 9    12 14 15
Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
Ritter Sport: 41
Pickup: 12
…
Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
Ritter Sport: 41
Pickup: 12
…
Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
Ritter Sport: 41
                        Δ
                   Balisto: -1
Pickup: 12
…
Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
Ritter Sport: 41
                        Δ
                   Balisto: -8
Pickup: 12         Snickers: +24
…                  Ritter Sport: -7
Increment Installation

Kinderriegel: 26
Balisto: 31
Hanuta: 14
Snickers: 43
Ritter Sport: 34
                          Δ
                     Balisto: -8
Pickup: 12           Snickers: +24
…                    Ritter Sport: -7
Overwrite Installation

Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
                                   Δ
                               Balisto: -8
Ritter Sport: 41Kinderriegel: 26
Pickup: 12      Balisto: 31    Snickers: +24
                Hanuta: 14 Ritter Sport: -7
…               Snickers: 43
                Ritter Sport: 34
                Pickup: 12
Fundamentals &
                 The Marimba Framework   Evaluation
Related Work
public class WordCount extends Configured implements Tool {

        public static class WordCountMapper extends
                         Mapper<LongWritable, Text, ImmutableBytesWritable,
                                          LongWritable> {

                 private Text word = new Text();

                 @Override
                 public void map(LongWritable key, Text value, Context context)
                                   throws IOException, InterruptedException {
                          String line = value.toString();
                          StringTokenizer tokenizer = new StringTokenizer(line);
                          while (tokenizer.hasMoreTokens()) {
                                   this.word.set(tokenizer.nextToken());
A   A   E
B   E   F
C   F   B
D   C   D
> create 'person', 'default'
> put 'person', 'p27', 'default:forename', 'Anton'
> put 'person', 'p27', 'default:surname', 'Schmidt'

> get 'person', 'p27'
COLUMN               CELL
 default:forname     timestamp=1338991497408, value=Schmidt
 default:surname     timestamp=1338991436688, value=Anton
2 row(s) in 0.0640 seconds
banane
die
iss
nimm
          3
          4
          1
          1
                Δ
              kaufe     1
schale    1
              die       0
schäle    1
              banane    1
schmeiß   1
              schmeiß   -1
weg       1
              schale    -1
              weg       -1
banane
die
iss
kaufe
nimm
          4
          4
          1
          1
          1
                              Δ
                            kaufe     1
              increment()
schale    0                 die       0
schäle    1                 banane    1
schmeiß   0                 schmeiß   -1
weg       0                 schale    -1
                            weg       -1
banane
die
iss
nimm
          3
          4
          1
          1
                              Δ
                            kaufe     1
schale    1   overwrite()   die       0
schäle    1
                            banane    1
schmeiß   1
                            schmeiß   -1
weg       1
                            schale    -1
                            weg       -1
banane
die
iss
kaufe
nimm
          4
          4
          1
          1
          1
                              Δ
                            kaufe     1
schale    0   overwrite()   die       0
schäle    1                 banane    1
schmeiß   0                 schmeiß   -1
weg       0                 schale    -1
                            weg       -1
void map(key, value) {
 if(value is inserted) {
    for(word : value.split(" ")) {
       write(word, 1);
    }
 else if(value is deleted) {
    for(word : value.split(" ")) {
       write(word, -1);
    }
 }



}
void map(key, value) {
 if(value is inserted) {
    for(word : value.split(" ")) {
       write(word, 1);
    }
 else if(value is deleted) {
    for(word : value.split(" ")) {
       write(word, -1);
    }
 }
 else { // old result
    write(key, value);
 }
}
Overwrite Installation

void reduce(key, values) {
  sum = 0;
  for(value : value) {
    sum += value;
  }
  put = new Put(key);
  put.add("fam", "col", sum);
  context.write(key, put);
}
Increment Installation

void reduce(key, values) {
  sum = 0;
  for(value : value) {
    sum += value;
  }
  inc = new Increment(key);
  inc.add("fam", "col", sum);
  context.write(key, inc);
}
Formalization
Formalization
Generic Mapper
Generic Reducer
Fundamentals &
                 The Marimba Framework   Evaluation
Related Work
Core functionality:
Distributed computations
with MapReduce

                                          I care about:
                                       IncDec, Overwrite,
                                      reading old results,
                                   producing of Increments,…


                             I tell you how to
                             read input data,
Core functionality:        aggregate, invert and
                             write the output
Incremental computations
public class WordTranslator extends
    Translator<LongWritable, Text> {
  public void translate(…) {
    …
}


IncJob job = new IncJobOverwrite(conf);
job.setTranslatorClass(
             WordTranslator.class);
job.setAbelianClass(WordAbelian.class);
public class WordAbelian implements
    Abelian<WordAbelian> {
 WordAbelian invert() { … }
 WordAbelian aggregate(WordAbelian
                        other) { … }
 WordAbelian neutral() { … }
 boolean isNeutral() { … }
 Writable extractKey() { … }
 void write(…) { … }
 void readFields(…) { … }
}
public class WordSerializer
 implements Serializer<WordAbelian> {

 Writable serialize(Writable key,
                    WordAbelian v) {
    …
 }
 WordAbelian deserializeHBase(
    byte[] rowId, byte[] colFamily,
    byte[] qualifier, byte[] value) {
    …
 }
}
How To Write A Marimba-Job

1. Abelian-Class
2. Translator-Class
3. Serializer-Class
4. Write a Hadoop-Job and use the
   class IncJob
Implementation

                                     setInputTable(…)


                    IncJob          setOutputTable(…)




  IncJobFull
                   IncJobIncDec      IncJobOverwrite
Recomputation

                                  setResultInputTable(…)
NeutralOutputStrategy
 (for IncJobOverwrite)
public interface Abelian<T extends
 Abelian<?>> extends
 WritableComparable<BinaryComparable>{

 T invert();
 T aggregate(T other);
 T neutral();
 boolean isNeutral();
 Writable extractKey();
 void write(…);
 void readFields(…);
}
public interface Serialzer<T extends
 Abelian<?>> {

 Writable serialize(T obj);
 T deserializeHBase(
    byte[] rowId, byte[] colFamily,
    byte[] qualifier, byte[] value);
}
public abstract class Translator
 <KEYIN, VALUEIN> {

 public abstract void translate
    (KEYIN key, VALUEIN value,
     Context context);

this.mapContext.write(
    abelianValue.extractKey(),
    this.invertValue ?
         abelianValue.invert() :
         abelianValue);
GenericMapper


From InputFormat:            Value


OverwriteResult   InsertedValue      DeletedValue         PreservedValue




  deserialize        translate    set invertValue=true;       ignore
                                  translate
GenericReducer


            1. aggregate
            2. serialize
            3. write


IncDec:                    Overwrite:
                           PUT → write
putToIncrement(…)          IGNORE → don‘t write
                           DELETE → putToDelete(...)
GenericCombiner

„Write A Combiner“
   -- 7 Tips for Improving MapReduce Performance, (Tipp 4)




                1. aggregate
TextWindowInputFormat
Example:
                        1. WordCount
void translate(key, value) {
                                   WordAbelian invert() {
                                    return new WordAbelian(
 for(word : value.split(" ")) {        this.word,
  write(                               -1 * this.count);
    new WordAbelian(word, 1));     }
 }
}                                      WordAbelian aggregate(
                                       WordAbelian other) {
                                        return new WordAbelian(
Writable serialize(                        this.word,
    WordAbelian w) {                       this.count
 Put p = new Put(                          + other.count);
        w.getWord());                  }
 p.add(…);
 return p;                         boolean neutral() {
}                                   return new WordAbelian(
                                       this.word, 0);
                                   }

                                   boolean isNeutral() {

  Translator                       }
                                    return (this.count == 0);



  Serializer                      WordAbelian
Example:
    2. Friends Of Friends
                      FRIENDS


A
             D
    B                 FRIENDS OF FRIENDS

C
         E
Example:
                 2. Friends Of Friends
             translate(person, friends):

aggregate(…):
Merge friends-of-friends-
sets
Example:
3. Reverse WebLink-Graph

                            REVERSE WEB LINK GRAPH
                            (Row-ID -> Columns)

                            Google -> {eBay, Wikipedia}

          aggregate(…): -> {Google, Wikipedia}
                      eBay

          Merge link-sets Mensa-KL -> {Google}
                            Facebook -> {Google, Mensa-
                            KL, Uni-KL}

                            Wikipedia -> {Google}

                            Uni-KL -> {Google, Wikipedia}
Example:
         4. Bigrams
Hi, kannst du mich ___?___ am
Bahnhof abholen? So in etwa
10 ___?___. Viele liebe ___?__.
P.S. Ich habe viel ___?___.
Idea:
Analize large amount of
       text data
Example:
                          4. Bigrams
extractKey()
  a                                             invert():
                                                count*=-1
  b             NGramAbelian
 count
                                           aggregate(… other):
write(…)                                   count+=other.count
                              neutral():
           isNeutral():       count=0
           count==0




           NGramStep2Abelian
Beispielanwendungen:
                     4. Bigrams
extractKey()
  a                                         invert():
                                            count*=-1
  b             NGramAbelian
 count
                                       aggregate(… other):
write(…)                               count+=other.count
                          neutral():
           isNeutral():   count=0
           count==0




           NGramStep2Abelian
„Which input
  data?“
bitte
Hi, kannst du mich ___ ___ am
                    nicht
Bahnhof abholen? So in etwa
<num>Minuten
     <num>
     Jahre
10 ___ ___.                Grüße
                            dich
               Viele liebe ___ __.
                   zu
                   Spaß
P.S. Ich habe viel ___ ___.
Fundamentals &
                 The Marimba Framework   Evaluation
Related Work
WordCount

               01:10




               01:00




               00:50




               00:40
Zeit [hh:mm]




               00:30                                                               FULL
                                                                                   INCDEC
                                                                                   OVERWRITE
               00:20




               00:10




               00:00
                       0%   10%   20%   30%      40%       50%   60%   70%   80%
                                              Änderungen
Reverse Weblink-Graph

               02:51

               02:41

               02:31

               02:21

               02:11

               02:00

               01:50

               01:40
Zeit [hh:mm]




               01:30
                                                                                   FULL
               01:20
                                                                                   INCDEC
               01:10
                                                                                   OVERWRITE
               01:00

               00:50

               00:40

               00:30

               00:20

               00:10

               00:00
                       0%   10%   20%   30%      40%       50%   60%   70%   80%
                                              Änderungen
Conclusion
Full Recomputation
IncDec / Overwrite
Images
Folie 5-9:                                                  Folie 37-44:
Flammen und Smilie: Microsoft Office 2010                   Puzzle: http://www.flickr.com/photos/dps/136565237/

Folie 10:                                                   Folie 46 - 48:
Google: http://www.google.de                                Junge: Microsoft Office 2010

Folie 11:                                                   Folie 49:
Amazon: http://www.amazon.de                                Google: http://www.google.de
                                                            eBay: http://www.ebay.de
Folie 12:                                                   Mensa-KL: http://www.mensa-kl.de
Hadoop: http://hadoop.apache.org                            facebook: http://www.facebook.de
Casio Wristwatch:                                           Wikipedia: http://de.wikipedia.org
http://www.flickr.com/photos/andresrueda/3448240252         TU Kaiserslautern: http://www.uni-kl.de

Folie 16:                                                   Folie 50-51:
Hadoop: http://hadoop.apache.org                            Handy: Microsoft Office 2010

Folie 17:                                                   Folie 56:
Hadoop: http://hadoop.apache.org                            Wikipedia: http://de.wikipedia.org
Notebook: Microsoft Office 2010                             Twitter: http://www.twitter.com

Folie 18:                                                   Folie 57:
HBase: http://hbase.apache.org                              Handy: Microsoft Office 2010

Folie 31:                                                   Folie 58:
Hadoop: http://hadoop.apache.org                            Hadoop: http://hadoop.apache.org
Casio Wristwatch:                                           Casio Wristwatch: http://www.flickr.com/photos/andresrueda/3448240252
http://www.flickr.com/photos/andresrueda/3448240252

Folie 32:
Gerüst: http://www.flickr.com/photos/michale/94538528/
Hadoop: http://hadoop.apache.org
Junge: Microsoft Office 2010
Bibliography (1/2)
[0] Johannes Schildgen. Ein MapReduce-basiertes Programmiermodell für selbstwartbare Aggregatsichten.
Masterarbeit, TU Kaiserslautern, August 2012

[1] Apache Hadoop project. http://hadoop.apache.org/.
[2] Virga: Incremental Recomputations in MapReduce. http://wwwlgis.informatik.uni-kl.de/cms/?id=526.
[3] Philippe Adjiman. Hadoop Tutorial Series, Issue #4: To Use Or Not To Use A Combiner,2010.
http://www.philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/.
[4] Kai Biermann. Big Data: Twitter wird zum Fieberthermometer der Gesellschaft, April 2012.
http://www.zeit.de/digital/internet/2012-04/twitter-krankheiten-nowcast.
[5] Julie Bort. 8 Crazy Things IBM Scientists Have Learned Studying Twitter, January 2012.
http://www.businessinsider.com/8-crazy-things-ibm-scientists-have-learned-studying-twitter-2012-1.
[6] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clus-ters. OSDI, pages 137–150, 2004.
[7] Lars George. HBase: The Definitive Guide. O’Reilly Media, 1 edition, 2011.
[8] Brown University Data Management Group. A Comparison of Approaches to Large-Scale Data Analysis.
http://database.cs.brown.edu/projects/mapreduce-vs-dbms/.
[9] Ricky Ho. Map/Reduce to recommend people connection, August 2010.
http://horicky.blogspot.de/2010/08/mapreduce-to-recommend-people.html.
[10] Yong Hu. Efficiently Extracting Change Data from HBase. April 2012.
[11] Thomas Jörg, Roya Parvizi, Hu Yong, and Stefan Dessloch. Can mapreduce learnform materialized views?
In LADIS 2011, pages 1 – 5, 9 2011.
[12] Thomas Jörg, Roya Parvizi, Hu Yong, and Stefan Dessloch. Incremental recomputations in mapreduce. In CloudDB 2011, 10 2011.
[13] Steve Krenzel. MapReduce: Finding Friends, 2010. http://stevekrenzel.com/finding-friends-with-mapreduce.
[14] Todd Lipcon. 7 Tips for Improving MapReduce Performance, 2009.http://www.cloudera.com/blog/2009/12/7-tips-for-improving-
mapreduce-performance/.
Bibliography (2/2)
[15] TouchType Ltd. SwiftKey X - Android Apps auf Google Play, February 2012.
http://play.google.com/store/apps/details?id=com.touchtype.swiftkey.
[16] Karl H. Marbaise. Hadoop - Think Large!, 2011. http://www.soebes.de/files/RuhrJUGEssenHadoop-20110217.pdf.
[17] Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan. Distributed Cube Materia-lization on Holistic Measures.
ICDE, pages 183–194, 2011.
[18] Alexander Neumann. Studie: Hadoop wird ähnlich erfolgreich wie Linux, Mai 2012.
http://heise.de/-1569837.
[19] Owen O’Malley, Jack Hebert, Lohit Vijayarenu, and Amar Kamat. Partitioning your job into maps and reduces, September 2009.
http://wiki.apache.org/hadoop/HowManyMapsAndReduces?action=recall&#38;rev=7.
[20] Roya Parvizi. Inkrementelle Neuberechnungen mit MapReduce. Bachelorarbeit, TU Kaiserslautern,
Juni 2011.
[21] Arnd Poetzsch-Heffter. Konzepte objektorientierter Programmierung. eXamen.press.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
[22] Dave Rosenberg. Hadoop, the elephant in the enterprise, June 2012.
http://news.cnet.com/8301-1001 3-57452061-92/hadoop-the-elephant-in-the-enterprise/.
[23] Marc Schäfer. Inkrementelle Wartung von Data Cubes. Bachelorarbeit, TU Kaiserslautern, Januar 2012.
[24] Sanjay Sharma. Advanced Hadoop Tuning and Optimizations, 2009.
http://www.slideshare.net/ImpetusInfo/ppt-on-advanced-hadoop-tuning-n-optimisation.
[25] Jason Venner. Pro Hadoop. Apress, Berkeley, CA, 2009.
[26] DickWeisinger. Big Data: Think of NoSQL As Complementary to Traditional RDBMS, Juni 2012.
http://www.formtek.com/blog/?p=3032.
[27] Tom White. 10 MapReduce Tips, May 2009. http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/.60

More Related Content

What's hot

Scala - where objects and functions meet
Scala - where objects and functions meetScala - where objects and functions meet
Scala - where objects and functions meet
Mario Fusco
 
Coffee script
Coffee scriptCoffee script
Coffee script
timourian
 

What's hot (20)

Haskellで学ぶ関数型言語
Haskellで学ぶ関数型言語Haskellで学ぶ関数型言語
Haskellで学ぶ関数型言語
 
FITC CoffeeScript 101
FITC CoffeeScript 101FITC CoffeeScript 101
FITC CoffeeScript 101
 
Programming Java - Lection 07 - Puzzlers - Lavrentyev Fedor
Programming Java - Lection 07 - Puzzlers - Lavrentyev FedorProgramming Java - Lection 07 - Puzzlers - Lavrentyev Fedor
Programming Java - Lection 07 - Puzzlers - Lavrentyev Fedor
 
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
 
Python Puzzlers
Python PuzzlersPython Puzzlers
Python Puzzlers
 
RxSwift 시작하기
RxSwift 시작하기RxSwift 시작하기
RxSwift 시작하기
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
 
Scala - where objects and functions meet
Scala - where objects and functions meetScala - where objects and functions meet
Scala - where objects and functions meet
 
Swift - 혼자 공부하면 분명히 안할테니까 같이 공부하기
Swift - 혼자 공부하면 분명히 안할테니까 같이 공부하기Swift - 혼자 공부하면 분명히 안할테니까 같이 공부하기
Swift - 혼자 공부하면 분명히 안할테니까 같이 공부하기
 
Python basic
Python basic Python basic
Python basic
 
TDC2016SP - Código funcional em Java: superando o hype
TDC2016SP - Código funcional em Java: superando o hypeTDC2016SP - Código funcional em Java: superando o hype
TDC2016SP - Código funcional em Java: superando o hype
 
groovy databases
groovy databasesgroovy databases
groovy databases
 
Empathic Programming - How to write comprehensible code
Empathic Programming - How to write comprehensible codeEmpathic Programming - How to write comprehensible code
Empathic Programming - How to write comprehensible code
 
Coffee script
Coffee scriptCoffee script
Coffee script
 
Programming Java - Lection 04 - Generics and Lambdas - Lavrentyev Fedor
Programming Java - Lection 04 - Generics and Lambdas - Lavrentyev FedorProgramming Java - Lection 04 - Generics and Lambdas - Lavrentyev Fedor
Programming Java - Lection 04 - Generics and Lambdas - Lavrentyev Fedor
 
Functional Pe(a)rls version 2
Functional Pe(a)rls version 2Functional Pe(a)rls version 2
Functional Pe(a)rls version 2
 
From Java to Kotlin beyond alt+shift+cmd+k - Kotlin Community Conf Milan
From Java to Kotlin beyond alt+shift+cmd+k - Kotlin Community Conf MilanFrom Java to Kotlin beyond alt+shift+cmd+k - Kotlin Community Conf Milan
From Java to Kotlin beyond alt+shift+cmd+k - Kotlin Community Conf Milan
 
From java to kotlin beyond alt+shift+cmd+k - Droidcon italy
From java to kotlin beyond alt+shift+cmd+k - Droidcon italyFrom java to kotlin beyond alt+shift+cmd+k - Droidcon italy
From java to kotlin beyond alt+shift+cmd+k - Droidcon italy
 
Pybelsberg — Constraint-based Programming in Python
Pybelsberg — Constraint-based Programming in PythonPybelsberg — Constraint-based Programming in Python
Pybelsberg — Constraint-based Programming in Python
 
SDC - Einführung in Scala
SDC - Einführung in ScalaSDC - Einführung in Scala
SDC - Einführung in Scala
 

Viewers also liked

Impact of design complexity on software quality - A systematic review
Impact of design complexity on software quality - A systematic reviewImpact of design complexity on software quality - A systematic review
Impact of design complexity on software quality - A systematic review
Anh Nguyen Duc
 
Authorization Aspects of the Distributed Dataflow-oriented IoT Framework Calvin
Authorization Aspects of the Distributed Dataflow-oriented IoT Framework CalvinAuthorization Aspects of the Distributed Dataflow-oriented IoT Framework Calvin
Authorization Aspects of the Distributed Dataflow-oriented IoT Framework Calvin
Tomas Nilsson
 
Thesis Power Point Presentation
Thesis Power Point PresentationThesis Power Point Presentation
Thesis Power Point Presentation
riddhikapandya1985
 
Dissertation oral defense presentation
Dissertation   oral defense presentationDissertation   oral defense presentation
Dissertation oral defense presentation
Dr. Naomi Mangatu
 

Viewers also liked (8)

Impact of design complexity on software quality - A systematic review
Impact of design complexity on software quality - A systematic reviewImpact of design complexity on software quality - A systematic review
Impact of design complexity on software quality - A systematic review
 
IDEF0 and Software Process Engineering Model
IDEF0 and Software Process Engineering ModelIDEF0 and Software Process Engineering Model
IDEF0 and Software Process Engineering Model
 
Authorization Aspects of the Distributed Dataflow-oriented IoT Framework Calvin
Authorization Aspects of the Distributed Dataflow-oriented IoT Framework CalvinAuthorization Aspects of the Distributed Dataflow-oriented IoT Framework Calvin
Authorization Aspects of the Distributed Dataflow-oriented IoT Framework Calvin
 
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
 
Thesis Defense Presentation
Thesis Defense PresentationThesis Defense Presentation
Thesis Defense Presentation
 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense Presentation
 
Thesis Power Point Presentation
Thesis Power Point PresentationThesis Power Point Presentation
Thesis Power Point Presentation
 
Dissertation oral defense presentation
Dissertation   oral defense presentationDissertation   oral defense presentation
Dissertation oral defense presentation
 

Similar to Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate Views

Consider the following C code snippet C codevoid setArray(int.pdf
Consider the following C code snippet C codevoid setArray(int.pdfConsider the following C code snippet C codevoid setArray(int.pdf
Consider the following C code snippet C codevoid setArray(int.pdf
arihantmum
 

Similar to Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate Views (20)

CoffeeScript
CoffeeScriptCoffeeScript
CoffeeScript
 
Monadologie
MonadologieMonadologie
Monadologie
 
Functional Programming with Groovy
Functional Programming with GroovyFunctional Programming with Groovy
Functional Programming with Groovy
 
Apache PIG - User Defined Functions
Apache PIG - User Defined FunctionsApache PIG - User Defined Functions
Apache PIG - User Defined Functions
 
Modern Application Foundations: Underscore and Twitter Bootstrap
Modern Application Foundations: Underscore and Twitter BootstrapModern Application Foundations: Underscore and Twitter Bootstrap
Modern Application Foundations: Underscore and Twitter Bootstrap
 
Programmation fonctionnelle Scala
Programmation fonctionnelle ScalaProgrammation fonctionnelle Scala
Programmation fonctionnelle Scala
 
Tuga IT 2017 - What's new in C# 7
Tuga IT 2017 - What's new in C# 7Tuga IT 2017 - What's new in C# 7
Tuga IT 2017 - What's new in C# 7
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
 
Functional programming
Functional programming Functional programming
Functional programming
 
TypeScript Introduction
TypeScript IntroductionTypeScript Introduction
TypeScript Introduction
 
Davide Cerbo - Kotlin: forse è la volta buona - Codemotion Milan 2017
Davide Cerbo - Kotlin: forse è la volta buona - Codemotion Milan 2017 Davide Cerbo - Kotlin: forse è la volta buona - Codemotion Milan 2017
Davide Cerbo - Kotlin: forse è la volta buona - Codemotion Milan 2017
 
Scala vs Ruby
Scala vs RubyScala vs Ruby
Scala vs Ruby
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional Programming
 
Pydiomatic
PydiomaticPydiomatic
Pydiomatic
 
Python idiomatico
Python idiomaticoPython idiomatico
Python idiomatico
 
ES6 patterns in the wild
ES6 patterns in the wildES6 patterns in the wild
ES6 patterns in the wild
 
Consider the following C code snippet C codevoid setArray(int.pdf
Consider the following C code snippet C codevoid setArray(int.pdfConsider the following C code snippet C codevoid setArray(int.pdf
Consider the following C code snippet C codevoid setArray(int.pdf
 
PHP and MySQL
PHP and MySQLPHP and MySQL
PHP and MySQL
 
Introduction to Kotlin.pptx
Introduction to Kotlin.pptxIntroduction to Kotlin.pptx
Introduction to Kotlin.pptx
 
01 Introduction to Kotlin - Programming in Kotlin.pptx
01 Introduction to Kotlin - Programming in Kotlin.pptx01 Introduction to Kotlin - Programming in Kotlin.pptx
01 Introduction to Kotlin - Programming in Kotlin.pptx
 

More from Johannes Schildgen (6)

Precision and Recall
Precision and RecallPrecision and Recall
Precision and Recall
 
Visualization of NotaQL Transformations using Sampling
Visualization of NotaQL Transformations using SamplingVisualization of NotaQL Transformations using Sampling
Visualization of NotaQL Transformations using Sampling
 
NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column S...
NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column S...NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column S...
NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column S...
 
Incremental Data Transformations on Wide-Column Stores with NotaQL
Incremental Data Transformations on Wide-Column Stores with NotaQLIncremental Data Transformations on Wide-Column Stores with NotaQL
Incremental Data Transformations on Wide-Column Stores with NotaQL
 
Big-Data-Analyse und NoSQL-Datenbanken
Big-Data-Analyse und NoSQL-DatenbankenBig-Data-Analyse und NoSQL-Datenbanken
Big-Data-Analyse und NoSQL-Datenbanken
 
Precision und Recall
Precision und RecallPrecision und Recall
Precision und Recall
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate Views

  • 1. A MapReduce-based Programming Model for Self-maintainable Aggregate Views 2012-08-31 Johannes Schildgen TU Kaiserslautern schildgen@cs.uni-kl.de
  • 3. 11 26 14 23 37 39 41 26 19 8 25 19 22 15 18 10 16 27 8 9 12 14 15
  • 4. Kinderriegel: 26 Balisto: 39 Hanuta: 14 Snickers: 19 Ritter Sport: 41 Pickup: 12 …
  • 5. Kinderriegel: 26 Balisto: 39 Hanuta: 14 Snickers: 19 Ritter Sport: 41 Pickup: 12 …
  • 6. Kinderriegel: 26 Balisto: 39 Hanuta: 14 Snickers: 19 Ritter Sport: 41 Δ Balisto: -1 Pickup: 12 …
  • 7. Kinderriegel: 26 Balisto: 39 Hanuta: 14 Snickers: 19 Ritter Sport: 41 Δ Balisto: -8 Pickup: 12 Snickers: +24 … Ritter Sport: -7
  • 8. Increment Installation Kinderriegel: 26 Balisto: 31 Hanuta: 14 Snickers: 43 Ritter Sport: 34 Δ Balisto: -8 Pickup: 12 Snickers: +24 … Ritter Sport: -7
  • 9. Overwrite Installation Kinderriegel: 26 Balisto: 39 Hanuta: 14 Snickers: 19 Δ Balisto: -8 Ritter Sport: 41Kinderriegel: 26 Pickup: 12 Balisto: 31 Snickers: +24 Hanuta: 14 Ritter Sport: -7 … Snickers: 43 Ritter Sport: 34 Pickup: 12
  • 10.
  • 11.
  • 12. Fundamentals & The Marimba Framework Evaluation Related Work
  • 13.
  • 14.
  • 15.
  • 16. public class WordCount extends Configured implements Tool { public static class WordCountMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, LongWritable> { private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { this.word.set(tokenizer.nextToken());
  • 17. A A E B E F C F B D C D
  • 18. > create 'person', 'default' > put 'person', 'p27', 'default:forename', 'Anton' > put 'person', 'p27', 'default:surname', 'Schmidt' > get 'person', 'p27' COLUMN CELL default:forname timestamp=1338991497408, value=Schmidt default:surname timestamp=1338991436688, value=Anton 2 row(s) in 0.0640 seconds
  • 19. banane die iss nimm 3 4 1 1 Δ kaufe 1 schale 1 die 0 schäle 1 banane 1 schmeiß 1 schmeiß -1 weg 1 schale -1 weg -1
  • 20. banane die iss kaufe nimm 4 4 1 1 1 Δ kaufe 1 increment() schale 0 die 0 schäle 1 banane 1 schmeiß 0 schmeiß -1 weg 0 schale -1 weg -1
  • 21. banane die iss nimm 3 4 1 1 Δ kaufe 1 schale 1 overwrite() die 0 schäle 1 banane 1 schmeiß 1 schmeiß -1 weg 1 schale -1 weg -1
  • 22. banane die iss kaufe nimm 4 4 1 1 1 Δ kaufe 1 schale 0 overwrite() die 0 schäle 1 banane 1 schmeiß 0 schmeiß -1 weg 0 schale -1 weg -1
  • 23. void map(key, value) { if(value is inserted) { for(word : value.split(" ")) { write(word, 1); } else if(value is deleted) { for(word : value.split(" ")) { write(word, -1); } } }
  • 24. void map(key, value) { if(value is inserted) { for(word : value.split(" ")) { write(word, 1); } else if(value is deleted) { for(word : value.split(" ")) { write(word, -1); } } else { // old result write(key, value); } }
  • 25. Overwrite Installation void reduce(key, values) { sum = 0; for(value : value) { sum += value; } put = new Put(key); put.add("fam", "col", sum); context.write(key, put); }
  • 26. Increment Installation void reduce(key, values) { sum = 0; for(value : value) { sum += value; } inc = new Increment(key); inc.add("fam", "col", sum); context.write(key, inc); }
  • 31. Fundamentals & The Marimba Framework Evaluation Related Work
  • 32. Core functionality: Distributed computations with MapReduce I care about: IncDec, Overwrite, reading old results, producing of Increments,… I tell you how to read input data, Core functionality: aggregate, invert and write the output Incremental computations
  • 33. public class WordTranslator extends Translator<LongWritable, Text> { public void translate(…) { … } IncJob job = new IncJobOverwrite(conf); job.setTranslatorClass( WordTranslator.class); job.setAbelianClass(WordAbelian.class);
  • 34. public class WordAbelian implements Abelian<WordAbelian> { WordAbelian invert() { … } WordAbelian aggregate(WordAbelian other) { … } WordAbelian neutral() { … } boolean isNeutral() { … } Writable extractKey() { … } void write(…) { … } void readFields(…) { … } }
  • 35. public class WordSerializer implements Serializer<WordAbelian> { Writable serialize(Writable key, WordAbelian v) { … } WordAbelian deserializeHBase( byte[] rowId, byte[] colFamily, byte[] qualifier, byte[] value) { … } }
  • 36. How To Write A Marimba-Job 1. Abelian-Class 2. Translator-Class 3. Serializer-Class 4. Write a Hadoop-Job and use the class IncJob
  • 37. Implementation setInputTable(…) IncJob setOutputTable(…) IncJobFull IncJobIncDec IncJobOverwrite Recomputation setResultInputTable(…)
  • 39. public interface Abelian<T extends Abelian<?>> extends WritableComparable<BinaryComparable>{ T invert(); T aggregate(T other); T neutral(); boolean isNeutral(); Writable extractKey(); void write(…); void readFields(…); }
  • 40. public interface Serialzer<T extends Abelian<?>> { Writable serialize(T obj); T deserializeHBase( byte[] rowId, byte[] colFamily, byte[] qualifier, byte[] value); }
  • 41. public abstract class Translator <KEYIN, VALUEIN> { public abstract void translate (KEYIN key, VALUEIN value, Context context); this.mapContext.write( abelianValue.extractKey(), this.invertValue ? abelianValue.invert() : abelianValue);
  • 42. GenericMapper From InputFormat: Value OverwriteResult InsertedValue DeletedValue PreservedValue deserialize translate set invertValue=true; ignore translate
  • 43. GenericReducer 1. aggregate 2. serialize 3. write IncDec: Overwrite: PUT → write putToIncrement(…) IGNORE → don‘t write DELETE → putToDelete(...)
  • 44. GenericCombiner „Write A Combiner“ -- 7 Tips for Improving MapReduce Performance, (Tipp 4) 1. aggregate
  • 46. Example: 1. WordCount void translate(key, value) { WordAbelian invert() { return new WordAbelian( for(word : value.split(" ")) { this.word, write( -1 * this.count); new WordAbelian(word, 1)); } } } WordAbelian aggregate( WordAbelian other) { return new WordAbelian( Writable serialize( this.word, WordAbelian w) { this.count Put p = new Put( + other.count); w.getWord()); } p.add(…); return p; boolean neutral() { } return new WordAbelian( this.word, 0); } boolean isNeutral() { Translator } return (this.count == 0); Serializer WordAbelian
  • 47. Example: 2. Friends Of Friends FRIENDS A D B FRIENDS OF FRIENDS C E
  • 48. Example: 2. Friends Of Friends translate(person, friends): aggregate(…): Merge friends-of-friends- sets
  • 49. Example: 3. Reverse WebLink-Graph REVERSE WEB LINK GRAPH (Row-ID -> Columns) Google -> {eBay, Wikipedia} aggregate(…): -> {Google, Wikipedia} eBay Merge link-sets Mensa-KL -> {Google} Facebook -> {Google, Mensa- KL, Uni-KL} Wikipedia -> {Google} Uni-KL -> {Google, Wikipedia}
  • 50. Example: 4. Bigrams Hi, kannst du mich ___?___ am Bahnhof abholen? So in etwa 10 ___?___. Viele liebe ___?__. P.S. Ich habe viel ___?___.
  • 52.
  • 53.
  • 54. Example: 4. Bigrams extractKey() a invert(): count*=-1 b NGramAbelian count aggregate(… other): write(…) count+=other.count neutral(): isNeutral(): count=0 count==0 NGramStep2Abelian
  • 55. Beispielanwendungen: 4. Bigrams extractKey() a invert(): count*=-1 b NGramAbelian count aggregate(… other): write(…) count+=other.count neutral(): isNeutral(): count=0 count==0 NGramStep2Abelian
  • 56. „Which input data?“
  • 57. bitte Hi, kannst du mich ___ ___ am nicht Bahnhof abholen? So in etwa <num>Minuten <num> Jahre 10 ___ ___. Grüße dich Viele liebe ___ __. zu Spaß P.S. Ich habe viel ___ ___.
  • 58. Fundamentals & The Marimba Framework Evaluation Related Work
  • 59. WordCount 01:10 01:00 00:50 00:40 Zeit [hh:mm] 00:30 FULL INCDEC OVERWRITE 00:20 00:10 00:00 0% 10% 20% 30% 40% 50% 60% 70% 80% Änderungen
  • 60. Reverse Weblink-Graph 02:51 02:41 02:31 02:21 02:11 02:00 01:50 01:40 Zeit [hh:mm] 01:30 FULL 01:20 INCDEC 01:10 OVERWRITE 01:00 00:50 00:40 00:30 00:20 00:10 00:00 0% 10% 20% 30% 40% 50% 60% 70% 80% Änderungen
  • 62. Images Folie 5-9: Folie 37-44: Flammen und Smilie: Microsoft Office 2010 Puzzle: http://www.flickr.com/photos/dps/136565237/ Folie 10: Folie 46 - 48: Google: http://www.google.de Junge: Microsoft Office 2010 Folie 11: Folie 49: Amazon: http://www.amazon.de Google: http://www.google.de eBay: http://www.ebay.de Folie 12: Mensa-KL: http://www.mensa-kl.de Hadoop: http://hadoop.apache.org facebook: http://www.facebook.de Casio Wristwatch: Wikipedia: http://de.wikipedia.org http://www.flickr.com/photos/andresrueda/3448240252 TU Kaiserslautern: http://www.uni-kl.de Folie 16: Folie 50-51: Hadoop: http://hadoop.apache.org Handy: Microsoft Office 2010 Folie 17: Folie 56: Hadoop: http://hadoop.apache.org Wikipedia: http://de.wikipedia.org Notebook: Microsoft Office 2010 Twitter: http://www.twitter.com Folie 18: Folie 57: HBase: http://hbase.apache.org Handy: Microsoft Office 2010 Folie 31: Folie 58: Hadoop: http://hadoop.apache.org Hadoop: http://hadoop.apache.org Casio Wristwatch: Casio Wristwatch: http://www.flickr.com/photos/andresrueda/3448240252 http://www.flickr.com/photos/andresrueda/3448240252 Folie 32: Gerüst: http://www.flickr.com/photos/michale/94538528/ Hadoop: http://hadoop.apache.org Junge: Microsoft Office 2010
  • 63. Bibliography (1/2) [0] Johannes Schildgen. Ein MapReduce-basiertes Programmiermodell für selbstwartbare Aggregatsichten. Masterarbeit, TU Kaiserslautern, August 2012 [1] Apache Hadoop project. http://hadoop.apache.org/. [2] Virga: Incremental Recomputations in MapReduce. http://wwwlgis.informatik.uni-kl.de/cms/?id=526. [3] Philippe Adjiman. Hadoop Tutorial Series, Issue #4: To Use Or Not To Use A Combiner,2010. http://www.philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/. [4] Kai Biermann. Big Data: Twitter wird zum Fieberthermometer der Gesellschaft, April 2012. http://www.zeit.de/digital/internet/2012-04/twitter-krankheiten-nowcast. [5] Julie Bort. 8 Crazy Things IBM Scientists Have Learned Studying Twitter, January 2012. http://www.businessinsider.com/8-crazy-things-ibm-scientists-have-learned-studying-twitter-2012-1. [6] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clus-ters. OSDI, pages 137–150, 2004. [7] Lars George. HBase: The Definitive Guide. O’Reilly Media, 1 edition, 2011. [8] Brown University Data Management Group. A Comparison of Approaches to Large-Scale Data Analysis. http://database.cs.brown.edu/projects/mapreduce-vs-dbms/. [9] Ricky Ho. Map/Reduce to recommend people connection, August 2010. http://horicky.blogspot.de/2010/08/mapreduce-to-recommend-people.html. [10] Yong Hu. Efficiently Extracting Change Data from HBase. April 2012. [11] Thomas Jörg, Roya Parvizi, Hu Yong, and Stefan Dessloch. Can mapreduce learnform materialized views? In LADIS 2011, pages 1 – 5, 9 2011. [12] Thomas Jörg, Roya Parvizi, Hu Yong, and Stefan Dessloch. Incremental recomputations in mapreduce. In CloudDB 2011, 10 2011. [13] Steve Krenzel. MapReduce: Finding Friends, 2010. http://stevekrenzel.com/finding-friends-with-mapreduce. [14] Todd Lipcon. 7 Tips for Improving MapReduce Performance, 2009.http://www.cloudera.com/blog/2009/12/7-tips-for-improving- mapreduce-performance/.
  • 64. Bibliography (2/2) [15] TouchType Ltd. SwiftKey X - Android Apps auf Google Play, February 2012. http://play.google.com/store/apps/details?id=com.touchtype.swiftkey. [16] Karl H. Marbaise. Hadoop - Think Large!, 2011. http://www.soebes.de/files/RuhrJUGEssenHadoop-20110217.pdf. [17] Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan. Distributed Cube Materia-lization on Holistic Measures. ICDE, pages 183–194, 2011. [18] Alexander Neumann. Studie: Hadoop wird ähnlich erfolgreich wie Linux, Mai 2012. http://heise.de/-1569837. [19] Owen O’Malley, Jack Hebert, Lohit Vijayarenu, and Amar Kamat. Partitioning your job into maps and reduces, September 2009. http://wiki.apache.org/hadoop/HowManyMapsAndReduces?action=recall&#38;rev=7. [20] Roya Parvizi. Inkrementelle Neuberechnungen mit MapReduce. Bachelorarbeit, TU Kaiserslautern, Juni 2011. [21] Arnd Poetzsch-Heffter. Konzepte objektorientierter Programmierung. eXamen.press. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. [22] Dave Rosenberg. Hadoop, the elephant in the enterprise, June 2012. http://news.cnet.com/8301-1001 3-57452061-92/hadoop-the-elephant-in-the-enterprise/. [23] Marc Schäfer. Inkrementelle Wartung von Data Cubes. Bachelorarbeit, TU Kaiserslautern, Januar 2012. [24] Sanjay Sharma. Advanced Hadoop Tuning and Optimizations, 2009. http://www.slideshare.net/ImpetusInfo/ppt-on-advanced-hadoop-tuning-n-optimisation. [25] Jason Venner. Pro Hadoop. Apress, Berkeley, CA, 2009. [26] DickWeisinger. Big Data: Think of NoSQL As Complementary to Traditional RDBMS, Juni 2012. http://www.formtek.com/blog/?p=3032. [27] Tom White. 10 MapReduce Tips, May 2009. http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/.60