SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 3
                  September 8, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
Acknowledgments
      Course design and slides derived from
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Today’s Agenda
• Review
• Toward MapReduce “design patterns”
  – Building block: preserving state across calls
  – In-Map & In-Mapper combining (vs. combiners)
  – Secondary sorting (via value-to-key Conversion)
  – Pairs and Stripes
  – Order Inversion
• Group Work (examples)
  – Interlude: scaling counts, TF-IDF
Review
MapReduce: Recap
Required:
   map ( K1, V1 ) → list ( K2, V2 )
   reduce ( K2, list(V2) ) → list ( K3, V3)
All values with the same key are reduced together
Optional:
   partition (K2, N) → Rj      maps K2 to some reducer Rj in [1..N]
      Often a simple hash of the key, e.g., hash(k’) mod n
      Divides up key space for parallel reduce operations


   combine ( K2, list(V2) ) → list ( K2, V2 )
      Mini-reducers that run in memory after the map phase
      Used as an optimization to reduce network traffic


The execution framework handles everything else…
“Everything Else”
    The execution framework handles everything else…
        Scheduling: assigns workers to map and reduce tasks
        ―Data distribution‖: moves processes to data
        Synchronization: gathers, sorts, and shuffles intermediate data
        Errors and faults: detects worker failures and restarts
    Limited control over data and execution flow
        All algorithms must expressed in m, r, c, p
    You don’t know:
        Where mappers and reducers run
        When a mapper or reducer begins or finishes
        Which input a particular mapper is processing
        Which intermediate key a particular reducer is processing
k1 v1   k2 v2   k3 v3    k4 v4     k5 v5      k6 v6




  map                   map                    map                   map


a 1    b 2           c 3     c 6           a 5     c 2             b 7     c 8

 combine              combine               combine                 combine



a 1    b 2                 c 9             a 5     c 2             b 7     c 8

 partition            partition               partition             partition

      Shuffle and Sort: aggregate values by keys
               a     1 5             b     2 7               c     2 9 8




         reduce                   reduce                  reduce


             r1 s1                 r2 s2                   r3 s3
Shuffle and Sort

     Mapper                                   intermediate files
                                                   (on disk)
                              merged spills
                                (on disk)
                                                                   Combiner   Reducer



  circular buffer
   (in memory)


                           Combiner




                                                other reducers
        spills (on disk)




                       other mappers
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178


    Shuffle and 2 Sorts




   As map emits values, local sorting
    runs in tandem (1st sort)
   Combine is optionally called
    0..N times for local aggregation
    on sorted (K2, list(V2)) tuples (more sorting of output)
   Partition determines which (logical) reducer Rj each key will go to
   Node’s TaskTracker tells JobTracker it has keys for Rj
   JobTracker determines node to run Rj based on data locality
   When local map/combine/sort finishes, sends data to Rj’s node
   Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
   For each (K, list(V)) tuple in merged output, call reduce(…)
Scalable Hadoop Algorithms: Themes
   Avoid object creation
       Inherently costly operation
       Garbage collection
   Avoid buffering
       Limited heap size
       Works for small datasets, but won’t scale!
         • Yet… we’ll talk about patterns involving buffering…
Importance of Local Aggregation
   Ideal scaling characteristics:
       Twice the data, twice the running time
       Twice the resources, half the running time
   Why can’t we achieve this?
       Synchronization requires communication
       Communication kills performance
   Thus… avoid communication!
       Reduce intermediate data via local aggregation
       Combiners can help
Tools for Synchronization
    Cleverly-constructed data structures
        Bring partial results together
    Sort order of intermediate keys
        Control order in which reducers process keys
    Partitioner
        Control which reducer processes which keys
    Preserving state in mappers and reducers
        Capture dependencies across multiple keys and values
Secondary Sorting
   MapReduce sorts input to reducers by key
       Values may be arbitrarily ordered
   What if want to sort value also?
       E.g., k → (k1,v1), (k1,v3), (k1,v4), (k1,v8)…
   Solutions?
       Swap key and value to sort by value?
       What if we use (k,v) as a joint key (and change nothing else)?
Secondary Sorting: Solutions
   Solution 1: Buffer values in memory, then sort
       Tradeoffs?
   Solution 2: ―Value-to-key conversion‖ design pattern
       Form composite intermediate key: (k, v1)
       Let execution framework do the sorting
       Preserve state across multiple key-value pairs
       …how do we make this happen?
Secondary Sorting (Lin 57, White 241)
    Create composite key: (k,v)
    Define a Key Comparator to sort via both
        Possibly not needed in some cases (e.g. strings & concatenation)
    Define a partition function based only on the (original) key
        All pairs with same key should go to same reducer
        Multiple keys may still go to the same reduce node; how do you
         know when the key changes across invocations of reduce()?
          • i.e. assume you want to do something with all values associated with
            a given key (e.g. print all on the same line, with no other keys)
    Preserve state in the reducer across invocations
        reduce() will be called separately for each pair, but we need to
         track the current key so we can detect when it changes


 Hadoop also provides Group Comparator
Preserving State in Hadoop


     Mapper object                                    Reducer object

                        one object per task
         state                                            state



       configure       API initialization hook          configure


                     one call per input
                     key-value pair
          map                                            reduce
                                   one call per
                                   intermediate key

         close           API cleanup hook                 close
Combiner Design
   Combiners and reducers share same method signature
       Sometimes, reducers can serve as combiners
       Often, not…
   Remember: combiner are optional optimizations
       Should not affect algorithm correctness
       May be run 0, 1, or multiple times
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);

    Combiner?
MapReduce Algorithm Design
Design Pattern for Local Aggregation
   ―In-mapper combining‖
       Fold the functionality of the combiner into the mapper,
        including preserving state across multiple map calls
   Advantages
       Speed
       Why is this faster than actual combiners?
         • Construction/deconstruction, serialization/deserialization
         • Guarantee and control use
   Disadvantages
       Buffering! Explicit memory management required
         • Can use disk-backed-buffer, based on # items or byes in memory
         • What if multiple mappers running on the same node? Do we know?
       Potential for order-dependent bugs
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);

    Combine = reduce
Word Count: in-map combining




Are combiners still needed?
Word Count: in-mapper combining




Are combiners still needed?
Example 2: Compute the Mean (v1)




Why can’t we use reducer as combiner?
Example 2: Compute the Mean (v2)




Why doesn’t this work?
Example 2: Compute the Mean (v3)
Computing the Mean:
                 in-mapper combining




Are combiners still needed?
Example 3: Term Co-occurrence
   Term co-occurrence matrix for a text collection
       M = N x N matrix (N = vocabulary size)
       Mij: number of times i and j co-occur in some context
        (for concreteness, let’s say context = sentence)
   Why?
       Distributional profiles as a way of measuring semantic distance
       Semantic distance useful for many language processing tasks
MapReduce: Large Counting Problems
   Term co-occurrence matrix for a text collection
    = specific instance of a large counting problem
       A large event space (number of terms)
       A large number of observations (the collection itself)
       Goal: keep track of interesting statistics about the events
   Basic approach
       Mappers generate partial counts
       Reducers aggregate partial counts



        How do we aggregate partial counts efficiently?
Approach 1: “Pairs”
    Each mapper takes a sentence:
        Generate all co-occurring term pairs
        For all pairs, emit (a, b) → count
    Reducers sum up counts associated with these pairs
    Use combiners!
Pairs: Pseudo-Code
“Pairs” Analysis
    Advantages
        Easy to implement, easy to understand
    Disadvantages
        Lots of pairs to sort and shuffle around (upper bound?)
        Not many opportunities for combiners to work
Another Try: “Stripes”
    Idea: group together pairs into an associative array
          (a, b) → 1
          (a, c) → 2
          (a, d) → 5                   a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
          (a, e) → 3
          (a, f) → 2

    Each mapper takes a sentence:
        Generate all co-occurring term pairs
        For each term, emit a → { b: countb, c: countc, d: countd … }
    Reducers perform element-wise sum of associative arrays
                a → { b: 1,       d: 5, e: 3 }
           +    a → { b: 1, c: 2, d: 2,       f: 2 }
                a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
Stripes: Pseudo-Code
“Stripes” Analysis
    Advantages
        Far less sorting and shuffling of key-value pairs
        Can make better use of combiners
    Disadvantages
        More difficult to implement
        Underlying object more heavyweight
        Fundamental limitation in terms of size of event space
          • Buffering!
Cluster size: 38 cores
Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3),
which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
Relative Frequencies
   How do we estimate relative frequencies from counts?

                       count ( A, B)    count ( A, B)
          f ( B | A)                
                        count ( A)      count ( A, B' )
                                       B'



   Why do we want to do this?
   How do we do this with MapReduce?
f(B|A): “Stripes”

     a → {b1:3, b2 :12, b3 :7, b4 :1, … }


    Easy!
        One pass to compute (a, *)
        Another pass to directly compute f(B|A)
f(B|A): “Pairs”

         (a, *) → 32    Reducer holds this value in memory

         (a, b1) → 3                        (a, b1) → 3 / 32
         (a, b2) → 12                       (a, b2) → 12 / 32
         (a, b3) → 7                        (a, b3) → 7 / 32
         (a, b4) → 1                        (a, b4) → 1 / 32
         …                                  …


    For this to work:
        Must emit extra (a, *) for every bn in mapper
        Must make sure all a’s get sent to same reducer (use partitioner)
        Must make sure (a, *) comes first (define sort order)
        Must hold state in reducer across different key-value pairs
“Order Inversion”
    Common design pattern
        Computing relative frequencies requires marginal counts
        But marginal cannot be computed until you see all counts
        Buffering is a bad idea!
        Trick: getting the marginal counts to arrive at the reducer before
         the joint counts
    Optimizations
        Apply in-memory combining pattern to accumulate marginal counts
        Should we apply combiners?
Synchronization: Pairs vs. Stripes
    Approach 1: turn synchronization into an ordering problem
        Sort keys into correct order of computation
        Partition key space so that each reducer gets the appropriate set
         of partial results
        Hold state in reducer across multiple key-value pairs to perform
         computation
        Illustrated by the ―pairs‖ approach
    Approach 2: construct data structures that bring partial
     results together
        Each reducer receives all the data it needs to complete the
         computation
        Illustrated by the ―stripes‖ approach
Recap: Tools for Synchronization
   Cleverly-constructed data structures
       Bring data together
   Sort order of intermediate keys
       Control order in which reducers process keys
   Partitioner
       Control which reducer processes which keys
   Preserving state in mappers and reducers
       Capture dependencies across multiple keys and values
Issues and Tradeoffs
   Number of key-value pairs
       Object creation overhead
       Time for sorting and shuffling pairs across the network
   Size of each key-value pair
       De/serialization overhead
   Local aggregation
       Opportunities to perform local aggregation varies
       Combiners make a big difference
       Combiners vs. in-mapper combining
       RAM vs. disk vs. network
Group Work (Examples)
Task 5
   How many distinct words in the document collection start
    with each letter?
       Note: ―types‖ vs. ―tokens‖
Task 5
   How many distinct words in the document collection start
    with each letter?
        Note: ―types‖ vs. ―tokens‖

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)




   Ways to make more efficient?
Task 5
   How many distinct words in the document collection start
    with each letter?
        Note: ―types‖ vs. ―tokens‖

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)

    Reducer<Integer,Integer  Integer,V3>
    Reduce(Integer length, Iterator<K2> values):
        set of words = empty set;
        for each word
           add word to set
        emit(letter, size word set)



   Ways to make more efficient?
Task 5b
   How many distinct words in the document collection start
    with each letter?
        How to use in-mapper combining and a separate combiner
        Tradeoffs

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)
Task 5b
   How many distinct words in the document collection start
    with each letter?
       How to use in-mapper combining and a separate combiner
       Tradeoffs?

 Mapper<String,String  String,String>
 Map(String docID, String document)
     for each word in document
          emit (first character, word)

 Combiner<String,String  String,String>
 Combine(String letter, Iterator<String> words):
     set of words = empty set;
     for each word
        add word to set
     for each word in set
     emit(letter, word)
Task 6: find median document length
Task 6: find median document length
  Mapper<K1,V1  Integer,Integer>
  Map(K1 xx, V1 xx)
    10,000 / N times
       emit( length(generateRandomDocument()), 1)
Task 6: find median document length
    Mapper<K1,V1  Integer,Integer>
    Map(K1 xx, V1 xx)
      10,000 / N times
         emit( length(generateRandomDocument()), 1)

    Reducer<Integer,Integer  Integer,V3>
    Reduce(Integer length, Iterator<K2> values):
        static list lengths = empty list;
        for each value
           append length to list

    Close() { output median }




   conf.setNumReduceTasks(1)
   Problems with this solution?
Interlude: Scaling counts
    Many applications require counts of words in some
     context.
        E.g. information retrieval, vector-based semantics
    Counts from frequent words like ―the‖ can overwhelm the
     signal from content words such as ―stocks‖ and ―football‖
    Two strategies for combating high frequency words:
        Use a stop list that excludes them
        Scale the counts such that high frequency words are downgraded.
Interlude: Scaling counts, TF-IDF
    TF-IDF, or term frequency—inverse document frequency
     is a standard way of scaling.
    Inverse document frequency for a term t is the ratio of the
     number of documents in the collection to the number of
     documents containing t:




    TF-IDF is just the term frequency times the idf:
Interlude: Scaling counts, TF-IDF
    TF-IDF, or term frequency—inverse document frequency
     is a standard way of scaling.
    Inverse document frequency for a term t is the ratio of the
     number of documents in the collection to the number of
     documents containing t:




    TF-IDF is just the term frequency times the idf:
Interlude: Scaling counts using DF
    Recall the word co-occurrence counts task from the earlier
     slides.
        mij represents the number of times word j has occurred in the
         neighborhood of word i.
        The row mi gives a vector profile of word i that we can use for
         tasks like determining word similarity (e.g. using cosine distance)
        Words like ―the‖ will tend to have high counts that we want to scale
         down so they don’t dominate this computation.
    The counts in mij can be scaled down using dfj. Let’s
     create a transformed matrix S where:
Task 7
     Compute S, the co-occurrence counts scaled by document
      frequency.
       • First: do the simplest mapper
       • Then: simplify things for the reducer

Mais conteúdo relacionado

Mais procurados

High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in Frostbite
Electronic Arts / DICE
 

Mais procurados (20)

Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Lec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptxLec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptx
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Scalding
ScaldingScalding
Scalding
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
 
High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in Frostbite
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well
 
OpenGL 4 for 2010
OpenGL 4 for 2010OpenGL 4 for 2010
OpenGL 4 for 2010
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
Assignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduceAssignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 

Destaque

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 

Destaque (14)

Curso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaSCurso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaS
 
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
 
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startupsFrameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
 
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
 
Modelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaSModelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaS
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 

Semelhante a Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
Steven Francia
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Semelhante a Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Hadoop
HadoopHadoop
Hadoop
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 

Mais de Matthew Lease

The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 

Mais de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 3 September 8, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of the following excellent Hadoop books (order yours today!) • Chuck Lam’s Hadoop In Action (2010) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Today’s Agenda • Review • Toward MapReduce “design patterns” – Building block: preserving state across calls – In-Map & In-Mapper combining (vs. combiners) – Secondary sorting (via value-to-key Conversion) – Pairs and Stripes – Order Inversion • Group Work (examples) – Interlude: scaling counts, TF-IDF
  • 5. MapReduce: Recap Required: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) All values with the same key are reduced together Optional: partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]  Often a simple hash of the key, e.g., hash(k’) mod n  Divides up key space for parallel reduce operations combine ( K2, list(V2) ) → list ( K2, V2 )  Mini-reducers that run in memory after the map phase  Used as an optimization to reduce network traffic The execution framework handles everything else…
  • 6. “Everything Else”  The execution framework handles everything else…  Scheduling: assigns workers to map and reduce tasks  ―Data distribution‖: moves processes to data  Synchronization: gathers, sorts, and shuffles intermediate data  Errors and faults: detects worker failures and restarts  Limited control over data and execution flow  All algorithms must expressed in m, r, c, p  You don’t know:  Where mappers and reducers run  When a mapper or reducer begins or finishes  Which input a particular mapper is processing  Which intermediate key a particular reducer is processing
  • 7. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reduce reduce reduce r1 s1 r2 s2 r3 s3
  • 8. Shuffle and Sort Mapper intermediate files (on disk) merged spills (on disk) Combiner Reducer circular buffer (in memory) Combiner other reducers spills (on disk) other mappers
  • 9. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178 Shuffle and 2 Sorts  As map emits values, local sorting runs in tandem (1st sort)  Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)  Partition determines which (logical) reducer Rj each key will go to  Node’s TaskTracker tells JobTracker it has keys for Rj  JobTracker determines node to run Rj based on data locality  When local map/combine/sort finishes, sends data to Rj’s node  Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)  For each (K, list(V)) tuple in merged output, call reduce(…)
  • 10. Scalable Hadoop Algorithms: Themes  Avoid object creation  Inherently costly operation  Garbage collection  Avoid buffering  Limited heap size  Works for small datasets, but won’t scale! • Yet… we’ll talk about patterns involving buffering…
  • 11. Importance of Local Aggregation  Ideal scaling characteristics:  Twice the data, twice the running time  Twice the resources, half the running time  Why can’t we achieve this?  Synchronization requires communication  Communication kills performance  Thus… avoid communication!  Reduce intermediate data via local aggregation  Combiners can help
  • 12. Tools for Synchronization  Cleverly-constructed data structures  Bring partial results together  Sort order of intermediate keys  Control order in which reducers process keys  Partitioner  Control which reducer processes which keys  Preserving state in mappers and reducers  Capture dependencies across multiple keys and values
  • 13. Secondary Sorting  MapReduce sorts input to reducers by key  Values may be arbitrarily ordered  What if want to sort value also?  E.g., k → (k1,v1), (k1,v3), (k1,v4), (k1,v8)…  Solutions?  Swap key and value to sort by value?  What if we use (k,v) as a joint key (and change nothing else)?
  • 14. Secondary Sorting: Solutions  Solution 1: Buffer values in memory, then sort  Tradeoffs?  Solution 2: ―Value-to-key conversion‖ design pattern  Form composite intermediate key: (k, v1)  Let execution framework do the sorting  Preserve state across multiple key-value pairs  …how do we make this happen?
  • 15. Secondary Sorting (Lin 57, White 241)  Create composite key: (k,v)  Define a Key Comparator to sort via both  Possibly not needed in some cases (e.g. strings & concatenation)  Define a partition function based only on the (original) key  All pairs with same key should go to same reducer  Multiple keys may still go to the same reduce node; how do you know when the key changes across invocations of reduce()? • i.e. assume you want to do something with all values associated with a given key (e.g. print all on the same line, with no other keys)  Preserve state in the reducer across invocations  reduce() will be called separately for each pair, but we need to track the current key so we can detect when it changes Hadoop also provides Group Comparator
  • 16. Preserving State in Hadoop Mapper object Reducer object one object per task state state configure API initialization hook configure one call per input key-value pair map reduce one call per intermediate key close API cleanup hook close
  • 17. Combiner Design  Combiners and reducers share same method signature  Sometimes, reducers can serve as combiners  Often, not…  Remember: combiner are optional optimizations  Should not affect algorithm correctness  May be run 0, 1, or multiple times
  • 18. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 19. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); Combiner?
  • 21. Design Pattern for Local Aggregation  ―In-mapper combining‖  Fold the functionality of the combiner into the mapper, including preserving state across multiple map calls  Advantages  Speed  Why is this faster than actual combiners? • Construction/deconstruction, serialization/deserialization • Guarantee and control use  Disadvantages  Buffering! Explicit memory management required • Can use disk-backed-buffer, based on # items or byes in memory • What if multiple mappers running on the same node? Do we know?  Potential for order-dependent bugs
  • 22. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); Combine = reduce
  • 23. Word Count: in-map combining Are combiners still needed?
  • 24. Word Count: in-mapper combining Are combiners still needed?
  • 25. Example 2: Compute the Mean (v1) Why can’t we use reducer as combiner?
  • 26. Example 2: Compute the Mean (v2) Why doesn’t this work?
  • 27. Example 2: Compute the Mean (v3)
  • 28. Computing the Mean: in-mapper combining Are combiners still needed?
  • 29. Example 3: Term Co-occurrence  Term co-occurrence matrix for a text collection  M = N x N matrix (N = vocabulary size)  Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence)  Why?  Distributional profiles as a way of measuring semantic distance  Semantic distance useful for many language processing tasks
  • 30. MapReduce: Large Counting Problems  Term co-occurrence matrix for a text collection = specific instance of a large counting problem  A large event space (number of terms)  A large number of observations (the collection itself)  Goal: keep track of interesting statistics about the events  Basic approach  Mappers generate partial counts  Reducers aggregate partial counts How do we aggregate partial counts efficiently?
  • 31. Approach 1: “Pairs”  Each mapper takes a sentence:  Generate all co-occurring term pairs  For all pairs, emit (a, b) → count  Reducers sum up counts associated with these pairs  Use combiners!
  • 33. “Pairs” Analysis  Advantages  Easy to implement, easy to understand  Disadvantages  Lots of pairs to sort and shuffle around (upper bound?)  Not many opportunities for combiners to work
  • 34. Another Try: “Stripes”  Idea: group together pairs into an associative array (a, b) → 1 (a, c) → 2 (a, d) → 5 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } (a, e) → 3 (a, f) → 2  Each mapper takes a sentence:  Generate all co-occurring term pairs  For each term, emit a → { b: countb, c: countc, d: countd … }  Reducers perform element-wise sum of associative arrays a → { b: 1, d: 5, e: 3 } + a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
  • 36. “Stripes” Analysis  Advantages  Far less sorting and shuffling of key-value pairs  Can make better use of combiners  Disadvantages  More difficult to implement  Underlying object more heavyweight  Fundamental limitation in terms of size of event space • Buffering!
  • 37. Cluster size: 38 cores Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
  • 38.
  • 39. Relative Frequencies  How do we estimate relative frequencies from counts? count ( A, B) count ( A, B) f ( B | A)   count ( A)  count ( A, B' ) B'  Why do we want to do this?  How do we do this with MapReduce?
  • 40. f(B|A): “Stripes” a → {b1:3, b2 :12, b3 :7, b4 :1, … }  Easy!  One pass to compute (a, *)  Another pass to directly compute f(B|A)
  • 41. f(B|A): “Pairs” (a, *) → 32 Reducer holds this value in memory (a, b1) → 3 (a, b1) → 3 / 32 (a, b2) → 12 (a, b2) → 12 / 32 (a, b3) → 7 (a, b3) → 7 / 32 (a, b4) → 1 (a, b4) → 1 / 32 … …  For this to work:  Must emit extra (a, *) for every bn in mapper  Must make sure all a’s get sent to same reducer (use partitioner)  Must make sure (a, *) comes first (define sort order)  Must hold state in reducer across different key-value pairs
  • 42. “Order Inversion”  Common design pattern  Computing relative frequencies requires marginal counts  But marginal cannot be computed until you see all counts  Buffering is a bad idea!  Trick: getting the marginal counts to arrive at the reducer before the joint counts  Optimizations  Apply in-memory combining pattern to accumulate marginal counts  Should we apply combiners?
  • 43. Synchronization: Pairs vs. Stripes  Approach 1: turn synchronization into an ordering problem  Sort keys into correct order of computation  Partition key space so that each reducer gets the appropriate set of partial results  Hold state in reducer across multiple key-value pairs to perform computation  Illustrated by the ―pairs‖ approach  Approach 2: construct data structures that bring partial results together  Each reducer receives all the data it needs to complete the computation  Illustrated by the ―stripes‖ approach
  • 44. Recap: Tools for Synchronization  Cleverly-constructed data structures  Bring data together  Sort order of intermediate keys  Control order in which reducers process keys  Partitioner  Control which reducer processes which keys  Preserving state in mappers and reducers  Capture dependencies across multiple keys and values
  • 45. Issues and Tradeoffs  Number of key-value pairs  Object creation overhead  Time for sorting and shuffling pairs across the network  Size of each key-value pair  De/serialization overhead  Local aggregation  Opportunities to perform local aggregation varies  Combiners make a big difference  Combiners vs. in-mapper combining  RAM vs. disk vs. network
  • 47. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖
  • 48. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖ Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word)  Ways to make more efficient?
  • 49. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖ Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word) Reducer<Integer,Integer  Integer,V3> Reduce(Integer length, Iterator<K2> values): set of words = empty set; for each word add word to set emit(letter, size word set)  Ways to make more efficient?
  • 50. Task 5b  How many distinct words in the document collection start with each letter?  How to use in-mapper combining and a separate combiner  Tradeoffs Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word)
  • 51. Task 5b  How many distinct words in the document collection start with each letter?  How to use in-mapper combining and a separate combiner  Tradeoffs? Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word) Combiner<String,String  String,String> Combine(String letter, Iterator<String> words): set of words = empty set; for each word add word to set for each word in set emit(letter, word)
  • 52. Task 6: find median document length
  • 53. Task 6: find median document length Mapper<K1,V1  Integer,Integer> Map(K1 xx, V1 xx) 10,000 / N times emit( length(generateRandomDocument()), 1)
  • 54. Task 6: find median document length Mapper<K1,V1  Integer,Integer> Map(K1 xx, V1 xx) 10,000 / N times emit( length(generateRandomDocument()), 1) Reducer<Integer,Integer  Integer,V3> Reduce(Integer length, Iterator<K2> values): static list lengths = empty list; for each value append length to list Close() { output median }  conf.setNumReduceTasks(1)  Problems with this solution?
  • 55. Interlude: Scaling counts  Many applications require counts of words in some context.  E.g. information retrieval, vector-based semantics  Counts from frequent words like ―the‖ can overwhelm the signal from content words such as ―stocks‖ and ―football‖  Two strategies for combating high frequency words:  Use a stop list that excludes them  Scale the counts such that high frequency words are downgraded.
  • 56. Interlude: Scaling counts, TF-IDF  TF-IDF, or term frequency—inverse document frequency is a standard way of scaling.  Inverse document frequency for a term t is the ratio of the number of documents in the collection to the number of documents containing t:  TF-IDF is just the term frequency times the idf:
  • 57. Interlude: Scaling counts, TF-IDF  TF-IDF, or term frequency—inverse document frequency is a standard way of scaling.  Inverse document frequency for a term t is the ratio of the number of documents in the collection to the number of documents containing t:  TF-IDF is just the term frequency times the idf:
  • 58. Interlude: Scaling counts using DF  Recall the word co-occurrence counts task from the earlier slides.  mij represents the number of times word j has occurred in the neighborhood of word i.  The row mi gives a vector profile of word i that we can use for tasks like determining word similarity (e.g. using cosine distance)  Words like ―the‖ will tend to have high counts that we want to scale down so they don’t dominate this computation.  The counts in mij can be scaled down using dfj. Let’s create a transformed matrix S where:
  • 59. Task 7  Compute S, the co-occurrence counts scaled by document frequency. • First: do the simplest mapper • Then: simplify things for the reducer