SlideShare a Scribd company logo
1 of 124
Download to read offline
Large-scale social media analysis with Hadoop

                                            Jake Hofman

                                            Yahoo! Research


                                            May 23, 2010




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   1 / 71
@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   2 / 71
1970s ∼ 101 nodes                      456              JOURNAL OF ANTHROPOLOGICAL RESEARCH
                                                                          FIGURE 1
                                                    Social Network Model of Relationships   in the Karate Club

                                                                            34      1
                                                                     33 3                    2




                                        27                                                                            8

                                        26 i                                                                           9

                                        25                                                                            10



                                                       CONFLICT AND FISSION IN SMALL GROUPS                                  453

                                  to bounded social groups of all types in all settings. Also, the data
                                  required can be collected by a reliable method currently familiar to
                                  anthropologists, the use of nominal scales.
                                                                    19       18             16
                                                                      18   17
                                                            THE ETHNOGRAPHIC RATIONALE
                                      The is the the clubrepresentationline ofis the socialbetween of three years, the indi-1970
                                      This
                                             karate karate was observed for a period two points when 34 two
                                      viduals in
                                                   graphic
                                                               club. A            drawn
                                                                                              relationships among the
                                                                                                                       from
                                  to 1972. In addition to direct observation, the history of outside those of to
                                      individuals being represented consistently      interacted in contexts the club prior
                                      karate classes, workouts, was reconstructed such line
                                  the period of the study and club meetings. Each through drawn is referredandasclub
                                      an edge.
                                                                                                           informants to
                                  records in the university archives. During the period of observation, the
                                  club maintained between 50 and 100 members, and its activities
                                      two individuals consistently were observed to interact outside the
                                  included social affairs (parties, dances, and club
                                      normal activities of the club (karate classes banquets, etc.) Thatwell as       as
          • Few direct observations; highly detailed info on nodes and
                                      an edge is drawn
                                                                                                           meetings).
                                  regularly scheduled ifkarate lessons. could be said to be friends outside the
                                                                the individuals The political organization of
                                  clubthe club activities.This while there was a constitutionin Figure 2. officers,
                                        was informal, and graph is represented as a matrix and four All
                                                                                                                          is,


                                  most decisions were made nondirectional at represent interaction in both
                                      the edges in Figure 1 are by concensus                   club meetings. For its classes,
              edges                                                                   (they
                                  the club employed thepart-time karate instructor, who will possible to to
                                      directions), and a graph is said to be symmetrical.It is also be referred
                                      draw edges that are directed (representing one-way relationships); such
                                  as Mr. Hi.2
                                      At the beginning of the study there was an incipient conflict
          • E.g. karate club (Zachary, 1977)
                                  between the club president, John A., and Mr. Hi over the price of
                                  karate lessons. Mr. Hi, who wished to raise prices, claimed the authority
                                  to set his own lesson fees, since he was the instructor. John A., who
                                  wished to stabilize prices, claimed the authority to set the lesson fees
                                  since he was the club's chief administrator.
@jakehofman   (Yahoo! Research)       As time passed the entire club became w/ Hadoop
                                     Large-scale social media analysis divided over this issue, and                                May 23, 2010   3 / 71
1990s ∼ 104 nodes




          • Larger, indirect samples; relatively few details on nodes and
              edges
          • E.g. APS co-authorship network (http://bit.ly/aps08jmh)

@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   4 / 71
Present ∼ 107 nodes +




          • Very large, dynamic samples; many details in node and edge
              metadata
          • E.g. Mail, Messenger, Facebook, Twitter, etc.



@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   5 / 71
@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   6 / 71
What could you ask of it?




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   6 / 71
Look familiar?




          # ls -talh neat_dataset.tar.gz
          -rw-r--r-- 100T May 23 13:00 neat_dataset.tar.gz




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   8 / 71
Look familiar?1




          # ls -talh twitter_rv.tar
          -rw-r--r-- 24G May 23 13:00 twitter_rv.tar




          1
              http://an.kaist.ac.kr/traces/WWW2010.html
@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   10 / 71
Agenda




                      Large-scale social media analysis with Hadoop




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   11 / 71
Agenda




                      Large-scale social media analysis with Hadoop


                                  GB/TB/PB-scale, 10,000+ nodes




@jakehofman   (Yahoo! Research)      Large-scale social media analysis w/ Hadoop   May 23, 2010   11 / 71
Agenda




                      Large-scale social media analysis with Hadoop


                                      network & text data




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   11 / 71
Agenda




                      Large-scale social media analysis with Hadoop


                             network analysis & machine learning




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   11 / 71
Agenda




                      Large-scale social media analysis with Hadoop


        open source Apache project for distributed storage/computation




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   11 / 71
Warning



      You may be bored if you already know how to ...




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   12 / 71
Warning



      You may be bored if you already know how to ...
          • Install and use Hadoop (on a single machine and EC2)




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   12 / 71
Warning



      You may be bored if you already know how to ...
          • Install and use Hadoop (on a single machine and EC2)
          • Run jobs in local and distributed modes




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   12 / 71
Warning



      You may be bored if you already know how to ...
          • Install and use Hadoop (on a single machine and EC2)
          • Run jobs in local and distributed modes
          • Implement distributed solutions for:




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   12 / 71
Warning



      You may be bored if you already know how to ...
          • Install and use Hadoop (on a single machine and EC2)
          • Run jobs in local and distributed modes
          • Implement distributed solutions for:
              • Parsing and manipulating large text collections




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   12 / 71
Warning



      You may be bored if you already know how to ...
          • Install and use Hadoop (on a single machine and EC2)
          • Run jobs in local and distributed modes
          • Implement distributed solutions for:
              • Parsing and manipulating large text collections
              • Clustering coefficient, BFS, etc., for networks w/ billions of
                edges




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   12 / 71
Warning



      You may be bored if you already know how to ...
          • Install and use Hadoop (on a single machine and EC2)
          • Run jobs in local and distributed modes
          • Implement distributed solutions for:
              • Parsing and manipulating large text collections
              • Clustering coefficient, BFS, etc., for networks w/ billions of
                edges
              • Classification, clustering




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   12 / 71
Selected resources




                                  http://www.hadoopbook.com/


@jakehofman   (Yahoo! Research)     Large-scale social media analysis w/ Hadoop   May 23, 2010   13 / 71
Selected resources




                                  http://www.cloudera.com/




@jakehofman   (Yahoo! Research)    Large-scale social media analysis w/ Hadoop   May 23, 2010   13 / 71
Selected resources



                                                                                                                         i


                            Data-Intensive Text Processing
                            with MapReduce

                            Jimmy Lin and Chris Dyer
                            University of Maryland, College Park

                            Draft of February 19, 2010

              http://www.umiacs.umd.edu/∼jimmylin/book.html




                            This is a (partial) draft of a book that is in preparation for Morgan & Claypool Synthesis
                            Lectures on Human Language Technologies. Anticipated publication date is mid-2010.
                            Comments and feedback are welcome!



@jakehofman   (Yahoo! Research)          Large-scale social media analysis w/ Hadoop                                         May 23, 2010   13 / 71
Selected resources




                                     ... and many more at

                       http://delicious.com/jhofman/hadoop




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   13 / 71
Selected resources




                                     ... and many more at

                    http://delicious.com/pskomoroch/hadoop




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   13 / 71
Outline




     1 Background (5 Ws)



     2 Introduction to MapReduce (How, Part I)



     3 Applications (How, Part II)




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   14 / 71
What?




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   15 / 71
What?


              “... to create building blocks for programmers who just
              happen to have lots of data to store, lots of data to
              analyze, or lots of machines to coordinate, and who
              don’t have the time, the skill, or the inclination to
              become distributed systems experts to build the
              infrastructure to handle it.”

                                                                           -Tom White
                                                            Hadoop: The Definitive Guide




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   16 / 71
What?



      Hadoop contains many subprojects:




      We’ll focus on distributed computation with MapReduce.




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   17 / 71
Who/when?




                                    An overly brief history




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   18 / 71
Who/when?




                                    pre-2004
        Cutting and Cafarella develop open source projects for web-scale
                        indexing, crawling, and search




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   18 / 71
Who/when?


                                  2004
         Dean and Ghemawat publish MapReduce programming model,
                        used internally at Google




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   18 / 71
Who/when?


                                  2006
         Hadoop becomes official Apache project, Cutting joins Yahoo!,
                          Yahoo adopts Hadoop




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   18 / 71
Who/when?




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   18 / 71
Where?




                    http://wiki.apache.org/hadoop/PoweredBy



@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   19 / 71
Why?




                                  Why yet another solution?


                   (I already use too many languages/environments)




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   20 / 71
Why?




                                  Why a distributed solution?


                (My desktop has TBs of storage and GBs of memory)




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   20 / 71
Why?




          Roughly how long to read 1TB from a commodity hard disk?




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   21 / 71
Why?




          Roughly how long to read 1TB from a commodity hard disk?


                                  1 Gb 1 B         sec       GB
                                       ×    × 3600     ≈ 225
                                  2 sec 8 b         hr       hr




@jakehofman   (Yahoo! Research)      Large-scale social media analysis w/ Hadoop   May 23, 2010   21 / 71
Why?




          Roughly how long to read 1TB from a commodity hard disk?

                                                 ≈ 4hrs




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   21 / 71
Why?




                                  http://bit.ly/petabytesort



@jakehofman   (Yahoo! Research)     Large-scale social media analysis w/ Hadoop   May 23, 2010   22 / 71
Outline




     1 Background (5 Ws)



     2 Introduction to MapReduce (How, Part I)



     3 Applications (How, Part II)




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   23 / 71
Typical scenario


      Store, parse, and analyze high-volume server logs,




      e.g. how many search queries match “icwsm”?




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   24 / 71
MapReduce: 30k ft




      Break large problem into smaller parts, solve in parallel, combine
      results




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   25 / 71
Typical scenario



                                          “Embarassingly parallel”
                                              (or nearly so)

                                  local read                           filter




                                                                                       }
               node 1



                                  local read                           filter
               node 2
                                                                                          collect results


                                  local read                           filter
               node 3



                                  local read                           filter
               node 4




@jakehofman   (Yahoo! Research)          Large-scale social media analysis w/ Hadoop   May 23, 2010         26 / 71
Typical scenario++



      How many search queries match “icwsm”, grouped by month?




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   27 / 71
MapReduce: example

                                                   Map                    Shuffle                    Reduce
                                            matching records to    to collect all records      results by adding
                                           (YYYYMM, count=1)       w/ same key (month)      count values for each key

        20091201,4.2.2.1,"icwsm 2010"
        20100523,2.4.1.2,"hadoop"
        20100101,9.7.6.5,"tutorial"
                                                 200912, 1
        20091125,2.4.6.1,"data"
        20090708,4.2.2.1,"open source"
        20100124,1.2.2.4,"washington dc"                                                    200912, 1
                                                                                                            200912, 2
                                                                                            200912, 1
                                                                                                            ...
                                                                                            ...
                                                                                                            200910, 1
                                                                                            200910, 1
        20100522,2.4.1.2,"conference"
        20091008,4.2.2.1,"2009 icwsm"
        20090807,4.2.2.1,"apache.org"            200910, 1
        20100101,9.7.6.5,"mapreduce"             200911, 1
        20100123,1.2.2.4,"washington dc"
        20091121,2.4.6.1,"icwsm dates"
                                                                                            200911, 1       200911, 1
                                                                                            ...             ...

        20090807,4.2.2.1,"distributed"
        20091225,4.2.2.1,"icwsm"
        20100522,2.4.1.2,"media"
                                                 200912, 1
        20100123,1.2.2.4,"social"
        20091114,2.4.6.1,"d.c."
        20100101,9.7.6.5,"new year's"




@jakehofman   (Yahoo! Research)            Large-scale social media analysis w/ Hadoop                  May 23, 2010    28 / 71
MapReduce: paradigm




                    Programmer specifies map and reduce functions




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   29 / 71
MapReduce: paradigm




          Map: tranforms input record to intermediate (key, value) pair




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   29 / 71
MapReduce: paradigm




                     Shuffle: collects all intermediate records by key




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   29 / 71
MapReduce: paradigm



              Reduce: transforms all records for given key to final output




@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   29 / 71
MapReduce: paradigm




      Distributed read, shuffle, and write are transparent to programmer




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   29 / 71
MapReduce: principles




          • Move code to data (local computation)
          • Allow programs to scale transparently w.r.t size of input
          • Abstract away fault tolerance, synchronization, etc.




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   30 / 71
MapReduce: strengths




          • Batch, offline jobs
          • Write-once, read-many across full data set
          • Usually, though not always, simple computations
          • I/O bound by disk/network bandwidth




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   31 / 71
!MapReduce




      What it’s not:
          • High-performance parallel computing, e.g. MPI
          • Low-latency random access relational database
          • Always the right solution




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   32 / 71
Word count



                                                                                dog 2
                                                                                --   1
                                                                                the 3
                                                                                brown      1
                                                                                fox 2
               the quick brown fox
                                                                                jumped     1
               jumps over the lazy dog
                                                                                lazy 2
               who jumped over that
                                                                                jumps      1
               lazy dog -- the fox ?
                                                                                over 2
                                                                                quick      1
                                                                                that 1
                                                                                who 1
                                                                                ?    1




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop            May 23, 2010   33 / 71
Word count

               Map: for each line, output each word and count (of 1)
                                                                                       the 1
                                                                                       quick       1
                                                                                       brown       1
                                                                                       fox 1
                                                                                       ---------
                                                                                       jumps       1
                                                                                       over 1
                                                                                       the 1
                      the quick brown fox
                                                                                       lazy 1
                      --------------------------------
                                                                                       dog 1
                      jumps over the lazy dog
                                                                                       ---------
                      --------------------------------
                                                                                       who 1
                      who jumped over that
                                                                                       jumped      1
                      --------------------------------
                                                                                       over 1
                      lazy dog -- the fox ?
                                                                                       ---------
                                                                                       that 1
                                                                                       lazy 1
                                                                                       dog 1
                                                                                       --     1
                                                                                       the 1
                                                                                       fox 1
                                                                                       ?      1


@jakehofman   (Yahoo! Research)          Large-scale social media analysis w/ Hadoop                   May 23, 2010   33 / 71
Word count

                           Shuffle: collect all records for each word
                                                                                --     1
                                                                                ---------
                                                                                ?      1
                                                                                ---------
                                                                                brown       1
                                                                                ---------
                                                                                dog 1
                                                                                dog 1
                                                                                ---------
                                                                                fox 1
                                                                                fox 1
                                                                                ---------
                                  the quick brown fox                           jumped      1
                                  --------------------------------              ---------
                                  jumps over the lazy dog                       jumps       1
                                  --------------------------------              ---------
                                  who jumped over that                          lazy 1
                                  --------------------------------              lazy 1
                                  lazy dog -- the fox ?                         ---------
                                                                                over 1
                                                                                over 1
                                                                                ---------
                                                                                quick       1
                                                                                ---------
                                                                                that 1
                                                                                ---------
                                                                                the 1
                                                                                the 1
                                                                                the 1
                                                                                ---------
                                                                                who 1




@jakehofman   (Yahoo! Research)            Large-scale social media analysis w/ Hadoop          May 23, 2010   33 / 71
Word count

                                  Reduce: add counts for each word
                                    --     1
                                    ---------
                                    ?      1
                                    ---------
                                    brown       1
                                    ---------
                                    dog 1
                                    dog 1
                                    ---------
                                    fox 1                                --   1
                                    fox 1                                ?    1
                                    ---------                            brown       1
                                    jumped      1                        dog 2
                                    ---------                            fox 2
                                    jumps       1                        jumped      1
                                    ---------                            jumps       1
                                    lazy 1                               lazy 2
                                    lazy 1                               over 2
                                    ---------                            quick       1
                                    over 1                               that 1
                                    over 1                               the 3
                                    ---------                            who 1
                                    quick       1
                                    ---------
                                    that 1
                                    ---------
                                    the 1
                                    the 1
                                    the 1
                                    ---------
                                    who 1




@jakehofman   (Yahoo! Research)        Large-scale social media analysis w/ Hadoop       May 23, 2010   33 / 71
Word count
                                                               dog 1
                                                               dog 1
                                                               ---------
                                                               --     1
                                                               ---------
                                                               the 1
                                                               the 1
                                                               the 1
                                                               ---------
                                                               brown       1              dog 2
                                                               ---------                  --   1
                                                               fox 1                      the 3
       the quick brown fox                                     fox 1                      brown        1
       --------------------------------                        ---------                  fox 2
       jumps over the lazy dog                                 jumped      1              jumped       1
       --------------------------------                        ---------                  lazy 2
       who jumped over that                                    lazy 1                     jumps        1
       --------------------------------                        lazy 1                     over 2
       lazy dog -- the fox ?                                   ---------                  quick        1
                                                               jumps       1              that 1
                                                               ---------                  who 1
                                                               over 1                     ?    1
                                                               over 1
                                                               ---------
                                                               quick       1
                                                               ---------
                                                               that 1
                                                               ---------
                                                               ?      1
                                                               ---------
                                                               who 1



@jakehofman     (Yahoo! Research)         Large-scale social media analysis w/ Hadoop   May 23, 2010       33 / 71
WordCount.java




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   34 / 71
Hadoop streaming




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   35 / 71
Hadoop streaming



      MapReduce for *nix geeks2 :

               # cat data | map | sort | reduce

          • Mapper reads input data from stdin
          • Mapper writes output to stdout
          • Reducer receives input, sorted by key, on stdin
          • Reducer writes output to stdout




          2
              http://bit.ly/michaelnoll
@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   37 / 71
wordcount.sh


      Locally:

              # cat data | tr " " "n" | sort | uniq -c




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   39 / 71
wordcount.sh


      Locally:

              # cat data | tr " " "n" | sort | uniq -c


                                                      ⇓
      Distributed:




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   39 / 71
Transparent scaling




        Use the same code on MBs locally or TBs across thousands
                             of machines.




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   40 / 71
wordcount.py




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   42 / 71
Outline




     1 Background (5 Ws)



     2 Introduction to MapReduce (How, Part I)



     3 Applications (How, Part II)




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   43 / 71
Network data




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   44 / 71
Scale



                                                                                 ...
       • Example numbers:
           • ∼ 107 nodes
           • ∼ 102 edges/node
                                                                        User 1             User 2
           • no node/edge data
           • static
           • ∼8GB
                                                                                 ...




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop          May 23, 2010   45 / 71
Scale



                                                                                 ...
       • Example numbers:
           • ∼ 107 nodes
           • ∼ 102 edges/node
                                                                        User 1             User 2
           • no node/edge data
           • static
           • ∼8GB
                                                                                 ...

      Simple, static networks push memory limit for commodity machines




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop          May 23, 2010   45 / 71
Scale




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   46 / 71
Scale



       • Example numbers:                                                                ...
           • ∼ 107 nodes
           • ∼ 102 edges/node                                                         Message
                                                                                     Header
           • node/edge metadata                                     User
                                                                            User 1   Content
                                                                                     ...
                                                                                                   User 2
                                                                                                              User
                                                                  Profile                                    Profile
           • dynamic                                              History                                   History
                                                                  ...                                       ...
           • ∼100GB/day
                                                                                         ...




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop                   May 23, 2010          47 / 71
Scale



       • Example numbers:                                                                ...
           • ∼ 107 nodes
           • ∼ 102 edges/node                                                         Message
                                                                                     Header
           • node/edge metadata                                     User
                                                                            User 1   Content
                                                                                     ...
                                                                                                   User 2
                                                                                                              User
                                                                  Profile                                    Profile
           • dynamic                                              History                                   History
                                                                  ...                                       ...
           • ∼100GB/day
                                                                                         ...

       Dynamic, data-rich social networks exceed memory limits; require
                             considerable storage




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop                   May 23, 2010          47 / 71
Assumptions
      Look only at topology, ignoring node and edge metadata




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   48 / 71
Assumptions
      Full network exceeds memory of single machine




                                                                                ...


                                               ...




                                                                                ...
@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop         May 23, 2010   48 / 71
Assumptions
      Full network exceeds memory of single machine




                                                                                ...


                                               ...




                                                                                ...
@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop         May 23, 2010   48 / 71
Assumptions
      First-hop neighborhood of any individual node fits in memory




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   48 / 71
Distributed network analysis




      MapReduce convenient for
      parallelizing individual
      node/edge-level calculations




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   49 / 71
Distributed network analysis




      Higher-order calculations
      more difficult , but can be
      adapted to MapReduce
      framework




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   50 / 71
Distributed network analysis
                                                                • Higher-order node-level
          • Network                                                 descriptive statistics
              creation/manipulation                                     • Clustering coefficient
                 • Logs → edges                                         • Implicit degree
                 • Edge list ↔ adjacency                                • ...
                     list
                                                                • Global calculations
                 • Directed ↔ undirected
                 • Edge thresholds                                  • Pairwise connectivity
                                                                    • Connected
          • First-order descriptive
                                                                      components
              statistics                                            • Minimum spanning
                 • Number of nodes                                    tree
                 • Number of edges                                  • Breadth-first search
                 • Node degrees                                     • Pagerank
                                                                    • Community detection




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop         May 23, 2010   51 / 71
Edge list → adjacency list




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   52 / 71
Edge list → adjacency list



                   source target
                      1     4                                        node in/out neighbors
                      1     5
                                                                       1    37 3546
                      1     6
                                                                       10        2
                      7     6
                                                                       2    10 3 4  98
                      7     1
                                                                       3    14 124
                      3     1
                                                                       4    13 32
                      1     3
                                                                       5    1
                      4     2
                                                                       6    17
                      3     2
                                                                       7         16
                      2     8
                                                                       8    2
                      10    2
                                                                       9    2
                      2     9
                      3     4
                      4     3




@jakehofman   (Yahoo! Research)    Large-scale social media analysis w/ Hadoop               May 23, 2010   52 / 71
Edge list → adjacency list
          Map: for each (source, target), output (source, →, target) &
                              (target, ←, source)
                                                                   node direction neighbor
                                                                         1      >      4
                                                                         4      <      1
                                                                         -----------------
                   source target                                         1      >      5
                       1      4                                          5      <      1
                       ---------                                         -----------------
                       1      5                                          1      >      6
                       ---------                                         6      <      1
                       1      6                                          -----------------
                       ---------                                         7      >      6
                       7      6                                          6      <      7
                       ---------                                         -----------------
                       7      1                                          7      >      1
                       ---------                                         1      <      7
                       3      1                                          -----------------
                       ---------                                         3      >      1
                       1      3                                          1      <      3
                       ---------                                         -----------------
                                                                         1      >      3
                                                                         3      <      1
                                                                         -----------------

@jakehofman   (Yahoo! Research)    Large-scale social media analysis w/ Hadoop               May 23, 2010   52 / 71
Edge list → adjacency list

                                   Shuffle: collect each node’s records
                                                                       node direction neighbor
                                                                             1      <      3
                                                                             1      <      7
                                                                             1      >      3
                   source target                                             1      >      4
                       1      4                                              1      >      5
                       ---------                                             1      >      6
                       1      5                                              -----------------
                       ---------                                             10 >          2
                       1      6                                              -----------------
                       ---------                                             2      <      10
                       7      6                                              2      <      3
                       ---------                                             2      <      4
                       7      1                                              2      >      8
                       ---------                                             2      >      9
                       3      1                                              -----------------
                       ---------                                             3      <      1
                       1      3                                              3      <      4
                       ---------                                             3      >      1
                                                                             3      >      2
                                                                             3      >      4



@jakehofman   (Yahoo! Research)        Large-scale social media analysis w/ Hadoop               May 23, 2010   52 / 71
Edge list → adjacency list

          Reduce: for each node, concatenate all in- and out-neighbors
                   node direction neighbor
                         1      <      3
                         1      <      7
                         1      >      3
                         1      >      4
                         1      >      5
                         1      >      6
                                                                         node in/out neighbors
                         -----------------
                         10 >          2                                   1     37 3546
                         -----------------                                 10         2
                         2      <      10                                  2     10 3 4  98
                         2      <      3                                   3     14 124
                         2      <      4                                   4     13 32
                         2      >      8                                   5     1
                         2      >      9                                   6     17
                         -----------------                                 7          16
                         3      <      1                                   8     2
                         3      <      4                                   9     2
                         3      >      1
                         3      >      2
                         3      >      4
                                 ...


@jakehofman   (Yahoo! Research)          Large-scale social media analysis w/ Hadoop             May 23, 2010   52 / 71
Edge list → adjacency list

                                      node direction neighbor
                                           1      <      3
                                           1      <      7
                                           1      >      3
                                           1      >      4
       source target
                                           1      >      5
          1      4                         1      >      6
                                                                                node in/out neighbors
          ---------                        -----------------
          1      5                         10 >          2                       1    37 3546
          ---------                        -----------------                     10        2
          1      6                         2      <      10                      2    10 3 4  98
          ---------                        2      <      3                       3    14 124
          7      6                         2      <      4                       4    13 32
          ---------                        2      >      8                       5    1
          7      1                         2      >      9                       6    17
          ---------                        -----------------                     7         16
          3      1                         3      <      1                       8    2
          ---------                        3      <      4                       9    2
          1      3                         3      >      1
          ---------                        3      >      2
             ...                           3      >      4
                                                   ...




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop             May 23, 2010   52 / 71
Edge list → adjacency list

        Adjacency lists provide access to a node’s local structure — e.g.
             we can pass messages from a node to its neighbors.




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   52 / 71
Degree distribution




               node in/out-degree
                1    2   4
                10   0   1
                2    3   2
                3    2   3
                4    2   2
                5    1   0
                6    2   0
                7    0   2
                8    1   0
                9    1   0




@jakehofman   (Yahoo! Research)     Large-scale social media analysis w/ Hadoop   May 23, 2010   53 / 71
Degree distribution
       Map: for each node, output in- and out-degree with count (of 1)
                                                 bin count
                                                 in_0 1
                                                 in_0 1
                                                 ---------
                                                 in_1 1
                                                 in_1 1
                                                 in_1 1
                                                 ---------
                                                 in_2 1
              node in/out-degree
                                                 in_2 1                          in/out degree count
               1    2   4                        in_2 1
                                                                                     in    0      2
               10   0   1                        in_2 1
                                                                                     in    1      3
               2    3   2                        ---------
                                                                                     in    2      4
               3    2   3                        in_3 1
                                                                                     in    3      1
               4    2   2                        ---------
                                                                                     out   0      4
               5    1   0                        out_0       1
                                                                                     out   1      1
               6    2   0                        out_0       1
                                                                                     out   2      3
               7    0   2                        out_0       1
                                                                                     out   3      1
               8    1   0                        out_0       1
                                                                                     out   4      1
               9    1   0                        ---------
                                                 out_1       1
                                                 ---------
                                                 out_2       1
                                                 out_2       1
                                                 out_2       1
                                                 ---------
                                                 out_3       1
                                                 ---------
                                                 out_4       1


@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop                 May 23, 2010   54 / 71
Degree distribution
                        Shuffle: collect counts for each in/out-degree
                                                 bin count
                                                 in_0 1
                                                 in_0 1
                                                 ---------
                                                 in_1 1
                                                 in_1 1
                                                 in_1 1
                                                 ---------
                                                 in_2 1
              node in/out-degree
                                                 in_2 1                          in/out degree count
               1    2   4                        in_2 1
                                                                                     in    0      2
               10   0   1                        in_2 1
                                                                                     in    1      3
               2    3   2                        ---------
                                                                                     in    2      4
               3    2   3                        in_3 1
                                                                                     in    3      1
               4    2   2                        ---------
                                                                                     out   0      4
               5    1   0                        out_0       1
                                                                                     out   1      1
               6    2   0                        out_0       1
                                                                                     out   2      3
               7    0   2                        out_0       1
                                                                                     out   3      1
               8    1   0                        out_0       1
                                                                                     out   4      1
               9    1   0                        ---------
                                                 out_1       1
                                                 ---------
                                                 out_2       1
                                                 out_2       1
                                                 out_2       1
                                                 ---------
                                                 out_3       1
                                                 ---------
                                                 out_4       1


@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop                 May 23, 2010   54 / 71
Degree distribution
                                       Reduce: add counts
                                                 bin count
                                                 in_0 1
                                                 in_0 1
                                                 ---------
                                                 in_1 1
                                                 in_1 1
                                                 in_1 1
                                                 ---------
                                                 in_2 1
              node in/out-degree
                                                 in_2 1                          in/out degree count
               1    2   4                        in_2 1
                                                                                     in    0      2
               10   0   1                        in_2 1
                                                                                     in    1      3
               2    3   2                        ---------
                                                                                     in    2      4
               3    2   3                        in_3 1
                                                                                     in    3      1
               4    2   2                        ---------
                                                                                     out   0      4
               5    1   0                        out_0       1
                                                                                     out   1      1
               6    2   0                        out_0       1
                                                                                     out   2      3
               7    0   2                        out_0       1
                                                                                     out   3      1
               8    1   0                        out_0       1
                                                                                     out   4      1
               9    1   0                        ---------
                                                 out_1       1
                                                 ---------
                                                 out_2       1
                                                 out_2       1
                                                 out_2       1
                                                 ---------
                                                 out_3       1
                                                 ---------
                                                 out_4       1


@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop                 May 23, 2010   54 / 71
Clustering coefficient



      Fraction of edges amongst a node’s in/out-neighbors

                                                                 ?

                                          ?
      — e.g. how many of a node’s friends are following each other?




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   55 / 71
Clustering coefficient

                                     8
                                    10
                                                                                                                followers
                                                                                                                friends
                                     7
                                    10


                                     6
                                    10
                   ?
                                     5
                                    10
                            count




                                     4
                                    10


                                     3
                   ?                10


                                     2
                                    10


                                     1
                                    10


                                     0
                                    10
                                         0     0.1      0.2      0.3            0.4           0.5   0.6   0.7               0.8
                                                                       clustering coefficient




@jakehofman   (Yahoo! Research)              Large-scale social media analysis w/ Hadoop                            May 23, 2010   56 / 71
Clustering coefficient


                                                   5


                                  6

                                               1
                                                             4
                                  7
                                                                                     8
                                                       3
                                                                     2


                                                                                10

                                                                 9




@jakehofman   (Yahoo! Research)       Large-scale social media analysis w/ Hadoop        May 23, 2010   56 / 71
Clustering coefficient


                                                   5


                                  6

                                               1
                                                             4
                                  7
                                                                                     8
                                                       3
                                                                     2


                                                                                10

                                                                 9




@jakehofman   (Yahoo! Research)       Large-scale social media analysis w/ Hadoop        May 23, 2010   56 / 71
Clustering coefficient


                                                   5


                                  6

                                               1
                                                             4
                                  7
                                                                                     8
                                                       3
                                                                     2


                                                                                10

                                                                 9




@jakehofman   (Yahoo! Research)       Large-scale social media analysis w/ Hadoop        May 23, 2010   56 / 71
Clustering coefficient

      Map: pass all of a node’s out-neighbors to each of its in-neighbors
                                                                                  out    two-hop
                                                                          node neighbor neighbors
                      node in/out-neighbors
                                                                               3      1      3546
                         1      37 3546                                        7      1      3546
                         -------------------------                             -------------------------
                         10            2                                       10 2          98
                         -------------------------                             3      2      98
                         2      10 3 4        98                               4      2      98
                         -------------------------                             -------------------------
                         3      14 124                                         1      3      124
                         -------------------------                             4      3      124
                         4      13 32                                          -------------------------
                         -------------------------                             1      4      32
                         5      1                                              3      4      32
                         -------------------------                             -------------------------
                         6      17                                             1      5
                         -------------------------                             -------------------------
                         7             16                                      1      6
                         -------------------------                             7      6
                         8      2                                              -------------------------
                         -------------------------                             2      8
                         9      2                                              -------------------------
                                                                               2      9




@jakehofman   (Yahoo! Research)             Large-scale social media analysis w/ Hadoop                    May 23, 2010   57 / 71
Clustering coefficient


                  Shuffle: collect each node’s two-hop neighborhoods
                                                                                  out    two-hop
                      node in/out-neighbors                               node neighbor neighbors

                         1      37 3546                                        1      3      124
                         -------------------------                             1      4      32
                         10            2                                       1      5
                         -------------------------                             1      6
                         2      10 3 4        98                               -------------------------
                         -------------------------                             10 2          98
                         3      14 124                                         -------------------------
                         -------------------------                             2      8
                         4      13 32                                          2      9
                         -------------------------                             -------------------------
                         5      1                                              3      1      3546
                         -------------------------                             3      2      98
                         6      17                                             3      4      32
                         -------------------------                             -------------------------
                         7             16                                      4      2      98
                         -------------------------                             4      3      124
                         8      2                                              -------------------------
                         -------------------------                             7      1      3546
                         9      2                                              7      6




@jakehofman   (Yahoo! Research)             Large-scale social media analysis w/ Hadoop                    May 23, 2010   57 / 71
Clustering coefficient

        Reduce: count a half-triangle for each node reachable by both a
                            one- and two-hop path
                              out    two-hop
                      node neighbor neighbors

                           1      3      124
                           1      4      32
                           1      5                                                  node      triangles
                           1      6
                           -------------------------                                      1      1.0
                           10 2          98                                               ------------
                           -------------------------                                      10 0.0
                           2      8                                                       ------------
                           2      9                                                       2      0.0
                           -------------------------                                      ------------
                           3      1      3546                                             3      1.0
                           3      2      98                                               ------------
                           3      4      32                                               4      0.5
                           -------------------------                                      ------------
                           4      2      98                                               7      0.5
                           4      3      124
                           -------------------------
                           7      1      3546
                           7      6




@jakehofman   (Yahoo! Research)             Large-scale social media analysis w/ Hadoop                    May 23, 2010   57 / 71
Clustering coefficient



                                                                                node    triangles

                                                                                   1      1.0
                                                                                   ------------
                                                                                   10 0.0
                                                                                   ------------
                                                                                   2      0.0
                                                                                   ------------
                                                                                   3      1.0
                                                                                   ------------
                                                                                   4      0.5
                                                                                   ------------
                                                                                   7      0.5




       Note: this approach generates large amount of intermediate data
                           relative to final output.



@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop          May 23, 2010   57 / 71
Breadth first search
      Iterative approach: each MapReduce round expands the frontier
                                                    ?


                                  ?

                                                0
                                                             ?
                                  ?
                                                                                  ?
                                                        ?
                                                                     ?


                                                                              ?

                                                                 ?



      Map: If node’s distance d to source is finite, output neighbor’s
      distance as d+1
      Reduce: Set node’s distance to minimum received from all
      in-neighbors
@jakehofman   (Yahoo! Research)       Large-scale social media analysis w/ Hadoop     May 23, 2010   58 / 71
Breadth first search
      Iterative approach: each MapReduce round expands the frontier
                                                    1


                                  1

                                                0
                                                             1
                                  ?
                                                                                  ?
                                                        1
                                                                     ?


                                                                              ?

                                                                 ?



      Map: If node’s distance d to source is finite, output neighbor’s
      distance as d+1
      Reduce: Set node’s distance to minimum received from all
      in-neighbors
@jakehofman   (Yahoo! Research)       Large-scale social media analysis w/ Hadoop     May 23, 2010   58 / 71
Breadth first search
      Iterative approach: each MapReduce round expands the frontier
                                                    1


                                  1

                                                0
                                                             1
                                  ?
                                                                                  ?
                                                        1
                                                                     2


                                                                              ?

                                                                 ?



      Map: If node’s distance d to source is finite, output neighbor’s
      distance as d+1
      Reduce: Set node’s distance to minimum received from all
      in-neighbors
@jakehofman   (Yahoo! Research)       Large-scale social media analysis w/ Hadoop     May 23, 2010   58 / 71
Breadth first search
      Iterative approach: each MapReduce round expands the frontier
                                                    1


                                  1

                                                0
                                                             1
                                  ?
                                                                                  3
                                                        1
                                                                     2


                                                                              ?

                                                                 3



      Map: If node’s distance d to source is finite, output neighbor’s
      distance as d+1
      Reduce: Set node’s distance to minimum received from all
      in-neighbors
@jakehofman   (Yahoo! Research)       Large-scale social media analysis w/ Hadoop     May 23, 2010   58 / 71
Break complicated tasks into multiple, simpler MapReduce rounds.




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   59 / 71
Pagerank
      Iterative approach: each MapReduce round broadcasts and collects
      edge messages for power method 3
                                                q5


                                    q6

                                              q1
                                                          q4
                                    q7
                                                                           q8
                                                     q3
                                                                    q2


                                                                         q10

                                                               q9



      Map: Output current pagerank over degree to each out-neighbor
      Reduce: Sum incoming probabilities to update estimate

                               http://bit.ly/nielsenpagerank
          3
              Extra rounds for random jump, dangling nodes
@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   60 / 71
Machine learning




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   61 / 71
Machine learning




          • Often use MapReduce for feature extraction, then fit/optimize
            locally
          • Useful for “embarassingly parallel” parts of learning, e.g.
                 • parameters sweeps for cross-validation
                 • independent restarts for local optimization
                 • making predictions on independent examples
          • Remember: MapReduce isn’t always the answer




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   62 / 71
Classification

      Example: given words in an article, assign article to one of K
      classes


                  “Floyd Landis showed up at the Tour of California”




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   63 / 71
Classification

      Example: given words in an article, assign article to one of K
      classes


                  “Floyd Landis showed up at the Tour of California”


                                                      ⇓




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   63 / 71
Classification

      Example: given words in an article, assign article to one of K
      classes


                  “Floyd Landis showed up at the Tour of California”


                                                      ⇓




                                                     ...


@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   63 / 71
Classification: naive Bayes
          • Model presence/absence of each word as independent coin flip

                    p (word|class) = Bernoulli(θwc )
                   p (words|class) = p (word1 |class) p (word2 |class) . . .




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   64 / 71
Classification: naive Bayes
          • Model presence/absence of each word as independent coin flip

                    p (word|class) = Bernoulli(θwc )
                   p (words|class) = p (word1 |class) p (word2 |class) . . .

          • Maximum likelihood estimates of probabilities from word and
              class counts

                                               ˆ                Nwc
                                               θwc       =
                                                                Nc
                                                 ˆ              Nc
                                                 θc      =
                                                                N




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   64 / 71
Classification: naive Bayes
          • Model presence/absence of each word as independent coin flip

                    p (word|class) = Bernoulli(θwc )
                   p (words|class) = p (word1 |class) p (word2 |class) . . .

          • Maximum likelihood estimates of probabilities from word and
              class counts

                                               ˆ                Nwc
                                               θwc       =
                                                                Nc
                                                 ˆ              Nc
                                                 θc      =
                                                                N
          • Use bayes’ rule to calculate distribution over classes given
              words

                                                    p (words|class, Θ) p (class, Θ)
                      p (class|words, Θ) =
                                                            p (words, Θ)
@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   64 / 71
Classification: naive Bayes




                                  Naive ↔ independent features




@jakehofman   (Yahoo! Research)    Large-scale social media analysis w/ Hadoop   May 23, 2010   65 / 71
Classification: naive Bayes




                                  Naive ↔ independent features


                                                       ⇓


                                  Class-conditional word count




@jakehofman   (Yahoo! Research)    Large-scale social media analysis w/ Hadoop   May 23, 2010   65 / 71
Classification: naive Bayes


                                                                                         class word count
                                                                                         sports   an    355
                                                                                         sports   be    317
                                                                                         sports   first 318
                                                                                         sports   game 379
                                                                                         sports   has 374
                                                                                         sports   have 284
                                                                                         sports   one 296
                                                                                         sports   said 325
       class     words
                                                                                         sports   season 295
                                                                                         sports   team 279
       world         Economics Is on Agenda for U.S. Meetings in China
                                                                                         sports   their 334
       --------------------------------
                                                                                         sports   this 293
       world         U.K. Backs Germany's Effort to Support Euro
                                                                                         sports   when 290
       --------------------------------
                                                                                         sports   who 363
       sports        A Pitchersʼ Duel Ends in Favor of the Yankees
                                                                                         world    after 347
       --------------------------------
                                                                                         world    but 299
       sports        After Doping Allegations, a Race for Details
                                                                                         world    government   300
                                                                                         world    had 352
                                                                                         world    have 342
                                                                                         world    he    355
                                                                                         world    its 308
                                                                                         world    mr    293
                                                                                         world    united 313
                                                                                         world    were 319




@jakehofman     (Yahoo! Research)              Large-scale social media analysis w/ Hadoop                     May 23, 2010   66 / 71
Clustering:
      Find clusters of “similar” points




                                  http://bit.ly/oldfaithful

@jakehofman   (Yahoo! Research)     Large-scale social media analysis w/ Hadoop   May 23, 2010   67 / 71
Clustering: K-means
      Map: Assign each point to cluster with closest mean4 , output
      (cluster, features)
      Reduce: Update clusters by calculating new class-conditional
      means




              http://en.wikipedia.org/wiki/K-means clustering
          4
              Each mapper loads all cluster centers on init
@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   68 / 71
Mahout




                                  http://mahout.apache.org/




@jakehofman   (Yahoo! Research)     Large-scale social media analysis w/ Hadoop   May 23, 2010   69 / 71
Thanks


          • Sharad Goely
          • Winter Masony
          • Sid Suriy
          • Sergei Vassilvitskiiy
          • Duncan Wattsy
          • Eytan Bakshym,y


      y   Yahoo! Research (http://research.yahoo.com)
      m   University of Michigan




@jakehofman   (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   70 / 71
Thanks.

                                              Questions?5




          5
              http://jakehofman.com/icwsm2010
@jakehofman    (Yahoo! Research)   Large-scale social media analysis w/ Hadoop   May 23, 2010   71 / 71

More Related Content

Viewers also liked

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Laurent Leturgez
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Laurent Leturgez
 
Oracle 12c in memory en action
Oracle 12c in memory en actionOracle 12c in memory en action
Oracle 12c in memory en actionLaurent Leturgez
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionTanel Poder
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014StampedeCon
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
Modern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingModern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingTanel Poder
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Yahoo Developer Network
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1Vimal Suthar
 

Viewers also liked (16)

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hanganalyze presentation
Hanganalyze presentationHanganalyze presentation
Hanganalyze presentation
 
Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
 
Oracle 12c in memory en action
Oracle 12c in memory en actionOracle 12c in memory en action
Oracle 12c in memory en action
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Virtualizing Hadoop
Virtualizing HadoopVirtualizing Hadoop
Virtualizing Hadoop
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Modern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingModern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application Troubleshooting
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
 

More from jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 

More from jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 

Recently uploaded

Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesCeline George
 
Over the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxOver the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxraviapr7
 
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINTARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINTDR. SNEHA NAIR
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17Celine George
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxSlides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxCapitolTechU
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
How to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeHow to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeCeline George
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxheathfieldcps1
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
Protein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxProtein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxvidhisharma994099
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 

Recently uploaded (20)

Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 Sales
 
Over the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxOver the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptx
 
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINTARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxSlides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
How to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeHow to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using Code
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
Protein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxProtein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptx
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 

Large-scale social media analysis with Hadoop

  • 1. Large-scale social media analysis with Hadoop Jake Hofman Yahoo! Research May 23, 2010 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 1 / 71
  • 2. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 2 / 71
  • 3. 1970s ∼ 101 nodes 456 JOURNAL OF ANTHROPOLOGICAL RESEARCH FIGURE 1 Social Network Model of Relationships in the Karate Club 34 1 33 3 2 27 8 26 i 9 25 10 CONFLICT AND FISSION IN SMALL GROUPS 453 to bounded social groups of all types in all settings. Also, the data required can be collected by a reliable method currently familiar to anthropologists, the use of nominal scales. 19 18 16 18 17 THE ETHNOGRAPHIC RATIONALE The is the the clubrepresentationline ofis the socialbetween of three years, the indi-1970 This karate karate was observed for a period two points when 34 two viduals in graphic club. A drawn relationships among the from to 1972. In addition to direct observation, the history of outside those of to individuals being represented consistently interacted in contexts the club prior karate classes, workouts, was reconstructed such line the period of the study and club meetings. Each through drawn is referredandasclub an edge. informants to records in the university archives. During the period of observation, the club maintained between 50 and 100 members, and its activities two individuals consistently were observed to interact outside the included social affairs (parties, dances, and club normal activities of the club (karate classes banquets, etc.) Thatwell as as • Few direct observations; highly detailed info on nodes and an edge is drawn meetings). regularly scheduled ifkarate lessons. could be said to be friends outside the the individuals The political organization of clubthe club activities.This while there was a constitutionin Figure 2. officers, was informal, and graph is represented as a matrix and four All is, most decisions were made nondirectional at represent interaction in both the edges in Figure 1 are by concensus club meetings. For its classes, edges (they the club employed thepart-time karate instructor, who will possible to to directions), and a graph is said to be symmetrical.It is also be referred draw edges that are directed (representing one-way relationships); such as Mr. Hi.2 At the beginning of the study there was an incipient conflict • E.g. karate club (Zachary, 1977) between the club president, John A., and Mr. Hi over the price of karate lessons. Mr. Hi, who wished to raise prices, claimed the authority to set his own lesson fees, since he was the instructor. John A., who wished to stabilize prices, claimed the authority to set the lesson fees since he was the club's chief administrator. @jakehofman (Yahoo! Research) As time passed the entire club became w/ Hadoop Large-scale social media analysis divided over this issue, and May 23, 2010 3 / 71
  • 4. 1990s ∼ 104 nodes • Larger, indirect samples; relatively few details on nodes and edges • E.g. APS co-authorship network (http://bit.ly/aps08jmh) @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 4 / 71
  • 5. Present ∼ 107 nodes + • Very large, dynamic samples; many details in node and edge metadata • E.g. Mail, Messenger, Facebook, Twitter, etc. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 5 / 71
  • 6. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 6 / 71
  • 7. What could you ask of it? @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 6 / 71
  • 8. Look familiar? # ls -talh neat_dataset.tar.gz -rw-r--r-- 100T May 23 13:00 neat_dataset.tar.gz @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 8 / 71
  • 9. Look familiar?1 # ls -talh twitter_rv.tar -rw-r--r-- 24G May 23 13:00 twitter_rv.tar 1 http://an.kaist.ac.kr/traces/WWW2010.html @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 10 / 71
  • 10. Agenda Large-scale social media analysis with Hadoop @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71
  • 11. Agenda Large-scale social media analysis with Hadoop GB/TB/PB-scale, 10,000+ nodes @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71
  • 12. Agenda Large-scale social media analysis with Hadoop network & text data @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71
  • 13. Agenda Large-scale social media analysis with Hadoop network analysis & machine learning @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71
  • 14. Agenda Large-scale social media analysis with Hadoop open source Apache project for distributed storage/computation @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71
  • 15. Warning You may be bored if you already know how to ... @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71
  • 16. Warning You may be bored if you already know how to ... • Install and use Hadoop (on a single machine and EC2) @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71
  • 17. Warning You may be bored if you already know how to ... • Install and use Hadoop (on a single machine and EC2) • Run jobs in local and distributed modes @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71
  • 18. Warning You may be bored if you already know how to ... • Install and use Hadoop (on a single machine and EC2) • Run jobs in local and distributed modes • Implement distributed solutions for: @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71
  • 19. Warning You may be bored if you already know how to ... • Install and use Hadoop (on a single machine and EC2) • Run jobs in local and distributed modes • Implement distributed solutions for: • Parsing and manipulating large text collections @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71
  • 20. Warning You may be bored if you already know how to ... • Install and use Hadoop (on a single machine and EC2) • Run jobs in local and distributed modes • Implement distributed solutions for: • Parsing and manipulating large text collections • Clustering coefficient, BFS, etc., for networks w/ billions of edges @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71
  • 21. Warning You may be bored if you already know how to ... • Install and use Hadoop (on a single machine and EC2) • Run jobs in local and distributed modes • Implement distributed solutions for: • Parsing and manipulating large text collections • Clustering coefficient, BFS, etc., for networks w/ billions of edges • Classification, clustering @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71
  • 22. Selected resources http://www.hadoopbook.com/ @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71
  • 23. Selected resources http://www.cloudera.com/ @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71
  • 24. Selected resources i Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer University of Maryland, College Park Draft of February 19, 2010 http://www.umiacs.umd.edu/∼jimmylin/book.html This is a (partial) draft of a book that is in preparation for Morgan & Claypool Synthesis Lectures on Human Language Technologies. Anticipated publication date is mid-2010. Comments and feedback are welcome! @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71
  • 25. Selected resources ... and many more at http://delicious.com/jhofman/hadoop @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71
  • 26. Selected resources ... and many more at http://delicious.com/pskomoroch/hadoop @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71
  • 27. Outline 1 Background (5 Ws) 2 Introduction to MapReduce (How, Part I) 3 Applications (How, Part II) @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 14 / 71
  • 28. What? @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 15 / 71
  • 29. What? “... to create building blocks for programmers who just happen to have lots of data to store, lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.” -Tom White Hadoop: The Definitive Guide @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 16 / 71
  • 30. What? Hadoop contains many subprojects: We’ll focus on distributed computation with MapReduce. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 17 / 71
  • 31. Who/when? An overly brief history @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71
  • 32. Who/when? pre-2004 Cutting and Cafarella develop open source projects for web-scale indexing, crawling, and search @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71
  • 33. Who/when? 2004 Dean and Ghemawat publish MapReduce programming model, used internally at Google @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71
  • 34. Who/when? 2006 Hadoop becomes official Apache project, Cutting joins Yahoo!, Yahoo adopts Hadoop @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71
  • 35. Who/when? @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71
  • 36. Where? http://wiki.apache.org/hadoop/PoweredBy @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 19 / 71
  • 37. Why? Why yet another solution? (I already use too many languages/environments) @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 20 / 71
  • 38. Why? Why a distributed solution? (My desktop has TBs of storage and GBs of memory) @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 20 / 71
  • 39. Why? Roughly how long to read 1TB from a commodity hard disk? @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 21 / 71
  • 40. Why? Roughly how long to read 1TB from a commodity hard disk? 1 Gb 1 B sec GB × × 3600 ≈ 225 2 sec 8 b hr hr @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 21 / 71
  • 41. Why? Roughly how long to read 1TB from a commodity hard disk? ≈ 4hrs @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 21 / 71
  • 42. Why? http://bit.ly/petabytesort @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 22 / 71
  • 43. Outline 1 Background (5 Ws) 2 Introduction to MapReduce (How, Part I) 3 Applications (How, Part II) @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 23 / 71
  • 44. Typical scenario Store, parse, and analyze high-volume server logs, e.g. how many search queries match “icwsm”? @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 24 / 71
  • 45. MapReduce: 30k ft Break large problem into smaller parts, solve in parallel, combine results @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 25 / 71
  • 46. Typical scenario “Embarassingly parallel” (or nearly so) local read filter } node 1 local read filter node 2 collect results local read filter node 3 local read filter node 4 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 26 / 71
  • 47. Typical scenario++ How many search queries match “icwsm”, grouped by month? @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 27 / 71
  • 48. MapReduce: example Map Shuffle Reduce matching records to to collect all records results by adding (YYYYMM, count=1) w/ same key (month) count values for each key 20091201,4.2.2.1,"icwsm 2010" 20100523,2.4.1.2,"hadoop" 20100101,9.7.6.5,"tutorial" 200912, 1 20091125,2.4.6.1,"data" 20090708,4.2.2.1,"open source" 20100124,1.2.2.4,"washington dc" 200912, 1 200912, 2 200912, 1 ... ... 200910, 1 200910, 1 20100522,2.4.1.2,"conference" 20091008,4.2.2.1,"2009 icwsm" 20090807,4.2.2.1,"apache.org" 200910, 1 20100101,9.7.6.5,"mapreduce" 200911, 1 20100123,1.2.2.4,"washington dc" 20091121,2.4.6.1,"icwsm dates" 200911, 1 200911, 1 ... ... 20090807,4.2.2.1,"distributed" 20091225,4.2.2.1,"icwsm" 20100522,2.4.1.2,"media" 200912, 1 20100123,1.2.2.4,"social" 20091114,2.4.6.1,"d.c." 20100101,9.7.6.5,"new year's" @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 28 / 71
  • 49. MapReduce: paradigm Programmer specifies map and reduce functions @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71
  • 50. MapReduce: paradigm Map: tranforms input record to intermediate (key, value) pair @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71
  • 51. MapReduce: paradigm Shuffle: collects all intermediate records by key @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71
  • 52. MapReduce: paradigm Reduce: transforms all records for given key to final output @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71
  • 53. MapReduce: paradigm Distributed read, shuffle, and write are transparent to programmer @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71
  • 54. MapReduce: principles • Move code to data (local computation) • Allow programs to scale transparently w.r.t size of input • Abstract away fault tolerance, synchronization, etc. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 30 / 71
  • 55. MapReduce: strengths • Batch, offline jobs • Write-once, read-many across full data set • Usually, though not always, simple computations • I/O bound by disk/network bandwidth @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 31 / 71
  • 56. !MapReduce What it’s not: • High-performance parallel computing, e.g. MPI • Low-latency random access relational database • Always the right solution @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 32 / 71
  • 57. Word count dog 2 -- 1 the 3 brown 1 fox 2 the quick brown fox jumped 1 jumps over the lazy dog lazy 2 who jumped over that jumps 1 lazy dog -- the fox ? over 2 quick 1 that 1 who 1 ? 1 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71
  • 58. Word count Map: for each line, output each word and count (of 1) the 1 quick 1 brown 1 fox 1 --------- jumps 1 over 1 the 1 the quick brown fox lazy 1 -------------------------------- dog 1 jumps over the lazy dog --------- -------------------------------- who 1 who jumped over that jumped 1 -------------------------------- over 1 lazy dog -- the fox ? --------- that 1 lazy 1 dog 1 -- 1 the 1 fox 1 ? 1 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71
  • 59. Word count Shuffle: collect all records for each word -- 1 --------- ? 1 --------- brown 1 --------- dog 1 dog 1 --------- fox 1 fox 1 --------- the quick brown fox jumped 1 -------------------------------- --------- jumps over the lazy dog jumps 1 -------------------------------- --------- who jumped over that lazy 1 -------------------------------- lazy 1 lazy dog -- the fox ? --------- over 1 over 1 --------- quick 1 --------- that 1 --------- the 1 the 1 the 1 --------- who 1 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71
  • 60. Word count Reduce: add counts for each word -- 1 --------- ? 1 --------- brown 1 --------- dog 1 dog 1 --------- fox 1 -- 1 fox 1 ? 1 --------- brown 1 jumped 1 dog 2 --------- fox 2 jumps 1 jumped 1 --------- jumps 1 lazy 1 lazy 2 lazy 1 over 2 --------- quick 1 over 1 that 1 over 1 the 3 --------- who 1 quick 1 --------- that 1 --------- the 1 the 1 the 1 --------- who 1 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71
  • 61. Word count dog 1 dog 1 --------- -- 1 --------- the 1 the 1 the 1 --------- brown 1 dog 2 --------- -- 1 fox 1 the 3 the quick brown fox fox 1 brown 1 -------------------------------- --------- fox 2 jumps over the lazy dog jumped 1 jumped 1 -------------------------------- --------- lazy 2 who jumped over that lazy 1 jumps 1 -------------------------------- lazy 1 over 2 lazy dog -- the fox ? --------- quick 1 jumps 1 that 1 --------- who 1 over 1 ? 1 over 1 --------- quick 1 --------- that 1 --------- ? 1 --------- who 1 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71
  • 62. WordCount.java @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 34 / 71
  • 63. Hadoop streaming @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 35 / 71
  • 64. Hadoop streaming MapReduce for *nix geeks2 : # cat data | map | sort | reduce • Mapper reads input data from stdin • Mapper writes output to stdout • Reducer receives input, sorted by key, on stdin • Reducer writes output to stdout 2 http://bit.ly/michaelnoll @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 37 / 71
  • 65. wordcount.sh Locally: # cat data | tr " " "n" | sort | uniq -c @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 39 / 71
  • 66. wordcount.sh Locally: # cat data | tr " " "n" | sort | uniq -c ⇓ Distributed: @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 39 / 71
  • 67. Transparent scaling Use the same code on MBs locally or TBs across thousands of machines. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 40 / 71
  • 68. wordcount.py @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 42 / 71
  • 69. Outline 1 Background (5 Ws) 2 Introduction to MapReduce (How, Part I) 3 Applications (How, Part II) @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 43 / 71
  • 70. Network data @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 44 / 71
  • 71. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node User 1 User 2 • no node/edge data • static • ∼8GB ... @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 45 / 71
  • 72. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node User 1 User 2 • no node/edge data • static • ∼8GB ... Simple, static networks push memory limit for commodity machines @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 45 / 71
  • 73. Scale @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 46 / 71
  • 74. Scale • Example numbers: ... • ∼ 107 nodes • ∼ 102 edges/node Message Header • node/edge metadata User User 1 Content ... User 2 User Profile Profile • dynamic History History ... ... • ∼100GB/day ... @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 47 / 71
  • 75. Scale • Example numbers: ... • ∼ 107 nodes • ∼ 102 edges/node Message Header • node/edge metadata User User 1 Content ... User 2 User Profile Profile • dynamic History History ... ... • ∼100GB/day ... Dynamic, data-rich social networks exceed memory limits; require considerable storage @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 47 / 71
  • 76. Assumptions Look only at topology, ignoring node and edge metadata @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 48 / 71
  • 77. Assumptions Full network exceeds memory of single machine ... ... ... @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 48 / 71
  • 78. Assumptions Full network exceeds memory of single machine ... ... ... @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 48 / 71
  • 79. Assumptions First-hop neighborhood of any individual node fits in memory @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 48 / 71
  • 80. Distributed network analysis MapReduce convenient for parallelizing individual node/edge-level calculations @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 49 / 71
  • 81. Distributed network analysis Higher-order calculations more difficult , but can be adapted to MapReduce framework @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 50 / 71
  • 82. Distributed network analysis • Higher-order node-level • Network descriptive statistics creation/manipulation • Clustering coefficient • Logs → edges • Implicit degree • Edge list ↔ adjacency • ... list • Global calculations • Directed ↔ undirected • Edge thresholds • Pairwise connectivity • Connected • First-order descriptive components statistics • Minimum spanning • Number of nodes tree • Number of edges • Breadth-first search • Node degrees • Pagerank • Community detection @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 51 / 71
  • 83. Edge list → adjacency list @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71
  • 84. Edge list → adjacency list source target 1 4 node in/out neighbors 1 5 1 37 3546 1 6 10 2 7 6 2 10 3 4 98 7 1 3 14 124 3 1 4 13 32 1 3 5 1 4 2 6 17 3 2 7 16 2 8 8 2 10 2 9 2 2 9 3 4 4 3 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71
  • 85. Edge list → adjacency list Map: for each (source, target), output (source, →, target) & (target, ←, source) node direction neighbor 1 > 4 4 < 1 ----------------- source target 1 > 5 1 4 5 < 1 --------- ----------------- 1 5 1 > 6 --------- 6 < 1 1 6 ----------------- --------- 7 > 6 7 6 6 < 7 --------- ----------------- 7 1 7 > 1 --------- 1 < 7 3 1 ----------------- --------- 3 > 1 1 3 1 < 3 --------- ----------------- 1 > 3 3 < 1 ----------------- @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71
  • 86. Edge list → adjacency list Shuffle: collect each node’s records node direction neighbor 1 < 3 1 < 7 1 > 3 source target 1 > 4 1 4 1 > 5 --------- 1 > 6 1 5 ----------------- --------- 10 > 2 1 6 ----------------- --------- 2 < 10 7 6 2 < 3 --------- 2 < 4 7 1 2 > 8 --------- 2 > 9 3 1 ----------------- --------- 3 < 1 1 3 3 < 4 --------- 3 > 1 3 > 2 3 > 4 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71
  • 87. Edge list → adjacency list Reduce: for each node, concatenate all in- and out-neighbors node direction neighbor 1 < 3 1 < 7 1 > 3 1 > 4 1 > 5 1 > 6 node in/out neighbors ----------------- 10 > 2 1 37 3546 ----------------- 10 2 2 < 10 2 10 3 4 98 2 < 3 3 14 124 2 < 4 4 13 32 2 > 8 5 1 2 > 9 6 17 ----------------- 7 16 3 < 1 8 2 3 < 4 9 2 3 > 1 3 > 2 3 > 4 ... @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71
  • 88. Edge list → adjacency list node direction neighbor 1 < 3 1 < 7 1 > 3 1 > 4 source target 1 > 5 1 4 1 > 6 node in/out neighbors --------- ----------------- 1 5 10 > 2 1 37 3546 --------- ----------------- 10 2 1 6 2 < 10 2 10 3 4 98 --------- 2 < 3 3 14 124 7 6 2 < 4 4 13 32 --------- 2 > 8 5 1 7 1 2 > 9 6 17 --------- ----------------- 7 16 3 1 3 < 1 8 2 --------- 3 < 4 9 2 1 3 3 > 1 --------- 3 > 2 ... 3 > 4 ... @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71
  • 89. Edge list → adjacency list Adjacency lists provide access to a node’s local structure — e.g. we can pass messages from a node to its neighbors. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71
  • 90. Degree distribution node in/out-degree 1 2 4 10 0 1 2 3 2 3 2 3 4 2 2 5 1 0 6 2 0 7 0 2 8 1 0 9 1 0 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 53 / 71
  • 91. Degree distribution Map: for each node, output in- and out-degree with count (of 1) bin count in_0 1 in_0 1 --------- in_1 1 in_1 1 in_1 1 --------- in_2 1 node in/out-degree in_2 1 in/out degree count 1 2 4 in_2 1 in 0 2 10 0 1 in_2 1 in 1 3 2 3 2 --------- in 2 4 3 2 3 in_3 1 in 3 1 4 2 2 --------- out 0 4 5 1 0 out_0 1 out 1 1 6 2 0 out_0 1 out 2 3 7 0 2 out_0 1 out 3 1 8 1 0 out_0 1 out 4 1 9 1 0 --------- out_1 1 --------- out_2 1 out_2 1 out_2 1 --------- out_3 1 --------- out_4 1 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 54 / 71
  • 92. Degree distribution Shuffle: collect counts for each in/out-degree bin count in_0 1 in_0 1 --------- in_1 1 in_1 1 in_1 1 --------- in_2 1 node in/out-degree in_2 1 in/out degree count 1 2 4 in_2 1 in 0 2 10 0 1 in_2 1 in 1 3 2 3 2 --------- in 2 4 3 2 3 in_3 1 in 3 1 4 2 2 --------- out 0 4 5 1 0 out_0 1 out 1 1 6 2 0 out_0 1 out 2 3 7 0 2 out_0 1 out 3 1 8 1 0 out_0 1 out 4 1 9 1 0 --------- out_1 1 --------- out_2 1 out_2 1 out_2 1 --------- out_3 1 --------- out_4 1 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 54 / 71
  • 93. Degree distribution Reduce: add counts bin count in_0 1 in_0 1 --------- in_1 1 in_1 1 in_1 1 --------- in_2 1 node in/out-degree in_2 1 in/out degree count 1 2 4 in_2 1 in 0 2 10 0 1 in_2 1 in 1 3 2 3 2 --------- in 2 4 3 2 3 in_3 1 in 3 1 4 2 2 --------- out 0 4 5 1 0 out_0 1 out 1 1 6 2 0 out_0 1 out 2 3 7 0 2 out_0 1 out 3 1 8 1 0 out_0 1 out 4 1 9 1 0 --------- out_1 1 --------- out_2 1 out_2 1 out_2 1 --------- out_3 1 --------- out_4 1 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 54 / 71
  • 94. Clustering coefficient Fraction of edges amongst a node’s in/out-neighbors ? ? — e.g. how many of a node’s friends are following each other? @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 55 / 71
  • 95. Clustering coefficient 8 10 followers friends 7 10 6 10 ? 5 10 count 4 10 3 ? 10 2 10 1 10 0 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 clustering coefficient @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 56 / 71
  • 96. Clustering coefficient 5 6 1 4 7 8 3 2 10 9 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 56 / 71
  • 97. Clustering coefficient 5 6 1 4 7 8 3 2 10 9 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 56 / 71
  • 98. Clustering coefficient 5 6 1 4 7 8 3 2 10 9 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 56 / 71
  • 99. Clustering coefficient Map: pass all of a node’s out-neighbors to each of its in-neighbors out two-hop node neighbor neighbors node in/out-neighbors 3 1 3546 1 37 3546 7 1 3546 ------------------------- ------------------------- 10 2 10 2 98 ------------------------- 3 2 98 2 10 3 4 98 4 2 98 ------------------------- ------------------------- 3 14 124 1 3 124 ------------------------- 4 3 124 4 13 32 ------------------------- ------------------------- 1 4 32 5 1 3 4 32 ------------------------- ------------------------- 6 17 1 5 ------------------------- ------------------------- 7 16 1 6 ------------------------- 7 6 8 2 ------------------------- ------------------------- 2 8 9 2 ------------------------- 2 9 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 57 / 71
  • 100. Clustering coefficient Shuffle: collect each node’s two-hop neighborhoods out two-hop node in/out-neighbors node neighbor neighbors 1 37 3546 1 3 124 ------------------------- 1 4 32 10 2 1 5 ------------------------- 1 6 2 10 3 4 98 ------------------------- ------------------------- 10 2 98 3 14 124 ------------------------- ------------------------- 2 8 4 13 32 2 9 ------------------------- ------------------------- 5 1 3 1 3546 ------------------------- 3 2 98 6 17 3 4 32 ------------------------- ------------------------- 7 16 4 2 98 ------------------------- 4 3 124 8 2 ------------------------- ------------------------- 7 1 3546 9 2 7 6 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 57 / 71
  • 101. Clustering coefficient Reduce: count a half-triangle for each node reachable by both a one- and two-hop path out two-hop node neighbor neighbors 1 3 124 1 4 32 1 5 node triangles 1 6 ------------------------- 1 1.0 10 2 98 ------------ ------------------------- 10 0.0 2 8 ------------ 2 9 2 0.0 ------------------------- ------------ 3 1 3546 3 1.0 3 2 98 ------------ 3 4 32 4 0.5 ------------------------- ------------ 4 2 98 7 0.5 4 3 124 ------------------------- 7 1 3546 7 6 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 57 / 71
  • 102. Clustering coefficient node triangles 1 1.0 ------------ 10 0.0 ------------ 2 0.0 ------------ 3 1.0 ------------ 4 0.5 ------------ 7 0.5 Note: this approach generates large amount of intermediate data relative to final output. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 57 / 71
  • 103. Breadth first search Iterative approach: each MapReduce round expands the frontier ? ? 0 ? ? ? ? ? ? ? Map: If node’s distance d to source is finite, output neighbor’s distance as d+1 Reduce: Set node’s distance to minimum received from all in-neighbors @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 58 / 71
  • 104. Breadth first search Iterative approach: each MapReduce round expands the frontier 1 1 0 1 ? ? 1 ? ? ? Map: If node’s distance d to source is finite, output neighbor’s distance as d+1 Reduce: Set node’s distance to minimum received from all in-neighbors @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 58 / 71
  • 105. Breadth first search Iterative approach: each MapReduce round expands the frontier 1 1 0 1 ? ? 1 2 ? ? Map: If node’s distance d to source is finite, output neighbor’s distance as d+1 Reduce: Set node’s distance to minimum received from all in-neighbors @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 58 / 71
  • 106. Breadth first search Iterative approach: each MapReduce round expands the frontier 1 1 0 1 ? 3 1 2 ? 3 Map: If node’s distance d to source is finite, output neighbor’s distance as d+1 Reduce: Set node’s distance to minimum received from all in-neighbors @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 58 / 71
  • 107. Break complicated tasks into multiple, simpler MapReduce rounds. @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 59 / 71
  • 108. Pagerank Iterative approach: each MapReduce round broadcasts and collects edge messages for power method 3 q5 q6 q1 q4 q7 q8 q3 q2 q10 q9 Map: Output current pagerank over degree to each out-neighbor Reduce: Sum incoming probabilities to update estimate http://bit.ly/nielsenpagerank 3 Extra rounds for random jump, dangling nodes @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 60 / 71
  • 109. Machine learning @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 61 / 71
  • 110. Machine learning • Often use MapReduce for feature extraction, then fit/optimize locally • Useful for “embarassingly parallel” parts of learning, e.g. • parameters sweeps for cross-validation • independent restarts for local optimization • making predictions on independent examples • Remember: MapReduce isn’t always the answer @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 62 / 71
  • 111. Classification Example: given words in an article, assign article to one of K classes “Floyd Landis showed up at the Tour of California” @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 63 / 71
  • 112. Classification Example: given words in an article, assign article to one of K classes “Floyd Landis showed up at the Tour of California” ⇓ @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 63 / 71
  • 113. Classification Example: given words in an article, assign article to one of K classes “Floyd Landis showed up at the Tour of California” ⇓ ... @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 63 / 71
  • 114. Classification: naive Bayes • Model presence/absence of each word as independent coin flip p (word|class) = Bernoulli(θwc ) p (words|class) = p (word1 |class) p (word2 |class) . . . @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 64 / 71
  • 115. Classification: naive Bayes • Model presence/absence of each word as independent coin flip p (word|class) = Bernoulli(θwc ) p (words|class) = p (word1 |class) p (word2 |class) . . . • Maximum likelihood estimates of probabilities from word and class counts ˆ Nwc θwc = Nc ˆ Nc θc = N @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 64 / 71
  • 116. Classification: naive Bayes • Model presence/absence of each word as independent coin flip p (word|class) = Bernoulli(θwc ) p (words|class) = p (word1 |class) p (word2 |class) . . . • Maximum likelihood estimates of probabilities from word and class counts ˆ Nwc θwc = Nc ˆ Nc θc = N • Use bayes’ rule to calculate distribution over classes given words p (words|class, Θ) p (class, Θ) p (class|words, Θ) = p (words, Θ) @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 64 / 71
  • 117. Classification: naive Bayes Naive ↔ independent features @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 65 / 71
  • 118. Classification: naive Bayes Naive ↔ independent features ⇓ Class-conditional word count @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 65 / 71
  • 119. Classification: naive Bayes class word count sports an 355 sports be 317 sports first 318 sports game 379 sports has 374 sports have 284 sports one 296 sports said 325 class words sports season 295 sports team 279 world Economics Is on Agenda for U.S. Meetings in China sports their 334 -------------------------------- sports this 293 world U.K. Backs Germany's Effort to Support Euro sports when 290 -------------------------------- sports who 363 sports A Pitchersʼ Duel Ends in Favor of the Yankees world after 347 -------------------------------- world but 299 sports After Doping Allegations, a Race for Details world government 300 world had 352 world have 342 world he 355 world its 308 world mr 293 world united 313 world were 319 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 66 / 71
  • 120. Clustering: Find clusters of “similar” points http://bit.ly/oldfaithful @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 67 / 71
  • 121. Clustering: K-means Map: Assign each point to cluster with closest mean4 , output (cluster, features) Reduce: Update clusters by calculating new class-conditional means http://en.wikipedia.org/wiki/K-means clustering 4 Each mapper loads all cluster centers on init @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 68 / 71
  • 122. Mahout http://mahout.apache.org/ @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 69 / 71
  • 123. Thanks • Sharad Goely • Winter Masony • Sid Suriy • Sergei Vassilvitskiiy • Duncan Wattsy • Eytan Bakshym,y y Yahoo! Research (http://research.yahoo.com) m University of Michigan @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 70 / 71
  • 124. Thanks. Questions?5 5 http://jakehofman.com/icwsm2010 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 71 / 71