SlideShare uma empresa Scribd logo
1 de 60
Baixar para ler offline
MapReduce / Hadoop for Scientific Data Mining
        Hadoop = Open Source MapReduce
                     Wider World of Hadoop




   An Introduction to the World of Hadoop
               Applications to Scientific Data Mining


                                    Gordon Rios

                              g.rios@4c.ucc.ie
                     Cork Constraint Computation Centre (4C)
                             University College Cork


                                October 29, 2010




                                Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
              Hadoop = Open Source MapReduce
                           Wider World of Hadoop


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Objectives

  At the end of this talk I want you to have ideas for how to apply
  MapReduce to your domains and confidence that Hadoop is a
  good way to do it. . .
       Introduce thinking in terms of MapReduce and why it’s a
       good idea
       Introduce Hadoop as an open source implementation of
       MapReduce
       Present a detailed example of using the Hadoop
       streaming API for a scientific data mining task
       Discuss higher level notions for performing ad hoc analysis
       and building systems with Hadoop


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Objectives

  At the end of this talk I want you to have ideas for how to apply
  MapReduce to your domains and confidence that Hadoop is a
  good way to do it. . .
       Introduce thinking in terms of MapReduce and why it’s a
       good idea
       Introduce Hadoop as an open source implementation of
       MapReduce
       Present a detailed example of using the Hadoop
       streaming API for a scientific data mining task
       Discuss higher level notions for performing ad hoc analysis
       and building systems with Hadoop


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Objectives

  At the end of this talk I want you to have ideas for how to apply
  MapReduce to your domains and confidence that Hadoop is a
  good way to do it. . .
       Introduce thinking in terms of MapReduce and why it’s a
       good idea
       Introduce Hadoop as an open source implementation of
       MapReduce
       Present a detailed example of using the Hadoop
       streaming API for a scientific data mining task
       Discuss higher level notions for performing ad hoc analysis
       and building systems with Hadoop


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Objectives

  At the end of this talk I want you to have ideas for how to apply
  MapReduce to your domains and confidence that Hadoop is a
  good way to do it. . .
       Introduce thinking in terms of MapReduce and why it’s a
       good idea
       Introduce Hadoop as an open source implementation of
       MapReduce
       Present a detailed example of using the Hadoop
       streaming API for a scientific data mining task
       Discuss higher level notions for performing ad hoc analysis
       and building systems with Hadoop


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining            Objectives
                 Hadoop = Open Source MapReduce                  Parallel Computing with MapReduce
                              Wider World of Hadoop              MapReduce Thinking


Basics Elements of MapReduce
  MapReduce is distributed sort with specific places to insert
  application logic. . .
          an input reader: read work data W from file system1 and
          produce a set of splits S: W → S
          a Map function: (S) → (K , V )
          combiner function: a mapper optimization. . .
          partition function: partition2 keys k ∈ K to reducers K → R
          compare function cmp(ki , kj ): sort keys presented to each
          reducer
          a Reduce function: reduce output from all mappers for a
          particular to another set of values for that key wk
          (k , V ) → (k , wk ))
          an output writer: write output to file system.
    1
        A distributed file system (DFS) for stability and scale
    2
        The default hash keys modulo number of reducers
                                             Gordon Rios         Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Examples of Map and Reduce
  Let’s start with a few examples of Map. . .
       Word Count: read in a stream of text (e.g. a document or a set of
       documents) and emit each word as a key with a value of 1
       Inverted Index: read in a stream of documents and emit each
       word as a key and the document ID as the value
       Max Temperature: read in formatted data and emit year as a
       key with temperature as the value
       Mean Rain Precipitation: read in daily data and emit
       (year-month, lat, long) as a key with temperature as
       the value
  Reduce in these cases simply applies a count, list, max,
  average, to a set of values for each key,
  respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011]

                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
             Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                          Wider World of Hadoop     MapReduce Thinking


Visualizing Word Count




                                                                  source: Chris Wensel from
                                    http://www.cascading.org




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Engineering Intermezzo



  This is how easy it is to get Hadoop installed . . . given that you
  have Java 6 installed already. . .
  Get Hadoop: http://hadoop.apache.org/

       % t a r x z f hadoop−x . y . z . t a r . gz
       % e x p o r t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z
       % e x p o r t PATH=$PATH : $HADOOP_INSTALL / b i n




                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


MapReduce with Hadoop and the streaming library


             Now, let’s take a closer look at how Hadoop implements
  MapReduce from [White, 2011]. . .




                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Hadoop Streaming Library



  We’ll focus on the streaming library as it’s the most natural for
  scientific or technical computing. . . let’s look at the Definitive
  Guide’s weather example. . .




                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
               Hadoop = Open Source MapReduce         Hadoop Examples
                            Wider World of Hadoop     Developing Production Systems


Hadoop Book Examples

  More examples from Hadoop: The Definitive Guide, 2nd Edition
  (Hadoop 20.1) http://www.hadoopbook.com/. . . here’s
  how to install and try them for yourself. . .

                                      Install Git: http://git-scm.com/
                                      Visit github for book code:
                                      http://github.com/tomwhite/
                                      hadoop-book/

  Checkout code examples from The Definitive Guide
  % cd BUILD_DIR
  % git clone http://github.com/tomwhite/hadoop-book.git hadoop-book




                                       Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining            Hadoop Basics
                  Hadoop = Open Source MapReduce                  Hadoop Examples
                               Wider World of Hadoop              Developing Production Systems


Example: ECA Mean Precipitation

  Let’s compute mean precipitation at over 2,000 weather stations and
  make some graphics. There are 2,186 files with median of 21,875
  lines each, a minimum of 1,025 and a maximum of 78,090.
  ECA Daily Data

  The ECA dataset contains series of daily observations at meteorological stations throughout Europe and the
  Mediterranean. Part of the dataset is freely available for non-commercial research. To download this public data
  select one of the options below. Note that a gridded version with daily temperature and precipitation fields is also
  available. source: http://eca.knmi.nl/dailydata/index.php


  File Format


  FILE FORMAT ( MISSING VALUE CODE = −9999):

  01−06   STAID :   Station i d e n t i f i e r
  08−13   SOUID :   Source i d e n t i f i e r
  15−22   DATE :    Date YYYYMMDD
  24−28   RR    :   P r e c i p i t a t i o n amount i n 0 . 1 mm
  30−34   Q_RR :    q u a l i t y code f o r RR ( 0 = ’ v a l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ )




                                              Gordon Rios         Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


Example: ECA Mean Precipitation


  Scientific Data Mining: use the Hadoop stream library and
  manually pipeline MapReduce jobs together as needed. . .
      Write hadoop scripts in python in two steps
      Test cat data | map.py | sort | reduce.py >
      output (not shown)
      Process data into individual files for each time period
      (Year/Month) of interest using hadoop stream library (local
      mode)
      Call R in batch mode to produce image files




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining                    Hadoop Basics
                  Hadoop = Open Source MapReduce                          Hadoop Examples
                               Wider World of Hadoop                      Developing Production Systems


ECA Mean Precipitation: Step One

  map_one.py
  def l a t _ l o n _ t o _ c o o r d ( s ) :
      sign = 1
      d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) )
      s i g n = −1 i f d < 0 else 1
      x = abs ( d ) + m / 60.0 + s / 3600.0
      return f l o a t ( sign ∗ x )

  f o r l i n e i n sys . s t d i n :
        # f l d s = ( s t a i d , souid , date , r r , q _ r r )
        flds = line . strip (). split ( " , " )
        i f len ( f l d s ) != 5:
               continue
        staid = flds [ 0 ] . strip ()                               # station id
        date = f l d s [ 2 ] . s t r i p ( )                        # YYYYMMDD
        i f date < BEGIN_DATE or date > END_DATE :
               continue
        rr = flds [3]. strip ()                                     # p r e c i p i t a t i o n i n 0 . 1 mm
        q_rr = f l d s [ 4 ] . s t r i p ( )                        # q u a l i t y code " 0 " = v a l i d
        l a t , l o n = l a t l o n s . g e t ( s t a i d , ( None , None ) )
        i f q _ r r == ’ 0 ’ and ( l a t i s not None ) and ( l o n i s not None ) :
               p r i n t "%s ,%.4 f ,%.4 f  t%s " % ( date [ 0 : 6 ] , l a t , lon , r r )




                                                   Gordon Rios            Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining                Hadoop Basics
                 Hadoop = Open Source MapReduce                      Hadoop Examples
                              Wider World of Hadoop                  Developing Production Systems


ECA Mean Precipitation: Step One (cont)


  reduce_one.py
  ( l a s t _ k e y , x , n ) = ( None , 0 . 0 , 0 )
  f o r l i n e i n sys . s t d i n :
       ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( "  t " )
       i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
            i f n > 0:
                p r i n t "%s  t %.2 f " % ( l a s t _ k e y , x / n )
               x = 0.0
               n = 0
      # we j u s t want data f o r t h e year 2009
       ( l a s t _ k e y , x , n ) = ( key , x + f l o a t ( v a l ) , n + 1 )
  i f last_key :
       i f n > 0:
            p r i n t "%s  t %.2 f " % ( l a s t _ k e y , x / n )




                                                Gordon Rios          Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining               Hadoop Basics
                 Hadoop = Open Source MapReduce                     Hadoop Examples
                              Wider World of Hadoop                 Developing Production Systems


ECA Mean Precipitation: Step Two
  Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))

  map_two.py


  f o r l i n e i n sys . s t d i n :
        yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( "  t " )
        yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )
        p r i n t "%s  t%s %s %s " % ( yyyymm , l a t , lon , mean_precip )


  Empty reduce just write to a local file (hack since we’re running locally)

  reduce_two.py


  l a s t _ k e y = None
  values = [ ]
  f o r l i n e i n sys . s t d i n :
      ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( "  t " )
      i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
           w r i t e _ f i l e ( last_key , values )
           values = [ ]
      l a s t _ k e y = key
      v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s

  i f last_key :
     w r i t e _ f i l e ( last_key , values )



                                                 Gordon Rios        Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining               Hadoop Basics
                 Hadoop = Open Source MapReduce                     Hadoop Examples
                              Wider World of Hadoop                 Developing Production Systems


ECA Mean Precipitation: Step Two
  Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))

  map_two.py


  f o r l i n e i n sys . s t d i n :
        yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( "  t " )
        yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )
        p r i n t "%s  t%s %s %s " % ( yyyymm , l a t , lon , mean_precip )


  Empty reduce just write to a local file (hack since we’re running locally)

  reduce_two.py


  l a s t _ k e y = None
  values = [ ]
  f o r l i n e i n sys . s t d i n :
      ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( "  t " )
      i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
           w r i t e _ f i l e ( last_key , values )
           values = [ ]
      l a s t _ k e y = key
      v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s

  i f last_key :
     w r i t e _ f i l e ( last_key , values )



                                                 Gordon Rios        Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining         Hadoop Basics
                Hadoop = Open Source MapReduce               Hadoop Examples
                             Wider World of Hadoop           Developing Production Systems


Example: ECA Mean Precipitation


  Step One: input -> (yyyymm,lat,lon), mean precip

  % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input
  /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper
  /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py
  % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  ...
  % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001
  % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output



  Step Two: (date,lat,lon), mean precip -> files(yymm)

  % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000
  -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer
  /Desktop/tmp/tarragona/python/reduce_two.py
  % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  ...
  % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001
  % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two




                                          Gordon Rios        Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining         Hadoop Basics
                Hadoop = Open Source MapReduce               Hadoop Examples
                             Wider World of Hadoop           Developing Production Systems


Example: ECA Mean Precipitation


  Step One: input -> (yyyymm,lat,lon), mean precip

  % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input
  /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper
  /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py
  % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  ...
  % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001
  % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output



  Step Two: (date,lat,lon), mean precip -> files(yymm)

  % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000
  -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer
  /Desktop/tmp/tarragona/python/reduce_two.py
  % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  ...
  % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001
  % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two




                                          Gordon Rios        Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining                       Hadoop Basics
                  Hadoop = Open Source MapReduce                             Hadoop Examples
                               Wider World of Hadoop                         Developing Production Systems


Batch Processing in R
  And, after a little batch processing with R. . .
  batch-graphics.R


  library ( fields )
  f i l e s <− c ( " 200901. d a t " , " 200902. d a t " , " 200903. d a t " ,
                          " 200904. d a t " , " 200905. d a t " , " 200906. d a t " ,
                          " 200907. d a t " , " 200908. d a t " , " 200909. d a t " ,
                          " 200910. d a t " , " 200911. d a t " , " 200912. d a t " )
  i <− 1
  for ( f in f i l e s ) {
       mat <− read . t a b l e ( f )
      names ( mat ) <− c ( " l a t " , " l o n g " , " p r e c i p " )
       png ( f i l e n a m e =paste ( " p r e c i p−" , i , " . png " , sep= " " ) , h e i g h t =480 , w i d t h =480)
       q u i l t . p l o t ( mat  $long , mat  $ l a t , mat  $ p r e c i p , n c o l =100 , nrow =100 ,
           y l i m =c ( 2 2 . 0 , 7 9 . 0 ) , x l i m =c ( − 5 2 . 0 , 7 2 . 0 ) ,
           c o l =two . c o l o r s ( 2 5 6 , s t a r t = " wheat " , end= " d a r k b l u e " , middle = " b l u e " ) ,
           z l i m =c ( 0 , 4 1 0 ) , add . legend=T , cex . l a b = 0 . 6 )
       p o i n t s ( 1 . 2 4 5 3 , 41.1187 , pch =1)
       t e x t ( 1 . 2 4 5 3 , 41.1187 , " t a r r a g o n a " , cex = 0 . 8 , pos=1)
       p o i n t s ( 2 . 3 5 0 8 3 , 4 8 . 8 9 , pch =1)
       t e x t ( 2 . 3 5 0 8 , 4 8 . 8 9 , " p a r i s " , cex = 0 . 8 , pos=4)
       p o i n t s ( 1 2 . 4 8 2 3 , 41.8955 , pch =1)
       t e x t ( 1 2 . 4 8 2 3 , 41.8955 , " rome " , cex = 0 . 8 , pos=4)
       dev . o f f ( )
       i <− i + 1
  }


                                                     Gordon Rios             Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 1




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 2




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 3




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 4




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 5




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 6




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 7




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 8




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 9




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 10




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 11




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 12




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Summary of What We Did

  Work through a complete example but that’s not all since with very
  little additional work we can. . .
       Test the scripts in pseudo-distributed mode locally on our own
       machine
       Run the job on a compute cluster remotely
       Run the job in the cloud with EC2 there system as just another
       remote cluster
       Run the job with Amazon’s Elastic MapReduce
       http://aws.amazon.com/elasticmapreduce/ which
       allows you to pay for exactly as much computing as you use.
  See [White, 2011] for complete details on how to run in these different
  modes. . .

                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


Systems Development APIs



  And, you can build production systems with Hadoop in either
  Java or C++. . .
      Full featured Java API for Hadoop
      Pipes is the C++ API for Hadoop MapReduce
      Cascading is an API for developing general data
      processing systems that incorporate MapReduce
      (http://www.cascading.org)




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


Systems Development APIs



  And, you can build production systems with Hadoop in either
  Java or C++. . .
      Full featured Java API for Hadoop
      Pipes is the C++ API for Hadoop MapReduce
      Cascading is an API for developing general data
      processing systems that incorporate MapReduce
      (http://www.cascading.org)




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


Systems Development APIs



  And, you can build production systems with Hadoop in either
  Java or C++. . .
      Full featured Java API for Hadoop
      Pipes is the C++ API for Hadoop MapReduce
      Cascading is an API for developing general data
      processing systems that incorporate MapReduce
      (http://www.cascading.org)




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining        Hadoop Basics
                Hadoop = Open Source MapReduce              Hadoop Examples
                             Wider World of Hadoop          Developing Production Systems


Cascading

  Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and then
  automatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph




                                          Gordon Rios       Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                        Ad Hoc Analysis
                 Hadoop = Open Source MapReduce
                                                        Further Reading
                              Wider World of Hadoop


Interesting Application Frameworks with Hadoop

  Here are a few examples of frameworks in development or already
  available that use Hadoop as a platform. . .
          Apache Mahout: Ambitious project to implement popular
          machine learning algorithms and recommenders with Hadoop3
          Graph: Jake Hoffman from Yahoo Research has released some
          of his work on large scale network analysis with Hadoop with
          prototype code4 . Also see [Vassilvitskii, 2010] for related graph
          analysis research.
          Application to GIS: Nathan Kerr’s M.S. Thesis with lots of details
          on how to do GIS with Hadoop5

     3
         http://mahout.apache.org/
     4
         http://github.com/jhofman/icwsm2010_tutorial
     5
       http://www.nathankerr.com/projects/parallel-gis-processing/alternative_
  approaches_to_parallel_gis_processing.html
                                         Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                        Ad Hoc Analysis
            Hadoop = Open Source MapReduce
                                                        Further Reading
                         Wider World of Hadoop


Further Reading
     White, T.
     Hadoop: The Definitive Guide, 2nd Edition
     O’Reilly Media, Inc., Sebastopol, CA, 2011

     Sanderson, D.
     Programming Google App Engine
     O’Reilly Media, Inc., Sebastopol, CA, 2009

     Murty, J.
     Programming Amazon Web Services
     O’Reilly Media, Inc., Sebastopol, CA, 2008

     Dean, J. and Ghemawat, S.
     MapReduce: simplified data processing on large clusters
     Communications of the ACM, 51(1):107–113, 2008

     Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. and
     Burrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E.
     Bigtable: a distributed storage system for structured data
     OSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation,
     USENIX Assoc., Berkeley, CA, 2006

     MapReduce on Wikipedia
     http://en.wikipedia.org/wiki/MapReduce

     Vassilvitskii, S.
     XXL Graph Algorithms, Hadoop Summit 2010
     http://developer.yahoo.com/events/hadoopsummit2010/

                                       Gordon Rios      Introduction to Hadoop

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Mais procurados (20)

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 

Destaque

BIG DATA Online Training | Hadoop Online Training with Placement Assistance
BIG DATA Online Training | Hadoop Online Training with Placement Assistance BIG DATA Online Training | Hadoop Online Training with Placement Assistance
BIG DATA Online Training | Hadoop Online Training with Placement Assistance
Computer Trainings Online
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 

Destaque (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Tennis
TennisTennis
Tennis
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Startup Recruiting Trends
Startup Recruiting TrendsStartup Recruiting Trends
Startup Recruiting Trends
 
BIG DATA Online Training | Hadoop Online Training with Placement Assistance
BIG DATA Online Training | Hadoop Online Training with Placement Assistance BIG DATA Online Training | Hadoop Online Training with Placement Assistance
BIG DATA Online Training | Hadoop Online Training with Placement Assistance
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Tennis presentation slide FINAL
Tennis presentation slide  FINALTennis presentation slide  FINAL
Tennis presentation slide FINAL
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 

Semelhante a An Introduction to the World of Hadoop

Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
Silicon Halton
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
Ashish Saraf
 
3.introduction to map reduce
3.introduction to map reduce3.introduction to map reduce
3.introduction to map reduce
databloginfo
 

Semelhante a An Introduction to the World of Hadoop (20)

Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce Paradigms
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Lecture 2 Hadoop.pptx
Lecture 2 Hadoop.pptxLecture 2 Hadoop.pptx
Lecture 2 Hadoop.pptx
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
3.introduction to map reduce
3.introduction to map reduce3.introduction to map reduce
3.introduction to map reduce
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

An Introduction to the World of Hadoop

  • 1. MapReduce / Hadoop for Scientific Data Mining Hadoop = Open Source MapReduce Wider World of Hadoop An Introduction to the World of Hadoop Applications to Scientific Data Mining Gordon Rios g.rios@4c.ucc.ie Cork Constraint Computation Centre (4C) University College Cork October 29, 2010 Gordon Rios Introduction to Hadoop
  • 2. MapReduce / Hadoop for Scientific Data Mining Hadoop = Open Source MapReduce Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 3. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 4. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 5. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 6. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 7. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 8. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 9. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 10. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 11. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 12. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 13. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 14. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 15. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Basics Elements of MapReduce MapReduce is distributed sort with specific places to insert application logic. . . an input reader: read work data W from file system1 and produce a set of splits S: W → S a Map function: (S) → (K , V ) combiner function: a mapper optimization. . . partition function: partition2 keys k ∈ K to reducers K → R compare function cmp(ki , kj ): sort keys presented to each reducer a Reduce function: reduce output from all mappers for a particular to another set of values for that key wk (k , V ) → (k , wk )) an output writer: write output to file system. 1 A distributed file system (DFS) for stability and scale 2 The default hash keys modulo number of reducers Gordon Rios Introduction to Hadoop
  • 16. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 17. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Examples of Map and Reduce Let’s start with a few examples of Map. . . Word Count: read in a stream of text (e.g. a document or a set of documents) and emit each word as a key with a value of 1 Inverted Index: read in a stream of documents and emit each word as a key and the document ID as the value Max Temperature: read in formatted data and emit year as a key with temperature as the value Mean Rain Precipitation: read in daily data and emit (year-month, lat, long) as a key with temperature as the value Reduce in these cases simply applies a count, list, max, average, to a set of values for each key, respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011] Gordon Rios Introduction to Hadoop
  • 18. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Visualizing Word Count source: Chris Wensel from http://www.cascading.org Gordon Rios Introduction to Hadoop
  • 19. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 20. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Engineering Intermezzo This is how easy it is to get Hadoop installed . . . given that you have Java 6 installed already. . . Get Hadoop: http://hadoop.apache.org/ % t a r x z f hadoop−x . y . z . t a r . gz % e x p o r t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z % e x p o r t PATH=$PATH : $HADOOP_INSTALL / b i n Gordon Rios Introduction to Hadoop
  • 21. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems MapReduce with Hadoop and the streaming library Now, let’s take a closer look at how Hadoop implements MapReduce from [White, 2011]. . . Gordon Rios Introduction to Hadoop
  • 22. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Hadoop Streaming Library We’ll focus on the streaming library as it’s the most natural for scientific or technical computing. . . let’s look at the Definitive Guide’s weather example. . . Gordon Rios Introduction to Hadoop
  • 23. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 24. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Hadoop Book Examples More examples from Hadoop: The Definitive Guide, 2nd Edition (Hadoop 20.1) http://www.hadoopbook.com/. . . here’s how to install and try them for yourself. . . Install Git: http://git-scm.com/ Visit github for book code: http://github.com/tomwhite/ hadoop-book/ Checkout code examples from The Definitive Guide % cd BUILD_DIR % git clone http://github.com/tomwhite/hadoop-book.git hadoop-book Gordon Rios Introduction to Hadoop
  • 25. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Let’s compute mean precipitation at over 2,000 weather stations and make some graphics. There are 2,186 files with median of 21,875 lines each, a minimum of 1,025 and a maximum of 78,090. ECA Daily Data The ECA dataset contains series of daily observations at meteorological stations throughout Europe and the Mediterranean. Part of the dataset is freely available for non-commercial research. To download this public data select one of the options below. Note that a gridded version with daily temperature and precipitation fields is also available. source: http://eca.knmi.nl/dailydata/index.php File Format FILE FORMAT ( MISSING VALUE CODE = −9999): 01−06 STAID : Station i d e n t i f i e r 08−13 SOUID : Source i d e n t i f i e r 15−22 DATE : Date YYYYMMDD 24−28 RR : P r e c i p i t a t i o n amount i n 0 . 1 mm 30−34 Q_RR : q u a l i t y code f o r RR ( 0 = ’ v a l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ ) Gordon Rios Introduction to Hadoop
  • 26. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Scientific Data Mining: use the Hadoop stream library and manually pipeline MapReduce jobs together as needed. . . Write hadoop scripts in python in two steps Test cat data | map.py | sort | reduce.py > output (not shown) Process data into individual files for each time period (Year/Month) of interest using hadoop stream library (local mode) Call R in batch mode to produce image files Gordon Rios Introduction to Hadoop
  • 27. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step One map_one.py def l a t _ l o n _ t o _ c o o r d ( s ) : sign = 1 d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) ) s i g n = −1 i f d < 0 else 1 x = abs ( d ) + m / 60.0 + s / 3600.0 return f l o a t ( sign ∗ x ) f o r l i n e i n sys . s t d i n : # f l d s = ( s t a i d , souid , date , r r , q _ r r ) flds = line . strip (). split ( " , " ) i f len ( f l d s ) != 5: continue staid = flds [ 0 ] . strip () # station id date = f l d s [ 2 ] . s t r i p ( ) # YYYYMMDD i f date < BEGIN_DATE or date > END_DATE : continue rr = flds [3]. strip () # p r e c i p i t a t i o n i n 0 . 1 mm q_rr = f l d s [ 4 ] . s t r i p ( ) # q u a l i t y code " 0 " = v a l i d l a t , l o n = l a t l o n s . g e t ( s t a i d , ( None , None ) ) i f q _ r r == ’ 0 ’ and ( l a t i s not None ) and ( l o n i s not None ) : p r i n t "%s ,%.4 f ,%.4 f t%s " % ( date [ 0 : 6 ] , l a t , lon , r r ) Gordon Rios Introduction to Hadoop
  • 28. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step One (cont) reduce_one.py ( l a s t _ k e y , x , n ) = ( None , 0 . 0 , 0 ) f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e i f n > 0: p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n ) x = 0.0 n = 0 # we j u s t want data f o r t h e year 2009 ( l a s t _ k e y , x , n ) = ( key , x + f l o a t ( v a l ) , n + 1 ) i f last_key : i f n > 0: p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n ) Gordon Rios Introduction to Hadoop
  • 29. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step Two Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean)) map_two.py f o r l i n e i n sys . s t d i n : yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " ) yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " ) p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip ) Empty reduce just write to a local file (hack since we’re running locally) reduce_two.py l a s t _ k e y = None values = [ ] f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e w r i t e _ f i l e ( last_key , values ) values = [ ] l a s t _ k e y = key v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s i f last_key : w r i t e _ f i l e ( last_key , values ) Gordon Rios Introduction to Hadoop
  • 30. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step Two Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean)) map_two.py f o r l i n e i n sys . s t d i n : yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " ) yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " ) p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip ) Empty reduce just write to a local file (hack since we’re running locally) reduce_two.py l a s t _ k e y = None values = [ ] f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e w r i t e _ f i l e ( last_key , values ) values = [ ] l a s t _ k e y = key v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s i f last_key : w r i t e _ f i l e ( last_key , values ) Gordon Rios Introduction to Hadoop
  • 31. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Step One: input -> (yyyymm,lat,lon), mean precip % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output Step Two: (date,lat,lon), mean precip -> files(yymm) % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000 -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer /Desktop/tmp/tarragona/python/reduce_two.py % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two Gordon Rios Introduction to Hadoop
  • 32. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Step One: input -> (yyyymm,lat,lon), mean precip % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output Step Two: (date,lat,lon), mean precip -> files(yymm) % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000 -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer /Desktop/tmp/tarragona/python/reduce_two.py % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two Gordon Rios Introduction to Hadoop
  • 33. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Batch Processing in R And, after a little batch processing with R. . . batch-graphics.R library ( fields ) f i l e s <− c ( " 200901. d a t " , " 200902. d a t " , " 200903. d a t " , " 200904. d a t " , " 200905. d a t " , " 200906. d a t " , " 200907. d a t " , " 200908. d a t " , " 200909. d a t " , " 200910. d a t " , " 200911. d a t " , " 200912. d a t " ) i <− 1 for ( f in f i l e s ) { mat <− read . t a b l e ( f ) names ( mat ) <− c ( " l a t " , " l o n g " , " p r e c i p " ) png ( f i l e n a m e =paste ( " p r e c i p−" , i , " . png " , sep= " " ) , h e i g h t =480 , w i d t h =480) q u i l t . p l o t ( mat $long , mat $ l a t , mat $ p r e c i p , n c o l =100 , nrow =100 , y l i m =c ( 2 2 . 0 , 7 9 . 0 ) , x l i m =c ( − 5 2 . 0 , 7 2 . 0 ) , c o l =two . c o l o r s ( 2 5 6 , s t a r t = " wheat " , end= " d a r k b l u e " , middle = " b l u e " ) , z l i m =c ( 0 , 4 1 0 ) , add . legend=T , cex . l a b = 0 . 6 ) p o i n t s ( 1 . 2 4 5 3 , 41.1187 , pch =1) t e x t ( 1 . 2 4 5 3 , 41.1187 , " t a r r a g o n a " , cex = 0 . 8 , pos=1) p o i n t s ( 2 . 3 5 0 8 3 , 4 8 . 8 9 , pch =1) t e x t ( 2 . 3 5 0 8 , 4 8 . 8 9 , " p a r i s " , cex = 0 . 8 , pos=4) p o i n t s ( 1 2 . 4 8 2 3 , 41.8955 , pch =1) t e x t ( 1 2 . 4 8 2 3 , 41.8955 , " rome " , cex = 0 . 8 , pos=4) dev . o f f ( ) i <− i + 1 } Gordon Rios Introduction to Hadoop
  • 34. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 1 Gordon Rios Introduction to Hadoop
  • 35. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 2 Gordon Rios Introduction to Hadoop
  • 36. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 3 Gordon Rios Introduction to Hadoop
  • 37. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 4 Gordon Rios Introduction to Hadoop
  • 38. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 5 Gordon Rios Introduction to Hadoop
  • 39. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 6 Gordon Rios Introduction to Hadoop
  • 40. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 7 Gordon Rios Introduction to Hadoop
  • 41. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 8 Gordon Rios Introduction to Hadoop
  • 42. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 9 Gordon Rios Introduction to Hadoop
  • 43. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 10 Gordon Rios Introduction to Hadoop
  • 44. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 11 Gordon Rios Introduction to Hadoop
  • 45. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 12 Gordon Rios Introduction to Hadoop
  • 46. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Summary of What We Did Work through a complete example but that’s not all since with very little additional work we can. . . Test the scripts in pseudo-distributed mode locally on our own machine Run the job on a compute cluster remotely Run the job in the cloud with EC2 there system as just another remote cluster Run the job with Amazon’s Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ which allows you to pay for exactly as much computing as you use. See [White, 2011] for complete details on how to run in these different modes. . . Gordon Rios Introduction to Hadoop
  • 47. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 48. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Systems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 49. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Systems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 50. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Systems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 51. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Cascading Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and then automatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph Gordon Rios Introduction to Hadoop
  • 52. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 53. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 54. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 55. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 56. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 57. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 58. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Interesting Application Frameworks with Hadoop Here are a few examples of frameworks in development or already available that use Hadoop as a platform. . . Apache Mahout: Ambitious project to implement popular machine learning algorithms and recommenders with Hadoop3 Graph: Jake Hoffman from Yahoo Research has released some of his work on large scale network analysis with Hadoop with prototype code4 . Also see [Vassilvitskii, 2010] for related graph analysis research. Application to GIS: Nathan Kerr’s M.S. Thesis with lots of details on how to do GIS with Hadoop5 3 http://mahout.apache.org/ 4 http://github.com/jhofman/icwsm2010_tutorial 5 http://www.nathankerr.com/projects/parallel-gis-processing/alternative_ approaches_to_parallel_gis_processing.html Gordon Rios Introduction to Hadoop
  • 59. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 60. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Further Reading White, T. Hadoop: The Definitive Guide, 2nd Edition O’Reilly Media, Inc., Sebastopol, CA, 2011 Sanderson, D. Programming Google App Engine O’Reilly Media, Inc., Sebastopol, CA, 2009 Murty, J. Programming Amazon Web Services O’Reilly Media, Inc., Sebastopol, CA, 2008 Dean, J. and Ghemawat, S. MapReduce: simplified data processing on large clusters Communications of the ACM, 51(1):107–113, 2008 Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. and Burrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E. Bigtable: a distributed storage system for structured data OSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, USENIX Assoc., Berkeley, CA, 2006 MapReduce on Wikipedia http://en.wikipedia.org/wiki/MapReduce Vassilvitskii, S. XXL Graph Algorithms, Hadoop Summit 2010 http://developer.yahoo.com/events/hadoopsummit2010/ Gordon Rios Introduction to Hadoop