SlideShare uma empresa Scribd logo
1 de 53
Baixar para ler offline
1


                                                                                                                    Thinking at
                                                                                                                    Scale with
                                                                                                                    Hadoop




                                                                                                                         Ken Krugler




                                                                                                                                                                                  photo by: i_pinz, flickr
                                                                                                                         Bixo Labs &
                                                                                                                         Scale Unlimited

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
2




       Introduction - Ken Krugler

              Founder, Bixo Labs - http://bixolabs.com
                      Large scale web mining workflows, running in Amazon’s cloud
              Instructor, Scale Unlimited - http://scaleunlimited.com
                      Teaching Hadoop Boot Camp
              Founder, Krugle (2005 - 2009)
                      Code search startup
                      Parse/index 2.6 billion lines of open source code



         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
3




        Goals for Talk


             Intro to Map-Reduce
             Overview of Hadoop
             Challenges with Map-Reduce
             Solutions to Complexity




Friday, September 24, 2010
4

                                                                                                                         Introduction to
                                                                                                                         Map-Reduce




                                                                                                                         Ken Krugler




                                                                                                                                                                                  photo by: exfordy, flickr
                                                                                                                         Bixo Labs &
                                                                                                                         Scale Unlimited

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
5




       Definitions


           Map -> The ‘map’ function in the MapReduce algorithm, user defined
           Reduce -> The ‘reduce’ function in the MapReduce algorithm, user defined
           Group -> The ‘magic’ that happens between Map and Reduce
           Key Value Pair -> two units of data, exchanged between Map & Reduce




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
6




       How MapReduce Works

              Map translates input to keys and values to new keys and values
                                                               [K1,V1]                       Map                    [K2,V2]



              System Groups each unique key with all its values
                                                             [K2,V2]                      Group                   [K2,{V2,V2,....}]




              Reduce translates the values of each unique key to new keys and values
                                                 [K2,{V2,V2,....}]                       Reduce                   [K3,V3]


         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
7




       All Together
                                                         Mapper



                                                         Mapper                                            Shuffle                                       Reducer



                                                         Mapper                                            Shuffle                                       Reducer




                                                         Mapper                                            Shuffle                                       Reducer




                                                          Mapper




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
8




       Canonical Example - Word Count

              Word Count
                      Read a document, parse out the words, count the frequency of each word


              Specifically, in MapReduce
                      With a document consisting of lines of text
                      Translate each line of text into key = word and value = 1
                      Sum the values (ones) for each unique word



         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
9
                                                                                            [1, When in the Course of human events, it becomes]

                                                                                       [2, necessary for one people to dissolve the political bands]

                                                                                      [3, which have connected them with another, and to assume]
                                      [K1,V1]

                                                                                                                          [n, {k,k,k,k,...}]


                                                                                                                                  Map


                                      [K2,V2]                                      [When,1]                [in,1]               [the,1]            [Course,1]                [k,1]




                                                                     User
                                                                                                                                 Group
                                                                    Defined




                                  [K2,{V2,V2,....}]                              [When,{1,1,1,1}]                     [people,{1,1,1,1,1,1}]                       [dissolve,{1,1,1}]


                                                                                               [connected,{1,1,1,1,1}]                             [k,{v,v,...}]



                                                                                                                               Reduce


                                                                                                  [When,4]                     [peope,6]                  [dissolve,3]
                                      [K3,V3]
                                                                                                             [connected,5]               [k,sum(v,v,v,..)]



         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
10




       Divide & Conquer

              Because
                      The Map function only cares about the current key and value, and
                      The Reduce function only cares about the current key and its values
                      Mappers & Reducers can run Map/Reduce functions independently
              Then
                      A Mapper can invoke Map on an arbitrary number of input keys and values
                      A Reducer can invoke Reduce on an arbitrary number of the unique keys
                      Any number of Mappers & Reducers can be run on any number of nodes


         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
11

                                                                                                                         Overview of
                                                                                                                         Hadoop




                                                                                                                         Ken Krugler




                                                                                                                                                                                   photo by: exfordy, flickr
                                                                                                                         Bixo Labs &
                                                                                                                         Scale Unlimited

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
12




       Logical Architecture
                                                             Cluster




                                                                                                         Execution


                                                                                                           Storage




            Logically, Hadoop is simply a computing cluster that provides:
                    a Storage layer, and
                    an Execution layer


         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
13




       Logical Architecture
                                     Users                                    Cluster

                                         Client




                                         Client
                                                                                                                              Execution
                                                                                                job          job                                               job
                                                                                                                                Storage
                                         Client




              Where users can submit jobs that:
                      Run on the Execution layer, and
                      Read/write data from the Storage layer

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
14




       Storage - HDFS
                                                                                            Geography                                                                             Geography

                                                                                            Rack                                           Rack                                   Rack

                                                                                            Node                   Node                    Node                     Node          ...

                                                                                               block                   block                                           block            block


                                                                                               block                                           block                   block            block



              Performance
                      Optimized for many concurrent serialized reads of large files
                      But only a single writer (write-once, no append)
              Fidelity
                      Replication via block replication

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
15




       Execution
                                                                                               Geography                                                                          Geography

                                                                                               Rack                                            Rack                               Rack

                                                                                               Node                   Node                     Node                     Node      ...
              Performance                                                                          map                     map                     map                      map         map

                      Divide and Conquer
                                                                                                 reduce                  reduce
                                                                                                                                                                                    reduce
                              MapReduce
                      Data Affinity
                              Move code to data

              Fidelity
                      Applications must choose to fail on bad data, or ignore bad data


         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
16




       Architectural Components
                                                                               NameNode                                   DataNode
                                                                                                                           DataNode
                                                                                                                             DataNode
                                                                                                                               DataNode                                      data block



                                                   ns                                                                                                         read/write
                                                operations                      Secondary                                              ns
                                                                                                                                    operations
                                                               read/write                                                ns
                                                                                                                      operations             read/write
                                                                                                                                                                       mapper
                                                                                                                                                                           mapper
                                                                                                                                                                       child jvm
                                                                                                                                                                              mapper
                                                                                                                                                                         child jvm
                                                               jobs                                      tasks                                                               child jvm
                                       Client                                  JobTracker

                                                                                                                                  TaskTracker
                                                                                                                                                                       reducer
                                                                                                                                                                           reducer
                                                                                                                                                                       child jvm
                                                                                                                                                                              reducer
                                                                                                                                                                         child jvm
                                                                                                                                                                             child jvm
              Solid boxes are unique applications
              Dashed boxes are child JVM instances on same node as parent
              Dotted boxes are blocks of managed files on same node as parent

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
17

                                                                                                                        Challenges of
                                                                                                                        Map-Reduce




                                                                                                                        Ken Krugler




                                                                                                                                                                                        photo by: Graham Racher, flickr
                                                                                                                        Bixo Labs &
                                                                                                                        Scale Unlimited

         Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
18




       Complexity of keys/values
              Simple != Simplicity
              People think about processing records/objects
              The correct key definition changes during a workflow
              A “value” is typically a set of fields




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
19




       Complexity of chaining jobs
              Most real-world workflows require multiple Map-Reduce jobs
              Connecting data between jobs is tedious
              Keeping key/value definitions in sync is error-prone
              As an example...




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
20




       A trivial filter/count/sort workflow
    private void start(String inputPath, String outputPath, String filterUser)                                               // run job and block until job is done
    throws Exception {                                                                                                       filterCountJob.waitForCompletion(false);
            // counts the pages the users clicked and sort by number of clicks
                                                                                                                             // sort
             // filter and count                                                                                             Job sortJob = new Job(conf, "sort job");
             Path outputDir = new Path(outputPath);                                                                          FileInputFormat.setInputPaths(sortJob, tmpOutput);
             Path parent = outputDir.getParent();                                                                            FileOutputFormat.setOutputPath(sortJob, outputDir);
             Path tmpOutput = new Path(parent, "" + System.currentTimeMillis());
                                                                                                                             sortJob.setMapperClass(SortMapper.class);
             // Create the job configuration                                                                                 sortJob.setMapOutputKeyClass(UserAndCount.class);
             Configuration conf = new Configuration();                                                                       sortJob.setMapOutputValueClass(NullWritable.class);
             conf.set(FilterMapper.USER_KEY, filterUser);
                                                                                                                             // sort are only stable with one reducer but you lose distribution.
             Job filterCountJob = new Job(conf, "filter+count job");                                                         sortJob.setNumReduceTasks(1);
             FileInputFormat.setInputPaths(filterCountJob, new Path(inputPath));                                             sortJob.setInputFormatClass(SequenceFileInputFormat.class);
             FileOutputFormat.setOutputPath(filterCountJob, tmpOutput);                                                      sortJob.setOutputFormatClass(TextOutputFormat.class);
             filterCountJob.setMapperClass(FilterMapper.class);                                                              // run job and block until job is done
             filterCountJob.setMapOutputKeyClass(Text.class);                                                                sortJob.waitForCompletion(false);
             filterCountJob.setMapOutputValueClass(LongWritable.class);
                                                                                                                             // Get rid of temp directory used to bridge between the two jobs
             filterCountJob.setReducerClass(CounterReducer.class);                                                           FileSystem.get(conf).delete(tmpOutput, true);
             filterCountJob.setOutputKeyClass(Text.class);
             filterCountJob.setOutputValueClass(LongWritable.class);

             filterCountJob.setInputFormatClass(TextInputFormat.class);
             filterCountJob.setOutputFormatClass(SequenceFileOutputFormat.class);




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
21




       Not seeing the forest...

              Too much low-level detail obscures intent
              Not seeing intent leads to bugs
              And makes optimization harder




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
22




       Hard to use MR optimizations

              MR => MR{0,1} => M{1,}R{0,1} => M{1,}R{0,1}M{0,}
              Combiners can reduce map output size
              Map-side joins where one of the sides is “small”




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
23




       Solutions to Complexity


              Cascading - Tuples, Pipes, and Operations
              FlumeJava - Java Classes and Deferred Execution
              Pig/Hive - “It’s just a big database”
              Datameer - “It’s just a big spreadsheet”




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
24

                                                                                                                        Cascading




                                                                                                                        Ken Krugler




                                                                                                                                                                                        photo by: Graham Racher, flickr
                                                                                                                        Bixo Labs &
                                                                                                                        Scale Unlimited

         Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
25




       Cascading


              An API for defining data processing workflows
              First public release Jan 2008
              Currently 1.1.1 and supports Hadoop 0.19.x through 0.20.2
              http://www.cascading.org




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
26




       Cascading


           Basic unit of data is “Tuple” - roughly equal to a record
           Implements all standard data processing operations
                   function, filter, group by, co-group, aggregator, etc
           Complex applications are built by chaining operations with pipes
           And are planned into the minimum number of MapReduce jobs




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
27




       Operations on Tuples


              Map and Reduce communicate through Keys and Values
              Our problems are really about elements of Keys and Values
                      e.g. “username”, not “line”
              The effect is more reusability of our ‘map’ and ‘reduce’ operations




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
28




       Mapping onto Hadoop


              Functions & Filters are map operations
              Aggregates & Buffers are reduce operations
              Built-in support for joining, multi-field sorting
              Concept of “Taps” with “Schemes” as input and output formats




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
29




       Loose Coupling
                                                                                                                                                                func




                        Source                              func                           Group                              aggr                                                Sink




              Above, all the possible relationships between operations and data
              If Fields are the interface between operations, we can build flexible apps

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
30




       Layered Over MapReduce


      Map                                                                Reduce                                         Map                      Reduce

        Source                   func                          CoGroup                   aggr                    temp                  GroupBy                   aggr             func   Sink




        Source                   func                   func




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
31




       Cascading Code

         Tap input = new Hfs( new TextLine( new Fields( "line" ) ), inputPath );
         Tap output = new Hfs( new TextLine( 1 ), outputPath, SinkMode.REPLACE );

         Pipe pipe = new Pipe( "example" );

         RegexSplitGenerator function = new RegexSplitGenerator( new Fields( "word" ), "s+" );
         pipe = new Each( pipe, new Fields( "line" ), function );

         pipe = new GroupBy( pipe, new Fields( "word" ) );

         pipe = new Every( pipe, new Fields( "word" ), new Count( new Fields( "count" ) ) );
         pipe = new Every( pipe, new Fields( "word" ), new Average( new Fields( "avg" ) ) );

         pipe = new GroupBy( pipe, new Fields( "count", -1 ) );

         FlowConnector.setApplicationJarClass( new Properties(), Main.class );
         Flow flow = new FlowConnector( properties ).connect( "example flow", input, output, pipe );
         flow.complete();




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
32
                                                                                                  [head]



                                                   Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']']


                                                                        ['ip', 'time', 'method', 'event', 'status', 'size']
                                                                        ['ip', 'time', 'method', 'event', 'status', 'size']


                                                                     Each('arrival rate')[DateParser[decl:'ts'][args:1]]

                                                                                         ['ts']                             ['ts']
                                                                                         ['ts']                             ['ts']

                                                             GroupBy('tsCount')[by:['ts']]                            Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]]


                                                                                                                                                     ['tm']
                                                                       tsCount['ts']
                                                                                                                                                     ['tm']
                                                                       ['ts']

                                                                                                                                        GroupBy('tmCount')[by:['tm']]
                                                        Every('tsCount')[Count[decl:'count']]

                                                                                                                                                tmCount['tm']
                                                                       ['ts', 'count']                                                          ['tm']
                                                                       ['ts']

                                                                                                                                     Every('tmCount')[Count[decl:'count']]

                                          Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']']
                                                                                                                                                 ['tm', 'count']
                                                                                                                                                 ['tm']


                                                                                                                    Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']']




                                                                                                           [tail]



         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
33
                                                                                                                                                                                                                                                   [head]




                                                                                                          Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']                                             Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']


                                                                                                                                        ['', '']                                                                                                    ['', '']
                                                                                                                                                                                                                                                                               Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
                                                                                                                                        ['', '']                                                                                                    ['', '']

                                                                                                                                                                                                                                                                                                                   ['', '']
                                                                                                              Each('')[RegexSplitter[decl:'', ''][args:1]]                                                   Each('')[RegexSplitter[decl:'', '', '', ''][args:1]]
                                                                                                                                                                                                                                                                                                                   ['', '']

                                                                                                                                        ['', '']                                                                                               ['', '']
                                                                                                                                                                                                                                                                                        Each('')[RegexSplitter[decl:'', '', ''][args:1]]
                                                                                                                                        ['', '']                                                                                               ['', '']


                                                                                                                       TempHfs['SequenceFile[['', '']]'][]                                                    TempHfs['SequenceFile[['', '']]'][]

                                                                                                                                                                                 ['', '']                                             ['', '']
                                                                                                                                                                                 ['', '']                                             ['', '']                                                                         ['', '']
                                                                                                                                                                                                                                                                                                                       ['', '']
                                                                                                                                                                                                                CoGroup('')[by:a:['']b:['']]

                                                                                                                                                                                                                               a[''],b['']
                                                                                                            ['', '']                                                                                                           ['', '', '', '']
                                                                                                            ['', '']
                                                                                                                                                                                                            Each('')[Identity[decl:'', '', '']]

                                                                                                                                        ['', '']                                                                                ['', '', '']
                                                                                                                                        ['', '']                                                                                ['', '', '']

                                                                                                                                                                                                      TempHfs['SequenceFile[['', '', '']]'][]                                                    TempHfs['SequenceFile[['', '']]'][]

                                                                                                                                                                                                                                                                         ['', '', '']                             ['', '']
                                                                                                                                                                                                                                                                         ['', '', '']                             ['', '']

                                                                                                                                                                                                                                                                                             CoGroup('')[by:a:['']b:['']]
                                                                                                                                                                                                                       ['', '', '']
                                                                                                                                                                                                                       ['', '', '']                                                                                                        ['', '']
                                                                                                                                                                                                                                                          ['', '', '']                                a[''],b['']                          ['', '']
                                                                                                                                                                                                                                                          ['', '', '']                                ['', '', '', '', '']

                                                                                                                                                                                                                                                                                          Each('')[Identity[decl:'', '', '', '']]

                                                                                                                                                                                                                                                                                                        ['', '', '', '']
                                                                                                                                                                                                                                                                                                        ['', '', '', '']

                                                                                                                                                                                                                                                                                  TempHfs['SequenceFile[['', '', '', '']]'][]
                                                                                                                                                                                                                                                  ['', '', '', '']
                                                                                                                                                                                                                                                  ['', '', '', '']                                      ['', '', '', '']
                                                                                                                                                                                                                                                                                                        ['', '', '', '']

                                                                                                                                                                                                 CoGroup('')[by:a:['']b:['']]                                                                CoGroup('')[by:a:['']b:['']]

                                                                                                                                                                                              a[''],b['']                                                                                            a[''],b['']
                                                                                                                                                                                              ['', '', '', '', '', '', '']                                                                           ['', '', '', '', '', '']

                                                                                                                                                                TempHfs['SequenceFile[['', '', '', '', '', '', '']]'][]                                                                   Each('')[Identity[decl:'', '', '', '']]

                                                                                                                                        ['', '', '', '', '', '', '']                                                                                                                                    ['', '', '', '']
                                                                                                                                        ['', '', '', '', '', '', '']                                                                                                                                    ['', '', '', '']

                                                                                      CoGroup('')[by:a:['']b:['']]                                                                                                                                                                TempHfs['SequenceFile[['', '', '', '']]'][]

                                                                                        a[''],b['']                                                                                                                                                                               ['', '', '', '']
                                                                                        ['', '', '', '', '', '', '', '', '']                                                                                                                                                      ['', '', '', '']

                                                                                   Each('')[Identity[decl:'', '']]                                                                                                                     CoGroup('')[by:a:['']b:['']]

                                                                                                 ['', '']                                                                                                                                  a[''],b['']
                                                                                                 ['', '']                                                                                                                                  ['', '', '', '', '', '', '']

                                                                           TempHfs['SequenceFile[['', '']]'][]                                                                                                               Each('')[Identity[decl:'', '', '']]

                                                                                              ['', '']                                                                                                                           ['', '', '']
                                                                                              ['', '']                                                                                                                           ['', '', '']

                                                                                    GroupBy('')[by:['']]                                                                                    TempHfs['SequenceFile[['', '', '']]'][]

                                                                                              a['']                                                                               ['', '', '']
                                                                                              ['', '']                                                                            ['', '', '']

                                                                               Every('')[Agregator[decl:'']]                              CoGroup('')[by:a:['']b:['']]

                                                                                                                                                     a[''],b['']
                                                                                                                                                     ['', '', '', '', '']
                                                                                               ['', '']
                                                                                               ['', '']
                                                                                                                                      Each('')[Identity[decl:'', '', '']]


                                                                                                                                                        ['', '', '']
                                                                      Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
                                                                                                                                                        ['', '', '']


                                                                                                                                  TempHfs['SequenceFile[['', '', '']]'][]

                                                                                                                                                        ['', '', '']
                                                                                                                                                        ['', '', '']

                                                                                                                                              GroupBy('')[by:['']]

                                                                                                                                                        a['']
                                                                                                                                                        ['', '', '']

                                                                                                                                     Every('')[Agregator[decl:'', '']]

                                                                                                                                                        ['', '', '']
                                                                                                                                                        ['', '', '']

                                                                                                                            Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']




                                                                                                                       [tail]




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
34

                                                                                                                        FlumeJava




                                                                                                                        Ken Krugler




                                                                                                                                                                                        photo by: Graham Racher, flickr
                                                                                                                        Bixo Labs &
                                                                                                                        Scale Unlimited

         Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
35




       FlumeJava


              Recent paper (June 2010) by Googlites - Chambers et. al.
              Internal project to make it easier to write Map-Reduce jobs
              Equivalent open source project just started
                      Plume - http://github.com/tdunning/Plume




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
36




       Java Classes, Deferred Exec

              Java-centric view of map-reduce processing
              Handful of classes to represent immutable, parallel collections
                      PCollection, PTable
              Small number of primitive operations on classes
                      parallelDo, groupByKey, combineValues, flatten
              Deferred execution allows construction/optimization of workflow




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
37




       FlumeJava Code


         PCollection<String> lines = readTextFileCollection("/gfs/data/shakes/hamlet.txt");

         Collection<String> words = lines.parallelDo(new DoFn<String,String>() {
            void process(String line, EmitFn<String> emitFn) {
               for (String word : splitIntoWords(line)) {
                  emitFn.emit(word);
               }
            }
         }, collectionOf(strings()));




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
38




       Results of Use at Google


                                                                                                              Text
                                                                                                                Text




                                                                                        Lines of code




                                                                   Number of Map-Reduce Jobs
                 Text
         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
39

                                                                                                                         Pig & Hive




                                                                                                                                                                                   photo by: Tambako the Jaguar, flickr
                                                                                                                         Ken Krugler
                                                                                                                         Bixo Labs &
                                                                                                                         Scale Unlimited

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
40




       Pig

              Pig is a data flow ‘programming environment’ for processing large files
              PigLatin is the text language it provides for defining data flows
              Developed by researchers in Yahoo! and widely used internally
              A sub-project of Hadoop
              APL licensed
              http://hadoop.apache.org/pig/
              First incubator release v0.1.0 on Sept 2008



         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
41




       Pig

              Provides two kinds of optimizations:
                      Physical - caching and stream optimizations at the MapReduce level
                      Algebraic - mathematical reductions at the syntax level
              Allows for User Defined Functions (UDF)
                      must register jar of functions via register path/some.jar
                      can be called directly from PigLatin
                              B = foreach A generate myfunc.MyEvalFunc($0);




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
42




       PigLatin - Example

                W = LOAD 'filename' AS (url, outlink);
                G = GROUP W by url;
                R = FOREACH G {
                        FW = FILTER W BY outlink eq 'www.apache.org';
                        PW = FW.outlink;
                        DW = DISTINCT PW;
                        GENERATE group, COUNT(DW);
                }




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
43




       Hive


              Hive is a data warehouse built on top of flat files
              Developed by Facebook
              Available as a Hadoop contribution in v0.19.0
              http://wiki.apache.org/hadoop/Hive




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
44




       Hive

              Data Organization into Tables with logical and hash partitioning
              A Metastore to store metadata about Tables/Partitions/Buckets
                      Partitions are ‘how’ the data is stored (collections of related rows)
                      Buckets further divide Partitions based on a hash of a column
                      metadata is actually stored in a RDBMS
              A SQL like query language over object data stored in Tables
              Custom types, structs of primitive types



         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
45




       Hive - Queries

              Queries always write results into a table

     FROM user
     INSERT OVERWRITE TABLE user_active
     SELECT user.*
     WHERE user.active = true;

     FROM pv_users
     INSERT OVERWRITE TABLE pv_gender_sum
     SELECT pv_users.gender, count_distinct(pv_users.userid)
     GROUP BY pv_users.gender
     INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum.txt'
     SELECT pv_users.age, count_distinct(pv_users.userid)
     GROUP BY pv_users.age;
         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
46

                                                                                                                         Datameer
                                                                                                                         Analytics
                                                                                                                         Solution (DAS)




                                                                                                                                                                                   photo by: Tambako the Jaguar, flickr
                                                                                                                         Ken Krugler
                                                                                                                         Bixo Labs &
                                                                                                                         Scale Unlimited

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
47




       DAS - Big Spreadsheets


              Commercial solution from Datameer
              Similar to IBM BigSheets
              GUI + workflow editor on top of Hadoop
              Define data sources, operations (via spreadsheet), and data sinks
              http://www.datameer.com




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
48




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
49




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
50

                                                                                                                         Summary




                                                                                                                                                                                   photo by: Tambako the Jaguar, flickr
                                                                                                                         Ken Krugler
                                                                                                                         Bixo Labs &
                                                                                                                         Scale Unlimited

         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
51




       Thinking at Scale with Hadoop


              Understand the underlying Map-Reduce architecture
              Model the problem as a workflow with operations on records
              Consider using Cascading or (eventually) Plume
              Use Pig/Hive to solve query-centric problems
              Look at DAS or BigSheets for easier analytics solutions




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010
52




       Resources


             Hadoop mailing lists - many, e.g. common-user@hadoop.apache.org
             Hadoop: The Definitive Guide by Tom White (O’Reilly)
             Hadoop Training
                   Scale Unlimited: http://scaleunlimited.com/courses
                   Cloudera: http://www.cloudera.com/hadoop-training/
             Cascading: http://www.cascading.org
             DAS: http://www.datameer.com



Friday, September 24, 2010
53




       Questions?




         Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Friday, September 24, 2010

Mais conteúdo relacionado

Destaque

The coexistence of biodiversity and built environments of the Sydney Basin Bi...
The coexistence of biodiversity and built environments of the Sydney Basin Bi...The coexistence of biodiversity and built environments of the Sydney Basin Bi...
The coexistence of biodiversity and built environments of the Sydney Basin Bi...Biocity Studio
 
Digital eng
Digital engDigital eng
Digital engbeerguy
 
Tecnica de Ermiro de Lima
Tecnica de Ermiro de LimaTecnica de Ermiro de Lima
Tecnica de Ermiro de LimaFrancy Vivas
 
Mobilizing Masses: Multimedia, Multichannel, MyFreeTaxes
Mobilizing Masses: Multimedia, Multichannel, MyFreeTaxesMobilizing Masses: Multimedia, Multichannel, MyFreeTaxes
Mobilizing Masses: Multimedia, Multichannel, MyFreeTaxesprnewswire
 
ShareCafé 3 - Geef je samenwerking een technologische upgrade
ShareCafé 3 - Geef je samenwerking een technologische upgradeShareCafé 3 - Geef je samenwerking een technologische upgrade
ShareCafé 3 - Geef je samenwerking een technologische upgradeOrbit One - We create coherence
 
Akademi Klasik Türk Müziği Korosunun 24 mayıs 2015 konseri Resimleri
Akademi Klasik Türk Müziği Korosunun 24 mayıs 2015 konseri ResimleriAkademi Klasik Türk Müziği Korosunun 24 mayıs 2015 konseri Resimleri
Akademi Klasik Türk Müziği Korosunun 24 mayıs 2015 konseri Resimleriaokutur
 
Андрей Бузина_ Вирусные кампании в Интернете
Андрей Бузина_ Вирусные кампании в ИнтернетеАндрей Бузина_ Вирусные кампании в Интернете
Андрей Бузина_ Вирусные кампании в ИнтернетеIngria. Technopark St. Petersburg
 
Информационный вестник Сентябрь 2012
Информационный вестник Сентябрь 2012Информационный вестник Сентябрь 2012
Информационный вестник Сентябрь 2012Ingria. Technopark St. Petersburg
 
The Outcome Conversation: Aligning DevOps and Business Through Understanding ...
The Outcome Conversation: Aligning DevOps and Business Through Understanding ...The Outcome Conversation: Aligning DevOps and Business Through Understanding ...
The Outcome Conversation: Aligning DevOps and Business Through Understanding ...James Urquhart
 
Up in the clouds sdd 2012
Up in the clouds sdd 2012Up in the clouds sdd 2012
Up in the clouds sdd 2012Andrea Ginsky
 
Валентин Макаров - Ситуация с экспортом ИТ-кластера Санкт-Петербурга (России)...
Валентин Макаров - Ситуация с экспортом ИТ-кластера Санкт-Петербурга (России)...Валентин Макаров - Ситуация с экспортом ИТ-кластера Санкт-Петербурга (России)...
Валентин Макаров - Ситуация с экспортом ИТ-кластера Санкт-Петербурга (России)...Ingria. Technopark St. Petersburg
 
Unit 3b Starting your own business
Unit 3b Starting your own businessUnit 3b Starting your own business
Unit 3b Starting your own businessAndrew Hingston
 

Destaque (19)

Nirrimi
NirrimiNirrimi
Nirrimi
 
The coexistence of biodiversity and built environments of the Sydney Basin Bi...
The coexistence of biodiversity and built environments of the Sydney Basin Bi...The coexistence of biodiversity and built environments of the Sydney Basin Bi...
The coexistence of biodiversity and built environments of the Sydney Basin Bi...
 
Elena
ElenaElena
Elena
 
Digital eng
Digital engDigital eng
Digital eng
 
Tecnica de Ermiro de Lima
Tecnica de Ermiro de LimaTecnica de Ermiro de Lima
Tecnica de Ermiro de Lima
 
Nhs Points
Nhs PointsNhs Points
Nhs Points
 
Mobilizing Masses: Multimedia, Multichannel, MyFreeTaxes
Mobilizing Masses: Multimedia, Multichannel, MyFreeTaxesMobilizing Masses: Multimedia, Multichannel, MyFreeTaxes
Mobilizing Masses: Multimedia, Multichannel, MyFreeTaxes
 
ShareCafé 3 - Geef je samenwerking een technologische upgrade
ShareCafé 3 - Geef je samenwerking een technologische upgradeShareCafé 3 - Geef je samenwerking een technologische upgrade
ShareCafé 3 - Geef je samenwerking een technologische upgrade
 
Akademi Klasik Türk Müziği Korosunun 24 mayıs 2015 konseri Resimleri
Akademi Klasik Türk Müziği Korosunun 24 mayıs 2015 konseri ResimleriAkademi Klasik Türk Müziği Korosunun 24 mayıs 2015 konseri Resimleri
Akademi Klasik Türk Müziği Korosunun 24 mayıs 2015 konseri Resimleri
 
Google does 1916
Google does 1916Google does 1916
Google does 1916
 
Ingria gamification
Ingria gamificationIngria gamification
Ingria gamification
 
Slide Chinh Thuc
Slide Chinh ThucSlide Chinh Thuc
Slide Chinh Thuc
 
Андрей Бузина_ Вирусные кампании в Интернете
Андрей Бузина_ Вирусные кампании в ИнтернетеАндрей Бузина_ Вирусные кампании в Интернете
Андрей Бузина_ Вирусные кампании в Интернете
 
Информационный вестник Сентябрь 2012
Информационный вестник Сентябрь 2012Информационный вестник Сентябрь 2012
Информационный вестник Сентябрь 2012
 
The Outcome Conversation: Aligning DevOps and Business Through Understanding ...
The Outcome Conversation: Aligning DevOps and Business Through Understanding ...The Outcome Conversation: Aligning DevOps and Business Through Understanding ...
The Outcome Conversation: Aligning DevOps and Business Through Understanding ...
 
Up in the clouds sdd 2012
Up in the clouds sdd 2012Up in the clouds sdd 2012
Up in the clouds sdd 2012
 
Будь в курсе
Будь в курсе Будь в курсе
Будь в курсе
 
Валентин Макаров - Ситуация с экспортом ИТ-кластера Санкт-Петербурга (России)...
Валентин Макаров - Ситуация с экспортом ИТ-кластера Санкт-Петербурга (России)...Валентин Макаров - Ситуация с экспортом ИТ-кластера Санкт-Петербурга (России)...
Валентин Макаров - Ситуация с экспортом ИТ-кластера Санкт-Петербурга (России)...
 
Unit 3b Starting your own business
Unit 3b Starting your own businessUnit 3b Starting your own business
Unit 3b Starting your own business
 

Semelhante a Thinking at scale with hadoop

Prototyping in a Scrum environment
Prototyping in a Scrum environmentPrototyping in a Scrum environment
Prototyping in a Scrum environmentSid Dane
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allMarc Dutoo
 
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware
 
PuppetConf track overview: Modern Infrastructure
PuppetConf track overview: Modern InfrastructurePuppetConf track overview: Modern Infrastructure
PuppetConf track overview: Modern InfrastructurePuppet
 
Managing Creativity
Managing CreativityManaging Creativity
Managing Creativitykamaelian
 
Running Lean Startup Experiments
Running Lean Startup ExperimentsRunning Lean Startup Experiments
Running Lean Startup ExperimentsZach Nies
 
Cloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made onCloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made onPatrick Chanezon
 
Inflectra User Summit - Mannheim - PTA 2023
Inflectra User Summit - Mannheim - PTA 2023Inflectra User Summit - Mannheim - PTA 2023
Inflectra User Summit - Mannheim - PTA 2023Inflectra
 
Bluemix Paris Meetup - Optimization on Cloud (DOcloud) - 14 octobre 2015
Bluemix Paris Meetup -  Optimization on Cloud (DOcloud) - 14 octobre 2015Bluemix Paris Meetup -  Optimization on Cloud (DOcloud) - 14 octobre 2015
Bluemix Paris Meetup - Optimization on Cloud (DOcloud) - 14 octobre 2015IBM France Lab
 
20210806_產品經理講座_公開版
20210806_產品經理講座_公開版20210806_產品經理講座_公開版
20210806_產品經理講座_公開版Chi Chu
 
Why Quantum Anywhere, Parts I and II, plus Intro
Why Quantum Anywhere, Parts I and II, plus IntroWhy Quantum Anywhere, Parts I and II, plus Intro
Why Quantum Anywhere, Parts I and II, plus IntroBrij Consulting, LLC
 
[CVPR2020] Simple but effective image enhancement techniques
[CVPR2020] Simple but effective image enhancement techniques[CVPR2020] Simple but effective image enhancement techniques
[CVPR2020] Simple but effective image enhancement techniquesJaeJun Yoo
 
Project planning
Project planningProject planning
Project planningESUG
 
Dr. McNatty Webinar: An Introduction to Acumen 360
Dr. McNatty Webinar: An Introduction to Acumen 360Dr. McNatty Webinar: An Introduction to Acumen 360
Dr. McNatty Webinar: An Introduction to Acumen 360Acumen
 
20110924 kansai kinect_vol1
20110924 kansai kinect_vol120110924 kansai kinect_vol1
20110924 kansai kinect_vol1Kaoru NAKAMURA
 
IRJET- Robo Goalkeeper
IRJET- Robo GoalkeeperIRJET- Robo Goalkeeper
IRJET- Robo GoalkeeperIRJET Journal
 
UX for Lean Startups Sep 15
UX for Lean Startups Sep 15UX for Lean Startups Sep 15
UX for Lean Startups Sep 15Lane Goldstone
 
Sharethrough's process evolution
Sharethrough's process evolutionSharethrough's process evolution
Sharethrough's process evolutionRobert Fan
 
Sharethrough's Process Evolution
Sharethrough's Process EvolutionSharethrough's Process Evolution
Sharethrough's Process EvolutionBalanced Team
 
10 tips
10 tips10 tips
10 tipsdjras
 

Semelhante a Thinking at scale with hadoop (20)

Prototyping in a Scrum environment
Prototyping in a Scrum environmentPrototyping in a Scrum environment
Prototyping in a Scrum environment
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
 
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
 
PuppetConf track overview: Modern Infrastructure
PuppetConf track overview: Modern InfrastructurePuppetConf track overview: Modern Infrastructure
PuppetConf track overview: Modern Infrastructure
 
Managing Creativity
Managing CreativityManaging Creativity
Managing Creativity
 
Running Lean Startup Experiments
Running Lean Startup ExperimentsRunning Lean Startup Experiments
Running Lean Startup Experiments
 
Cloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made onCloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made on
 
Inflectra User Summit - Mannheim - PTA 2023
Inflectra User Summit - Mannheim - PTA 2023Inflectra User Summit - Mannheim - PTA 2023
Inflectra User Summit - Mannheim - PTA 2023
 
Bluemix Paris Meetup - Optimization on Cloud (DOcloud) - 14 octobre 2015
Bluemix Paris Meetup -  Optimization on Cloud (DOcloud) - 14 octobre 2015Bluemix Paris Meetup -  Optimization on Cloud (DOcloud) - 14 octobre 2015
Bluemix Paris Meetup - Optimization on Cloud (DOcloud) - 14 octobre 2015
 
20210806_產品經理講座_公開版
20210806_產品經理講座_公開版20210806_產品經理講座_公開版
20210806_產品經理講座_公開版
 
Why Quantum Anywhere, Parts I and II, plus Intro
Why Quantum Anywhere, Parts I and II, plus IntroWhy Quantum Anywhere, Parts I and II, plus Intro
Why Quantum Anywhere, Parts I and II, plus Intro
 
[CVPR2020] Simple but effective image enhancement techniques
[CVPR2020] Simple but effective image enhancement techniques[CVPR2020] Simple but effective image enhancement techniques
[CVPR2020] Simple but effective image enhancement techniques
 
Project planning
Project planningProject planning
Project planning
 
Dr. McNatty Webinar: An Introduction to Acumen 360
Dr. McNatty Webinar: An Introduction to Acumen 360Dr. McNatty Webinar: An Introduction to Acumen 360
Dr. McNatty Webinar: An Introduction to Acumen 360
 
20110924 kansai kinect_vol1
20110924 kansai kinect_vol120110924 kansai kinect_vol1
20110924 kansai kinect_vol1
 
IRJET- Robo Goalkeeper
IRJET- Robo GoalkeeperIRJET- Robo Goalkeeper
IRJET- Robo Goalkeeper
 
UX for Lean Startups Sep 15
UX for Lean Startups Sep 15UX for Lean Startups Sep 15
UX for Lean Startups Sep 15
 
Sharethrough's process evolution
Sharethrough's process evolutionSharethrough's process evolution
Sharethrough's process evolution
 
Sharethrough's Process Evolution
Sharethrough's Process EvolutionSharethrough's Process Evolution
Sharethrough's Process Evolution
 
10 tips
10 tips10 tips
10 tips
 

Mais de Ken Krugler

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraKen Krugler
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrKen Krugler
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorialKen Krugler
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big dataKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 

Mais de Ken Krugler (9)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Thinking at scale with hadoop

  • 1. 1 Thinking at Scale with Hadoop Ken Krugler photo by: i_pinz, flickr Bixo Labs & Scale Unlimited Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 2. 2 Introduction - Ken Krugler Founder, Bixo Labs - http://bixolabs.com Large scale web mining workflows, running in Amazon’s cloud Instructor, Scale Unlimited - http://scaleunlimited.com Teaching Hadoop Boot Camp Founder, Krugle (2005 - 2009) Code search startup Parse/index 2.6 billion lines of open source code Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 3. 3 Goals for Talk Intro to Map-Reduce Overview of Hadoop Challenges with Map-Reduce Solutions to Complexity Friday, September 24, 2010
  • 4. 4 Introduction to Map-Reduce Ken Krugler photo by: exfordy, flickr Bixo Labs & Scale Unlimited Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 5. 5 Definitions Map -> The ‘map’ function in the MapReduce algorithm, user defined Reduce -> The ‘reduce’ function in the MapReduce algorithm, user defined Group -> The ‘magic’ that happens between Map and Reduce Key Value Pair -> two units of data, exchanged between Map & Reduce Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 6. 6 How MapReduce Works Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2] System Groups each unique key with all its values [K2,V2] Group [K2,{V2,V2,....}] Reduce translates the values of each unique key to new keys and values [K2,{V2,V2,....}] Reduce [K3,V3] Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 7. 7 All Together Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 8. 8 Canonical Example - Word Count Word Count Read a document, parse out the words, count the frequency of each word Specifically, in MapReduce With a document consisting of lines of text Translate each line of text into key = word and value = 1 Sum the values (ones) for each unique word Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 9. 9 [1, When in the Course of human events, it becomes] [2, necessary for one people to dissolve the political bands] [3, which have connected them with another, and to assume] [K1,V1] [n, {k,k,k,k,...}] Map [K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1] User Group Defined [K2,{V2,V2,....}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}] [connected,{1,1,1,1,1}] [k,{v,v,...}] Reduce [When,4] [peope,6] [dissolve,3] [K3,V3] [connected,5] [k,sum(v,v,v,..)] Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 10. 10 Divide & Conquer Because The Map function only cares about the current key and value, and The Reduce function only cares about the current key and its values Mappers & Reducers can run Map/Reduce functions independently Then A Mapper can invoke Map on an arbitrary number of input keys and values A Reducer can invoke Reduce on an arbitrary number of the unique keys Any number of Mappers & Reducers can be run on any number of nodes Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 11. 11 Overview of Hadoop Ken Krugler photo by: exfordy, flickr Bixo Labs & Scale Unlimited Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 12. 12 Logical Architecture Cluster Execution Storage Logically, Hadoop is simply a computing cluster that provides: a Storage layer, and an Execution layer Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 13. 13 Logical Architecture Users Cluster Client Client Execution job job job Storage Client Where users can submit jobs that: Run on the Execution layer, and Read/write data from the Storage layer Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 14. 14 Storage - HDFS Geography Geography Rack Rack Rack Node Node Node Node ... block block block block block block block block Performance Optimized for many concurrent serialized reads of large files But only a single writer (write-once, no append) Fidelity Replication via block replication Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 15. 15 Execution Geography Geography Rack Rack Rack Node Node Node Node ... Performance map map map map map Divide and Conquer reduce reduce reduce MapReduce Data Affinity Move code to data Fidelity Applications must choose to fail on bad data, or ignore bad data Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 16. 16 Architectural Components NameNode DataNode DataNode DataNode DataNode data block ns read/write operations Secondary ns operations read/write ns operations read/write mapper mapper child jvm mapper child jvm jobs tasks child jvm Client JobTracker TaskTracker reducer reducer child jvm reducer child jvm child jvm Solid boxes are unique applications Dashed boxes are child JVM instances on same node as parent Dotted boxes are blocks of managed files on same node as parent Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 17. 17 Challenges of Map-Reduce Ken Krugler photo by: Graham Racher, flickr Bixo Labs & Scale Unlimited Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 18. 18 Complexity of keys/values Simple != Simplicity People think about processing records/objects The correct key definition changes during a workflow A “value” is typically a set of fields Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 19. 19 Complexity of chaining jobs Most real-world workflows require multiple Map-Reduce jobs Connecting data between jobs is tedious Keeping key/value definitions in sync is error-prone As an example... Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 20. 20 A trivial filter/count/sort workflow private void start(String inputPath, String outputPath, String filterUser) // run job and block until job is done throws Exception { filterCountJob.waitForCompletion(false); // counts the pages the users clicked and sort by number of clicks // sort // filter and count Job sortJob = new Job(conf, "sort job"); Path outputDir = new Path(outputPath); FileInputFormat.setInputPaths(sortJob, tmpOutput); Path parent = outputDir.getParent(); FileOutputFormat.setOutputPath(sortJob, outputDir); Path tmpOutput = new Path(parent, "" + System.currentTimeMillis()); sortJob.setMapperClass(SortMapper.class); // Create the job configuration sortJob.setMapOutputKeyClass(UserAndCount.class); Configuration conf = new Configuration(); sortJob.setMapOutputValueClass(NullWritable.class); conf.set(FilterMapper.USER_KEY, filterUser); // sort are only stable with one reducer but you lose distribution. Job filterCountJob = new Job(conf, "filter+count job"); sortJob.setNumReduceTasks(1); FileInputFormat.setInputPaths(filterCountJob, new Path(inputPath)); sortJob.setInputFormatClass(SequenceFileInputFormat.class); FileOutputFormat.setOutputPath(filterCountJob, tmpOutput); sortJob.setOutputFormatClass(TextOutputFormat.class); filterCountJob.setMapperClass(FilterMapper.class); // run job and block until job is done filterCountJob.setMapOutputKeyClass(Text.class); sortJob.waitForCompletion(false); filterCountJob.setMapOutputValueClass(LongWritable.class); // Get rid of temp directory used to bridge between the two jobs filterCountJob.setReducerClass(CounterReducer.class); FileSystem.get(conf).delete(tmpOutput, true); filterCountJob.setOutputKeyClass(Text.class); filterCountJob.setOutputValueClass(LongWritable.class); filterCountJob.setInputFormatClass(TextInputFormat.class); filterCountJob.setOutputFormatClass(SequenceFileOutputFormat.class); Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 21. 21 Not seeing the forest... Too much low-level detail obscures intent Not seeing intent leads to bugs And makes optimization harder Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 22. 22 Hard to use MR optimizations MR => MR{0,1} => M{1,}R{0,1} => M{1,}R{0,1}M{0,} Combiners can reduce map output size Map-side joins where one of the sides is “small” Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 23. 23 Solutions to Complexity Cascading - Tuples, Pipes, and Operations FlumeJava - Java Classes and Deferred Execution Pig/Hive - “It’s just a big database” Datameer - “It’s just a big spreadsheet” Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 24. 24 Cascading Ken Krugler photo by: Graham Racher, flickr Bixo Labs & Scale Unlimited Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 25. 25 Cascading An API for defining data processing workflows First public release Jan 2008 Currently 1.1.1 and supports Hadoop 0.19.x through 0.20.2 http://www.cascading.org Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 26. 26 Cascading Basic unit of data is “Tuple” - roughly equal to a record Implements all standard data processing operations function, filter, group by, co-group, aggregator, etc Complex applications are built by chaining operations with pipes And are planned into the minimum number of MapReduce jobs Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 27. 27 Operations on Tuples Map and Reduce communicate through Keys and Values Our problems are really about elements of Keys and Values e.g. “username”, not “line” The effect is more reusability of our ‘map’ and ‘reduce’ operations Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 28. 28 Mapping onto Hadoop Functions & Filters are map operations Aggregates & Buffers are reduce operations Built-in support for joining, multi-field sorting Concept of “Taps” with “Schemes” as input and output formats Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 29. 29 Loose Coupling func Source func Group aggr Sink Above, all the possible relationships between operations and data If Fields are the interface between operations, we can build flexible apps Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 30. 30 Layered Over MapReduce Map Reduce Map Reduce Source func CoGroup aggr temp GroupBy aggr func Sink Source func func Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 31. 31 Cascading Code Tap input = new Hfs( new TextLine( new Fields( "line" ) ), inputPath ); Tap output = new Hfs( new TextLine( 1 ), outputPath, SinkMode.REPLACE ); Pipe pipe = new Pipe( "example" ); RegexSplitGenerator function = new RegexSplitGenerator( new Fields( "word" ), "s+" ); pipe = new Each( pipe, new Fields( "line" ), function ); pipe = new GroupBy( pipe, new Fields( "word" ) ); pipe = new Every( pipe, new Fields( "word" ), new Count( new Fields( "count" ) ) ); pipe = new Every( pipe, new Fields( "word" ), new Average( new Fields( "avg" ) ) ); pipe = new GroupBy( pipe, new Fields( "count", -1 ) ); FlowConnector.setApplicationJarClass( new Properties(), Main.class ); Flow flow = new FlowConnector( properties ).connect( "example flow", input, output, pipe ); flow.complete(); Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 32. 32 [head] Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']'] ['ip', 'time', 'method', 'event', 'status', 'size'] ['ip', 'time', 'method', 'event', 'status', 'size'] Each('arrival rate')[DateParser[decl:'ts'][args:1]] ['ts'] ['ts'] ['ts'] ['ts'] GroupBy('tsCount')[by:['ts']] Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]] ['tm'] tsCount['ts'] ['tm'] ['ts'] GroupBy('tmCount')[by:['tm']] Every('tsCount')[Count[decl:'count']] tmCount['tm'] ['ts', 'count'] ['tm'] ['ts'] Every('tmCount')[Count[decl:'count']] Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']'] ['tm', 'count'] ['tm'] Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']'] [tail] Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 33. 33 [head] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', ''] ['', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', ''] ['', ''] ['', ''] Each('')[RegexSplitter[decl:'', ''][args:1]] Each('')[RegexSplitter[decl:'', '', '', ''][args:1]] ['', ''] ['', ''] ['', ''] Each('')[RegexSplitter[decl:'', '', ''][args:1]] ['', ''] ['', ''] TempHfs['SequenceFile[['', '']]'][] TempHfs['SequenceFile[['', '']]'][] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', ''] ['', '', '', ''] ['', ''] Each('')[Identity[decl:'', '', '']] ['', ''] ['', '', ''] ['', ''] ['', '', ''] TempHfs['SequenceFile[['', '', '']]'][] TempHfs['SequenceFile[['', '']]'][] ['', '', ''] ['', ''] ['', '', ''] ['', ''] CoGroup('')[by:a:['']b:['']] ['', '', ''] ['', '', ''] ['', ''] ['', '', ''] a[''],b[''] ['', ''] ['', '', ''] ['', '', '', '', ''] Each('')[Identity[decl:'', '', '', '']] ['', '', '', ''] ['', '', '', ''] TempHfs['SequenceFile[['', '', '', '']]'][] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] CoGroup('')[by:a:['']b:['']] a[''],b[''] a[''],b[''] ['', '', '', '', '', '', ''] ['', '', '', '', '', ''] TempHfs['SequenceFile[['', '', '', '', '', '', '']]'][] Each('')[Identity[decl:'', '', '', '']] ['', '', '', '', '', '', ''] ['', '', '', ''] ['', '', '', '', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] TempHfs['SequenceFile[['', '', '', '']]'][] a[''],b[''] ['', '', '', ''] ['', '', '', '', '', '', '', '', ''] ['', '', '', ''] Each('')[Identity[decl:'', '']] CoGroup('')[by:a:['']b:['']] ['', ''] a[''],b[''] ['', ''] ['', '', '', '', '', '', ''] TempHfs['SequenceFile[['', '']]'][] Each('')[Identity[decl:'', '', '']] ['', ''] ['', '', ''] ['', ''] ['', '', ''] GroupBy('')[by:['']] TempHfs['SequenceFile[['', '', '']]'][] a[''] ['', '', ''] ['', ''] ['', '', ''] Every('')[Agregator[decl:'']] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', '', ''] ['', ''] ['', ''] Each('')[Identity[decl:'', '', '']] ['', '', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', '', ''] TempHfs['SequenceFile[['', '', '']]'][] ['', '', ''] ['', '', ''] GroupBy('')[by:['']] a[''] ['', '', ''] Every('')[Agregator[decl:'', '']] ['', '', ''] ['', '', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] [tail] Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 34. 34 FlumeJava Ken Krugler photo by: Graham Racher, flickr Bixo Labs & Scale Unlimited Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 35. 35 FlumeJava Recent paper (June 2010) by Googlites - Chambers et. al. Internal project to make it easier to write Map-Reduce jobs Equivalent open source project just started Plume - http://github.com/tdunning/Plume Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 36. 36 Java Classes, Deferred Exec Java-centric view of map-reduce processing Handful of classes to represent immutable, parallel collections PCollection, PTable Small number of primitive operations on classes parallelDo, groupByKey, combineValues, flatten Deferred execution allows construction/optimization of workflow Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 37. 37 FlumeJava Code PCollection<String> lines = readTextFileCollection("/gfs/data/shakes/hamlet.txt"); Collection<String> words = lines.parallelDo(new DoFn<String,String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } } }, collectionOf(strings())); Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 38. 38 Results of Use at Google Text Text Lines of code Number of Map-Reduce Jobs Text Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 39. 39 Pig & Hive photo by: Tambako the Jaguar, flickr Ken Krugler Bixo Labs & Scale Unlimited Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 40. 40 Pig Pig is a data flow ‘programming environment’ for processing large files PigLatin is the text language it provides for defining data flows Developed by researchers in Yahoo! and widely used internally A sub-project of Hadoop APL licensed http://hadoop.apache.org/pig/ First incubator release v0.1.0 on Sept 2008 Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 41. 41 Pig Provides two kinds of optimizations: Physical - caching and stream optimizations at the MapReduce level Algebraic - mathematical reductions at the syntax level Allows for User Defined Functions (UDF) must register jar of functions via register path/some.jar can be called directly from PigLatin B = foreach A generate myfunc.MyEvalFunc($0); Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 42. 42 PigLatin - Example W = LOAD 'filename' AS (url, outlink); G = GROUP W by url; R = FOREACH G { FW = FILTER W BY outlink eq 'www.apache.org'; PW = FW.outlink; DW = DISTINCT PW; GENERATE group, COUNT(DW); } Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 43. 43 Hive Hive is a data warehouse built on top of flat files Developed by Facebook Available as a Hadoop contribution in v0.19.0 http://wiki.apache.org/hadoop/Hive Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 44. 44 Hive Data Organization into Tables with logical and hash partitioning A Metastore to store metadata about Tables/Partitions/Buckets Partitions are ‘how’ the data is stored (collections of related rows) Buckets further divide Partitions based on a hash of a column metadata is actually stored in a RDBMS A SQL like query language over object data stored in Tables Custom types, structs of primitive types Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 45. 45 Hive - Queries Queries always write results into a table FROM user INSERT OVERWRITE TABLE user_active SELECT user.* WHERE user.active = true; FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum.txt' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age; Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 46. 46 Datameer Analytics Solution (DAS) photo by: Tambako the Jaguar, flickr Ken Krugler Bixo Labs & Scale Unlimited Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 47. 47 DAS - Big Spreadsheets Commercial solution from Datameer Similar to IBM BigSheets GUI + workflow editor on top of Hadoop Define data sources, operations (via spreadsheet), and data sinks http://www.datameer.com Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 48. 48 Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 49. 49 Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 50. 50 Summary photo by: Tambako the Jaguar, flickr Ken Krugler Bixo Labs & Scale Unlimited Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 51. 51 Thinking at Scale with Hadoop Understand the underlying Map-Reduce architecture Model the problem as a workflow with operations on records Consider using Cascading or (eventually) Plume Use Pig/Hive to solve query-centric problems Look at DAS or BigSheets for easier analytics solutions Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010
  • 52. 52 Resources Hadoop mailing lists - many, e.g. common-user@hadoop.apache.org Hadoop: The Definitive Guide by Tom White (O’Reilly) Hadoop Training Scale Unlimited: http://scaleunlimited.com/courses Cloudera: http://www.cloudera.com/hadoop-training/ Cascading: http://www.cascading.org DAS: http://www.datameer.com Friday, September 24, 2010
  • 53. 53 Questions? Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Friday, September 24, 2010