SlideShare uma empresa Scribd logo
1 de 73
CHUG,	
  February	
  12,	
  2013	
  
                       dean.wampler@thinkbiganalyCcs.com
                         polyglotprogramming.com/talks




                                                                    Scalding	
  for	
  
                                                                        Hadoop
Wednesday, February 13, 13

Using Scalding to leverage Cascading to write MapReduce jobs. Note: This talk was presented
in tandem with a Cascading introduction presented by Paco Nathan (@pacoid), so it assumes
you know some Cascading concepts.

(All photos are © Dean Wampler, 2005-2013, All Rights Reserved. Otherwise, the
presentation is free to use.)
About	
  Me...
        dean.wampler@thinkbiganalyCcs.com
        @deanwampler
        github.com/thinkbiganalyCcs	
  
        github.com/deanwampler




                                                      Programming



                                                      Hive
                       Functional
                       Programming
                             for Java Developers
                                                                        Dean Wampler,
                                      Dean Wampler                  Jason Rutherglen &
                                                                      Edward Capriolo




Wednesday, February 13, 13

My books and contact information.
Hive	
  Training	
  in	
  Chicago!

         IntroducCon	
  to	
  Hive
         March	
  7

         Hive	
  Master	
  Class
         March	
  8	
  (also	
  before	
                     Programming


         StrataConf,	
  February	
  25)                       Hive
         hQp://bit.ly/ThinkBigEvents
                                                                               Dean Wampler,
                                                                           Jason Rutherglen &
                                                                             Edward Capriolo




Wednesday, February 13, 13

I’m teaching two, one-day Hive classes in Chicago in March. I’m also teaching the Hive
Master Class on Monday before StrataConf, 2/25, in Mountain View.
How	
  many	
  of	
  you	
  have	
  
                all	
  the	
  'me	
  in	
  the	
  world	
  
                to	
  get	
  your	
  work	
  done?



Wednesday, February 13, 13
How	
  many	
  of	
  you	
  have	
  
                all	
  the	
  'me	
  in	
  the	
  world	
  
                to	
  get	
  your	
  work	
  done?
                             Then	
  why	
  are	
  you
                                doing	
  this:
Wednesday, February 13, 13
import org.apache.hadoop.io.*;
      import org.apache.hadoop.mapred.*;
      import java.util.StringTokenizer;

      class WCMapper extends MapReduceBase
          implements Mapper<LongWritable, Text, Text, IntWritable> {

          static final IntWritable one = new IntWritable(1);
          static final Text word = new Text;  // Value will be set in a non-thread-safe way!

          @Override
          public void map(LongWritable key, Text valueDocContents,
              OutputCollector<Text, IntWritable> output, Reporter reporter) {
            String[] tokens = valueDocContents.toString.split("s+");
            for (String wordString: tokens) {
              if (wordString.length > 0) {
                word.set(wordString.toLowerCase);
                output.collect(word, one);                                                                                     The “simple”
              }
            }                                                                                                                   Word Count
          }
      }                                                                                                                         algorithm
      class Reduce extends MapReduceBase
          implements Reducer[Text, IntWritable, Text, IntWritable] {

          public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts,
                     OutputCollector<Text, IntWritable> output, Reporter reporter) {
            int totalCount = 0;
            while (valuesCounts.hasNext) {
              totalCount += valuesCounts.next.get;
            }
            output.collect(keyWord, new IntWritable(totalCount));
          }
      }
                                                                                    6
Wednesday, February 13, 13
This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the
Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers
insist on working this way...
The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets very complex and tedious very fast!
Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that
do work. We’ll see how these change...
import org.apache.hadoop.io.*;
      import org.apache.hadoop.mapred.*;
      import java.util.StringTokenizer;

      class WCMapper extends MapReduceBase
          implements Mapper<LongWritable, Text, Text, IntWritable> {

          static final IntWritable one = new IntWritable(1);
          static final Text word = new Text;  // Value will be set in a non-thread-safe way!

          @Override
          public void map(LongWritable key, Text valueDocContents,
              OutputCollector<Text, IntWritable> output, Reporter reporter) {
            String[] tokens = valueDocContents.toString.split("s+");
            for (String wordString: tokens) {
              if (wordString.length > 0) {
                word.set(wordString.toLowerCase);
                output.collect(word, one);
              }
            }
          }
      }

      class Reduce extends MapReduceBase
          implements Reducer[Text, IntWritable, Text, IntWritable] {

          public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts,
                     OutputCollector<Text, IntWritable> output, Reporter reporter) {
            int totalCount = 0;
            while (valuesCounts.hasNext) {
              totalCount += valuesCounts.next.get;
            }
            output.collect(keyWord, new IntWritable(totalCount));
          }
      }
                                                                                    7
Wednesday, February 13, 13
This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the
Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers
insist on working this way...
The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets very complex and tedious very fast!
Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that
do work. We’ll see how these change...
output.collect(word, one);
                       }
                  }
            }
       }

       class Reduce extends MapReduceBase
           implements Reducer[Text, IntWritable, Text, IntWritable] {

            public void reduce(Text keyWord, java.util.Iterator<IntWritabl
                       OutputCollector<Text, IntWritable> output, Reporter
              int totalCount = 0;
              while (valuesCounts.hasNext) {
                totalCount += valuesCounts.next.get;
              }
              output.collect(keyWord, new IntWritable(totalCount));
            }
       }




                                                                                    7
Wednesday, February 13, 13
This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the
Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers
insist on working this way...
The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets very complex and tedious very fast!
Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that
do work. We’ll see how these change...
output.collect(word, one);
                       }
                  }
            }
       }

       class Reduce extends MapReduceBase
           implements Reducer[Text, IntWritable, Text, IntWritable] {

            public void reduce(Text keyWord, java.util.Iterator<IntWritabl
                       OutputCollector<Text, IntWritable> output, Reporter
              int totalCount = 0;
              while (valuesCounts.hasNext) {
                totalCount += valuesCounts.next.get;
              }
              output.collect(keyWord, new IntWritable(totalCount));
            }
       }



                                                                             Green
                                                                             types.
                                                                                    7
Wednesday, February 13, 13
This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the
Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers
insist on working this way...
The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets very complex and tedious very fast!
Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that
do work. We’ll see how these change...
output.collect(word, one);
                       }
                  }
            }
       }

       class Reduce extends MapReduceBase
           implements Reducer[Text, IntWritable, Text, IntWritable] {

            public void reduce(Text keyWord, java.util.Iterator<IntWritabl
                       OutputCollector<Text, IntWritable> output, Reporter
              int totalCount = 0;
              while (valuesCounts.hasNext) {
                totalCount += valuesCounts.next.get;
              }
              output.collect(keyWord, new IntWritable(totalCount));
            }
       }



                                                                             Green                                        Yellow
                                                                             types.                                     operations.
                                                                                    7
Wednesday, February 13, 13
This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the
Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers
insist on working this way...
The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets very complex and tedious very fast!
Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that
do work. We’ll see how these change...
Your	
  tools	
  should:




Wednesday, February 13, 13

We would like a “domain-specific language” (DSL) for data manipulation and analysis.
Your	
  tools	
  should:

                         Minimize	
  boilerplate	
  
                            by	
  exposing	
  the	
  
                          right	
  abstrac'ons.

Wednesday, February 13, 13

We would like a “domain-specific language” (DSL) for data manipulation and analysis.
Your	
  tools	
  should:

             Maximize	
  expressiveness	
  
                and	
  extensibility.


Wednesday, February 13, 13

Expressiveness - ability to tell the system what you need done.
Extensibility - ability to add functionality to the system when it doesn’t already support some
operation you need.
Use Cascading (Java)
                                                                     (SoluCon	
  #1)




                                                                         10
Wednesday, February 13, 13
Cascading is a Java library that provides higher-level abstractions for building data processing pipelines with concepts familiar from SQL such as a
joins, group-bys, etc. It works on top of Hadoop’s MapReduce and hides most of the boilerplate from you.
See http://cascading.org.
import org.cascading.*;
     ...
     public class WordCount {
       public static void main(String[] args) {
         Properties properties = new Properties();
         FlowConnector.setApplicationJarClass( properties, WordCount.class );

             Scheme sourceScheme = new TextLine( new Fields( "line" ) );
             Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
             String inputPath = args[0];
             String outputPath = args[1];
             Tap source = new Hfs( sourceScheme, inputPath );
             Tap sink   = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

             Pipe assembly = new Pipe( "wordcount" );

             String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
             Function function = new RegexGenerator( new Fields( "word" ), regex );
             assembly = new Each( assembly, new Fields( "line" ), function );
             assembly = new GroupBy( assembly, new Fields( "word" ) );
             Aggregator count = new Count( new Fields( "count" ) );
             assembly = new Every( assembly, count );

             FlowConnector flowConnector = new FlowConnector( properties );
             Flow flow = flowConnector.connect( "word-count", source, sink, assembly);
             flow.complete();
         }
     }
                                                                           11
Wednesday, February 13, 13
Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate,
although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just
mention some highlights.
Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The
infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count,
etc. Also, the API emphasizes composing behaviors together in an intuitive and powerful way.
import org.cascading.*;
     ...
     public class WordCount {
       public static void main(String[] args) {
         Properties properties = new Properties();
         FlowConnector.setApplicationJarClass( properties, WordCount.class );

             Scheme sourceScheme = new TextLine( new Fields( "line" ) );
             Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
             String inputPath = args[0];
             String outputPath = args[1];
             Tap source = new Hfs( sourceScheme, inputPath );
             Tap sink   = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

             Pipe assembly = new Pipe( "wordcount" );

             String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
             Function function = new RegexGenerator( new Fields( "word" ), regex );
             assembly = new Each( assembly, new Fields( "line" ), function );
             assembly = new GroupBy( assembly, new Fields( "word" ) );
             Aggregator count = new Count( new Fields( "count" ) );
             assembly = new Every( assembly, count );

             FlowConnector flowConnector = new FlowConnector( properties );
             Flow flow = flowConnector.connect( "word-count", source, sink, assembly);
             flow.complete();
         }
     }
                                                                           12
Wednesday, February 13, 13
Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate,
although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just
mention some highlights.
Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The
infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count,
etc. Also, the API emphasizes composing behaviors together in an intuitive and powerful way.
import org.cascading.*;
   ...
   public class WordCount {
     public static void main(String[] args) {
       Properties properties = new Properties();
       FlowConnector.setApplicationJarClass( properties, Wo

               Scheme sourceScheme = new TextLine( new Fields( "lin
               Scheme sinkScheme = new TextLine( new Fields( "word"
               String inputPath = args[0];
               String outputPath = args[1];
               Tap source = new Hfs( sourceScheme, inputPath );
               Tap sink   = new Hfs( sinkScheme, outputPath, SinkMo

               Pipe assembly = new Pipe( "wordcount" );
                    “Flow” setup.
                                                                        12
              String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!
Wednesday, February 13, 13
              Function function = new RegexGenerator( new Fields(
Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate,
although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them "line" ),
              assembly = new Each( assembly, new Fields( here, but just
mention some highlights.
Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The "word"
              assembly = new GroupBy( assembly, new Fields(
infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count,
etc. Also, theAggregator behaviors together in new Count( new Fields( "count" )
              API emphasizes composing count = an intuitive and powerful way.
Scheme sourceScheme = new TextLine( new Fields( "lin
               Scheme sinkScheme = new TextLine( new Fields( "word"
               String inputPath = args[0];
               String outputPath = args[1];
               Tap source = new Hfs( sourceScheme, inputPath );
               Tap sink   = new Hfs( sinkScheme, outputPath, SinkMo

               Pipe assembly = new Pipe( "wordcount" );

               String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!
               Function function = new RegexGenerator( new Fields(
               assembly = new Each( assembly, new Fields( "line" ),
               assembly = new GroupBy( assembly, new Fields( "word"
               Aggregator count = new Count( new Fields( "count" )
               assembly = new Every( assembly, count );

               FlowConnector flowConnector = new FlowConnector( pro
               Flow flow = flowConnector.connect( "word-count", sou
               flow.complete();       Input and
                    “Flow” setup.
         }                                                               output taps.
                                                                           12
   }
Wednesday, February 13, 13
Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate,
although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just
mention some highlights.
Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The
infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count,
etc. Also, the API emphasizes composing behaviors together in an intuitive and powerful way.
Pipe assembly = new Pipe( "wordcount" );

               String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!
               Function function = new RegexGenerator( new Fields(
               assembly = new Each( assembly, new Fields( "line" ),
               assembly = new GroupBy( assembly, new Fields( "word"
               Aggregator count = new Count( new Fields( "count" )
               assembly = new Every( assembly, count );

               FlowConnector flowConnector = new FlowConnector( pro
               Flow flow = flowConnector.connect( "word-count", sou
               flow.complete();
         }
   }




                                                                          Input and                                      Assemble
                             “Flow” setup.
                                                                         output taps.                                     pipes.
                                                                           12
Wednesday, February 13, 13
Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate,
although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just
mention some highlights.
Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The
infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count,
etc. Also, the API emphasizes composing behaviors together in an intuitive and powerful way.
Word	
  Count
      Flow
        Pipe ("word count assembly")
         line                     words             word                                                          count
                    Each(Regex)           GroupBy              Every(Count)


                        Tap                                                               Tap
                                            HDFS
                      (source)                                                           (sink)




                                                            Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Wednesday, February 13, 13

Schematically, here is what Word Count looks like in Cascading. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details.
Use	
  Scalding (Scala)
                                                                   (SoluCon	
  #2)




                                                                      14
Wednesday, February 13, 13
Scalding is a Scala “DSL” (domain-specific language) that wraps Cascading providing an even more intuitive and more boilerplate-free API for
writing MapReduce jobs. https://github.com/twitter/scalding
Scala is a new JVM language that modernizes Java’s object-oriented (OO) features and adds support for functional programming, as we’ll describe
shortly.
import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }

                                                                       That’s it!!

                                                                                 15
Wednesday, February 13, 13
This Scala code is almost pure domain logic with very little boilerplate. There are a few minor differences in the implementation. You don’t explicitly specify the
“Hfs” (Hadoop Distributed File System) taps. That’s handled by Scalding implicitly when you run in “non-local” model. Also, I’m using a simpler tokenization
approach here, where I split on anything that isn’t a “word character” [0-9a-zA-Z_].
There is much less green, in part because Scala infers types in many cases, but also because fewer infrastructure types are required. There is a lot more yellow
for the functions that do real work!
What if MapReduce, and hence Cascading and Scalding, went obsolete tomorrow? This code is so short, I wouldn’t care about throwing it away! I invested little
time writing it, testing it, etc.
CREATE EXTERNAL TABLE docs (line STRING)
     LOCATION '/path/to/corpus';

     CREATE TABLE word_counts
     AS SELECT word, count(1) AS count FROM
     (SELECT explode(split(line, 's')) AS word
      FROM docs) w GROUP BY word
     ORDER BY word;



                                                                            Hive

                                                                                16
Wednesday, February 13, 13
Here is the Hive equivalent (more or less, we aren’t splitting into words using a sophisticated regular expression this time). Note that Word Count is a bit unusual
to implement with a SQL dialect (we’ll see more typical cases later), and we’re using some Hive SQL extensions that we won’t take time to explain (like ARRAYs),
but you get the point that SQL is wonderfully concise… when you can express your problem in it!!
inpt = load '/path/to/corpus'
       using TextLoader as (line: chararray);
     words = foreach inpt generate
       flatten(TOKENIZE(line)) as word;
     grpd = group words by word;
     cntd = foreach grpd generate group, COUNT(words);
     dump cntd;




                                                           Pig

                                                           17
Wednesday, February 13, 13
Pig is more expressive for non-query problems like this.
What	
  Is
                                  Scala?
                             18
Wednesday, February 13, 13
Scala	
  is	
  a
                               modern,	
  concise,
                             object-­‐oriented	
  and	
  
                             func'onal	
  language
                                for	
  the	
  JVM.
                                    http://scala-lang.org

Wednesday, February 13, 13

I’ll explain what I mean for these terms in the next few slides.
Modern.

                      Reflects	
  the	
  20	
  years	
  
                     of	
  language	
  evolu'on	
  
                     since	
  Java’s	
  inven'on.

Wednesday, February 13, 13

Java was a good compromise of ideas for the state-of-the-art in the early 90’s, but that was
~20 years ago! We’ve learned a lot about language design and effective software development
techniques since then.
Concise.

                     Type	
  inference,	
  syntax	
  
                     for	
  common	
  OOP	
  and	
  
                         FP	
  idioms,	
  APIs.

Wednesday, February 13, 13

Briefly, type inference reduces “noise”, allowing the logic to shine through. Scala has lots of
idioms to simplify class definition and object creation, as well as many FP idioms, such as
anonymous functions. The collection API emphasizes the important data transformations we
need...
Object-­‐Oriented	
  
                                Programming.
                             Fixes	
  flaws	
  in	
  Java’s	
  
                                 object	
  model.	
  
                             ComposiCon/reuse	
  
                                through	
  traits.
Wednesday, February 13, 13

Scala fixes flaws in Java’s object model, such as the essential need for a mix-in composition
mechanism, which Scala provides via traits. For more on this, see Daniel Spiewak’s keynote
for NEScala 2013: “The Bakery from the Black Lagoon”.
“
Func'onal	
  
                              Programming.
                             Most	
  natural	
  fit
                                for	
  data!	
  
                             Robust,	
  scalable.

Wednesday, February 13, 13

It may not look like it, but this is actually the most important slide in this talk; FP is the most
appropriate tool for data, since everything we do with data is mathematics. FP also gives us
the idioms we need for robust, scalable concurrency.
JVM.

                     Designed	
  for	
  the	
  JVM.	
  
                     Interoperates	
  with	
  all	
  
                         Java	
  socware.

Wednesday, February 13, 13

But Scala is a JVM language. It interoperates with any Java, Clojure, JRuby, etc. libraries.
What	
  Is
                                  Scalding?
                             25
Wednesday, February 13, 13
Scalding	
  is	
  a	
  Scala	
  
                      Big	
  Data	
  library	
  based	
  
                             on	
  Cascading,	
  
                             developed	
  by
                                   TwiLer.
                             https://github.com/twitter/scalding

Wednesday, February 13, 13

I say “based on” Cascading, because it isn’t just a thin Scala veneer over the Java API. It adds
some different concepts, hiding some Cascading API concepts (e.g., IO specifics), but not all
(e.g., groupby function <=> GroupBy class).
Scalding	
  adds	
  a
                         Matrix	
  API	
  useful	
  for	
  
                         graph	
  and	
  machine-­‐
                         learning	
  algorithms.


Wednesday, February 13, 13

I’ll show examples.
Note:	
  Cascalog	
  is	
  a	
  
                      Clojure	
  alternaCve	
  for	
  
                      Cascading.	
  Also	
  adds	
  
                       Logic-­‐Programming	
  
                        based	
  on	
  Datalog.
                             https://github.com/nathanmarz/cascalog

Wednesday, February 13, 13

For completeness, note that Cascalog is a Clojure alternative. It is also notable for its addition
of Logic-Programming features, adapted from Datalog. Cascalog was developed by Nathan
Marz, now at Twitter, who also invented Storm.
Moar Scalding!
                                                                  29
Wednesday, February 13, 13
More examples, starting with a deeper look at our first example.
Returning to our
                                                               first example.

     import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }




                                                                     30
Wednesday, February 13, 13
Let’s walk through the code to get the “gist” of what’s happening.
First, import the Scalding API.
Returning to our
                                                               first example.

     import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }

                                                            Import the library.

                                                                     30
Wednesday, February 13, 13
Let’s walk through the code to get the “gist” of what’s happening.
First, import the Scalding API.
import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }

                                                         Create a “Job” class.

                                                                     31
Wednesday, February 13, 13
Wrap the logic in a class (OOP!) that extends the Job abstraction.
import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }

                                     Open the user-specified input as text.

                                                                                 32
Wednesday, February 13, 13
The “args” holds the user-specified arguments used to invoke the job. “--input path” specifies the source of the text. Similarly, “--output path” is used to specify the
output location for the results.
import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }
                                         Split each 'line into 'words, a
                                      collection and flatten the collections.
                                                                                 33
Wednesday, February 13, 13
A flatMap iterates through a collection, calling a function on each element, where the function “converts” each element into a collection. So, you get a collection of
collections. The “flat” part flattens those nested collections into one larger collection. The “line: String => line …(“W+”)” expression is the function called by
flatMap on each element, where the argument to the function is a single string variable named line and the body is to the right-hand side of the the =>.
The output word field is called “word”, from the first argument to flatMap: “‘line -> ‘word”.
import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }
                                           Group by each 'word and
                                        determine the size of each group.
                                                                              34
Wednesday, February 13, 13
Now we group the stream of words by the word value, then count the size of each group. Give the output field the name “count”, rather than the default name
“size” (if no argument is passed to size()).
import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }
                                                         Write tab-separated
                                                         words and counts.
                                                                                35
Wednesday, February 13, 13
Finally, we write the tab-separated words and counts to the output location specified using “--output path”.
import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String =>
               line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }

                                           Profit!!
                                              36
Wednesday, February 13, 13
For all you South Park fans out there...
Joins




Wednesday, February 13, 13
An example of doing a simple inner join between stock and dividend data.
import com.twitter.scalding._

 class StocksDivsJoin(args: Args) extends Job(args){
   val stocksSchema = ('symd, 'close, 'volume)
   val divsSchema = ('dymd, 'dividend)
   val stocksPipe = new Tsv(args("stocks"), stockSchema)
     .read
     .project('symd, 'close)
   val divsPipe = new Tsv(args("dividends"), divsSchema)
     .read

       stocksPipe
         .joinWithTiny('symd -> 'dymd, dividendsPipe)
         .project('symd, 'close, 'dividend)
         .write(Tsv(args("output")))

                                Load stocks and dividends data in
                             separate pipes for a given stock, e.g., IBM.
                                                  38
Wednesday, February 13, 13
import com.twitter.scalding._

 class StocksDivsJoin(args: Args) extends Job(args){
   val stocksSchema = ('symd, 'close, 'volume)
   val divsSchema = ('dymd, 'dividend)
   val stocksPipe = new Tsv(args("stocks"), stockSchema)
     .read
     .project('symd, 'close)
   val divsPipe = new Tsv(args("dividends"), divsSchema)
     .read

       stocksPipe
         .joinWithTiny('symd -> 'dymd, dividendsPipe)
         .project('symd, 'close, 'dividend)
         .write(Tsv(args("output")))
 }
                                             Capture the schema as values.

                                                                                39
Wednesday, February 13, 13
“symd” is “stocks year-month-day” and similarly for dividends. Note that Scalding requires unique names for these fields from the different tables.
import com.twitter.scalding._

 class StocksDivsJoin(args: Args) extends Job(args){
   val stocksSchema = ('symd, 'close, 'volume)
   val divsSchema = ('dymd, 'dividend)
   val stocksPipe = new Tsv(args("stocks"), stockSchema)
     .read
     .project('symd, 'close)
   val divsPipe = new Tsv(args("dividends"), divsSchema)
     .read

       stocksPipe
         .joinWithTiny('symd -> 'dymd, dividendsPipe)
         .project('symd, 'close, 'dividend)
         .write(Tsv(args("output")))
 }
                              Load the stock and dividend records into
                             separate pipes, project desired stock fields.
                                                  40
Wednesday, February 13, 13
import com.twitter.scalding._

 class StocksDivsJoin(args: Args) extends Job(args){
   val stocksSchema = ('symd, 'close, 'volume)
   val divsSchema = ('dymd, 'dividend)
   val stocksPipe = new Tsv(args("stocks"), stockSchema)
     .read
     .project('symd, 'close)
   val divsPipe = new Tsv(args("dividends"), divsSchema)
     .read

       stocksPipe
         .joinWithTiny('symd -> 'dymd, dividendsPipe)
         .project('symd, 'close, 'dividend)
         .write(Tsv(args("output")))
 }
                             Join stocks with smaller dividends on
                                  dates, project desired fields.
                                              41
Wednesday, February 13, 13
# Invoking the script (bash)
                      scald.rb StocksDivsJoins.scala 
                        --stocks    data/stocks/IBM.txt 
                        --dividends data/dividends/IBM.txt 
                        --output output/IBM-join.txt




                                         42
Wednesday, February 13, 13
# Invoking the script (bash)
                      scald.rb StocksDivsJoins.scala 
                        --stocks    data/stocks/IBM.txt 
                        --dividends data/dividends/IBM.txt 
                        --output output/IBM-join.txt


                      # Output (not sorted!)
                      2010-02-08      121.88   0.55
                      2009-11-06      123.49   0.55
                      2009-08-06      117.38   0.55
                      2009-05-06      104.62   0.55
                      2009-02-06      96.14    0.5
                      2008-11-06      85.15    0.5
                      2008-08-06      129.16   0.5
                      2008-05-07      124.14   0.5
                      ...

                                         42
Wednesday, February 13, 13
43
Wednesday, February 13, 13
Compute ngrams in a corpus...
NGrams




                                43
Wednesday, February 13, 13
Compute ngrams in a corpus...
import com.twitter.scalding._

     class ContextNGrams(args: Args) extends Job(args) {
       val ngramPrefix =
         args.list("ngram-prefix").mkString(" ")
       val keepN = args.getOrElse("count", "10").toInt
       val ngramRE = (ngramPrefix + """s+(w+)""").r

           // Sort (phrase,count) by count, descending.
           val countReverseComparator =
             (tuple1:(String,Int), tuple2:(String,Int))
               => tuple1._2 > tuple2._2
           ...
                                          Find, count, sort, all “I love _” in
                                                Shakespeare’s plays...
                                                           44
Wednesday, February 13, 13
1st half of this example (our longest).
...
            val lines = TextLine(args("input"))
              .read
              .flatMap('line -> 'ngram) { text: String =>
                ngramRE.findAllIn(text).toIterable }
              .discard('num, 'line)
              .groupBy('ngram) { g => g.size('count) }
              .groupAll { g =>
                g.sortWithTake[(String,Int)](
                 ('ngram, 'count) -> 'sorted_ngrams, keepN(
                 countReverseComparator)
              }
              .write(Tsv(args("output")))
     }



                                   45
Wednesday, February 13, 13
2nd half.
import com.twitter.scalding._

     class ContextNGrams(args: Args) extends Job(args) {
       val ngramPrefix =
         args.list("ngram-prefix").mkString(" ")
       val keepN = args.getOrElse("count", "10").toInt
       val ngramRE = (ngramPrefix + """s+(w+)""").r

           // Sort (phrase,count) by count, descending.
           val countReverseComparator =
             (tuple1:(String,Int), tuple2:(String,Int))
               => tuple1._2 > tuple2._2
           ...
                             From user-specified args, the ngram prefix
                                (“I love”), the # of ngrams desired, a
                                     matching regular expression.
Wednesday, February 13, 13
Now walking through it...
import com.twitter.scalding._

     class ContextNGrams(args: Args) extends Job(args) {
       val ngramPrefix =
         args.list("ngram-prefix").mkString(" ")
       val keepN = args.getOrElse("count", "10").toInt
       val ngramRE = (ngramPrefix + """s+(w+)""").r

           // Sort (phrase,count) by count, descending.
           val countReverseComparator =
             (tuple1:(String,Int), tuple2:(String,Int))
               => tuple1._2 > tuple2._2
           ...
                               A function “value” we’ll use to sort ngrams
                                        by frequency, descending.
                                                                                   47
Wednesday, February 13, 13
This is a function that is a “normal” value. We’ll pass it to another function as an argument in the 2nd half of the code.
...
           val lines = TextLine(args("input"))
             .read
             .flatMap('line -> 'ngram) { text: String =>
               ngramRE.findAllIn(text).toIterable }
             .discard('num, 'line)
             .groupBy('ngram) { g => g.size('count) }
             .groupAll { g =>
               g.sortWithTake[(String,Int)](
                ('ngram, 'count) -> 'sorted_ngrams, keepN(
                countReverseComparator)
             }
             .write(Tsv(args("output")))
     }
                                 Read, find desired ngram phrases, project
                                   desired fields, then group by ngram.
                                                                                48
Wednesday, February 13, 13
Discard is like the opposite of project(). We don’t need the line number nor the whole line anymore, so toss ‘em.
...
           val lines = TextLine(args("input"))
             .read
             .flatMap('line -> 'ngram) { text: String =>
               ngramRE.findAllIn(text).toIterable }
             .discard('num, 'line)
             .groupBy('ngram) { g => g.size('count) }
             .groupAll { g =>
               g.sortWithTake[(String,Int)](
                ('ngram, 'count) -> 'sorted_ngrams, keepN(
                countReverseComparator)
             }
             .write(Tsv(args("output")))
     }
                             Group all (ngram,count) tuples together
                              and sort descending by count. Write...
                                               49
Wednesday, February 13, 13
# Invoking the script (bash)
                      scald.rb ContextNGrams.scala 
                        --input data/shakespeare/plays.txt 
                        --output output/context-ngrams.txt 
                        --ngram-prefix "I love" 
                        --count 10




                                         50
Wednesday, February 13, 13
# Invoking the script (bash)
                      scald.rb ContextNGrams.scala 
                        --input data/shakespeare/plays.txt 
                        --output output/context-ngrams.txt 
                        --ngram-prefix "I love" 
                        --count 10
                      # Output (reformatted)
                      (I love thee,44),
                      (I love you,24),
                      (I love him,15),
                      (I love the,9),
                      (I love her,8),
                      ...,
                      (I love myself,3),
                      ...,
                      (I love Valentine,1),
                      ...,
                      (I love France,1), ...
                                         50
Wednesday, February 13, 13
An Example of
                                        the Matrix API
                                             51
Wednesday, February 13, 13

From Scalding’s own Matrix examples, “MatrixTutorial4”, compute the cosine of every pair of
vectors. This is a similarity measurement, useful, e.g., for for recommending movies and
other stuff to one user based on the preferences of similar users.
import com.twitter.scalding._
     import com.twitter.scalding.mathematics.Matrix

     // Load a directed graph adjacency matrix where:
     // a[i,j] = 1 if there is an edge from a[i] to b[j]
     // and computes the cosine of the angle between
     // every two pairs of vectors.
     class MatrixCosine(args: Args) extends Job(args) {
       import Matrix._
       val schema = ('user1, 'user2, 'relation) 
       val adjacencyMatrix = Tsv(args("input"), schema)
        .read
        .toMatrix[Long,Long,Double](schema)
       val normMatrix = adjacencyMatrix.rowL2Normalize

       // Inner product is equivalent to the cosine:
       //   AA^T/(||A||*||A||)
       (normMatrix * normMatrix.transpose)
       .write(Tsv(args("output")))
     }
                              52
Wednesday, February 13, 13
import com.twitter.scalding._
     import com.twitter.scalding.mathematics.Matrix

     // Load a directed graph adjacency matrix where:
     // a[i,j] = 1 if there is an edge from a[i] to b[j]
     // and computes the cosine of the angle between
     // every two pairs of vectors.
     class MatrixCosine(args: Args) extends Job(args) {
       import Matrix._
       val schema = ('user1, 'user2, 'relation) 
       val adjacencyMatrix = Tsv(args("input"), schema)
        .read
        .toMatrix[Long,Long,Double](schema)
       val normMatrix = adjacencyMatrix.rowL2Normalize

       // Inner product is equivalent to the cosine:
       //   AA^T/(||A||*||A||)
       (normMatrix * normMatrix.transpose)
       .write(Tsv(args("output")))
     }                                 Import matrix library.
                                                                             53
Wednesday, February 13, 13
Scala lets you put imports anywhere! So, you can scope them to just where they are needed.
import com.twitter.scalding._
     import com.twitter.scalding.mathematics.Matrix

     // Load a directed graph adjacency matrix where:
     // a[i,j] = 1 if there is an edge from a[i] to b[j]
     // and computes the cosine of the angle between
     // every two pairs of vectors.
     class MatrixCosine(args: Args) extends Job(args) {
       import Matrix._
       val schema = ('user1, 'user2, 'relation) 
       val adjacencyMatrix = Tsv(args("input"), schema)
        .read
        .toMatrix[Long,Long,Double](schema)
       val normMatrix = adjacencyMatrix.rowL2Normalize

       // Inner product is equivalent to the cosine:
       //   AA^T/(||A||*||A||)
       (normMatrix * normMatrix.transpose)
       .write(Tsv(args("output")))
     }                                  Load into a matrix.
                                54
Wednesday, February 13, 13
import com.twitter.scalding._
     import com.twitter.scalding.mathematics.Matrix

     // Load a directed graph adjacency matrix where:
     // a[i,j] = 1 if there is an edge from a[i] to b[j]
     // and computes the cosine of the angle between
     // every two pairs of vectors.
     class MatrixCosine(args: Args) extends Job(args) {
       import Matrix._
       val schema = ('user1, 'user2, 'relation) 
       val adjacencyMatrix = Tsv(args("input"), schema)
        .read
        .toMatrix[Long,Long,Double](schema)
       val normMatrix = adjacencyMatrix.rowL2Normalize

       // Inner product is equivalent to the cosine:
       //   AA^T/(||A||*||A||)
       (normMatrix * normMatrix.transpose)
       .write(Tsv(args("output")))
     }                                Normalize: sqrt(x^2+y^2)
                                                               55
Wednesday, February 13, 13
L2 norm is the usual Euclidean distance: sqrt(a*a + b*b + …)
import com.twitter.scalding._
     import com.twitter.scalding.mathematics.Matrix

     // Load a directed graph adjacency matrix where:
     // a[i,j] = 1 if there is an edge from a[i] to b[j]
     // and computes the cosine of the angle between
     // every two pairs of vectors.
     class MatrixCosine(args: Args) extends Job(args) {
       import Matrix._
       val schema = ('user1, 'user2, 'relation) 
       val adjacencyMatrix = Tsv(args("input"), schema)
        .read
        .toMatrix[Long,Long,Double](schema)
       val normMatrix = adjacencyMatrix.rowL2Normalize

       // Inner product is equivalent to the cosine:
       //   AA^T/(||A||*||A||)
       (normMatrix * normMatrix.transpose)
       .write(Tsv(args("output")))
     }                                   Compute cosine!
                                                               56
Wednesday, February 13, 13
L2 norm is the usual Euclidean distance: sqrt(a*a + b*b + …)
(I won’t show an example invocation here.)
Scalding	
  vs.	
  
                             Hive	
  vs.	
  Pig
                                                                         57
Wednesday, February 13, 13
Some concluding thoughts on Scalding (and Cascading) vs. Hive vs. Pig, since I’ve used all three a lot.
Use	
  Hive	
  
                                                                                  for	
  queries.
                                                                       58
Wednesday, February 13, 13
SQL is about as concise as it gets when asking questions of data and doing “standard” analytics. A great example of a “purpose-built” language.
It’s great at what it was designed for. Not so great when you need to do something it wasn’t designed to do.
Use	
  Pig	
  for	
  
                                                                 data	
  flows.
Wednesday, February 13, 13

Pig is great for data flows, such as sequencing transformations typical of ETL (extract, transform, and load).
As you might expect for a purpose-built language, it is very concise for these purposes, but writing
extensions is a non-trivial step (same for Hive).
Use	
  Scalding	
  
                                      for	
  total	
  
                                     flexibility.
                                              60
Wednesday, February 13, 13

Scalding can do everything you can do with these other tools (more or less), but it gives you
the full flexible power of a Turing-complete language, so it’s easier to adapt to non-trivial,
custom scenarios.
For	
  more	
  on	
  
  Scalding:




                             github.com/twiQer/scalding
                             github.com/ThinkBigAnalyCcs/scalding-­‐workshop
                             github.com/Cascading/ImpaCent/tree/master/part8
Wednesday, February 13, 13
dean.wampler@thinkbiganalyCcs.com
 polyglotprogramming.com/talks                                     Thanks!




                                          Hive	
  Intro.	
  &	
  Master	
  Classes
                                          StrataConf:	
  2/25-­‐26,	
  Chicago:	
  3/7-­‐8
                                          hQp://bit.ly/ThinkBigEvents
                                             62
Wednesday, February 13, 13

All pictures © Dean Wampler, 2005-2013. The rest of the presentation is free to reuse, but
attribution is appreciated. Hire us!!

Mais conteúdo relacionado

Mais procurados

Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Sunghyouk Bae
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Sunghyouk Bae
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring frameworkSunghyouk Bae
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)lennartkats
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash courseHolden Karau
 

Mais procurados (20)

Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Scala+data
Scala+dataScala+data
Scala+data
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
 
Requery overview
Requery overviewRequery overview
Requery overview
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Meet scala
Meet scalaMeet scala
Meet scala
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring framework
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash course
 

Destaque

Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Sriram Krishnan
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Swiss Big Data User Group
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To CascadingNate Murray
 

Destaque (6)

Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
Unit testing pig
Unit testing pigUnit testing pig
Unit testing pig
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 

Semelhante a Scalding for Hadoop

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Mapredtutorial
MapredtutorialMapredtutorial
MapredtutorialAnup Mohta
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data Jay Nagar
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiUnmesh Baile
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 

Semelhante a Scalding for Hadoop (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Mapredtutorial
MapredtutorialMapredtutorial
Mapredtutorial
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 

Mais de Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 

Mais de Chicago Hadoop Users Group (19)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Scalding for Hadoop

  • 1. CHUG,  February  12,  2013   dean.wampler@thinkbiganalyCcs.com polyglotprogramming.com/talks Scalding  for   Hadoop Wednesday, February 13, 13 Using Scalding to leverage Cascading to write MapReduce jobs. Note: This talk was presented in tandem with a Cascading introduction presented by Paco Nathan (@pacoid), so it assumes you know some Cascading concepts. (All photos are © Dean Wampler, 2005-2013, All Rights Reserved. Otherwise, the presentation is free to use.)
  • 2. About  Me... dean.wampler@thinkbiganalyCcs.com @deanwampler github.com/thinkbiganalyCcs   github.com/deanwampler Programming Hive Functional Programming for Java Developers Dean Wampler, Dean Wampler Jason Rutherglen & Edward Capriolo Wednesday, February 13, 13 My books and contact information.
  • 3. Hive  Training  in  Chicago! IntroducCon  to  Hive March  7 Hive  Master  Class March  8  (also  before   Programming StrataConf,  February  25) Hive hQp://bit.ly/ThinkBigEvents Dean Wampler, Jason Rutherglen & Edward Capriolo Wednesday, February 13, 13 I’m teaching two, one-day Hive classes in Chicago in March. I’m also teaching the Hive Master Class on Monday before StrataConf, 2/25, in Mountain View.
  • 4. How  many  of  you  have   all  the  'me  in  the  world   to  get  your  work  done? Wednesday, February 13, 13
  • 5. How  many  of  you  have   all  the  'me  in  the  world   to  get  your  work  done? Then  why  are  you doing  this: Wednesday, February 13, 13
  • 6. import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import java.util.StringTokenizer; class WCMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final IntWritable one = new IntWritable(1); static final Text word = new Text; // Value will be set in a non-thread-safe way! @Override public void map(LongWritable key, Text valueDocContents, OutputCollector<Text, IntWritable> output, Reporter reporter) { String[] tokens = valueDocContents.toString.split("s+"); for (String wordString: tokens) { if (wordString.length > 0) { word.set(wordString.toLowerCase); output.collect(word, one); The “simple” } } Word Count } } algorithm class Reduce extends MapReduceBase implements Reducer[Text, IntWritable, Text, IntWritable] { public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts, OutputCollector<Text, IntWritable> output, Reporter reporter) { int totalCount = 0; while (valuesCounts.hasNext) { totalCount += valuesCounts.next.get; } output.collect(keyWord, new IntWritable(totalCount)); } } 6 Wednesday, February 13, 13 This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers insist on working this way... The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce code in this API gets very complex and tedious very fast! Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that do work. We’ll see how these change...
  • 7. import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import java.util.StringTokenizer; class WCMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final IntWritable one = new IntWritable(1); static final Text word = new Text; // Value will be set in a non-thread-safe way! @Override public void map(LongWritable key, Text valueDocContents, OutputCollector<Text, IntWritable> output, Reporter reporter) { String[] tokens = valueDocContents.toString.split("s+"); for (String wordString: tokens) { if (wordString.length > 0) { word.set(wordString.toLowerCase); output.collect(word, one); } } } } class Reduce extends MapReduceBase implements Reducer[Text, IntWritable, Text, IntWritable] { public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts, OutputCollector<Text, IntWritable> output, Reporter reporter) { int totalCount = 0; while (valuesCounts.hasNext) { totalCount += valuesCounts.next.get; } output.collect(keyWord, new IntWritable(totalCount)); } } 7 Wednesday, February 13, 13 This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers insist on working this way... The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce code in this API gets very complex and tedious very fast! Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that do work. We’ll see how these change...
  • 8. output.collect(word, one); } } } } class Reduce extends MapReduceBase implements Reducer[Text, IntWritable, Text, IntWritable] { public void reduce(Text keyWord, java.util.Iterator<IntWritabl OutputCollector<Text, IntWritable> output, Reporter int totalCount = 0; while (valuesCounts.hasNext) { totalCount += valuesCounts.next.get; } output.collect(keyWord, new IntWritable(totalCount)); } } 7 Wednesday, February 13, 13 This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers insist on working this way... The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce code in this API gets very complex and tedious very fast! Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that do work. We’ll see how these change...
  • 9. output.collect(word, one); } } } } class Reduce extends MapReduceBase implements Reducer[Text, IntWritable, Text, IntWritable] { public void reduce(Text keyWord, java.util.Iterator<IntWritabl OutputCollector<Text, IntWritable> output, Reporter int totalCount = 0; while (valuesCounts.hasNext) { totalCount += valuesCounts.next.get; } output.collect(keyWord, new IntWritable(totalCount)); } } Green types. 7 Wednesday, February 13, 13 This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers insist on working this way... The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce code in this API gets very complex and tedious very fast! Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that do work. We’ll see how these change...
  • 10. output.collect(word, one); } } } } class Reduce extends MapReduceBase implements Reducer[Text, IntWritable, Text, IntWritable] { public void reduce(Text keyWord, java.util.Iterator<IntWritabl OutputCollector<Text, IntWritable> output, Reporter int totalCount = 0; while (valuesCounts.hasNext) { totalCount += valuesCounts.next.get; } output.collect(keyWord, new IntWritable(totalCount)); } } Green Yellow types. operations. 7 Wednesday, February 13, 13 This is intentionally too small to read and we’re not showing the main routine, which roughly doubles the code size. The “Word Count” algorithm is simple, but the Hadoop MapReduce framework is in your face. It’s very hard to see the actual “business logic”. Plus, your productivity is terrible. Yet, many Hadoop developers insist on working this way... The main routine I’ve omitted contains boilerplate details for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce code in this API gets very complex and tedious very fast! Notice the green, which I use for all the types, most of which are infrastructure types we don’t really care about. There is little yellow, which are function calls that do work. We’ll see how these change...
  • 11. Your  tools  should: Wednesday, February 13, 13 We would like a “domain-specific language” (DSL) for data manipulation and analysis.
  • 12. Your  tools  should: Minimize  boilerplate   by  exposing  the   right  abstrac'ons. Wednesday, February 13, 13 We would like a “domain-specific language” (DSL) for data manipulation and analysis.
  • 13. Your  tools  should: Maximize  expressiveness   and  extensibility. Wednesday, February 13, 13 Expressiveness - ability to tell the system what you need done. Extensibility - ability to add functionality to the system when it doesn’t already support some operation you need.
  • 14. Use Cascading (Java) (SoluCon  #1) 10 Wednesday, February 13, 13 Cascading is a Java library that provides higher-level abstractions for building data processing pipelines with concepts familiar from SQL such as a joins, group-bys, etc. It works on top of Hadoop’s MapReduce and hides most of the boilerplate from you. See http://cascading.org.
  • 15. import org.cascading.*; ... public class WordCount { public static void main(String[] args) { Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, WordCount.class ); Scheme sourceScheme = new TextLine( new Fields( "line" ) ); Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) ); String inputPath = args[0]; String outputPath = args[1]; Tap source = new Hfs( sourceScheme, inputPath ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "wordcount" ); String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)"; Function function = new RegexGenerator( new Fields( "word" ), regex ); assembly = new Each( assembly, new Fields( "line" ), function ); assembly = new GroupBy( assembly, new Fields( "word" ) ); Aggregator count = new Count( new Fields( "count" ) ); assembly = new Every( assembly, count ); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "word-count", source, sink, assembly); flow.complete(); } } 11 Wednesday, February 13, 13 Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate, although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http:// docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just mention some highlights. Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count, etc. Also, the API emphasizes composing behaviors together in an intuitive and powerful way.
  • 16. import org.cascading.*; ... public class WordCount { public static void main(String[] args) { Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, WordCount.class ); Scheme sourceScheme = new TextLine( new Fields( "line" ) ); Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) ); String inputPath = args[0]; String outputPath = args[1]; Tap source = new Hfs( sourceScheme, inputPath ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "wordcount" ); String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)"; Function function = new RegexGenerator( new Fields( "word" ), regex ); assembly = new Each( assembly, new Fields( "line" ), function ); assembly = new GroupBy( assembly, new Fields( "word" ) ); Aggregator count = new Count( new Fields( "count" ) ); assembly = new Every( assembly, count ); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "word-count", source, sink, assembly); flow.complete(); } } 12 Wednesday, February 13, 13 Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate, although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http:// docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just mention some highlights. Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count, etc. Also, the API emphasizes composing behaviors together in an intuitive and powerful way.
  • 17. import org.cascading.*; ... public class WordCount { public static void main(String[] args) { Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Wo Scheme sourceScheme = new TextLine( new Fields( "lin Scheme sinkScheme = new TextLine( new Fields( "word" String inputPath = args[0]; String outputPath = args[1]; Tap source = new Hfs( sourceScheme, inputPath ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMo Pipe assembly = new Pipe( "wordcount" ); “Flow” setup. 12 String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?! Wednesday, February 13, 13 Function function = new RegexGenerator( new Fields( Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate, although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http:// docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them "line" ), assembly = new Each( assembly, new Fields( here, but just mention some highlights. Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The "word" assembly = new GroupBy( assembly, new Fields( infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count, etc. Also, theAggregator behaviors together in new Count( new Fields( "count" ) API emphasizes composing count = an intuitive and powerful way.
  • 18. Scheme sourceScheme = new TextLine( new Fields( "lin Scheme sinkScheme = new TextLine( new Fields( "word" String inputPath = args[0]; String outputPath = args[1]; Tap source = new Hfs( sourceScheme, inputPath ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMo Pipe assembly = new Pipe( "wordcount" ); String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?! Function function = new RegexGenerator( new Fields( assembly = new Each( assembly, new Fields( "line" ), assembly = new GroupBy( assembly, new Fields( "word" Aggregator count = new Count( new Fields( "count" ) assembly = new Every( assembly, count ); FlowConnector flowConnector = new FlowConnector( pro Flow flow = flowConnector.connect( "word-count", sou flow.complete(); Input and “Flow” setup. } output taps. 12 } Wednesday, February 13, 13 Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate, although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http:// docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just mention some highlights. Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count, etc. Also, the API emphasizes composing behaviors together in an intuitive and powerful way.
  • 19. Pipe assembly = new Pipe( "wordcount" ); String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?! Function function = new RegexGenerator( new Fields( assembly = new Each( assembly, new Fields( "line" ), assembly = new GroupBy( assembly, new Fields( "word" Aggregator count = new Count( new Fields( "count" ) assembly = new Every( assembly, count ); FlowConnector flowConnector = new FlowConnector( pro Flow flow = flowConnector.connect( "word-count", sou flow.complete(); } } Input and Assemble “Flow” setup. output taps. pipes. 12 Wednesday, February 13, 13 Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate, although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http:// docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just mention some highlights. Note that there is still a lot of green for types, but now they are almost all domain-related concepts and less infrastructure types. The infrastructure is less “in your face.” There is still little yellow, because Cascading has to wrap behavior in Java classes, like GroupBy, Each, Count, etc. Also, the API emphasizes composing behaviors together in an intuitive and powerful way.
  • 20. Word  Count Flow Pipe ("word count assembly") line words word count Each(Regex) GroupBy Every(Count) Tap Tap HDFS (source) (sink) Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Wednesday, February 13, 13 Schematically, here is what Word Count looks like in Cascading. See http:// docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details.
  • 21. Use  Scalding (Scala) (SoluCon  #2) 14 Wednesday, February 13, 13 Scalding is a Scala “DSL” (domain-specific language) that wraps Cascading providing an even more intuitive and more boilerplate-free API for writing MapReduce jobs. https://github.com/twitter/scalding Scala is a new JVM language that modernizes Java’s object-oriented (OO) features and adds support for functional programming, as we’ll describe shortly.
  • 22. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } That’s it!! 15 Wednesday, February 13, 13 This Scala code is almost pure domain logic with very little boilerplate. There are a few minor differences in the implementation. You don’t explicitly specify the “Hfs” (Hadoop Distributed File System) taps. That’s handled by Scalding implicitly when you run in “non-local” model. Also, I’m using a simpler tokenization approach here, where I split on anything that isn’t a “word character” [0-9a-zA-Z_]. There is much less green, in part because Scala infers types in many cases, but also because fewer infrastructure types are required. There is a lot more yellow for the functions that do real work! What if MapReduce, and hence Cascading and Scalding, went obsolete tomorrow? This code is so short, I wouldn’t care about throwing it away! I invested little time writing it, testing it, etc.
  • 23. CREATE EXTERNAL TABLE docs (line STRING) LOCATION '/path/to/corpus'; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word; Hive 16 Wednesday, February 13, 13 Here is the Hive equivalent (more or less, we aren’t splitting into words using a sophisticated regular expression this time). Note that Word Count is a bit unusual to implement with a SQL dialect (we’ll see more typical cases later), and we’re using some Hive SQL extensions that we won’t take time to explain (like ARRAYs), but you get the point that SQL is wonderfully concise… when you can express your problem in it!!
  • 24. inpt = load '/path/to/corpus' using TextLoader as (line: chararray); words = foreach inpt generate flatten(TOKENIZE(line)) as word; grpd = group words by word; cntd = foreach grpd generate group, COUNT(words); dump cntd; Pig 17 Wednesday, February 13, 13 Pig is more expressive for non-query problems like this.
  • 25. What  Is Scala? 18 Wednesday, February 13, 13
  • 26. Scala  is  a modern,  concise, object-­‐oriented  and   func'onal  language for  the  JVM. http://scala-lang.org Wednesday, February 13, 13 I’ll explain what I mean for these terms in the next few slides.
  • 27. Modern. Reflects  the  20  years   of  language  evolu'on   since  Java’s  inven'on. Wednesday, February 13, 13 Java was a good compromise of ideas for the state-of-the-art in the early 90’s, but that was ~20 years ago! We’ve learned a lot about language design and effective software development techniques since then.
  • 28. Concise. Type  inference,  syntax   for  common  OOP  and   FP  idioms,  APIs. Wednesday, February 13, 13 Briefly, type inference reduces “noise”, allowing the logic to shine through. Scala has lots of idioms to simplify class definition and object creation, as well as many FP idioms, such as anonymous functions. The collection API emphasizes the important data transformations we need...
  • 29. Object-­‐Oriented   Programming. Fixes  flaws  in  Java’s   object  model.   ComposiCon/reuse   through  traits. Wednesday, February 13, 13 Scala fixes flaws in Java’s object model, such as the essential need for a mix-in composition mechanism, which Scala provides via traits. For more on this, see Daniel Spiewak’s keynote for NEScala 2013: “The Bakery from the Black Lagoon”. “
  • 30. Func'onal   Programming. Most  natural  fit for  data!   Robust,  scalable. Wednesday, February 13, 13 It may not look like it, but this is actually the most important slide in this talk; FP is the most appropriate tool for data, since everything we do with data is mathematics. FP also gives us the idioms we need for robust, scalable concurrency.
  • 31. JVM. Designed  for  the  JVM.   Interoperates  with  all   Java  socware. Wednesday, February 13, 13 But Scala is a JVM language. It interoperates with any Java, Clojure, JRuby, etc. libraries.
  • 32. What  Is Scalding? 25 Wednesday, February 13, 13
  • 33. Scalding  is  a  Scala   Big  Data  library  based   on  Cascading,   developed  by TwiLer. https://github.com/twitter/scalding Wednesday, February 13, 13 I say “based on” Cascading, because it isn’t just a thin Scala veneer over the Java API. It adds some different concepts, hiding some Cascading API concepts (e.g., IO specifics), but not all (e.g., groupby function <=> GroupBy class).
  • 34. Scalding  adds  a Matrix  API  useful  for   graph  and  machine-­‐ learning  algorithms. Wednesday, February 13, 13 I’ll show examples.
  • 35. Note:  Cascalog  is  a   Clojure  alternaCve  for   Cascading.  Also  adds   Logic-­‐Programming   based  on  Datalog. https://github.com/nathanmarz/cascalog Wednesday, February 13, 13 For completeness, note that Cascalog is a Clojure alternative. It is also notable for its addition of Logic-Programming features, adapted from Datalog. Cascalog was developed by Nathan Marz, now at Twitter, who also invented Storm.
  • 36. Moar Scalding! 29 Wednesday, February 13, 13 More examples, starting with a deeper look at our first example.
  • 37. Returning to our first example. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } 30 Wednesday, February 13, 13 Let’s walk through the code to get the “gist” of what’s happening. First, import the Scalding API.
  • 38. Returning to our first example. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } Import the library. 30 Wednesday, February 13, 13 Let’s walk through the code to get the “gist” of what’s happening. First, import the Scalding API.
  • 39. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } Create a “Job” class. 31 Wednesday, February 13, 13 Wrap the logic in a class (OOP!) that extends the Job abstraction.
  • 40. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } Open the user-specified input as text. 32 Wednesday, February 13, 13 The “args” holds the user-specified arguments used to invoke the job. “--input path” specifies the source of the text. Similarly, “--output path” is used to specify the output location for the results.
  • 41. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } Split each 'line into 'words, a collection and flatten the collections. 33 Wednesday, February 13, 13 A flatMap iterates through a collection, calling a function on each element, where the function “converts” each element into a collection. So, you get a collection of collections. The “flat” part flattens those nested collections into one larger collection. The “line: String => line …(“W+”)” expression is the function called by flatMap on each element, where the argument to the function is a single string variable named line and the body is to the right-hand side of the the =>. The output word field is called “word”, from the first argument to flatMap: “‘line -> ‘word”.
  • 42. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } Group by each 'word and determine the size of each group. 34 Wednesday, February 13, 13 Now we group the stream of words by the word value, then count the size of each group. Give the output field the name “count”, rather than the default name “size” (if no argument is passed to size()).
  • 43. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } Write tab-separated words and counts. 35 Wednesday, February 13, 13 Finally, we write the tab-separated words and counts to the output location specified using “--output path”.
  • 44. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } Profit!! 36 Wednesday, February 13, 13 For all you South Park fans out there...
  • 45. Joins Wednesday, February 13, 13 An example of doing a simple inner join between stock and dividend data.
  • 46. import com.twitter.scalding._ class StocksDivsJoin(args: Args) extends Job(args){ val stocksSchema = ('symd, 'close, 'volume) val divsSchema = ('dymd, 'dividend) val stocksPipe = new Tsv(args("stocks"), stockSchema) .read .project('symd, 'close) val divsPipe = new Tsv(args("dividends"), divsSchema) .read stocksPipe .joinWithTiny('symd -> 'dymd, dividendsPipe) .project('symd, 'close, 'dividend) .write(Tsv(args("output"))) Load stocks and dividends data in separate pipes for a given stock, e.g., IBM. 38 Wednesday, February 13, 13
  • 47. import com.twitter.scalding._ class StocksDivsJoin(args: Args) extends Job(args){ val stocksSchema = ('symd, 'close, 'volume) val divsSchema = ('dymd, 'dividend) val stocksPipe = new Tsv(args("stocks"), stockSchema) .read .project('symd, 'close) val divsPipe = new Tsv(args("dividends"), divsSchema) .read stocksPipe .joinWithTiny('symd -> 'dymd, dividendsPipe) .project('symd, 'close, 'dividend) .write(Tsv(args("output"))) } Capture the schema as values. 39 Wednesday, February 13, 13 “symd” is “stocks year-month-day” and similarly for dividends. Note that Scalding requires unique names for these fields from the different tables.
  • 48. import com.twitter.scalding._ class StocksDivsJoin(args: Args) extends Job(args){ val stocksSchema = ('symd, 'close, 'volume) val divsSchema = ('dymd, 'dividend) val stocksPipe = new Tsv(args("stocks"), stockSchema) .read .project('symd, 'close) val divsPipe = new Tsv(args("dividends"), divsSchema) .read stocksPipe .joinWithTiny('symd -> 'dymd, dividendsPipe) .project('symd, 'close, 'dividend) .write(Tsv(args("output"))) } Load the stock and dividend records into separate pipes, project desired stock fields. 40 Wednesday, February 13, 13
  • 49. import com.twitter.scalding._ class StocksDivsJoin(args: Args) extends Job(args){ val stocksSchema = ('symd, 'close, 'volume) val divsSchema = ('dymd, 'dividend) val stocksPipe = new Tsv(args("stocks"), stockSchema) .read .project('symd, 'close) val divsPipe = new Tsv(args("dividends"), divsSchema) .read stocksPipe .joinWithTiny('symd -> 'dymd, dividendsPipe) .project('symd, 'close, 'dividend) .write(Tsv(args("output"))) } Join stocks with smaller dividends on dates, project desired fields. 41 Wednesday, February 13, 13
  • 50. # Invoking the script (bash) scald.rb StocksDivsJoins.scala --stocks data/stocks/IBM.txt --dividends data/dividends/IBM.txt --output output/IBM-join.txt 42 Wednesday, February 13, 13
  • 51. # Invoking the script (bash) scald.rb StocksDivsJoins.scala --stocks data/stocks/IBM.txt --dividends data/dividends/IBM.txt --output output/IBM-join.txt # Output (not sorted!) 2010-02-08 121.88 0.55 2009-11-06 123.49 0.55 2009-08-06 117.38 0.55 2009-05-06 104.62 0.55 2009-02-06 96.14 0.5 2008-11-06 85.15 0.5 2008-08-06 129.16 0.5 2008-05-07 124.14 0.5 ... 42 Wednesday, February 13, 13
  • 52. 43 Wednesday, February 13, 13 Compute ngrams in a corpus...
  • 53. NGrams 43 Wednesday, February 13, 13 Compute ngrams in a corpus...
  • 54. import com.twitter.scalding._ class ContextNGrams(args: Args) extends Job(args) { val ngramPrefix = args.list("ngram-prefix").mkString(" ") val keepN = args.getOrElse("count", "10").toInt val ngramRE = (ngramPrefix + """s+(w+)""").r // Sort (phrase,count) by count, descending. val countReverseComparator = (tuple1:(String,Int), tuple2:(String,Int)) => tuple1._2 > tuple2._2 ... Find, count, sort, all “I love _” in Shakespeare’s plays... 44 Wednesday, February 13, 13 1st half of this example (our longest).
  • 55. ... val lines = TextLine(args("input")) .read .flatMap('line -> 'ngram) { text: String => ngramRE.findAllIn(text).toIterable } .discard('num, 'line) .groupBy('ngram) { g => g.size('count) } .groupAll { g => g.sortWithTake[(String,Int)]( ('ngram, 'count) -> 'sorted_ngrams, keepN( countReverseComparator) } .write(Tsv(args("output"))) } 45 Wednesday, February 13, 13 2nd half.
  • 56. import com.twitter.scalding._ class ContextNGrams(args: Args) extends Job(args) { val ngramPrefix = args.list("ngram-prefix").mkString(" ") val keepN = args.getOrElse("count", "10").toInt val ngramRE = (ngramPrefix + """s+(w+)""").r // Sort (phrase,count) by count, descending. val countReverseComparator = (tuple1:(String,Int), tuple2:(String,Int)) => tuple1._2 > tuple2._2 ... From user-specified args, the ngram prefix (“I love”), the # of ngrams desired, a matching regular expression. Wednesday, February 13, 13 Now walking through it...
  • 57. import com.twitter.scalding._ class ContextNGrams(args: Args) extends Job(args) { val ngramPrefix = args.list("ngram-prefix").mkString(" ") val keepN = args.getOrElse("count", "10").toInt val ngramRE = (ngramPrefix + """s+(w+)""").r // Sort (phrase,count) by count, descending. val countReverseComparator = (tuple1:(String,Int), tuple2:(String,Int)) => tuple1._2 > tuple2._2 ... A function “value” we’ll use to sort ngrams by frequency, descending. 47 Wednesday, February 13, 13 This is a function that is a “normal” value. We’ll pass it to another function as an argument in the 2nd half of the code.
  • 58. ... val lines = TextLine(args("input")) .read .flatMap('line -> 'ngram) { text: String => ngramRE.findAllIn(text).toIterable } .discard('num, 'line) .groupBy('ngram) { g => g.size('count) } .groupAll { g => g.sortWithTake[(String,Int)]( ('ngram, 'count) -> 'sorted_ngrams, keepN( countReverseComparator) } .write(Tsv(args("output"))) } Read, find desired ngram phrases, project desired fields, then group by ngram. 48 Wednesday, February 13, 13 Discard is like the opposite of project(). We don’t need the line number nor the whole line anymore, so toss ‘em.
  • 59. ... val lines = TextLine(args("input")) .read .flatMap('line -> 'ngram) { text: String => ngramRE.findAllIn(text).toIterable } .discard('num, 'line) .groupBy('ngram) { g => g.size('count) } .groupAll { g => g.sortWithTake[(String,Int)]( ('ngram, 'count) -> 'sorted_ngrams, keepN( countReverseComparator) } .write(Tsv(args("output"))) } Group all (ngram,count) tuples together and sort descending by count. Write... 49 Wednesday, February 13, 13
  • 60. # Invoking the script (bash) scald.rb ContextNGrams.scala --input data/shakespeare/plays.txt --output output/context-ngrams.txt --ngram-prefix "I love" --count 10 50 Wednesday, February 13, 13
  • 61. # Invoking the script (bash) scald.rb ContextNGrams.scala --input data/shakespeare/plays.txt --output output/context-ngrams.txt --ngram-prefix "I love" --count 10 # Output (reformatted) (I love thee,44), (I love you,24), (I love him,15), (I love the,9), (I love her,8), ..., (I love myself,3), ..., (I love Valentine,1), ..., (I love France,1), ... 50 Wednesday, February 13, 13
  • 62. An Example of the Matrix API 51 Wednesday, February 13, 13 From Scalding’s own Matrix examples, “MatrixTutorial4”, compute the cosine of every pair of vectors. This is a similarity measurement, useful, e.g., for for recommending movies and other stuff to one user based on the preferences of similar users.
  • 63. import com.twitter.scalding._ import com.twitter.scalding.mathematics.Matrix // Load a directed graph adjacency matrix where: // a[i,j] = 1 if there is an edge from a[i] to b[j] // and computes the cosine of the angle between // every two pairs of vectors. class MatrixCosine(args: Args) extends Job(args) { import Matrix._ val schema = ('user1, 'user2, 'relation)    val adjacencyMatrix = Tsv(args("input"), schema)    .read    .toMatrix[Long,Long,Double](schema) val normMatrix = adjacencyMatrix.rowL2Normalize   // Inner product is equivalent to the cosine: // AA^T/(||A||*||A||)   (normMatrix * normMatrix.transpose) .write(Tsv(args("output"))) } 52 Wednesday, February 13, 13
  • 64. import com.twitter.scalding._ import com.twitter.scalding.mathematics.Matrix // Load a directed graph adjacency matrix where: // a[i,j] = 1 if there is an edge from a[i] to b[j] // and computes the cosine of the angle between // every two pairs of vectors. class MatrixCosine(args: Args) extends Job(args) { import Matrix._ val schema = ('user1, 'user2, 'relation)    val adjacencyMatrix = Tsv(args("input"), schema)    .read    .toMatrix[Long,Long,Double](schema) val normMatrix = adjacencyMatrix.rowL2Normalize   // Inner product is equivalent to the cosine: // AA^T/(||A||*||A||)   (normMatrix * normMatrix.transpose) .write(Tsv(args("output"))) } Import matrix library. 53 Wednesday, February 13, 13 Scala lets you put imports anywhere! So, you can scope them to just where they are needed.
  • 65. import com.twitter.scalding._ import com.twitter.scalding.mathematics.Matrix // Load a directed graph adjacency matrix where: // a[i,j] = 1 if there is an edge from a[i] to b[j] // and computes the cosine of the angle between // every two pairs of vectors. class MatrixCosine(args: Args) extends Job(args) { import Matrix._ val schema = ('user1, 'user2, 'relation)    val adjacencyMatrix = Tsv(args("input"), schema)    .read    .toMatrix[Long,Long,Double](schema) val normMatrix = adjacencyMatrix.rowL2Normalize   // Inner product is equivalent to the cosine: // AA^T/(||A||*||A||)   (normMatrix * normMatrix.transpose) .write(Tsv(args("output"))) } Load into a matrix. 54 Wednesday, February 13, 13
  • 66. import com.twitter.scalding._ import com.twitter.scalding.mathematics.Matrix // Load a directed graph adjacency matrix where: // a[i,j] = 1 if there is an edge from a[i] to b[j] // and computes the cosine of the angle between // every two pairs of vectors. class MatrixCosine(args: Args) extends Job(args) { import Matrix._ val schema = ('user1, 'user2, 'relation)    val adjacencyMatrix = Tsv(args("input"), schema)    .read    .toMatrix[Long,Long,Double](schema) val normMatrix = adjacencyMatrix.rowL2Normalize   // Inner product is equivalent to the cosine: // AA^T/(||A||*||A||)   (normMatrix * normMatrix.transpose) .write(Tsv(args("output"))) } Normalize: sqrt(x^2+y^2) 55 Wednesday, February 13, 13 L2 norm is the usual Euclidean distance: sqrt(a*a + b*b + …)
  • 67. import com.twitter.scalding._ import com.twitter.scalding.mathematics.Matrix // Load a directed graph adjacency matrix where: // a[i,j] = 1 if there is an edge from a[i] to b[j] // and computes the cosine of the angle between // every two pairs of vectors. class MatrixCosine(args: Args) extends Job(args) { import Matrix._ val schema = ('user1, 'user2, 'relation)    val adjacencyMatrix = Tsv(args("input"), schema)    .read    .toMatrix[Long,Long,Double](schema) val normMatrix = adjacencyMatrix.rowL2Normalize   // Inner product is equivalent to the cosine: // AA^T/(||A||*||A||)   (normMatrix * normMatrix.transpose) .write(Tsv(args("output"))) } Compute cosine! 56 Wednesday, February 13, 13 L2 norm is the usual Euclidean distance: sqrt(a*a + b*b + …) (I won’t show an example invocation here.)
  • 68. Scalding  vs.   Hive  vs.  Pig 57 Wednesday, February 13, 13 Some concluding thoughts on Scalding (and Cascading) vs. Hive vs. Pig, since I’ve used all three a lot.
  • 69. Use  Hive   for  queries. 58 Wednesday, February 13, 13 SQL is about as concise as it gets when asking questions of data and doing “standard” analytics. A great example of a “purpose-built” language. It’s great at what it was designed for. Not so great when you need to do something it wasn’t designed to do.
  • 70. Use  Pig  for   data  flows. Wednesday, February 13, 13 Pig is great for data flows, such as sequencing transformations typical of ETL (extract, transform, and load). As you might expect for a purpose-built language, it is very concise for these purposes, but writing extensions is a non-trivial step (same for Hive).
  • 71. Use  Scalding   for  total   flexibility. 60 Wednesday, February 13, 13 Scalding can do everything you can do with these other tools (more or less), but it gives you the full flexible power of a Turing-complete language, so it’s easier to adapt to non-trivial, custom scenarios.
  • 72. For  more  on   Scalding: github.com/twiQer/scalding github.com/ThinkBigAnalyCcs/scalding-­‐workshop github.com/Cascading/ImpaCent/tree/master/part8 Wednesday, February 13, 13
  • 73. dean.wampler@thinkbiganalyCcs.com polyglotprogramming.com/talks Thanks! Hive  Intro.  &  Master  Classes StrataConf:  2/25-­‐26,  Chicago:  3/7-­‐8 hQp://bit.ly/ThinkBigEvents 62 Wednesday, February 13, 13 All pictures © Dean Wampler, 2005-2013. The rest of the presentation is free to reuse, but attribution is appreciated. Hire us!!