23. Word count in Cascading
1. public class WordCount {
2. public static void main(String[] args) {
3. Properties properties = new Properties();
4. FlowConnector.setApplicationJarClass (properties, WordCount.class);
5. Scheme sourceScheme = new TextLine (new Fields(“line”));
6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”));
7. Tap source = new Hfs( sourceScheme, args[0]);
8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE );
9. Pipe assembly = new Pipe(“ Word Count “);
10. String regex = “(?>!pL)(?=pL)[^ ]*(?<=pL)(?!pL)”;
11. Function function = new RegexGenerator( new Fields(“word”), regex);
12. assembly = new Each( assembly, new Fields(“line”), function );
13. assembly = new GroupBy( assembly, new Fields(“word”) );
14. Aggregator count = new Count(new Fields(“count”) );
15. assembly = new Every( assembly, count );
16. FlowConnector flowConnector = new FlowConnector( properties );
17. Flow flow = flowConnector.connect(“word-count”, source, sink, assembly);
18. flow.complete();
19. }
20. }
Scalding.io
70% less boilerplate code
But still some infrastructure code
31. Scalding…
…open sourced by twitter at 2011
…has more than 100 open source contributors
…exposes the right abstractions
…maximizes expressiveness
…promotes extensibility
…adds new capabilities to Cascading
Scalding.io
36. Group operations
1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity))
2. .groupBy('shopId) {
3. _.sum[Long]('quantity-> 'totalSoldItems)
4. }
5. .write(Tsv(“results.tsv”))
Scalding.io
Group by particular
fields
.groupBy
.groupAll Group all data
37. Pipe operations
1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes
2. .debug // Print sample data to screen
3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded
Scalding.io
Simple pipe operations
39. Scalding + Hive
1. class HiveExample (args: Args) extends Job(args) {
2. val USER_SCHEMA = List('userId, 'username, 'photo)
3. HiveSource("myHiveTable", SinkMode.KEEP)
4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))
5. .write(Tsv("outputFromHive"))
6. }
Scalding.io
Define the schemaQuery Hcatalog
Read directly from
HDFS
40. Scalding + ElasticSearch
1. val schema = List('number, 'product, 'description)
2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","",
schema).read.write(Tsv("data/es-out.tsv"))
3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap
("localhost”, 9200,"index/secondType","", schema))
Scalding.io
Read from
ElasticSearch in one
line!Also index new data
in ES
45. Testing challenges in the context of MR
Scalding.io
Acceptance Tests
Unit – Component Tests
System Tests
Integration Tests Scalding enables
testing in every layer
&
TDD
46. example
Scalding.io
1. class TsvWordCountJobTest extends FlatSpec
2. with ShouldMatchers with TuppleConversions {
3. “WordCountJob” should “count words” in {
4. JobTest(new WordCountJob(_))
5. .args(“input”,”inFile”)
6. .args(“output”,”outFile”)
7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”))
8. .sink[(String,Int)](Tsv(“outFile”)) { out =>
9. out.toList should contain (“cool” -> 2)
10. }
11. .run
12. .finish
13. }
14. }
Replaces taps
with in-memory
collections
and asserts the
expected output
48. “Driven takes Cascading
application development to the
next level with management and
monitoring capabilities for your
apps”
Scalding.io
http://driven.cascading.io
Monday, June 30, 2014
6:30 PM to 9:30 PM
Barclays Accelerator 69-89 Mile End Road, E1 4UJ, London
http://www.meetup.com/big-data-london/events/188925412/
Here are my contact details.
Find a number of open-source projects related to Hadoop & MapReduce that I have been contributing on GitHub
And also the technical blog http://scalding.io
First book ever available on Scala + MapReduce + Hadoop
Comes with hundreds of ready to run examples
Book @ Amazon = http://amazon.co.uk/dp/1783287012
Book @ PACKT = http://packtpub.com/programming-mapreduce-with-scalding/book
GitHub repository with examples = https://github.com/scalding-io/ProgrammingWithScalding
http://github.com/twitter/scalding
Once upon a time..
Hadoop provides HDFS for the distributed storage of large files and services for coordinated execution of MapReduce tasks
The Java MapReduce API is very verbose and 70 lines of code for a simple WordCount example
Hadoop provides HDFS for the distributed storage of large files and services for coordinated execution of MapReduce tasks
The Java MapReduce API is very verbose and 70 lines of code for a simple WordCount example
In-memory systems i.e. memcached , redis etc
Document Databases
Search systems
Explain:
Taps, Tuples, pipes
Show parallelism
Cascading word count requires 20 lines
Java MapReduce API word count requires 70 lines
= We manage to remove 50 lines of code (70%) by using Cascading
Is this what adds on top of cascading ?
Parquet => Efficient columnar storage
For a Scalding application to execute all defined input and output taps must participate in the pipeline.
Reading & Writing files
// 15 map operations – that are translated into map phases