Hadoop Conf Japan 2009 After Party LT - Hadoop Ruby DSL
JRubyKaigi2010 Hadoop Papyrus
1. MapReduce by JRuby and DSL
Hadoop Papyrus
2010/8/28
JRubyKaigi 2010
藤川幸一 FUJIKAWA Koichi @fujibee
2. What’s Hadoop?
• FW of parallel distributed processing
framework for BIG data
• OSS clone of Google MapReduce
• For over terabyte scale data processing
– Took over 2000hr if you read the data of
400TB(Web scale data) by standard HDD, reading
50MB/s
– Need the distributed file system and parallel
processing framework!
3. Hadoop Papyrus
• My own OSS project
– Hosted by github http://github.com/fujibee/hadoop-papyrus
• Framework for running Hadoop jobs by (J)Ruby
DSL description
– Originally Hadoop jobs written by Java
– Just few lines in Ruby same as the very complex
procedure if using Java!
• Supported by IPA MITOH 2009 project
(Government support)
• Can run by Hudson (CI tool) plug-in
7. package org.apache.hadoop.examples; Javaの場合
import java.io.IOException;
import java.util.StringTokenizer;
70 lines are needed in Java..
import org.apache.hadoop.conf.Configuration;
Hadoop Papyrus is only needed 10 lines!
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper; IntSumReducer extends
public static class
import org.apache.hadoop.mapreduce.Reducer;
Reducer<Text, IntWritable, Text, IntWritable> {
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
private IntWritable result = new IntWritable();
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
public class WordCount { int sum = 0;
for (IntWritable val : values) {
sum += val.get();
public static class TokenizerMapper extends
}
Mapper<Object, Text, Text, IntWritable> {
result.set(sum);
Hadoop Papyrus
context.write(key, result);
} dsl 'LogAnalysis‘
private final static IntWritable one = new IntWritable(1);
}
private Text word = new Text();
public static void main(String[] args) throws Exception {
public void map(Object key, Text value, Context context) Configuration();
Configuration conf = new
from ‘test/in‘
throws IOException, InterruptedException {
String[] otherArgs = new GenericOptionsParser(conf, args)
StringTokenizer itr = new StringTokenizer(value.toString());
.getRemainingArgs();
to ‘test/out’
while (itr.hasMoreTokens()) { if (otherArgs.length != 2) {
word.set(itr.nextToken()); System.err.println("Usage: wordcount <in> <out>");
context.write(word, one); System.exit(2);
}
}
} pattern /¥[¥[([^|¥]:]+)[^¥]:]*¥]¥]/
Job job = new Job(conf, "word count");
} job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
column_name :link
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); topic "link num", :label => 'n' do
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); count_uniq column[:link]
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
end
8. Hadoop Papyrus Details
• Invoke Ruby script using JRuby in the process
of Map/Reduce running on Java
9. Hadoop Papyrus Details (con’t)
• Additionally, we can write the DSL script you want to process (log analysis,
etc). Papyrus can choose the different process on each phase (Map or
Reduce, job initialization). So we just need the only one script.