Rapid Development of Big Data applications using Spring for Apache Hadoop

Spring for Apache
Hadoop
By Zenyk Matchyshyn

Agenda
• Goals of the project
• Hadoop Introduction
• High level support
• Workflows
• Scripting & Migration
• Alternatives
• Testing & Related

BigData–Why?
Because of Terabytes and Petabytes:
• Smart meter analysis
• Genome processing
• Sentiment & social media analysis
• Network capacity trending & management
• Ad targeting
• Fraud detection

Goals
• Provide programmatic model to work with
Hadoop ecosystem
• Simplify client libraries usage
• Provide Spring friendly wrappers
• Enable real-world usage as a part of
Spring Batch & Spring Integration
• Leverage Spring features

Supporteddistros
• Apache Hadoop 1.2.1/2.0.6/2.2.0
• Cloudera CDH4
• Hortonworks HDP 1.3
• Pivotal HD 1.0/1.1

Hadoop
Hadoop
Map/Reduce
HDFS
HBase
Pig Hive

Hadoopbasics
Split Map Shuffle Reduce
Dog ate the bone
Cat ate the fish
Dog, 1
Ate, 1
The, 1
Bone, 1
Cat, 1
Ate, 1
The, 1
Fish,1
Dog, 1
Ate, {1, 1}
The, {1, 1}
Bone, 1
Cat, 1
Fish,1
Dog, 1
Ate, 2
The, 2
Bone, 1
Cat, 1
Fish,1

Configuration
< … XML …>
<context:property-placeholder
location="hadoop.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
mapred.job.tracker=${hd.jt}
</hdp:configuration>
<… XML … >

Job definition
<hdp:job id=“hadoopJob"
input-path="${wordcount.input.path}"
output-path="${wordcount.output.path}"
libs="file:${app.repo}/supporting-lib-*.jar"
mapper="org.company.Mapper"
reducer="org.company.Reducer"/>
Configuration conf = new Configuration();
Job job = new Job(conf, “hadoopJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Maper.class);
job.setReducerClass(Reducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);

Job Execution
<hdp:job-runner id="runner" run-at-startup="true"
pre-action=“someScript“
post-action=“someOtherScript“
job-ref=“hadoopJob" />
• Basic:
• Scheduled
– TaskScheduler
– Quartz
• Custom

Solutions
• HBase
• Hive
• Pig
• Cascading

Simplifies
• Thread safety
• DAO friendliness, wrappers and basic
mappers
• Simple connection interfaces
• Runners, Template and callback
methods
• Common scenarios simplifications
• Scripting support

Example- Template
template.execute("MyTable", new TableCallback<Object>() {
@Override
public Object doInTable(HTable table) throws Throwable {
Put p = new Put(Bytes.toBytes("SomeRow"));
p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue"));
table.put(p);
return null;
}
});
<hdp:hbase-configuration/>
<bean id="hbaseTemplate"
class="org.springframework.data.hadoop.hbase.HbaseTemplate"
p:configuration-ref="hbaseConfiguration"/>

Example–ScriptRunner
<hdp:hive-server host=“hivehost" port="10001" />
<hdp:hive-template />
<hdp:hive-client-factory host="some-host" port="some-port" >
<hdp:script location="classpath:org/company/hive/script.q">
<arguments>ignore-case=true</arguments>
</hdp:script>
</hdp:hive-client-factory>
<hdp:hive-runner id="hiveRunner" run-at-startup="true">
<hdp:script>
DROP TABLE IF EXITS testHiveBatchTable;
CREATE TABLE testHiveBatchTable (key int, value string);
</hdp:script>
<hdp:script location="hive-scripts/script.q"/>
</hdp:hive-runner>

TypicalBig Data ProcessingFlow
Capture Pre-Process Insert Process Extract Present

SpringBatch &Spring Integration
• Big Data Flows are based on Spring
Integration & Spring Batch
• Spring for Hadoop provides:
– Spring Batch tasklets
– Spring Integration support

Tasklets
• Job runners
• Script runners
• Hive
• Pig
• Cascading

Example
<hdp:job-tasklet id="hadoop-tasklet" job-ref="mr-job" wait-for-completion="true" />
<batch:job id="job1">
<batch:step id="import" next=“ht">
<batch:tasklet ref="script-tasklet"/>
</batch:step>
<batch:step id=“ht">
<batch:tasklet ref=" hadoop-tasklet" />
</batch:step>
</batch:job>

Details
• Supports JVM languages from JSR-223
(Groovy, JRuby, Jython, Rhino)
• Exposes SimplerFileSystem
• Provides implicit variables
• Exposes FsShell to mimic HDFS shell
• Exposes DistCp to mimic distcp from
Hadoop

Example
<hdp:script-tasklet id="script-tasklet">
<hdp:script language="groovy">
inputPath = "/user/gutenberg/input/word/"
outputPath = "/user/gutenberg/output/word/"
if (fsh.test(inputPath)) {
fsh.rmr(inputPath) }
if (fsh.test(outputPath)) {
fsh.rmr(outputPath) }
inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"
fsh.put(inputFile, inputPath)
</hdp:script>
</hdp:script-tasklet>

Migration
Hadoop Streaming:
Hadoop Tool Executor:
<hdp:streaming id="streaming"
input-path="/input/" output-path="/ouput/"
mapper="${path.cat}" reducer="${path.wc}"/>
<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true">
<hdp:arg value="data/in.txt"/>
<hdp:arg value="data/out.txt"/>
property=value
</hdp:tool-runner>

Alternatives
• Apache Flume – distributed data collection
• Apache Oozie – workflow scheduler
• Apache Sqoop – SQL bulk import/export

Testing
• JUnit/Mocks + MRUnit
• Mini-HDFS and Mini-MapReduce
cluster
• LocalJobRunner

SpringYARN
HDFS
storage
Map/Reduce
cluster / data process
YARN
cluster
HDFS
storage
Map/Reduce
data process
Other
like Spark - data
Hadoop 1.x Hadoop 2.x

SpringeXtremeData (XD)
• Ultimate data processing solution
• Implements most common approach,
business logic up to you
• On top of Spring Batch and Spring
Integration
• Has DSL
• Scalable

More speedups
• Use provider quick start VM for initial
development
• Use cloud based images for production
(start/stop)
• Don’t use Map/Reduce without real need.
Start with higher abstraction.
• Don’t migrate without real need!
• Invest in DevOps (Chef / Puppet /
Vagrant…)

Rapid Development of Big Data applications using Spring for Apache Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Rapid Development of Big Data applications using Spring for Apache Hadoop

Semelhante a Rapid Development of Big Data applications using Spring for Apache Hadoop (20)

Mais de zenyk

Mais de zenyk (12)

Último

Último (20)

Rapid Development of Big Data applications using Spring for Apache Hadoop