WordPress Websites for Engineers: Elevate Your Brand
An introduction to Apache Crunch
1. Apache Crunch
●
What is it ?
●
How does it work ?
●
Why use it ?
●
Hadoop MapReduce pipelines
●
Scrunch
●
Joins
www.semtech-solutions.co.nz
info@semtech-solutions.co.nz
2. Apache Crunch – Pipe line
●
Crunch is based on Google's FlumeJava
●
Provides a Java based API for M/R pipelines
●
It uses an MST ( multiple serializable type ) data model
●
Good for processing complex data types
●
Better for “non tuple” data types i.e.
–
Images
–
Audio
–
Seismic data
www.semtech-solutions.co.nz
info@semtech-solutions.co.nz
3. Apache Crunch – Pipe line
●
What is a Map Reduce Pipe line ?
–
Map
–
Shuffle
–
Reduce
–
Combine
●
Arranged in sequence and / or in parallel
●
Potentially very long chains
www.semtech-solutions.co.nz
info@semtech-solutions.co.nz
4. Apache Crunch – Scala
●
Scrunch is a Scala wrapper for Apache Crunch
●
Reduced code
●
Functional and OO styles
●
Uses type inferencing for Map / Reduce
●
Incorporates Java Materialize functionality
●
Includes REPL ( read eval print loop )
www.semtech-solutions.co.nz
info@semtech-solutions.co.nz
5. Apache Crunch – Joins
●
Details of Joins available in Crunch
–
Inner / Outer like SQL joins
–
Same with Left / Right / Full joins
–
MapSide join is an in memory join
www.semtech-solutions.co.nz
info@semtech-solutions.co.nz
6. Apache Crunch – Performance
●
A light weight API that runs efficiently
●
Crunch is a thin veneer on top of Map Reduce
●
Two implementations available
–
–
●
Hadoop Writeables
Avro
Avro implementation much faster
www.semtech-solutions.co.nz
info@semtech-solutions.co.nz
8. Contact Us
●
Feel free to contact us at
–
www.semtech-solutions.co.nz
–
info@semtech-solutions.co.nz
●
We offer IT project consultancy
●
We are happy to hear about your problems
●
You can just pay for those hours that you need
●
To solve your problems