Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Big Data and Hadoop in Cloud

Vijay Rayapati
@amnigos
1

What is Big Data?

Datasets that grow so large that they become
awkward to work with using on-hand database
management tools. Difficulties include
capture, storage, search, sharing, analytics,
and visualizing - Wikipedia

High volume of data (storage) + speed of data
(scale) + variety of data (diff types) - Gartner

World is ON = Content + Interactions = More Data
(Social and Mobile)

Tons of data is generated by each one of us!

(We moved from GB to ZB and from Millions to Zillions)

Big Data - There is so much more you can do!

Everybody has this problem – Not just Amazon, Google,
Facebook and Twitter!

How can we work with
Big Data?

Why Cloud and Big Data?

Cloud has democratized access to large
scale infrastructure for masses!

You can store, process and manage big
data sets without worrying about IT!

**http://wiki.apache.org/hadoop/PoweredBy

Hadoop makes it easier to
store, process and analyze
lot of data on commodity
hardware!

Who uses Hadoop and How?

Everybody (from A to Z )
to
Solve complex problems

**http://wiki.apache.org/hadoop/PoweredBy

Big Data and Hadoop - It’s Fun

Task Tracker Task Tracker Task Tracker
Map Reduce
(processing)

Job Tracker

Name Node

HDFS Layer
(storage) Data Node Data Node Data Node

Master Node

Hadoop – Getting Started

• Download latest stable version - http://hadoop.apache.org/common/releases.html

• Install Java ( > 1.6.0_20 ) and set your JAVA_HOME

• Install rsync and ssh

• Follow instructions - http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html

• Hadoop Modes – Local, Pseudo-distributed and Fully distributed

• Run in pseudo-distributed mode for your testing and development

• Assign a decent jvm heapsize through mapred.child.java.opts if you
notice task errors or GC overhead or OOM

• Play with samples – WordCount, TeraSort etc

• Good for learning - http://www.cloudera.com/hadoop-training-virtual-machine

Why Amazon EMR?

I am interested in using Hadoop
to solve problems and not in
building and managing Hadoop
Infrastructure!

Amazon EMR – Setup

• Install Ruby 1.8.X and use EMR Ruby CLI for managing EMR.

• Just create credentials.json file in your EMR Ruby CLI installation
directory and provide your accesskey & private key.

• Bootstrapping is a great way to install required components or
perform custom actions in your EMR cluster.

• Default bootstrap action is available to control the configuration of
Hadoop and MapReduce.

• Bootstrap with Ganglia during your development and tuning phase –
provides monitoring metrics across your cluster.

• Minor bugs in EMR Ruby CLI but pretty cool for your needs.

Amazon EMR – Setup

• Launching a 500 node and fully configured cluster is as simple
as firing one command

> elastic-mapreduce --create --alive --plain-output --master-instance-type
m1.xlarge --slave-instance-type m2.2xlarge --num-instances 500 --name
"Site Analytics Cluster" --bootstrap-action
s3://com.bcb11.emr/scripts/bootstrap-custom.sh
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia -
-bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-
hadoop --args "--mapred-config-file, s3://com.bcb11.emr/conf/custom-
mapred-site.xml"

> elastic-mapreduce -j ${jobflow} --stream --step-name “Profile Analyzer" --
jobconf mapred.task.timeout=0 --mapper
s3://com.bcb11.emr/code/mapper.rb --reducer
s3://com.bcb11.emr/bin/reducer.rb --cache
s3://com.bcb11.emr/cache/customdata.dat#data.txt --input
s3://com.bcb11.emr/input/ --output s3://com.bcb11.emr/output

Amazon EMR - Service Architecture

EMR CLI – What you need to know?

• elastic-mapreduce -j <jobflow id> --describe

• elastic-mapreduce --list --active

• elastic-mapreduce -j <jobflow id> --terminate

• elastic-mapreduce --jobflow <jobflow id> --ssh

• Look into your logs directory in the S3 if you need any other
information on cluster setup, hadoop logs, Job step logs, Task
attempt logs etc.

EMR Map Reduce Jobs

• Amazon EMR supports – streaming, custom jar, cascading, pig
and hive. So you can write jobs in a you want without worrying
about managing the underlying infrastructure including hadoop.

• Streaming – Write Map Reduce jobs in any scripting language.

• Custom Jar – Write using Java and good for speed/control.

• Cascading, Hive and Pig – Higher level of abstraction.

• Use a good S3 explorer, FoxyProxy and ElasticFox.

• Leverage aws emr forum if you need help.

EMR – Debugging and Performance Tuning

Hadoop – Debugging and Profiling

• Run hadoop in local mode for debugging so mapper and reducer
tasks run in a single JVM instead of separate JVMs.

• Configure Hadoop_Opts to enable debugging.
(export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008“)

• Configure fs.default.name value in core-site.xml to file:/// from hdfs://

• Configure mapred.job.tracker value in mapred-site.xml to local

• Create debug configuration for Eclipse and set the port to 8008.

• Run your hadoop job and launch Eclipse with your Java code so you
can start debugging.

• Use your favorite profiler to understand code level hotspots.

EMR – Good, Bad and Ugly

• Great for bootstrapping large clusters and very cost-effective if
you need once in a while infrastructure to run your Hadoop jobs.

• Don’t need to worry about underlying Hadoop cluster setup and
management. Most patches are applied and Amazon creates new
AMI’s with improvements.

• Doesn’t have a fall back (secondary name node) – only one
master node.

• Intermittent Network Issues – Sometimes could cause serious
degradation of performance.

• Network IO is variable and streaming jobs will be much sluggish
on EMR compared to dedicated setup.

• Disk IO is terrible across instance families and types – Please fix
it.

Hadoop – High Level Tuning

Small files problem – avoid too Tune your settings – JVM
many small files and tune your Reuse, Sort Buffer, Sort Factor,
block size. Map/Reduce Tasks, Parallel
Copies, MapRed Output
Compression etc

Good thing is that you can
Know what is limiting you at a
use small cluster and sample
node level – CPU, Memory,
input size for tuning
DISK IO or Network IN/OUT

Hadoop – What effects your jobs performance?

• GC Overhead - memory and reduce the jvm reuse tasks.

• Increase dfs block size (default 128MB in EMR) for large files.

• Avoid read contention at S3 – have equal or more files in S3
compared to available mappers.

• Use mapred output compression to save storage, processing
time and bandwidth costs.

• Set mapred task timeout to 0 if you have long running jobs (> 10
mins) and can disable speculative execution time.

• Increase sort buffer and sort factor based on map tasks output.

Understand – EMR Cluster Metrics

Common Bottlenecks – Monitor Matters

Hadoop and EMR – What I have learned?

• Code is god – If you have severe performance issues then look at
your code 100 times, understand third party libraries used and
rewrite in Java if required.

• Streaming jobs are slow compared to Custom Jar jobs – Over
head and scripting is good for adhoc-analysis.

• Disk IO and Network IO effects your processing time.

• Be ready to face variable performance in Cloud.

• Monitor everything once in a while and keep benchmarking with
data points.

• Default settings are seldom optimal in EMR – unless you run
simple jobs.

• Focus on optimization as it’s the only way to save Cost and Time.

Hadoop and EMR – Performance Tuning Example

• Streaming : Map reduce jobs were written using Ruby. Input
dataset was 150 GB and output was around 4000 GB. Complex
processing, highly CPU bound and Disk IO.

• Time taken to complete job processing : 4000 m1.xlarge nodes
and 180 minutes.

• Rewrote the code in Java – job processing time was reduced to
70 minutes on just 400 m1.xlarge nodes.

• Tuning EMR configuration has further reduced it to 32 minutes.

• Focus on code first and then focus on configuration.

Like what we do? – connect with me
Kuliza.com | vijay.rayapati@kuliza.com | @kuliza

vijay.rayapati@kuliza.com
@amnigos

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (10)

Semelhante a Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Semelhante a Big Data and Hadoop in Cloud - Leveraging Amazon EMR (20)

Mais de Vijay Rayapati

Mais de Vijay Rayapati (13)

Último

Último (20)

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Notas do Editor