Cost effective BigData Processing on Amazon EC2

Big Data Cloud Meetup Cost Effective Big-Data Processing using Amazon Elastic Map Reduce Sujee Maniyam s@sujee.net / www.sujee.net July 08, 2011

Hi, I’m Sujee 10+ years of software development enterprise apps  web apps iphone apps  Hadoop More : http://sujee.net/tech

Quiz PRIZE! Where was this picture taken?

Quiz : Where was this picture taken?

Nature of Data… Primary Data Email, blogs, pictures, tweets Critical for operation (Gmail can’t loose emails) Secondary data Wikipedia access logs, Google search logs Not ‘critical’, but used to ‘enhance’ user experience Search logs help predict ‘trends’ Yelp can figure out you like Chinese food

Data Explosion Primary data has grown phenomenally But secondary data has exploded in recent years “log every thing and ask questions later” Used for Recommendations (books, restaurants ..etc) Predict trends (job skills in demand) Show ADS ($$$) ..etc ‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook) Startups are struggling to get on top of ‘big data’

Hadoop to Rescue Hadoop can help with BigData Hadoop has been proven in the field Under active development Throw hardware at the problem Getting cheaper by the year Bleeding edge technology Hire good people!

About This Presentation Based on my experience with a startup 5 people (3 Engineers) Ad-Serving Space Amazon EC2 is our ‘data center’ Technologies: Web stack : Python, Tornado, PHP, mysql , LAMP Amazon EMR to crunch data Data size : 1 TB / week

Story of a Startup…month-1 Each web serverwrites logs locally Logs were copiedto a log-serverand purged from web servers Log Data size : ~100-200 G

Story of a Startup…month-6 More web servers comeonline Aggregate log serverfalls behind

Data @ 6 months 2 TB of data already 50-100 G new data / day And we were operating at 20% of our capacity!

Solution? Scalable database (NOSQL) Hbase Cassandra Hadoop log processing / Map Reduce

What We Evaluated 1) Hbase cluster 2) Hadoop cluster 3) Amazon EMR

Hadoop on Amazon EC2 1) Permanent Cluster 2) On demand cluster (elastic map reduce)

Hadoop Cluster 7 C1.xlarge machines 15 TB EBS volumes Sqoop exports mysql log tables into HDFS Logs are compressed (gz) to minimize disk usage (data locality trade-off) All is working well…

2 months later Couple of EBS volumes DIE Couple of EC2 instances DIE Maintaining the hadoop cluster is mechanical job less appealing COST! Our jobs utilization is about 50% But still paying for machines running 24x7

Lessons Learned C1.xlarge is pretty stable (8 core / 8G memory) EBS volumes max size 1TB, so string few for higher density / node DON’T RAID them; let hadoop handle them as individual disks Might fail Backup data on S3 Skip EBS. Use instance store disks, and store data in S3 Use Apache WHIRR to setup cluster easily

Hadoop cluster on EC2 cost $3,500 = 7 c1.xlarge @ $500 / month $1,500 = 15 TB EBS storage @ $0.10 per GB $ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests  $5,500 / month $60,000 / year !

Buy / Rent ? Typical hadoop machine cost : $10-15k 10 node cluster = $100k Plus data center costs Plus IT-ops costs Amazon Ec2 10 node cluster: $500 * 10 = $5,000 / month = $60k / year

Buy / Rent Amazon EC2 is great, for Quickly getting started Startups Scaling on demand / rapidly adding more servers popular social games Netflix story Streaming is powered by EC2 Encoding movies ..etc Use 1000s of instances Not so economical for running clusters 24x7 http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/

Amazon’s Elastic Map Reduce Basically ‘on demand’ hadoop cluster Store data on Amazon S3 Kick off a hadoop cluster to process data Shutdown when done Pay for the HOURS used

Moving parts Logs go into Scribe Scribe master ships logs into S3, gzipped Spin EMR cluster, run job, done Using same old Java MR jobs for EMR Summary data gets directly updated to a mysql (no output files from reducers)

EMR Wins Cost  only pay for use http://aws.amazon.com/elasticmapreduce/pricing/ Example: EMR ran on 5 C1.xlarge for 3hrs EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20 http://aws.amazon.com/elasticmapreduce/faqs/#billing-4 (1 hour of c1.xlarge = 8 hours normalized compute time) EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40 Plus S3 storage cost : 1TB / month = $150 Data bandwidth from S3 to EC2 is FREE!  $25 bucks

Design Wins Bidders now write logs to Scribe directly No mysql at web server machines Writes much faster! S3 has been a reliable storage and cheap

EMR Wins No hadoop cluster to maintainno failed nodes / disks

EMR Wins Hadoop clusters can be of any size! Can have multiple hadoop clusters smaller jobs  fewer number of machines memory hungry tasks  m1.xlarge cpu hungry tasks  c1.xlarge

EMR trade-offs Lower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk) Streaming data from S3, for each job EMR Hadoop is not the latest version Missing tools : Oozie Right now, trading performance for convenience and cost

Lessons Learned Debugging a failed MR job is tricky Because the hadoop cluster is terminated  no logs files Save log files to S3

Lessons : Script every thing scripts to launch jar EMR jobs Custom parameters depending on job needs (instance types, size of cluster ..etc) monitor job progress Save logs for later inspection Job status (finished / cancelled) https://github.com/sujee/amazon-emr-beyond-basics

Sample Launch Script #!/bin/bash ## run-sitestats4.sh # config MASTER_INSTANCE_TYPE="m1.large" SLAVE_INSTANCE_TYPE="c1.xlarge" INSTANCES=5 export JOBNAME="SiteStats4" export TIMESTAMP=$(date +%Y%m%d-%H%M%S) # end config echo "===========================================" echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...." export t1=$(date +%s) export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”) sh ./emr-wait-for-completion.sh

Lessons : tweak cluster for each job Mapred-config-m1-xl.xml <configuration> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1024M</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx3000M</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>3</value> </property> <property> <name>mapred.output.compress</name> <value>true</value> </property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property> </configuration>

Map reduce tips : Control the amount of Input We get different type of events event A (freq: 10,000) >>> event B (100) >> event C (1) Initially we put them all into a single log file A A A A B A A B C

Control Input… So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing So we split the logs log_A….gz log_B….gz log_C…gz Now only processing fraction of our logs Input : s3://my_bucket/logs/log_B* x-ref using memcache if needed

Map reduce tips: Data joining (x-ref) Data is split across log files, need to x-ref during Map phase Used to load the data in mapper’s memory (data was small and in mysql) Now we use Membase (Memcached) Two MR jobs are chained First one processes logfile_type_A and populates Membase (very quick, takes minutes) Second one, processes logfile_type_B, cross-references values from Membase

Map reduce tips: Logfile format CSV  JSON Started with CSV CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL 20-40 fields… fragile, position dependant, hard to code url = csv[18]…counting position numbers gets old after 100th time around) If (csv.length == 29) url = csv[28] else url = csv[26]

Map reduce tips: Logfile format JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….} Self-describing, easy to add new fields, easy to process url = map.get(‘url’) Flatten JSON to fit in ONE LINE Compresses pretty well (not much data inflation)

Map reduce tips: Incremental Log Processing Recent data (today / yesterday / this week) is more relevant than older data (6 months +)

Map reduce tips: Incremental Log Processing Adding ‘time window’ to our stats only process newer logs faster

Next steps : faster processing Streaming S3 data for each MR job is not optimal Spin cluster Copy data from S3 to HDFS Run all MR jobs (make use of data locality) terminate

Next Steps : More Processing More MR jobs More frequent data processing Frequent log rolls Smaller delta window (1 hr / 15 mins)

Next steps : new software New Software Pig, python mrJOB(from Yelp) Scribe  Cloudera flume? Use work flow tools like Oozie Hive? Adhoc SQL like queries

Next Steps : SPOT instances SPOT instances : name your price (ebay style) Been available on EC2 for a while Just became available for Elastic map reduce! New cluster setup: 10 normal instances + 10 spot instances Spots may go away anytime That is fine! Hadoop will handle node failures Bigger cluster : cheaper & faster

In summary… Amazon EMR could be a great solution We are happy!

Take a test drive Just bring your credit-card  http://aws.amazon.com/elasticmapreduce/ Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52

Thanks Questions? Sujee Maniyam http://sujee.net hello@sujee.net Devil’s slide, Pacifica

Cost effective BigData Processing on Amazon EC2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Cost effective BigData Processing on Amazon EC2

Semelhante a Cost effective BigData Processing on Amazon EC2 (20)

Último

Último (20)

Cost effective BigData Processing on Amazon EC2