SlideShare uma empresa Scribd logo
1 de 71
Big Data Cloud Meetup Cost Effective Big-Data Processing using Amazon Elastic Map Reduce Sujee Maniyam s@sujee.net   /  www.sujee.net July 08, 2011
Hi, I’m Sujee 10+ years of software development enterprise apps  web apps iphone apps   Hadoop More : http://sujee.net/tech
I am  an ‘expert’ 
Quiz PRIZE! Where was this picture taken?
Quiz : Where was this picture taken?
Answer : Montara Light House
Ah.. Data
Nature of Data… Primary Data Email, blogs, pictures, tweets Critical for operation (Gmail can’t loose emails) Secondary data Wikipedia access logs, Google search logs Not ‘critical’, but  used to ‘enhance’  user experience Search logs help predict ‘trends’ Yelp can figure out you like Chinese food
Data Explosion Primary data has grown phenomenally But secondary data has exploded in recent years “log every thing and ask questions later” Used for Recommendations (books, restaurants ..etc) Predict trends (job skills in demand) Show ADS  ($$$) ..etc ‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook) Startups are struggling to get on top of ‘big data’
Big Guys
Startups
Startups and bigdata
Hadoop to Rescue Hadoop can help with BigData Hadoop has been proven in the field Under active development Throw hardware at the problem Getting cheaper by the year Bleeding edge technology Hire good people!
Hadoop: It is a CAREER
Data Spectrum
Who is Using Hadoop?
About This Presentation Based on my experience with a startup 5 people (3 Engineers) Ad-Serving Space Amazon EC2 is our ‘data center’ Technologies: Web stack : Python, Tornado,  PHP,  mysql , LAMP Amazon EMR to crunch data Data size : 1 TB  / week
Story of a Startup…month-1 Each web serverwrites logs locally Logs were copiedto a log-serverand purged from web servers Log Data size : ~100-200 G
Story of a Startup…month-6 More web servers comeonline Aggregate log serverfalls behind
Data @ 6 months 2 TB of data already 50-100 G new data / day  And we were operating at 20% of our capacity!
Future…
Solution? Scalable database (NOSQL) Hbase Cassandra Hadoop log processing / Map Reduce
What We Evaluated 1) Hbase cluster 2) Hadoop cluster 3) Amazon EMR
Hadoop on Amazon EC2 1) Permanent Cluster 2) On demand cluster (elastic map reduce)
1) Permanent Hadoop Cluster
Architecture 1
Hadoop Cluster 7 C1.xlarge machines 15 TB EBS volumes Sqoop exports mysql log tables into HDFS Logs are compressed (gz) to minimize disk usage (data locality trade-off) All is working well…
2 months later Couple of EBS volumes DIE Couple of EC2 instances DIE Maintaining the hadoop cluster is mechanical  job  less appealing COST! Our jobs utilization is about 50% But still paying for machines running 24x7
Lessons Learned C1.xlarge is  pretty stable (8 core / 8G memory) EBS volumes max size 1TB,  so string few for higher density / node DON’T RAID them; let hadoop handle them as individual disks Might fail Backup data on S3 Skip EBS.  Use instance store disks, and store data in S3 Use Apache WHIRR to setup cluster easily
Amazon Storage Options
Amazon EC2 Cost
Hadoop cluster on EC2 cost $3,500 = 7 c1.xlarge @ $500 / month $1,500 = 15 TB EBS storage @ $0.10 per GB $ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests   $5,500 / month $60,000 / year !
Buy / Rent ? Typical hadoop machine cost : $10-15k 10 node cluster = $100k  Plus data center  costs Plus IT-ops costs Amazon Ec2 10 node cluster: $500 * 10 = $5,000 / month = $60k / year
Buy / Rent Amazon EC2 is great, for Quickly getting started Startups Scaling on demand / rapidly adding more servers popular social games Netflix story Streaming is powered by EC2 Encoding movies ..etc Use 1000s of instances Not so economical for running clusters 24x7 http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/
Buy vs Rent
Next : Amazon EMR
Where was this picture taken?
Answer : Pacifica Pier
Amazon’s Elastic Map Reduce Basically ‘on demand’ hadoop cluster Store data on Amazon S3 Kick off a hadoop cluster to process data Shutdown when done Pay for the HOURS used
Architecture2 : Amazon EMR
Moving parts Logs go into Scribe Scribe master ships logs into S3, gzipped Spin EMR cluster, run job, done Using same old Java MR jobs for EMR Summary data gets directly updated to a mysql (no output files from reducers)
EMR Wins Cost   only pay for use http://aws.amazon.com/elasticmapreduce/pricing/ Example: EMR ran on 5 C1.xlarge for 3hrs EC2 instances for 3 hrs = $0.68  per hr x 5 inst x 3 hrs = $10.20 http://aws.amazon.com/elasticmapreduce/faqs/#billing-4 (1 hour of c1.xlarge = 8 hours normalized compute time) EMR cost = 5 instances x 3 hrs x 8 normalized hrs x  0.12 emr = $14.40 Plus S3 storage cost :  1TB / month = $150 Data bandwidth from S3 to EC2 is FREE!  $25 bucks
Design Wins Bidders now write logs to Scribe directly  No mysql at web server machines Writes much faster! S3 has been a reliable  storage and cheap
EMR Wins No hadoop cluster to maintainno failed nodes / disks
EMR Wins Hadoop clusters can be of any size! Can have multiple hadoop clusters smaller jobs  fewer number of machines memory hungry tasks  m1.xlarge cpu hungry tasks  c1.xlarge
EMR trade-offs Lower performance on MR jobs compared to a  clusterReduced data throughput (S3 isn’t the same as local disk) Streaming data from S3, for each job EMR Hadoop is not the latest version Missing tools : Oozie Right now, trading performance for convenience and cost
Lessons Learned Debugging a failed MR job is tricky Because the hadoop cluster is terminated  no logs files Save log files to S3
Lessons : Script every thing scripts  to launch jar EMR jobs Custom parameters depending on job needs (instance types, size of cluster ..etc) monitor  job progress Save logs for later inspection Job status (finished / cancelled) https://github.com/sujee/amazon-emr-beyond-basics
Sample Launch Script #!/bin/bash ## run-sitestats4.sh # config MASTER_INSTANCE_TYPE="m1.large" SLAVE_INSTANCE_TYPE="c1.xlarge" INSTANCES=5 export JOBNAME="SiteStats4" export TIMESTAMP=$(date +%Y%m%d-%H%M%S) # end config echo "===========================================" echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...." export t1=$(date +%s) export JOBID=$(elastic-mapreduce --plain-output  --create --name "${JOBNAME}__${TIMESTAMP}"   --num-instances "$INSTANCES"  --master-instance-type "$MASTER_INSTANCE_TYPE"  --slave-instance-type "$SLAVE_INSTANCE_TYPE"  --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config  --log-uri s3://my_bucket/emr-logs/   --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”) sh ./emr-wait-for-completion.sh
Lessons : tweak cluster for each job Mapred-config-m1-xl.xml <configuration>     <property>         <name>mapreduce.map.java.opts</name>         <value>-Xmx1024M</value>     </property>     <property>         <name>mapreduce.reduce.java.opts</name>         <value>-Xmx3000M</value>     </property>     <property>         <name>mapred.tasktracker.reduce.tasks.maximum</name>         <value>3</value> </property> <property>         <name>mapred.output.compress</name>         <value>true</value> </property>     <property>         <name>mapred.output.compression.type</name>         <value>BLOCK</value>     </property> </configuration>
Saved Logs
Sample Saved Log
Map reduce tips : Control the amount of Input We get different type of events event A (freq: 10,000)   >>> event B (100)  >> event C (1) Initially we put them all into a single log file A A A A B A A B C
Control Input… So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing So we split the logs log_A….gz log_B….gz log_C…gz Now only processing fraction of our logs Input : s3://my_bucket/logs/log_B* x-ref using memcache if needed
Map reduce tips: Data joining (x-ref) Data is split across log files, need to x-ref during Map phase Used to load the data in mapper’s memory (data was small and in mysql) Now we use Membase  (Memcached) Two MR jobs are chained First one processes logfile_type_A and populates Membase (very quick,  takes minutes) Second one, processes logfile_type_B, cross-references values from Membase
X-ref
Map reduce tips: Logfile format CSV  JSON Started with CSV CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL 20-40 fields… fragile, position dependant, hard to code  url = csv[18]…counting position numbers gets old after 100th time around) If (csv.length == 29) url = csv[28]     else url = csv[26]
Map reduce tips: Logfile format JSON: { exchange_id: 2,  url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….} Self-describing,  easy to add new fields, easy to process url = map.get(‘url’) Flatten JSON to fit in ONE LINE Compresses pretty well (not much data inflation)
Map reduce tips: Incremental Log Processing Recent data (today / yesterday / this week) is more relevant than older data (6 months +)
Map reduce tips: Incremental Log Processing Adding ‘time window’ to our stats only process newer logs faster
Next Steps
Where was this pic taken?
Answer : Foster City
Next steps : faster processing Streaming S3 data for each MR job is not optimal Spin cluster Copy data from S3 to HDFS Run all MR jobs (make use of data locality) terminate
Next Steps : More Processing More MR jobs More frequent data processing Frequent log rolls Smaller delta window (1 hr / 15 mins)
Next steps : new software  New Software Pig,  python  mrJOB(from Yelp) Scribe  Cloudera flume? Use work flow tools like Oozie Hive? Adhoc SQL like queries
Next Steps : SPOT instances SPOT instances : name your price (ebay style) Been available on EC2 for a while Just became available for Elastic map reduce! New cluster setup: 10 normal instances + 10 spot instances Spots may go away anytime That is fine!  Hadoop will handle node failures Bigger cluster : cheaper & faster
Example Price Comparison
In summary… Amazon EMR could be a great solution We are happy!
Take a test drive Just bring your credit-card  http://aws.amazon.com/elasticmapreduce/ Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52
Thanks Questions? Sujee Maniyam http://sujee.net hello@sujee.net Devil’s slide, Pacifica

Mais conteúdo relacionado

Mais procurados

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
Joydeep Sen Sarma
 

Mais procurados (20)

Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 

Destaque

Destaque (20)

Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
elasticsearch
elasticsearchelasticsearch
elasticsearch
 
Rstudio in aws 16 9
Rstudio in aws 16 9Rstudio in aws 16 9
Rstudio in aws 16 9
 
Solr on Cloud
Solr on CloudSolr on Cloud
Solr on Cloud
 
White Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known CustomersWhite Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known Customers
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
Real Time Security Analytics
Real Time Security AnalyticsReal Time Security Analytics
Real Time Security Analytics
 
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves Hadoop
 
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
 
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
 
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
 
Zookeeper 소개
Zookeeper 소개Zookeeper 소개
Zookeeper 소개
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 

Semelhante a Cost effective BigData Processing on Amazon EC2

3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR
Faizan Javed
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickr
xlight
 

Semelhante a Cost effective BigData Processing on Amazon EC2 (20)

3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickr
 
Bostonrb Amazon Talk
Bostonrb Amazon TalkBostonrb Amazon Talk
Bostonrb Amazon Talk
 
Building prediction models with Amazon Redshift and Amazon ML
Building prediction models with  Amazon Redshift and Amazon MLBuilding prediction models with  Amazon Redshift and Amazon ML
Building prediction models with Amazon Redshift and Amazon ML
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Amazon web services : Layman Introduction
Amazon web services : Layman IntroductionAmazon web services : Layman Introduction
Amazon web services : Layman Introduction
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Klmug presentation - Simple Analytics with MongoDB
Klmug presentation - Simple Analytics with MongoDBKlmug presentation - Simple Analytics with MongoDB
Klmug presentation - Simple Analytics with MongoDB
 
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR HadoopCrunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
 
10 things I’ve learnt In the clouds
10 things I’ve learnt In the clouds10 things I’ve learnt In the clouds
10 things I’ve learnt In the clouds
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Hosting Drupal on Amazon EC2
Hosting Drupal on Amazon EC2Hosting Drupal on Amazon EC2
Hosting Drupal on Amazon EC2
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Cost effective BigData Processing on Amazon EC2

  • 1. Big Data Cloud Meetup Cost Effective Big-Data Processing using Amazon Elastic Map Reduce Sujee Maniyam s@sujee.net / www.sujee.net July 08, 2011
  • 2. Hi, I’m Sujee 10+ years of software development enterprise apps  web apps iphone apps  Hadoop More : http://sujee.net/tech
  • 3. I am an ‘expert’ 
  • 4. Quiz PRIZE! Where was this picture taken?
  • 5. Quiz : Where was this picture taken?
  • 6. Answer : Montara Light House
  • 8. Nature of Data… Primary Data Email, blogs, pictures, tweets Critical for operation (Gmail can’t loose emails) Secondary data Wikipedia access logs, Google search logs Not ‘critical’, but used to ‘enhance’ user experience Search logs help predict ‘trends’ Yelp can figure out you like Chinese food
  • 9. Data Explosion Primary data has grown phenomenally But secondary data has exploded in recent years “log every thing and ask questions later” Used for Recommendations (books, restaurants ..etc) Predict trends (job skills in demand) Show ADS ($$$) ..etc ‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook) Startups are struggling to get on top of ‘big data’
  • 13. Hadoop to Rescue Hadoop can help with BigData Hadoop has been proven in the field Under active development Throw hardware at the problem Getting cheaper by the year Bleeding edge technology Hire good people!
  • 14. Hadoop: It is a CAREER
  • 16. Who is Using Hadoop?
  • 17. About This Presentation Based on my experience with a startup 5 people (3 Engineers) Ad-Serving Space Amazon EC2 is our ‘data center’ Technologies: Web stack : Python, Tornado, PHP, mysql , LAMP Amazon EMR to crunch data Data size : 1 TB / week
  • 18. Story of a Startup…month-1 Each web serverwrites logs locally Logs were copiedto a log-serverand purged from web servers Log Data size : ~100-200 G
  • 19. Story of a Startup…month-6 More web servers comeonline Aggregate log serverfalls behind
  • 20. Data @ 6 months 2 TB of data already 50-100 G new data / day And we were operating at 20% of our capacity!
  • 22. Solution? Scalable database (NOSQL) Hbase Cassandra Hadoop log processing / Map Reduce
  • 23. What We Evaluated 1) Hbase cluster 2) Hadoop cluster 3) Amazon EMR
  • 24. Hadoop on Amazon EC2 1) Permanent Cluster 2) On demand cluster (elastic map reduce)
  • 27. Hadoop Cluster 7 C1.xlarge machines 15 TB EBS volumes Sqoop exports mysql log tables into HDFS Logs are compressed (gz) to minimize disk usage (data locality trade-off) All is working well…
  • 28. 2 months later Couple of EBS volumes DIE Couple of EC2 instances DIE Maintaining the hadoop cluster is mechanical job less appealing COST! Our jobs utilization is about 50% But still paying for machines running 24x7
  • 29. Lessons Learned C1.xlarge is pretty stable (8 core / 8G memory) EBS volumes max size 1TB, so string few for higher density / node DON’T RAID them; let hadoop handle them as individual disks Might fail Backup data on S3 Skip EBS. Use instance store disks, and store data in S3 Use Apache WHIRR to setup cluster easily
  • 32. Hadoop cluster on EC2 cost $3,500 = 7 c1.xlarge @ $500 / month $1,500 = 15 TB EBS storage @ $0.10 per GB $ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests  $5,500 / month $60,000 / year !
  • 33. Buy / Rent ? Typical hadoop machine cost : $10-15k 10 node cluster = $100k Plus data center costs Plus IT-ops costs Amazon Ec2 10 node cluster: $500 * 10 = $5,000 / month = $60k / year
  • 34. Buy / Rent Amazon EC2 is great, for Quickly getting started Startups Scaling on demand / rapidly adding more servers popular social games Netflix story Streaming is powered by EC2 Encoding movies ..etc Use 1000s of instances Not so economical for running clusters 24x7 http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/
  • 37. Where was this picture taken?
  • 39. Amazon’s Elastic Map Reduce Basically ‘on demand’ hadoop cluster Store data on Amazon S3 Kick off a hadoop cluster to process data Shutdown when done Pay for the HOURS used
  • 41. Moving parts Logs go into Scribe Scribe master ships logs into S3, gzipped Spin EMR cluster, run job, done Using same old Java MR jobs for EMR Summary data gets directly updated to a mysql (no output files from reducers)
  • 42. EMR Wins Cost  only pay for use http://aws.amazon.com/elasticmapreduce/pricing/ Example: EMR ran on 5 C1.xlarge for 3hrs EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20 http://aws.amazon.com/elasticmapreduce/faqs/#billing-4 (1 hour of c1.xlarge = 8 hours normalized compute time) EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40 Plus S3 storage cost : 1TB / month = $150 Data bandwidth from S3 to EC2 is FREE!  $25 bucks
  • 43. Design Wins Bidders now write logs to Scribe directly No mysql at web server machines Writes much faster! S3 has been a reliable storage and cheap
  • 44. EMR Wins No hadoop cluster to maintainno failed nodes / disks
  • 45. EMR Wins Hadoop clusters can be of any size! Can have multiple hadoop clusters smaller jobs  fewer number of machines memory hungry tasks  m1.xlarge cpu hungry tasks  c1.xlarge
  • 46. EMR trade-offs Lower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk) Streaming data from S3, for each job EMR Hadoop is not the latest version Missing tools : Oozie Right now, trading performance for convenience and cost
  • 47. Lessons Learned Debugging a failed MR job is tricky Because the hadoop cluster is terminated  no logs files Save log files to S3
  • 48. Lessons : Script every thing scripts to launch jar EMR jobs Custom parameters depending on job needs (instance types, size of cluster ..etc) monitor job progress Save logs for later inspection Job status (finished / cancelled) https://github.com/sujee/amazon-emr-beyond-basics
  • 49. Sample Launch Script #!/bin/bash ## run-sitestats4.sh # config MASTER_INSTANCE_TYPE="m1.large" SLAVE_INSTANCE_TYPE="c1.xlarge" INSTANCES=5 export JOBNAME="SiteStats4" export TIMESTAMP=$(date +%Y%m%d-%H%M%S) # end config echo "===========================================" echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...." export t1=$(date +%s) export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”) sh ./emr-wait-for-completion.sh
  • 50. Lessons : tweak cluster for each job Mapred-config-m1-xl.xml <configuration> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1024M</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx3000M</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>3</value> </property> <property> <name>mapred.output.compress</name> <value>true</value> </property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property> </configuration>
  • 53. Map reduce tips : Control the amount of Input We get different type of events event A (freq: 10,000) >>> event B (100) >> event C (1) Initially we put them all into a single log file A A A A B A A B C
  • 54. Control Input… So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing So we split the logs log_A….gz log_B….gz log_C…gz Now only processing fraction of our logs Input : s3://my_bucket/logs/log_B* x-ref using memcache if needed
  • 55. Map reduce tips: Data joining (x-ref) Data is split across log files, need to x-ref during Map phase Used to load the data in mapper’s memory (data was small and in mysql) Now we use Membase (Memcached) Two MR jobs are chained First one processes logfile_type_A and populates Membase (very quick, takes minutes) Second one, processes logfile_type_B, cross-references values from Membase
  • 56. X-ref
  • 57. Map reduce tips: Logfile format CSV  JSON Started with CSV CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL 20-40 fields… fragile, position dependant, hard to code url = csv[18]…counting position numbers gets old after 100th time around) If (csv.length == 29) url = csv[28] else url = csv[26]
  • 58. Map reduce tips: Logfile format JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….} Self-describing, easy to add new fields, easy to process url = map.get(‘url’) Flatten JSON to fit in ONE LINE Compresses pretty well (not much data inflation)
  • 59. Map reduce tips: Incremental Log Processing Recent data (today / yesterday / this week) is more relevant than older data (6 months +)
  • 60. Map reduce tips: Incremental Log Processing Adding ‘time window’ to our stats only process newer logs faster
  • 62. Where was this pic taken?
  • 64. Next steps : faster processing Streaming S3 data for each MR job is not optimal Spin cluster Copy data from S3 to HDFS Run all MR jobs (make use of data locality) terminate
  • 65. Next Steps : More Processing More MR jobs More frequent data processing Frequent log rolls Smaller delta window (1 hr / 15 mins)
  • 66. Next steps : new software New Software Pig, python mrJOB(from Yelp) Scribe  Cloudera flume? Use work flow tools like Oozie Hive? Adhoc SQL like queries
  • 67. Next Steps : SPOT instances SPOT instances : name your price (ebay style) Been available on EC2 for a while Just became available for Elastic map reduce! New cluster setup: 10 normal instances + 10 spot instances Spots may go away anytime That is fine! Hadoop will handle node failures Bigger cluster : cheaper & faster
  • 69. In summary… Amazon EMR could be a great solution We are happy!
  • 70. Take a test drive Just bring your credit-card  http://aws.amazon.com/elasticmapreduce/ Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52
  • 71. Thanks Questions? Sujee Maniyam http://sujee.net hello@sujee.net Devil’s slide, Pacifica