This Slide was presented @ Cloud Connect 2013. Lock, Stock and X Smoking EC2's was by inspired by Guy Ritchie movies. It describes how we put Amazon EMR + Spot EC2 instances to use for a customer and achieved cost savings while solving a Big Data problem.
2. P1) This Presentation is
P2) Strongly Inspired by “Guy Ritchie”
Movies
P3) Disclaimer : All images are downloaded from
internet. If you find any of the content / images violating
copyright, please let me know and I will act upon it
immediately
4. Case
Cigarette smoking is injurious to health
• Mobile Advertising company, USA
• Forbes 1000 clientele
• TB’s of unstructured data -> Big Data
Problem
5. Lock
• Hourly ~1 TB
• CDN Logs
• Text Files
• XML Files
• Geo data files
• Server logs
• DB records
7. Challenges
• Daily (was OK), Monthly (Pain) and Historical
analysis ( almost dead )
• How do we Transfer, Store, Analyze and Share ?
• How to optimize costs at this scale ?
8. Solution
Cigarette smoking is injurious to health
• Use AWS Cloud for hosting Analytics module
• Amazon EMR for unstructured Log Analysis
• Automation using Scripts, Java code and other
tools
9. Social / 3rd
Party
Feeds/Cloud
Logs
Stage 1: Data Transfer
• Tsunami UDP
• ~1TB un compressed logs
every hour
• High bandwidth EC2’s for
Tsunami UDP
• Other Popular Options :
• Aspera
• AWS Import/Export
• WAN optimization
• AWS Direct Connect
10. Amazon S3
Logs
Stage 2: Storage
• Amazon Web Services Building Block
– S3
• Scalable Object Store
• Inherently Fault Tolerant
• ~2 TB of compressed logs every day
• S3 RR option for intermediate
outputs
• Amazon Glacier for archivalSocial / 3rd
Party
Feeds/Cloud
12. • Amazon EMR is great
• But adding Spot EC2 is super cool
Wait !!!
13. What is Amazon Spot ?
13
• Time-flexible, interruption-tolerant tasks
• Bid Price & Spot Price
• M1.xlarge Price Comparison
• $0.480 per Hour – On Demand
• $0.052 per Hour - Spot
• You will never pay more than your
maximum bid price per hour
•Spot Instance may be interrupted
• If interrupted you will not be charged for
any partial hour of usage. (*Free)
16. Amazon EMR with Spot Instance
Project Master
Instance
Group
Core Instance
Group
Task Instance
Group
Long-running
clusters
on-demand on-demand Spot
Cost-driven
workloads
spot spot Spot
Data-critical
workloads
on-demand on-demand Spot
Application
testing
spot Spot Spot
17. Amazon S3
Elastic
MapReduce
Social /
3rd Party
Feeds
Logs
Stage 4: Custom EMR Manager
• We created a Custom EMR
Manager
• Choose spot based on:
• Past price trend intelligence
• Choose AZ based on Current
Market Prices
• Choose between Large vs
Extra Large
• Spot Pricing Strategy :
• Set Spot Price = On Demand
Price
• Over board <20% of On
Demand Price at times
• Dynamic Sizing the Core / Task
nodes
• Dynamic EMR Cluster creationCustom EMR
Manager
18. Some Spot Use Cases
18
• Analytics & Big Data
• Scientific computing
• Web crawling
• Financial model and Analysis
• Testing
• Image & Media Encoding
66 % savings
50 % savings
57 % savings
19. Learning
• Spot + On demand EC2 is a deadly combination for cost savings
• Every millisecond matters in MR – Tune your code
• Merge Files – Bigger ones are better for processing
20. More Learning …
• Custom Job Manager was designed by us
• 1 File Per Mapper was better for our case in AWS
• Understand the performance constraints of AWS and
work with it
• Compress data : Both storage and transit(.LZO & Snappy)
21. Continues…
• Keep configuration data in local memory or Amazon
DynamoDB
• Reducers split files suitable for next job mappers
• Elasticity – Increase/Decrease Task nodes
• Elasticity – Create new EMR Clusters matching the Logs
(Core + Task)
22. Value
• ~56% cost savings from pure On-Demand model for Core+
Task Nodes
• Automation vastly reduced Labor cost ( initial + on going)
• Customer CXO’s were happy
23. • AWS Premium Partner
• Solution Experts in
• Cloud Computing
• Big Data
• Identity Management
About US