3. Who are we?
• Premium photo & video sharing.
• Everyone pays!
• Bootstrapped in ’02.
• $10M+ as of ’07.
• Profitable.
• Top 250 website.
• 35M+ people / month
4. The challenge
• Premium! aka “more” + “better” + “faster”.
• Unlimited storage.
• Unlimited bandwidth.
• Huge photos (100Mpix!). Billions of them.
• Huge videos (1080p, high bitrates)
• Lots of photos per page.
• Super fast.
5. Architecture
early 2006
• Multiple datacenters
• Self-managed
• Self-installed hardware
• Tons of spinning disks
• Tons of custom servers
• Tons of distracting work
• We’re not a datacenter company
6. The phone call
early 2006
• *ring* “Hi, this is Amazon, we’d like to sell you storage.”
• “Say what? Amazon? Storage?”
• “Yeah, how does $0.50/GB/mo sound?”
• ... quick napkin math ... “Sorry, we do $0.20/GB today”
• “Oh, really? Thanks for the feedback.” *click*
• ... days pass ...
• *ring* “Hi, Amazon again. How about $0.15/GB/mo?”
• Sold.
7. It begins
April 2006
• Started simple. Storage - and lots of it.
• Slow at first. “Isn’t Amazon a bookseller?”
• First bill a huge $1147.41 in April. ;)
• Redundant backup to begin with.
• Soon, primary with on-site as backup.
• Finally, 100% photos & videos in S3.
• “Wow, this thing is for real!”
8. Show me the money
early 2007
• Guesstimate: ~$500K saved first year
• Actual:
• Growth: 64M photos -> 140M photos
• Stored 200TB at S3
• Disks would have cost: $40K/mo -> $100K/mo
• $922K projected spend, $230K actual
• $692K in cold hard savings
• Taxes! $295K ‘saved’ in cash flow.
• Reselling disks - recouping sunk costs.
9. The revelation
early 2007
• Yes, S3 is cheap.
• Yes, it’s durable and available.
• Yes, it’s fast.
• But most important: Weight off our shoulders!
• No more hard disk replacements!
• No more midnight datacenter fiascos!
• We can focus on photo sharing!
10. But wait, there’s more...
2007
• Amazon does books ... storage... and compute?!
• Hey, we have lots of compute!
• Web servers, background proc, rendering, etc
• Buying, installing, maintaining servers
• Often idle.
• Let’s try rendering first.
12. SkyNet Lives!
2007
• First EC2 service: ‘RubberBand’
• Handles all background photo processing
• Automated, near zero human interaction
• Tried to take over the world
• Launched ~1920 cores in a single API call
• Amazon spun them up as requested
• Renamed to ‘SkyNet’ :)
13. SkyNet success
2007-2012+
• Rendering load peaky
• User-driven based on # of photos shot recently
• Only roughly predictable
• Sundays heavy - but how heavy?
• Big spike - but will it last?
• Elastic scaling maximizes throughput, minimizes cost
• Instrument and automate
• No humans!
14. Leverage for... new products?
late 2007
• Customers begging us for video
• Not just any video: Hi-Def & high bitrate
• Potentially huge $$ capital expense (lots of servers)
• Totally unknown customer adoption
• Upside? Who knows?!
15. Leverage for new products!
late 2007
• Use EC2! No capital expense!
• If usage takes off, just scale it up!
• If usage falls off a cliff, just turn it off!
• Worked like a charm
• Minimal investment to get it into customers’ hands
• Took off (whew!)
16. New products part two, electric boogaloo
mid 2008
• Customers begging for archival storage
• RAW photos, original video footage, etc
• Breaks our business model
• Potentially costly to implement
• Again, unknown customer adoption
17. New products part two, electric boogaloo
mid 2008
• DevPay to the rescue!
• S3 + Amazon Payments mashup
• We called it ‘SmugVault’
• Store anything you like, pay as you go
• Amazon bills customer directly
• Terabytes of backup storage
• Happy customers
18. Mo money
2009
• Amazon does payments, too?
• Sure, why not. We’ll try it.
• SmugMug subscriptions via Amazon Payments
• Immediate 7% increase in total signups
19. EC2 steamroller begins
2009
• Important new EC2-related services arrive
• Auto-Scaling
• Elastic Load Balancing
• Monitoring
• Able to migrate lots more services to AWS
20. EC2 steamroller: Photos & Videos
2009
• SmugMug’s security & privacy layer complex
• Doesn’t map to S3’s
• Needs a proxy layer to intercept & validate requests
• From client straight to AWS
• Bypasses our datacenters
• Auto-Scaling + ELB + EC2 + S3 = Win
21. EC2 steamroller: Realtime rendering
2010
• Lots of different devices & screens out there
• SmugMug’s pre-rendered sizes don’t always fit
• Allow realtime dynamic photo resizing server-side
• Any resolution they wish
• Must be lightning fast
• Unpredictable load
• More ELB + Auto-Scaling + EC2 + S3
22. EC2 steamroller: Uploads
2011
• No more proxy uploads to our servers
• Uploads go straight to ELB->EC2->S3
• Can’t use Auto-Scaling, terminates
too fast
• User-generated, unpredictable load
• ELB + EC2 + S3
23. EC2 steamroller
today
• Vast majority of CPU usage in EC2
• 100% photo & video requests served from AWS
• 4 out of 5 customer facing web clusters 100% in AWS
• 5th one “any day now” - full testing currently underway
• Final stage required advanced AWS functionality
• DynamoDB
• EC2 instances w/SSD (hi.4xlarge)
• 100% AWS within reach
24. EC2 evolution: hi.4xlarge
2012
• Finally.
• Extremely high-scale I/O DB-class systems.
• Final missing-link to let us migrate 100% to AWS.
• (We’re already 100% SSD in our datacenters)
• 2TB of SSD storage
• 120,000 random read IOPS
• 10,000 - 85,000 random write IOPS
• omg.
25. Alien technology: DynamoDB
2012
• Finally.
• “S3 for databases”
• Bottomless low-latency datastore.
• Key-value. (aka NoSQL)
• Bulk of our data headed to DynamoDB
• omg.
27. Alien technology: CloudSearch
2012
• Billions of documents to search
• Millions of new & changed docs per day
• Many dozens of different facets
• Old system basically duct tape + SSDs
• CloudSearch blazingly fast even with crazy queries.
• omg.
28. Handling Failure
always & forever
• Everything breaks. Even in your own datacenters.
• Especially in your own datacenters.
• Plan for it.
• With AWS, ‘breaking’ is clearly defined
• Regions, Zones, Services, Instances, etc.
• Mix & match for your needs
• Multi-AZ is currently our sweet spot.
• Minimal impact during various ‘Amazonpocalypse’ events