Presentation by Narendra Venkataraman and Peter McArthur at the AWS Summit in NYC 2011. How Vimeo uses on-demand, spot, and reserved instances in EC2 to do Video transcoding.
7. Our Bidding Strategy Keep it simple One time spot requests; one instance per request;across all availability zones Spot requests expire in 10 minutes Never bid more than threshold. It is currently set to 80% of on-demand price Not more than 10 open spot requests at any time Bid 10% more than the average price over last hour Never bid more than threshold. It is currently set to 80% of on-demand price 7
8. Reserved Instance Utilization 8 on-demand % busy • 96.23% utilization with 54 instances on low-traffic day • High priority jobs: Buy reserve instance capacity to meet non-peak hour loads
10. Pro Tips Use Spots for your low priority and less time critical jobs Never kill spots. Let Amazon do it. Have more retries for jobs running on spots. Watch out for open spot requests. Add expiry to your requests. Long running jobs, bid higher or use on-demand Failover to on-demand when spot market is saturated 10
High quality video, first to do HD, work hard on supporting every format and getting the most out of the video.Video player, close second with HTML5 player, we want videos on vimeo to be playable everywhere. iOS, desktop, TV.Friendliest and most supportive community. Lots of positive people who like making videos.Good tools for sharing and privacy. You don’t have to share your videos with the whole world.Two types of users – free and plus.
It was thereManaged hosting, expensive storage in 2007. We moved to source file storage to S3 because we could do it really easily. No contracts. It was cheaper than what we had. It had more features than what we had. We needed something now with low commitment.Our first auto scaling EC2 transcoders went up in 2008. Our encoding load at peak was 3-4x higher than non-peak. Our users don't care whether it's a peak time or not, and we prefer not to pay for transcoding machines the five days a week we aren't using them. This workload is perfect for EC2.Since then our workload has normalized a bit, but that was the situation at the time.
Upload machine (Long connections, doesn't scale well, high IO)S3Transcoding machine (Jobs take minutes to transcode, open source toolkit, as tightly coupled as can be -- controlled from our datacenter via ssh. Works great if you are small. Thousands of ssh connections though... don't recommend this)
We have had a pretty good auto scaling system in place since April 2009. We constantly improve it. We're refreshing it now to make it stateless.At peak few hundred c1.xlarges, dozens of m1.large for uploads. We're experimenting with GPU and cluster compute instances.We buy reserved instances to bring our costs down. In the past we've bought enough to keep them at 100% utilization. If our lowest utilization on a weekday was 50 instances, we bought that number.We've also buy them to guarantee capacity for our plus members. Our plus members shouldn't wait, even when Amazon is low on on-demand instances.The availability guarantee is important. We have had trouble getting capacity at times, for as long as a few days.Now we are buying them to get to 75% utilization. You save money if you use 55% or more.
Don't manage spots manually:) Amazon has awesome api support. Leverage on it. We found python boto to be very stable and easy to use. We likeaws-lib for Node.js. We use it for SQS."Thinking of spots for your web servers and database machines. DON'T DO IT”
We Keep it simple: Onetime spot request with one instance per request with an expiry of 10 minutes across all availability zonesWe get average price over the last 1hr and Bid 10% more than that. It is little more complex than that.Watch how many spots you are launching. We have no more than 10 spot requests "open" at any time.We never exceed our bids more than 80% of the on demand price. 80 came from 80-20 rule. When in doubt pick 80. let us just say we just keep tweaking threshold and currently it is set to 80%
However Amazon recommends to bid at the max price you are comfortable with. We don't do that primarily because we never kill spots.. That is not completely true. When we need to scale down, we terminate spots only when we have no more on demand instances to kill.Why we never have to kill spots? We carefully estimate how many machines we need to keep running all the time and buy reserve instance capacity to meet that demand.
Spots can be saturated during peak hours. Deal with it.
We terminate machines that have no jobs running on them or have least impact. Also we terminate machines 5 minutes before start of next billing hour.jobs running on spots are retried twice as many times the jobs running on reserve/on demand. If a job failed far too many times and/or is delayed beyond acceptable wait time, it gets to run on an on demand machine.Use spots for your low priority and less time critical jobs.If you have long running jobs bid higher or just use on demand instances.Understand your workload and tweak the spot algorithm to suit your needs.. the single most important thing you can take from this talk.