2. Music discovery powered by Scrobbling
Personalised radio, social network, events, a
“wikipedia of music”
High traffic
Monthly visitors: 40 million
Monthly page views: 500,000 million
6. So, what’s there to
talk about?
Good Things ™
New users. It’s really cool.
Bad Things™
Lots of new users, traffic spikes
A very important, high profile, launch
How did Last.fm approach this?
7. Xbox Live: 15 million users
assuming a 10% take-up rate = 1,500,000 users
startup: 5 requests + starting radio: 5 requests + 15 minutes of radio: 60 requests
1 hour of radio = 250 requests per user
an hour of radio per user is a rough averaged guess
1,500,000 users = 375,000,000 requests over 24 hours
assuming an even distribution = 4,500 requests / second
Likely peaking at more than triple = 15,000 requests / second
Last.fm: 2,000 requests/sec
based on number of servers and apache configuration
estimated max capacity of 3,500 requests per second
9. What next?
Picked a metric: requests per second
Estimated traffic increase vs capacity
Selected our goals;
Serve requests faster
Reduce number requests
10. Profiling traffic
Used traffic generated beta testing
Web server request logs
Common format, widely supported format
Hundreds of existing tools
We generated some stats using AWK...
15. Why so many
track.getInfo calls?
A tiny UI tweak...
...responsible for 25% of calls.
Arrggghhhhhh
Added that information to a sensible API call
Microsoft kindly updated the app
17. What about the
getImages calls?
Powers an artist slideshow visualisation
Results of this call won’t change often
Set a HTTP cache timeout
Set caching on a few other calls too
21. What happens if
things break?
Simulated failing calls
Highlighted essential calls
Acted as a dry-run for launch day failures
Informed our backup plans
23. Prepare for the worst
Unexpected problems we’ve had:
Servers overheating (twice)
Hardware (almost) stolen from data-centers
Power outage in the office
24. Backup plans, AKA
The “Kill List”
Plan Effect Severity
Disable radio DB-
Faster calls Minor
backing
Disable Flash Player Save 200 req/sec Major
Drop non essential Reduce Xbox traffic
Extreme
Xbox API calls by 0 - 50%
Drop X% of radio Reduce Xbox traffic
Nuclear
tune calls by X%
27. How did it go?
Our estimate was about 50% over
Didn’t exceed capacity (but got quite close)
Profiling and caching was essential
Or we would have gone down
28. What did we learn?
Use timezones to rollout slowly
Traffic will follow daily trends
Live monitoring is essential
Backup plans are comforting
Pre-fill caches before launch
30. 1. Estimate
Choose your metric
Estimate launch traffic
Compare against capacity
Make performance targets
Know your limitations
31. 2. Profile requests
Start with a sample of traffic
Extract data for your metric
Visualise the results
Identify expensive requests for your metric
Use profiling tools on individual requests
32. 3. Optimise
Reduce number of requests
Set the right HTTP caching headers
Combine with reverse web proxy
Prime caches for common calls
Use an object cache
Avoid language level optimisation
36. 4. Plan for failure
Simulate failures
Know your weak spots
Prepare backups plans
Communicate with users and partners
37. 5. Launch it!
Roll out slowly, if you can
Setup live monitoring
If something goes wrong;
Don’t panic
Keep people updated
Have some champagne on ice
38. 1. Start with an estimate
2. Profile your traffic
3. Make optimisations
4. Prepare for the worst
5. Launch it!