Last.fm vs Xbox

Last.fm vs. Xbox

David Singleton
last.fm/user/underpangs
twitter.com/dsingleton

Music discovery powered by Scrobbling

Personalised radio, social network, events, a
“wikipedia of music”

High trafﬁc

Monthly visitors: 40 million

Monthly page views: 500,000 million

Last.fm Architecture
Load balancer

HTTP Cache

Web Server

Object Cache

Database

Xbox Live Platform,
millions of users

Last.fm Radio App

Built by Microsoft

Powered by our API

Launched along side
Facebook & Twitter

So, what’s there to
talk about?
Good Things ™

New users. It’s really cool.

Bad Things™

Lots of new users, trafﬁc spikes

A very important, high proﬁle, launch

How did Last.fm approach this?

Xbox Live: 15 million users
assuming a 10% take-up rate = 1,500,000 users
startup: 5 requests + starting radio: 5 requests + 15 minutes of radio: 60 requests

1 hour of radio = 250 requests per user
an hour of radio per user is a rough averaged guess

1,500,000 users = 375,000,000 requests over 24 hours
assuming an even distribution = 4,500 requests / second

Likely peaking at more than triple = 15,000 requests / second

Last.fm: 2,000 requests/sec
based on number of servers and apache conﬁguration

estimated max capacity of 3,500 requests per second

What next?

Picked a metric: requests per second

Estimated trafﬁc increase vs capacity

Selected our goals;

Serve requests faster

Reduce number requests

Profiling traffic

Used traffic generated beta testing

Web server request logs

Common format, widely supported format

Hundreds of existing tools

We generated some stats using AWK...

Which API requests
were made?
Method

71638 track.getInfo
53941 artist.getImages
15150 radio.getPlaylist
7308 library.getArtists
5020 user.getRecentStations
4979 ads.getVideos
4205 radio.tune
3155 track.love
1507 artist.getInfo
1258 user.getRecommendedArtists
1135 user.getInfo
1130 geo.getTopArtists
1128 radio.gamerStations
1102 tag.getTopArtists
1021 track.ban
1006 user.getLovedTracks
340 library.addArtist
206 auth.getMobileSession

Which API requests
were made?

Raw data from beta
Calls Method Total Average

53941 artist.getImages 19647 0.36
71638 track.getInfo 15789 0.22
15150 radio.getPlaylist 6962 0.46
7308 library.getArtists 2402 0.33
4979 ads.getVideos 1810 0.36
5020 user.getRecentStations 1674 0.33
1102 tag.getTopArtists 1488 1.35
1258 user.getRecommendedArtists 1457 1.16
4205 radio.tune 923 0.22
1130 geo.getTopArtists 575 0.51
1507 artist.getInfo 440 0.29
1128 radio.gamerStations 298 0.26
1006 user.getLovedTracks 271 0.27
1135 user.getInfo 171 0.15
206 auth.getMobileSession 38 0.19
136 user.signUp 32 0.24
123 user.terms 16 0.13
3155 track.love 0 0.00

How long did each
method take?

Why so many
track.getInfo calls?
A tiny UI tweak...

...responsible for 25% of calls.

Arrggghhhhhh

Added that information to a sensible API call

Microsoft kindly updated the app

What about the
getImages calls?

Powers an artist slideshow visualisation

Results of this call won’t change often

Set a HTTP cache timeout

Set caching on a few other calls too

Request generation
Calls Method Total Average

53941 artist.getImages 19647 0.36
71638 track.getInfo 15789 0.22
15150 radio.getPlaylist 6962 0.46
7308 library.getArtists 2402 0.33
4979 ads.getVideos 1810 0.36
5020 user.getRecentStations 1674 0.33
1102 tag.getTopArtists 1488 1.35
1258 user.getRecommendedArtists 1457 1.16
4205 radio.tune 923 0.22
1130 geo.getTopArtists 575 0.51
1507 artist.getInfo 440 0.29
1128 radio.gamerStations 298 0.26
1006 user.getLovedTracks 271 0.27
1135 user.getInfo 171 0.15
206 auth.getMobileSession 38 0.19
136 user.signUp 32 0.24
123 user.terms 16 0.13
3155 track.love 0 0.00

kcachegrind
http://kcachegrind.sourceforge.net

webgrind
http://code.google.com/p/webgrind/

What happens if
things break?

Simulated failing calls

Highlighted essential calls

Acted as a dry-run for launch day failures

Informed our backup plans

Prepare for the worst

Unexpected problems we’ve had:

Servers overheating (twice)

Hardware (almost) stolen from data-centers

Power outage in the ofﬁce

Backup plans, AKA
The “Kill List”
Plan Effect Severity

Disable radio DB-
Faster calls Minor
backing

Disable Flash Player Save 200 req/sec Major

Drop non essential Reduce Xbox trafﬁc
Extreme
Xbox API calls by 0 - 50%
Drop X% of radio Reduce Xbox trafﬁc
Nuclear
tune calls by X%

Last.fm: Launch Day
(When traffic attacks)

How did it go?

Our estimate was about 50% over

Didn’t exceed capacity (but got quite close)

Proﬁling and caching was essential

Or we would have gone down

What did we learn?

Use timezones to rollout slowly

Trafﬁc will follow daily trends

Live monitoring is essential

Backup plans are comforting

Pre-ﬁll caches before launch

So, how does this
help me?

1. Estimate

Choose your metric

Estimate launch trafﬁc

Compare against capacity

Make performance targets

Know your limitations

2. Profile requests

Start with a sample of traffic

Extract data for your metric

Visualise the results

Identify expensive requests for your metric

Use profiling tools on individual requests

3. Optimise
Reduce number of requests

Set the right HTTP caching headers

Combine with reverse web proxy

Prime caches for common calls

Use an object cache

Avoid language level optimisation

Web Request

Load balancer

HTTP Cache

Web Server

Object Cache

Database

4. Plan for failure

Simulate failures

Know your weak spots

Prepare backups plans

Communicate with users and partners

5. Launch it!
Roll out slowly, if you can

Setup live monitoring

If something goes wrong;

Don’t panic

Keep people updated

Have some champagne on ice

1. Start with an estimate
2. Proﬁle your trafﬁc
3. Make optimisations
4. Prepare for the worst
5. Launch it!

Last.fm vs. Xbox

Questions?
David Singleton
last.fm/user/underpangs
twitter.com/dsingleton

Last.fm vs Xbox

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Last.fm vs Xbox

Semelhante a Last.fm vs Xbox (20)

Último

Último (20)

Last.fm vs Xbox