What's up?

What’s up?

Bouvet BigOne, 2011-10-27
Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga

1

The problem with RSS readers

• Many feeds (like newspapers’) are too busy,
and so unread stories pile up
• In most feeds you are only interested in a small
subset of the posts
• Staying on top of the flow of news and digging
up the interesting stuff is hard work

2

What’s up?

• A newsreader that tries solve this for you
• It uses statistics to figure out which news are
the most interesting for you
– the statistics are based on feedback from you
• Everything is collected into a single list, sorted
by relevance and freshness
• Stories sink slowly as they age, so if you don’t
read them they gradually fade away

3

Like Not used

Dislike Mark as read

4

Very interesting post about beer,
but the word “beer” doesn’t actually
appear anywhere

Probabilities combined with Bayes’s
theorem.

5

Utterly irrelevant post about sports

6

Adding feeds

Probably the usability Achilles
heel of the system right now

7

Three implementations

• In-memory single-user version
– worked well for me for several years
– wanted to try it out with more users
• Google AppEngine version
– easy to build and deploy
– used way too much CPU
• “Traditional” version
– PostgreSQL backend, ordinary web hosting
– seems to scale much better

8

The goal

• Make the site pay for its own hosting
– currently solved by running it on my personal web
server
– expect system to outgrow that server soon
• Move to cloud hosting
– candidates: Amazon EC2, heroku, Google AppEngine
w/ MySQL
• Income from Google Ads
– income per user likely to be very low
– scaling challenge: support enough users to pay for
computing resources

9

Data structure

Feed Post

Subscr Rated
User
iption post

• Good
– fully normalized, no redundancy
– simple and natural
• Bad
– showing main page requires many joins
– limited possibilities for caching
10

Queueing

• The original version would respond to clicks in
real-time
– meant recomputing all stories on each up/down vote,
before showing the page again
– not really very pleasant user experience
• Changed over to a queue approach
– user clicks added to a queue
– queue retrieves tasks, processes them, may add more
– scheduled tasks injected into queue
– admin command-line tool1) to inject tasks when needed
– works beautifully

11 1) http://code.google.com/p/whazzup/source/browse/send.py

Google AppEngine experience

• Easy to build, painless to deploy
– web.py and Python well supported
– good queue and scheduled tasks APIs
• Datastore and GQL too primitive
– high latency registers as high CPU usage (costly)
– very, very limited support for letting the database do
the work, leads to poor performance
• AppEngine apps require heavy caching to work
– not really possible with this application
• Would have hit limit of free usage at 4 users
– not a realistic proposition
12

Example problem

• How to implement aging of posts?
– that is, reducing score as the posts get older
• Could compute score when loading story list
– not possible in GQL (no expression language)
• Could run scheduled tasks once an hour
– in GQL this requires loading all RatedPost objects
into main process
– way too resource-intensive
• Just didn’t scale at all

13 (probability * 1000.0) / math.log(ageinsecs)

Current architecture
100% Python
Based on web.py
Apache w/ mod_python Single server so far
Apache w/ mod_python

cron

IPC message queue 1)

Queue worker

Download Download Download PostgreSQL
thread thread thread

DBM files
DBM files
DBM files
14 DBM files 1) http://semanchuk.com/philip/sysv_ipc/

Aging posts with Postgres

• First attempt
– load posts, compute in Python, save to DB
– took 1.1 seconds per subscription
– with ~50 subscriptions per user, that’s much too slow
• Second attempt
– do calculation in the SQL update statement1)
– takes 0.5 seconds per user
– more than 100 times faster
– may still be too slow
• with 7200 users it would take an hour

15 1)
http://code.google.com/p/whazzup/source/browse/dbqueue.py?r=4e

More performance tricks

• Loading story pages is a bit expensive
– because of SQL joins required
– now handling votes with AJAX, so page doesn’t have
to be reloaded for every vote
– next step: caching feed titles and story titles?
• Separate worker threads for feed downloading
– because feeds may be slow to respond
– threads save feed XML to disk, then queue task to
process feed
– ParseFeed task doesn’t have any network latency

16

Statistics No perceptible server load
Bottlenecks:
• loading story list pages
• parsing feeds, calculating points

17

Future architecture

Web frontend Web frontend Web frontend

memcached?

cron Message queue DB cluster
(Gearman?) (PostgreSQL?)

Queue worker Queue worker Queue worker

DBM files?
18

More information

• Blog post
– http://www.garshol.priv.no/blog/216.html
• Source code
– http://code.google.com/p/whazzup/
• Pre-alpha trial
– open to anyone; sign up if you’re interested
– no guarantees about anything
– http://whazzup.garshol.priv.no/
– currently limited to 100 users (87 accounts available)

19

What's up?

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (19)

Semelhante a What's up?

Semelhante a What's up? (20)

Mais de Lars Marius Garshol

Mais de Lars Marius Garshol (20)

Último

Último (20)

What's up?