This document discusses how Business Insider scales their website. They use MongoDB for backend scaling to store blog content and analytics data across multiple databases. They use Varnish caching servers to cache content and reduce load on backend servers. Varnish uses edge side includes to cache parts of pages separately. As traffic grows, they can add more Varnish servers and implement layer 7 load balancing to distribute traffic and prevent duplicate requests to backend servers from caching invalidations.
7. Constraints on Scaling BI
✤ Coping with the inherently
unpredictable nature of news
traffic
✤ Some events bring predictably
high traffic (Apple product
announcements), but traffic
spikes can happen anytime
8. Constraints on Scaling BI
✤ Being on top of breaking news is
of huge importance to us
✤ We can’t afford to have latency
between CMS and production
9. Back-end vs. Front-end Scaling
✤ Two different scaling strategies that apply to different use cases
✤ Backend scaling is useful for any site since it helps no matter how
dynamic your content is, but it’s difficult because it involves the
entire stack.
✤ Front-end scaling is useful for sites that need to deliver the same
content to a huge audience.
✤ Which do we use at Business Insider? Both, of course.
11. MongoDB for Backend Scaling
✤ MongoDB is a NoSQL database, it stores documents rather than
relational data and it lacks transactions.
✤ MongoDB doesn’t ensure writes by default.
✤ These choices make it fast but they need to be understood by
developers using MongoDB.
12. Business Insider Data Constraints
✤ Our data storage for the site content itself is less than 10 GB, growing
a few GB a year.
✤ Images that are blog content, stored in the database using GridFS,
come to another 100 or so GB, growing a few GB a month.
✤ We need to constantly record internal analytics of page views and
unique visitors for business use.
✤ We need updates to be reflected immediately.
✤ Our architecture needs to allow for exponential transaction volume
growth.
13. Business Insider & MongoDB
✤ The blog (including images) is stored on DB1.
✤ Analytics are written to DB3 so the analytics write locks don’t affect
blog performance.
✤ A shared slave is ready to step in
during a failure of either server.
✤ All transactions are performed
against the primaries.
14. Business Insider & MongoDB
✤ As we grow, we’ll plan to move GridFS off the blog server and onto
its own server & slave.
✤ The blog and analytics can be
sharded for performance when that’s
eventually needed.
✤ Our DB servers are dual quad-core
Xeons with 64GB of RAM and SSD
data storage
✤ We’re handling 800-1000 ops/sec
with negligible CPU load.
15. Data Modeling for Scaling
✤ MongoDB allows the storage of embedded documents within a
document. We use this to store an array of comments within the blog
post document they belong to.
✤ This eliminates the costly joins traditional
SQL would require. In order to provide a
moderation interface displaying all recent
comments, we de-normalize that data and
double store it.
16. Varnish for Front-end Caching
✤
Varnish is a front-end caching reverse proxy.
✤ Varnish retrieves web pages on behalf of clients and caches the result.
Clients that can be served from cache don’t add any load to your
backend.
18. Varnish & Business Insider
✤ We use two Varnish servers with single quad-core processors and
32GB of RAM each. They are randomly load balanced between and
each store a full 24GB RAM cache of the site.
✤ Average weekdays peak around 700 reqs/sec on each Varnish server,
spiking to over 1500 reqs/sec during breaking news such as Apple
quarterly earnings.
✤ Our four backend Apache/PHP
servers tend to see 50-60 reqs/s,
and during breaking news this
only spikes to 70-80 reqs/s.
19. Varnish Active Bans
✤ When an editor publishes or edits a post, a ban request is sent to the
Varnish servers.
✤ They add post pages, vertical pages, and author pages associated
with that post to the Varnish ban list.
✤ The next time a request matches that list, the content is retrieved fresh
from the backend and the cache is refreshed.
✤ This lets us cache but still keep our content totally up-to-date.
21. Edge Side Includes
✤ Varnish allows for the use of Edge Side Includes, parts of a page that
are retrieved separately and have different cache lifetimes.
✤ One example is the “Most Read” widget in
the right rail on Business Insider.
✤ The page hosting the module may be
cached for an hour but the Edge Side
Include hosting the widget has a 5 minute
TTL, keeping it up to date.
22. Edge Side Includes
Another example is the top user menu. This is an Edge Side Include that
includes the logged in user’s ID in the hash, so each user gets this block
customized for them while the rest of the page can be cached more
generally.
Varnish allows you full programmatic control of what in included in the
hash for each request, so complex tricks like this are possible.
23. Edge Side Includes
Edge Side Includes are a common caching standard, this is the same
format used by Akamai and other CDN providers.
Generally you should add a header tag to any page including an ESI.
You need to tell Varnish to process your ESI tags and if you run that
processing on every payload you’ll waste resources and risk corrupting
any binary or image that happens to contain the sequence “<esi:”.
24. Scaling Varnish Servers
✤ The load balancer
randomly sends
traffic to the Varnish
servers.
✤ Each Varnish server
caches every page.
✤ Every cache ban
results in two
backend requests, one
from each Varnish
server.
25. Scaling Varnish Servers
✤ Adding a third
Varnish server means
a third backend
request for every ban.
✤ That defeats our
purpose entirely!
26. Scaling Varnish Servers
✤ We can solve this a
few ways, but we’ll
use Layer 7 load
balancing.
✤ We can send a subset
of the URL space to
each Varnish server
✤ Only one copy of each
cached page will exist
on the cluster,
reducing load on the
backend.
27. A Closing Testimonial From Jay-Z
✤ “If you’re having
scaling problems,
I feel bad for you son...
Clients sent 99 requests
but my backend got
one.”
Photo by flickr user matthew_harrison