18. The Solution: Partition Work must be structured such that each resource can complete it independently Overhead to divide workload
19. Data architecture Look at queries you perform. Divide data such that each query can be answered by querying no more than 1 partition.
20. Comments on a profile Comments (user_id, author_id, comment) Post a comment on a user’s profile Get list of comments on a user’s profile Delete a comment from a user’s profile Give up for now: Comments written by a user
21. Comments on a profile Partition by user Costs: Determining partition of a user constant Consistency check on access that author still exists linear on number of comments to display
23. Alternative Solution Partition by user, duplicate by author Comments(user_id, author_id, comment) AuthoredComments(author_id, user_id, comment_id)
24. Alternative Solution Comments(user_id, author_id, comment) AuthoredComments(author_id, user_id, comment_id) Costs: double writes extra storage delete by author still very expensive
29. Traditional Systems Architecture www.tuenti.com 12.45.34.179 12.45.34.178 Load Balancer Load Balancer Web server farm Web server farm Web server farm Web server farm
30. AJAX What is AJAX? “Asynchronous JavaScript and XML” Paradigm for client-server interaction Change state on client, without loading a complete HTML page
31. Traditional HTML Browsing User clicks link Browser sends request Server receives, parses request, generates response Browser receives response and begins rendering Dependent objects (images, js, css) load and render Page appears
32. AJAX Browsing User clicks link Browser sends request Server receives, parses request, generates response Browser receives response and begins rendering Dependent objects (images, js, css) load and render Page appears
33. How does Tuenti use AJAX? Only pageloads are login and home page Loader pulls in all JS/CSS Afterwards stay within one HTML page, rotating canvas area content
34. Balancing Load Top-level requests to www.tuenti.com Each request tells client which farm it should be using, based on a mapping Mapping can be changed to balance load, perform maintenance, etc
35. Client-side Routing www.tuenti.com wwwb3.tuenti.com wwwb2.tuenti.com wwwb1.tuenti.com wwwb4.tuenti.com Load Balancer Load Balancer Load Balancer Load Balancer Web server farm Web server farm Web server farm Web server farm Linearly scalable …
36. Client-side Routing www.tuenti.com wwwb3.tuenti.com wwwb2.tuenti.com wwwb1.tuenti.com wwwb4.tuenti.com Load Balancer Load Balancer Load Balancer Load Balancer Web server farm Web server farm Web server farm Web server farm Linearly scalable … except for top level
37. Client-side Routing www.tuenti.com wwwb3.tuenti.com wwwb2.tuenti.com wwwb1.tuenti.com wwwb4.tuenti.com Load Balancer Load Balancer Load Balancer Load Balancer Web server farm Web server farm Web server farm Web server farm lots of content creation = lots of dynamic data
38. Client-side Routing www.tuenti.com wwwb3.tuenti.com wwwb2.tuenti.com wwwb1.tuenti.com wwwb4.tuenti.com Load Balancer Load Balancer Load Balancer Load Balancer Web server farm Web server farm Web server farm Web server farm Cache Farm lots of dynamic data = lots of cache = internal network traffic
39. Client-side Routing www.tuenti.com wwwb3.tuenti.com wwwb2.tuenti.com wwwb1.tuenti.com wwwb4.tuenti.com Load Balancer Load Balancer Load Balancer Load Balancer Web server farm Web server farm Web server farm Web server farm Cache Farm Cache Farm Cache Farm Cache Farm Partition cache Route requests to a farm near cache needed to respond
49. What is a CDN? Examples: Akamai, Limelight also dozens more, including Amazon Big distributed, object cache Pay per use either per request, per TB transfer, or per peak Mbps
50. What is a CDN? Advantages: Outsource dev and infrastructure Geographically distributed Economies of scale Disadvantages: High cost Less control and transparency Commitments
51. What affects image load time? Client internet connection Response time of CDN CDN cache hit rate
52. What affects image load time? Client internet connection Response time of CDN CDN cache hit rate
53.
54. Monitor Performance from Client Closer to performance experienced by end-user Only way to get view of network issues faced by users (ie last mile)
55.
56. How to fix slow ISP? Choose better transit provider Set-up peering (or get CDN too) Traffic management
57. What affects image load time? Client internet connection Response time of CDN CDN cache hit rate
69. Pre-fetch Content Exploit predictable user behavior Ex: clicking to next photo in an album Simple solution – load next image hidden Client browser will cache it (next response < 100 ms) Increase tolerance for slow response time
70. Pre-fetch Content More complex solution Pre-fetch next canvas (full html), render in background – rotate in on Next Even more complex Instantiate HTML template w/ data on client Pre-fetch data X photos in advance, render Y templates in advance with this data
71. Pre-fetch Content Problems: Rendering still takes time Increases browser load Need to set cache headers correctly
72. Image delivery Small images: High request, low volume Most cost-effective to cache in memory Large images: High volume, low requests, greater tolerance for latency
73. What affects image load time? Client internet connection Response time of CDN CDN cache hit rate
ComScore numbers show that we have more traffic than all Google properties combined. ComScore estimates 1 in 6 web pages viewed in Spain is from Tuenti. ComScore numbers are lower than our internal measurements.
This is what makes web programming different that application programming. How much it can do in a given period of time, not how much time it needs to do one thing.Below a reasonable threshold, I care about how far out to the right I can get on the curve.
“Scalability” is property of a system architecture. Generally speaking, a system is scalable if it can continue to perform acceptably well as load increases. Load level at which performance becomes unacceptable is the capacity of the system.
Many users trying to access a resource at the same time.
“Scalability” is property of a system architecture. Generally speaking, a system is scalable if it can continue to perform acceptably well as load increases. Load level at which performance becomes unacceptable is the capacity of the system.
“Scalability” is property of a system architecture. Generally speaking, a system is scalable if it can continue to perform acceptably well as load increases. Load level at which performance becomes unacceptable is the capacity of the system.
If I add resources, I should be able to shift the curve right. If system is linearly scalable, should be able to handle 2x requests with 2x machines.
Performance graph from Tuenti from October
Split resource into two, then send half load to one, half load to the other.
The two resources should perform independently such that the performance curve for the entire system, is the sum of the curves for each resource.
These are the two major caveats. The former is fundamentally a design question, and is essentially a data architecture question. The latter is generally simpler to address, and I’ll discuss it a bit later.
These are the two major caveats. The former is fundamentally a design question, and is essentially a data architecture question. The latter is generally simpler to address – but adds some constant overhead to each response such that performance is not simply a sum of two curves for independent systems.
This is a very simple example of comments on a profile. I really only need 3 queries: post (insert) a comment on a user’s profile, get the list of comments posted to a user’s profile, and delete a comment from a user’s profile. I’m going to give up on getting a list of comments written by a user – might be nice, but isn’t critical.
The solution is to partition by user. You need a way to map a user to a partition (hash function, lookup table, etc). Each partition contains data for a set of users – but all the data for each of those users. If user A is on partition 1, all comments on user A’s profile can be found on partition 1 and none of those comments are stored on any other partition.This imposes some costs – 1) determining the partition of a user, ie computing some partitioning function (hash, lookup table, etc). 2) since comments WRITTEN by a given user might be spread across a bunch of partitions, we’re unable to DELETE all comments such comments (we can’t even look them up without querying all partitions). We can’t have any kind of foreign key to enforce that all comments have valid authors. The only solution is to enforce this when we actually access the author information. In practice, this doesn’t add much overhead – presumably when we want to display the comment, we want to display basic info about the author (such as name) as well. If we’re unable to find that basic info, the author probably doesn’t exist – at that point, we can delete the comment. Slightly more logic is needed to account for this possibility and execute the delete – but it’s not too costly. In fact, it’s constant overhead for each comment – and presumably the number of comments we display per request is constant with respect to the rate of requests – so it’s constant wrt requests.
The two resources should perform independently such that the performance curve for the entire system, is the sum of the curves for each resource. The two costs I pointed out on the previous slide add overhead to every request, but this overhead is constant wrt to the rate of requests.
If you need to look-up comments by author, this can be achieved by maintaining a second table that is partitioning and indexed by author. Querying one partition can get you a list of all comments written by a user, but to get the content you’ll still need to query against the primary partitions for each item – which can be expensive.
Every time a comment is written or deleted, you’ll have to write into both the author partition and the user profile partition – a constant expense.You’ll have a constant overhead of storage – every byte in the AuthoerComments partition is duplicated data. Selecting by author is still very expensive, unless you duplicate the entirety of the comment data – which will be a significant storage cost. This duplication won’t make deletion any faster – and deletion in the worst case could require hitting every partition.This solution could be appropriate for some workloads, but has a number of drawbacks that make me inclined not to choose it.
What I’ve previously described is the partitioning technique applied to databases. We also use analogous methods for scaling our web server and cache tiers.
Load balancer is single point through which requests run – subject to contention, failure.
Load balancer is a single point through which many requests are sent, making it a possible point of contention.
Applying analogous partitioning techniques on web server tier.
Traditionally, one way to partition web server tier is with RRDNS to split requests into two LBs, but …
… we have AJAX.JavaScript and XML are just technologies; “Asynchronous” is what’s important – the shift in thinking from web browsing as serial page by page to more fluid navigation that’s wholly contained within the same HTML page. I’m not going to go much into implementation, etc – it’s a lot of detail, and talking about cross-browser compatibility isn’t so fun or interesting. Focus on approaches – what we’ve learned from scaling on the server side can be applied to client side.
Using AJAX in application design, allows 1-6 to collapse a bit
Using AJAX in application design, allows 1-6 to collapse a bit – can play with them, things don’t have to happen in such a serial order.
Doesn’t eliminate single point of contention at login/auth/home page load tier, but does push this back aways.
However, we have lots of dynamic content, and we heavily use memcache as the storage tier for that content (backed by MySQL instances for persistence) …
But that means we have a ton of data in cache, thus a large number of cache servers are needed to store that data. That makes for a large cache tier behind our tier of server farms. What are the problems with that? – what happens when when a web server physically (and logically at network level) at one end of internal network needs data cached in other end? Long ways to go, and crossing (and congesting) a ton of intermediary links in the process. All that data crossing in middle requires powerful switches/large links (even if have ring or some other more exotic net architecture)…
Solution is to partition the cache, then route page requests from clients directly to farms that are physically/logically near the partitions that hold the data need to respond to the request. The net effect of this is that fewer cache requests need to cross the network to get their data – instead are just routed to cache partition immediately behind. This saves internal network traffic by reducing hops this data has to take: instead of 1 byte passing over 4 links (web- web rack switch-center switch-cache rack switch –cache), pass 1 byte over 2 links (web – web rack switch – cache) gives 50% savings. Less aggregate network traffic means less switching/link capacity is required. Fewer hops also means less latency. In practice, the later is quite clear ….
The global cache tier is unpartitioned cache – cache server holding the data is equally likely to be on the otherside of the network as it is to be in the same rack as the web server making the request. The partitioned cache is separated into farms – each request is routed from the client (by picking a farm in javascript) to a web server farm that is likely to be close to the cache farm where most of the data needed for the page is cached. Savings is ~40% and will grow as the size and complexity of the network increases.
Performance graph from Tuentifrom October
In Dec, managed to flatten the line – but traded ~10 ms in best case performance. Good illustration of trading response time for scalability.Note: fit might not be most appropriate, flattened by some outliers in low load range – but clearly reaching higher load at comparable level of performance in dec, although some poor performing outliers at high load as well.
February graph looks really good – continued to flatten the line and won back that 10 ms cost we paid in december. The data set is also less noisy than the previous two, indicating the system was more stable.
Further improvements in April – shifting best case down another 10 ms while maintaining slope. The data set is again quite clean, indicating a very stable system.
We deliver 25k pages/second at peak, but nearly 100k static files/second.
Competitive market – only 2 (Akamai and Limelight) are financially very healthy – and Limelight is losing money if you consider investments