What does it take to scale a system? We'll learn how going distributed can pay dividends in areas like availability and fault tolerance by examining a real-world case study. However, we will also look at the inherent pitfalls. When it comes to distributed systems, for every promise there is a peril.
The Economics of Scale: Promises and Perils of Going Distributed
1. The Economics of Scale Tyler Treat
WorkivaPromises and Perils of Going Distributed
September 19, 2015
2. About The Speaker
âą Backend engineer at Workiva
âą Messaging platform tech lead
âą Distributed systems
âą bravenewgeek.com @tyler_treat
tyler.treat@workiva.com
3. About The Talk
âą Why distributed systems?
âą Case study
âą Advantages/Disadvantages
âą Strategies for scaling and resilience patterns
âą Scaling Workiva
5. Scale Up vs. Scale Out
â Add resources to a node
â Increases node capacity, load
is unaïŹected
â System complexity unaïŹected
Vertical Scaling
â Add nodes to a cluster
â Decreases load, capacity is
unaïŹected
â Availability and throughput w/
increased complexity
Horizontal Scaling
22. Partition tweets into different databases using
some consistent hash schemeâš
(put a hash ring on it).
23.
24. This alleviates lock contention and improves
throughputâŠâš
âš
but fetching timelines is still extremely costly
(now scatter-gather query across multiple DBs).
25. Observation: Twitter is a consumption mechanism
more than an ingestion oneâŠ
i.e. cheap reads > cheap writes
27. Ingestion/Fan-Out Process
1. Tweet comes in
2. Query the social graph service for followers
3. Iterate through each follower and insert tweet ID into
their timeline (stored in Redis)
4. Store tweet on disk (MySQL)
28.
29. Ingestion/Fan-Out Process
âą Lots of processing on ingest, no computation on reads
âą Redis stores timelines in memoryâvery fast
âą Fetching timeline involves no queriesâget timeline from
Redis cache and rehydrate with multi-get on IDs
âą If timeline falls out of cache, reconstitute from disk
âą O(n) on writes, O(1) on reads
âą http://www.infoq.com/presentations/Twitter-Timeline-Scalability
30. Key Takeaway: think about your access patterns
and design accordingly.
âš
Optimize for the critical path.
31. Letâs RecapâŠ
âą Advantages of single database system:
âą Simple!
âą Data and invariants are consistent (ACID transactions)
âą Disadvantages of single database system:
âą Slow
âą Doesnât scale
âą Single point of failure
37. Sure, just coordinate things before proceedingâŠ
âHave you seen this tweet? Okay, good.â
âHave you seen this tweet? Okay, good.â
âHave you seen this tweet? Okay, good.â
âHave you seen this tweet? Okay, good.â
âHave you seen this tweet? Okay, good.â
âHave you seen this tweet? Okay, good.â
38. Sooo what do you do when Justin Bieber tweets to
his 67 million followers?
39.
40. Coordinating for consistency is expensiveâš
when data is distributedâš
because processesâš
canât make progress independently.
41.
42.
43. Source: Peter Bailis, 2015 https://speakerdeck.com/pbailis/silence-is-golden-coordination-avoiding-systems-design
44. Key Takeaway: strong consistency is slow and
distributed coordination is expensive (in terms of
latency and throughput).
67. Flow-Control Mechanisms
âą Rate limit
âą Bound queues/buffers
âą Backpressure - drop messages on the ïŹoor
âą Increment stat counters for monitoring/alerting
âą Exponential back-off
âą Use application-level acks for critical transactions
68. Bounding resource utilization and failing fast
helps maintain predictable performance and
impedes cascading failures.