This document discusses quantifying the scalability of software. It recommends instrumenting code from the beginning to collect monitoring data on application health, the entire cluster, and individual nodes' system resources. This allows measuring how well a system can handle increasing load and evolving constraints.
1. The Economies of Scaling
Software
Abdelmonaim RemaniAbdelmonaim Remani
@PolymathicCoder@PolymathicCoder
2. About Me
⢠Platform Architect at just.me Inc.
⢠JavaOne RockStar and frequent speaker at many developer events and conferences
including JavaOne, JAX, OSCON, OREDEV, 33rd Degree, etc...
⢠Open-source advocate and contributor
⢠Active Community member
⢠The NorCal Java User Group
⢠The Silicon valley Spring User Group
⢠The SiliconValley dart Meetup
⢠Bio: http://about.me/PolymathicCoder
⢠Twitter: @PolymathicCoder
⢠Email: abdelmonaim.remani@gmail.com
⢠SlideShare: http://www.slideshare.net/PolymathicCoder/
3. License
⢠Creative Commons Attribution Non-Commercial
License 3.0 Unported
⢠The graphics and logos in this presentation belong to
their rightful owners
6. Whatâs up with the title?
⢠The Economies of Scale
⢠âIn microeconomics, economies of scale
are the cost advantages that enterprises
obtain due to size [...] Often operational
efficiency is [...] greater with increasing
scale [...]â -Wikipedia
7. The line is blurred!
⢠The was a time when only the enterprise worried about issues
like scalability
⢠The rise of social and the abundance of mobile are responsible
for
⢠Not only an exponential growth of internet traffic
⢠But the creation of a spoiled user-base that wants answers
to questions like
⢠I want to see the closest Moroccan restaurants to my
current location on a map along with consumer
ratings and whether any of my friends has recently
checked-in in the last 30 days
8. The bar is high!
⢠Scalability is everyoneâs problem
10. The Common Definition
⢠The ability of a software to handle an
increasing amount of work without
performance degradation
11. I have a problem with that definition...
⢠It implies that a scalable system is one that is
capable of sustaining its scalability forever
⢠Not realistic, It fails to recognize external
constraints imposed
⢠It fails to acknowledge that scalability is relative
⢠It does not take into account that a system
⢠Need not to be capable to handle the work
⢠But simply capable of evolving to handle the work
12. A better definition
⢠The ability of an application to gracefully
evolve within the constraints of its
ecosystem in order to handle the
maximum potential amount of work
without performance degradation
13. Easier said than done!
⢠A black art
⢠Not surprise here!
⢠An application that supports 1 million
users
⢠You add one new feature
⢠500,000 users crash your system
15. The Bottlenecks
⢠Scaling is about relieving or managing these limitations or
constraints that we call the bottlenecks
⢠When we talk about bottlenecks in computing, we talk about the
usual suspects
⢠The CPU
⢠Storage or I/O
⢠The Network
⢠Inter-related
⢠The rest of this talk is structured around these bottlenecks to make
the case that oneâs scalability needs are to be addressed in that
fashion
17. The CPU Bottleneck
⢠Nothing affects the CPU more than the
instructions it is summoned to execute
⢠In other words, this is about the very code
of your application
19. Architecture?
⢠Architecture
⢠âThings that people perceive as hard-to-
changeâ - Martin Fowler
⢠http://martinfowler.com/ieeeSoftware/whoNee
⢠Decisions you commit to; the ones that will
be stuck with you forever
20. Be wise...Think twice...
Choosing the right technologies
⢠Platform
⢠Languages
⢠Frameworks
⢠Libraries
Making the right abstractions
⢠Technical Abstractions
⢠Functional Abstractions
⢠Make sure that the former is subordinate to the later and not the other way
around
22. Write Good Code
⢠Think your algorithms through and mind their complexity (Asymptotic Complexity,
Cyclomatic Complexity, etc...)
⢠SOLIDify your Design
⢠Single Responsibility, Open-Closed, Liskov Substitution, Interface Segregation, and
Dependency Inversion
⢠Understand the limitation of your technology and leverage its strengths
⢠Donât be afraid to be Polyglot
⢠Obsess with testing
⢠TTD/BDD
⢠Tools
⢠Static code analyzers (PMD, FindBugs, Etc...)
⢠Profilers (Detect memory leaks, bottle neck, etc...)
⢠Etc...
23. KnowYour S#!t
⢠Read
⢠The classics: The Mythical Man-Mouth
⢠GoFâs âDesign Patternsâ
⢠Eric Evansâ âDomain-Driven Designâ
⢠Every book by Martin Fowler
⢠Uncle Bobâs âClean Codeâ
⢠Josh Blochâs âEffective Javaâ
⢠Brian Goetzâs âJava Concurrency in Practice â
⢠Etc...
⢠The list is long ...
24. We do all that... and still end up with this...
⢠The fading tradition of making cow dung piles
26. Technical Debt is a Reality
⢠It is the inevitable...You will incur it one way or another deliberately and not
⢠The quick-and-dirty you are not proud of
⢠Things you would/should do differently
⢠Anyways, after a while it starts to smell...
⢠The bright side
⢠The fact that it is recognize as a debt is good
⢠Keep track and refactor
⢠For the fearless... Be wise and think twice before you do it
⢠Cut the right corners
⢠Donât lock yourself out
⢠Donât make it a part of your architecture
28. Parallelism
⢠Parallelism?
⢠Writing concurrent code or simultaneously executing code
⢠Most write code that runs within web containers by extending
framework classes that are already multi-threaded
⢠Sometimes the complexity of the business logic demands that we
break it into smaller steps, execute them in parallel, then
aggregate data back to get a result within a reasonable amount of
time
⢠This is not easy!
⢠Often requires synchronizing state, which is a nightmare
29. Vertical Scaling
⢠Vertical Scaling (Scaling up)
⢠A single-node system
⢠Adding more computing resources to the
node (Get a beefier machine)
⢠Writing code to harness the full power of
the one node
30. Easier said than done...
⢠On the one machine, we have been reaping the benefit of Mooreâs Law
⢠Performance gain is automatically realized by software (In other
words, code is faster on faster hardware)
⢠The End of Mooreâs Law:The birth multi-core chip
⢠We actually need to write code to take advantage of this
⢠Good news! There are frameworks and libraries make it a lot easier
⢠Fork/Join in Java
⢠Akka
⢠Etc...
31. Easier said than done...
⢠Challenges
⢠What about dependencies and 3rd Party code?
⢠Synchronizing state just got HARDER across cores! Too
many cooks!
⢠Frankly, this shared state deal is a real pain
⢠Get a life and do without
⢠Go immutable (Not always straightforward or
not even sometimes not possible)
⢠Go âFunctionalâ (No guts... no glory...)
32. It gets more interesting...⢠Amdahlâs Law
⢠Throwing more cores does not necessarily result in
performance gain
⢠We actually end up with diminishing return at some point no
matter how many cores you throw in
34. Horizontal Scaling
⢠Horizontal Scaling (Scaling out)
⢠A distributed system (A cluster)
⢠Adding more nodes
⢠Writing code to harness the full power of
the cluster
35. Topology
⢠A typical cluster consists of
⢠A number of identical application server nodes behind a load balancer
A number?
⢠It depends on how many you actually need and can afford
⢠Elastic Scaling / Auto-Scaling
⢠The number of live nodes within the cluster shrinks and grows depending on the load
⢠New ones are provisioned or terminated as needed
Identical?
⢠Application nodes are cloned off of image files (Ex. AWS Ec2 AMIs, etc...)
⢠Configuration Management tool (Chef, Puppet, Salt, etc...)
Load balancer?
⢠Load is evenly distributed across live nodes according to some algorithm (Round-Robin typically)
37. Managing State
⢠Session data
⢠Session Replication
⢠Session Affinity/Sticky Session
⢠Requests from the same client always get routed back to the
same server
⢠When the node dies, session data die with it
⢠Shared/Distributed Session
⢠Session is in a centralized location
⢠Do your self a favor and go stateless!
⢠No session data
⢠Any server would do
38. Parallelism
⢠Leverage MapReduce
⢠âA programming model for processing
large data sets with a parallel, distributed
algorithm on a clusterâ
⢠Hadoop
39. Misc
⢠Distributed Lock Manager (DLM)
⢠Synchronized access to shared resources
⢠Google Chubby
⢠Zookeeper
⢠Hazelcast
⢠Teracotta
⢠Etc...
⢠Distributed Transactions
⢠X/Open XA
⢠HTTPS
⢠End at the load balancer
⢠Wildcard SSL
⢠Leverage probabilistic data structures and algorithms
⢠Bloom filters
⢠Quotient filters
⢠Etc...
45. What datastore to use?
What kind of question is that?
What kind of question is that?
⢠There was a time when the obvious choice was the relational model
⢠Schema that guarantees data integrity
⢠Data Normalized (minimized redundancies, no modification anomalies, etc...)
⢠ACIDity (Atomicity, Consistency, Isolation, and Durability)
⢠Data is stored in away that is independent from how the data is to accessed (No biased
towards any particular query patterns)
⢠Flexible query language
⢠As our datasets grow, we scaled vertically
⢠Buying beefier machines
⢠Database tuning / Query Optimization
⢠Creating MaterializedViews
⢠De-normalizing
⢠Etc...
46. Mucho Data!
⢠We hit the limit of the one machine
⢠Attempted to scale the RDBMS horizontally
⢠Master/Slave clusters
⢠Data Sharding
⢠We failed...Why?
⢠Eric Brewerâs CAP Theorem on distributed systems
⢠Pick 2 out of 3
⢠Consistency
⢠Availability
⢠Partition Tolerance
⢠The relational model is designed to favor CA over P
⢠It cannot be scaled horizontally
47. NoSQL
⢠A wide range of specialized data stores with the goal of
addressing the challenges of the relation model
⢠âThe whole point of seeking alternatives is that you need to
solve a problem that relational databases are a bad fit forâ -Eric
Evans
⢠A wide variety
⢠Key-Value Data stores
⢠Columnar Data stores
⢠Document Data stores
⢠Graph Data stores
48. Polyglot Persistence
⢠Acknowledging
⢠The complexity and variety data and data access
patterns within the one application
⢠The absurdity of the idea that all data should be
fitted into one storage model
⢠Proposing a solutions that
⢠Leverage multiple data stores within the one
application based on the specific way the data is
stored and accessed
49. For more details...
⢠Checkout my talk from JAX Conf 2012
⢠The Rise of NoSQL and Polyglot
Persistence
⢠YouTubeVideo:
⢠http://bit.ly/PCWtWi
51. Caching
⢠A cache is typically simple key-value data structure
⢠Instead of incurring the overhead of data retrieval or
computation every time, you check the cache first
⢠Since we canât cache everything, caches can be configured to
use multiple algorithms depending on the use cases (LRU,
BĂŠlĂĄdy's Algorithm, Etc...)
⢠Use aggressively!
⢠What to cache?
⢠Frequently accessed data (Session data, feeds, etc...)
⢠Long computation results
52. Caching
⢠Where to cache?
⢠On disk
⢠File System: Slow and sequential access
⢠DB:A little bit better (Data is arranged in structures
designed for efficient access, indexes, etc...)
⢠Generally a terrible idea
⢠SSD make things a little better
⢠In-Memory: Fast and random access, but volatile
⢠Something in between: Persistent caches (Redis, etc...)
53. Caching
⢠Types of Caches
⢠Local
⢠Replicated
⢠Distributed
⢠Clustered
54. Caching
⢠How to cache?
⢠Most caches implement a very simple interface
⢠Always attempt to get from cache first using a key
⢠If it is a hit, you saved yourself the overhead
⢠If it is a miss, compute or read from the data
store then put in cache for subsequent gets
⢠When you update you can evict stale data
⢠You can set a TTL when you put
⢠Many other common operations...
55. Caching Patterns
⢠Caching Query Results
⢠Key: hash of the query itself
⢠How about parametrized complex queries?
⢠Key: hash of the query itself + hash of parameter values
⢠Method/Function Memoization
⢠Key: method name
⢠How about with parametrized?
⢠Key: hash of the method name + hash of parameter values
⢠Caching Objects
⢠Key: Identity of the object
56. Caching Pattern
⢠Time-series datasets (Ex. Realtime feed)
⢠Sometimes pseudo/near realtime is
enough
⢠Use caching to throttle access to the
source
⢠Cache query result with a t expiry
⢠Fresh data is only read every t
57. Caching Gotchas
⢠Profile your code to assess what to cache, and whether you
need to to begin with
⢠Stale state might bite you hard
⢠Incoherence: Inconsistent copies of objects cached with
multiple keys
⢠Stale nested aggregates
⢠Network overhead of misses might outweighs the
performance gain of the hits
⢠Consider writing/updating to cache when you write to the
data store
58. Featured Solutions
⢠EhCache
⢠Memcahed
⢠Oracle Coherence
⢠Redis
⢠A Persistent NoSQL store
⢠Supports built-in data structures like sets and lists
⢠Supports intelligent keys and namespaces
62. Asynchronous Processing Patterns
⢠Pseudo-Asynchronous Processing
⢠Flow
⢠Preprocessing data / operations in advance
⢠Request data or operation
⢠Responding synchronously with preprocessed
result
⢠Sometimes not possible (Dynamic content,
etc...)
63. Asynchronous Processing Patterns
⢠True Asynchronous Processing
⢠Flow
⢠Request data or operation
⢠Acknowledge
⢠Ex.A REST that return an â202 Acceptedâ HTTP
status code
⢠Do Processing at your own connivence
⢠Allow the user to check progress
⢠Optionally notify when processing is complete
66. CDN
⢠Static Content
⢠Binary (Video,Audio, Etc...)
⢠Web objects (HTML, Javascript, CSS, Etc...)
⢠Do not serve through you application server
⢠Use a CDN
⢠âA large distributed system of server deployed in
multiple data centers across the internetâ
⢠Akamai
⢠AWS CloudFront
67. CDN Gotchas
⢠Versioning and caching
⢠Assume that you a script file named
script.js deployed on a CDN
⢠Copies of the file script.js will be
replicated across all edge nodes
⢠Clients will cache copies of the script file
script.js as well in their local cache
68. CDN Gotchas
⢠Versioning and caching
⢠When script.js is updated sharing the same URI
with the old version
⢠The new content is NOT propagated across
the edge nodes
⢠New clients end up being served with the
old version, now dirty state
⢠Old clients continue to use their local cache
containing the old version, now dirty state
69. CDN Gotchas
⢠Versioning and caching
⢠What to do?
⢠Simply append version numbers to file
names
⢠script-v1.js, script-v2.js, Etc...
⢠Force invalidation of the file on edge nodes
⢠Set HTTP caching headers properly
71. DNS
⢠DNS
⢠Do not rely on your free domain name registrar
DNS services
⢠Use a scalable DNS solution
⢠AWS Route 53
⢠DynECT
⢠UltraDNS
⢠Etc...
75. When disaster hits...⢠Goal:
⢠Fault tolerant system
⢠If case of disaster, recover and restore service ASAP
⢠Be proactive
⢠Develop a Disaster Recovery Plan (DRP)
⢠Test DRP in failure drills
77. Scaling Teams
⢠Hiring
⢠Always hire top talent
⢠You are as strong as your weakest link
⢠Develop a process to bring people in
⢠Turnkey Hardware/Software Set up (Tools likeVagrant, etc...)
⢠Arrange for proper access/accounts
⢠Develop a knowledge base (Architecture documentation, FAQs, etc...)
⢠Development Process
⢠Be Agile
⢠Refine in the spirit of Six Sigma
78. Scaling Teams
⢠Teams
⢠Form small ad-hoc teams from pools of Agile breeds
⢠Product Owners
⢠Team Members
⢠Team Lead (Scrum Master)
⢠Engineers
⢠QAs
⢠Architecture Owners
⢠Keep them small
⢠Give them ownership of their DevOps
80. The Take-home Message
⢠The early-bird gets the worm
⢠Design to scale from day one
⢠Plan for capacity early
⢠Your needs determine how scalable is scalable
⢠Do not over-engineer
⢠Do not bite more than you can chew
⢠Building scalable system is process
⢠Commit to a road map around bottlenecks
⢠Guided by planned business features
⢠Learn from othersâ experiences (Twitter, Netflix, etc...)