The ability to grow (and shrink) according to the needs and the available resources is an essential part of designing applications. In this talk we'll cover the fundamental elements of scalability, including aspects involving people, processes and technology. With sound and proven principles and some advice on how to shape your organisation, set the right processes and design your application, this session is a must-see for developers and technical leads alike.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
The Art of Scalability - Managing growth
1. The Art of Scalabiliity
Managing Growth
Lorenzo Alberton
Amsterdam, 11th June 2010
2. Scalability
Scalability is a desirable property of a
system, a network, a business or a
process, which indicates its ability to
handle growing amounts of work
http://en.wikipedia.org/wiki/Scalability
2
3. Scalable ≠ Fast
A service is said to be scalable if when we
increase the resources in a system, it
results in increased performance in a
manner proportional to resources added.
http://www.julianbrowne.com/article/viewer/scalability
Increasing performance in general means
serving more units of work, but it can also
be to handle larger units of work, such as
when data sets grow.
http://highscalability.com/amazon-architecture
3
7. Roles And Responsibilities Role-clarity
overlapping areas missing wasted effort,
responsibilities responsibilities value-destroying conflicts,
failed scale initiatives
6
8. Roles And Responsibilities Role-clarity
overlapping areas missing wasted effort,
responsibilities responsibilities value-destroying conflicts,
failed scale initiatives
Key scale-related responsibilities
Set measurable goals
Staff the team with the appropriate skills
Define and implement a scalable architecture
Test, monitor, develop future demand projections
Define future changes based on the analysis
6
9. Leadership
Inspire people
Set the right vision and goals
Create the right culture
Create the right tools
7
10. Leadership
}
Inspire people
Set the right vision and goals
Accelerator for growth
Create the right culture
Create the right tools
7
11. Leadership
}
Inspire people
Set the right vision and goals
Accelerator for growth
Create the right culture
Create the right tools
vision = where we are going
mission = general direction on how to get there
goals = milestones along the path
7
12. Leadership
}
Inspire people
Set the right vision and goals
Accelerator for growth
Create the right culture
Create the right tools
vision = where we are going
mission = general direction on how to get there
goals = milestones along the path
S Specific
M Measurable
A Achievable (but Aggressive)
R Realistic
T Time-bound
7
13. Leadership
}
Inspire people
Set the right vision and goals
Accelerator for growth
Create the right culture
Create the right tools
vision = where we are going
mission = general direction on how to get there
goals = milestones along the path
S Specific Chip & Dan Heat, “Switch: How To
Change Things When Change Is Hard”
M Measurable
A Achievable (but Aggressive) People
R Realistic - Direct the rider
T Time-bound - Motivate the elephant
- Shape the path
7
15. Management
Project Management
Goals Projects Tasks Individuals
Measurement Communication Resolution
People Management
Hiring Firing Growth
8
16. Organisational Structure And Team size
Too small Too big
Micromanaging Poor communication
managers
Low morale
Overworked team
Low productivity
members
9
20. Why Are Processes Critical?
Augment management of teams and employees
Standardise actions in repetitive tasks
Reduce mundane decisions to focus on grander ideas
Allow the team to react quickly to crisis
Determine system capacity and scalability needs
12
21. Why Are Processes Critical?
Augment management of teams and employees
Standardise actions in repetitive tasks
Reduce mundane decisions to focus on grander ideas
Allow the team to react quickly to crisis
Determine system capacity and scalability needs
Challenge
12
22. Why Are Processes Critical?
Augment management of teams and employees
Standardise actions in repetitive tasks
Reduce mundane decisions to focus on grander ideas
Allow the team to react quickly to crisis
Determine system capacity and scalability needs
Challenge
right amount
12
23. Why Are Processes Critical?
Augment management of teams and employees
Standardise actions in repetitive tasks
Reduce mundane decisions to focus on grander ideas
Allow the team to react quickly to crisis
Determine system capacity and scalability needs
Challenge
right amount right process
12
24. Why Are Processes Critical?
Augment management of teams and employees
Standardise actions in repetitive tasks
Reduce mundane decisions to focus on grander ideas
Allow the team to react quickly to crisis
Determine system capacity and scalability needs
Challenge
right amount right process right time
12
31. Headroom Process
1. Identify major components 2. Identify responsible team
315 queries/sec
20MB/min
3. Determine usage and capacity
14
32. Headroom Process
1. Identify major components 2. Identify responsible team
315 queries/sec
20MB/min
3. Determine usage and capacity 4. Determine growth rate
14
33. Headroom Process
(ideal usage percentage) x (max capacity) -
(current usage) -
1. Identify major components
12 2. Identify responsible team
∑ (growth(t) - (optimisation projects(t))) =
____________________________________
t=1
Headroom
315 queries/sec
20MB/min L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
M.
3. Determine usage and capacity 4. Determine growth rate
14
34. Joint Architecture Design + Review Board
Engineering
Architecture
Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
15
35. Joint Architecture Design + Review Board
Engineering
Architecture
Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
15
36. Joint Architecture Design + Review Board
Engineering
Architecture
Architecture
Review Board
Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
15
37. Joint Architecture Design + Review Board
Meeting
Engineering
State goal
Review
alternative
designs
Architecture Q&A session
Deliberation Architecture
Review Board
Vote
Conclusion
Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
15
38. Joint Architecture Design + Review Board
Meeting
Engineering
State goal
Review
alternative
designs
Architecture Q&A session
Deliberation Architecture
Review Board
Vote
Conclusion
Operations M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
15
40. Controlling Change in Production Environment
Change Management Process
Proposal Approval Scheduling Logging Review
16
41. Controlling Change in Production Environment
Change Management Process
Proposal Approval Scheduling Logging Review
Change Identification Process
Date & time System undergoing Expected
of the change the change results
Contact information Rollback procedure
16
45. Determining Risk #3: FMEA
Failure Mode and Effect Analysis
Likelihood Severity Ability Total Remed- Revised
Failure
Feature Effect of If Failure to Risk iation Risk
Mode
Failure Occurs Detect Score Actions Score
User
User not - do this
data not registered 3 3 3 27 3
- do that
saved
Sign Up
Users Users can
given access
1 9 3 27 - do sth 9
wrong other’s
privileges data
CC
Credit number CC theft
not 1 9 1 9 N/A 9
Card risk
encrypted
19
47. Managing Risk (Human Factor)
Rules Risk Tolerance Level
6-hour period < 150 pts *
12-hour period < 250 pts *
24-hour period < 350 pts *
72-hour period < 500 pts *
* Numbers are just indicative figures
21
48. Managing Incidents And Problems
Detect, Report, Investigate, Escalate, Resolve approach
M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
Restore services in a timely and cost-effective manner
Contain chaos: each person has a place
Determine root cause and correct problems
Review issues regularly
22
49. Managing Incidents And Problems
Detect, Report, Investigate, Escalate, Resolve approach
M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
Restore services in a timely and cost-effective manner
Contain chaos: each person has a place
Determine root cause and correct problems
Review issues regularly
Post-mortem Process
Cross-functional brainstorming meeting
22
52. Performance (Load) Testing
✓1.5k users/sec
1. Establish success criteria ✓RT < 150ms
2. Establish the test environment TEST ≅ LIVE
23
53. Performance (Load) Testing
✓1.5k users/sec
1. Establish success criteria ✓RT < 150ms
2. Establish the test environment TEST ≅ LIVE
Pareto rule
3. Define the tests (for different things) 20% - 80%
23
54. Performance (Load) Testing
✓1.5k users/sec
1. Establish success criteria ✓RT < 150ms
2. Establish the test environment TEST ≅ LIVE
Pareto rule
3. Define the tests (for different things) 20% - 80%
4. Identify what needs to be monitored CPU - Memory
What data needs to be collected TTL, RT, Services
23
55. Performance (Load) Testing
✓1.5k users/sec
1. Establish success criteria ✓RT < 150ms
2. Establish the test environment TEST ≅ LIVE
Pareto rule
3. Define the tests (for different things) 20% - 80%
4. Identify what needs to be monitored CPU - Memory
What data needs to be collected TTL, RT, Services
CPU: 90%
5. Run, analyse, report to engineers RT: 180ms
2K SimUsers/sec
23
56. Performance (Load) Testing
✓1.5k users/sec
1. Establish success criteria ✓RT < 150ms
2. Establish the test environment TEST ≅ LIVE
Pareto rule
3. Define the tests (for different things) 20% - 80%
4. Identify what needs to be monitored CPU - Memory
What data needs to be collected TTL, RT, Services
CPU: 90%
5. Run, analyse, report to engineers RT: 180ms
2K SimUsers/sec
6. Repeat tests and analysis Rinse and repeat
23
63. Designing For Any Technology
Dell WatchGuard
Cisco CSS 11501
HP ProLiant DL
HP Media Cache
Server Appliance
27
64. Designing For Any Technology
Dell WatchGuard
Cisco CSS 11501
HP ProLiant DL
HP Media Cache
Server Appliance
27
65. Designing For Any Technology
Dell WatchGuard Firewall
Load Balancer
Cisco CSS 11501
HP ProLiant DL
Application Servers
HP Media Cache
Server Appliance DB Server
Media / Cache
27
71. Architectural Principles
+1
N + 1 design for rollback to be disabled
to be for multiple
monitored live sites
28
72. Architectural Principles
+1
N + 1 design for rollback to be disabled
to be for multiple use mature
monitored live sites technology
28
73. Architectural Principles
+1
N + 1 design for rollback to be disabled
to be for multiple use mature
monitored live sites technology
asynchronous
design
28
74. Architectural Principles
+1
N + 1 design for rollback to be disabled
to be for multiple use mature
monitored live sites technology
asynchronous stateless
design systems
28
75. Architectural Principles
+1
N + 1 design for rollback to be disabled
to be for multiple use mature
monitored live sites technology
asynchronous stateless buy when
design systems non core
28
79. Stateless Systems
State is often useful, but has a significant cost
(replication between data centres, synchronous calls...)
31
80. Stateless Systems
State is often useful, but has a significant cost
(replication between data centres, synchronous calls...)
A B
?
Avoidance
No sessions /
Sticky sessions
31
81. Stateless Systems
State is often useful, but has a significant cost
(replication between data centres, synchronous calls...)
A B
?
Avoidance Decentralisation
No sessions / Data in the cookie /
Sticky sessions Cookie with hash
31
82. Stateless Systems
State is often useful, but has a significant cost
(replication between data centres, synchronous calls...)
A B
?
Avoidance Decentralisation Centralisation
No sessions / Data in the cookie / Store cookies in the
Sticky sessions Cookie with hash db or in memcached
31
85. Creating Fault Isolative Structures
Increase availability
Limit impact of
failures
Easier debugging
First
32
86. Creating Fault Isolative Structures
Increase availability
Limit impact of
failures
Easier debugging
Functions
causing
repetitive
problems
First
32
87. Creating Fault Isolative Structures
Increase availability
Limit impact of
failures
Easier debugging
Functions Natural layout
causing or topology
repetitive of the site
problems
First
32
88. Scale Directions
M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
33
89. Scale Directions
cloning of entities or data - unbiased distribution of work
x
M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
33
90. Scale Directions
cloning of entities or data - unbiased distribution of work
x
y
separation of work
by activity or data
M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
33
91. Scale Directions
cloning of entities or data - unbiased distribution of work
x
y z
separation of work separation of work by person
by activity or data for whom the work is done
M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
33
94. Splitting Applications For Scale
mirroring
x + scale transactions
- scale data
+ fault isolation
+ scale function data
- scale customer data
y
split by service
34
95. Splitting Applications For Scale
mirroring
x + scale transactions
- scale data
+ fault isolation + fault isolation
+ scale function data + scale customer data
- scale customer data - scale function data
y z
split by need /
split by service
location / value
34
97. Splitting Databases For Scale
data cloning (replication / clustering)
x + easy to implement
+ scale transaction volume
- scale data size and growth
35
98. Splitting Databases For Scale
data cloning (replication / clustering)
x + easy to implement
+ scale transaction volume
- scale data size and growth
+ fault isolation
+ reduce query time
- more difficult
- data migration
y
split by service /
resource / data affinity
35
99. Splitting Databases For Scale
data cloning (replication / clustering)
x + easy to implement
+ scale transaction volume
- scale data size and growth
+ balanced demand
+ fault isolation
+ fault isolation
+ reduce query time
+ scale data and trans.
- more difficult
- more costly
- data migration
y z
split by service / split by modulus /
resource / data affinity hash-based lookups
35
106. Too Much Data
The more storage
...the more
storage management
38
107. Too Much Data
The more storage
...the more
storage management
storage costs
people and software
power and space
processing power
backup time and costs
38
108. Too Much Data
The more storage
...the more
storage management
storage costs
people and software
power and space
processing power
backup time and costs
Evaluate data retention policy
Consider multi-tiered storage
Distribute work (MapReduce)
38
109. Clouds And Grids
Cheap, on-demand storage and compute capacity
Cost (pay for what you use) High computation rates
Speed (procurement, Shared infrastructure (with
provisioning, deployment) proper scheduling
Flexibility (change / Unused capacity (SETI@H)
reconfigure environment)
Security, portability, control Not shared simultaneously
Limitations of virtualisation Monolithic applications
Performance Complexity (debugging, OS)
39
111. Monitoring
1. Is there a problem? User experience / Business metrics monitors
2. Where is the problem? System monitors (threshold - variance)
3. What is the problem? Application monitors
40
112. Monitoring
1. Is there a problem? User experience / Business metrics monitors
2. Where is the problem? System monitors (threshold - variance)
3. What is the problem? Application monitors
Keep Signal vs. Noise ratio high
40
113. Monitoring
1. Is there a problem? User experience / Business metrics monitors
2. Where is the problem? System monitors (threshold - variance)
3. What is the problem? Application monitors
Keep Signal vs. Noise ratio high
40