This document summarizes Twitter's approach to capacity planning for large events like the Super Bowl. It discusses using historical traffic patterns to predict capacity needs, analyzing key metrics like tweets per second, and planning for potential traffic spikes through statistical analysis and scenario modeling. For Super Bowl 2013, Twitter's models predicted a traffic spike could push tweets per second into the 20,000+ range, higher than previous years, and the company was able to maintain high availability during the game despite the brief blackout.
Twitter QCon NY 2013: Isolating Events from the Fail Whale
1. @Twitter | QCon NY 2013 1
Isolating Events from the Fail Whale
Arun Kejariwal, Bryce Yan
(@arun_kejariwal, @bryce_yan)
Capacity Engineering @ Twitter
June 2013
2. @Twitter | QCon NY 2013 2
Delivering Best User Experience
• Performance
Real time!
Latency tolerance of end-users has nose dived
Average, p99, p999
Variability on large clusters
Tolerate faults when using commodity hardware
• Availability
Anytime, Anywhere, Any Device
• Organic Growth
Over 200M monthly active users
• Events
Planned, Unplanned
[3] https://twitter.com/twitter/status/281051652235087872
[2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf
[1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf
[2]
[3]
[1]
3. @Twitter | QCon NY 2013 3
High Performance, Availability
• Capacity Planning
Throw hardware at the problem
Operationally inefficient
Even otherwise
o How much?
o What kind? (Inventory management etc.)
Reactive approach
Degraded user experience
o Impact bottomline
Overall goal
Deliver best user experience
Minimal operational footprint
o Factor in organic growth and lead times for provisioning additional capacity
4. @Twitter | QCon NY 2013 4
Capacity Planning is Non-trivial
• Behavioral response is unpredictable
• Multiplier Effect
# Retweets x Followers of each retweeter
Large fan-out
5. @Twitter | QCon NY 2013 5
Capacity Planning is Non-trivial (cont’d)
• Unforeseen events
Power failure
“Hurricane Sandy takes data centers offline with flooding, power outages”
Network issues
“Amazon's compute cloud has a networking hiccup”
• Evolving product development landscape
New features
New products
New partners
“Twitter Arrives on Wall Street, Via Bloomberg”
[1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/
[2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/
[4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/
[3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf.
[1]
[2] [3]
[4]
14 June 2013
6. @Twitter | QCon NY 2013 6
Capacity Planning is Non-trivial (cont’d)
• New hardware platforms
Purchase pipeline
How much and when to buy – Cost performance trade-off
7. @Twitter | QCon NY 2013 7
Events
• Planned
Still, traffic pattern subject to, say,
Nature of the event
Behavioral response
Community effect
Demographics
8. @Twitter | QCon NY 2013 8
Events (cont’d)
• Unplanned
Intensity of the event
Population density
Japan Tsunami
New Zealand Earthquake
Hurricane Sandy
Flash Crash
Egyptian Revolution
Iran’s Disputed Election
Boston Explosion
Remembering Steve Jobs
9. @Twitter | QCon NY 2013 9
Events (cont’d)
• Unplanned (transient)
Duration
Type of the transient event
White House Rumor: AP account being hacked
[1]
[1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
10. @Twitter | QCon NY 2013 10
Events (cont’d)
• Black Swans (ala Nassim Taleb)
Planned events, but…
Superbowl’13 Blackout
Zidane in “Action”
“Hand of God”
Usain Bolt’s 100m World Record
11. @Twitter | QCon NY 2013 11
Events (cont’d)
• Events timeline
Time
12. @Twitter | QCon NY 2013 12
Events’ Impact
• Differ in characteristics
Tweets
Photos
Vines
Now, Music
• Consequently, tax different services
Different capacity requests
14. @Twitter | QCon NY 2013 14
Capacity Modeling
• Takes core drivers as inputs to generate usage demand
Forecasts the amount of work based on core driver projections
• Relates the work metric to a primary resource to identify the capacity
threshold
Primary resources
Computing power (CPU, RAM)
Storage (disk I/O, disk space)
Network (network bandwidth)
• Generate hardware demand based on the limiting primary resource
15. @Twitter | QCon NY 2013 15
Core Drivers
• Underlying business metrics that drive demand for more capacity
Active Users
Tweets per second (TPS)
Favorites per second (FPS)
Requests per second (RPS)
• Normalized by Active Users to isolate user engagement
• Project user engagement and Active Users independently
16. @Twitter | QCon NY 2013 16
Active Users aka User Growth
Normalized Core Drivers for Engagement
Core Drivers (cont’d)
PerActiveUserValues
Time
Favorites
Retweets
Poly. (Favorites)
Linear (Retweets)
ActiveUserCount
Time
Active
Users
Linear (Active
Users)
17. @Twitter | QCon NY 2013 17
Core Drivers (cont’d)
Time
User Growth: Active Users
Active
Users
Linear (Active
Users)
Time
Engagement: Photos/Active User
Photos
Linear (Photos)
Time
Core Driver: Photos per Day
Photos
Photos
Forecast
18. @Twitter | QCon NY 2013 18
Capacity Threshold
• Primary resource scalability threshold
Determined by load testing
Synthetic load
Replaying production traffic
Real-time production traffic
Test systems may be
Isolated replicas of production
Staging systems in production
Production systems
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
ServiceResponseTime
CPU
Average Response Times vs CPU
X
19. @Twitter | QCon NY 2013 19
Hardware Demand
• Core driver capacity threshold scaling formula server count
• Example
Core driver: Requests per Second
Per server request throughput determined by
capacity threshold
Scaling formula for Sizing
Number of Servers = (RPS) / Per Server Threshold
CoreDriver(RPS)/ServerCount
Time
RPS (Actuals)
RPS (Forecast)
# Servers (Actuals)
# Servers (Forecast)
21. @Twitter | QCon NY 2013 21
Events: High Level Methodology
• Goal
Handle traffic “spike”
• Predict expected traffic based on historical and temporal statistical analysis
Statistical Metrics
Average
Standard deviation
Max
• Limitations
Changing usage patterns
Organic growth, behavioral, cultural
Event driven
How a game would turn out?
22. @Twitter | QCon NY 2013 22
Statistical Time Series Analysis
• Time window
Week over Week (WoW)
Month over Month (MoM)
Year over Year (YoY)
• Data Distribution
Normal, Log Normal, Multi-modal
Has implications on model selection
• Forecasting
Regression model
Linear, Spline
ARIMA
Trending, Seasonal, Residuals
23. @Twitter | QCon NY 2013 23
Superbowl 2013: Capacity Planning
• Assess capacity requirement based 2011, 2012 Superbowl traffic patterns
• Core driver selection
RPS (Reads)
TPS (Writes)
• What time granularity to use?
Avg TPS (Tweets per sec)
1s/10s/15s/30s Max TPS
1 min/5 min/10 min Max TPS
1 hr Max TPS
24. @Twitter | QCon NY 2013 24
Superbowl 2013: Capacity Planning (cont’d)
• Which metric to use?
Time
Highly correlated
25. @Twitter | QCon NY 2013 25
Superbowl 2013: Capacity Planning (cont’d)
• Which metric to use?
Time sensitive – correlation may change YoY
Time
Highly correlated
27. @Twitter | QCon NY 2013 27
Superbowl 2013: Capacity Planning (cont’d)
• Ratio Analysis (Rn)
1s Max TPS
14 Day
28 day
45 Day
2011
0.791
0.791
1.007
2012
1.062
0.858
0.580
28. @Twitter | QCon NY 2013 28
μ
Superbowl 2013: Capacity Planning (cont’d)
• Distribution Analysis (αn)
AVG (μ), STDEV(σ)
μ increased YoY (expected)
σ also increased YoY
1s Max TPS
Tn /μ
(Tn – μ)/σ
2011
1.448
1.746
2012
1.517
2.756
TPS during Superbowl has been
moving right YoY
2011
2012
29. @Twitter | QCon NY 2013 29
Superbowl 2013: Capacity Planning (cont’d)
• Distribution Analysis
YoY movement of TPSSuperbowl further into the right tail
Expectation: Progressive moves would be smaller
Overestimate α
Handle unplanned events
Business decision
30. @Twitter | QCon NY 2013 30
Superbowl 2013: Capacity Planning (cont’d)
• Historical component
Determine extent of movement (αexpected) of TPSSuperbowl into right tail
• Temporal component
Current μc
Current σc
• Capacity planning
Plan capacity corresponding to μc + αexpected * σc
Scenario Analysis (ala Global Macro Hedge Funds)
αexpected
o αn-1 (same as last year)
o αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)