SlideShare a Scribd company logo
1 of 33
@Twitter | QCon NY 2013 1
Isolating Events from the Fail Whale
Arun Kejariwal, Bryce Yan
(@arun_kejariwal, @bryce_yan)
Capacity Engineering @ Twitter
June 2013
@Twitter | QCon NY 2013 2
Delivering Best User Experience
•  Performance
  Real time!
  Latency tolerance of end-users has nose dived
  Average, p99, p999
  Variability on large clusters
  Tolerate faults when using commodity hardware
•  Availability
  Anytime, Anywhere, Any Device
•  Organic Growth
  Over 200M monthly active users
•  Events
  Planned, Unplanned
[3] https://twitter.com/twitter/status/281051652235087872
[2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf
[1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf
[2]
[3]
[1]
@Twitter | QCon NY 2013 3
High Performance, Availability
•  Capacity Planning
  Throw hardware at the problem
  Operationally inefficient
  Even otherwise
o  How much?
o  What kind? (Inventory management etc.)
  Reactive approach
  Degraded user experience
o  Impact bottomline
  Overall goal
  Deliver best user experience
  Minimal operational footprint 
o  Factor in organic growth and lead times for provisioning additional capacity
@Twitter | QCon NY 2013 4
Capacity Planning is Non-trivial
•  Behavioral response is unpredictable
•  Multiplier Effect
  # Retweets x Followers of each retweeter
Large fan-out
@Twitter | QCon NY 2013 5
Capacity Planning is Non-trivial (cont’d)
•  Unforeseen events
  Power failure
  “Hurricane Sandy takes data centers offline with flooding, power outages”
  Network issues
  “Amazon's compute cloud has a networking hiccup”
•  Evolving product development landscape
  New features
  New products
  New partners
  “Twitter Arrives on Wall Street, Via Bloomberg”
[1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/
[2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/
[4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/
[3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf.
[1]
[2] [3]
[4]
14 June 2013
@Twitter | QCon NY 2013 6
Capacity Planning is Non-trivial (cont’d)
•  New hardware platforms
  Purchase pipeline
  How much and when to buy – Cost performance trade-off
@Twitter | QCon NY 2013 7
Events
•  Planned


  Still, traffic pattern subject to, say, 
  Nature of the event 
  Behavioral response
  Community effect
  Demographics
@Twitter | QCon NY 2013 8
Events (cont’d)
•  Unplanned




  Intensity of the event
  Population density
Japan Tsunami
 New Zealand Earthquake
 Hurricane Sandy
Flash Crash
Egyptian Revolution
Iran’s Disputed Election
 Boston Explosion
Remembering Steve Jobs
@Twitter | QCon NY 2013 9
Events (cont’d)
•  Unplanned (transient)



  Duration 
  Type of the transient event
White House Rumor: AP account being hacked

























[1]
[1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
@Twitter | QCon NY 2013 10
Events (cont’d)
•  Black Swans (ala Nassim Taleb)
  Planned events, but…
Superbowl’13 Blackout
 Zidane in “Action”
 “Hand of God”
Usain Bolt’s 100m World Record
@Twitter | QCon NY 2013 11
Events (cont’d)
•  Events timeline
Time
@Twitter | QCon NY 2013 12
Events’ Impact
•  Differ in characteristics
  Tweets
  Photos
  Vines
  Now, Music
•  Consequently, tax different services
  Different capacity requests
@Twitter | QCon NY 2013 13
Capacity Modeling Overview
@Twitter | QCon NY 2013 14
Capacity Modeling
•  Takes core drivers as inputs to generate usage demand
  Forecasts the amount of work based on core driver projections
•  Relates the work metric to a primary resource to identify the capacity
threshold
  Primary resources
  Computing power (CPU, RAM)
  Storage (disk I/O, disk space)
  Network (network bandwidth)
•  Generate hardware demand based on the limiting primary resource
@Twitter | QCon NY 2013 15
Core Drivers
•  Underlying business metrics that drive demand for more capacity
  Active Users
  Tweets per second (TPS)
  Favorites per second (FPS)
  Requests per second (RPS)
•  Normalized by Active Users to isolate user engagement
•  Project user engagement and Active Users independently
@Twitter | QCon NY 2013 16
Active Users aka User Growth
 Normalized Core Drivers for Engagement
Core Drivers (cont’d)
PerActiveUserValues
Time
Favorites
Retweets
Poly. (Favorites)
Linear (Retweets)
ActiveUserCount
Time
Active
Users
Linear (Active
Users)
@Twitter | QCon NY 2013 17
Core Drivers (cont’d)
Time
User Growth: Active Users
Active
Users
Linear (Active
Users)
Time
Engagement: Photos/Active User
Photos
Linear (Photos)
Time
Core Driver: Photos per Day
Photos
Photos
Forecast
@Twitter | QCon NY 2013 18
Capacity Threshold
•  Primary resource scalability threshold
  Determined by load testing
  Synthetic load
  Replaying production traffic
  Real-time production traffic
  Test systems may be
  Isolated replicas of production
  Staging systems in production
  Production systems
0.00
 10.00
 20.00
 30.00
 40.00
 50.00
 60.00
 70.00
 80.00
 90.00
 100.00
ServiceResponseTime
CPU
Average Response Times vs CPU
X
@Twitter | QCon NY 2013 19
Hardware Demand
•  Core driver  capacity threshold  scaling formula  server count
•  Example
  Core driver: Requests per Second
  Per server request throughput determined by 
capacity threshold
  Scaling formula for Sizing
  Number of Servers = (RPS) / Per Server Threshold
CoreDriver(RPS)/ServerCount
Time
RPS (Actuals)
 RPS (Forecast)
 # Servers (Actuals)
 # Servers (Forecast)
@Twitter | QCon NY 2013 20
Deep Dive and Superbowl 2013
@Twitter | QCon NY 2013 21
Events: High Level Methodology
•  Goal
  Handle traffic “spike”
•  Predict expected traffic based on historical and temporal statistical analysis
  Statistical Metrics
  Average
  Standard deviation
  Max
•  Limitations
  Changing usage patterns
  Organic growth, behavioral, cultural 
  Event driven
  How a game would turn out?
@Twitter | QCon NY 2013 22
Statistical Time Series Analysis
•  Time window
  Week over Week (WoW)
  Month over Month (MoM)
  Year over Year (YoY)
•  Data Distribution
  Normal, Log Normal, Multi-modal
  Has implications on model selection
•  Forecasting
  Regression model
  Linear, Spline
  ARIMA
  Trending, Seasonal, Residuals
@Twitter | QCon NY 2013 23
Superbowl 2013: Capacity Planning
•  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns

•  Core driver selection
  RPS (Reads)
  TPS (Writes)

•  What time granularity to use?
  Avg TPS (Tweets per sec)
  1s/10s/15s/30s Max TPS
  1 min/5 min/10 min Max TPS
  1 hr Max TPS
@Twitter | QCon NY 2013 24
Superbowl 2013: Capacity Planning (cont’d)
•  Which metric to use?
Time
Highly correlated
@Twitter | QCon NY 2013 25
Superbowl 2013: Capacity Planning (cont’d)
•  Which metric to use?
  Time sensitive – correlation may change YoY
Time
Highly correlated
@Twitter | QCon NY 2013 26
Superbowl 2013: Capacity Planning (cont’d)
•  Approaches
  TPSSuperbowl (denote by Tn)
  d-Day historical window
  TPSn-1, TPSn-2, …, TPSn-d
  Ratio Analysis
  Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)
  Distribution Analysis
  αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
@Twitter | QCon NY 2013 27
Superbowl 2013: Capacity Planning (cont’d)
•  Ratio Analysis (Rn)
  1s Max TPS
14 Day
 28 day
 45 Day
2011
 0.791
 0.791
 1.007
2012
 1.062
 0.858
 0.580
@Twitter | QCon NY 2013 28
μ
Superbowl 2013: Capacity Planning (cont’d)
•  Distribution Analysis (αn)
  AVG (μ), STDEV(σ) 
  μ increased YoY (expected)
  σ also increased YoY
  1s Max TPS
Tn /μ
 (Tn – μ)/σ
2011
 1.448
 1.746
2012
 1.517
 2.756
TPS during Superbowl has been
moving right YoY
2011
 2012
@Twitter | QCon NY 2013 29
Superbowl 2013: Capacity Planning (cont’d)
•  Distribution Analysis
  YoY movement of TPSSuperbowl further into the right tail
  Expectation: Progressive moves would be smaller

  Overestimate α
  Handle unplanned events
  Business decision
@Twitter | QCon NY 2013 30
Superbowl 2013: Capacity Planning (cont’d)
•  Historical component
  Determine extent of movement (αexpected) of TPSSuperbowl into right tail

•  Temporal component
  Current μc 
  Current σc

•  Capacity planning
  Plan capacity corresponding to μc + αexpected * σc
  Scenario Analysis (ala Global Macro Hedge Funds)
  αexpected 
o  αn-1 (same as last year)
o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
@Twitter | QCon NY 2013 31
Superbowl 2013: Capacity Planning (cont’d)
•  Capacity planning
  1s Max TPS
  αn-1  20K+
  αn-1 + (αn-1 + αn-2)/2  22K+
@Twitter | QCon NY 2013 32
Superbowl 2013: Capacity Planning (cont’d)
•  Validation
  1s Max TPS
  αobserved < αexpected


  Twitter was highly available during Superbowl 2013
  Over-allocation concerns?
  Minimal 
  Limited to few services
  Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
@Twitter | QCon NY 2013 33
Join the Flock
•  We are hiring!
  https://twitter.com/JoinTheFlock
  https://twitter.com/jobs

More Related Content

Similar to Twitter QCon NY 2013: Isolating Events from the Fail Whale

Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan Kumar
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Symeon Papadopoulos
 
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...ferda ofli
 
The STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design FeaturesThe STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design FeaturesGLTN_STDM
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 
Analysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane SandyAnalysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane SandyCatherine Graham
 
22 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-1722 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-17indiawrm
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Piyush Kumar
 
Task Time Series CoronaWhy De
Task Time Series CoronaWhy DeTask Time Series CoronaWhy De
Task Time Series CoronaWhy DeIsaac Godfried
 
Backups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsBackups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsCommunity IT Innovators
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Gunjan Kumar
 
Financial management
Financial managementFinancial management
Financial managementMaulikVasani2
 

Similar to Twitter QCon NY 2013: Isolating Events from the Fail Whale (20)

Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
 
The STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design FeaturesThe STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design Features
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Analysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane SandyAnalysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane Sandy
 
22 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-1722 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-17
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
 
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
 
Task Time Series CoronaWhy De
Task Time Series CoronaWhy DeTask Time Series CoronaWhy De
Task Time Series CoronaWhy De
 
Backups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsBackups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for Nonprofits
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017
 
Financial management
Financial managementFinancial management
Financial management
 
Final presentation
Final presentationFinal presentation
Final presentation
 

More from Arun Kejariwal

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The EdgeArun Kejariwal
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar FunctionsArun Kejariwal
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsArun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series DataArun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly DetectionArun Kejariwal
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactArun Kejariwal
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterArun Kejariwal
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceArun Kejariwal
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintArun Kejariwal
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudArun Kejariwal
 

More from Arun Kejariwal (17)

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly Detection
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Twitter QCon NY 2013: Isolating Events from the Fail Whale

  • 1. @Twitter | QCon NY 2013 1 Isolating Events from the Fail Whale Arun Kejariwal, Bryce Yan (@arun_kejariwal, @bryce_yan) Capacity Engineering @ Twitter June 2013
  • 2. @Twitter | QCon NY 2013 2 Delivering Best User Experience •  Performance   Real time!   Latency tolerance of end-users has nose dived   Average, p99, p999   Variability on large clusters   Tolerate faults when using commodity hardware •  Availability   Anytime, Anywhere, Any Device •  Organic Growth   Over 200M monthly active users •  Events   Planned, Unplanned [3] https://twitter.com/twitter/status/281051652235087872 [2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf [1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf [2] [3] [1]
  • 3. @Twitter | QCon NY 2013 3 High Performance, Availability •  Capacity Planning   Throw hardware at the problem   Operationally inefficient   Even otherwise o  How much? o  What kind? (Inventory management etc.)   Reactive approach   Degraded user experience o  Impact bottomline   Overall goal   Deliver best user experience   Minimal operational footprint o  Factor in organic growth and lead times for provisioning additional capacity
  • 4. @Twitter | QCon NY 2013 4 Capacity Planning is Non-trivial •  Behavioral response is unpredictable •  Multiplier Effect   # Retweets x Followers of each retweeter Large fan-out
  • 5. @Twitter | QCon NY 2013 5 Capacity Planning is Non-trivial (cont’d) •  Unforeseen events   Power failure   “Hurricane Sandy takes data centers offline with flooding, power outages”   Network issues   “Amazon's compute cloud has a networking hiccup” •  Evolving product development landscape   New features   New products   New partners   “Twitter Arrives on Wall Street, Via Bloomberg” [1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/ [2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/ [4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/ [3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf. [1] [2] [3] [4] 14 June 2013
  • 6. @Twitter | QCon NY 2013 6 Capacity Planning is Non-trivial (cont’d) •  New hardware platforms   Purchase pipeline   How much and when to buy – Cost performance trade-off
  • 7. @Twitter | QCon NY 2013 7 Events •  Planned   Still, traffic pattern subject to, say,   Nature of the event   Behavioral response   Community effect   Demographics
  • 8. @Twitter | QCon NY 2013 8 Events (cont’d) •  Unplanned   Intensity of the event   Population density Japan Tsunami New Zealand Earthquake Hurricane Sandy Flash Crash Egyptian Revolution Iran’s Disputed Election Boston Explosion Remembering Steve Jobs
  • 9. @Twitter | QCon NY 2013 9 Events (cont’d) •  Unplanned (transient)   Duration   Type of the transient event White House Rumor: AP account being hacked [1] [1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
  • 10. @Twitter | QCon NY 2013 10 Events (cont’d) •  Black Swans (ala Nassim Taleb)   Planned events, but… Superbowl’13 Blackout Zidane in “Action” “Hand of God” Usain Bolt’s 100m World Record
  • 11. @Twitter | QCon NY 2013 11 Events (cont’d) •  Events timeline Time
  • 12. @Twitter | QCon NY 2013 12 Events’ Impact •  Differ in characteristics   Tweets   Photos   Vines   Now, Music •  Consequently, tax different services   Different capacity requests
  • 13. @Twitter | QCon NY 2013 13 Capacity Modeling Overview
  • 14. @Twitter | QCon NY 2013 14 Capacity Modeling •  Takes core drivers as inputs to generate usage demand   Forecasts the amount of work based on core driver projections •  Relates the work metric to a primary resource to identify the capacity threshold   Primary resources   Computing power (CPU, RAM)   Storage (disk I/O, disk space)   Network (network bandwidth) •  Generate hardware demand based on the limiting primary resource
  • 15. @Twitter | QCon NY 2013 15 Core Drivers •  Underlying business metrics that drive demand for more capacity   Active Users   Tweets per second (TPS)   Favorites per second (FPS)   Requests per second (RPS) •  Normalized by Active Users to isolate user engagement •  Project user engagement and Active Users independently
  • 16. @Twitter | QCon NY 2013 16 Active Users aka User Growth Normalized Core Drivers for Engagement Core Drivers (cont’d) PerActiveUserValues Time Favorites Retweets Poly. (Favorites) Linear (Retweets) ActiveUserCount Time Active Users Linear (Active Users)
  • 17. @Twitter | QCon NY 2013 17 Core Drivers (cont’d) Time User Growth: Active Users Active Users Linear (Active Users) Time Engagement: Photos/Active User Photos Linear (Photos) Time Core Driver: Photos per Day Photos Photos Forecast
  • 18. @Twitter | QCon NY 2013 18 Capacity Threshold •  Primary resource scalability threshold   Determined by load testing   Synthetic load   Replaying production traffic   Real-time production traffic   Test systems may be   Isolated replicas of production   Staging systems in production   Production systems 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 ServiceResponseTime CPU Average Response Times vs CPU X
  • 19. @Twitter | QCon NY 2013 19 Hardware Demand •  Core driver  capacity threshold  scaling formula  server count •  Example   Core driver: Requests per Second   Per server request throughput determined by capacity threshold   Scaling formula for Sizing   Number of Servers = (RPS) / Per Server Threshold CoreDriver(RPS)/ServerCount Time RPS (Actuals) RPS (Forecast) # Servers (Actuals) # Servers (Forecast)
  • 20. @Twitter | QCon NY 2013 20 Deep Dive and Superbowl 2013
  • 21. @Twitter | QCon NY 2013 21 Events: High Level Methodology •  Goal   Handle traffic “spike” •  Predict expected traffic based on historical and temporal statistical analysis   Statistical Metrics   Average   Standard deviation   Max •  Limitations   Changing usage patterns   Organic growth, behavioral, cultural   Event driven   How a game would turn out?
  • 22. @Twitter | QCon NY 2013 22 Statistical Time Series Analysis •  Time window   Week over Week (WoW)   Month over Month (MoM)   Year over Year (YoY) •  Data Distribution   Normal, Log Normal, Multi-modal   Has implications on model selection •  Forecasting   Regression model   Linear, Spline   ARIMA   Trending, Seasonal, Residuals
  • 23. @Twitter | QCon NY 2013 23 Superbowl 2013: Capacity Planning •  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns •  Core driver selection   RPS (Reads)   TPS (Writes) •  What time granularity to use?   Avg TPS (Tweets per sec)   1s/10s/15s/30s Max TPS   1 min/5 min/10 min Max TPS   1 hr Max TPS
  • 24. @Twitter | QCon NY 2013 24 Superbowl 2013: Capacity Planning (cont’d) •  Which metric to use? Time Highly correlated
  • 25. @Twitter | QCon NY 2013 25 Superbowl 2013: Capacity Planning (cont’d) •  Which metric to use?   Time sensitive – correlation may change YoY Time Highly correlated
  • 26. @Twitter | QCon NY 2013 26 Superbowl 2013: Capacity Planning (cont’d) •  Approaches   TPSSuperbowl (denote by Tn)   d-Day historical window   TPSn-1, TPSn-2, …, TPSn-d   Ratio Analysis   Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)   Distribution Analysis   αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
  • 27. @Twitter | QCon NY 2013 27 Superbowl 2013: Capacity Planning (cont’d) •  Ratio Analysis (Rn)   1s Max TPS 14 Day 28 day 45 Day 2011 0.791 0.791 1.007 2012 1.062 0.858 0.580
  • 28. @Twitter | QCon NY 2013 28 μ Superbowl 2013: Capacity Planning (cont’d) •  Distribution Analysis (αn)   AVG (μ), STDEV(σ)   μ increased YoY (expected)   σ also increased YoY   1s Max TPS Tn /μ (Tn – μ)/σ 2011 1.448 1.746 2012 1.517 2.756 TPS during Superbowl has been moving right YoY 2011 2012
  • 29. @Twitter | QCon NY 2013 29 Superbowl 2013: Capacity Planning (cont’d) •  Distribution Analysis   YoY movement of TPSSuperbowl further into the right tail   Expectation: Progressive moves would be smaller   Overestimate α   Handle unplanned events   Business decision
  • 30. @Twitter | QCon NY 2013 30 Superbowl 2013: Capacity Planning (cont’d) •  Historical component   Determine extent of movement (αexpected) of TPSSuperbowl into right tail •  Temporal component   Current μc   Current σc •  Capacity planning   Plan capacity corresponding to μc + αexpected * σc   Scenario Analysis (ala Global Macro Hedge Funds)   αexpected o  αn-1 (same as last year) o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
  • 31. @Twitter | QCon NY 2013 31 Superbowl 2013: Capacity Planning (cont’d) •  Capacity planning   1s Max TPS   αn-1  20K+   αn-1 + (αn-1 + αn-2)/2  22K+
  • 32. @Twitter | QCon NY 2013 32 Superbowl 2013: Capacity Planning (cont’d) •  Validation   1s Max TPS   αobserved < αexpected   Twitter was highly available during Superbowl 2013   Over-allocation concerns?   Minimal   Limited to few services   Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
  • 33. @Twitter | QCon NY 2013 33 Join the Flock •  We are hiring!   https://twitter.com/JoinTheFlock   https://twitter.com/jobs