SlideShare uma empresa Scribd logo
1 de 17
Michael Kehoe
Senior Site Reliability Engineer
LinkedIn
SouthBay SRE Meetup
LinkedIn Traffic Shifting
2
$ whoami
Michael Kehoe
• Sr Site Reliability Engineer (SRE)
• Member of PROD-SRE
• https://www.linkedin.com/in/michaelkkehoe
3
LinkedIn Multicolo History
4
What is a Traffic Shift?
• Edge (PoP) shift
• Datacenter Load shift
• Single Master Failovers
5
Why do we do traffic shifts
• To mitigate user impact from problems with a 3rd party provider or
LinkedIn’s infrastructure/ services
• To validate Disaster Recovery (DR) in case of any datacenter failure
• To validate and test capacity headroom across our datacenters
• To expose bugs and suboptimal configurations by load testing one or
more datacenters
• To perform planned maintenance
• To validate and exercise the traffic shift automation
6
Traffic shifting
How do we do it?
7
Edge Traffic shifts
How does it work
• We use IPVS to load balance at our edges
• We can withdraw anycast routes to remove traffic from that
PoP
• Health checks on our edge proxy are tested by DNS providers to
verify whether that PoP is in rotation
• We can fail those health checks to remove unicast traffic from
that PoP
8
Edge Traffic shifts
9
Datacenter Traffic shifts
How does it work?
• Different traffic types are partitioned and controlled separately
• Logged-in vs Logged-out
• CDN
• Monitoring
• Microsites
• Logged-in users are placed into ‘buckets’ and have primary/
secondary datacenter assignments
• Buckets are marked online/ offline to move site traffic
10
Mitigating Impact
What a traffic shift looks like
11
Load testing
How do we do it?
12
Load testing
How do we do it?
13
Single Master Failover
How does it work?
• Only used in extreme cases
• Leverage distributed locking in Apache Zookeeper
• Single master services have a spring component that checks the
mastership of the service in a particular datacenter
14
Single Master Failover
How does it work?
15
Conclusion
• The best way to prepare for a disaster is to practice one regularly!
• Tooling and automation is your best friend during an outage
• Capacity planning/ management is extremely important
Questions?
16
Thank You
©2014 LinkedIn Corporation. All Rights Reserved.

Mais conteúdo relacionado

Mais procurados

Microsoft Azure and Windows Application monitoring
Microsoft Azure and Windows Application monitoringMicrosoft Azure and Windows Application monitoring
Microsoft Azure and Windows Application monitoringSite24x7
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
 
[Webinar] AWS Monitoring with Site24x7
[Webinar] AWS Monitoring with Site24x7[Webinar] AWS Monitoring with Site24x7
[Webinar] AWS Monitoring with Site24x7Site24x7
 
Server Monitoring from the Cloud
Server Monitoring from the CloudServer Monitoring from the Cloud
Server Monitoring from the CloudSite24x7
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know confluent
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent
 
VMware Monitoring-Discover And Monitor Your Virtual Environment
VMware Monitoring-Discover And Monitor Your Virtual EnvironmentVMware Monitoring-Discover And Monitor Your Virtual Environment
VMware Monitoring-Discover And Monitor Your Virtual EnvironmentSite24x7
 
Akka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeRoland Kuhn
 
DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...
DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...
DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...Gene Kim
 
Integrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environmentconfluent
 
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...HostedbyConfluent
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
 
Top 15 Exchange Questions that Senior Admin ask - Jaap Wesselius
Top 15 Exchange Questions that Senior Admin ask - Jaap WesseliusTop 15 Exchange Questions that Senior Admin ask - Jaap Wesselius
Top 15 Exchange Questions that Senior Admin ask - Jaap WesseliusKemp
 
Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...
Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...
Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...confluent
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)KafkaZone
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Azkaban - WorkFlow Scheduler/Automation Engine
Azkaban - WorkFlow Scheduler/Automation EngineAzkaban - WorkFlow Scheduler/Automation Engine
Azkaban - WorkFlow Scheduler/Automation EnginePraveen Thirukonda
 
Nov 2015 Webinar: Introduction to FileCatalyst v3.6
Nov 2015 Webinar: Introduction to FileCatalyst v3.6Nov 2015 Webinar: Introduction to FileCatalyst v3.6
Nov 2015 Webinar: Introduction to FileCatalyst v3.6FileCatalyst
 

Mais procurados (20)

Microsoft Azure and Windows Application monitoring
Microsoft Azure and Windows Application monitoringMicrosoft Azure and Windows Application monitoring
Microsoft Azure and Windows Application monitoring
 
Intro to.net core 20170111
Intro to.net core   20170111Intro to.net core   20170111
Intro to.net core 20170111
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
[Webinar] AWS Monitoring with Site24x7
[Webinar] AWS Monitoring with Site24x7[Webinar] AWS Monitoring with Site24x7
[Webinar] AWS Monitoring with Site24x7
 
Server Monitoring from the Cloud
Server Monitoring from the CloudServer Monitoring from the Cloud
Server Monitoring from the Cloud
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
 
VMware Monitoring-Discover And Monitor Your Virtual Environment
VMware Monitoring-Discover And Monitor Your Virtual EnvironmentVMware Monitoring-Discover And Monitor Your Virtual Environment
VMware Monitoring-Discover And Monitor Your Virtual Environment
 
Akka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in Practice
 
DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...
DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...
DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...
 
Integrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environment
 
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
Top 15 Exchange Questions that Senior Admin ask - Jaap Wesselius
Top 15 Exchange Questions that Senior Admin ask - Jaap WesseliusTop 15 Exchange Questions that Senior Admin ask - Jaap Wesselius
Top 15 Exchange Questions that Senior Admin ask - Jaap Wesselius
 
Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...
Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...
Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Azkaban - WorkFlow Scheduler/Automation Engine
Azkaban - WorkFlow Scheduler/Automation EngineAzkaban - WorkFlow Scheduler/Automation Engine
Azkaban - WorkFlow Scheduler/Automation Engine
 
Nov 2015 Webinar: Introduction to FileCatalyst v3.6
Nov 2015 Webinar: Introduction to FileCatalyst v3.6Nov 2015 Webinar: Introduction to FileCatalyst v3.6
Nov 2015 Webinar: Introduction to FileCatalyst v3.6
 

Destaque

SRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentMichael Kehoe
 
Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016Michael Kehoe
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
The ROLE SRE Approach - Getting more concrete
The ROLE SRE Approach - Getting more concreteThe ROLE SRE Approach - Getting more concrete
The ROLE SRE Approach - Getting more concretedrenzel
 
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...Ralf Klamma
 
Sre con16 tier 1 metamorphosis
Sre con16   tier 1 metamorphosisSre con16   tier 1 metamorphosis
Sre con16 tier 1 metamorphosisNina Mushiana
 
Software reliability tools and common software errors
Software reliability tools and common software errorsSoftware reliability tools and common software errors
Software reliability tools and common software errorsHimanshu
 
Scio Saa S Readiness Evaluation Sre V1.0
Scio Saa S Readiness Evaluation Sre V1.0Scio Saa S Readiness Evaluation Sre V1.0
Scio Saa S Readiness Evaluation Sre V1.0ScioSales
 
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleStephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleIE Group
 
Works of site reliability engineer
Works of site reliability engineerWorks of site reliability engineer
Works of site reliability engineerShohei Kobayashi
 
How TPM saves the day
How TPM saves the dayHow TPM saves the day
How TPM saves the dayPooja Tangi
 
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Diego Pacheco
 
Software Reliability Engineering
Software Reliability EngineeringSoftware Reliability Engineering
Software Reliability Engineeringguest90cec6
 
Software reliability growth model
Software reliability growth modelSoftware reliability growth model
Software reliability growth modelHimanshu
 
You got a couple Microservices, now what? - Adding SRE to DevOps
You got a couple Microservices, now what?  - Adding SRE to DevOpsYou got a couple Microservices, now what?  - Adding SRE to DevOps
You got a couple Microservices, now what? - Adding SRE to DevOpsGonzalo Maldonado
 
Feedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potentialFeedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potentialPooja Tangi
 
Load balancing in the SRE way
Load balancing in the SRE wayLoad balancing in the SRE way
Load balancing in the SRE wayShawn Zhu
 

Destaque (20)

SRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level Talent
 
Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
The ROLE SRE Approach - Getting more concrete
The ROLE SRE Approach - Getting more concreteThe ROLE SRE Approach - Getting more concrete
The ROLE SRE Approach - Getting more concrete
 
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...
 
Sre con16 tier 1 metamorphosis
Sre con16   tier 1 metamorphosisSre con16   tier 1 metamorphosis
Sre con16 tier 1 metamorphosis
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
Software reliability tools and common software errors
Software reliability tools and common software errorsSoftware reliability tools and common software errors
Software reliability tools and common software errors
 
Scio Saa S Readiness Evaluation Sre V1.0
Scio Saa S Readiness Evaluation Sre V1.0Scio Saa S Readiness Evaluation Sre V1.0
Scio Saa S Readiness Evaluation Sre V1.0
 
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleStephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
 
Works of site reliability engineer
Works of site reliability engineerWorks of site reliability engineer
Works of site reliability engineer
 
How TPM saves the day
How TPM saves the dayHow TPM saves the day
How TPM saves the day
 
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
 
Software Reliability Engineering
Software Reliability EngineeringSoftware Reliability Engineering
Software Reliability Engineering
 
Software reliability growth model
Software reliability growth modelSoftware reliability growth model
Software reliability growth model
 
You got a couple Microservices, now what? - Adding SRE to DevOps
You got a couple Microservices, now what?  - Adding SRE to DevOpsYou got a couple Microservices, now what?  - Adding SRE to DevOps
You got a couple Microservices, now what? - Adding SRE to DevOps
 
Feedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potentialFeedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potential
 
SRE Tools
SRE ToolsSRE Tools
SRE Tools
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
Load balancing in the SRE way
Load balancing in the SRE wayLoad balancing in the SRE way
Load balancing in the SRE way
 

Semelhante a SouthBay SRE Meetup Jan 2016

Microservices Architecture
Microservices ArchitectureMicroservices Architecture
Microservices ArchitectureLucian Neghina
 
Trafficshifting: Avoiding Disasters & Improving Performance at Scale
Trafficshifting: Avoiding Disasters & Improving Performance at ScaleTrafficshifting: Avoiding Disasters & Improving Performance at Scale
Trafficshifting: Avoiding Disasters & Improving Performance at ScaleAPNIC
 
Bringing it all together - Denver JUG
Bringing it all together - Denver JUGBringing it all together - Denver JUG
Bringing it all together - Denver JUGMelissaMcKay15
 
Fleet management system mine excellence
Fleet management system  mine excellenceFleet management system  mine excellence
Fleet management system mine excellenceMason Taylor
 
Nonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinTechWell
 
Introduction to High Availability with SQL Server
Introduction to High Availability with SQL ServerIntroduction to High Availability with SQL Server
Introduction to High Availability with SQL ServerJohn Sterrett
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks
 
Building Enterprise Clouds - Key Considerations and Strategies - RED HAT
Building Enterprise Clouds - Key Considerations and Strategies - RED HATBuilding Enterprise Clouds - Key Considerations and Strategies - RED HAT
Building Enterprise Clouds - Key Considerations and Strategies - RED HATFadi Semaan
 
Service Mesh CTO Forum (Draft 3)
Service Mesh CTO Forum (Draft 3)Service Mesh CTO Forum (Draft 3)
Service Mesh CTO Forum (Draft 3)Rick Hightower
 
Oracle Database Lifecycle Management
Oracle Database Lifecycle ManagementOracle Database Lifecycle Management
Oracle Database Lifecycle ManagementHari Srinivasan
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleMichael Kehoe
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksSenturus
 
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Creating an Elastic Platform Using Kafka and Microservices in OpenShift Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Creating an Elastic Platform Using Kafka and Microservices in OpenShift confluent
 
A Three-Tier Load Testing Program Saved Our Bacon
A Three-Tier Load Testing Program Saved Our BaconA Three-Tier Load Testing Program Saved Our Bacon
A Three-Tier Load Testing Program Saved Our BaconTechWell
 
Testing Mobile App Performance
Testing Mobile App PerformanceTesting Mobile App Performance
Testing Mobile App PerformanceTechWell
 
NANOG 80: Measuring RPKI Effectiveness
NANOG 80: Measuring RPKI EffectivenessNANOG 80: Measuring RPKI Effectiveness
NANOG 80: Measuring RPKI EffectivenessAPNIC
 
HKNOG 9.0: Measuring RPKI
HKNOG 9.0: Measuring RPKIHKNOG 9.0: Measuring RPKI
HKNOG 9.0: Measuring RPKIAPNIC
 
Getting start with Performance Testing
Getting start with Performance Testing Getting start with Performance Testing
Getting start with Performance Testing Yogesh Deshmukh
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracled0nn9n
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02~Eric Principe
 

Semelhante a SouthBay SRE Meetup Jan 2016 (20)

Microservices Architecture
Microservices ArchitectureMicroservices Architecture
Microservices Architecture
 
Trafficshifting: Avoiding Disasters & Improving Performance at Scale
Trafficshifting: Avoiding Disasters & Improving Performance at ScaleTrafficshifting: Avoiding Disasters & Improving Performance at Scale
Trafficshifting: Avoiding Disasters & Improving Performance at Scale
 
Bringing it all together - Denver JUG
Bringing it all together - Denver JUGBringing it all together - Denver JUG
Bringing it all together - Denver JUG
 
Fleet management system mine excellence
Fleet management system  mine excellenceFleet management system  mine excellence
Fleet management system mine excellence
 
Nonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the Coin
 
Introduction to High Availability with SQL Server
Introduction to High Availability with SQL ServerIntroduction to High Availability with SQL Server
Introduction to High Availability with SQL Server
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
 
Building Enterprise Clouds - Key Considerations and Strategies - RED HAT
Building Enterprise Clouds - Key Considerations and Strategies - RED HATBuilding Enterprise Clouds - Key Considerations and Strategies - RED HAT
Building Enterprise Clouds - Key Considerations and Strategies - RED HAT
 
Service Mesh CTO Forum (Draft 3)
Service Mesh CTO Forum (Draft 3)Service Mesh CTO Forum (Draft 3)
Service Mesh CTO Forum (Draft 3)
 
Oracle Database Lifecycle Management
Oracle Database Lifecycle ManagementOracle Database Lifecycle Management
Oracle Database Lifecycle Management
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Creating an Elastic Platform Using Kafka and Microservices in OpenShift Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
 
A Three-Tier Load Testing Program Saved Our Bacon
A Three-Tier Load Testing Program Saved Our BaconA Three-Tier Load Testing Program Saved Our Bacon
A Three-Tier Load Testing Program Saved Our Bacon
 
Testing Mobile App Performance
Testing Mobile App PerformanceTesting Mobile App Performance
Testing Mobile App Performance
 
NANOG 80: Measuring RPKI Effectiveness
NANOG 80: Measuring RPKI EffectivenessNANOG 80: Measuring RPKI Effectiveness
NANOG 80: Measuring RPKI Effectiveness
 
HKNOG 9.0: Measuring RPKI
HKNOG 9.0: Measuring RPKIHKNOG 9.0: Measuring RPKI
HKNOG 9.0: Measuring RPKI
 
Getting start with Performance Testing
Getting start with Performance Testing Getting start with Performance Testing
Getting start with Performance Testing
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
 

Mais de Michael Kehoe

Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container BasicsMichael Kehoe
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsMichael Kehoe
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsMichael Kehoe
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsMichael Kehoe
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFMichael Kehoe
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...Michael Kehoe
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...Michael Kehoe
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsMichael Kehoe
 

Mais de Michael Kehoe (16)

eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart way
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortems
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container Basics
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortems
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
 

Último

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 

Último (20)

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

SouthBay SRE Meetup Jan 2016

  • 1. Michael Kehoe Senior Site Reliability Engineer LinkedIn SouthBay SRE Meetup LinkedIn Traffic Shifting
  • 2. 2 $ whoami Michael Kehoe • Sr Site Reliability Engineer (SRE) • Member of PROD-SRE • https://www.linkedin.com/in/michaelkkehoe
  • 4. 4 What is a Traffic Shift? • Edge (PoP) shift • Datacenter Load shift • Single Master Failovers
  • 5. 5 Why do we do traffic shifts • To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure/ services • To validate Disaster Recovery (DR) in case of any datacenter failure • To validate and test capacity headroom across our datacenters • To expose bugs and suboptimal configurations by load testing one or more datacenters • To perform planned maintenance • To validate and exercise the traffic shift automation
  • 7. 7 Edge Traffic shifts How does it work • We use IPVS to load balance at our edges • We can withdraw anycast routes to remove traffic from that PoP • Health checks on our edge proxy are tested by DNS providers to verify whether that PoP is in rotation • We can fail those health checks to remove unicast traffic from that PoP
  • 9. 9 Datacenter Traffic shifts How does it work? • Different traffic types are partitioned and controlled separately • Logged-in vs Logged-out • CDN • Monitoring • Microsites • Logged-in users are placed into ‘buckets’ and have primary/ secondary datacenter assignments • Buckets are marked online/ offline to move site traffic
  • 10. 10 Mitigating Impact What a traffic shift looks like
  • 13. 13 Single Master Failover How does it work? • Only used in extreme cases • Leverage distributed locking in Apache Zookeeper • Single master services have a spring component that checks the mastership of the service in a particular datacenter
  • 15. 15 Conclusion • The best way to prepare for a disaster is to practice one regularly! • Tooling and automation is your best friend during an outage • Capacity planning/ management is extremely important
  • 17. ©2014 LinkedIn Corporation. All Rights Reserved.

Notas do Editor

  1. Prior to 2010, running out of Chicago only (ECH3) 2010 – completion of our second production data center (ELA4) Formal disaster recovery strategy 2011 Site was not serving traffic – maintaining active/ passive was not easy Recovery from a true disaster would not have been easy 2013 Built LVA1 Re-architected services to be (mostly) MultiMaster Invested in how to recover from disaster/ service outage quickly 2014 Started multi-colo loadtesting Single Master Failover 1 Built LTX1 2015 Single Master Failover 2 Started LSG1 2016 Ramp LSG1 LOR1 – NextGen DC design
  2. Edge (PoP) shifts. LinkedIn currently operates 12 PoP’s around the world, with more on the way, that help improve page load times for our users. These PoP’s give LinkedIn more flexibility about where a user enters our network and also gives us added redundancy in the case of an outage. We work with our DNS providers to direct users to an appropriate PoP or to alter the flow of traffic to each PoP. See Ritesh Maheshwari’s post for more details on this approach. Data Center Load Shifts. From each PoP, we direct traffic to a specific data center. Logged in users are assigned to a specific data center by default, but during a traffic shift, we can instruct the PoPs to reroute any portion of traffic to one or more different data centers. Single master failovers  Some of our legacy services have not been fully migrated to a multi-data center architecture, and operate in single master mode in one data center. This includes both user-facing services and back-end services whose traffic may not be directly related to user page views. When performing maintenance, addressing site issues, or exploring capacity issues, we must also take these single master services into account. Although some of these legacy services require special attention in these situations, many have been converted to a “fast-failover” mechanism that allows us to switch masters between data centers in seconds, with no downtime. Being able to move these master services around at will also allows us to balance that part of the load between data centers. Edge LinkedIn currently has 12 PoP’s around the world These help improve page load times for our users Give flexibility on where a user enters our network Gives us extra redundancy See Ritesh Masheshwari’s blog post Fabric We assign users to a specific datacenter, but during a trafficshift we can instruct the PoP’s to reroute users to other datacenters to increase the load Single Master Failovers Some older, more complicated services have not been fully migrated to a multi-datacentre architecture. They operate in SingleMaster mode on one datacentre. These services may not be directly related to user page-views but are still important to the running of the site. We have converted most of these services to fast-failover which allows us to change the mastership between datacenters in seconds with no downtime.
  3. To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure To validate Disaster Recovery (DR) in case of any datacentre failure To validate and test capacity headroom across our datacenters To expose bugs and suboptimal configurations by loadtesting one or more datacenters To perform planned maintenance To validate and exercise the traffic shift automation
  4. The traffic shifting process is orchestrated by a system we developed internally which is designed to make the load shift process hands off. The portal gives a holistic view of the site
  5. The following graphs illustrate the traffic-shift process – from last night Here we see the number of buckets online for each data center. At roughly 6PM, we progressively marked 100 buckets offline, and then ramped them back online gradually. The following graph shows actual measured request percentage for one segment of our traffic. The pattern corresponds to the bucket graph, and shows traffic going to zero in one data center, and being redistributed to the other two. Our ability to offline a data center with as little member impact as possible is one of our top priorities. We perform weekly load tests to validate the process and guarantee we can offline a colo successfully with minimal member impact. We load test by load shedding a percentage of traffic to a targeted data center and evaluate sustainability.
  6. You simply schedule a load test and the system does the rest. A load test is preceded by a series of email notifications starting several hours prior to shifting traffic. At the designated time, the system starts shifting traffic to the targeted data center by offlining buckets from the remaining data centers. The manipulation of these buckets is facilitated by underlying libraries that interface with our “sticky routing” service developed in-house. The system has a feedback loop that uses our alerting system to check for any errors potentially triggered by the traffic shift. If an alert is detected, the traffic shift automatically halts and issues notifications, allowing an engineer to manually inspect the reason for the alert and determine whether it is safe to proceed.   The stress test period is reached once the system has successfully redirected the desired volume of traffic to the targeted data center. The stress test period is typically 1 1/2 hours during which we observe the impact the extra load is having on the data center. If impact is detected we immediately rebalance the load and begin investigating the source of the impact. We then work with service owners on reviewing their system and determine a solution. If the stress test period is completed without causing impact, the system rebalances traffic and is considered complete once the rebalance is finished.
  7. Our single master mechanism relies on Apache ZooKeeper to maintain the status of all single master services. On startup, all instances within a cluster of single master services check the value of a cluster master node in ZooKeeper. Each service determines whether or not it is a master based on the value stored in the cluster master node. All services also establish a watch on the cluster master node. When services accept master status, they create an ephemeral node in ZooKeeper that acts as a lockfile.
  8. We can perform a single master failover with a command line tool that handles all the communication with ZooKeeper instances along with the workflow. But in a disaster scenario, it can be useful to have an easy-to-use interface to functionality, as well as a visual overview of the system state. This interface allows an Engineer to failover all of our “Fast Failover” enabled Singlemaster services with one click. The interface also integrates the LiX and ZooKeeper-based mechanisms into a single location.