The Top Outages of 2021: Analysis and Takeaways

2
Featured Speakers
Chris Villemez
Technical Marketing Engineer
Angelique Medina
Director of Product Marketing

3
Before We Begin...
• If you have any questions, please type them in the Questions window.
• If you have any audio problems, please chat us for help.
• A recording of this presentation will be sent to you in a few days.
3
@ThousandEyes

4
Agenda
• About ThousandEyes
• Noteworthy Outages of 2021
• Primer: Digital Service Building Blocks
• Top Ten Outage Countdown
• Lessons & Takeaways
• Q&A
4
@ThousandEyes

5
Actionable Insight for Internet, Cloud, and SaaS
Correlated Insights
Quickly isolate issues to app, network,
or service
Network Visibility
Overlay, hop-by-hop underlay, ISP
performance, and BGP routing
App Experience
SaaS, API, and internal app
performance and user experience

6
2021 Noteworthy Outages
Major
Significant
Shadow
Verizon
(1/26)
Facebook
(4/8)
Vodafone
India Route
Leak (4/16)
Cloudflare
Magic
Transit
(5/3)
DNS
services
DDoS'd
(May)
Rusian
telecom
route
hijacking
(6/3)
Fastly
(6/10)
Akamai
Prolexic
Routed
(6/16)
AWS
routing
issues
(7/12)
Akamai
DNS,
(7/22)
Facebook
(10/4)
Comcast
(11/9)
AWS
(12/7)
AWS
(12/15)
Azure
AD
(12/15)

7
CDN
Cloud
BGP
DNS
The Building Blocks of Today’s Digital Services
SaaS

8
DNS
BGP
Many Options, Complex Dependencies
ISP
Users
CDN
Your App
Security

9
DNS
BGP
Many Options, Complex Dependencies
ISP
Users
CDN
Your App
Cloud APIs
Data Center
Cloud IaaS
Security

10
Step 1: DNS – Where are We Going?
Users CDN Your App
BGP
ISP
DNS
Root Server
TLD Server
Authoritative
Server

11
Step 2: How do We Get There?
Users BGP
ISP
DNS CDN Your App

12
Step 3: CDNs - Do We Have to Travel So Far?
Users Your App
CDN
BGP
ISP
DNS

13
Step 4: Rinse and Repeat For Services & API Calls
Your App
SaaS Apps
Cloud APIs
Data
Center
Backend
Services

15
Azure AD (12/15)
#9
#8
#10
#7
#6
Cloudflare Magic Transit, May 3, 2021
Comcast, November 9, 2021
Verizon, January 26, 2021
Facebook, April 8, 2021
~40 minutes
App + routing issues
~1.5 hours
Authentication service returning HTTP server errors
Initial phase manifested as HTTP errors (5xx, 4xx). Phase 2
showed unusual routing to one PoP with packet loss & latency –
possibly part of mitigation efforts. Lesson: App providers can
move users to sub-optimal or out of region PoPs at their
discretion.
Internet Report
Ep. 34
Massive levels of packet loss for Magic Transit customers, which
involves advertising customer prefixes (with their permission.
Some customers were only briefly impacted due to rapidly
readvertising prefixes through another provider. Lesson: Have
redundancy + a playbook to activate when needed.
~2 hours
Network traffic loss Internet Report Ep. 37
~2 hours
Network traffic loss + routing issues
~1 hour
Network traffic loss
Packet loss across US East and Midwest parts of backbone.
Varying levels of user impact based on user location and service
being accessed. Lesson: Outages can have very different
impacts based on a variety of variables. Visibility is key.
Internet Report
Ep. 33
Two-part outage, with first more localized to users around
Sunnyvale PoP. Second part still centered around Sunnyvale, but
broad impact across Midwest and East based users. Lesson:
Don’t rely on provider updates to understand what’s
happening.
Azure AD, an authentication service used across Microsoft and
its customers services was unavailable, preventing access to a
large number of services. Lesson: Authentication is a critical
dependency.

16
Akamai Prolexic Routed, June 16, 2021
#5
• Service unavailable ~4
hours
• …But very different
customer outcomes
• Akamai withdrew routes
enabling customer
mitigation by advertising
through alternate
provider
• Some customers
recovered within
minutes — others
impacted entire incident
duration
Lesson: Plan for failure and have a setup and/or playbook in
place before an outage occurs because they are inevitable and no
provider is immune.

17
Akamai Edge DNS, July 22, 2021
#4
• Duration: ~1 hour
• Impacted internal DNS
service used to route
global users to CDN
nodes
• No DNS = No CDN (even
though CDN was
technically up and
running)
• DNS + CDN are
intertwined complicating
question of redundancy
• Multi-CDN has pros and
cons (but apex domain
control still important)
Lesson: Plan for failure and have a setup and/or playbook in
place before an outage occurs because they are inevitable and no
provider is immune.

18
Fastly, June 10, 2021
#3
• Duration: ~1 hour
• CDN service
unavailable across
~85% of infrastructure
• Some customers
mitigated by rerouting
to origin or alternate
CDN, while others felt
full impact of incident
(e.g., NYTimes vs
Reddit)
Lesson: Understand the role of DNS in driving mitigation actions
(e.g., controlling apex domain, TTL values, etc.). Plan for failure and
have a setup and/or playbook in place before an outage occurs.

19
AWS, December 7 & 10, 2021
#2
Lessons: Don’t rely solely on provider status updates to
understand what’s happening in real time and how it impacts you.
Cloud provider interdependencies can cascade impacts far and
wide.
• Duration: ~7 hours
• Centered in US-EAST-1,
but broad impact on site
availability (phase 1) +
management access and
API gateway
• Lingering performance
impacts even after
remediation
• High-interdependence of
AWS services

20
Facebook, October 4, 2021
#1
• Duration: ~7 hours
• Yes, Facebook DNS
went “offline” but the
scope was broader and,
ultimately, whether DNS
was up or not,
Facebook services were
unusable.
• Impact to internal
tooling was
compromised by
outage, delaying
remediation
Lesson: Keep management systems and tooling walled off from
production environment. Consider external providers for critical
services such as DNS and monitoring.

Brief Summary of
“Shadow” Outages
• The DNS Comes Under Attack, May 2021
• Routing Mishaps Fly Under the Radar

22
Lessons and Takeaways
• BGP is only as resilient as your design enables it to be
• Public cloud is fantastic! But, know the dependencies!
• DNS is often the culprit when things go wrong, but how to know?
• CDN can be a blackbox for IT Ops and visibility.
• SaaS is great, if it can be monitored.

23
@ThousandEyes
Learn
more
Free
Trial /
Demo
Next Steps
Copyright ©2022 ThousandEyes
• Subscribe! https://blog.thousandeyes.com
• Get a real-time view of the health of the Internet
https://thousandeyes.com/outages
• Sign up for a Free Trial:
https://www.thousandeyes.com/signup
• Request a demo:
https://www.thousandeyes.com/request-demo

The Top Outages of 2021: Analysis and Takeaways

The Top Outages of 2021: Analysis and Takeaways

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a The Top Outages of 2021: Analysis and Takeaways

Semelhante a The Top Outages of 2021: Analysis and Takeaways (20)

Mais de ThousandEyes

Mais de ThousandEyes (20)

Último

Último (20)

The Top Outages of 2021: Analysis and Takeaways