3. 3
Before We Begin...
• If you have any questions, please type them in the Questions window.
• If you have any audio problems, please chat us for help.
• A recording of this presentation will be sent to you in a few days.
3
@ThousandEyes
4. 4
Agenda
• About ThousandEyes
• Noteworthy Outages of 2021
• Primer: Digital Service Building Blocks
• Top Ten Outage Countdown
• Lessons & Takeaways
• Q&A
4
@ThousandEyes
5. 5
Actionable Insight for Internet, Cloud, and SaaS
Correlated Insights
Quickly isolate issues to app, network,
or service
Network Visibility
Overlay, hop-by-hop underlay, ISP
performance, and BGP routing
App Experience
SaaS, API, and internal app
performance and user experience
15. 15
Azure AD (12/15)
#9
#8
#10
#7
#6
Cloudflare Magic Transit, May 3, 2021
Comcast, November 9, 2021
Verizon, January 26, 2021
Facebook, April 8, 2021
~40 minutes
App + routing issues
~1.5 hours
Authentication service returning HTTP server errors
Initial phase manifested as HTTP errors (5xx, 4xx). Phase 2
showed unusual routing to one PoP with packet loss & latency –
possibly part of mitigation efforts. Lesson: App providers can
move users to sub-optimal or out of region PoPs at their
discretion.
Internet Report
Ep. 34
Massive levels of packet loss for Magic Transit customers, which
involves advertising customer prefixes (with their permission.
Some customers were only briefly impacted due to rapidly
readvertising prefixes through another provider. Lesson: Have
redundancy + a playbook to activate when needed.
~2 hours
Network traffic loss Internet Report Ep. 37
~2 hours
Network traffic loss + routing issues
~1 hour
Network traffic loss
Packet loss across US East and Midwest parts of backbone.
Varying levels of user impact based on user location and service
being accessed. Lesson: Outages can have very different
impacts based on a variety of variables. Visibility is key.
Internet Report
Ep. 33
Two-part outage, with first more localized to users around
Sunnyvale PoP. Second part still centered around Sunnyvale, but
broad impact across Midwest and East based users. Lesson:
Don’t rely on provider updates to understand what’s
happening.
Azure AD, an authentication service used across Microsoft and
its customers services was unavailable, preventing access to a
large number of services. Lesson: Authentication is a critical
dependency.
16. 16
Akamai Prolexic Routed, June 16, 2021
#5
• Service unavailable ~4
hours
• …But very different
customer outcomes
• Akamai withdrew routes
enabling customer
mitigation by advertising
through alternate
provider
• Some customers
recovered within
minutes — others
impacted entire incident
duration
Lesson: Plan for failure and have a setup and/or playbook in
place before an outage occurs because they are inevitable and no
provider is immune.
17. 17
Akamai Edge DNS, July 22, 2021
#4
• Duration: ~1 hour
• Impacted internal DNS
service used to route
global users to CDN
nodes
• No DNS = No CDN (even
though CDN was
technically up and
running)
• DNS + CDN are
intertwined complicating
question of redundancy
• Multi-CDN has pros and
cons (but apex domain
control still important)
Lesson: Plan for failure and have a setup and/or playbook in
place before an outage occurs because they are inevitable and no
provider is immune.
18. 18
Fastly, June 10, 2021
#3
• Duration: ~1 hour
• CDN service
unavailable across
~85% of infrastructure
• Some customers
mitigated by rerouting
to origin or alternate
CDN, while others felt
full impact of incident
(e.g., NYTimes vs
Reddit)
Lesson: Understand the role of DNS in driving mitigation actions
(e.g., controlling apex domain, TTL values, etc.). Plan for failure and
have a setup and/or playbook in place before an outage occurs.
19. 19
AWS, December 7 & 10, 2021
#2
Lessons: Don’t rely solely on provider status updates to
understand what’s happening in real time and how it impacts you.
Cloud provider interdependencies can cascade impacts far and
wide.
• Duration: ~7 hours
• Centered in US-EAST-1,
but broad impact on site
availability (phase 1) +
management access and
API gateway
• Lingering performance
impacts even after
remediation
• High-interdependence of
AWS services
20. 20
Facebook, October 4, 2021
#1
• Duration: ~7 hours
• Yes, Facebook DNS
went “offline” but the
scope was broader and,
ultimately, whether DNS
was up or not,
Facebook services were
unusable.
• Impact to internal
tooling was
compromised by
outage, delaying
remediation
Lesson: Keep management systems and tooling walled off from
production environment. Consider external providers for critical
services such as DNS and monitoring.
21. Brief Summary of
“Shadow” Outages
• The DNS Comes Under Attack, May 2021
• Routing Mishaps Fly Under the Radar
22. 22
Lessons and Takeaways
• BGP is only as resilient as your design enables it to be
• Public cloud is fantastic! But, know the dependencies!
• DNS is often the culprit when things go wrong, but how to know?
• CDN can be a blackbox for IT Ops and visibility.
• SaaS is great, if it can be monitored.