First lecture in my series, Design of Digital Machines.
Begins with a real world case study showing how to be a system detective, then steps back to explain how shared characteristics of all systems helps see the systems around us.
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...
What is a system?
1. What is a system?
№ 1, Design of Digital Machines
Tim Sheiner
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
2. Sections in this presentation
๏ A System Story
๏ What is a system?
๏ Characteristics of a system
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
3. System Story
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
4. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
5. Huh?
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
6. Huh(2x)?
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
7. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
8. Outage:
Christmas Eve,
12:30pm Pacific
Amazon Web
Services, Elastic
Load Balancers
“Netflix streaming was impacted on Christmas
Eve 2012 by problems in the Amazon Web
Services (AWS) Elastic Load Balancer
Text
(ELB) service that routes network traffic to the
Netflix services supporting streaming.”
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
9. Outage:
Christmas Eve,
12:30pm Pacific
Amazon Web
Services, Elastic
Load Balancers
Americas only
“The outage primarily affected playback on TV
connected devices in the US, Canada and Latin
America. Our service in the UK, Ireland and
Text
TV connected
devices, Nordic countries was not impacted.”
primarily
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
10. Outage:
Christmas Eve,
12:30pm Pacific
Amazon Web
Services, Elastic
Load Balancers
Americas only “Netflix uses hundreds of ELBs. Each one
supports a distinct service or a different version
of a service and provides a network address
Text
TV connected
devices,
that your Web browser or streaming device
primarily calls. Netflix streaming has been implemented
on over a thousand different streaming devices
over the last few years, and groups of similar
100’s of ELBs devices tend to depend on specific ELBs.”
~1:1
ELB: Device Type
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
11. Outage: Failure localized
Christmas Eve, to only some
12:30pm Pacific ELBs
Amazon Web Issue was
Services, Elastic requests not
Load Balancers passed through
Americas only
“Out of hundreds of ELBs in use by Netflix, a
handful failed, losing their ability to pass
Text
requests to the servers behind them. None of
TV connected
devices,
primarily
the other AWS services failed, so our
applications continued to respond normally
whenever the requests were able to get
through.”
100’s of ELBs
~1:1
ELB: Device Type
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
12. Outage: Failure localized
Christmas Eve, to only some
12:30pm Pacific ELBs
Amazon Web Issue was
Services, Elastic requests not
Load Balancers passed through
Slight
Americas only “Over-all streaming playback via Macs and PCs performance
impact to Mac/
was only slightly reduced from normal levels. A PC
few devices also saw no impact at all as those
Text
TV connected devices have an ELB configuration that kept Game consoles
devices, impacted 7
primarily running throughout the incident, providing hours
normal playback levels.
... game consoles etc. were impacted for about
100’s of ELBs seven hours.”
~1:1
ELB: Device Type
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
13. Outage: Failure localized
Christmas Eve, to only some
12:30pm Pacific ELBs
Amazon Web Issue was
Services, Elastic requests not
Load Balancers passed through
Slight
Americas only “It is still early days for cloud innovation and performance
impact to Mac/
there is certainly more to do in terms of PC
building resiliency in the cloud.
Text
TV connected
devices,
We have plans to work on this in 2013. It is an Game consoles
impacted 7
primarily interesting and hard problem to solve, since ... hours
the systems involved ... must be extremely
reliable and capable of avoiding cascading
100’s of ELBs overload failures.”
~1:1
ELB: Device Type
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
14. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
15. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
16. US-East Region
ELB
Severe but
localized
interruption
“We would like to share more details with our
customers about the event that occurred with
the Amazon Elastic Load Balancing Service
Text
(“ELB”) earlier this week in the US-East Region.
While the service disruption only affected
applications using the ELB service (and only a
fraction of the ELB load balancers were
affected), the impacted load balancers saw
significant impact for a prolonged period of
time.”
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
17. US-East Region
ELB
Severe but
localized
interruption
12:24 PM PST on
December 24
“The service disruption began at 12:24 PM PST
on December 24th when a portion of the ELB
Text
ELB state data state data was logically deleted. ”
logically deleted
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
18. US-East Region
ELB
Severe but
localized
interruption
12:24 PM PST on
December 24 “This data is used and maintained by the ELB
control plane to manage the configuration of
the ELB load balancers in the region (for
Text
ELB state data example tracking all the backend hosts to
logically deleted
which traffic should be routed by each load
balancer). ”
ELB control
plane manages
configurations
Tracking hosts
for traffic
routing
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
19. Inadvertent
US-East Region maintenance
ELB process
Severe but production
localized environment
interruption access
12:24 PM PST on Unaware of
December 24 “The data was deleted by a maintenance error
process that was inadvertently run against the
Text
production ELB state data. This process was run
ELB state data
logically deleted
by one of a very small number of developers
who have access to this production
environment. Unfortunately, the developer did
ELB control not realize the mistake at the time. ”
plane manages
configurations
Tracking hosts
for traffic
routing
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
20. Inadvertent
US-East Region maintenance
ELB process
Severe but production
localized environment
interruption access
12:24 PM PST on
December 24
“After this data was deleted, the ELB control Unaware of
error
plane began experiencing high latency and
error rates for API calls to manage ELB load
Text
ELB state data
balancers. In this initial part of the service High latency &
logically deleted disruption, there was no impact to the request error rates
handling functionality of running ELB load
balancers because the missing ELB state data
ELB control
plane manages was not integral to the basic operation of API calls
configurations
running load balancers. ”
Tracking hosts
for traffic No impact to
routing running ELBs
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
21. Inadvertent Create new, but
US-East Region maintenance not manage
ELB process existing
Severe but production
localized
interruption
“The team was puzzled as environment
access
Failure on
attempt to scale
many APIs were succeeding
(customers were able to
12:24 PM PST on create and manage new load Unaware of
December 24
balancers but not manage error
existing load balancers) and
Text
others were failing. As this
ELB state data
logically deleted
continued, some customers High latency &
error rates
began to experience
performance issues with their
ELB control running load balancers. These
plane manages API calls
configurations issues only occurred after the
ELB control plane attempted
to make changes to a running
Tracking hosts
for traffic load balancer. ” No impact to
running ELBs
routing
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
22. Inadvertent Create new, but
US-East Region maintenance not manage
ELB process existing
Severe but production
Failure on
localized environment
interruption “At 5:02 PM PST, the team access attempt to scale
disabled several of the ELB
control plane workflows
6.8% directly
12:24 PM PST on
December 24
(including the scaling and Unaware of
error
impacted, rest
no scaling
descaling workflows) to
prevent additional running
Text
ELB state data
load balancers from being High latency &
logically deleted affected by the missing ELB error rates
state data. At the peak of the
event, 6.8% of running ELB
ELB control
plane manages load balancers were API calls
configurations
impacted. The rest of the load
balancers in the system were
Tracking hosts unable to scale or be
No impact to
for traffic
routing modified by customers, but running ELBs
were operating correctly. ”
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
23. Inadvertent Create new, but
US-East Region maintenance not manage
ELB process existing
Severe but production
Failure on
localized environment
attempt to scale
interruption “The team attempted to access
restore the ELB state data to a
point-in-time just before the 6.8% directly
12:24 PM PST on Unaware of
December 24 event began. By restoring the error
impacted, rest
no scaling
data to this time, we would
be able to merge in events
Text
ELB state data that happened after ... to High latency & Merge old state
logically deleted
create an accurate state. ... error rates
the initial method used by the
ELB control
team to restore the ELB state
plane manages
configurations
data ... failed to provide a API calls Initial recovery
plan failed
usable snapshot of the data.
This delayed recovery until an
Tracking hosts
for traffic
alternate recovery process No impact to
routing was found. ” running ELBs
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
24. Inadvertent Create new, but
US-East Region maintenance not manage
ELB process existing
Severe but
localized “The system began recovering production
environment Failure on
attempt to scale
interruption
the remaining affected load access
balancers, and by 8:15 AM
PST, the team had re-enabled 6.8% directly
12:24 PM PST on Unaware of
December 24 the majority of APIs and error
impacted, rest
no scaling
backend workflows. By 10:30
AM PST, almost all affected
Text
ELB state data load balancers had been High latency & Merge old state
logically deleted error rates
restored to full operation.
While the service was
ELB control
substantially recovered at this
Initial recovery
plane manages
configurations
time, the team continued to API calls
plan failed
closely monitor the service
before communicating
10:30 am
Tracking hosts
for traffic
broadly that it was operating No impact to substantial
recovery; 20
routing normally at 12:05 PM PST. ” running ELBs
hours
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
25. Inadvertent Create new, but
US-East Region maintenance not manage
ELB process existing
Severe but production
localized
interruption
“We have made a number of environment
access
Failure on
attempt to scale
changes to protect the ELB
service from this sort of
12:24 PM PST on
disruption in the future. Unaware of
6.8% directly
impacted, rest
December 24 • modified the access controls on our error no scaling
production ELB state data
• modified our data recovery process to
Text
reflect the learning we went through in
ELB state data this event High latency & Merge old state
logically deleted error rates
We will also incorporate our
learning from this event into
ELB control our service architecture. We Initial recovery
plane manages
configurations believe that we can API calls
plan failed
reprogram [to] allow the
service to recover 10:30 am
Tracking hosts
for traffic automatically from logical No impact to substantial
recovery; 20
running ELBs
routing
data loss.” hours
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
26. Outage: Failure localized Inadvertent Create new, but
US-East Region
Christmas Eve, to only some maintenance not manage
ELB
12:30pm Pacific ELBs process existing
Amazon Web Issue was Severe but production
Failure on
Services, Elastic requests not localized environment
attempt to scale
Load Balancers passed through interruption access
Slight 6.8% directly
performance 12:24 PM PST on Unaware of
Americas only impacted, rest
impact to Mac/ December 24 error no scaling
PC
TV connected Game consoles ELB state data High latency &
devices, impacted 7 Merge old state
logically deleted error rates
primarily hours
ELB control
plane manages Initial recovery
100’s of ELBs API calls
configurations plan failed
Tracking hosts 10:30 am
~1:1 No impact to substantial
for traffic
ELB: Device Type routing running ELBs recovery; 20
hours
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
27. Events
Outage: TV connected
{(Netflix) + (Amazon)}
Christmas Eve, Americas only devices,
12:30pm Pacific primarily
Severe but
12:24 PM PST on US-East Region ELB state data
localized
December 24 interruption ELB logically deleted Structural Explanation
Inadvertent production
Unaware of
maintenance environment
Objects & Relationships process access error
Amazon Web ~1:1
Services, Elastic 100’s of ELBs
Load Balancers ELB: Device Type Create new, but
Failure on
not manage Merge old state
attempt to scale
existing
ELB control Tracking hosts
ELB control 10:30 am
plane manages for traffic
plane substantial
configurations routing Initial recovery
plan failed recovery; 20
hours
Patterns
Slight
Failure localized Issue was Game consoles
performance
to only some requests not impacted 7
impact to Mac/
ELBs passed through hours
PC
6.8% directly
High latency & No impact to
API calls impacted, rest
error rates running ELBs no scaling
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
28. Events
Outage: TV connected
{Netflix + Amazon}
Christmas Eve, Americas only devices,
12:30pm Pacific primarily
Severe but
12:24 PM PST on US-East Region ELB state data
localized
December 24 interruption ELB logically deleted Structural Explanation
Inadvertent production
Unaware of
maintenance environment
Objects & Relationships process access error
Amazon Web ~1:1
Services, Elastic 100’s of ELBs
Load Balancers ELB: Device Type Create new, but
Failure on
not manage Merge old state
attempt to scale
existing
ELB control Tracking hosts
ELB control 10:30 am
plane manages for traffic
plane substantial
configurations routing Initial recovery
plan failed recovery; 20
hours
Patterns
Slight
Failure localized Issue was Game consoles
performance
to only some requests not impacted 7
impact to Mac/
ELBs passed through hours
PC
6.8% directly
High latency & No impact to
API calls impacted, rest
error rates running ELBs no scaling
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
29. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
30. What is a system?
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
31. Bricks
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
32. Brick Systems or Brick Collections?
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
33. A system is an interconnected set
of elements that is coherently
organized in a way that achieves
something.
Donella Meadows, Thinking in Systems
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
34. Operational View of a System
1. Objects
A
C
B
D
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
35. Operational View of a System
1. Objects
2. Relationships
A
C
B
D
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
36. Operational View of a System
1. Objects
2. Relationships
A
3. Currency
C
B
D
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
37. Operational View of a System
1. Objects
2. Relationships
A
3. Currency
4. Boundary
C
B
D
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
38. Operational View of a System
1. Objects
2. Relationships
A
3. Currency
4. Boundary
5. Purpose
Output
C
Input B
D
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
39. Dynamic View of a System
A
Output
C
Input B
D
A’
Output’
C’
Input’ B’
Time D’
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
40. Dynamic View of a System
Behavior vs Time
100
Output
0
20
Time
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
41. A system is an interconnected set
of elements that is coherently
organized in a way that achieves
something.
The
General
System
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
42. These elements.
Those connections.
This organization.
That boundary. The
Specific
System
This purpose.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
43. Seeing systems
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
44. If it looks like a duck...
๏ A system’s parts must all be present for the system to
carry out its purpose optimally.
๏ A system’s parts must be arranged in a specific way for
the system to carry out its purpose.
๏ Systems have specific purposes within larger systems.
๏ Systems maintain their stability through fluctuations and
adjustments.
๏ Systems have feedback.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
45. The nature of systems is that
your understanding of a
particular one gets more precise
over time.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States. 45
46. Seeing Systems
Outage: Severe but
Christmas Eve, localized
12:30pm Pacific interruption
EVENTS
Events are what we notice first.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
47. Seeing Systems
Outage: Severe but
Christmas Eve, localized
12:30pm Pacific interruption
EVENTS
TV connected
Failure on
devices,
attempt to scale
primarily
PATTERNS
Patterns = Observation(Events + Time)
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
48. Seeing Systems
Outage: Severe but
Christmas Eve, localized
12:30pm Pacific interruption
EVENTS
TV connected
Failure on
devices,
attempt to scale
primarily
PATTERNS
Issue was
ELB state data
requests not
logically deleted
passed through
STRUCTURE
From patterns we deduce structure via ‘black box’ process
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
49. Seeing Systems
Outage: Severe but
Christmas Eve, localized
12:30pm Pacific interruption
EVENTS
TV connected
Failure on
devices,
attempt to scale
primarily
PATTERNS
Issue was
ELB state data
requests not
logically deleted
passed through
STRUCTURE
Amazon Web ELB control
Services, Elastic plane manages
Load Balancers configurations
CONTEXT
Context helps us discriminate the isomorph
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
50. Fin
1. Objects
2. Relationships
A
3. Currency
4. Boundary
5. Purpose
Output
C
Input B
D
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.