SlideShare uma empresa Scribd logo
1 de 50
Baixar para ler offline
What is a system?
№ 1, Design of Digital Machines

Tim Sheiner
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Sections in this presentation
๏ A System Story
๏ What is a system?
๏ Characteristics of a system




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
System Story



0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Huh?
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Huh(2x)?
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Outage:
Christmas Eve,
12:30pm Pacific




Amazon Web
Services, Elastic
Load Balancers




                                                “Netflix streaming was impacted on Christmas
                                                Eve 2012 by problems in the Amazon Web
                                                Services (AWS) Elastic Load Balancer
                                                                       Text
                                                (ELB) service that routes network traffic to the
                                                Netflix services supporting streaming.”




         0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Outage:
Christmas Eve,
12:30pm Pacific




Amazon Web
Services, Elastic
Load Balancers




Americas only
                                                “The outage primarily affected playback on TV
                                                connected devices in the US, Canada and Latin
                                                America. Our service in the UK, Ireland and
                                                                      Text
TV connected
devices,                                        Nordic countries was not impacted.”
primarily




         0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Outage:
Christmas Eve,
12:30pm Pacific




Amazon Web
Services, Elastic
Load Balancers




Americas only                                   “Netflix uses hundreds of ELBs. Each one
                                                supports a distinct service or a different version
                                                of a service and provides a network address
                                                                       Text
TV connected
devices,
                                                that your Web browser or streaming device
primarily                                       calls. Netflix streaming has been implemented
                                                on over a thousand different streaming devices
                                                over the last few years, and groups of similar
100’s of ELBs                                   devices tend to depend on specific ELBs.”


~1:1
ELB: Device Type




         0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Outage:                                                                                                              Failure localized
Christmas Eve,                                                                                                       to only some
12:30pm Pacific                                                                                                      ELBs




Amazon Web                                                                                                           Issue was
Services, Elastic                                                                                                    requests not
Load Balancers                                                                                                       passed through




Americas only
                                                “Out of hundreds of ELBs in use by Netflix, a
                                                handful failed, losing their ability to pass
                                                                       Text
                                                requests to the servers behind them. None of
TV connected
devices,
primarily
                                                the other AWS services failed, so our
                                                applications continued to respond normally
                                                whenever the requests were able to get
                                                through.”
100’s of ELBs




~1:1
ELB: Device Type




         0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Outage:                                                                                                              Failure localized
Christmas Eve,                                                                                                       to only some
12:30pm Pacific                                                                                                      ELBs




Amazon Web                                                                                                           Issue was
Services, Elastic                                                                                                    requests not
Load Balancers                                                                                                       passed through



                                                                                                                     Slight
Americas only                                   “Over-all streaming playback via Macs and PCs                        performance
                                                                                                                     impact to Mac/
                                                was only slightly reduced from normal levels. A                      PC

                                                few devices also saw no impact at all as those
                                                                       Text
TV connected                                    devices have an ELB configuration that kept                          Game consoles
devices,                                                                                                             impacted 7
primarily                                       running throughout the incident, providing                           hours

                                                normal playback levels.
                                                ... game consoles etc. were impacted for about
100’s of ELBs                                   seven hours.”


~1:1
ELB: Device Type




         0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Outage:                                                                                                              Failure localized
Christmas Eve,                                                                                                       to only some
12:30pm Pacific                                                                                                      ELBs




Amazon Web                                                                                                           Issue was
Services, Elastic                                                                                                    requests not
Load Balancers                                                                                                       passed through



                                                                                                                     Slight
Americas only                                   “It is still early days for cloud innovation and                     performance
                                                                                                                     impact to Mac/
                                                there is certainly more to do in terms of                            PC

                                                building resiliency in the cloud.
                                                                          Text
TV connected
devices,
                                                We have plans to work on this in 2013. It is an                      Game consoles
                                                                                                                     impacted 7
primarily                                       interesting and hard problem to solve, since ...                     hours

                                                the systems involved ... must be extremely
                                                reliable and capable of avoiding cascading
100’s of ELBs                                   overload failures.”


~1:1
ELB: Device Type




         0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
US-East Region
ELB




Severe but
localized
interruption




                                               “We would like to share more details with our
                                               customers about the event that occurred with
                                               the Amazon Elastic Load Balancing Service
                                                                      Text
                                               (“ELB”) earlier this week in the US-East Region.
                                               While the service disruption only affected
                                               applications using the ELB service (and only a
                                               fraction of the ELB load balancers were
                                               affected), the impacted load balancers saw
                                               significant impact for a prolonged period of
                                               time.”



        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
US-East Region
ELB




Severe but
localized
interruption




12:24 PM PST on
December 24

                                               “The service disruption began at 12:24 PM PST
                                               on December 24th when a portion of the ELB
                                                                      Text
ELB state data                                 state data was logically deleted. ”
logically deleted




        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
US-East Region
ELB




Severe but
localized
interruption




12:24 PM PST on
December 24                                    “This data is used and maintained by the ELB
                                               control plane to manage the configuration of
                                               the ELB load balancers in the region (for
                                                                     Text
ELB state data                                 example tracking all the backend hosts to
logically deleted
                                               which traffic should be routed by each load
                                               balancer). ”
ELB control
plane manages
configurations




Tracking hosts
for traffic
routing




        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Inadvertent
US-East Region                                                                                                      maintenance
ELB                                                                                                                 process




Severe but                                                                                                          production
localized                                                                                                           environment
interruption                                                                                                        access




12:24 PM PST on                                                                                                     Unaware of
December 24                                    “The data was deleted by a maintenance                               error

                                               process that was inadvertently run against the
                                                                     Text
                                               production ELB state data. This process was run
ELB state data
logically deleted
                                               by one of a very small number of developers
                                               who have access to this production
                                               environment. Unfortunately, the developer did
ELB control                                    not realize the mistake at the time. ”
plane manages
configurations




Tracking hosts
for traffic
routing




        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Inadvertent
US-East Region                                                                                                      maintenance
ELB                                                                                                                 process




Severe but                                                                                                          production
localized                                                                                                           environment
interruption                                                                                                        access




12:24 PM PST on
December 24
                                               “After this data was deleted, the ELB control                        Unaware of
                                                                                                                    error
                                               plane began experiencing high latency and
                                               error rates for API calls to manage ELB load
                                                                        Text
ELB state data
                                               balancers. In this initial part of the service                       High latency &
logically deleted                              disruption, there was no impact to the request                       error rates

                                               handling functionality of running ELB load
                                               balancers because the missing ELB state data
ELB control
plane manages                                  was not integral to the basic operation of                           API calls
configurations
                                               running load balancers. ”

Tracking hosts
for traffic                                                                                                         No impact to
routing                                                                                                             running ELBs




        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Inadvertent      Create new, but
US-East Region                                                                                                      maintenance      not manage
ELB                                                                                                                 process          existing




Severe but                                                                                                          production
localized
interruption
                                               “The team was puzzled as                                             environment
                                                                                                                    access
                                                                                                                                     Failure on
                                                                                                                                     attempt to scale
                                               many APIs were succeeding
                                               (customers were able to
12:24 PM PST on                                create and manage new load                                           Unaware of
December 24
                                               balancers but not manage                                             error

                                               existing load balancers) and
                                                                     Text
                                               others were failing. As this
ELB state data
logically deleted
                                               continued, some customers                                            High latency &
                                                                                                                    error rates
                                               began to experience
                                               performance issues with their
ELB control                                    running load balancers. These
plane manages                                                                                                       API calls
configurations                                 issues only occurred after the
                                               ELB control plane attempted
                                               to make changes to a running
Tracking hosts
for traffic                                    load balancer. ”                                                     No impact to
                                                                                                                    running ELBs
routing




        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Inadvertent      Create new, but
US-East Region                                                                                                      maintenance      not manage
ELB                                                                                                                 process          existing




Severe but                                                                                                          production
                                                                                                                                     Failure on
localized                                                                                                           environment
interruption                                   “At 5:02 PM PST, the team                                            access           attempt to scale

                                               disabled several of the ELB
                                               control plane workflows
                                                                                                                                     6.8% directly
12:24 PM PST on
December 24
                                               (including the scaling and                                           Unaware of
                                                                                                                    error
                                                                                                                                     impacted, rest
                                                                                                                                     no scaling
                                               descaling workflows) to
                                               prevent additional running
                                                                      Text
ELB state data
                                               load balancers from being                                            High latency &
logically deleted                              affected by the missing ELB                                          error rates

                                               state data. At the peak of the
                                               event, 6.8% of running ELB
ELB control
plane manages                                  load balancers were                                                  API calls
configurations
                                               impacted. The rest of the load
                                               balancers in the system were
Tracking hosts                                 unable to scale or be
                                                                                                                    No impact to
for traffic
routing                                        modified by customers, but                                           running ELBs

                                               were operating correctly. ”
        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Inadvertent      Create new, but
US-East Region                                                                                                      maintenance      not manage
ELB                                                                                                                 process          existing




Severe but                                                                                                          production
                                                                                                                                     Failure on
localized                                                                                                           environment
                                                                                                                                     attempt to scale
interruption                                   “The team attempted to                                               access

                                               restore the ELB state data to a
                                               point-in-time just before the                                                         6.8% directly
12:24 PM PST on                                                                                                     Unaware of
December 24                                    event began. By restoring the                                        error
                                                                                                                                     impacted, rest
                                                                                                                                     no scaling
                                               data to this time, we would
                                               be able to merge in events
                                                                       Text
ELB state data                                 that happened after ... to                                           High latency &   Merge old state
logically deleted
                                               create an accurate state. ...                                        error rates

                                               the initial method used by the
ELB control
                                               team to restore the ELB state
plane manages
configurations
                                               data ... failed to provide a                                         API calls        Initial recovery
                                                                                                                                     plan failed
                                               usable snapshot of the data.
                                               This delayed recovery until an
Tracking hosts
for traffic
                                               alternate recovery process                                           No impact to
routing                                        was found. ”                                                         running ELBs




        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Inadvertent      Create new, but
US-East Region                                                                                                      maintenance      not manage
ELB                                                                                                                 process          existing




Severe but
localized                                      “The system began recovering                                         production
                                                                                                                    environment      Failure on
                                                                                                                                     attempt to scale
interruption
                                               the remaining affected load                                          access

                                               balancers, and by 8:15 AM
                                               PST, the team had re-enabled                                                          6.8% directly
12:24 PM PST on                                                                                                     Unaware of
December 24                                    the majority of APIs and                                             error
                                                                                                                                     impacted, rest
                                                                                                                                     no scaling
                                               backend workflows. By 10:30
                                               AM PST, almost all affected
                                                                      Text
ELB state data                                 load balancers had been                                              High latency &   Merge old state
logically deleted                                                                                                   error rates
                                               restored to full operation.
                                               While the service was
ELB control
                                               substantially recovered at this
                                                                                                                                     Initial recovery
plane manages
configurations
                                               time, the team continued to                                          API calls
                                                                                                                                     plan failed
                                               closely monitor the service
                                               before communicating
                                                                                                                                     10:30 am
Tracking hosts
for traffic
                                               broadly that it was operating                                        No impact to     substantial
                                                                                                                                     recovery; 20
routing                                        normally at 12:05 PM PST. ”                                          running ELBs
                                                                                                                                     hours



        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Inadvertent      Create new, but
US-East Region                                                                                                      maintenance      not manage
ELB                                                                                                                 process          existing




Severe but                                                                                                          production
localized
interruption
                                               “We have made a number of                                            environment
                                                                                                                    access
                                                                                                                                     Failure on
                                                                                                                                     attempt to scale
                                               changes to protect the ELB
                                               service from this sort of
12:24 PM PST on
                                               disruption in the future.                                            Unaware of
                                                                                                                                     6.8% directly
                                                                                                                                     impacted, rest
December 24                                        •    modified the access controls on our                         error            no scaling
                                                        production ELB state data
                                                   •    modified our data recovery process to
                                                                                      Text
                                                        reflect the learning we went through in
ELB state data                                          this event                                                  High latency &   Merge old state
logically deleted                                                                                                   error rates
                                                   We will also incorporate our
                                                   learning from this event into
ELB control                                        our service architecture. We                                                      Initial recovery
plane manages
configurations                                     believe that we can                                              API calls
                                                                                                                                     plan failed

                                                   reprogram [to] allow the
                                                   service to recover                                                                10:30 am
Tracking hosts
for traffic                                        automatically from logical                                       No impact to     substantial
                                                                                                                                     recovery; 20
                                                                                                                    running ELBs
routing
                                                   data loss.”                                                                       hours



        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Outage:                                       Failure localized                                                       Inadvertent      Create new, but
                                                                                                            US-East Region
          Christmas Eve,                                to only some                                                            maintenance      not manage
                                                                                                            ELB
          12:30pm Pacific                               ELBs                                                                    process          existing




          Amazon Web                                    Issue was                                           Severe but          production
                                                                                                                                                 Failure on
          Services, Elastic                             requests not                                        localized           environment
                                                                                                                                                 attempt to scale
          Load Balancers                                passed through                                      interruption        access



                                                        Slight                                                                                   6.8% directly
                                                        performance                                         12:24 PM PST on     Unaware of
          Americas only                                                                                                                          impacted, rest
                                                        impact to Mac/                                      December 24         error            no scaling
                                                        PC



          TV connected                                  Game consoles                                       ELB state data      High latency &
          devices,                                      impacted 7                                                                               Merge old state
                                                                                                            logically deleted   error rates
          primarily                                     hours




                                                                                                            ELB control
                                                                                                            plane manages                        Initial recovery
          100’s of ELBs                                                                                                         API calls
                                                                                                            configurations                       plan failed




                                                                                                            Tracking hosts                       10:30 am
          ~1:1                                                                                                                  No impact to     substantial
                                                                                                            for traffic
          ELB: Device Type                                                                                  routing             running ELBs     recovery; 20
                                                                                                                                                 hours



0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Events
Outage:                                                            TV connected
                                                                                                                            {(Netflix) + (Amazon)}
Christmas Eve,               Americas only                         devices,
12:30pm Pacific                                                    primarily



                             Severe but
12:24 PM PST on                                                    US-East Region                       ELB state data
                             localized
December 24                  interruption                          ELB                                  logically deleted      Structural Explanation
                                                                                                                               Inadvertent        production
                                                                                                                                                                     Unaware of
                                                                                                                               maintenance        environment
 Objects & Relationships                                                                                                       process            access             error


Amazon Web                                                         ~1:1
Services, Elastic            100’s of ELBs
Load Balancers                                                     ELB: Device Type                                            Create new, but
                                                                                                                                                  Failure on
                                                                                                                               not manage                            Merge old state
                                                                                                                                                  attempt to scale
                                                                                                                               existing

                             ELB control                           Tracking hosts
ELB control                                                                                                                                       10:30 am
                             plane manages                         for traffic
plane                                                                                                                                             substantial
                             configurations                        routing                                                     Initial recovery
                                                                                                                               plan failed        recovery; 20
                                                                                                                                                  hours

 Patterns
                                                                  Slight
Failure localized            Issue was                                                                  Game consoles
                                                                  performance
to only some                 requests not                                                               impacted 7
                                                                  impact to Mac/
ELBs                         passed through                                                             hours
                                                                  PC



                                                                                                        6.8% directly
High latency &                                                    No impact to
                             API calls                                                                  impacted, rest
error rates                                                       running ELBs                          no scaling

          0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Events
Outage:                                                            TV connected
                                                                                                                            {Netflix + Amazon}
Christmas Eve,               Americas only                         devices,
12:30pm Pacific                                                    primarily



                              Severe but
12:24 PM PST on                                                    US-East Region                       ELB state data
                              localized
December 24                   interruption                         ELB                                  logically deleted      Structural Explanation
                                                                                                                               Inadvertent        production
                                                                                                                                                                     Unaware of
                                                                                                                               maintenance        environment
 Objects & Relationships                                                                                                       process            access             error


Amazon Web                                                         ~1:1
Services, Elastic             100’s of ELBs
Load Balancers                                                     ELB: Device Type                                            Create new, but
                                                                                                                                                  Failure on
                                                                                                                               not manage                            Merge old state
                                                                                                                                                  attempt to scale
                                                                                                                               existing

                             ELB control                           Tracking hosts
ELB control                                                                                                                                        10:30 am
                             plane manages                         for traffic
plane                                                                                                                                             substantial
                             configurations                        routing                                                     Initial recovery
                                                                                                                               plan failed        recovery; 20
                                                                                                                                                  hours

 Patterns
                                                                    Slight
Failure localized            Issue was                                                                  Game consoles
                                                                   performance
to only some                 requests not                                                               impacted 7
                                                                   impact to Mac/
ELBs                         passed through                                                             hours
                                                                   PC



                                                                                                        6.8% directly
High latency &                                                     No impact to
                             API calls                                                                  impacted, rest
error rates                                                        running ELBs                         no scaling

          0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
What is a system?



0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Bricks




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Brick Systems or Brick Collections?




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
A system is an interconnected set
            of elements that is coherently
            organized in a way that achieves
            something.
            Donella Meadows, Thinking in Systems



0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Operational View of a System

 1. Objects

                                                                                                             A




                                                                                                                 C
                                                                                                B
                                                                                                                     D




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Operational View of a System

 1. Objects
 2. Relationships
                                                                                                             A




                                                                                                                 C
                                                                                                B
                                                                                                                     D




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Operational View of a System

 1. Objects
 2. Relationships
                                                                                                             A
 3. Currency




                                                                                                                 C
                                                                                                B
                                                                                                                     D




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Operational View of a System

 1. Objects
 2. Relationships
                                                                                                             A
 3. Currency
 4. Boundary



                                                                                                                 C
                                                                                                B
                                                                                                                     D




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Operational View of a System

 1. Objects
 2. Relationships
                                                                                                             A
 3. Currency
 4. Boundary
 5. Purpose
                                                                                                                         Output
                                                                                                                 C
                                   Input                                                        B
                                                                                                                     D




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Dynamic View of a System
                                        A




                                                                                                Output
                                                              C
Input                   B
                                                                D




                                                                                                                         A’



                                                                                                                                   Output’
                                                                                                                              C’
                                                                                  Input’                            B’
                                                  Time                                                                        D’


        0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Dynamic View of a System
      Behavior vs Time

                  100




 Output




                       0
                                                                                                                   20
                                                                                                            Time
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
A system is an interconnected set
            of elements that is coherently
            organized in a way that achieves
            something.
                                                                                                             The
                                                                                                            General
                                                                                                            System


0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
These elements.
            Those connections.
            This organization.
            That boundary.                                                                                    The
                                                                                                            Specific
                                                                                                            System
            This purpose.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Seeing systems



0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
If it looks like a duck...
๏ A system’s parts must all be present for the system to
  carry out its purpose optimally.

๏ A system’s parts must be arranged in a specific way for
  the system to carry out its purpose.

๏ Systems have specific purposes within larger systems.

๏ Systems maintain their stability through fluctuations and
  adjustments.

๏ Systems have feedback.


 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
The nature of systems is that
  your understanding of a
  particular one gets more precise
  over time.

0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.   45
Seeing Systems
                                                                    Outage:                                          Severe but
                                                                    Christmas Eve,                                   localized
                                                                    12:30pm Pacific                                  interruption



                                                                                                            EVENTS




                                                                  Events are what we notice first.
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Seeing Systems
                                                                    Outage:                                            Severe but
                                                                    Christmas Eve,                                     localized
                                                                    12:30pm Pacific                                    interruption



                                                                                                             EVENTS
                                             TV connected
                                                                                                                                      Failure on
                                             devices,
                                                                                                                                      attempt to scale
                                             primarily


                                                                                                            PATTERNS




                                                     Patterns = Observation(Events + Time)
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Seeing Systems
                                                                    Outage:                                             Severe but
                                                                    Christmas Eve,                                      localized
                                                                    12:30pm Pacific                                     interruption



                                                                                                             EVENTS
                                             TV connected
                                                                                                                                       Failure on
                                             devices,
                                                                                                                                       attempt to scale
                                             primarily


                                                                                                            PATTERNS
                    Issue was
                                                                                                                                                     ELB state data
                    requests not
                                                                                                                                                     logically deleted
                    passed through

                                                                                                            STRUCTURE




              From patterns we deduce structure via ‘black box’ process
0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Seeing Systems
                                                                      Outage:                                             Severe but
                                                                      Christmas Eve,                                      localized
                                                                      12:30pm Pacific                                     interruption



                                                                                                               EVENTS
                                               TV connected
                                                                                                                                         Failure on
                                               devices,
                                                                                                                                         attempt to scale
                                               primarily


                                                                                                              PATTERNS
                      Issue was
                                                                                                                                                       ELB state data
                      requests not
                                                                                                                                                       logically deleted
                      passed through

                                                                                                              STRUCTURE
Amazon Web                                                                                                                                                          ELB control
Services, Elastic                                                                                                                                                   plane manages
Load Balancers                                                                                                                                                      configurations

                                                                                                               CONTEXT
                                            Context helps us discriminate the isomorph
  0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
Fin

 1. Objects
 2. Relationships
                                                                                                             A
 3. Currency
 4. Boundary
 5. Purpose
                                                                                                                         Output
                                                                                                                 C
                                   Input                                                        B
                                                                                                                     D




 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.

Mais conteúdo relacionado

Mais de Tim Sheiner

When UX (guy) Meets Operations
When UX (guy) Meets OperationsWhen UX (guy) Meets Operations
When UX (guy) Meets OperationsTim Sheiner
 
The Digital Machine - Transforming Systems to Software
The Digital Machine - Transforming Systems to SoftwareThe Digital Machine - Transforming Systems to Software
The Digital Machine - Transforming Systems to SoftwareTim Sheiner
 
The Interaction Model
The Interaction ModelThe Interaction Model
The Interaction ModelTim Sheiner
 
The Object Model
The Object ModelThe Object Model
The Object ModelTim Sheiner
 
The Conceptual Model
The Conceptual ModelThe Conceptual Model
The Conceptual ModelTim Sheiner
 
From Systems to Software
From Systems to SoftwareFrom Systems to Software
From Systems to SoftwareTim Sheiner
 
Visualizing Systems
Visualizing SystemsVisualizing Systems
Visualizing SystemsTim Sheiner
 
System Diagramming Basics
System Diagramming BasicsSystem Diagramming Basics
System Diagramming BasicsTim Sheiner
 

Mais de Tim Sheiner (9)

When UX (guy) Meets Operations
When UX (guy) Meets OperationsWhen UX (guy) Meets Operations
When UX (guy) Meets Operations
 
The Digital Machine - Transforming Systems to Software
The Digital Machine - Transforming Systems to SoftwareThe Digital Machine - Transforming Systems to Software
The Digital Machine - Transforming Systems to Software
 
The Data Model
The Data ModelThe Data Model
The Data Model
 
The Interaction Model
The Interaction ModelThe Interaction Model
The Interaction Model
 
The Object Model
The Object ModelThe Object Model
The Object Model
 
The Conceptual Model
The Conceptual ModelThe Conceptual Model
The Conceptual Model
 
From Systems to Software
From Systems to SoftwareFrom Systems to Software
From Systems to Software
 
Visualizing Systems
Visualizing SystemsVisualizing Systems
Visualizing Systems
 
System Diagramming Basics
System Diagramming BasicsSystem Diagramming Basics
System Diagramming Basics
 

Último

Chapter 19_DDA_TOD Policy_First Draft 2012.pdf
Chapter 19_DDA_TOD Policy_First Draft 2012.pdfChapter 19_DDA_TOD Policy_First Draft 2012.pdf
Chapter 19_DDA_TOD Policy_First Draft 2012.pdfParomita Roy
 
VIP College Call Girls Gorakhpur Bhavna 8250192130 Independent Escort Service...
VIP College Call Girls Gorakhpur Bhavna 8250192130 Independent Escort Service...VIP College Call Girls Gorakhpur Bhavna 8250192130 Independent Escort Service...
VIP College Call Girls Gorakhpur Bhavna 8250192130 Independent Escort Service...Suhani Kapoor
 
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...Amil baba
 
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130  Available With RoomVIP Kolkata Call Girl Gariahat 👉 8250192130  Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Roomdivyansh0kumar0
 
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai DouxDubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Douxkojalkojal131
 
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceCALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceanilsa9823
 
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130Suhani Kapoor
 
Fashion trends before and after covid.pptx
Fashion trends before and after covid.pptxFashion trends before and after covid.pptx
Fashion trends before and after covid.pptxVanshNarang19
 
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...Yantram Animation Studio Corporation
 
WAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsWAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsCharles Obaleagbon
 
The history of music videos a level presentation
The history of music videos a level presentationThe history of music videos a level presentation
The history of music videos a level presentationamedia6
 
Verified Trusted Call Girls Adugodi💘 9352852248 Good Looking standard Profil...
Verified Trusted Call Girls Adugodi💘 9352852248  Good Looking standard Profil...Verified Trusted Call Girls Adugodi💘 9352852248  Good Looking standard Profil...
Verified Trusted Call Girls Adugodi💘 9352852248 Good Looking standard Profil...kumaririma588
 
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130Suhani Kapoor
 
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...Call Girls in Nagpur High Profile
 
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiVIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Petrosains Drama Competition (PSDC).pptx
Petrosains Drama Competition (PSDC).pptxPetrosains Drama Competition (PSDC).pptx
Petrosains Drama Competition (PSDC).pptxIgnatiusAbrahamBalin
 
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...home
 
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵anilsa9823
 
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...
Booking open Available Pune Call Girls Nanded City  6297143586 Call Hot India...Booking open Available Pune Call Girls Nanded City  6297143586 Call Hot India...
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...Call Girls in Nagpur High Profile
 

Último (20)

Chapter 19_DDA_TOD Policy_First Draft 2012.pdf
Chapter 19_DDA_TOD Policy_First Draft 2012.pdfChapter 19_DDA_TOD Policy_First Draft 2012.pdf
Chapter 19_DDA_TOD Policy_First Draft 2012.pdf
 
VIP College Call Girls Gorakhpur Bhavna 8250192130 Independent Escort Service...
VIP College Call Girls Gorakhpur Bhavna 8250192130 Independent Escort Service...VIP College Call Girls Gorakhpur Bhavna 8250192130 Independent Escort Service...
VIP College Call Girls Gorakhpur Bhavna 8250192130 Independent Escort Service...
 
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
 
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130  Available With RoomVIP Kolkata Call Girl Gariahat 👉 8250192130  Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
 
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai DouxDubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
 
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceCALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
 
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
 
Fashion trends before and after covid.pptx
Fashion trends before and after covid.pptxFashion trends before and after covid.pptx
Fashion trends before and after covid.pptx
 
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
 
WAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsWAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past Questions
 
The history of music videos a level presentation
The history of music videos a level presentationThe history of music videos a level presentation
The history of music videos a level presentation
 
Verified Trusted Call Girls Adugodi💘 9352852248 Good Looking standard Profil...
Verified Trusted Call Girls Adugodi💘 9352852248  Good Looking standard Profil...Verified Trusted Call Girls Adugodi💘 9352852248  Good Looking standard Profil...
Verified Trusted Call Girls Adugodi💘 9352852248 Good Looking standard Profil...
 
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
 
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
 
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiVIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
 
Petrosains Drama Competition (PSDC).pptx
Petrosains Drama Competition (PSDC).pptxPetrosains Drama Competition (PSDC).pptx
Petrosains Drama Competition (PSDC).pptx
 
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
 
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
 
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
 
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...
Booking open Available Pune Call Girls Nanded City  6297143586 Call Hot India...Booking open Available Pune Call Girls Nanded City  6297143586 Call Hot India...
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...
 

What is a system?

  • 1. What is a system? № 1, Design of Digital Machines Tim Sheiner 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 2. Sections in this presentation ๏ A System Story ๏ What is a system? ๏ Characteristics of a system 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 3. System Story 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 4. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 5. Huh? 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 6. Huh(2x)? 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 7. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 8. Outage: Christmas Eve, 12:30pm Pacific Amazon Web Services, Elastic Load Balancers “Netflix streaming was impacted on Christmas Eve 2012 by problems in the Amazon Web Services (AWS) Elastic Load Balancer Text (ELB) service that routes network traffic to the Netflix services supporting streaming.” 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 9. Outage: Christmas Eve, 12:30pm Pacific Amazon Web Services, Elastic Load Balancers Americas only “The outage primarily affected playback on TV connected devices in the US, Canada and Latin America. Our service in the UK, Ireland and Text TV connected devices, Nordic countries was not impacted.” primarily 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 10. Outage: Christmas Eve, 12:30pm Pacific Amazon Web Services, Elastic Load Balancers Americas only “Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address Text TV connected devices, that your Web browser or streaming device primarily calls. Netflix streaming has been implemented on over a thousand different streaming devices over the last few years, and groups of similar 100’s of ELBs devices tend to depend on specific ELBs.” ~1:1 ELB: Device Type 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 11. Outage: Failure localized Christmas Eve, to only some 12:30pm Pacific ELBs Amazon Web Issue was Services, Elastic requests not Load Balancers passed through Americas only “Out of hundreds of ELBs in use by Netflix, a handful failed, losing their ability to pass Text requests to the servers behind them. None of TV connected devices, primarily the other AWS services failed, so our applications continued to respond normally whenever the requests were able to get through.” 100’s of ELBs ~1:1 ELB: Device Type 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 12. Outage: Failure localized Christmas Eve, to only some 12:30pm Pacific ELBs Amazon Web Issue was Services, Elastic requests not Load Balancers passed through Slight Americas only “Over-all streaming playback via Macs and PCs performance impact to Mac/ was only slightly reduced from normal levels. A PC few devices also saw no impact at all as those Text TV connected devices have an ELB configuration that kept Game consoles devices, impacted 7 primarily running throughout the incident, providing hours normal playback levels. ... game consoles etc. were impacted for about 100’s of ELBs seven hours.” ~1:1 ELB: Device Type 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 13. Outage: Failure localized Christmas Eve, to only some 12:30pm Pacific ELBs Amazon Web Issue was Services, Elastic requests not Load Balancers passed through Slight Americas only “It is still early days for cloud innovation and performance impact to Mac/ there is certainly more to do in terms of PC building resiliency in the cloud. Text TV connected devices, We have plans to work on this in 2013. It is an Game consoles impacted 7 primarily interesting and hard problem to solve, since ... hours the systems involved ... must be extremely reliable and capable of avoiding cascading 100’s of ELBs overload failures.” ~1:1 ELB: Device Type 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 14. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 15. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 16. US-East Region ELB Severe but localized interruption “We would like to share more details with our customers about the event that occurred with the Amazon Elastic Load Balancing Service Text (“ELB”) earlier this week in the US-East Region. While the service disruption only affected applications using the ELB service (and only a fraction of the ELB load balancers were affected), the impacted load balancers saw significant impact for a prolonged period of time.” 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 17. US-East Region ELB Severe but localized interruption 12:24 PM PST on December 24 “The service disruption began at 12:24 PM PST on December 24th when a portion of the ELB Text ELB state data state data was logically deleted. ” logically deleted 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 18. US-East Region ELB Severe but localized interruption 12:24 PM PST on December 24 “This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for Text ELB state data example tracking all the backend hosts to logically deleted which traffic should be routed by each load balancer). ” ELB control plane manages configurations Tracking hosts for traffic routing 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 19. Inadvertent US-East Region maintenance ELB process Severe but production localized environment interruption access 12:24 PM PST on Unaware of December 24 “The data was deleted by a maintenance error process that was inadvertently run against the Text production ELB state data. This process was run ELB state data logically deleted by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did ELB control not realize the mistake at the time. ” plane manages configurations Tracking hosts for traffic routing 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 20. Inadvertent US-East Region maintenance ELB process Severe but production localized environment interruption access 12:24 PM PST on December 24 “After this data was deleted, the ELB control Unaware of error plane began experiencing high latency and error rates for API calls to manage ELB load Text ELB state data balancers. In this initial part of the service High latency & logically deleted disruption, there was no impact to the request error rates handling functionality of running ELB load balancers because the missing ELB state data ELB control plane manages was not integral to the basic operation of API calls configurations running load balancers. ” Tracking hosts for traffic No impact to routing running ELBs 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 21. Inadvertent Create new, but US-East Region maintenance not manage ELB process existing Severe but production localized interruption “The team was puzzled as environment access Failure on attempt to scale many APIs were succeeding (customers were able to 12:24 PM PST on create and manage new load Unaware of December 24 balancers but not manage error existing load balancers) and Text others were failing. As this ELB state data logically deleted continued, some customers High latency & error rates began to experience performance issues with their ELB control running load balancers. These plane manages API calls configurations issues only occurred after the ELB control plane attempted to make changes to a running Tracking hosts for traffic load balancer. ” No impact to running ELBs routing 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 22. Inadvertent Create new, but US-East Region maintenance not manage ELB process existing Severe but production Failure on localized environment interruption “At 5:02 PM PST, the team access attempt to scale disabled several of the ELB control plane workflows 6.8% directly 12:24 PM PST on December 24 (including the scaling and Unaware of error impacted, rest no scaling descaling workflows) to prevent additional running Text ELB state data load balancers from being High latency & logically deleted affected by the missing ELB error rates state data. At the peak of the event, 6.8% of running ELB ELB control plane manages load balancers were API calls configurations impacted. The rest of the load balancers in the system were Tracking hosts unable to scale or be No impact to for traffic routing modified by customers, but running ELBs were operating correctly. ” 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 23. Inadvertent Create new, but US-East Region maintenance not manage ELB process existing Severe but production Failure on localized environment attempt to scale interruption “The team attempted to access restore the ELB state data to a point-in-time just before the 6.8% directly 12:24 PM PST on Unaware of December 24 event began. By restoring the error impacted, rest no scaling data to this time, we would be able to merge in events Text ELB state data that happened after ... to High latency & Merge old state logically deleted create an accurate state. ... error rates the initial method used by the ELB control team to restore the ELB state plane manages configurations data ... failed to provide a API calls Initial recovery plan failed usable snapshot of the data. This delayed recovery until an Tracking hosts for traffic alternate recovery process No impact to routing was found. ” running ELBs 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 24. Inadvertent Create new, but US-East Region maintenance not manage ELB process existing Severe but localized “The system began recovering production environment Failure on attempt to scale interruption the remaining affected load access balancers, and by 8:15 AM PST, the team had re-enabled 6.8% directly 12:24 PM PST on Unaware of December 24 the majority of APIs and error impacted, rest no scaling backend workflows. By 10:30 AM PST, almost all affected Text ELB state data load balancers had been High latency & Merge old state logically deleted error rates restored to full operation. While the service was ELB control substantially recovered at this Initial recovery plane manages configurations time, the team continued to API calls plan failed closely monitor the service before communicating 10:30 am Tracking hosts for traffic broadly that it was operating No impact to substantial recovery; 20 routing normally at 12:05 PM PST. ” running ELBs hours 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 25. Inadvertent Create new, but US-East Region maintenance not manage ELB process existing Severe but production localized interruption “We have made a number of environment access Failure on attempt to scale changes to protect the ELB service from this sort of 12:24 PM PST on disruption in the future. Unaware of 6.8% directly impacted, rest December 24 • modified the access controls on our error no scaling production ELB state data • modified our data recovery process to Text reflect the learning we went through in ELB state data this event High latency & Merge old state logically deleted error rates We will also incorporate our learning from this event into ELB control our service architecture. We Initial recovery plane manages configurations believe that we can API calls plan failed reprogram [to] allow the service to recover 10:30 am Tracking hosts for traffic automatically from logical No impact to substantial recovery; 20 running ELBs routing data loss.” hours 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 26. Outage: Failure localized Inadvertent Create new, but US-East Region Christmas Eve, to only some maintenance not manage ELB 12:30pm Pacific ELBs process existing Amazon Web Issue was Severe but production Failure on Services, Elastic requests not localized environment attempt to scale Load Balancers passed through interruption access Slight 6.8% directly performance 12:24 PM PST on Unaware of Americas only impacted, rest impact to Mac/ December 24 error no scaling PC TV connected Game consoles ELB state data High latency & devices, impacted 7 Merge old state logically deleted error rates primarily hours ELB control plane manages Initial recovery 100’s of ELBs API calls configurations plan failed Tracking hosts 10:30 am ~1:1 No impact to substantial for traffic ELB: Device Type routing running ELBs recovery; 20 hours 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 27. Events Outage: TV connected {(Netflix) + (Amazon)} Christmas Eve, Americas only devices, 12:30pm Pacific primarily Severe but 12:24 PM PST on US-East Region ELB state data localized December 24 interruption ELB logically deleted Structural Explanation Inadvertent production Unaware of maintenance environment Objects & Relationships process access error Amazon Web ~1:1 Services, Elastic 100’s of ELBs Load Balancers ELB: Device Type Create new, but Failure on not manage Merge old state attempt to scale existing ELB control Tracking hosts ELB control 10:30 am plane manages for traffic plane substantial configurations routing Initial recovery plan failed recovery; 20 hours Patterns Slight Failure localized Issue was Game consoles performance to only some requests not impacted 7 impact to Mac/ ELBs passed through hours PC 6.8% directly High latency & No impact to API calls impacted, rest error rates running ELBs no scaling 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 28. Events Outage: TV connected {Netflix + Amazon} Christmas Eve, Americas only devices, 12:30pm Pacific primarily Severe but 12:24 PM PST on US-East Region ELB state data localized December 24 interruption ELB logically deleted Structural Explanation Inadvertent production Unaware of maintenance environment Objects & Relationships process access error Amazon Web ~1:1 Services, Elastic 100’s of ELBs Load Balancers ELB: Device Type Create new, but Failure on not manage Merge old state attempt to scale existing ELB control Tracking hosts ELB control 10:30 am plane manages for traffic plane substantial configurations routing Initial recovery plan failed recovery; 20 hours Patterns Slight Failure localized Issue was Game consoles performance to only some requests not impacted 7 impact to Mac/ ELBs passed through hours PC 6.8% directly High latency & No impact to API calls impacted, rest error rates running ELBs no scaling 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 29. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 30. What is a system? 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 31. Bricks 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 32. Brick Systems or Brick Collections? 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 33. A system is an interconnected set of elements that is coherently organized in a way that achieves something. Donella Meadows, Thinking in Systems 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 34. Operational View of a System 1. Objects A C B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 35. Operational View of a System 1. Objects 2. Relationships A C B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 36. Operational View of a System 1. Objects 2. Relationships A 3. Currency C B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 37. Operational View of a System 1. Objects 2. Relationships A 3. Currency 4. Boundary C B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 38. Operational View of a System 1. Objects 2. Relationships A 3. Currency 4. Boundary 5. Purpose Output C Input B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 39. Dynamic View of a System A Output C Input B D A’ Output’ C’ Input’ B’ Time D’ 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 40. Dynamic View of a System Behavior vs Time 100 Output 0 20 Time 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 41. A system is an interconnected set of elements that is coherently organized in a way that achieves something. The General System 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 42. These elements. Those connections. This organization. That boundary. The Specific System This purpose. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 43. Seeing systems 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 44. If it looks like a duck... ๏ A system’s parts must all be present for the system to carry out its purpose optimally. ๏ A system’s parts must be arranged in a specific way for the system to carry out its purpose. ๏ Systems have specific purposes within larger systems. ๏ Systems maintain their stability through fluctuations and adjustments. ๏ Systems have feedback. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 45. The nature of systems is that your understanding of a particular one gets more precise over time. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States. 45
  • 46. Seeing Systems Outage: Severe but Christmas Eve, localized 12:30pm Pacific interruption EVENTS Events are what we notice first. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 47. Seeing Systems Outage: Severe but Christmas Eve, localized 12:30pm Pacific interruption EVENTS TV connected Failure on devices, attempt to scale primarily PATTERNS Patterns = Observation(Events + Time) 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 48. Seeing Systems Outage: Severe but Christmas Eve, localized 12:30pm Pacific interruption EVENTS TV connected Failure on devices, attempt to scale primarily PATTERNS Issue was ELB state data requests not logically deleted passed through STRUCTURE From patterns we deduce structure via ‘black box’ process 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 49. Seeing Systems Outage: Severe but Christmas Eve, localized 12:30pm Pacific interruption EVENTS TV connected Failure on devices, attempt to scale primarily PATTERNS Issue was ELB state data requests not logically deleted passed through STRUCTURE Amazon Web ELB control Services, Elastic plane manages Load Balancers configurations CONTEXT Context helps us discriminate the isomorph 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  • 50. Fin 1. Objects 2. Relationships A 3. Currency 4. Boundary 5. Purpose Output C Input B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.