SlideShare uma empresa Scribd logo
1 de 38
Impact of Soft Errors on Reliability
        and Availability of Servers in the
            Internet Computing Era
                  Ishwar Parulkar
               Sun Microsystems, Inc.




VTS 2006                                      Slide 1
Outline
           • Servers for the Internet Era
           • Server Reliability and Availability
              – Impact to customers
              – Metrics and typical targets
           • Soft Errors in Silicon Components
              – Classification at system level
              – Sensitivity of system metrics
           • Chip Soft Error Trends and Solutions
           • Conclusions

VTS 2006                                            Slide 2
Categorizing Internet Era Workloads
                Highly Threaded




  Storage                            Network
 Intensive                          Intensive




                 Single Threaded
VTS 2006                                Slide 3
Categorizing Internet Era Workloads
                                 Highly Threaded



                  Meteorology/Climate Simulation

                   Nuclear Simulation/Weapons Modeling


                   Seismic Analysis,
                   Reservoir Modeling
               Thermodynamics
  Storage                                                                                   Network
 Intensive                                                                                 Intensive
                                             EAI Servers
                                                               Structural Analysis
                                               Electronic Design Simulation
                                              Workgroup        Compute Grid
                                               Application Development
                                                           Financial Risk/Portfolio Analysis
                                                           Monte Carlo Simulation
                                                         Genomics, Cheminformatics



   Technical                      Single Threaded
VTS 2006                                                                                       Slide 4
Categorizing Internet Era Workloads
                                  Highly Threaded


                         Data Warehousing
                                                                                 Proxy Caching
                   Meteorology/Climate Simulation
                          Data Analysis                                     Web Serving
                    Nuclear Simulation/Weapons Modeling                 Streaming Media
                                  OLTP Database
                                                                        Security
                               File Server
                    Seismic Analysis,      ERP (SAP R3)          Directory
                    Reservoir Modeling
                Thermodynamics                   J2EE Application Servers
  Storage                              Batch                                                 Network
 Intensive                                                                                  Intensive
                                              EAI Servers
                                                                Structural Analysis
                                                Electronic Design Simulation
                                               Workgroup        Compute Grid
                                                Application Development
                                                            Financial Risk/Portfolio Analysis
                                                            Monte Carlo Simulation
                                                         Genomics, Cheminformatics

   Commercial
   Technical                       Single Threaded
VTS 2006                                                                                        Slide 5
Categorizing Internet Era Workloads
                                     Highly Threaded


                            Data Warehousing
                                                                                    Proxy Caching
                      Meteorology/Climate Simulation
                             Data Analysis                                     Web Serving
                Data   Nuclear Simulation/Weapons Modeling                 Streaming Media
                                     OLTP Database
                                                                           Security
                                  File Server
                       Seismic Analysis,      ERP (SAP R3)          Directory
                       Reservoir Modeling
                   Thermodynamics                   J2EE Application Servers
  Storage                                 Batch                                                 Network
 Intensive                                                                                     Intensive
                                                 EAI Servers
                                                                   Structural Analysis
                                                   Electronic Design Simulation
                                                  Workgroup        Compute Grid
                                                   Application Development
                                                               Financial Risk/Portfolio Analysis
                                                               Monte Carlo Simulation
                                                            Genomics, Cheminformatics

   Commercial
   Technical                          Single Threaded
VTS 2006                                                                                           Slide 6
Categorizing Internet Era Workloads
                                     Highly Threaded


                            Data Warehousing
                                                                                    Proxy Caching
                      Meteorology/Climate Simulation
                             Data Analysis
                                                               Web             Web Serving
                Data   Nuclear Simulation/Weapons Modeling                 Streaming Media
                                     OLTP Database
                                                                           Security
                                  File Server
                       Seismic Analysis,      ERP (SAP R3)          Directory
                       Reservoir Modeling
                   Thermodynamics                   J2EE Application Servers
  Storage                                 Batch                                                 Network
 Intensive                                                                                     Intensive
                                                 EAI Servers
                                                                   Structural Analysis
                                                   Electronic Design Simulation
                                                  Workgroup        Compute Grid
                                                   Application Development
                                                               Financial Risk/Portfolio Analysis
                                                               Monte Carlo Simulation
                                                            Genomics, Cheminformatics

   Commercial
   Technical                          Single Threaded
VTS 2006                                                                                           Slide 7
Categorizing Internet Era Workloads
                                     Highly Threaded


                            Data Warehousing
                                                                                    Proxy Caching
                      Meteorology/Climate Simulation
                             Data Analysis
                                                               Web             Web Serving
                Data   Nuclear Simulation/Weapons Modeling                 Streaming Media
                                     OLTP Database
                                                                           Security
                                  File Server
                       Seismic Analysis,      ERP (SAP R3)          Directory
                       Reservoir Modeling
                   Thermodynamics                   J2EE Application Servers
  Storage                                 Batch                                                 Network
 Intensive                                                                                     Intensive
                                                 EAI Servers
                                                                   Structural Analysis
                                                   Electronic Design Simulation
                                                  Workgroup        Compute Grid
                                                   Application Development
                                                               Financial Risk/Portfolio Analysis
                                              Compute          Monte Carlo Simulation
                                                            Genomics, Cheminformatics

   Commercial
   Technical                          Single Threaded
VTS 2006                                                                                           Slide 8
Categorizing Internet Era Workloads
                                     Highly Threaded


                            Data Warehousing
                                                                                    Proxy Caching
                      Meteorology/Climate Simulation
                             Data Analysis
                                                               Web             Web Serving
                Data   Nuclear Simulation/Weapons Modeling                 Streaming Media
                                     OLTP Database
                                                                           Security
                                  File Server
                       Seismic Analysis,      ERP (SAP R3)          Directory
                       Reservoir Modeling    Application
                   Thermodynamics                   J2EE Application Servers
  Storage                                 Batch                                                 Network
 Intensive                                                                                     Intensive
                                                 EAI Servers
                                                                   Structural Analysis
                                                   Electronic Design Simulation
                                                  Workgroup        Compute Grid
                                                   Application Development
                                                               Financial Risk/Portfolio Analysis
                                              Compute          Monte Carlo Simulation
                                                            Genomics, Cheminformatics

   Commercial
   Technical                          Single Threaded
VTS 2006                                                                                           Slide 9
Categorizing Internet Era Workloads
                                     Highly Threaded


                            Data Warehousing
                                                                                    Proxy Caching
                      Meteorology/Climate Simulation
                             Data Analysis
                                                               Web             Web Serving
                Data   Nuclear Simulation/Weapons Modeling                 Streaming Media
                                     OLTP Database
                                                                           Security
                                  File Server
                       Seismic Analysis,      ERP (SAP R3)          Directory
                       Reservoir Modeling    Application
                   Thermodynamics                   J2EE Application Servers
  Storage              HPC                Batch                                                 Network
 Intensive                                                                                     Intensive
                                                 EAI Servers
                                                                   Structural Analysis
                                                   Electronic Design Simulation
                                                  Workgroup        Compute Grid
                                                   Application Development
                                                               Financial Risk/Portfolio Analysis
                                              Compute          Monte Carlo Simulation
                                                            Genomics, Cheminformatics

   Commercial
   Technical                          Single Threaded
VTS 2006                                                                                           Slide 10
Optimizing Servers for Workloads

      • Three primary server design points
         – Data centric servers
         – Web centric servers
         – Compute centric servers
      • Application centric servers leverage design
        point of Data and Web centric
      • HPC centric servers leverage design point of
        Data centric


VTS 2006                                               Slide 11
Server Reliability and Availability
   Basic Concepts
                                     System
              System Available        Down           System Available



    Restart                      Failure   Restart                      Failure
                                 Occurs                                 Occurs
                                     MTTR                MTTF




     • Reliability = MTBF = MTTF + MTTR
     • Availability = MTTF/MTBF = 1 - (MTTR/MTBF)

VTS 2006                                                                      Slide 12
Cost of UnReliability and UnAvailability
   • Ebay outages
      – Estimated $3-5 million lost in revenue because
        of returned fees and lost business
      – $5 billion drop in market capitalization
   • Ameritrade, Schwab, E*Trade outages
      – Class action lawsuit for intermittent service
   • Akamai outage
      – Akamai handles 15% of world's Internet traffic
      – Google, Yahoo, Ebay, etc. affected by this
        outage
   Note: All outages were not hardware related
VTS 2006                                                 Slide 13
Cost of UnReliability and UnAvailability
           Customer behavior after an Internet server/site outage

                                                          No change in
                            9%                            behavior
                                                          Found a new
                                                          site, used it
                                 24%                      once
                                                          Found a new
                   53%                                    site, continued
                                                          to use both
                              13%                         Stopped using
                                                          site altogether




Source: Jupiter Communications – Internet Research Firm
VTS 2006                                                                    Slide 14
VTS 2006
                                                                            K$ Per Hour




                                                                                100



                                                            10
                                                 Brokerage                                    1,000
                                                                                                              10,000 6,450




                                                Credit Card
                                                                                                      2,600




            Source: InternetWeek 4/3/2000
                                                      Ebay
                                                                                        225




                                                   Amazon
                                                                                       180




                              Package Shipping
                                                                                      150




                                            Home Shopping
                                                                                  113
                                                                                90




                                              Catalog Sales
                                                                                89




                  Airline Reservation
                                                                           41




                                            Cellular Service
                                                                      25




                                            On-line Network
                                                                 14




                                              ATM Service
                                                                                                                             Cost of UnReliability and UnAvailability




 Slide 15
Server Reliability and Availability
   Customer Perspective

       • Impacts felt by customers
          – Silent data corruption (SDC)
          – Unscheduled system interruptions (USI)
          – Service or repair rate
          – Downtime (or Uptime)
       • Metrics
         – Mean time between SDC (MTBSDC)
         – Mean time between USI (MTBUSI)
         – Mean time between repair (MTBR)
         – Availability
VTS 2006                                             Slide 16
Server Reliability and Availability
   Typical Targets


           Server Type      MTBSDC          MTBUSI        Availability
           Data Centric   100-1000 years   10-25 years      99.999
           Web Centric     10-100 years    10-25 years   99.999-99.9999
      Compute Centric 100-1000 years       2-10 years       99.990




     MTBF in years = 109 / (FIT * 24 Hours * 365 Days)


VTS 2006                                                                  Slide 17
A Typical Data Centric Server

           Component        Approx. Count                 Comments
            Processors           8-64                    8-64 way systems

               ASICs             320        Memory controllers, IO bridges, Crypto, etc.

           Memory DIMMs          640              Depends on memory capacity
              AC/DC
                                 8-10                   Main power supply
           Power Supplies
              DC/DC
                                 640              High and low voltage supplies
           Power Supplies
              Clocking           64             Clock synthesizers and distribution

      Service Processor           4                  Small processors, FPGA
       Miscellaneous
                              1000-10000     Resistors, Capacitors, Pins, Connectors
      Small Components


VTS 2006                                                                                   Slide 18
Impact of Silicon Soft Errors on
   Servers

  • How much is the contribution of silicon soft errors
    to total failures in systems?
  • To what degree are each of the system level
    metrics impacted by silicon soft errors?
  • How much protection is adequate?




VTS 2006                                              Slide 19
Classification of Silicon Soft Errors




                   Universe of
                   Soft Errors
                in a Server Chip




VTS 2006                                  Slide 20
Classification of Silicon Soft Errors




                 C            U


              Corrected   Uncorrected



VTS 2006                                  Slide 21
Classification of Silicon Soft Errors



             Silent     SC           SU


           Reported     RC           RU

                      Corrected   Uncorrected



VTS 2006                                        Slide 22
Classification of Silicon Soft Errors
                                                Silent Data
                                                Corruption
                                                (MTBSDC)

             Silent     SC           SU


           Reported     RC           RU

                      Corrected   Uncorrected



VTS 2006                                                      Slide 23
Classification of Silicon Soft Errors
                                                Silent Data
                                                Corruption
                                                (MTBSDC)

             Silent     SC           SU


           Reported     RC           RU
                                                System Crash
                                                  (MTBUSI)
                      Corrected   Uncorrected



VTS 2006                                                      Slide 24
Classification of Silicon Soft Errors
   Customer                                     Silent Data
does not care or                                Corruption
 need not know                                  (MTBSDC)

             Silent     SC           SU


           Reported     RC           RU
                                                System Crash
                                                  (MTBUSI)
                      Corrected   Uncorrected



VTS 2006                                                      Slide 25
Classification of Silicon Soft Errors
   Customer                                 Silent Data
does not care or                            Corruption
 need not know                              (MTBSDC)

             Silent   SC         SU


           Reported   RC         RU
                                            System Crash
                                              (MTBUSI)
   Required by    Corrected   Uncorrected
Service/Customer
to monitor health

VTS 2006                                                  Slide 26
Silent Data Corruption
   Total Server FIT

     A: Without any protection

                2%

                         18%




    80%



           - Memory      - Proc. + ASICs   - Misc.


VTS 2006                                             Slide 27
Silent Data Corruption
   Total Server FIT

     A: Without any protection                   B: With SEC-DED on Memory

                   2%                                             1% 9%

                               18%




                                                  89%
    80%



            - Memory           - Proc. + ASICs          - Misc.
Note: Total FIT in A > Total FIT in B
VTS 2006                                                                  Slide 28
Sensitivity to Silicon Soft Errors
   (Silent Data Corruption)

                              Sensitivity of Server to Processor SU Rate
                                   120
                                   110
                                   100
           Server MTBSDC (Years)



                                    90
                                    80
                                    70
                                    60
                                    50
                                    40
                                    30
                                    20
                                    10
                                     0
                                     100   200   300    400   500    600    700
                                        Processor SU (Silent Uncorrected) FIT




VTS 2006                                                                          Slide 29
Unscheduled System Interruptions
   Total Server FIT
   A: Without any Redundancy
          or Protection
                2%
                     8%



                           20%




    70%



           - Power        - Memory   - Proc. + ASICs   - Misc.


VTS 2006                                                         Slide 30
Unscheduled System Interruptions
   Total Server FIT
   A: Without any Redundancy               B: With Power Redundancy
          or Protection                    and SEC-DED on Memory
                   2%                                        12%
                         8%



                                  20%     35%



                                                                       52%
                                                1%
    70%



            - Power            - Memory    - Proc. + ASICs         - Misc.

Note: Total FIT in A > Total FIT in B
VTS 2006                                                                     Slide 31
Sensitivity to Silicon Soft Errors
   (Unscheduled System Inpterruptions)

                            Server Sensitivity to Processor RU Rate
                                    20

                                   17.5
           Server MTBUSI (Years)




                                    15
                                   12.5

                                    10
                                    7.5
                                     5
                                    2.5

                                     0
                                     100 200 300 400 500 600 700
                                     Processor RU (Reported Uncorrected) FIT



VTS 2006                                                                       Slide 32
Server Processor Trends (Memory)
                                             On-chip memory trend*
                                        50                                              45

                Memory bits (million)
                                                                          40
                                        40                   35
                                                30
                                        30

                                        20

                                        10

                                        0
                                             64b, 130nm,   Dual core   8-core (2nd   Next genera-
                                             Single Core   (1st Gen    Gen CMT)      tion CMT
                                                           CMT)
                                             *Assuming 2-4MB on-chip level-2 cache


       • Typically memories >8KB protected with SEC-DED,
         2Kb-8KB protected with variants of parity
       • Contribution of memories to chip level FIT rate has
         been fairly constant over time
VTS 2006                                                                                            Slide 33
Server Processor Trends (Flops)
                                                On-chip flop trend
                                 1200
            Flops per chip (K)
                                                                                 1000
                                 1000

                                  800

                                  600
                                                                    500
                                  400
                                                       200
                                  200
                                           80
                                    0
                                        64b 130nm     Dual core   8-core (2nd   Next genera-
                                        Single Core   (1st Gen    Gen CMT)      tion CMT
                                                      CMT)



     • With chip multi-threading (CMT), more pipelines
       on a chip, hence more logic
VTS 2006                                                                                       Slide 34
Server Processor Trends (Flops)
    • Flop soft error FIT is typically 0.001 FIT/bit *
    • 30% of flop bit flips contribute to chip failure **
                                 Chip level FIT contribution of flops
                                500
                                450
                                400
                 FIT per chip




                                350
                                                                                 300
                                300
                                250
                                200
                                                                  150
                                150
                                100                   60
                                 50      24
                                  0
                                      64b 130nm     Dual core   8-cores (2nd   Next genera-
                                      Single Core   (1st Gen    Gen CMT)       tion CMT
                                                    CMT)

  * SELSE II (Workshop on System Effects of Logic Soft Errors)
  ** Fault injection with architectural trace simulation
VTS 2006                                                                                      Slide 35
Sensitivity to Processor Flop FIT
                           Sensitivity to Processor SU Rate                                  Sensitivity to Processor RU Rate
                        120                                                                  20
                        110
                                                                                            17.5   17 years
                        100
Server MTBSDC (Years)




                                                                    Server MTBUSI (Years)
                         90     89 years                                                     15          14 years
                         80
                                                                                            12.5
                         70
                         60                                                                  10
                         50              42 years                                            7.5
                         40
                         30                                                                   5
                         20
                                                                                             2.5
                         10
                          0                                                                   0
                           100 200 300 400 500 600 700                                         100 200 300 400 500 600 700
                            Processor SU (Silent Uncorrected) FIT                           Processor RU (Reported Uncorrected) FIT




                                  • A 150 FIT increase in processor implies:
                                     – 52.8% degradation of MTBSDC
                                     – 17.7% degradation of MTBUSI
VTS 2006                                                                                                                              Slide 36
Directions for Solutions to Soft Errors
     • Unit level redundancy is too costly in server
       space, need cheaper solutions
     • Circuit level solutions can be limiting
        – Cannot reduce failure rate to 0
        – Reporting corrected errors
        – CAD, design methodology limitations
     • Logic level and architectural techniques more
       promising - cost/flexibility/portability
     • Just detection is not sufficient – need correction
       or recovery too
     • Taking advantage of features of CMT processors
VTS 2006                                               Slide 37
Conclusions
   • Investment in mitigation of soft errors in silicon
     should be based on top-down system targets
   • All soft errors in silicon are not equal
   • System level impact of silicon soft errors
      – Very high on silent data corruption rate
      – Medium on unscheduled interruption rate
      – Low on availability
   • Flop SER significant for some types of servers
   • Solutions need to be low overhead – mainframe
     level reliability/availability at server price points
VTS 2006                                                     Slide 38

Mais conteúdo relacionado

Destaque

Improving substation reliability & availability
Improving substation reliability & availability Improving substation reliability & availability
Improving substation reliability & availability Vincent Wedelich, PE MBA
 
Cloud Computing - Benefits and Challenges
Cloud Computing - Benefits and ChallengesCloud Computing - Benefits and Challenges
Cloud Computing - Benefits and ChallengesThoughtWorks Studios
 
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 Linaro
 
Slides cloud computing
Slides cloud computingSlides cloud computing
Slides cloud computingHaslina
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple pptAgarwaljay
 

Destaque (7)

Improving substation reliability & availability
Improving substation reliability & availability Improving substation reliability & availability
Improving substation reliability & availability
 
System dependability
System dependabilitySystem dependability
System dependability
 
Availability and reliability
Availability and reliabilityAvailability and reliability
Availability and reliability
 
Cloud Computing - Benefits and Challenges
Cloud Computing - Benefits and ChallengesCloud Computing - Benefits and Challenges
Cloud Computing - Benefits and Challenges
 
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
 
Slides cloud computing
Slides cloud computingSlides cloud computing
Slides cloud computing
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple ppt
 

Semelhante a Impact of Soft Errors on Server Reliability

Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...IMEX Research
 
Microsoft HPC User Group
Microsoft HPC User Group Microsoft HPC User Group
Microsoft HPC User Group sjwoodman
 
Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).Mindtree Ltd.
 
2. FOMS _ FeedHenry_ Mícheál Ó Foghlú
2. FOMS _ FeedHenry_ Mícheál Ó Foghlú2. FOMS _ FeedHenry_ Mícheál Ó Foghlú
2. FOMS _ FeedHenry_ Mícheál Ó FoghlúFOMS011
 
Process Steps
Process StepsProcess Steps
Process StepsmfeKEG
 
6.Live Framework 和Mesh Services
6.Live Framework 和Mesh Services6.Live Framework 和Mesh Services
6.Live Framework 和Mesh ServicesGaryYoung
 
Extending HBSS Information Assurance with Tripwire Enterprise
Extending HBSS Information Assurance with Tripwire EnterpriseExtending HBSS Information Assurance with Tripwire Enterprise
Extending HBSS Information Assurance with Tripwire EnterpriseTripwire
 
Thomas Rischbeck Real Life E S B
Thomas  Rischbeck    Real  Life  E S BThomas  Rischbeck    Real  Life  E S B
Thomas Rischbeck Real Life E S BSOA Symposium
 
Imex Research Virtualization Executive Summary On Slideshare
Imex Research Virtualization Executive Summary On SlideshareImex Research Virtualization Executive Summary On Slideshare
Imex Research Virtualization Executive Summary On SlideshareM. R. Pamidi, Ph. D.
 
Scalability and Availability - Without Compromise
Scalability and Availability - Without CompromiseScalability and Availability - Without Compromise
Scalability and Availability - Without CompromiseBjorn Andersson
 
HP Server og Lagring SPOR 1
HP Server og Lagring SPOR 1HP Server og Lagring SPOR 1
HP Server og Lagring SPOR 1HP Norge
 
IT Architecture Automatic Verification (RCIS 2010)
IT Architecture Automatic Verification (RCIS 2010)IT Architecture Automatic Verification (RCIS 2010)
IT Architecture Automatic Verification (RCIS 2010)António Alegria
 
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008Stuart Charlton
 
Securing Your Endpoints Using Novell ZENworks Endpoint Security Management
Securing Your Endpoints Using Novell ZENworks Endpoint Security ManagementSecuring Your Endpoints Using Novell ZENworks Endpoint Security Management
Securing Your Endpoints Using Novell ZENworks Endpoint Security ManagementNovell
 
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012OpenCity Community
 
2013 storage prediction hds hong kong
2013 storage prediction hds hong kong2013 storage prediction hds hong kong
2013 storage prediction hds hong kongAndrew Wong
 

Semelhante a Impact of Soft Errors on Server Reliability (20)

Data-Intensive Research
Data-Intensive ResearchData-Intensive Research
Data-Intensive Research
 
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
 
Microsoft HPC User Group
Microsoft HPC User Group Microsoft HPC User Group
Microsoft HPC User Group
 
Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).
 
2. FOMS _ FeedHenry_ Mícheál Ó Foghlú
2. FOMS _ FeedHenry_ Mícheál Ó Foghlú2. FOMS _ FeedHenry_ Mícheál Ó Foghlú
2. FOMS _ FeedHenry_ Mícheál Ó Foghlú
 
Process Steps
Process StepsProcess Steps
Process Steps
 
6.Live Framework 和Mesh Services
6.Live Framework 和Mesh Services6.Live Framework 和Mesh Services
6.Live Framework 和Mesh Services
 
Corporate overview 2.0
Corporate overview 2.0Corporate overview 2.0
Corporate overview 2.0
 
Extending HBSS Information Assurance with Tripwire Enterprise
Extending HBSS Information Assurance with Tripwire EnterpriseExtending HBSS Information Assurance with Tripwire Enterprise
Extending HBSS Information Assurance with Tripwire Enterprise
 
Sumo
SumoSumo
Sumo
 
NETMF
NETMFNETMF
NETMF
 
Thomas Rischbeck Real Life E S B
Thomas  Rischbeck    Real  Life  E S BThomas  Rischbeck    Real  Life  E S B
Thomas Rischbeck Real Life E S B
 
Imex Research Virtualization Executive Summary On Slideshare
Imex Research Virtualization Executive Summary On SlideshareImex Research Virtualization Executive Summary On Slideshare
Imex Research Virtualization Executive Summary On Slideshare
 
Scalability and Availability - Without Compromise
Scalability and Availability - Without CompromiseScalability and Availability - Without Compromise
Scalability and Availability - Without Compromise
 
HP Server og Lagring SPOR 1
HP Server og Lagring SPOR 1HP Server og Lagring SPOR 1
HP Server og Lagring SPOR 1
 
IT Architecture Automatic Verification (RCIS 2010)
IT Architecture Automatic Verification (RCIS 2010)IT Architecture Automatic Verification (RCIS 2010)
IT Architecture Automatic Verification (RCIS 2010)
 
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008Designing Enterprise IT Systems with REST - QCon San Francisco 2008
Designing Enterprise IT Systems with REST - QCon San Francisco 2008
 
Securing Your Endpoints Using Novell ZENworks Endpoint Security Management
Securing Your Endpoints Using Novell ZENworks Endpoint Security ManagementSecuring Your Endpoints Using Novell ZENworks Endpoint Security Management
Securing Your Endpoints Using Novell ZENworks Endpoint Security Management
 
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
 
2013 storage prediction hds hong kong
2013 storage prediction hds hong kong2013 storage prediction hds hong kong
2013 storage prediction hds hong kong
 

Impact of Soft Errors on Server Reliability

  • 1. Impact of Soft Errors on Reliability and Availability of Servers in the Internet Computing Era Ishwar Parulkar Sun Microsystems, Inc. VTS 2006 Slide 1
  • 2. Outline • Servers for the Internet Era • Server Reliability and Availability – Impact to customers – Metrics and typical targets • Soft Errors in Silicon Components – Classification at system level – Sensitivity of system metrics • Chip Soft Error Trends and Solutions • Conclusions VTS 2006 Slide 2
  • 3. Categorizing Internet Era Workloads Highly Threaded Storage Network Intensive Intensive Single Threaded VTS 2006 Slide 3
  • 4. Categorizing Internet Era Workloads Highly Threaded Meteorology/Climate Simulation Nuclear Simulation/Weapons Modeling Seismic Analysis, Reservoir Modeling Thermodynamics Storage Network Intensive Intensive EAI Servers Structural Analysis Electronic Design Simulation Workgroup Compute Grid Application Development Financial Risk/Portfolio Analysis Monte Carlo Simulation Genomics, Cheminformatics Technical Single Threaded VTS 2006 Slide 4
  • 5. Categorizing Internet Era Workloads Highly Threaded Data Warehousing Proxy Caching Meteorology/Climate Simulation Data Analysis Web Serving Nuclear Simulation/Weapons Modeling Streaming Media OLTP Database Security File Server Seismic Analysis, ERP (SAP R3) Directory Reservoir Modeling Thermodynamics J2EE Application Servers Storage Batch Network Intensive Intensive EAI Servers Structural Analysis Electronic Design Simulation Workgroup Compute Grid Application Development Financial Risk/Portfolio Analysis Monte Carlo Simulation Genomics, Cheminformatics Commercial Technical Single Threaded VTS 2006 Slide 5
  • 6. Categorizing Internet Era Workloads Highly Threaded Data Warehousing Proxy Caching Meteorology/Climate Simulation Data Analysis Web Serving Data Nuclear Simulation/Weapons Modeling Streaming Media OLTP Database Security File Server Seismic Analysis, ERP (SAP R3) Directory Reservoir Modeling Thermodynamics J2EE Application Servers Storage Batch Network Intensive Intensive EAI Servers Structural Analysis Electronic Design Simulation Workgroup Compute Grid Application Development Financial Risk/Portfolio Analysis Monte Carlo Simulation Genomics, Cheminformatics Commercial Technical Single Threaded VTS 2006 Slide 6
  • 7. Categorizing Internet Era Workloads Highly Threaded Data Warehousing Proxy Caching Meteorology/Climate Simulation Data Analysis Web Web Serving Data Nuclear Simulation/Weapons Modeling Streaming Media OLTP Database Security File Server Seismic Analysis, ERP (SAP R3) Directory Reservoir Modeling Thermodynamics J2EE Application Servers Storage Batch Network Intensive Intensive EAI Servers Structural Analysis Electronic Design Simulation Workgroup Compute Grid Application Development Financial Risk/Portfolio Analysis Monte Carlo Simulation Genomics, Cheminformatics Commercial Technical Single Threaded VTS 2006 Slide 7
  • 8. Categorizing Internet Era Workloads Highly Threaded Data Warehousing Proxy Caching Meteorology/Climate Simulation Data Analysis Web Web Serving Data Nuclear Simulation/Weapons Modeling Streaming Media OLTP Database Security File Server Seismic Analysis, ERP (SAP R3) Directory Reservoir Modeling Thermodynamics J2EE Application Servers Storage Batch Network Intensive Intensive EAI Servers Structural Analysis Electronic Design Simulation Workgroup Compute Grid Application Development Financial Risk/Portfolio Analysis Compute Monte Carlo Simulation Genomics, Cheminformatics Commercial Technical Single Threaded VTS 2006 Slide 8
  • 9. Categorizing Internet Era Workloads Highly Threaded Data Warehousing Proxy Caching Meteorology/Climate Simulation Data Analysis Web Web Serving Data Nuclear Simulation/Weapons Modeling Streaming Media OLTP Database Security File Server Seismic Analysis, ERP (SAP R3) Directory Reservoir Modeling Application Thermodynamics J2EE Application Servers Storage Batch Network Intensive Intensive EAI Servers Structural Analysis Electronic Design Simulation Workgroup Compute Grid Application Development Financial Risk/Portfolio Analysis Compute Monte Carlo Simulation Genomics, Cheminformatics Commercial Technical Single Threaded VTS 2006 Slide 9
  • 10. Categorizing Internet Era Workloads Highly Threaded Data Warehousing Proxy Caching Meteorology/Climate Simulation Data Analysis Web Web Serving Data Nuclear Simulation/Weapons Modeling Streaming Media OLTP Database Security File Server Seismic Analysis, ERP (SAP R3) Directory Reservoir Modeling Application Thermodynamics J2EE Application Servers Storage HPC Batch Network Intensive Intensive EAI Servers Structural Analysis Electronic Design Simulation Workgroup Compute Grid Application Development Financial Risk/Portfolio Analysis Compute Monte Carlo Simulation Genomics, Cheminformatics Commercial Technical Single Threaded VTS 2006 Slide 10
  • 11. Optimizing Servers for Workloads • Three primary server design points – Data centric servers – Web centric servers – Compute centric servers • Application centric servers leverage design point of Data and Web centric • HPC centric servers leverage design point of Data centric VTS 2006 Slide 11
  • 12. Server Reliability and Availability Basic Concepts System System Available Down System Available Restart Failure Restart Failure Occurs Occurs MTTR MTTF • Reliability = MTBF = MTTF + MTTR • Availability = MTTF/MTBF = 1 - (MTTR/MTBF) VTS 2006 Slide 12
  • 13. Cost of UnReliability and UnAvailability • Ebay outages – Estimated $3-5 million lost in revenue because of returned fees and lost business – $5 billion drop in market capitalization • Ameritrade, Schwab, E*Trade outages – Class action lawsuit for intermittent service • Akamai outage – Akamai handles 15% of world's Internet traffic – Google, Yahoo, Ebay, etc. affected by this outage Note: All outages were not hardware related VTS 2006 Slide 13
  • 14. Cost of UnReliability and UnAvailability Customer behavior after an Internet server/site outage No change in 9% behavior Found a new site, used it 24% once Found a new 53% site, continued to use both 13% Stopped using site altogether Source: Jupiter Communications – Internet Research Firm VTS 2006 Slide 14
  • 15. VTS 2006 K$ Per Hour 100 10 Brokerage 1,000 10,000 6,450 Credit Card 2,600 Source: InternetWeek 4/3/2000 Ebay 225 Amazon 180 Package Shipping 150 Home Shopping 113 90 Catalog Sales 89 Airline Reservation 41 Cellular Service 25 On-line Network 14 ATM Service Cost of UnReliability and UnAvailability Slide 15
  • 16. Server Reliability and Availability Customer Perspective • Impacts felt by customers – Silent data corruption (SDC) – Unscheduled system interruptions (USI) – Service or repair rate – Downtime (or Uptime) • Metrics – Mean time between SDC (MTBSDC) – Mean time between USI (MTBUSI) – Mean time between repair (MTBR) – Availability VTS 2006 Slide 16
  • 17. Server Reliability and Availability Typical Targets Server Type MTBSDC MTBUSI Availability Data Centric 100-1000 years 10-25 years 99.999 Web Centric 10-100 years 10-25 years 99.999-99.9999 Compute Centric 100-1000 years 2-10 years 99.990 MTBF in years = 109 / (FIT * 24 Hours * 365 Days) VTS 2006 Slide 17
  • 18. A Typical Data Centric Server Component Approx. Count Comments Processors 8-64 8-64 way systems ASICs 320 Memory controllers, IO bridges, Crypto, etc. Memory DIMMs 640 Depends on memory capacity AC/DC 8-10 Main power supply Power Supplies DC/DC 640 High and low voltage supplies Power Supplies Clocking 64 Clock synthesizers and distribution Service Processor 4 Small processors, FPGA Miscellaneous 1000-10000 Resistors, Capacitors, Pins, Connectors Small Components VTS 2006 Slide 18
  • 19. Impact of Silicon Soft Errors on Servers • How much is the contribution of silicon soft errors to total failures in systems? • To what degree are each of the system level metrics impacted by silicon soft errors? • How much protection is adequate? VTS 2006 Slide 19
  • 20. Classification of Silicon Soft Errors Universe of Soft Errors in a Server Chip VTS 2006 Slide 20
  • 21. Classification of Silicon Soft Errors C U Corrected Uncorrected VTS 2006 Slide 21
  • 22. Classification of Silicon Soft Errors Silent SC SU Reported RC RU Corrected Uncorrected VTS 2006 Slide 22
  • 23. Classification of Silicon Soft Errors Silent Data Corruption (MTBSDC) Silent SC SU Reported RC RU Corrected Uncorrected VTS 2006 Slide 23
  • 24. Classification of Silicon Soft Errors Silent Data Corruption (MTBSDC) Silent SC SU Reported RC RU System Crash (MTBUSI) Corrected Uncorrected VTS 2006 Slide 24
  • 25. Classification of Silicon Soft Errors Customer Silent Data does not care or Corruption need not know (MTBSDC) Silent SC SU Reported RC RU System Crash (MTBUSI) Corrected Uncorrected VTS 2006 Slide 25
  • 26. Classification of Silicon Soft Errors Customer Silent Data does not care or Corruption need not know (MTBSDC) Silent SC SU Reported RC RU System Crash (MTBUSI) Required by Corrected Uncorrected Service/Customer to monitor health VTS 2006 Slide 26
  • 27. Silent Data Corruption Total Server FIT A: Without any protection 2% 18% 80% - Memory - Proc. + ASICs - Misc. VTS 2006 Slide 27
  • 28. Silent Data Corruption Total Server FIT A: Without any protection B: With SEC-DED on Memory 2% 1% 9% 18% 89% 80% - Memory - Proc. + ASICs - Misc. Note: Total FIT in A > Total FIT in B VTS 2006 Slide 28
  • 29. Sensitivity to Silicon Soft Errors (Silent Data Corruption) Sensitivity of Server to Processor SU Rate 120 110 100 Server MTBSDC (Years) 90 80 70 60 50 40 30 20 10 0 100 200 300 400 500 600 700 Processor SU (Silent Uncorrected) FIT VTS 2006 Slide 29
  • 30. Unscheduled System Interruptions Total Server FIT A: Without any Redundancy or Protection 2% 8% 20% 70% - Power - Memory - Proc. + ASICs - Misc. VTS 2006 Slide 30
  • 31. Unscheduled System Interruptions Total Server FIT A: Without any Redundancy B: With Power Redundancy or Protection and SEC-DED on Memory 2% 12% 8% 20% 35% 52% 1% 70% - Power - Memory - Proc. + ASICs - Misc. Note: Total FIT in A > Total FIT in B VTS 2006 Slide 31
  • 32. Sensitivity to Silicon Soft Errors (Unscheduled System Inpterruptions) Server Sensitivity to Processor RU Rate 20 17.5 Server MTBUSI (Years) 15 12.5 10 7.5 5 2.5 0 100 200 300 400 500 600 700 Processor RU (Reported Uncorrected) FIT VTS 2006 Slide 32
  • 33. Server Processor Trends (Memory) On-chip memory trend* 50 45 Memory bits (million) 40 40 35 30 30 20 10 0 64b, 130nm, Dual core 8-core (2nd Next genera- Single Core (1st Gen Gen CMT) tion CMT CMT) *Assuming 2-4MB on-chip level-2 cache • Typically memories >8KB protected with SEC-DED, 2Kb-8KB protected with variants of parity • Contribution of memories to chip level FIT rate has been fairly constant over time VTS 2006 Slide 33
  • 34. Server Processor Trends (Flops) On-chip flop trend 1200 Flops per chip (K) 1000 1000 800 600 500 400 200 200 80 0 64b 130nm Dual core 8-core (2nd Next genera- Single Core (1st Gen Gen CMT) tion CMT CMT) • With chip multi-threading (CMT), more pipelines on a chip, hence more logic VTS 2006 Slide 34
  • 35. Server Processor Trends (Flops) • Flop soft error FIT is typically 0.001 FIT/bit * • 30% of flop bit flips contribute to chip failure ** Chip level FIT contribution of flops 500 450 400 FIT per chip 350 300 300 250 200 150 150 100 60 50 24 0 64b 130nm Dual core 8-cores (2nd Next genera- Single Core (1st Gen Gen CMT) tion CMT CMT) * SELSE II (Workshop on System Effects of Logic Soft Errors) ** Fault injection with architectural trace simulation VTS 2006 Slide 35
  • 36. Sensitivity to Processor Flop FIT Sensitivity to Processor SU Rate Sensitivity to Processor RU Rate 120 20 110 17.5 17 years 100 Server MTBSDC (Years) Server MTBUSI (Years) 90 89 years 15 14 years 80 12.5 70 60 10 50 42 years 7.5 40 30 5 20 2.5 10 0 0 100 200 300 400 500 600 700 100 200 300 400 500 600 700 Processor SU (Silent Uncorrected) FIT Processor RU (Reported Uncorrected) FIT • A 150 FIT increase in processor implies: – 52.8% degradation of MTBSDC – 17.7% degradation of MTBUSI VTS 2006 Slide 36
  • 37. Directions for Solutions to Soft Errors • Unit level redundancy is too costly in server space, need cheaper solutions • Circuit level solutions can be limiting – Cannot reduce failure rate to 0 – Reporting corrected errors – CAD, design methodology limitations • Logic level and architectural techniques more promising - cost/flexibility/portability • Just detection is not sufficient – need correction or recovery too • Taking advantage of features of CMT processors VTS 2006 Slide 37
  • 38. Conclusions • Investment in mitigation of soft errors in silicon should be based on top-down system targets • All soft errors in silicon are not equal • System level impact of silicon soft errors – Very high on silent data corruption rate – Medium on unscheduled interruption rate – Low on availability • Flop SER significant for some types of servers • Solutions need to be low overhead – mainframe level reliability/availability at server price points VTS 2006 Slide 38