This document discusses the impact of soft errors on the reliability and availability of servers used for internet computing. It outlines how soft errors can lead to silent data corruption or unscheduled system interruptions. While memory is a major source of soft errors, the number of logic gates and pipelines in processors is increasing, thereby increasing their potential soft error rate over time. Techniques like error correction codes help mitigate soft errors but ongoing improvements are needed to meet high reliability targets for internet infrastructure.
1. Impact of Soft Errors on Reliability
and Availability of Servers in the
Internet Computing Era
Ishwar Parulkar
Sun Microsystems, Inc.
VTS 2006 Slide 1
2. Outline
• Servers for the Internet Era
• Server Reliability and Availability
– Impact to customers
– Metrics and typical targets
• Soft Errors in Silicon Components
– Classification at system level
– Sensitivity of system metrics
• Chip Soft Error Trends and Solutions
• Conclusions
VTS 2006 Slide 2
3. Categorizing Internet Era Workloads
Highly Threaded
Storage Network
Intensive Intensive
Single Threaded
VTS 2006 Slide 3
4. Categorizing Internet Era Workloads
Highly Threaded
Meteorology/Climate Simulation
Nuclear Simulation/Weapons Modeling
Seismic Analysis,
Reservoir Modeling
Thermodynamics
Storage Network
Intensive Intensive
EAI Servers
Structural Analysis
Electronic Design Simulation
Workgroup Compute Grid
Application Development
Financial Risk/Portfolio Analysis
Monte Carlo Simulation
Genomics, Cheminformatics
Technical Single Threaded
VTS 2006 Slide 4
5. Categorizing Internet Era Workloads
Highly Threaded
Data Warehousing
Proxy Caching
Meteorology/Climate Simulation
Data Analysis Web Serving
Nuclear Simulation/Weapons Modeling Streaming Media
OLTP Database
Security
File Server
Seismic Analysis, ERP (SAP R3) Directory
Reservoir Modeling
Thermodynamics J2EE Application Servers
Storage Batch Network
Intensive Intensive
EAI Servers
Structural Analysis
Electronic Design Simulation
Workgroup Compute Grid
Application Development
Financial Risk/Portfolio Analysis
Monte Carlo Simulation
Genomics, Cheminformatics
Commercial
Technical Single Threaded
VTS 2006 Slide 5
6. Categorizing Internet Era Workloads
Highly Threaded
Data Warehousing
Proxy Caching
Meteorology/Climate Simulation
Data Analysis Web Serving
Data Nuclear Simulation/Weapons Modeling Streaming Media
OLTP Database
Security
File Server
Seismic Analysis, ERP (SAP R3) Directory
Reservoir Modeling
Thermodynamics J2EE Application Servers
Storage Batch Network
Intensive Intensive
EAI Servers
Structural Analysis
Electronic Design Simulation
Workgroup Compute Grid
Application Development
Financial Risk/Portfolio Analysis
Monte Carlo Simulation
Genomics, Cheminformatics
Commercial
Technical Single Threaded
VTS 2006 Slide 6
7. Categorizing Internet Era Workloads
Highly Threaded
Data Warehousing
Proxy Caching
Meteorology/Climate Simulation
Data Analysis
Web Web Serving
Data Nuclear Simulation/Weapons Modeling Streaming Media
OLTP Database
Security
File Server
Seismic Analysis, ERP (SAP R3) Directory
Reservoir Modeling
Thermodynamics J2EE Application Servers
Storage Batch Network
Intensive Intensive
EAI Servers
Structural Analysis
Electronic Design Simulation
Workgroup Compute Grid
Application Development
Financial Risk/Portfolio Analysis
Monte Carlo Simulation
Genomics, Cheminformatics
Commercial
Technical Single Threaded
VTS 2006 Slide 7
8. Categorizing Internet Era Workloads
Highly Threaded
Data Warehousing
Proxy Caching
Meteorology/Climate Simulation
Data Analysis
Web Web Serving
Data Nuclear Simulation/Weapons Modeling Streaming Media
OLTP Database
Security
File Server
Seismic Analysis, ERP (SAP R3) Directory
Reservoir Modeling
Thermodynamics J2EE Application Servers
Storage Batch Network
Intensive Intensive
EAI Servers
Structural Analysis
Electronic Design Simulation
Workgroup Compute Grid
Application Development
Financial Risk/Portfolio Analysis
Compute Monte Carlo Simulation
Genomics, Cheminformatics
Commercial
Technical Single Threaded
VTS 2006 Slide 8
9. Categorizing Internet Era Workloads
Highly Threaded
Data Warehousing
Proxy Caching
Meteorology/Climate Simulation
Data Analysis
Web Web Serving
Data Nuclear Simulation/Weapons Modeling Streaming Media
OLTP Database
Security
File Server
Seismic Analysis, ERP (SAP R3) Directory
Reservoir Modeling Application
Thermodynamics J2EE Application Servers
Storage Batch Network
Intensive Intensive
EAI Servers
Structural Analysis
Electronic Design Simulation
Workgroup Compute Grid
Application Development
Financial Risk/Portfolio Analysis
Compute Monte Carlo Simulation
Genomics, Cheminformatics
Commercial
Technical Single Threaded
VTS 2006 Slide 9
10. Categorizing Internet Era Workloads
Highly Threaded
Data Warehousing
Proxy Caching
Meteorology/Climate Simulation
Data Analysis
Web Web Serving
Data Nuclear Simulation/Weapons Modeling Streaming Media
OLTP Database
Security
File Server
Seismic Analysis, ERP (SAP R3) Directory
Reservoir Modeling Application
Thermodynamics J2EE Application Servers
Storage HPC Batch Network
Intensive Intensive
EAI Servers
Structural Analysis
Electronic Design Simulation
Workgroup Compute Grid
Application Development
Financial Risk/Portfolio Analysis
Compute Monte Carlo Simulation
Genomics, Cheminformatics
Commercial
Technical Single Threaded
VTS 2006 Slide 10
11. Optimizing Servers for Workloads
• Three primary server design points
– Data centric servers
– Web centric servers
– Compute centric servers
• Application centric servers leverage design
point of Data and Web centric
• HPC centric servers leverage design point of
Data centric
VTS 2006 Slide 11
12. Server Reliability and Availability
Basic Concepts
System
System Available Down System Available
Restart Failure Restart Failure
Occurs Occurs
MTTR MTTF
• Reliability = MTBF = MTTF + MTTR
• Availability = MTTF/MTBF = 1 - (MTTR/MTBF)
VTS 2006 Slide 12
13. Cost of UnReliability and UnAvailability
• Ebay outages
– Estimated $3-5 million lost in revenue because
of returned fees and lost business
– $5 billion drop in market capitalization
• Ameritrade, Schwab, E*Trade outages
– Class action lawsuit for intermittent service
• Akamai outage
– Akamai handles 15% of world's Internet traffic
– Google, Yahoo, Ebay, etc. affected by this
outage
Note: All outages were not hardware related
VTS 2006 Slide 13
14. Cost of UnReliability and UnAvailability
Customer behavior after an Internet server/site outage
No change in
9% behavior
Found a new
site, used it
24% once
Found a new
53% site, continued
to use both
13% Stopped using
site altogether
Source: Jupiter Communications – Internet Research Firm
VTS 2006 Slide 14
15. VTS 2006
K$ Per Hour
100
10
Brokerage 1,000
10,000 6,450
Credit Card
2,600
Source: InternetWeek 4/3/2000
Ebay
225
Amazon
180
Package Shipping
150
Home Shopping
113
90
Catalog Sales
89
Airline Reservation
41
Cellular Service
25
On-line Network
14
ATM Service
Cost of UnReliability and UnAvailability
Slide 15
16. Server Reliability and Availability
Customer Perspective
• Impacts felt by customers
– Silent data corruption (SDC)
– Unscheduled system interruptions (USI)
– Service or repair rate
– Downtime (or Uptime)
• Metrics
– Mean time between SDC (MTBSDC)
– Mean time between USI (MTBUSI)
– Mean time between repair (MTBR)
– Availability
VTS 2006 Slide 16
17. Server Reliability and Availability
Typical Targets
Server Type MTBSDC MTBUSI Availability
Data Centric 100-1000 years 10-25 years 99.999
Web Centric 10-100 years 10-25 years 99.999-99.9999
Compute Centric 100-1000 years 2-10 years 99.990
MTBF in years = 109 / (FIT * 24 Hours * 365 Days)
VTS 2006 Slide 17
18. A Typical Data Centric Server
Component Approx. Count Comments
Processors 8-64 8-64 way systems
ASICs 320 Memory controllers, IO bridges, Crypto, etc.
Memory DIMMs 640 Depends on memory capacity
AC/DC
8-10 Main power supply
Power Supplies
DC/DC
640 High and low voltage supplies
Power Supplies
Clocking 64 Clock synthesizers and distribution
Service Processor 4 Small processors, FPGA
Miscellaneous
1000-10000 Resistors, Capacitors, Pins, Connectors
Small Components
VTS 2006 Slide 18
19. Impact of Silicon Soft Errors on
Servers
• How much is the contribution of silicon soft errors
to total failures in systems?
• To what degree are each of the system level
metrics impacted by silicon soft errors?
• How much protection is adequate?
VTS 2006 Slide 19
22. Classification of Silicon Soft Errors
Silent SC SU
Reported RC RU
Corrected Uncorrected
VTS 2006 Slide 22
23. Classification of Silicon Soft Errors
Silent Data
Corruption
(MTBSDC)
Silent SC SU
Reported RC RU
Corrected Uncorrected
VTS 2006 Slide 23
24. Classification of Silicon Soft Errors
Silent Data
Corruption
(MTBSDC)
Silent SC SU
Reported RC RU
System Crash
(MTBUSI)
Corrected Uncorrected
VTS 2006 Slide 24
25. Classification of Silicon Soft Errors
Customer Silent Data
does not care or Corruption
need not know (MTBSDC)
Silent SC SU
Reported RC RU
System Crash
(MTBUSI)
Corrected Uncorrected
VTS 2006 Slide 25
26. Classification of Silicon Soft Errors
Customer Silent Data
does not care or Corruption
need not know (MTBSDC)
Silent SC SU
Reported RC RU
System Crash
(MTBUSI)
Required by Corrected Uncorrected
Service/Customer
to monitor health
VTS 2006 Slide 26
27. Silent Data Corruption
Total Server FIT
A: Without any protection
2%
18%
80%
- Memory - Proc. + ASICs - Misc.
VTS 2006 Slide 27
28. Silent Data Corruption
Total Server FIT
A: Without any protection B: With SEC-DED on Memory
2% 1% 9%
18%
89%
80%
- Memory - Proc. + ASICs - Misc.
Note: Total FIT in A > Total FIT in B
VTS 2006 Slide 28
29. Sensitivity to Silicon Soft Errors
(Silent Data Corruption)
Sensitivity of Server to Processor SU Rate
120
110
100
Server MTBSDC (Years)
90
80
70
60
50
40
30
20
10
0
100 200 300 400 500 600 700
Processor SU (Silent Uncorrected) FIT
VTS 2006 Slide 29
30. Unscheduled System Interruptions
Total Server FIT
A: Without any Redundancy
or Protection
2%
8%
20%
70%
- Power - Memory - Proc. + ASICs - Misc.
VTS 2006 Slide 30
31. Unscheduled System Interruptions
Total Server FIT
A: Without any Redundancy B: With Power Redundancy
or Protection and SEC-DED on Memory
2% 12%
8%
20% 35%
52%
1%
70%
- Power - Memory - Proc. + ASICs - Misc.
Note: Total FIT in A > Total FIT in B
VTS 2006 Slide 31
32. Sensitivity to Silicon Soft Errors
(Unscheduled System Inpterruptions)
Server Sensitivity to Processor RU Rate
20
17.5
Server MTBUSI (Years)
15
12.5
10
7.5
5
2.5
0
100 200 300 400 500 600 700
Processor RU (Reported Uncorrected) FIT
VTS 2006 Slide 32
33. Server Processor Trends (Memory)
On-chip memory trend*
50 45
Memory bits (million)
40
40 35
30
30
20
10
0
64b, 130nm, Dual core 8-core (2nd Next genera-
Single Core (1st Gen Gen CMT) tion CMT
CMT)
*Assuming 2-4MB on-chip level-2 cache
• Typically memories >8KB protected with SEC-DED,
2Kb-8KB protected with variants of parity
• Contribution of memories to chip level FIT rate has
been fairly constant over time
VTS 2006 Slide 33
34. Server Processor Trends (Flops)
On-chip flop trend
1200
Flops per chip (K)
1000
1000
800
600
500
400
200
200
80
0
64b 130nm Dual core 8-core (2nd Next genera-
Single Core (1st Gen Gen CMT) tion CMT
CMT)
• With chip multi-threading (CMT), more pipelines
on a chip, hence more logic
VTS 2006 Slide 34
35. Server Processor Trends (Flops)
• Flop soft error FIT is typically 0.001 FIT/bit *
• 30% of flop bit flips contribute to chip failure **
Chip level FIT contribution of flops
500
450
400
FIT per chip
350
300
300
250
200
150
150
100 60
50 24
0
64b 130nm Dual core 8-cores (2nd Next genera-
Single Core (1st Gen Gen CMT) tion CMT
CMT)
* SELSE II (Workshop on System Effects of Logic Soft Errors)
** Fault injection with architectural trace simulation
VTS 2006 Slide 35
36. Sensitivity to Processor Flop FIT
Sensitivity to Processor SU Rate Sensitivity to Processor RU Rate
120 20
110
17.5 17 years
100
Server MTBSDC (Years)
Server MTBUSI (Years)
90 89 years 15 14 years
80
12.5
70
60 10
50 42 years 7.5
40
30 5
20
2.5
10
0 0
100 200 300 400 500 600 700 100 200 300 400 500 600 700
Processor SU (Silent Uncorrected) FIT Processor RU (Reported Uncorrected) FIT
• A 150 FIT increase in processor implies:
– 52.8% degradation of MTBSDC
– 17.7% degradation of MTBUSI
VTS 2006 Slide 36
37. Directions for Solutions to Soft Errors
• Unit level redundancy is too costly in server
space, need cheaper solutions
• Circuit level solutions can be limiting
– Cannot reduce failure rate to 0
– Reporting corrected errors
– CAD, design methodology limitations
• Logic level and architectural techniques more
promising - cost/flexibility/portability
• Just detection is not sufficient – need correction
or recovery too
• Taking advantage of features of CMT processors
VTS 2006 Slide 37
38. Conclusions
• Investment in mitigation of soft errors in silicon
should be based on top-down system targets
• All soft errors in silicon are not equal
• System level impact of silicon soft errors
– Very high on silent data corruption rate
– Medium on unscheduled interruption rate
– Low on availability
• Flop SER significant for some types of servers
• Solutions need to be low overhead – mainframe
level reliability/availability at server price points
VTS 2006 Slide 38