Impact of Soft Errors on Server Reliability

Impact of Soft Errors on Reliability
and Availability of Servers in the
Internet Computing Era
Ishwar Parulkar
Sun Microsystems, Inc.

VTS 2006 Slide 1

Outline
• Servers for the Internet Era
• Server Reliability and Availability
– Impact to customers
– Metrics and typical targets
• Soft Errors in Silicon Components
– Classification at system level
– Sensitivity of system metrics
• Chip Soft Error Trends and Solutions
• Conclusions

VTS 2006 Slide 2

Categorizing Internet Era Workloads
Highly Threaded

Storage Network
Intensive Intensive

Single Threaded
VTS 2006 Slide 3

Highly Threaded

Meteorology/Climate Simulation

Nuclear Simulation/Weapons Modeling

Seismic Analysis,
Reservoir Modeling
Thermodynamics
Storage Network
Intensive Intensive
EAI Servers
Structural Analysis
Electronic Design Simulation
Workgroup Compute Grid
Application Development
Financial Risk/Portfolio Analysis
Monte Carlo Simulation
Genomics, Cheminformatics

Technical Single Threaded
VTS 2006 Slide 4

Highly Threaded

Data Warehousing
Proxy Caching
Data Analysis Web Serving
Nuclear Simulation/Weapons Modeling Streaming Media
OLTP Database
Security
File Server
Seismic Analysis, ERP (SAP R3) Directory
Reservoir Modeling
Thermodynamics J2EE Application Servers
Storage Batch Network
Intensive Intensive
EAI Servers
Structural Analysis

Commercial
VTS 2006 Slide 5

Highly Threaded

Data Warehousing
Proxy Caching
Data Analysis Web Serving
Data Nuclear Simulation/Weapons Modeling Streaming Media
OLTP Database
Security
File Server
Reservoir Modeling
Intensive Intensive
EAI Servers
Structural Analysis

Commercial
VTS 2006 Slide 6

Highly Threaded

Data Warehousing
Proxy Caching
Data Analysis
Web Web Serving
OLTP Database
Security
File Server
Reservoir Modeling
Intensive Intensive
EAI Servers
Structural Analysis

Commercial
VTS 2006 Slide 7

Highly Threaded

Data Warehousing
Proxy Caching
Data Analysis
Web Web Serving
OLTP Database
Security
File Server
Reservoir Modeling
Intensive Intensive
EAI Servers
Structural Analysis
Compute Monte Carlo Simulation

Commercial
VTS 2006 Slide 8

Highly Threaded

Data Warehousing
Proxy Caching
Data Analysis
Web Web Serving
OLTP Database
Security
File Server
Reservoir Modeling Application
Intensive Intensive
EAI Servers
Structural Analysis

Commercial
VTS 2006 Slide 9

Highly Threaded

Data Warehousing
Proxy Caching
Data Analysis
Web Web Serving
OLTP Database
Security
File Server
Reservoir Modeling Application
Storage HPC Batch Network
Intensive Intensive
EAI Servers
Structural Analysis

Commercial
VTS 2006 Slide 10

Optimizing Servers for Workloads

• Three primary server design points
– Data centric servers
– Web centric servers
– Compute centric servers
• Application centric servers leverage design
point of Data and Web centric
• HPC centric servers leverage design point of
Data centric

VTS 2006 Slide 11

Server Reliability and Availability
Basic Concepts
System
System Available Down System Available

Restart Failure Restart Failure
Occurs Occurs
MTTR MTTF

• Reliability = MTBF = MTTF + MTTR
• Availability = MTTF/MTBF = 1 - (MTTR/MTBF)

VTS 2006 Slide 12

Cost of UnReliability and UnAvailability
• Ebay outages
– Estimated $3-5 million lost in revenue because
of returned fees and lost business
– $5 billion drop in market capitalization
• Ameritrade, Schwab, E*Trade outages
– Class action lawsuit for intermittent service
• Akamai outage
– Akamai handles 15% of world's Internet traffic
– Google, Yahoo, Ebay, etc. affected by this
outage
Note: All outages were not hardware related
VTS 2006 Slide 13

Customer behavior after an Internet server/site outage

No change in
9% behavior
Found a new
site, used it
24% once
Found a new
53% site, continued
to use both
13% Stopped using
site altogether

Source: Jupiter Communications – Internet Research Firm
VTS 2006 Slide 14

VTS 2006
K$ Per Hour

100

10
Brokerage 1,000
10,000 6,450

Credit Card
2,600

Source: InternetWeek 4/3/2000
Ebay
225

Amazon
180

Package Shipping
150

Home Shopping
113
90

Catalog Sales
89

Airline Reservation
41

Cellular Service
25

On-line Network
14

ATM Service

Slide 15

Customer Perspective

• Impacts felt by customers
– Silent data corruption (SDC)
– Unscheduled system interruptions (USI)
– Service or repair rate
– Downtime (or Uptime)
• Metrics
– Mean time between SDC (MTBSDC)
– Mean time between USI (MTBUSI)
– Mean time between repair (MTBR)
– Availability
VTS 2006 Slide 16

Typical Targets

Server Type MTBSDC MTBUSI Availability
Data Centric 100-1000 years 10-25 years 99.999
Web Centric 10-100 years 10-25 years 99.999-99.9999
Compute Centric 100-1000 years 2-10 years 99.990

MTBF in years = 109 / (FIT * 24 Hours * 365 Days)

VTS 2006 Slide 17

A Typical Data Centric Server

Component Approx. Count Comments
Processors 8-64 8-64 way systems

ASICs 320 Memory controllers, IO bridges, Crypto, etc.

Memory DIMMs 640 Depends on memory capacity
AC/DC
8-10 Main power supply
Power Supplies
DC/DC
640 High and low voltage supplies
Power Supplies
Clocking 64 Clock synthesizers and distribution

Service Processor 4 Small processors, FPGA
Miscellaneous
1000-10000 Resistors, Capacitors, Pins, Connectors
Small Components

VTS 2006 Slide 18

Impact of Silicon Soft Errors on
Servers

• How much is the contribution of silicon soft errors
to total failures in systems?
• To what degree are each of the system level
metrics impacted by silicon soft errors?
• How much protection is adequate?

VTS 2006 Slide 19

Classification of Silicon Soft Errors

Universe of
Soft Errors
in a Server Chip

VTS 2006 Slide 20


C U

Corrected Uncorrected

VTS 2006 Slide 21


Silent SC SU

Reported RC RU


VTS 2006 Slide 22

Silent Data
Corruption
(MTBSDC)

Silent SC SU

Reported RC RU


VTS 2006 Slide 23

Silent Data
Corruption
(MTBSDC)

Silent SC SU

Reported RC RU
System Crash
(MTBUSI)

VTS 2006 Slide 24

Customer Silent Data
does not care or Corruption
need not know (MTBSDC)

Silent SC SU

Reported RC RU
System Crash
(MTBUSI)

VTS 2006 Slide 25

Customer Silent Data
does not care or Corruption
need not know (MTBSDC)

Silent SC SU

Reported RC RU
System Crash
(MTBUSI)
Required by Corrected Uncorrected
Service/Customer
to monitor health

VTS 2006 Slide 26

Silent Data Corruption
Total Server FIT

A: Without any protection

2%

18%

80%

- Memory - Proc. + ASICs - Misc.

VTS 2006 Slide 27

Silent Data Corruption
Total Server FIT

A: Without any protection B: With SEC-DED on Memory

2% 1% 9%

18%

89%
80%

- Memory - Proc. + ASICs - Misc.
Note: Total FIT in A > Total FIT in B
VTS 2006 Slide 28

Sensitivity to Silicon Soft Errors
(Silent Data Corruption)

Sensitivity of Server to Processor SU Rate
120
110
100
Server MTBSDC (Years)

90
80
70
60
50
40
30
20
10
0
100 200 300 400 500 600 700
Processor SU (Silent Uncorrected) FIT

VTS 2006 Slide 29

Unscheduled System Interruptions
Total Server FIT
A: Without any Redundancy
or Protection
2%
8%

20%

70%

- Power - Memory - Proc. + ASICs - Misc.

VTS 2006 Slide 30

Unscheduled System Interruptions
Total Server FIT
A: Without any Redundancy B: With Power Redundancy
or Protection and SEC-DED on Memory
2% 12%
8%

20% 35%

52%
1%
70%

- Power - Memory - Proc. + ASICs - Misc.

Note: Total FIT in A > Total FIT in B
VTS 2006 Slide 31

Sensitivity to Silicon Soft Errors
(Unscheduled System Inpterruptions)

Server Sensitivity to Processor RU Rate
20

17.5
Server MTBUSI (Years)

15
12.5

10
7.5
5
2.5

0
100 200 300 400 500 600 700
Processor RU (Reported Uncorrected) FIT

VTS 2006 Slide 32

Server Processor Trends (Memory)
On-chip memory trend*
50 45

Memory bits (million)
40
40 35
30
30

20

10

0
64b, 130nm, Dual core 8-core (2nd Next genera-
Single Core (1st Gen Gen CMT) tion CMT
CMT)
*Assuming 2-4MB on-chip level-2 cache

• Typically memories >8KB protected with SEC-DED,
2Kb-8KB protected with variants of parity
• Contribution of memories to chip level FIT rate has
been fairly constant over time
VTS 2006 Slide 33

Server Processor Trends (Flops)
On-chip flop trend
1200
Flops per chip (K)
1000
1000

800

600
500
400
200
200
80
0
64b 130nm Dual core 8-core (2nd Next genera-
CMT)

• With chip multi-threading (CMT), more pipelines
on a chip, hence more logic
VTS 2006 Slide 34

Server Processor Trends (Flops)
• Flop soft error FIT is typically 0.001 FIT/bit *
• 30% of flop bit flips contribute to chip failure **
Chip level FIT contribution of flops
500
450
400
FIT per chip

350
300
300
250
200
150
150
100 60
50 24
0
64b 130nm Dual core 8-cores (2nd Next genera-
CMT)

* SELSE II (Workshop on System Effects of Logic Soft Errors)
** Fault injection with architectural trace simulation
VTS 2006 Slide 35

Sensitivity to Processor Flop FIT
Sensitivity to Processor SU Rate Sensitivity to Processor RU Rate
120 20
110
17.5 17 years
100
Server MTBSDC (Years)

Server MTBUSI (Years)
90 89 years 15 14 years
80
12.5
70
60 10
50 42 years 7.5
40
30 5
20
2.5
10
0 0
100 200 300 400 500 600 700 100 200 300 400 500 600 700
Processor SU (Silent Uncorrected) FIT Processor RU (Reported Uncorrected) FIT

• A 150 FIT increase in processor implies:
– 52.8% degradation of MTBSDC
– 17.7% degradation of MTBUSI
VTS 2006 Slide 36

Directions for Solutions to Soft Errors
• Unit level redundancy is too costly in server
space, need cheaper solutions
• Circuit level solutions can be limiting
– Cannot reduce failure rate to 0
– Reporting corrected errors
– CAD, design methodology limitations
• Logic level and architectural techniques more
promising - cost/flexibility/portability
• Just detection is not sufficient – need correction
or recovery too
• Taking advantage of features of CMT processors
VTS 2006 Slide 37

Conclusions
• Investment in mitigation of soft errors in silicon
should be based on top-down system targets
• All soft errors in silicon are not equal
• System level impact of silicon soft errors
– Very high on silent data corruption rate
– Medium on unscheduled interruption rate
– Low on availability
• Flop SER significant for some types of servers
• Solutions need to be low overhead – mainframe
level reliability/availability at server price points
VTS 2006 Slide 38

Impact of Soft Errors on Server Reliability

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Semelhante a Impact of Soft Errors on Server Reliability

Semelhante a Impact of Soft Errors on Server Reliability (20)

Impact of Soft Errors on Server Reliability