1. White Paper – System Storage Reliability
Juha Salenius
Storage Systems
As a way of defining the various facets of reliability let’s start out by configuring a hypothetical a server
using six internal hot swap disk drives that can be either Serial Attached SCSI (SAS) or Serial Advanced
Technology Attachment (SATA) disk drives. If these drives were SAS disks and were set up as Just a
Bunch of Drives (JBOD), a non RAID configuration, then the MTBF would be all we’d need to define the
computed failure rate. With an individual drive MTBFSAS of 1,400,000 hours1 the combined MTBF for six
drives would compute to 233,333 hours using the following equation where N is the number of the
same component (in this case disk drives) and the subscripts are tc=total components and
c=component:
(Special case where all components are the same)
In contrast to the SAS MTBFtc, using individual SATA drives instead, each exhibiting an individual
MTBFSATA of 1,000,000 hours2, the combined MTBFtc for six drives in a JBOD configuration is 166,667
hours.
RAID Considerations:
Even with the drives configured as a RAID (levels 0, 1, 5 and 6) array the total MTBF for all the drives will
remain the same as above because it does not take into account any redundancy. MTBI (Mean Time
Between Interruption) can be used to highlight the difference in uptime based on a redundant
configuration. In a JBOD configuration the MTBF and the MTBI are the same. You’ll notice that once we
move from a non-redundant system to a system with redundant components we move from reporting
MTBF to reporting MTBI as the more meaningful term from a system perspective.
Consider the following RAID levels:
RAID 0 will not be considered here because it does not provide any failure protection, though it
does provide higher throughput by striping the data stream across multiple drives and it is
usually used with other RAID levels increasing their throughput. Certain RAID levels can be
combined, for example RAID 10, RAID 50 and RAID 60. These configurations combine data
striping across multiple drives combined with either mirroring or parity drives. These
configurations drive increased complexity but improve performance.
RAID 1 mirrors data across two disk drives which require doubling of the number of data drives.
RAID 5 uses parity bits to recover from a bad read and the data/parity is written in blocks across
all drives in the array with the parity distributed evenly among all drives. Because of the added
parity bit information, a minimum of three or more disk drives are needed to implement RAID 5.
RAID 6 is a RAID 5 configuration using yet an additional parity bit.
1
Adaptec Inc. Storage Advisors Weblog 11/02/2005
2
Ibid
Page 1
2. White Paper – System Storage Reliability
Juha Salenius
With RAID 5 and 6 the ratio of data storage to parity storage increases as the number of spindles
increase, so for a system with six drives there could be the equivalent of five data drives and one parity
drive for RAID 5 and four data drives and two parity drives for RAID 6. The spindle overhead for RAID 5
with five drives is 20% and doubling the total number of drives to 10 decreases the overhead to 10%.
Why use RAID 6 instead of RAID 5? RAID 5 provides protection against a single failed drive. RAID 6 will
provide protection against two concurrent failures. Once a drive has failed the only time exposure the
array has to an additional failure is the time it takes to replace and re-build the failed drive, the MTTR
(Mean Time To Repair) interval. With RAID 6, the exposure to an additional failure is eliminated
because of the additional parity. If the system has a hot swap drive the time to repair will be
significantly reduced, the re-built time can start immediately and the failed drive replaced during or
after the rebuild. The probability of another hardware failure during the MTTR interval is extremely low.
But there is another disk related issue that could cause a problem during this MTTR interval and that is a
hard read error, more prevalent in SATA disks.
SAS and SATA Drive Considerations:
Both SAS and SATA drives have well defined Bit Error Rates (BER). SAS drives are more robust than
SATA drives, exhibiting a BER of the order of one out of every 1015 bits read3, equating to one out of
every 100 terabytes (TB) read.
SATA drives are not as robust and exhibit BERs in the order of one every 1014 bits read4 or every 10 TB.
What does this mean from a system perspective? To illustrate the issue, we’ll start with a SATA disk
array that has failed due to a hardware problem and is in the process of rebuilding. Let’s make some
assumptions; using 500GB drives, the array has 10 drives in it and the drives each have a 1014 read BER.
The following formula can determine the number of times an unrecoverable error will occur:
It’s entirely possible that an array will be rebuilt 2.5 times in its life and there may be a non-recoverable
error occurring during those 2.5 rebuilds. This scenario only addresses 500GB drives, but the industry
has moved on and drive sizes have increased to 1TB and beyond, which makes this issue more
problematic. The higher the drive size or the more drives in the array, the more frequently the non-
recoverable read error can occur during rebuild.
Combining Optimal RAID and Hard Drive Choices (probability) – the math:
A concern with SATA disk technology is the Unrecoverable Read Error (URE) which is currently at 1014. A
URE every 1014 bits equates to an error every 2.4E10 sectors. This becomes critical as the drive sizes
3
Ibid
4
Ibid
Page 2
3. White Paper – System Storage Reliability
Juha Salenius
increase. When a drive fails in a 7 drive RAID 5 array made up of 2 TB SATA disks, the 6 remaining good
2 TB drives will have to be read completely to recover the missing data. As the RAID controller is
reconstructing the data it is very likely it will see an URE occur in the remaining media. At that point the
RAID reconstruction stops.
Here’s the math:
There is a 62% chance of data loss due to an uncorrectable read error on a 7 drive (2 TB each) RAID 5
array with one failed disk, assuming a 1014 read error rate and ~23 billion sectors in 12 TB. Feeling
lucky?5
RAID 6 is a technique that can be used to mitigate this failure during the rebuild cycle. This is important
because it allows the system to recover from two disk failures, one array failure and a subsequent single
hard read error from the surviving disks in the array during the rebuild.
With customers looking to reduce system cost using SATA technology, the additional overhead for RAID
6 parity is becoming acceptable. But there are drawbacks to using RAID 6 which include longer write
times due to the additional time required to generate the RAID 5 parity bit and then generating the RAID
6 parity. When an error occurs during a read, the RAID 5 and RAID 6 array reduces the read throughput
due to bit recovery.
As we mentioned in the beginning of this article, we wanted to constrain this discussion to defining RAS
and addressing increased reliability with SATA disks in a RAID environment. But there are other areas
that should be addressed at the system level that also affect disk drive performance. One such area is
rotational vibration. This issue is a systemic problem in rack mount systems due to the critical thermal
constraints in 1U and 2U chassis in a NEBS environment. Rotational vibration effects are mitigated in
our mechanical designs and the techniques used are covered in a separate document.
5
Does RAID 6 stop working in 2019? by Robin Harris on Saturday, 27 February, 2010
(http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019/)
Page 3
4. White Paper – System Storage Reliability
Juha Salenius
Reliability – MTTF and MTBF (Mean Time To Failure and Mean Time Between Failure)
With so much data exposed to catastrophic failure, exacerbated in the cloud computing environment,
it’s important to maintain data integrity, especially in the medical, telecommunications and military
markets. Systems designed for these markets must address three key areas; Reliability, Availability and
Serviceability, system uptime must be maximized and mission critical data must be maintained.
The term Mean Time To Failure (MTTF) is an estimate of the average, or mean time until the initial
failure of a design or component (you may not want to include external failures), or disruption in the
operation of the product, process, procedure, or design occurs. A failure assumes that the product
cannot be repaired nor can it resume any of its normal operations without taking it out of service.
MTTF it is similar to Mean Time Between Failure (MTBF) though MTBF typically is slightly longer in time
than MTTF because MTBF includes the repair time of the design or component. Also, MTBF is the
average time between failures including the average repair time, which is known as MTTR (Mean Time
To Repair).6
What is Reliability? Per the Six Sigma SPC’s (Sigma Process Control) Quality Control Dictionary,
Reliability is the probability for any given design or process to execute within the anticipated operational
or design margin for a specified period of time. In addition, the system will work under defined
operating conditions with a minimum amount of stoppage due to a design or process error. Some
indicators for reliability are MTBF (Mean Time Between Failures) computations, ALT (Accelerated Life
Test using temperature chambers), MTTF (Mean Time To Failure) computations, and Chi-Square7
(statistical difference between observed and expected).
MTBF is a calculated indication of reliability. From a system perspective, any reliable assembly must
satisfactorily perform its intended function under some defined circumstances which may not be part of
the MTBF calculation’s environment. This may include conditions such as operating in varying ambient
temperatures. MTBF addresses reliability in a very controlled and limited scope. Traditionally, MTBF
calculations are based on the Telcordia Technologies Special Report SR-332, Issue 1, Reliability
Prediction Procedure for Electronic Equipment. The results from these calculations can be used to
roughly assist customers in the evaluation of the individual products, but should not be used as a
representation or guarantee of reliability or performance of the product. MTBF is only a gross
representation of how reliable a system will be under clearly defined conditions, clearly not real world.
If we can’t use the results of the MTBF calculation to determine when the components will wear out in
the real world, which product is better than the others and MTBF does not provide a reliable metric for
field failures, then why use it? Well it allows us to determine how often a system will fail under steady
state environmental conditions. Early in the design cycle, component MTBF can be used to determine
which parts will initially fail enabling engineering to improve the design robustness by selecting more
6
Paraphrasing the Six Sigma SPC's Quality Control Dictionary and Glossary
http://www.sixsigmaspc.com/dictionary/glossary.html
7
ibid
Page 4
5. White Paper – System Storage Reliability
Juha Salenius
robust components or design using hardened assemblies. There are three methods used in the MTBF
calculations: 1. the black box, 2. the black box plus laboratory inputs, and 3. the black box plus field
failure inputs. While the industry traditionally uses Method 1, Kontron Inc. CBPU/CRMS uses a
combination of all three methods – black box with lab inputs coupled with field data where available.
For large aggregated component assemblies, such as computer baseboards, there are typically vendor-
calculated MTBF; for passive components, there is industry standard failure rate data; and for
proprietary components, lab or field data is available.
Availability – MTBI (Mean Time Between Interruption)
If MTTF and MTBF reference failure modes address the failures from an initial power on or address the
failures from a previous failure including the repair time (MTTR) what is meant by Mean Time Between
Interruption (MTBI)? It addresses designs that provide redundancy allowing for the failure of a
redundant component that will not halt (fail) the system. The system may not run at full speed during
the time it takes to replace or re-build the failed component but it will run. MTBI time durations are
much larger than MTTR/MTBF intervals, which is better and they could be include multiple failures (RAID
6) with the replacement of the redundant components.
Serviceability – MTTR (Mean Time To Repair)
This term refers to how quickly and easily a system can be repaired after a MTBF, MTTF or an MTBI
event.
One measure of availability is what’s touted as the Five Nines. As we’ve seen, MTBI and MTTR are
tightly coupled. There is a significant amount of marketing literature promoting Five Nines availability
for systems designed for critical environments. But what is meant by Five Nines? This particular metric
is an artifact of the monolithic telecommunications industry when the incumbent carriers exercised
complete control of the equipment installed in their central offices. Five Nines availability was, and in
many cases remains, a requirement of Telco-grade Service Level Agreements (SLAs), defining a ratio of
system uptime (MTTR/ MTBI) versus unplanned downtime (MTTR), not counting scheduled
maintenance, planned updates, reboots, etc. Five Nines availability has an uptime of 99.999% per year,
or expressed conversely, its five minutes and thirty-five seconds of unplanned downtime per year,
equivalent to six sigma, a 99.99966% process capability. With a downtime measured in minutes, it is
vitally important that the system serviceability duration is minimized and any spare parts are available
locally, e.g., hot spares for disk drive arrays.
With Five Nines reflecting the system elements and not the network, we can easily compute the
network level availability. For example, if two non-redundant serial network elements each have
99.999% availability, the total availability of the network is 0.99999 X 0.99999 = 0.99998 or 99.998% or
Four Nines availability. Notice that with redundant components in a system we use MTBI not MTBF as a
measure of the interval between system level failures. By providing redundancy for all high powered
Page 5
6. White Paper – System Storage Reliability
Juha Salenius
and rotating components we increase the time the system takes to fail (MTBI) but reduce the
MTTF/MTBF because there are more components to fail.
Computations
When evaluating a non-redundant system, all sub-system’s MTBF numbers can be viewed as a series
sequence with any single component or assembly causing a single system failure. The total calculated
MTBF will be less than the lowest individual component MTBF as illustrated in the following formula.
(Standard case where all components aren’t the same)
When we add redundant assemblies to the system, these combined components are measured as a
single block and the system level result is no longer MTBF but rather MTBI; the system keeps working
even with the failed redundant component. For example, in a system with no redundant fans, the MTBF
for the fan group maybe 261,669 hours. After we add redundant fans the MTBI is 3,370,238,148 hours
even though the MTBF is reduced because of the added fans. Because this MTBI is such a large number,
the fan group is virtually eliminated from the equation for system MTBI. We add redundant
components to increase the MTBI of the grouped components so they no longer adversely affect the
system level MTBI because their MTBI values are so large. The system’s single point of failure is reduced
by taking the assemblies that traditionally exhibit high single point of failure rates i.e., any component
assemblies that move, rotate or work at the edge of their thermal or electrical envelope and designing
the system in such as was so that these assemblies are redundant assemblies.
Power supplies are also items that fail due because they are usually working at the higher end of their
components thermal and electrical limits. By adding redundant power supplies, the MTBF can go from
125,000 hours for a single supply to an MTBI of 326,041,999 hours for a redundant pair. Like the earlier
example with the fans this is substantial change and will have a major positive impact to the system
MTBI.
Page 6