2012 RAMS Investment in Reliability Program versus Return – How to Decide

Investment in reliability program versus return – how to decide
Fred Schenkelberg, Ops A La Carte, LLC

Key Words: Reliability, program, planning, investment, ROI

SUMMARY & CONCLUSIONS cost in resources, time, or materials. Focusing on tasks that
effectively and efficiently move the design toward a market
Selecting the right tool, or the right investment for a
acceptable product drives the overall program’s success. One
specific reliability task is often left to the judgment of the
aspect is the product’s reliability. Engineer and program
reliability professional. With experience these choices become
managers can quickly estimate the cost of specific tasks, like
simpler, yet in many cases the task can be daunting. By
Highly Accelerated Life Test (HALT) or Accelerated Life
examining the decision process we explore a means to
Test (ALT). Yet, the value returned to the program is not as
determine the most cost effective reliability activities for
clear.
specific situations.
This paper explores a few examples and situations to
Not all reliability tools provide useful information or
illustrate how to determine the Return On Investment (ROI).
timely results in every situation, yet how does one choose the
Your actual situation is different and the resulting ROI will
best activities for a given situation. After conducting over 100
also be different. It is the assessment of which tasks add the
reliability program assessments and working with dozens of
most value provides a guide to building an overall reliability
design teams to build effective reliability programs, the author
plan.
lays out an means to trade-off the cost and benefits for the
In each of three examples explored there are numerous
appropriate selections of reliability activities.
‘facts’ stated that would be known by the design team. These
Considering the constraints and the objectives - there is a
are simply stated as facts to build information about the
best set of tools to employ during the development process to
situation. Each case is built from an actual situation
produce a reliable product. This paper explore the cost/benefit
experienced in my work. Also, there are many assumptions
equation in three different cases: High cost low volume, low
made and stated as assumptions. In practice we do not have all
cost high volume and brand new technology product
the information or facts, assumptions permit the calculation to
development situations. Considerations include risk, models,
continue, and by stating them clearly permit the team to
processes, and technology along with customer or market
challenge, understand and improve the calculations.
expectations. Another significant consideration is the
reliability maturity of the organization. 2 HALT AND TIME TO MARKET
There isn't a single set of tools or activities that will
Consider the development of a new game controller. High
always produce a reliable product in a cost effective manner.
volume with the majority of sales expected immediately after
Carefully, considering the current situation and capabilities
product launch during the holiday sales period. New design,
permit the team to select the right tools to make significant
time to market emphasis, majority of product manufactured
progress toward a reliable product.
prior to the start of sales, no repairs and the controller is an
1 INTRODUCTION enabling part of a larger system. The controller’s reliability
goal is 98% reliable over the first year of ownership when
Many reliability activities are naturally part of design
used as part of the game system.
engineering. Adding the weight of a radio to an aircraft is a
tradeoff between the value of the communication function and 2.1 HALT vs ALT Discussion
the cost of lifting the additional weight. Another tradeoff
One of the basic questions facing the team is, “Will the
considered is between the value of the communication
product meet the 98% reliability goal?” An ALT may help
function and cost of maintenance or repair. The repair cost is
answer this question if we know which failure mechanism(s)
in part related to the reliability of the equipment.
will lead to failure during the first year [1]. This is a new
Seasonal consumer products may have an emphasis on
product without any field history. Other controllers designed
time to market and the cost of lost sales. Medical products
for this environment have experienced a range of failure
may have an emphasis on product safety and cost of potential
causes and are often dominated by shock and vibration
product liability. For each product development team the
damage from dropping.
ability to quantify the cost of unreliability is important in order
The risk analysis done by the design team fully suspects
to balance the appropriate investment into achieving reliability
that drop damage would be the most significant contributor to
objectives.
product failures. The new controller is different enough that
Each task related to the development of a product has a
© IEEE 2012 – Annual Reliability and Maintainability Symposium

using the field data is likely to not apply. Also, it is unknown first year. 25% of the time, the underlying design has at least
which specific element of the design would experience failure one major failure mechanism that may be detected and
first or at all over one year of use. Therefore, understanding resolved prior to the start of sales.
the most likely failure mechanisms that are to occur is Also, consider that no testing program will uncover all
important to discover. faults, yet let’s assume that only 10% of the time will HALT
The initial project plan did not include HALT testing on and DVT not find a major (>10% failure rate) issue. Also,
the first set of prototypes, rather it would sample from the HALT may not find the issue while DVT does detect the fault,
second set of prototypes, 8 weeks later, just before the transfer let’s say 50% of the time. And, let’s assume HALT finds the
of the design to manufacturing to conduct design verification fault only 40% of the time. Note: this low rate is pessimistic
testing (DVT), including life testing. The drop testing portion for an estimate of the ability of a well executed HALT and in
of the DVT is expected to take a week to accomplish. my experience HALT is much more effective.
The reliability engineer on this program recommends For the value calculation, 25% chance of an unacceptable
performing HALT on the first available prototypes. Using failure rate exists in the design, times a 40% chance of HALT
high loads of random vibration and high shock loads in the finding the issue, times the cost avoided by having time to
HALT plan to quickly assess the design weakness related to solve the issue without a 30 day program delay, results in an
product drop damage. The project manager requests more expected savings of 0.25 x 0.40 x $15m = $1.5 million.
information on timing, cost and benefits (value).
2.2 HALT Cost
There isn’t time to procure a HALT chamber within the
development schedule; therefore we let’s collect quotes from
HALT labs to conduct the testing. Let’s assume a quote of
$10k for one round of testing [2]. Of course, if there were
HALT facilities internally available this cost would be less.
Also consider the cost of the prototypes are about 5 times
more expensive then second round prototype units. The first
round of prototypes are a small run, specialized tooling, quick Figure 1 HALT Value Calculation
turn production, costing approximately $1k for each unit. We
are requesting five units at an increased cost of 5 times over
later prototypes at an $800 price increase, or $4k. 2.4 HALT ROI
Rounding out the expected costs of engineering support, The ROI is the ratio of the expected return over the cost.
testing equipment support, and failure analysis support, we $1.5 million divided by $24k, which results in an ROI of over
estimate an additional cost of approximately $10k. Therefore, 60.
the total cost to the program to add HALT testing is This is only part of the value, as it only considered the
approximately $24k. detection of major issues thus avoiding a schedule slip. The
2.3 HALT Value HALT will also find less significant issues that wouldn’t have
resulted in a schedule slip, yet the earlier detection would
One of the primary benefits of HALT is the potential reduce the cost of implementing design changes. Plus, HALT
uncovering of new failure mechanisms in the design [3]. By may have found unique failure mechanisms beyond what the
conducting the HALT on the first available prototypes the DVT would find, than leading to an incremental reduction in
design team increases the time available to resolve design achieved field failure rate.
errors or make design improvements. Designers tend to design
away from failures; HALT is a tool to discover previously 3 ALT AND MARKET SHARE
unknown (or unsuspected) failure mechanisms. A design team working on a medical device understands
Let’s assume (for purpose of this example) the design that the market share is related to the product reliability. The
prior to any testing has a 25% chance of a failure mechanism current product performs adequately yet has the highest field
that will lead to an unacceptably high first year failure rate. In failure rate of similar products. Customers complain about the
discussions with the program manager we learn that they poor reliability and the market share reflects their comparative
would delay the start of production if there were a 10% or reliability ranking. The product with the highest market share
higher expected field failure rate. And, the cost of the delay is also the most reliable.
was estimated at $500k per day in lost sales. With an assumed The design challenge is to create a product that is more
30 days to design and implement an improvement to resolve a reliable than the competition at about the same price point and
major reliability issue, that would cost the program $500k/day if possible with improved functionality. The early concepts all
for 30 days, or $15 million. include a novel design using an unproven (reliability) sealing
There is a good chance the design is fine and will meet material. The uncertainty suggests the implementation of an
the reliability objectives. Let’s assume 75% of the time the accelerated life test to estimate the expected product
design has an overall failure rate of less than 10% over the

reliability. the testing complexity and result in lower overall costs.
Achieving the higher reliability is expected to result in The cost of the subsystem that holds the seal is $200 each.
more than tripling the market share in the first year. This 230 x $200 estimates the cost for samples of $46k. Therefore,
would result is sales of the $3k/unit priced product to jump the total cost of the ALT is approximately $96k.
from 10k per year now to approximately 30k per year. This
3.3 ALT Value
would be an additional $60m in revenue. Furthermore, the
increase in sales would require more than doubling the In this situation the test results provide a binary result.
manufacturing capacity at a cost of $5 million. The decision to The population either does or does not achieve at least 99%
increase the manufacturing capacity is dependent on the reliability. Keeping in mind the ALT is run with a sample to
estimated product reliability. In order to have the capacity represent the population there is some uncertainty about the
available to meet the expected demand the decision has be results. Statistical error may lead to four outcomes as shown in
made and the $5 million committed prior to the start of table 1 [6]. Assuming the test design used 90% confidence and
production. has a 90% power, we have a 10% chance of thinking the
reliability is less than it actually is, and not invest in added
manufacturing capacity (lost opportunity for increased sales).
3.1 Reliability Goal and ALT Discussion
And, we have a 10% chance of thinking the reliability is better
The current product achieves 90% reliability over two than 99% when it is not, thus investing in added capacity
years. The best competitive product is estimated to achieve when demand will not materialize.
98% reliability over the same period. The goal for the new
design is 99% reliable over two years or better. This is a major
goal and simply conducting an ALT is not going to achieve The unknown actual Reliability
the result. Yet, a key element is the understanding if the goal is less then 99%
has been achieved or not. The $5 million investment in Test Result Is TRUE Is FALSE
manufacturing depends on knowing if the design will or will R >= 99% Type I error Correct
not meet the goal. R < 99% Correct Type II error
ALT in this case can answer the question as it’s focused Table 1 Statistical Errors
on the expected dominant failure mechanism [4]. The failure
mechanism and the stresses are all known. The new design Going in to the ALT we have a 50/50 chance that the new
using novel material does leave the uncertainty around how material and design will meet the 99% reliability goal.
the design will actually perform. A well designed ALT has the Combining that with the uncertainty of statistical error and a
capability to ascertain the expected reliability performance. $5m decision, we can calculate the value of the test.
3.2 ALT Cost
ALT is often an expensive test to conduct. The test
design, samples, product operation jigs (robots, actuators,
software, etc.), monitoring equipment and failure analysis all
add to the cost. Let’s assume the total test planning and setup
cost is $50k.
The high reliability to demonstrate will require a
significant number of samples. The following formula [5]
provides a rough estimate of the number of samples needed
for a test to demonstrate 99% reliability with 90% confidence
assuming no failures of any tested samples. Figure 2 ALT Value Calculations

ln(1−C) ln(1− 0.9)
n= = ≅ 230 (1)
ln(R) ln(0.99) 3.4 ALT ROI
The ROI is the ratio of the expected return over the cost.
Where, $4m over $96k results in an ROI of over 41.
n is the sample size Of course, the $5m decision isn’t the only factor in the
C is the statistical confidence value of the ALT. It also provides a base line for further
R is the reliability testing (test cost savings), it may provide information on the
amount of margin the design has over the goal and permit
The 230 sample number is based on a success testing further design enhancements. It also confirms the change in
approach assuming the failure mechanism and associated reliability permitting a proactive changes in warranty accruals
stress is well understood. Reducing the sample size with the and service and repair operations.
use of degradation testing, or some other method may increase

program.
4 DERATING AND FIELD FAILURE RATE 4.3 Derating Value
The specialized test and measurement industry creates The primary value of component derating is the increase
very complex electronic equipment, which are expensive tools circuit robustness of the product leads to fewer field failures
with total production of maybe 50 per year over a four year [7]. The cost of a field failure is expensive, due to the
period. And, like other high cost/low volume products the cost replacement cost, failure analysis, and possible redesign and
of failure is very high. qualification costs. Let’s assume that each field failure has an
Because the unit costs are very high, the ability to test average cost $2m or four times the sales price.
sufficient numbers of units to failure or at all, is severely Reducing a 10% annual failure rate (a low estimate for
limited. It is not uncommon to have only one or two units for such complex products) to 5% would results in 2.5 fewer $2m
all qualification testing. Furthermore, the complexity of the failures per year for an annual savings of $5m.
units provide multiple possible failure mechanisms and only
rarely does the design provide a clear dominate failure
4.4 Derating ROI
mechanism to focus reliability evaluations.
Given the barriers to conducting physical testing, the The ROI is the ratio of the expected return over the cost.
reliability team recommends implementing detailed derating With a cost of $6 million and return of only $5m, the ROI is
analysis for the selection of every electronic component. less than one at 0.83.
The design team does use some derating concepts, yet If the starting failure rate or cost of failure is low then this
only based on a 50% guideline and without detailed analysis. ROI may not exceed the breakeven point. Also, consider the
Therefore, the project manager has requested more market and competition impact. If the high failure rate caused
information about the process, costs, and value. a loss of market share, that may further increase the cost of
failure. Currently, implementing derating does not make sense
in this situation.
4.1 Derating and Field Failures Discussion
Derating is the selection of components that have ratings
5 RELIABILITY MATURITY CONSIDERATIONS
(power, voltage, etc) above the expected stress [. Selecting a
capacitor that bridges a 5 volt potential that has a voltage Organizations have different capabilities and approaches
rating of 10 volts would be considered a 50% derating. to reliability. In some, product reliability is not considered and
Selecting components that match the expected stress and the product performance is fairly random and unpredictable.
rating generally lead to premature failure of the components. Other organizations do considerable testing and use a wide
The ratings vendors provide only imply the component can range of tools to improve reliability, yet the testing and tools
experience the stress at the rated value for a very short time. are generally done in response to customer complaints and
Derating provides a margin to minimize the accumulation of field failures. And a few organizations are proactive in the
damage or the chance exposure of high enough stress to cause selection of high value reliability design activities [8].
a failure. The same concept applied for mechanical designs The base culture of reactive or proactive with respect to
using a safety margins. reliability suggests different routes to making reliability
At Hewlett-Packard, a study of the effects of various improvements. Less mature organizations may require training
design for reliability tools found a very high correlation and maybe a pilot program to build acceptance of the
between well executed derating programs and low field failure proposed changes. More mature organizations my not find
rates. This contributed to the 50% fewer field failures additional tools with significant ROI’s, yet may understand
experienced [9]. In one particular division where the design and be able to calculate the impact of reliability improvements
team embarked on a full implementation of derating on all on market share or customer satisfaction.
products, realized a 50% reduction in field failures in the first In less mature organizations the calculations for cost and
year, and continued to realized reduced failure rates over benefit may more difficult and rely on more assumptions. That
subsequent years as more fully derated product designs is not a reason for not doing the calculations. State the
shipped. assumptions and start the discussion to find better information
for the assessment.
4.2 Derating Cost
In more mature organizations, while the calculations may
Higher rated components cost more and are generally be easier to accomplish given the better understanding of costs
larger in size. Assuming the current bill of material cost is and benefits, the ROI’s are likely to be smaller in a direct
$100k and with the implementation of detailed and thorough manner. These organization also understand the value of
derating the bill of material costs rise to $200,000, or doubles. customer satisfaction and avoiding the costs associated with
For a production run of 50 units, the cost increases to $5m. the reactive engineering to field problems. Mature
The additional engineering time for training, circuit organizations do not have 25% of their design engineering
analysis, and procurement may add an additional $1m to the resources responding to field failures.
project cost. The total cost is an additional $6m to the

6 CONCLUSIONS Test Plans, and Data Analysis. Edited by S S Wilks
Samuel. Wiley Series in Probability and Mathematical
The decision to add a reliability specific task generally
Statistics. New York: John Wiley & Sons, 1990, pg. 3.
adds cost to the development program. The costs are typically
5. Wasserman, Gary S. Reliability Verification, Testing and
easily calculated by summing engineering time, material costs,
Analysis in Engineering Design. New York: Marcel
added samples, added time, and other direct costs. On the
Dekker, 2003, pg. 209.
other hand, the benefit is more difficult to calculate. The
6. Ott, Lyman. An Introduction to Statistical Methods and
benefits may included estimated reduction of field failure
Data Analysis. Belmont, Calif.: Duxbury Press, 1993, pg.
rates, or estimated reduction in risks, or expected discovery
216.
rates of serious field failure issue during the early design
7. Ireson, William Grant, Clyde F Coombs, and Richard Y
phase.
Moss. Handbook of Reliability Engineering and
The calculation of value before the value is realized is
Management. New York: McGraw Hill, 1995., pg. 16.9.
difficult and often based on a series of assumptions. Stating
8. Crosby, Philip B. Quality Is Free: The Art of Making
the assumptions and showing the calculations permits the team
Quality Certain. New York: Signet, 1979.
to understand the calculations and check the assumptions.
9. Ireson, William Grant, Clyde F Coombs, and Richard Y
Having an estimated value provides a quantitative means to
Moss. Handbook of Reliability Engineering and
determine the return on investment. The ROI value provides a
Management. New York: McGraw Hill, 1995, pg. 5.4.
means to determine the relative value of any investment, thus
permitting the comparison of all the investment decisions
made during a development project. BIOGRAPHIES
Without the quantitative value calculation the team relies
Fred Schenkelberg
on the antidotal belief that the tools will provide value. In
Ops A La Carte, LLC
some cases this will be obvious or not a question, yet in those
990 Richard Avenue, Suite 101
cases where there is any doubt, the examples in this paper
Santa Clara, CA 95050, USA
provide guidance for the ROI calculation of reliability tasks.
Every product development team faces different criteria e-mail: fms@opsalacarte.com
for value (time, cost, etc.) and different sets of constraints
Fred Schenkelberg is a reliability engineering and
(time, samples, test capabilities, etc.). Just like creating a
management consultant with Ops A La Carte, with areas of
specific test plan, the ROI calculation is tailored to fit the
focus including reliability engineering management training
situation.
and accelerated life testing. Previously, he co-founded and
Not every tool is appropriate to use and through the
built the HP corporate reliability program, including
analysis of the ROI, even with estimates and assumptions,
consulting on a broad range of HP products. He is a lecturer
provides an organization the ability to select the tools that
with the University of Maryland teaching a graduate level
provide the best value.
course on reliability engineering management. He earned a
Master of Science degree in statistics at Stanford University in
REFERENCES
1996. He earned his bachelors degrees in Physics at the
United State Military Academy in 1983. Fred is an active
1. Silverman, Mike. How Reliable Is Your Product?
volunteer with the management committee of RAMS,
Cupertino, CA: Super Star Press, December, 2010, pg.
currently the Chair of the American Society of Quality
193.
Reliability Division, active at the local level with the Society
2. Personal Communication with Mike Silverman, June 18th,
of Reliability Engineers and IEEE’s Reliability Society, IEEE
2011.
reliability standards development teams and recently joined
3. Hobbs, Gregg K. Accelerated Reliability Engineering :
the US delegation as a voting member of the IEC TAG 56 -
HALT and HASS. Chichester ; New York: Wiley, 2000,
Durability. He is a Senior Member of ASQ and IEEE. He is
pg. 43.
an ASQ Certified Quality and Reliability Engineer.
4. Nelson, Wayne. Accelerated Testing: Statistical Models,


2012 RAMS Investment in Reliability Program versus Return – How to Decide

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Accendo Reliability

Mais de Accendo Reliability (20)

Último

Último (20)

2012 RAMS Investment in Reliability Program versus Return – How to Decide