Paper given at the 2009 ASQ World Quality Congress on key features found is the best (and worst) reliability programs.
Having the privilege to interview a cross-section of more than 70 product development teams to understand their reliability program has led to a few observations. Only a rare few have mature, cost effective and efficient reliability programs.
A clear understanding of your organization’s reliability program along with a clear vision of what is possible is the crucial first step to making systematic program improvements. This paper explores the key traits which separate good from great reliability programs.
Marketing, product volume, complexity and organizational structure do not tend to matter however a proactive approach, statistical thinking, fact based decision making and integrated reliability tools do tend to make a difference. This paper outlines how to assess your organization and provides highlights of key traits of good and simply great reliability programs.
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Traits Found in Effective Reliability Programs
1. TRAITS FOUND IN EFFECTIVE RELIABILITY PROGRAMS
Fred Schenkelberg
Reliability Engineering Consultant
FMS Reliability
Los Gatos, CA 95032
fms@fmsreliability.com
www.fmsreliability.com
SUMMARY
Having the privilege to interview a cross-section of more than 70 product development teams to understand their
reliability program has led to a few observations. Only a rare few have mature, cost effective and efficient reliability
programs.
A clear understanding of your organization’s reliability program along with a clear vision of what is possible is the
crucial first step to making systematic program improvements. This paper explores the key traits which separate good from
great reliability programs.
Marketing, product volume, complexity and organizational structure do not tend to matter however a proactive
approach, statistical thinking, fact based decision making and integrated reliability tools do tend to make a difference. This
paper outlines how to assess your organization and provides highlights of key traits of good and simply great reliability
programs.
INTRODUCTION
On one occasion I conducted assessments of two organizations located in the same building. Both designed and
manufactured telecommunication equipment with similar complexity and volume. The interview schedule had me going up
and down stairs almost every hour for two days and by midday of the first day I enjoyed going upstairs and dreaded heading
down. Despite all the similarities the two reliability programs were dramatically different; as different as their reliability
results.
Downstairs the interviews started late, got interrupted by urgent phone calls or in-person requests; Firefighting at its
best. The team employed a wide range of tools, all that were listed on a checklist, for each project. The reliability goals were
not known to the design team and for the few that knew them also understood they would not be measured or impede getting
the product to market. The people I talked to stated reliability was very important and were very busy fixing field or testing
(just before product launch) identified issues. Reliability was done by the guy that left last year.
Upstairs the interview started on time, without interruption. No one remembered the last time there was an urgent
need to resolve a field issue. The team employed reliability tools as needed that would benefit the project. The specific testing
was tailored to the risks identified during the design phase. The goals were widely known and current status was also known,
during development and after product launch. The people I talked to stated reliability was very important and they knew what
to do to meet their reliability objectives. Reliability thinking and skills were taught by Sharon who left last year.
This paper touches on the key traits that separated these two groups.
BACKGROUND
In 1983, John Young, CEO of Hewlett-Packard Company, noticed that the rate of growth of warranty was higher
than the rate of growth of revenue. He asked the corporation to reduce warranty by 10x by the end of the decade. One of the
key factors in the success of this program in changing warranty from 4% to 1.5% of net revenue was the identification and
encouragement of key reliability engineering practices. [Ireson, 5.1] Dick Moss conducted the survey and was my mentor at
HP.
2. In 1996, we were unable to tally the corporate warranty expense, the systems and metrics established in the late ‘80s
had been dismantled. When the result was finally determine, the corporation had lost ground and looked like it would
continue to grow the warranty expense faster than the revenue rate. Just like the early ‘80s many of the key reliability
practices were widely used, yet the results did not indicate any effectiveness. It was time to conduct another survey since a
few product divisions did have better results with respect to warranty expenses. So, I dusted off the old Moss survey.
One item became clear as the survey progressed was that the culture of the product team and how they viewed
reliability seemed directly related to the results. This is similar to the quality maturity as described in “Quality is Free” by
Philip B. Crosby. Using the same approach for product reliability, the product teams with high maturity did have significantly
lower warranty expenses. Other attempts have explored this relationship between reliability activities, effectiveness and
results, including a current effort within IEEE to publish a reliability assessment standard. [Gullo]
In my experience, product teams have asked for guidance on how to improve their product reliability (e.g. warranty
expenses), which is guidance on how to move to the right on the maturity matrix and become more effective in achieving
reliable product performance in the field. A few of these engagements involved reliability programs that already employed an
assortment of practices, yet each had one or two missing elements that kept them from achieving systemic improvements. It
is specifically these experiences that form the basis for this paper.
THE TRAITS
There are three main interconnected threads that run through very effective programs. First, teams with clearly
stated reliability goals that are routinely estimated, measured and evaluated. Second, teams that make design decisions fully
considering the impact to the program and business. And, third, the team actively seeks failures and endeavor to learn as
much as possible from each failure. Each of these traits consists of a collection of tightly interwoven reliability tools or
practices. The specific tools vary from one team to the next due to volume, market, and other business priorities.
Trait 1: STATE CLEAR GOALS
There are plenty of really bad reliability goal statements, like 20,000 hour MTBF, 5 year life, ‘as good or better
than…”, 2 year warranty, zero field failures. What very good programs have is a complete statement that permits the
organization to understand and use the goal to influence each design decision.
A simple definition of reliability includes four elements: function, duration, probability and environment. Poor goal
statements are often only one of the four elements and force assumed values for the other elements. A complete reliability
goal statement includes all four elements as shown below.
“Product FMS provides music storage and playback [key functions] for two years [duration] with 98% reliability
[probability of success over duration period] in a worldwide portable environment [environment].”
Both the function and environment require further definition and is often done with other key documents or
references. For example many product development teams have a set of product specifications the design should meet. These
include size, color, features, and performance parameters. Generally the function element includes anything that the customer
would notice not working, and when it didn’t perform as expected, would call a failure. Understanding what is a failure from
the customer’s point-of-view tailor risk analysis and product evaluations to key elements important to customers.
The environment includes shipping, storage, installation, startup, and use. Many organizations develop a set of
documents that capture the key features of their market’s environment. Many organizations rely on standards and do not
tailor, as the best do, the environmental parameters to reflect the experience of their products with their customers. For
example, the above MP3 player is likely to be on a car dash board in the sun – does the internal set of environmental
requirements capture this temperature extreme and expected duration? The better environmental statements include nominal
and expected range of values for temperature, humidity, shock, radiated emissions, usage profiles, and possibility numerous
other environmental and usage factors that define the most significant parameters that impact the short and long term
performance of the product with the customer. It is not a set of fixed profile tests.
3. A fully stated goal, often with multiple duration and associated probability statements (out of box, first 90 days,
warranty period and expected life are common durations of interest). Different failure mechanisms may exhibit failures with
a design at different points of time. For example shock and vibration from transportation to the customer may be the most
significant root cause of out of box failures, whereas mechanical fatigue may dominate the failures after the warranty period.
The full statement permits consideration of materials, assembly options, component selection and packaging approaches early
in the product design process.
A reliability goal is just one of many constraints a design team must consider during product development. They
face a seemingly endless list of requirements, regulations, and business expectations. The three most common are
performance, schedule and cost. The performance is the functions i.e. what is the product supposed to do for the customer
and this is often key to the value the product provides. It is immediately measurable and either meets the performance
requirements or doesn’t. The first prototypes provide the first measures and are central to nearly every measure made and
reported during development and manufacturing. Schedule refers to the time to market requirement. The project has a target
date to have the product in its final form, on the shelf, ready for sale. The calendar measures this criteria and a series of
schedule milestones remind the design team of the deadline. Cost is often the bill of material cost and relates to the
profitability of the product. A simple spreadsheet listing the components and assembly costs can tally this every day for the
design team. All three are readily measurable. They each provide feedback to the team.
Reliability, specifically the probability of successful operation at later durations is difficult at best to measure
accurately. The second element of this trait is the repeated and improving measure of reliability during the design process.
Goals without some method to track progress leaves the team guessing did they achieve the goal or are they on target. The
measure provides a means to make adjustments, to gauge readiness for the market.
One of the best examples I’ve seen involved a weekly report to the design team on reliability. Each Friday, Phil
would gather the best available data or estimates for each of the major sub-systems of the product. On Monday he would
report the results of the tally against the reliability goals. Early in the program these estimates were based on historical data
from previously fielded products. As the design evolved the estimates received adjustments from parts count and vendor data
sources. For key elements the team invested in accelerated life testing or encouraged the vendor to perform the testing. And,
finally with later prototypes, the team conducted accelerated demonstration tests on the entire system using time compression
and elevated temperature. High temperatures accelerated most dominate high risk failure mechanisms and the team closely
monitored the first 6 months of field performance.
During each stage of the product lifecycle the team received the best available measure of reliability. As the design
progressed and as the product become more functional, additional testing and estimates continued to improve. Just like the
other three major constraints (performance, schedule and cost) reliability measures provided regular feedback.
A goal without a measure, like measures without a goal, provide limited value to the decision making process.
Clearly stating a fully expressed reliability goal and regularly measuring reliability permit the team to know where they are
going, if they are on track, and, when they have arrived.
Trait 2: ENABLE TRADEOFFS
A single key piece of information is all that is required to enable designers to balance reliability with performance,
time to market and cost. This information exists within any product shipping company, and is nearly always unknown to the
design team. Providing the cost of a field return value in dollars permits the designer to translate reliability differences into
dollars.
For example if the projected shipments are 1000 units a month and a return costs the company $450 (call center,
repair/replacement, shipping, failure analysis, are examples of elements of this value) translate into the value of a 1% change
in reliability (from 92% to 93%, for example) would reduce the returns cost by $4,500 per month. Taking this example a bit
further, assume it would cost (bill of material cost) $1/unit more to achieve the change in field failure rate, is this worth the
increase in bill of material cost? Certainly, as the savings is $4,500/1000 units or $4.50 per unit shipped. Adding, $3.50 to
profit for each unit shipped.
For high risk areas or major elements of a design, the team may face multiple options to trade off cost, time to
market or functionality each with associated costs. By understanding the impact to reliability these trade offs can be fully
considered. Teams that do this well use it during component selection, during design solution comparisons and during design
4. optimization. Teams that do this well seek the areas for the best return for the investment, whether that is component cost,
functionality, schedule or reliability.
Trait 3: SECURE FAILURES
“The concept of failure is central to the design process, and it is by thinking in terms of obviating failure that
successful designs are achieved.” [Petroski]
Product teams understand the product should just work for the customer. It shouldn’t fail. In my experience design
teams tend to imagine possible failure modes and attempt to design the product to avoid or mitigate the failure. It may be a
point of litigation if the product fails in a manner that should have been anticipated by the design team. More often it is the
business case that a product that doesn’t fail, will sell better and have lower warranty expenses. Hence, a reliable product is
more profitable.
The best teams aggressively seek failures in the design over the entire product lifecycle. In early concept phases,
consider the fundamental limits of the chosen technology. Also, consider the types of stresses expected during use and project
effect onto the core technology. A Failure Mode and Effect Analysis (FMEA) may help reveal high risk areas for further
analysis. With the first prototypes, the team now can directly evaluate performance and discover failure mechanisms through
testing such as Highly Accelerated Life Testing (HALT). And, during the product launch, the team can either confirm or
discover the way the product fails in use. In all cases, a technical understanding of the interaction of the design with the
applied stress (use, temperature, vibration, etc.) permits the team to uncover the design flaw the revealed itself as a failure.
Reliability growth modeling is based on the premise that every design has an unknown and finite number of design
flaws. The product development process is the careful uncovering and resolving of as many of these flaws as possible before
shipping the products. At some point finding the remaining flaws is not worth the effort (cost and time). The remaining flaws
have acceptable field reliability.
Just finding the failures is a key first step in this trait. Many failures, once revealed to a design team, highlights
various design changes that will reduce or eliminate the same failures in the improved design. On some occasions, the failure
only is a symptom and treating the assumed cause of the failure does not remove the flaw. For example, an intermittent over
voltage power supply may cause sensitive integrated circuits (IC) to fail. The IC failure may indicate a faulty component, and
it’s replacement does not change the underlying root cause of the failure. It will happen again. Or, the faulty power supply
may cause another component to fail. With careful failure analysis of the broken IC, the root cause of over voltage would
lead to investigating the power supply. Once the power supply design is fixed, the failure symptom of blown IC’s goes away.
Another element of this trait is the pursuit of every failure. Imagine during prototyping 100 units are created and
distributed to various parts of the team for evaluation and testing. Some failures may occur with all 100 units, some failures
occur with about half and some occur on only one unit. The first two cases are obvious flaws that need attention and
resolution before shipping, as the sample failure rates approximates a 100% and 50% field failure rate.
Now let’s further assume the product goal is 95% reliability over the first year and that five units revealed a design
flaw each, and furthermore, the other 95 units function without fault. The team is done, right? No, first there is an issue with
the sample of 100 units with five failures estimating the population’s failure rate. The nominal estimate is 5/100 or 5%, which
is the same as 95% reliability. We have to assume all 100 units experienced at least of year of operation (very unlikely) or
those other functional units did not replicate the failure when exposed to the stress that uncovered the fault (more likely).
Using a 90% confidence that the sample represents the population, the actual reliability could be as low as 63%.
Also, consider use conditions, environment, manufacturing and components all vary, the actually failure rate will
certainly be worse than that estimated during development. Therefore, even relatively rare failures in the development
process require careful analysis and resolution.
In other words, each and every failure is a gift to learn about design flaws within a product. Using tools like FMEA
and HALT permit the team to uncover the faults as soon as possible.
CONCLUSION
5. Product teams that regularly produce reliable products (the upstairs team) have these three traits in common.
• First a complete reliability statement with regular measurement.
• Second, the ability to translate reliability changes into dollars,
• Third, the aggressive discovery and resolution of failures.
Each of these is more than using a reliability engineering tool. They are a collection of tools working together to
encourage and enable the engineer to develop a product that meets the customer’s expectations of reliability. When all the
pieces are in place the opportunity to meet reliability and business goals improves. The results of the upstairs team has been
repeated by other teams that carefully assessed their development program and adjusted to include all the elements of the
three traits.
REFERENCES
Crosby, Philip B., Quality if Free: The Art of Making Quality Certain, Mentor, New York, 1979.
Gullo, Louis J., et. al., "Assessment of Organizational Reliability Capability", Components and Packaging Technologies,
IEEE Transactions, June, 2006, Vol. 29, Issue 2, 425-428.
Ireson, W. Grant, Coombs, Clyde F. and Moss, Richard Y., Handbook of Reliability Engineering and Management, 2nd Ed.,
McGraw-Hill, New York, 1996.
Petroski, Henry, Design Paradigms: Case Histories of Error and Judgment in Engineering, Cambridge University Press, 1994.