Software metrics best practices from a benchmarking assignment that indicates how software metrics are reported to management and used to drive behavior. We learned how leading companies used dashboards to report on quality progress and improvement results. We found the best organizations focused on the vital few metrics but also had automated systems with the ability to drill down on metrics at the divisional and team levels. In addition, the best normalized the metrics by number of customers or complexity. They systematically used root cause analysis to analyze bugs in the field. The SW Quality metrics often went beyond the strict definition of quality in that they also measured release predictability and feature expectations. Finally, the best companies used external benchmarks to set their quality targets.
2. 2
22
EXECUTIVE SUMMARY
The purpose of the benchmark study was to capture best
practices in the application of SW metrics dashboards.
Ten technology companies were benchmarked against these
questions:
• What metrics on software quality are reported to
management?
• Internal quality metrics, external field detected metrics?
• How are they normalized? Customers in field, LOC?
• What are the most important?
• Are they tabular, graphical? How many? Are target values
shown?
• How frequently are they reported? How many do you
report on?
• What are key target values you look at for key metrics?
Alcatel
Boston Scientific
Cisco
Ericsson
General Dynamics
IBM
Saint Jude Medical
Palo Alto Networks
Riverbed
VMware
List of participants
3 Highly regulated companies
7 Networking/Computer/Storage
Key Highlights:
• There is no standard for the number of metrics, type of
metrics, nor frequency of reporting
• However there are best practices around Software Quality
Metrics – We can look at what separates the best from the
rest
• The BEST have
1. Automated metrics tracking and analysis systems that
allow drill down and reporting by product, release,
customer
2. Normalization that ensures that the metrics are
meaningful as the number of customers or the
complexity of code increases
3. Root Cause Analysis system that systematically
analyzes defects that escape the company and are
found in the field
4. Quality metrics that go beyond product defects, and
include release predictability and feature expectations
5. External benchmarks that are used to set goals
(created by third parties to establish databases or
perform surveys)
3. 3
33
HOW WE APPROACHED THE ANALYSIS
• The Process Capability Maturity Model (CMM) defines five level of process maturity
• Levels 1 (Initial, Chaotic)
• Level 2 (Repeatable)
• Level 3 (Defined)
• Level 4 (Managed, Measured)
• Level 5 (Optimizing)
• Metrics are a key parts of the CMM model, and Level 4 indicates mastery of metrics
• SW metrics are well characterized, and are often divided up between Product Quality Metrics, In-Process Metrics, and Metrics for
SW Maintenance*
• From our survey of ten companies, we have derived a sense of metrics maturity, and have created our own rating of SW Metrics
Maturity using five factors
• Automated, Root Cause Analysis, Normalized, External Benchmarks, and Total Quality (not just defects)
• The Best tend to have excellent scores on all five dimensions, the rest lag behind in one or more areas
• The best tend to have measures in the three areas defined above (Product, In-Process, and Maintenance)
4. 4
44
EXAMPLE SW METRICS MATURITY
1. Automated metrics tracking and analysis
systems that allow drill down and
reporting by product, release, customer
2. Normalization that ensures that the
metrics are meaningful as the number of
customers or the complexity of code
increases
3. Root Cause Analysis system that
systematically analyzes defects that
escape the company and are found in the
field
4. Quality metrics that go beyond product
defects, and include release predictability
and feature expectations
5. External benchmarks that are used to set
goals (created by third parties to establish
databases or perform surveys)
Root Cause
Analysis
Automated
Metrics System
Normalization
Total Quality
(Predictability/Fe
atures)
Uses External
Benchmarks
Best
Rest
The nature of the survey did not allow us to complete this chart for each participant, but this treatment would be very useful to
evaluate where you are today and where you should focus in the future to close gaps between the best and the rest.
Hypothetical Radar Chart: A 5 point scale, where mastery is
indicated as a 5 (outermost), and absent is a 0 (innermost)
5. 5
55
DASHBOARD – DRAWN FROM BENCHMARKING
• Title & Description • So What • Consistent Design
• Labeled Axes • Target Curves • Narrative
Guiding Principles:
Each metric should be linked to your overall quality objectives, which were derived from your overall
strategy
From the Benchmark Sample, the goals might be:
• Increasing Net Promoter Score (how highly you are recommended)
• Increasing Release Predictability
• Increasing Customer Satisfaction
• Increasing Reported Quality (Field Quality)
• Reducing time to repair
• Reducing the number of Critical Accounts
Each chart has the following graphical properties:
• The charts are composed so that the ‘so what’ is very clear, and repeated for each so that it is clear to
managers that only see them once a quarter, so they know why the metric is there and if there is any
significance to the data, what the significance is.
• Targets should be on all graphs
• Where benchmark data exists, it will also be shown on the chart
• Each chart should have the following properties
There should be between
4-8 metrics
Two related metrics per
screen
Text describing & analyzing
the data represented
6. 6
66
Percent of Release Slips
This chart plots the percentage of actual versus planned schedule for major
and minor releases.
• The target is derived to get to less than 5% slip by 2014, closing the gap in
a straight line, coming down from 22% where we are today
• The increase shown in November, 2011is driven by the A.2a release,
which had to go through 2 alpha
• We expect a steeper drop in July, 2012 because of our new “Darken the
Sky” program to provide requirements stability
• Benchmarking indicates that the best in class number is a slip rate of less
than 15% (for 9 month release cycles).
Mean Time to Repair
This chart plots the average time, in weeks, that the customers had to wait for
resolution. Measured in weekly intervals, data captured per release.
• The target is derived to get to the fastest resolution (and reduce the
number outstanding)
• The increase shown in January, 2012 is driven by the A.x release.
• The new methods for engineering releases should impact this in 2013
VerticalAxisLabel
Horizontal Axis Label
Benchmark
VerticalAxisLabel
Horizontal Axis Label
Major Release 2
Major Release 3
7. 7
77
BEST PRACTICES
1. Use of third party firms to assess where your software defect performance stacks up against the competition & use of industry standard databases for
software quality
2. Test Escapes Analysis Process to perform root cause analysis on all significant escapes to the field
3. SW Defects reported on dashboard includes broader measures like predictability, expectations
4. Automated, integrated system for real time metrics analysis and presentation to management is simply pulling up current data and reviewing it formally
5. Normalization for complexity and or accounts in the field to ensure that proper comparisons are made
6. Create compound metric that pulls together several important factors for the business
7. Institute metrics that show (unit and integration) statement coverage, branch coverage, all tests passing, and for functional testing, show requirements
coverage and all tests passing
8. Institute metrics that show defect backlog, number of test cases planned, and Upgrade/Update failure rate, Early Return Index, Fault Slip Through
9. Bug tool kit that goes to the field with exhaustive and searchable data to help customers avoid reporting defects, learn about workarounds, and search with
Google like strength
10. If external benchmark targets are not known, track improvement release over release
11. Focus on what is important. One participant only tracks release predictability and customer satisfaction
12. Use parametric estimation metrics – for example 4 days for a test case to ensure high quality, data driven schedule estimates (also helps demonstrate
improvements over time)
In benchmarking studies like this, we often see some exemplary practices that demonstrate creative and effective ways to stay ahead.
Top 5
Metrics to
Consider
Other
Tips
8. 8
88
SUMMARY STATISTICS
• 8 do report customer found defects to management (remaining
2 report customer sat at a high level)
• 6 report on the order of 4 metrics to management, the
remaining 4 report more or less
• 5 include time to market as a metric in their quality dashboard
• 4 report escapes or customer found defects caused by bad fixes
Key Highlights:
• 4 companies have real time visibility of metrics, and they are
automatically updated on a daily basis
• 3 companies reported on compound metrics that combine
reliability, availability, time to fix
• 3 do not use targets for metrics reported to management, but
only report the improvement release to release
• 3 normalize metrics (LOC on inside, or Units in Field on outside)
9. 9
99
IMPLICATIONS
• Root cause analysis should be performed on defects from the field that are either critical or from regressions
• Many companies have special processes for doing this effectively
• It appears that some participants have higher levels of automation and coverage for both unit, integration, and functional test
• And it is measured
• Planning metrics, such as the number of days per test case should be used for prediction and improvement
• If you are growing, some normalization should be used.
• It should be coarse (like judged Lines of Code, converted from Function Points)
• Walker Survey, Quest Database, and Manager-Tools.com are three recommended vendors for metrics and management
• Walker Survey can determine how you stack up against your competitors regarding quality and satisfaction
• Quest is a TL 9000 database
• Manager-Tools are helpful for developing QA managers
• Where absolute targets don’t exist, a target curve based on prior improvement should be used to answer ‘are we getting better?’