This document discusses a study on how the statistical distributions of software metrics can impact quality. It shows that some metrics, like file size, follow a double Pareto distribution with a transition point from lognormal to power law behavior. Files above this transition point account for a large percentage of code size and defects. The probability of finding defects is higher for files with metrics above the transition point. Overall, the findings indicate the statistical distributions of metrics are related to defects density and can help reduce the search space for problematic files.
On National Teacher Day, meet the 2024-25 Kenan Fellows
Statistical Distribution of Metrics
1. Statistical distributions of software metrics: do
they matter?
Israel Herraiz
Technical University of Madrid
israel.herraiz@upm.es
Grab these slides from
http://slideshare.net/herraiz/statistical-distributions-of-metrics
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17
2. Outline
1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 2/17
3. 1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 3/17
4. A (not so) long time ago...
Statistical distribution of software metrics
Software size follows a double Pareto distribution
Towards a theoretical model for software growth MSR 2007
More recently
Not only size, but some OO metrics too (and some complexity metrics)
On the Statistical Distribution of Object-Oriented System
Properties WETSoM 2012
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 4/17
5. OK, but what is that double Pareto thing?
1e+00
1e−02
P[X > x]
Data
Double Pareto
1e−04
Lognormal
1 100 10000
SLOC
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 5/17
6. But does it matter?
Most of the files are on the
lognormal side
10 15 20 25 30 35
% Files
5
0
C C++ Java Python Lisp
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
7. But does it matter?
Most of the files are on the But the power law minority
lognormal side matters a lot
10 15 20 25 30 35
40
30
% SLOC
% Files
20
10
5
0
0
C C++ Java Python Lisp C C++ Java Python Lisp
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
8. Large files have a large impact
Size estimation models
Some software size estimation models are based on the log-normality of size
metrics. These models systematically underestimate the size of software.
C C++
50
50
RE
RE
0
0
−100
−100
2000 5000 10000 50000 2000 5000 20000 50000
SLOC SLOC
Java Python
50
50
RE
RE
0
0
−100
−100
1000 2000 5000 10000 1000 2000 5000 10000
SLOC SLOC
On the distribution of source code file sizes ICSOFT 2011
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 7/17
9. 1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 8/17
10. Parameters of the statistical distribution
Power law parameters: λ and xmin
Transition from lognormal to power law
1e+00
1e−02
P[X > x]
Data
Double Pareto
1e−04
Lognormal
1 100 10000
SLOC
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 9/17
11. 1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 10/17
12. Probability of finding defects
Probability of finding defects
We have seen that files above xmin account for 40% of total size, being
only about ∼ 1% of the files.
What about defects? Probability of finding defects in three software
projects (using CYCLO as metric)
Project Below xmin Above xmin
Apache .4178 .7708
OpenIntents .2500 .7500
Zxing .2143 .4161
* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE
2011.
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 11/17
13. Probability of finding defects
Probability of finding defects (normalized metrics)
Using CYCLO / WMC as metric (cyclomatic complex. per LOC)
Project Below xmin Above xmin
Apache .4159 .6296
OpenIntents .2813 .5417
Zxing .3181 .2389
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 12/17
14. Probability of finding defects
Defects density (only pre-release defects)
Using Number of Methods and number of pre-release defects per LOC
Below xmin Above xmin
Below xmin Above xmin
12000 300
10000 250
8000 200
6000 150
4000 100
2000 50
0 0
0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Avg .Dens. = .2685 Avg .Dens. = .4565
* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 13/17
15. Probability of finding defects
Defects density (only post-release defects)
Using Number of Methods and number of post-release defects per LOC
Below xmin Above xmin
Below xmin Above xmin
12000 300
10000 250
8000 200
6000 150
4000 100
2000 50
0 0
0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Avg .Dens. = .1437 Avg .Dens. = .2690
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 14/17
16. Probability of finding defects
Defects density (pre + post-release defects)
Using CYCLO/SLOC and number of total defects per LOC
0 3
10 10
−1 2
10 10
Pr(X ≥ x)
−2 1
10 10
−3 0
10 10
−4 −1
10 −1 1 3 5
10
−1 0 1 2 3 4 5
10 10 10 10 10 10 10
10 10 10 10
x
Below xmin Above xmin
Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files)
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17
17. 1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 16/17
18. Summary and further work
Summary of preliminary findings
Some metrics have a transition from lognormal to power law
Clear relation between normalized metrics and defects density
Although the threshold might not be perfect (e.g., you might find a
high defects density in a lower side file), it greatly reduces the search
space for potentially problematic files
Further work
Verify in more projects
Do you have defects data at the file level?
Find explanation for the transition and its influence on quality
How do the statistical parameters change over time? Do defects
evolve accordingly?
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 17/17