16. System release
Parsing
FAMIX Class
Attribute
Attribute
Attribute
check out
Svn / Cvs Class / File
repository Versioning link Inferred
system logs link
log Parsing
Commit
comments
Bug reference
Bug reports in the comment
Bugzilla Query Parsing
database Bug
20. Classification Ranking
Precision & recall Spearman correlation
coefficient
How How
small FP is small FN is
Buggy classes
FN
TP
FP
Classes predicted
as buggy
21. Classification Ranking
Precision & recall Spearman correlation
coefficient
How How
small FP is small FN is Predicted
Observed
Class D Class E
Buggy classes
Class A Class A
FN
TP
Class E
...
~ Class D
...
FP ... ...
Classes predicted
... ...
as buggy
29. Conclusion
Software metrics
Past defects is the correlate with defects but
predictor for future are not usable in practice
defects
30. Mining metrics to predict
component failures
Nachiappan Nagappan
Thomas Ball
Microsoft Research
Andreas Zeller
Saarland University
31. Experimental settings
Project Code size
Internet Explorer 6 511 KLOC
DirectX 306 KLOC
Process messaging
147 KLOC
component
NetMeeting 109 KLOC
IIS Core 37 KLOC
Granularity level: module
32. Experimental settings
Project Code size
Internet Explorer 6 511 KLOC
DirectX 306 KLOC
Process messaging
147 KLOC
component
NetMeeting 109 KLOC
IIS Core 37 KLOC
Granularity level: module
(a binary file
within Windows)
33. Experimental settings
Project Code size
Internet Explorer 6 511 KLOC
DirectX 306 KLOC
Process messaging
147 KLOC
component
NetMeeting 109 KLOC
IIS Core 37 KLOC
Granularity level: module
(a binary file
A set of classes
within Windows)
34. Q1 Do complexity metrics correlate with defects?
35. Q1 Do complexity metrics correlate with defects?
Maximum correlation
Percentage of correlated metrics
1.00
0.75
0.50
0.25
0
A B C D E
36. Q2 Is there a unique set of metrics that predicts
defects in all projets?
38. Q3 Can we combine metrics to predict defect?
Multicollinearity
of metrics
39. Q3 Can we combine metrics to predict defect?
Principal
Multicollinearity
Component
of metrics
analysis
40. Q3 Can we combine metrics to predict defect?
Principal Linear/logistic
Multicollinearity
Component regression
of metrics
analysis model
41. Q3 Can we combine metrics to predict defect?
Principal Linear/logistic
Multicollinearity
Component regression
of metrics
analysis model
Spearman/Pearson correlation
Percentage of splits which correlate
1.00
0.75
0.50
0.25
0
A B C D E
42. Q3 Can we combine metrics to predict defect?
Principal Linear/logistic
Multicollinearity
Component regression
of metrics
analysis model
Spearman/Pearson correlation
Percentage of splits which correlate
Too few samples
1.00
0.75
0.50
0.25
0
A B C D E
43. Q4 Are predictors obtained from one project
applicable to other projects?
44. Conclusion
Metrics can be used
to predict defects
45. Conclusion
Metrics can be used
to predict defects
but
46. Conclusion
Metrics can be used
to predict defects
but
they must be validated
on the history
47. Improving Defect Prediction
Using Temporal Features and
Non Linear Models
Abraham Bernstein
Jayalath Ekanayake
Martin Pinzger
University of Zurich
48. Experimental settings
Plugin #Years #Files
updateui 7 757
updatecore 7 459
search 6.5 540
pdeui 6.5 1621
pdebuild 6 198
compare 6.5 315
Non linear models based on
21 historical metrics
+
LOC
49. Classification of files
Using decision tree learners
All files: A Size(CC)
Accuracy =
Size(A)
Correctly
classified
files: CC
50. Classification of files
Using decision tree learners
All files: A Size(CC)
Accuracy =
Size(A)
Correctly
classified
files: CC
Best predictor (7 metrics)
Accuracy 99.16%
51. Ranking of files
Using m5 tree regression algorithm
Sperman correlation
Predictor based on 7 metrics 0.966
Zimmermann’s pre-release defects 0.907
0 0.243 0.485 0.728 0.970
52. Conclusion
Defect prediction can be improved with:
Historical information Non-linear function
54. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
k=1
where pk is the probability that the file k changes during
File A
he considered time interval. Figure 4 shows an example
with three files and three time intervals.
File B
File C
File A
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
File C
55. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
k=1
where pk is the probability that the file k changes during
File A
he considered time interval. Figure 4 shows an example
with three files and three time intervals.
File B
File C
File A
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
FileHn(P)
C =
56. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
4 k=1
where pk is the probability that the file k changes during
File A
he considered time interval. Figure 4 shows an example
with three files and three time intervals.
File B
File C
File A
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
FileHn(P)
C =
57. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
4 2
k=1
4
where pk is the probability that the file k changes during
File A
he considered time interval. Figure 4 shows an example
with three files and three time intervals.
File B
File C
File A
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
FileHn(P)
C = - 2 4 * log2 4
2
58. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
4 2
k=1
4
where pk is the probability that the file k changes during
File A
he considered time1interval. Figure 4 shows an example
4 time intervals.
with three files and three
File B
File C
File A
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
FileHn(P)
C = - 2 4 * log2 4 - 1 4 * log2 4
2 1
59. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
4 2
k=1
4
where pk is the probability that the file k changes during
File A
he considered time1interval. Figure 4 shows an example
4 time intervals.
with three files and three
File B
1
File C
File A 4
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
FileHn(P)
C 1 1
= - 2 4 * log2 4 - 1 4 * log2 4 - 1 4 * log 2 4
2
60. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
4 2
k=1
4
where pk is the probability that the file k changes during
File A
he considered time1interval. Figure 4 shows an example
4 time intervals.
with three files and three
File B
1
File C
File A 4
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
FileHn(P)
C 1 1
= - 2 4 * log2 4 - 1 4 * log2 4 - 1 4 * log 2 4 = 1
2
61. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
k=1
where pk is the probability that H > 1? k changes during
H=1 the file
File A
he considered time interval. Figure 4 shows an example
with three files and three time intervals.
File B
File C
File A
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
File C
62. ntuition is that one change affecting one file only is simpler
Complexity = Entropy
han one affecting many different files, as the developer who
has to more changeschange has to keep trackthe entropy
The perform the are distributed the higher of all them.
Hassan proposed to use Shannon Entropy defined as
Shannon Entropy
n
X
Hn (P ) = − pk ∗ log2 pk (1)
k=1
H=1 H > 1?
where pk is the probability that the file k changes during
File A
he considered time interval. Figure 4 shows an example
with three files and three time intervals.
File B
File C
File A
Time
File B t1 (2 weeks) t2 (2 weeks) t3 (2 weeks)
File C
63. ned as:in the last six months). file juse H ,entropy F
modified
Complexity Metric (HCM) of a c ∗ the To as
j∈ i
Historyas bug predictor, Hassan
of Complexity Metric (HCM)
e change HCP F (j) = X defined the
i (j) =
ij i History
mplexity Metric {a,..,b} of a file j 0, ij ∗ i (j) , otherw
HCM (HCM) asc
HCP F H
i j∈F
(3)
HCP Fi (j) = X i∈{a,..,b}
HCM{a,..,b} (j) = 0,
HCP Fi (j) other
(3)
e i is a.., b} is a set of evolution periods iand HCP the
here {a, period with entropy H ,Set i is F is
i∈{a,..,b} F of
efined as:
{a, b} period i and j periods andHmodified filesto
re i..,is is a set of with ∗ is ,a j ∈ F HCPFiisis
n the a periodevolutionentropy belongingth
file i , F
cij Hi i
e definition of icij , there otherwise
din theHCP Fi (j) = and j is a file belonging
as: period 0, are three types (4)
cij ∗ Hi , j ∈ Fi
he definition ofentropy there are three mod-
i is a Fi (j) with0, cij , Hotherwise set of files typ
here HCP period= , Fi is the (4)
(1) the period i and jHis Mfilebelonging to Fentropy of co
ed in
cij = 1, everya file modifiedi .in the
C Each file gets the According
i
oi the definition ofentropy Hiarei three types of HCM :the c
iisgets ij with1,ijevery,of the systemmod-the
a period = entropy the is the set of files in
the c , there file modified in
F system
(1) c i and j is a file belonging to F . According
n the period i
interval. 1,This file modified approach: HCM
definition of cijevery defines types ofconsidered in th
1. (1) cij = , entropy of the system period
i gets the there areMthree in the HCM
i gets the entropy of C system in the considered its
H the Each file is weighted with time
W defines considered period
1)interval. This approach HCM.
cij = 1, every file modified in the approach HCM
interval. This defines
(2) the entropyjof the system in the consideredmodified
gets cij = p , each modified being gets the
probability of file time
64. In EDHCM (Exponentially Decayed HCM) , entropies f
earlier with decaytime, i.e., earlier modifications, have the
HCM periods of factors
contribution reduced exponentially over time, modelling a
exponential decay model. EDHCM was introduced by Ha
san. Similarly, LDHCM (Linearly Decayed) and LGDHC
(LoGarithmically decayed), have their contributions reduc
over time in a respectively linear and logarithmic fashio
Both are novel. The definition of the variants follow:
P HCP Fi (j)
EDHCM{a,..,b} (j) = i∈{a,..,b} eφ1 ∗(|{a,..,b}|−i) (
P HCP Fi (j)
LDHCM{a,..,b} (j) = i∈{a,..,b} φ2 ∗(|{a,..,b}|+1−i) (
P HCP Fi (j)
LGDHCM{a,..,b} (j) = i∈{a,..,b} φ3 ∗ln(|{a,..,b}|+1.01−i) (
where φ1 , φ2 and φ3 are the decay factors.
65. In EDHCM (Exponentially Decayed HCM) , entropies f
earlier with decaytime, i.e., earlier modifications, have the
HCM periods of factors
contribution reduced exponentially over time, modelling a
exponential decay model. EDHCM was introduced by Ha
san. Similarly, LDHCM (Linearly Decayed) and LGDHC
(LoGarithmically decayed), have their contributions reduc
overExponentially decayed
time in a respectively linear and logarithmic fashio
Both are novel. The definition of the variants follow:
P HCP Fi (j)
EDHCM{a,..,b} (j) = i∈{a,..,b} eφ1 ∗(|{a,..,b}|−i) (
P HCP Fi (j)
LDHCM{a,..,b} (j) = i∈{a,..,b} φ2 ∗(|{a,..,b}|+1−i) (
P HCP Fi (j)
LGDHCM{a,..,b} (j) = i∈{a,..,b} φ3 ∗ln(|{a,..,b}|+1.01−i) (
where φ1 , φ2 and φ3 are the decay factors.
66. In EDHCM (Exponentially Decayed HCM) , entropies f
earlier with decaytime, i.e., earlier modifications, have the
HCM periods of factors
contribution reduced exponentially over time, modelling a
exponential decay model. EDHCM was introduced by Ha
san. Similarly, LDHCM (Linearly Decayed) and LGDHC
(LoGarithmically decayed), have their contributions reduc
overExponentially decayed
time in a respectively linear and logarithmic factor
Exponential fashio
Both are novel. The definition of the variants follow:
P HCP Fi (j)
EDHCM{a,..,b} (j) = i∈{a,..,b} eφ1 ∗(|{a,..,b}|−i) (
P HCP Fi (j)
LDHCM{a,..,b} (j) = i∈{a,..,b} φ2 ∗(|{a,..,b}|+1−i) (
P HCP Fi (j)
LGDHCM{a,..,b} (j) = i∈{a,..,b} φ3 ∗ln(|{a,..,b}|+1.01−i) (
where φ1 , φ2 and φ3 are the decay factors.
67. Experimental settings
System Start date #Subsystem
NetBSD March 1993 235
FreeBSD June 1993 152
OpenBSD Oct 1995 265
Postgre July 1996 280
KDE April 1997 108
KOffice April 1998 158
Entropy metrics
Number of past modifications
Number of past defects
Subsystem level
68. 2
Models fitting in terms of R
Past defects
Past changes
HCM
WHCM
EDHCM
0 0.2 0.4 0.6
NetBSD FreeBSD OpenBSD Postgres KDE KOffice
69. Prediction error
Number of past changes vs Entropy
NetBSD
FreeBSD
OpenBSD
Postgres
KDE
KOffice
0 12.5 25.0 37.5
#Changes - WHCM (%) #Changes - EDHCM (%)
70. Prediction error
Number of past defects vs Entropy
NetBSD
FreeBSD
OpenBSD
Postgres
KDE
KOffice
-20.0 -10.0 0 10.0 20.0 30.0 40.0
#Defects - WHCM (%) #Defects - EDHCM (%)
71. Conclusion
Models based on entropy of changes
are better defects predictor s than
number o f past changes or defects
72. Conclusion
Models based on entropy of changes
are better defects predictor s than
number o f past changes or defects
A complex code change process
negatively affects its product, the
software system
75. Epilogue
We can predict defects
but
results have still limited practical usability
76. Epilogue
Predicting bugs is very difficult
because developing code is a human activity
77. Epilogue
A human activity influenced by too many factors
How complex was the piece of code?
How tested?
How experienced was the developer?
78. Epilogue
A human activity influenced by too many factors
How complex was the piece of code?
How tested?
How experienced was the developer?
How tired was the developer?
How integrated was the developer in the team?
Did he like his job?
79. Epilogue
A human activity influenced by too many factors
F OC US
How complex was the piece of code?
How tested?
How experienced was the developer?
How tired was the developer?
How integrated was the developer in the team?
Did he like his job?
80. Epilogue
A human activity influenced by too many factors
F OC US
How complex was the piece of code?
How tested?
How experienced was the developer?
od Hata ow tired was the developer?
N
y etintegrated was the developer in the team?
How
Did he like his job?
93. An ideal bug life cycle
Unconfirmed Verified
New Resolved Closed
Assigned
94. A bit less ideal
Unconfirmed Verified
New Resolved Closed
Assigned
95. A bit less ideal
Unconfirmed Verified
New Resolved Closed
Assigned Reopened
96. The reality
Unconfirmed Verified
New Resolved Closed
Assigned Reopened
97. The reality
Unconfirmed Verified
New Resolved Closed
Assigned Reopened
98. All bug properties can change over time
Bug
Problem
id description
product component
Criticality
severity priority
Involved people
assignedTo reporter qa
State
Status Resolution
...
99. All bug properties can change over time
Bug Bug
Problem Problem
id description id description
product component product component
Criticality Activity Criticality
severity priority severity priority
Involved people Involved people
steve
assignedTo reporter qa
AssignedTo mike
assignedTo reporter qa
State steve john State
Status Resolution Status Resolution
... ...
100. All bug properties can change over time
Bug Bug
Problem Problem
id description id description
product component product component
Criticality Activity Criticality
severity priority severity priority
Involved people Involved people
steve
assignedTo reporter qa
AssignedTo mike
assignedTo reporter qa
State steve john State
Status Resolution Status Resolution
... ...
i
B
P
de i
B
P
de i
B
P
de i
B
P
de
Bug history C C C C
Inv Inv Inv Inv
S SR S SR S SR S SR
110. System radiography view
“Where (in the system and in its history) are
the open bugs located?”
111. System radiography view
“Where (in the system and in its history) are
the open bugs located?”
Visualization principle
•System decomposition on the
Component 1 y axis
Product A
Component 2
•Product :: Component
Product B
Time
112. System radiography view
“Where (in the system and in its history) are
the open bugs located?”
Visualization principle
•System decomposition on the
Component 1 y axis
Product A
Component 2
y position
Color
#bugs
•Product :: Component
• (x,y) : (time, component)
Component
Product B
x position
• Color: # open bugs
Time Interval
Time
113. System radiography view
“Where (in the system and in its history) are
the open bugs located?”
Visualization principle
•System decomposition on the
Component 1 y axis
Product A
Component 2
y position
Color
#bugs
•Product :: Component
• (x,y) : (time, component)
Component
Product B
x position
• Color: # open bugs
Time Interval
Time
114. Mozilla example [Sep ‘98 - Apr ‘03]
aggiungere transizione
alla prossima slide,
volendo anche nel filmato
115. Mozilla example [Sep ‘98 - Apr ‘03]
aggiungere transizione
alla prossima slide,
volendo anche nel filmato
Browser
116. Mozilla example [Sep ‘98 - Apr ‘03]
aggiungere transizione
alla prossima slide,
volendo anche nel filmato
Browser
Mailnews
117. Mozilla example [Sep ‘98 - Apr ‘03]
aggiungere transizione
alla prossima slide,
volendo anche nel filmato
Browser
Mailnews
118. The Bug Watch View
“How are bugs characterized with respect to their history?”
119. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999
Time
120. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time
121. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time • Status
122. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time • Status
Status From To
Assigned 10/19/99 12/21/99
Resolved 12/21/99 1/31/00
Reopened 1/31/00 2/6/00
New 2/6/00 6/5/00
... ... ...
123. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time • Status
Status From To
Assigned 10/19/99 12/21/99
Resolved 12/21/99 1/31/00
Reopened 1/31/00 2/6/00
New 2/6/00 6/5/00
... ... ...
124. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time • Status
Status From To
Assigned 10/19/99 12/21/99
Resolved 12/21/99 1/31/00
Reopened 1/31/00 2/6/00
New 2/6/00 6/5/00
... ... ...
125. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time • Status
Status From To
Assigned 10/19/99 12/21/99
Resolved 12/21/99 1/31/00
Reopened 1/31/00 2/6/00
New 2/6/00 6/5/00
... ... ...
126. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time • Status
Status From To
Assigned 10/19/99 12/21/99
Resolved 12/21/99 1/31/00
Reopened 1/31/00 2/6/00
New 2/6/00 6/5/00
... ... ...
127. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time • Status
Status From To
Assigned 10/19/99 12/21/99
Resolved 12/21/99 1/31/00
Reopened 1/31/00 2/6/00
New 2/6/00 6/5/00
... ... ...
• Activity
128. The Bug Watch View
“How are bugs characterized with respect to their history?”
Visualization principle
End: 10/16/2001 Beginning: 10/19/1999 • 3 Layers
Time • Status
Status From To
Assigned 10/19/99 12/21/99
Resolved 12/21/99 1/31/00
Reopened 1/31/00 2/6/00
New 2/6/00 6/5/00
... ... ...
• Activity
• Severity
129. tell more about the
Examples from Mozilla clustering
dire cosa e’ la grandezza
Browser :: Networking [Nov ‘02- Apr ‘03]
130. tell more about the
Examples from Mozilla clustering
dire cosa e’ la grandezza
Browser :: Networking [Nov ‘02- Apr ‘03]
Reopened 4 times
Developer in charge to
fix it changed 6 times
Many people added in
the CC
131. tell more about the
Examples from Mozilla clustering
dire cosa e’ la grandezza
Browser :: Networking [Nov ‘02- Apr ‘03]
132. tell more about the
Examples from Mozilla clustering
dire cosa e’ la grandezza
Browser :: Networking [Nov ‘02- Apr ‘03]
One status but many
activities (addition of CC)
133. Conclusion
Analyzing a bug database
Provides useful insights in
a software system
Helps in detecting the
most harmful bugs