Software Defect Prediction on Unlabeled Datasets

Software Defect Prediction
on Unlabeled Datasets
- PhD Thesis Defence -
July 23, 2015
Jaechang Nam
Department of Computer Science and Engineering
HKUST

• General question of software defect
prediction
– Can we identify defect-prone entities (source
code file, binary, module, change,...) in advance?
• # of defects
• buggy or clean
• Why? (applications)
– Quality assurance for large software
(Akiyama@IFIP’71)
– Effective resource allocation
• Testing (Menzies@TSE`07, Kim@FSE`15)
• Code review (Rahman@FSE’11)
2

3
Predict
Training
?
?
Model
Project A
: Metric value
: Buggy-labeled instance
: Clean-labeled instance
?: Unlabeled instance
Related Work
Munson@TSE`92, Basili@TSE`95, Menzies@TSE`07,
Hassan@ICSE`09, Bird@FSE`11,D’ambros@EMSE112
Lee@FSE`11,...

What if labeled instances do not
exist?
4
?
?
?
?
?
Project X
Unlabeled
Dataset
: Metric value
Model
New projects
Projects lacking in
historical data

This problem is...
5
?
?
?
?
?
Project X
Unlabeled
Dataset
: Metric value

Existing Solutions?
6
?
?
?
?
?
(New) Project X
Unlabeled
Dataset
: Metric value

Solution 1
Cross-Project Defect Prediction
(CPDP)
7
?
?
?
?
?
Training
Predict
Model
Project A
(source)
Project X
(target)
Unlabeled
Dataset
: Metric value
: Buggy-labeled instance
: Clean-labeled instance
Related Work
Watanabe@PROMISE08, Turhan@EMSE`09
Zimmermann@FSE`09, Ma@IST`12,
Zhang@MSR`14
Challenges
Same metric set
(same feature space)
• Worse than WPDP
• Heterogeneous
metrics between
source and target
Only 2% out of 622 CPDP
combinations worked.
(Zimmermann@FSE`09)

Solution 2
Using Only Unlabeled Datasets
8
?
?
?
?
?
Project X
Unlabeled
Dataset
Training
Model
Predict
Related Work
Zhong@HASE`04,
Catal@ITNG`09
• Manual Effort
Challenge
Human-intervention

9
Sub-problems Proposed Techniques
CPDP comparable to WPDP? Transfer Defect Learning (TCA+)
CPDP across projects with
heterogeneous metric sets?
Heterogeneous Defect Prediction (HDP)
DP using only unlabeled
datasets
without human effort?
CLAMI

10
datasets
CLAMI

CPDP
• Reason for poor prediction
performance of CPDP
– Different distributions of source and target
datasets (Pan et al@TKDE`09)
11

TCA+
12
Source Target
Oops, we are different! Let’s meet at another world!
(Projecting datasets into a latent feature space)
New Source New Target
Normalize US together!Normalization
Transfer
Component
Analysis (TCA)
+
Make different distributions between source and target
similar!

Data Normalization
• Adjust all metric values in the same
scale
– E.g., Make Mean = 0 and Std = 1
• Known to be helpful for classification
algorithms to improve prediction
performance (Han@`12).
13

Normalization Options
• N1: Min-max Normalization (max=1, min=0) [Han et
al., 2012]
• N2: Z-score Normalization (mean=0, std=1) [Han et
al., 2012]
• N3: Z-score Normalization only using source mean
and standard deviation
• N4: Z-score Normalization only using target mean
and standard deviation
• NoN: No normalization
14

Decision Rules for Normalization
• Find a suitable normalization
• Steps
– #1: Characterize a dataset
– #2: Measure similarity
between source and target datasets
– #3: Decision rules
15

#1: Characterize a dataset
3
1
…
Dataset A Dataset B
2
4
5
8
9
6
11
d1,2
d1,5
d1,3
d3,11
3
1
…
2
4
5
8
9
6
11
d2,6
d1,2
d1,3
d3,11
DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i
< j}
A
16

#2: Measure Similarity between source and
target
3
1
…
Dataset A Dataset B
2
4
5
8
9
6
11
d1,2
d1,5
d1,3
d3,11
3
1
…
2
4
5
8
9
6
11
d2,6
d1,2
d1,3
d3,11
DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i
< j}
A
17
• Minimum (min) and maximum (max) values of
DIST
• Mean and standard deviation (std) of DIST
• The number of instances

#3: Decision Rules
• Rule #1
– Mean and Std are same  NoN
• Rule #2
– Max and Min are different  N1 (max=1, min=0)
• Rule #3, #4
– Std and # of instances are different
 N3 or N4 (src/tgt mean=0, std=1)
• Rule #5
– Default  N2 (mean=0, std=1)
18

TCA
• Key idea
Source Target
19

TCA (cont.)
20
Pan et al.@TNN`10, Domain Adaptation via Transfer Component Analysis
Target domain data
Source domain data
Buggy source instances
Clean source instances
Buggy target instances
Clean target instances

TCA (cont.)
21
TCA
Pan et al.@TNN`10, Domain Adaptation via Transfer Component Analysis

TCA+
22
Source Target
Normalize us together with a suitable
option!
Normalization
Transfer
Component
Analysis (TCA)
+
Make different distributions between source and target
similar!

Research Questions
• RQ1
– What is the cross-project prediction performance
of TCA/TCA+ compared to WPDP?
• RQ2
– What is the cross-project prediction performance
of TCA/TCA+ compared to that CPDP without
TCA/TCA+?
24

Experimental Setup
• 8 software subjects
• Machine learning algorithm
– Logistic regression
ReLink (Wu et al.@FSE`11)
Projects
# of metrics
(features)
Apache
26
(Source code)
Safe
ZXing
AEEEM (D’Ambros et al.@MSR`10)
Projects
# of metrics
(features)
Apache Lucene
(LC)
61
(Source code,
Churn,
Entropy,…)
Equinox (EQ)
Eclipse JDT
Eclipse PDE UI
Mylyn (ML)
25

Experimental Design
Test set
(50%)
Training set
(50%)
Within-project defect prediction (WPDP)
26

Experimental Design
Target project (Test set)
Source project (Training set)
Cross-project defect prediction (CPDP)
27

Experimental Design
Target project (Test set)
Source project (Training set)
Cross-project defect prediction with TCA/TCA+
TCA/TCA+
28

ReLink Result
Representative 3 out of 6 combinations
*CPDP: Cross-project defect prediction without
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
F-measure
WPDP CPDP TCA TCA+
Safe  Apache Apache  Safe Safe  ZXing
WPDP CPDP TCA TCA+ WPDP CPDP TCA TCA+
30

ReLink Result
F-measure
Cross
Source  Target
Safe  Apache
Zxing  Apache
Apache  Safe
Zxing  Safe
Apache  ZXing
Safe  ZXing
Average
CPDP
0.52
0.69
0.49
0.59
0.46
0.10
0.49
TCA
0.64
0.64
0.72
0.70
0.45
0.42
0.59
TCA+
0.64
0.72
0.72
0.64
0.49
0.53
0.61
WPDP
0.64
0.62
0.33
0.53
*CPDP: Cross-project defect prediction without 31

AEEEM Result
Representative 3 out of 20 combinations
*CPDP: Cross-project defect prediction without TCA/TCA+
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
F-measure
WPDP CPDP TCA TCA+
JDT  EQ PDE  LC PDE  ML
WPDP CPDP TCA TCA+ WPDP CPDP TCA TCA+
32

AEEEM Result
F-measure
Cross
Source  Target
JDT  EQ
LC  EQ
ML  EQ
…
PDE  LC
EQ  ML
JDT  ML
LC  ML
PDE ML
…
Average
CPDP
0.31
0.50
0.24
…
0.33
0.19
0.27
0.20
0.27
…
0.32
TCA
0.59
0.62
0.56
…
0.27
0.62
0.56
0.58
0.48
…
0.41
TCA+
0.60
0.62
0.56
…
0.33
0.62
0.56
0.60
0.54
…
0.41
WPDP
0.58
…
0.37
0.30
…
0.42
33

Related Work
Transfer
learning
Metric
Compensation
NN Filter TNB TCA+
Preprocessing N/A
Feature
selection,
Log-filter
Log-filter Normalization
Machine
learner
C4.5 Naive Bayes TNB
Logistic
Regression
# of Subjects 2 10 10 8
# of
predictions
2 10 10 26
Avg. f-
measure
0.67
(W:0.79,
C:0.58)
0.35
(W:0.37,
C:0.26)
0.39
(NN: 0.35,
C:0.33)
0.46
(W:0.46,
C:0.36)
Citation Watanabe@PROMISE
`08
Turhan@ESEJ`0
9
Ma@IST`12 Nam@ICSE`13
* NN = Nearest neighbor, W = Within, C = Cross
34

35
datasets
CLAMI

Motivation
36
?
?
?
?
?
Training
Test
Model
Project A
(source)
Project B
(target)
Same metric set
(same feature space)
CPDP
In experiments of TCA+
Datasets in ReLink Datasets in AEEEMX
Unlabeled
Dataset
Apache
Safe
JDTX

Motivation
37
?
Training
Test
Model
Project A
(source)
Project C
(target)
?
?
?
?
?
?
?
Heterogeneous metric sets
(different feature spaces
or different domains)
Possible to Reuse all the existing defect datasets for CPDP!

Key Idea
• Most defect prediction metrics
– Measure complexity of software and its
development process.
• e.g.
– The number of developers touching a source code file
(Bird@FSE`11)
– The number of methods in a class (D’Ambroas@ESEJ`12)
– The number of operands (Menzies@TSE`08)
More complexity implies more defect-proneness
(Rahman@ICSE`13)
38

Key Idea
• Most defect prediction metrics
– Measure complexity of software and its
• e.g.
(Bird@FSE`11)
More complexity implies more defect-proneness
(Rahman@ICSE`13)
39
Match source and target metrics that have similar
distribution

- Overview -
40
X1 X2 X3 X4 Label
1 1 3 10 Buggy
8 0 1 0 Clean
⋮ ⋮ ⋮ ⋮ ⋮
9 0 1 1 Clean
Metric
Matching
Source: Project A Target: Project B
Cross-
prediction Model
Build
(training)
Predict
(test)
Metric
Selection
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Label
3 1 1 0 2 1 9 ?
1 1 9 0 2 3 8 ?
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
0 1 1 1 2 1 1 ?
1 3 10 Buggy
8 1 0 Clean
⋮ ⋮ ⋮ ⋮
9 1 1 Clean
1 3 10 Buggy
8 1 0 Clean
⋮ ⋮ ⋮ ⋮
9 1 1 Clean
9 1 1 ?
8 3 9 ?
⋮ ⋮ ⋮ ⋮
1 1 1 ?

Metric Selection
• Why? (Guyon@JMLR`03)
– Select informative metrics
• Remove redundant and irrelevant metrics
– Decrease complexity of metric matching
combinations
• Feature Selection Approaches
(Gao@SPE`11,Shivaji@TSE`13)
– Gain Ratio
– Chi-square
– Relief-F
– Significance attribute evaluation
41

Metric Matching
42
Source Metrics Target Metrics
X1
X2
Y1
Y2
0.8
0.5
* We can apply different cutoff values of matching score
* It can be possible that there is no matching at all.

Compute Matching Score
KSAnalyzer
• Use p-value of Kolmogorov-Smirnov Test
(Massey@JASA`51)
43
Matching Score M of i-th source and j-th target metrics:
Mij = pij

Heterogeneous Defect Prediction
- Overview -
44
X1 X2 X3 X4 Label
1 1 3 10 Buggy
8 0 1 0 Clean
⋮ ⋮ ⋮ ⋮ ⋮
9 0 1 1 Clean
Metric
Matching
Source: Project A Target: Project B
Cross-
prediction Model
Build
(training)
Predict
(test)
Metric
Selection
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Label
3 1 1 0 2 1 9 ?
1 1 9 0 2 3 8 ?
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
0 1 1 1 2 1 1 ?
1 3 10 Buggy
8 1 0 Clean
⋮ ⋮ ⋮ ⋮
9 1 1 Clean
1 3 10 Buggy
8 1 0 Clean
⋮ ⋮ ⋮ ⋮
9 1 1 Clean
9 1 1 ?
8 3 9 ?
⋮ ⋮ ⋮ ⋮
1 1 1 ?

Baselines
• WPDP
• CPDP-CM (Turhan@EMSE`09,Ma@IST`12,He@IST`14)
– Cross-project defect prediction using only
common metrics between source and target
datasets
• CPDP-IFS (He@CoRR`14)
– Cross-project defect prediction on
Imbalanced Feature Set (i.e. heterogeneous
metric set)
– 16 distributional characteristics of values of
an instance as features (e.g., mean, std,
maximum,...)
46

Research Questions (RQs)
• RQ1
– Is heterogeneous defect prediction comparable
to WPDP?
• RQ2
– Is heterogeneous defect prediction comparable
to CPDP-CM?
• RQ3
– Is Heterogeneous defect prediction comparable
to CPDP-IFS?
47

Benchmark Datasets
Group Dataset
# of instances # of
metrics
Granularity
All Buggy (%)
AEEEM
EQ 325 129 (39.7%)
61 Class
JDT 997 206 (20.7%)
LC 399 64 (9.36%)
ML 1862 245 (13.2%)
PDE 1492 209 (14.0%)
MORP
H
ant-1.3 125 20 (16.0%)
20 Class
arc 234 27 (11.5%)
camel-1.0 339 13 (3.8%)
poi-1.5 237 141 (75.0%)
redaktor 176 27 (15.3%)
skarbonka 45 9 (20.0%)
tomcat 858 77 (9.0%)
velocity-1.4 196 147 (75.0%)
xalan-2.4 723 110 (15.2%)
xerces-1.2 440 71 (16.1%)
48
Group Dataset
# of instances # of
metrics
Granularity
All Buggy (%)
ReLink
Apache 194 98 (50.5%)
26 FileSafe 56 22 (39.3%)
ZXing 399
118
(29.6%)
NASA
cm1 327 42 (12.8%)
37 Function
mw1 253 27 (10.7%)
pc1 705 61 (8.7%)
pc3 1077
134
(12.4%)
pc4 1458
178
(12.2%)
SOFTLA
B
ar1 121 9 (7.4%)
29 Function
ar3 63 8 (12.7%)
ar4 107 20 (18.7%)
ar5 36 8 (22.2%)
ar6 101 15 (14.9%)
600 prediction combinations in total!

Experimental Settings
• Logistic Regression
• HDP vs. WPDP, CPDP-CM, and CPDP-IFS
49
Test set
(50%)
Training set
(50%)
Project
1
Project
2
Project
n
...
...
X 1000
Project
1
Project
2
Project
n
...
...
CPDP-CM
CPDP-IFS
HDP
WPDP
Project A

Evaluation Measures
• False Positive Rate = FP/(TN+FP)
• True Positive Rate = Recall
• AUC (Area Under receiver operating characteristic Curve)
50
False Positive rate
TruePositiverate
0
1
1

Evaluation Measures
• Win/Tie/Loss (Valentini@ICML`03, Li@JASE`12, Kocaguneli@TSE`13)
– Wilcoxon signed-rank test (p<0.05) for 1000
prediction results
– Win
• # of outperforming HDP prediction combinations with
statistical significance. (p<0.05)
– Tie
• # of HDP prediction combinations with no statistical
significance. (p≥0.05)
– Loss
• # of outperforming baseline prediction results with
statistical significance. (p>0.05)
51

Prediction Results in median
AUC
Target WPDP
CPDP-
CM
CPDP-
IFS
HDPKS
(cutoff
=0.05)
EQ 0.583 0.776 0.461 0.783
JDT 0.795 0.781 0.543 0.767
MC 0.575 0.636 0.584 0.655
ML 0.734 0.651 0.557 0.692*
PDE 0.684 0.682 0.566 0.717
ant-1.3 0.670 0.611 0.500 0.701
arc 0.670 0.611 0.523 0.701
camel-1.0 0.550 0.590 0.500 0.639
poi-1.5 0.707 0.676 0.606 0.537
redaktor 0.744 0.500 0.500 0.537
skarbonka 0.569 0.736 0.528 0.694*
tomcat 0.778 0.746 0.640 0.818
velocity-
1.4
0.725 0.609 0.500 0.391
xalan-2.4 0.755 0.658 0.499 0.751
xerces-1.2 0.624 0.453 0.500 0.489
53
Target WPDP
CPDP-
CM
CPDP-
IFS
HDPKS
(cutoff
=0.05)
Apache 0.714 0.689 0.635 0.717*
Safe 0.706 0.749 0.616 0.818*
ZXing 0.605 0.619 0.530 0.650*
cm1 0.653 0.622 0.551 0.717*
mw1 0.612 0.584 0.614 0.727
pc1 0.787 0.675 0.564 0.752*
pc3 0.794 0.665 0.500 0.738*
pc4 0.900 0.773 0.589 0.682*
ar1 0.582 0.464 0.500 0.734*
ar3 0.574 0.862 0.682 0.823*
ar4 0.657 0.588 0.575 0.816*
ar5 0.804 0.875 0.585 0.911*
ar6 0.654 0.611 0.527 0.640
All 0.657 0.636 0.555 0.724*
HDPKS: Heterogeneous defect prediction using KSAnalyzer

Win/Tie/Loss Results
Target
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
W T L W T L W T L
EQ 4 0 0 2 2 0 4 0 0
JDT 0 0 5 3 0 2 5 0 0
LC 6 0 1 3 3 1 3 1 3
ML 0 0 6 4 2 0 6 0 0
PDE 3 0 2 2 0 3 5 0 0
ant-1.3 6 0 1 6 0 1 5 0 2
arc 3 1 0 3 0 1 4 0 0
camel-1.0 3 0 2 3 0 2 4 0 1
poi-1.5 2 0 2 3 0 1 2 0 2
redaktor 0 0 4 2 0 2 3 0 1
skarbonka 11 0 0 4 0 7 9 0 2
tomcat 2 0 0 1 1 0 2 0 0
velocity-
1.4
0 0 3 0 0 3 0 0 3
xalan-2.4 0 0 1 1 0 0 1 0 0
xerces-1.2 0 0 3 3 0 0 1 0 2 54
Target
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
W T L W T L W T L
Apach
e
6 0 5 8 1 2 9 0 2
Safe 14 0 3 12 0 5 15 0 2
ZXing 8 0 0 6 0 2 7 0 1
cm1 7 1 2 8 0 2 9 0 1
mw1 5 0 1 4 0 2 4 0 2
pc1 1 0 5 5 0 1 6 0 0
pc3 0 0 7 7 0 0 7 0 0
pc4 0 0 7 2 0 5 7 0 0
ar1 14 0 1 14 0 1 11 0 4
ar3 15 0 0 5 0 10 10 2 3
ar4 16 0 0 14 1 1 15 0 1
ar5 14 0 4 14 0 4 16 0 2
ar6 7 1 7 8 4 3 12 0 3
Total 147 3 72 147 14 61 182 3 35
%
66.2
%
1.4%
32.4
%
66.2
%
6.3%
27.5
%
82.0
%
1.3%
16.7
%

Matched Metrics (Win)
55
MetricValues
Distribution
(Source metric: RFC-the number of method invoked by a class, Target metric: the number of operand
Matching Score = 0.91
AUC = 0.946 (ant1.3  ar5)

Matched Metrics (Loss)
56
MetricValues
Distribution
(Source metric: LOC, Target metric: average number of LOC in a method)
Matching Score = 0.13
AUC = 0.391 (Safe  velocity-1.4)

Different Feature Selections
(median AUCs, Win/Tie/Loss)
57
Approach
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
HDP
AUC Win% AUC Win% AUC Win% AUC
Gain Ratio 0.657 63.7% 0.645 63.2% 0.536 80.2% 0.720
Chi-Square 0.657 64.7% 0.651 66.4% 0.556 82.3% 0.727
Significanc
e
0.657 66.2% 0.636 66.2% 0.553 82.0% 0.724
Relief-F 0.670 57.0% 0.657 63.1% 0.543 80.5% 0.709
None 0.657 47.3% 0.624 50.3% 0.536 66.3% 0.663

Results in Different Cutoffs
58
Cutoff
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
HDP Target
Coverage
AUC Win% AUC Win% AUC Win% AUC
0.05 0.657 66.2% 0.636 66.2% 0.553 82.4% 0.724* 100%
0.90 0.657 100% 0.761 71.4% 0.624 100% 0.852* 21%

59
datasets
CLAMI

Motivation
60
- Loss result of HDP

Motivation
61
- Loss result of HDP
Still difficult to make different
distribution similar!

Motivation
62
Training
Predict
Unlabeled Dataset
What if....
?

How?
• Recall the trend of defect prediction metrics
– Measures complexity of software and its
• e.g.
(Bird@FSE`11)
Higher metric values imply more defect-proneness
(Rahman@ICSE`13)
63

How?
• Recall this trend of defect prediction metrics
– Measures complexity of software and its
• e.g.
(Bird@FSE`11)
Higher metric values imply more defect-proneness
(Rahman@ICSE`13)
64
(1) Label instances that have higher metric values as
buggy!
(2) Generate a training set by removing metrics and
instances that violates (1).

CLAMI Approach Overview
65
Unlabeled
Dataset
(1) Clustering
(2) LAbeling
(3) Metric Selection
(4) Instance Selection
(5) Metric
Selection
CLAMI
Model
Build
Predict
Training dataset
Test dataset

CLAMI Approach
- Clustering and Labeling Clusters -
66
Cluster, K=3
Unlabeled Dataset
X1 X2 X3 X4 X5 X6 X7 Label
3 1 3 0 5 1 9 ?
1 1 2 0 7 3 8 ?
2 3 2 5 5 2 1 ?
0 0 8 1 0 1 9 ?
1 0 2 5 6 10 8 ?
1 4 1 1 7 1 1 ?
1 0 1 0 0 1 7 ?
1 1 2 1 5 1 8Median
Inst.
A
Inst. B
Inst.
C
Inst.
D
Inst. E
Inst. F
inst.
G
Instance
s
K = the number of higher metric
values that are greater than Median.
C
Cluster, K=4
A, E
B, D, F
Cluster, K=2
G
Cluster, K=0
(1) Clustering (2) Labeling Clusters
Higher values : buggy clusters : clean clusters

CLAMI Approach
- Metric Selection -
67
{X1,X4}
X1 X2 X3 X4 X5 X6 X7 Label
3 1 3 0 5 1 9 Buggy
1 1 2 0 7 3 8 Clean
2 3 2 5 5 2 1 Buggy
0 0 8 1 0 1 9 Clean
1 0 2 5 6 10 8 Buggy
1 4 1 1 7 1 1 Clean
1 0 1 0 0 1 7 Clean
Inst.
A
Inst. B
Inst. C
Inst. D
Inst. E
Inst. F
Inst.
G 1 3 3 1 4 2 3
# of
Violations
Selected Metrics
Violation: a metric value that does not follow its label!
Higher values are bold-facedViolations

CLAMI Approach
- Instance Selection -
68
X1 X4 Label
3 0 Buggy
1 0 Clean
2 5 Buggy
0 1 Clean
1 5 Buggy
1 1 Clean
1 0 Clean
Inst. A
Inst. B
Inst. C
Inst. D
Inst. E
Inst. F
Inst. G
X1 X4 Label
1 0 Clean
2 5 Buggy
0 1 Clean
1 1 Clean
1 0 Clean
Inst. B
Inst. C
Inst. D
Inst. F
Inst. G
Final Training Dataset

CLAMI Approach Overview
69
Unlabeled
Dataset
(1) Clustering
(2) LAbeling
(3) Metric Selection
(4) Instance Selection
(5) Metric
Selection
CLAMI
Model
Build
Predict
Training dataset
Test dataset

Baselines
• Supervised learning model (i.e. WPDP)
• Defect prediction only using unlabeled
datasets
– Expert-based (Zhong@HASE`04)
• Cluster instances by K-Mean into 20 clusters
• A human expert labels each cluster
– Threshold-based (Catal@ITNG`09)
• [LoC, CC, UOP, UOpnd, TOp, TOpnd]
= [65, 10, 25, 40, 125, 70]
– Label an instance whose any metric value is greater
than a threshold value
• Manual effort requires to decide threshold values in
advance.
71

Research Questions (RQs)
• RQ1
– CLAMI vs. Supervised learning model?
• RQ2
– CLAMI vs. Expert-/threshold-based approaches?
(Zhong@HASE`04, Catal@ITNG`09)
72

Benchmark Datasets
Group Dataset
# of instnaces # of
metrics
Prediction
GranularityAll Buggy (%)
NetGene
Httpclient 361
205
(56.8%)
465
(Network,
Change
genealogy)
File
Jackrabbit 542
225
(41.5%)
Lucene 1671
346
(10.7%)
Rhino 253
109
(43.1%)
ReLink
Apache 194 98 (50.5%)
26
(code
complexity)
File
Safe 56
22
(39.29%)
ZXing 399
118
(29.6%)
73

Experimental Settings (RQ1)
- Supervised learning model -
74
Test set (50%)
Training set (50%)
Supervised
Model
(Baseline)
Training
Predict
X 1000
CLAMI
Model
Training
Predict

Experimental Settings (RQ2)
-Comparison to existing approaches -
75
Unlabeled Dataset
CLAMI
Model
Predict
Training
Predict
Threshold-
Based
(Baseline1,
Catal@ITNG`09)
Expert-
Based
(Baseline2,
Zhong@HASE`04)

Measure
• F-measure
• AUC
76

Supervised model vs. CLAMI
Dataset
F-measure AUC
Supervise
d
(w/ labels)
CLAMI
(w/o
labels)
+/-%
Supervise
d
(w/ labels)
CLAMI
(w/o
labels)
+/-%
Httpclient 0.729 0.722 -1.0% 0.727 0.772 +6.2%
Jackrabbi
t
0.649 0.685 +5.5% 0.727 0.751 +3.2%
Lucene 0.508 0.397 -21.8% 0.708 0.595 -15.9%
Rhino 0.639 0.752 +17.7% 0.702 0.777 +10.7%
Apache 0.653 0.720 +10.2% 0.714 0.753 +5.3%
Safe 0.615 0.667 +8.3% 0.706 0.773 +9.5%
ZXing 0.333 0.497 +49.0% 0.605 0.644 +6.4%
Median 0.639 0.685 +7.2% 0.707 0.753 +6.3%
78

Existing approaches vs. CLAMI
f-measure
Dataset Threshold-based Expert-based CLAMI
Httpclient 0.355 0.811 0.756
Jackrabbit 0.184 0.676 0.685
Lucene 0.144 0.000 0.404
Rhino 0.190 0.707 0.731
Apache 0.547 0.701 0.725
Safe 0.308 0.718 0.694
ZXing 0.228 0.402 0.505
Median 0.228 0.701 0.694
79

Distributions of metrics (Safe)
80
Most frequently selected metrics by CLAMI
Metrics with less discriminative power

Distributions of metrics (Lucene)
81
Most frequently selected metrics by CLAMI
Metrics with less discriminative power

82
datasets
CLAMI

Conclusion
83
Sub-problems
Technique 1:
TCA+
Technique 2:
HDP
Technique 3:
CLAMI
Comparable prediction
performance than WPDP
O
(in f-measure)
O
(in AUC)
O
Able to handle
heterogeneous metric
sets
X O O
Automated
without human effort O O O

Publications at HKUST
• Defect Prediction
– Micro Interaction Metrics for Defect Prediction@FSE`11, Taek Lee,
Jaechang Nam, Donggyun Han, Sunghun Kim and Hoh Peter In
– Transfer Defect Learning@ICSE`13, Jaechang Nam, Sinno Jialin Pan and
Sunghun Kim, Nominee, ACM SIGSOFT Distinguished Paper Award
– Heterogeneous Defect Prediction@FSE`15, Jaechang Nam ann Sunghun Kim
– REMI: Defect Prediction for Efficient API Testing@FSE`15, Mijung Kim,
Jaechang Nam, Jaehyuk Yeon, Soonhwang Choi, and Sunghun Kim, Industrial
Track
– CLAMI: Defect Prediction on Unlabeled Datasets@ASE`15, Jaechang Nam
and Sunghun Kim
• Testing
– Calibrated Mutation Testing@MUTATION`12, Jaechang Nam, David Schuler,
and Andreas Zeller
• Automated bug-fixing
– Automatic Patch Generation Learned from Human-written
Patches@ICSE`13, Dongsun Kim, Jaechang Nam, Jaewoo Song and Sunghun
Kim, ACM SIGSOFT Distinguished Paper Award Winner
84

Cross-
Prediction
Feasibility
Check
CLAMI
NoSame
metric
set?
TCA+
Feasibl
e?
Yes
No
Yes
HDP
Unlabeled
Project
Dataset Existing
Labeled
Project
Datasets
Ensemble model for defect prediction on unlabeled datasets
85

Software Defect Prediction on Unlabeled Datasets

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Software Defect Prediction on Unlabeled Datasets

Semelhante a Software Defect Prediction on Unlabeled Datasets (20)

Mais de Sung Kim

Mais de Sung Kim (20)

Último

Último (20)

Software Defect Prediction on Unlabeled Datasets

Notas do Editor