SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
1
Reducing Features to Improve Code Change Based
Bug Prediction
Shivkumar Shivaji ∗, E. James Whitehead, Jr. ∗, Ram Akella ∗, Sunghun Kim †
∗School of Engineering
University of California Santa Cruz,
Santa Cruz, CA, USA
Email: {shiv, ejw, ram}@soe.ucsc.edu
†Department of Computer Science
Hong Kong University of Science and Technology
Hong Kong
Email: hunkim@cse.ust.hk
Abstract—Machine learning classifiers have recently emerged
as a way to predict the introduction of bugs in changes made to
source code files. The classifier is first trained on software history,
and then used to predict if an impending change causes a bug.
Drawbacks of existing classifier-based bug prediction techniques
are insufficient performance for practical use and slow prediction
times due to a large number of machine learned features.
This paper investigates multiple feature selection techniques
that are generally applicable to classification-based bug predic-
tion methods. The techniques discard less important features
until optimal classification performance is reached. The total
number of features used for training is substantially reduced,
often to less than 10% of the original. The performance of Na¨ıve
Bayes and Support Vector Machine (SVM) classifiers when using
this technique is characterized on eleven software projects. Na¨ıve
Bayes using feature selection provides significant improvement
in buggy F-measure (21% improvement) over prior change
classification bug prediction results (by the second and fourth
authors [28]). The SVM’s improvement in buggy F-measure
is 9%. Interestingly, an analysis of performance for varying
numbers of features shows that strong performance is achieved
at even 1% of the original number of features.
Index Terms—Reliability; Bug prediction; Machine Learning;
Feature Selection
I. INTRODUCTION
Imagine if you had a little imp, sitting on your shoulder,
telling you whether your latest code change has a bug. Imps,
being mischievous by nature, would occasionally get it wrong,
telling you a clean change was buggy. This would be annoying.
However, say the imp was right 90% of the time. Would you
still listen to him?
While real imps are in short supply, thankfully advanced
machine learning classifiers are readily available (and have
plenty of impish qualities). Prior work by the second and
fourth authors (hereafter called Kim et al. [28]) and similar
work by Hata et al. [22] demonstrate that classifiers, when
trained on historical software project data, can be used to
predict the existence of a bug in an individual file-level
software change. The classifier is first trained on information
found in historical changes, and once trained, can be used
to classify a new impending change as being either buggy
(predicted to have a bug) or clean (predicted to not have a
bug).
We envision a future where software engineers have bug
prediction capability built into their development environment
[36]. Instead of an imp on the shoulder, software engineers
will receive feedback from a classifier on whether a change
they committed is buggy or clean. During this process, a
software engineer completes a change to a source code file,
submits the change to a software configuration management
(SCM) system, then receives a bug prediction back on that
change. If the change is predicted to be buggy, a software
engineer could perform a range of actions to find the latent
bug, including writing more unit tests, performing a code
inspection, or examining similar changes made elsewhere in
the project.
Due to the need for the classifier to have up-to-date training
data, the prediction is performed by a bug prediction service
located on a server machine [36]. Since the service is used
by many engineers, speed is of the essence when performing
bug predictions. Faster bug prediction means better scalability,
since quick response times permit a single machine to service
many engineers.
A bug prediction service must also provide precise predic-
tions. If engineers are to trust a bug prediction service, it must
provide very few “false alarms”, changes that are predicted to
be buggy but which are really clean [6]. If too many clean
changes are falsely predicted to be buggy, developers will lose
faith in the bug prediction system.
The prior change classification bug prediction approach used
by Kim et al. and analogous work by Hata et al. involve the
extraction of “features” (in the machine learning sense, which
differ from software features) from the history of changes
made to a software project. They include everything separated
by whitespace in the code that was added or deleted in
a change. Hence, all variables, comment words, operators,
method names, and programming language keywords are used
as features to train the classifier. Some object-oriented metrics
are also used as part of the training set, together with other
features such as configuration management log messages and
change metadata (size of change, day of change, hour of
2
change, etc.). This leads to a large number of features: in
the thousands and low tens of thousands. For project histories
which span a thousand revisions or more, this can stretch
into hundreds of thousands of features. Changelog features
were included as a way to capture the intent behind a code
change, for example to scan for bug fixes. Metadata features
were included on the assumption that individual developers
and code submission times are positively correlated with
bugs [50]. Code complexity metrics were captured due to
their effectiveness in previous studies ( [12], [43] amongst
many others). Source keywords are captured en masse as they
contain detailed information for every code change.
The large feature set comes at a cost. Classifiers typically
cannot handle such a large feature set in the presence of
complex interactions and noise. For example, the addition of
certain features can reduce the accuracy, precision, and recall
of a support vector machine. Due to the value of a specific
feature not being known a-priori, it is necessary to start with
a large feature set and gradually reduce features. Additionally,
the time required to perform classification increases with the
number of features, rising to several seconds per classification
for tens of thousands of features, and minutes for large project
histories. This negatively affects the scalability of a bug
prediction service.
A possible approach (from the machine learning literature)
for handling large feature sets is to perform a feature selection
process to identify that subset of features providing the best
classification results. A reduced feature set improves the
scalability of the classifier, and can often produce substantial
improvements in accuracy, precision, and recall.
This paper investigates multiple feature selection techniques
to improve classifier performance. Classifier performance can
be evaluated using a suitable metric. For this paper, classifier
performance refers to buggy F-measure rates. The choice of
this measure is discussed in section III. The feature selection
techniques investigated include both filter and wrapper meth-
ods. The best technique is Significance Attribute Evaluation (a
filter method) in conjunction with the Na¨ıve Bayes classifier.
This technique discards features with lowest significance until
optimal classification performance is reached (described in
section II-D).
Although many classification techniques could be em-
ployed, this paper focuses on the use of Na¨ıve Bayes and
SVM. The reason is due to the strong performance of the
SVM and the Na¨ıve Bayes classifier for text categorization
and numerical data [24], [34]. The J48 and JRIP classifiers
were briefly tried, but due to inadequate results, their mention
is limited to section VII.
The primary contribution of this paper is the empirical anal-
ysis of multiple feature selection techniques to classify bugs
in software code changes using file level deltas. An important
secondary contribution is the high average F-measure values
for predicting bugs in individual software changes.
This paper explores the following research questions.
Question 1. Which variables lead to best bug prediction
performance when using feature selection?
The three variables affecting bug prediction performance that
are explored in this paper are: (1) type of classifier (Na¨ıve
Bayes, Support Vector Machine), (2) type of feature selection
used (3) and whether multiple instances of a feature are
significant (count), or whether only the existence of a feature
is significant (binary). Results are reported in Section V-A.
Results for question 1 are reported as averages across all
projects in the corpus. However, in practice it is useful to
know the range of results across a set of projects. This leads
to our second question.
Question 2. Range of bug prediction performance using
feature selection. How do the best-performing SVM and
Na¨ıve Bayes classifiers perform across all projects when using
feature selection? (see Section V-B)
The sensitivity of bug prediction results with number of
features is explored in the next question.
Question 3. Feature Sensitivity. What is the performance of
change classification at varying percentages of features? What
is the F-measure of the best performing classifier when using
just 1% of all project features? (see Section V-D)
Some types of features work better than others for discrimi-
nating between buggy and clean changes, explored in the final
research question.
Question 4. Best Performing Features. Which classes of
features are the most useful for performing bug predictions?
(see Section V-E)
A comparison of this paper’s results with those found in
related work (see Section VI) shows that change classification
with feature selection outperforms other existing classification-
based bug prediction approaches. Furthermore, when using the
Na¨ıve Bayes classifier, buggy precision averages 97% with a
recall of 70%, indicating the predictions are generally highly
precise, thereby minimizing the impact of false positives.
In the remainder of the paper, we begin by presenting
an overview of the change classification approach for bug
prediction, and then detail a new process for feature selection
(Section II). Next, standard measures for evaluating classifiers
(accuracy, precision, recall, F-measure, ROC AUC) are de-
scribed in Section III. Following, we describe the experimental
context, including our data set, and specific classifiers (Section
IV). The stage is now set, and in subsequent sections we
explore the research questions described above (Sections V-A
- V-E). A brief investigation of algorithm runtimes is next
(Section V-F). The paper ends with a summary of related
work (Section VI), threats to validity (Section VII), and the
conclusion.
This paper builds on [49], a prior effort by the same set of
authors. While providing substantially more detail in addition
to updates to the previous effort, there are two additional
research questions, followed by three new experiments. In ad-
dition, the current paper extends feature selection comparison
to more techniques. The new artifacts are research questions 3,
4, tables 2, and VII. Further details over previous experiments
include Table VI, Table IV, and the accompanying analysis.
II. CHANGE CLASSIFICATION
The primary steps involved in performing change classifi-
cation on a single project are as follows:
Creating a Corpus:
3
1. File level change deltas are extracted from the revision
history of a project, as stored in its SCM repository (described
further in Section II-A).
2. The bug fix changes for each file are identified by
examining keywords in SCM change log messages (Section
II-A).
3. The bug-introducing and clean changes at the file level
are identified by tracing backward in the revision history from
bug fix changes (Section II-A).
4. Features are extracted from all changes, both buggy and
clean. Features include all terms in the complete source code,
the lines modified in each change (delta), and change meta-
data such as author and change time. Complexity metrics, if
available, are computed at this step. Details on these feature
extraction techniques are presented in Section II-B. At the end
of Step 4, a project-specific corpus has been created, a set of
labeled changes with a set of features associated with each
change.
All of the steps until this point are the same as in Kim et al
[28]. The following step is the new contribution in this paper.
Feature Selection:
5. Perform a feature selection process that employs a
combination of wrapper and filter methods to compute a
reduced set of features. The filter methods used are Gain Ratio,
Chi-Squared, Significance, and Relief-F feature rankers. The
wrapper methods are based on the Na¨ıve Bayes and the SVM
classifiers. For each iteration of feature selection, classifier F-
measure is optimized. As Relief-F is a slower method, it is
only invoked on the top 15% of features rather than 50%.
Feature selection is iteratively performed until one feature is
left. At the end of Step 5, there is a reduced feature set that
performs optimally for the chosen classifier metric.
Classification:
6. Using the reduced feature set, a classification model is
trained.
7. Once a classifier has been trained, it is ready to use.
New code changes can now be fed to the classifier, which
determines whether a new change is more similar to a buggy
change or a clean change. Classification is performed at a
code change level using file level change deltas as input to
the classifier.
A. Finding Buggy and Clean Changes
The process of determining buggy and clean changes begins
by using the Kenyon infrastructure to extract change transac-
tions from either a CVS or Subversion software configuration
management repository [7]. In Subversion, such transactions
are directly available. CVS, however, provides only versioning
at the file level, and does not record which files were commit-
ted together. To recover transactions from CVS archives, we
group the individual per-file changes using a sliding window
approach [53]: two subsequent changes by the same author
and with the same log message are part of one transaction if
they are at most 200 seconds apart.
In order to find bug-introducing changes, bug fixes must
first be identified by mining change log messages. We use
two approaches: searching for keywords in log messages such
TABLE I
EXAMPLE BUG FIX SOURCE CODE CHANGE
1.23: Bug-introducing 1.42: Fix
... ...
15: if (foo==null) { 36: if (foo!=null) {
16: foo.bar(); 37: foo.bar();
... ...
as “Fixed”, “Bug” [40], or other keywords likely to appear
in a bug fix, and searching for references to bug reports like
“#42233”. This allows us to identify whether an entire code
change transaction contains a bug fix. If it does, we then
need to identify the specific file delta change that introduced
the bug. For the systems studied in this paper, we manually
verified that the identified fix commits were, indeed, bug fixes.
For JCP, all bug fixes were identified using a source code to
bug tracking system hook. As a result, we did not have to rely
on change log messages for JCP.
The bug-introducing change identification algorithm pro-
posed by ´Sliwerski, Zimmermann, and Zeller (SZZ algorithm)
[50] is used in the current paper. After identifying bug fixes,
SZZ uses a diff tool to determine what changed in the bug-
fixes. The diff tool returns a list of regions that differ between
the two files; each region is called a “hunk”. It observes
each hunk in the bug-fix and assumes that the deleted or
modified source code in each hunk is the location of a
bug. Finally, SZZ tracks down the origins of the deleted or
modified source code in the hunks using the built-in annotation
function of SCM (Source Code Management) systems. The
annotation computes, for each line in the source code, the most
recent revision where the line was changed, and the developer
who made the change. These revisions are identified as bug-
introducing changes. In the example in Table I, revision 1.42
fixes a fault in line 36. This line was introduced in revision
1.23 (when it was line 15). Thus revision 1.23 contains a bug-
introducing change. Specifically, revision 1.23 calls a method
within foo. However, the if-block is entered only if foo is null.
B. Feature Extraction
To classify software changes using machine learning algo-
rithms, a classification model must be trained using features
of buggy and clean changes. In this section, we discuss
techniques for extracting features from a software project’s
change history.
A file change involves two source code revisions (an old
revision and a new revision) and a change delta that records
the added code (added delta) and the deleted code (deleted
delta) between the two revisions. A file change has associated
meta-data, including the change log, author, and commit date.
Every term in the source code, change delta, and change log
texts is used as a feature. This means that every variable,
method name, function name, keyword, comment word, and
operator—that is, everything in the source code separated by
whitespace or a semicolon—is used as a feature.
We gather eight features from change meta-data: author,
commit hour (0, 1, 2, . . . , 23), commit day (Sunday, Monday,
. . . , Saturday), cumulative change count, cumulative bug
4
count, length of change log, changed LOC (added delta LOC
+ deleted delta LOC), and new revision source code LOC.
We compute a range of traditional complexity metrics of the
source code by using the Understand C/C++ and Java tools
[23]. As a result, we extract 61 complexity metrics for each
file, including LOC, number of comment lines, cyclomatic
complexity, and max nesting. Since we have two source code
files involved in each change (old and new revision files),
we can use complexity metric deltas as features. That is, we
can compute a complexity metric for each file and take the
difference; this difference can be used as a feature.
Change log messages are similar to e-mail or news articles
in that they are human readable texts. To extract features
from change log messages, we use the bag-of-words (BOW)
approach, which converts a stream of characters (the text) into
a BOW (index terms) [47].
We use all terms in the source code as features, including
operators, numbers, keywords, and comments. To generate
features from source code, we use a modified version of BOW,
called BOW+, that extracts operators in addition to all terms
extracted by BOW, since we believe operators such as !=,
++, and && are important terms in source code. We perform
BOW+ extraction on added delta, deleted delta, and new
revision source code. This means that every variable, method
name, function name, programming language keyword, com-
ment word, and operator in the source code separated by
whitespace or a semicolon is used as a feature.
We also convert the directory and filename into features
since they encode both module information and some behav-
ioral semantics of the source code. For example, the file (from
the Columba project) “ReceiveOptionPanel.java” in the direc-
tory “src/mail/core/org/columba/mail/gui/config/ account/” re-
veals that the file receives some options using a panel interface
and the directory name shows that the source code is related to
“account,” “configure,” and “graphical user interface.” Direc-
tory and filenames often use camel case, concatenating words
breaks with capitals. For example, “ReceiveOptionPanel.java”
combines “receive,” “option,” and “panel.” To extract such
words correctly, we use a case change in a directory or a
filename as a word separator. We call this method BOW++.
Table II summarizes features generated and used in this paper.
Feature groups which can be interpreted as binary or count
are also indicated. Section V-A explores binary and count
interpretations for those feature groups.
Table III provides an overview of the projects examined in
this research and the duration of each project examined.
For each project we analyzed (see Table III) the number
of meta-data (M) and code complexity (C) features is 8 and
150 respectively. Source code (A, D, N), change log (L), and
directory/filename (F) features contribute thousands of features
per project, ranging from 6-42 thousand features. Source code
features (A, D, N) can take two forms: binary or count. For
binary features, the feature only notes the presence or absence
of a particular keyword. For count features, the count of the
number of times a keyword appears is recorded. For example,
if a variable maxParameters is present anywhere in the project,
a binary feature just records this variable as being present,
while a count feature would additionally record the number
TABLE III
SUMMARY OF PROJECTS SURVEYED
Project Period Clean Buggy Features
Changes Changes
APACHE 1.3 10/1996- 566 134 17,575
01/1997
COLUMBA 05/2003- 1,270 530 17,411
09/2003
GAIM 08/2000- 742 451 9,281
03/2001
GFORGE 01/2003- 399 334 8,996
03/2004
JEDIT 08/2002- 626 377 13,879
03/2003
MOZILLA 08/2003- 395 169 13,648
08/2004
ECLIPSE 10/2001- 592 67 16,192
11/2001
PLONE 07/2002- 457 112 6,127
02/2003
POSTGRESQL 11/1996- 853 273 23,247
02/1997
SUBVERSION 01/2002- 1,925 288 14,856
03/2002
JCP 1 year 1,516 403 41,942
Total N/A 9,294 3,125 183,054
of times it appears anywhere in the project’s history.
C. Feature Selection Techniques
The number of features gathered during the feature extrac-
tion phase is quite large, ranging from 6,127 for Plone to
41,942 for JCP (Table III). Such large feature sets lead to
longer training and prediction times, require large amounts of
memory to perform classification. A common solution to this
problem is the process of feature selection, in which only the
subset of features that are most useful for making classification
decisions are actually used.
This paper empirically compares a selection of wrapper and
filter methods for bug prediction classification effectiveness.
Filter methods use general characteristics of the dataset to
evaluate and rank features [20]. They are independent of learn-
ing algorithms. Wrapper methods, on the other hand, evaluate
features by using scores provided by learning algorithms. All
methods can be used with both count and binary interpretations
of keyword features.
The methods are further described below.
• Gain Ratio Attribute Evaluation - Gain Ratio is a myopic
feature scoring algorithm. A myopic feature scoring algo-
rithm evaluates each feature individually independent of
the other features. Gain Ratio improves upon Information
Gain [2], a well known measure of the amount by which
a given feature contributes information to a classification
decision. Information Gain for a feature in our context
is the amount of information the feature can provide
about whether the code change is buggy or clean. Ideally,
features that provide a good split between buggy and
clean changes should be preserved. For example, if the
presence of a certain feature strongly indicates a bug, and
its absence greatly decreases the likelihood of a bug, that
feature possesses strong Information Gain. On the other
5
TABLE II
FEATURE GROUPS. FEATURE GROUP DESCRIPTION, EXTRACTION METHOD, AND EXAMPLE FEATURES.
Feature Description Extraction Example Feature
Group Method Features Interpretation
Added Delta (A) Terms in added delta source code BOW+ if, while, for, == Binary/Count
Deleted Delta (D) Terms in deleted delta source code BOW+ true, 0, <=, ++, int Binary/Count
Directory/File Name (F) Terms in directory/file names BOW++ src, module, java N/A
Change Log (L) Terms in the change log BOW fix, added, new N/A
New Revision Terms in new BOW+ if, , !=, do, while, Binary/Count
Source Code (N) Revision source code file string, false
Meta-data (M) Change meta-data Direct author: hunkim, N/A
such as time and author commit hour: 12
Complexity Metrics (C) Software complexity metrics Understand tools [23] LOC: 34, N/A
of each source code unit Cyclomatic: 10
hand, if a feature’s presence indicates a 50% likelihood
of a bug, and its absence also indicates a 50% likelihood
of a bug, this feature has low Information Gain and is
not useful in predicting the code change.
However, Information Gain places more importance on
features that have a large range of values. This can lead to
problems with features such as the LOC of a code change.
Gain Ratio [2] plays the same role as Information Gain,
but instead provides a normalized measure of a feature’s
contribution to a classification decision [2]. Thus, Gain
Ratio is less affected by features having a large range of
values. More details on how the entropy based measure
is calculated for Gain Ratio (including how a normalized
measure of a feature’s contribution is computed), and
other inner workings can be found in an introductory data
mining book, e.g. [2].
• Chi-Squared Attribute Evaluation - Chi-squared feature
score is also a myopic feature scoring algorithm. The
worth of a feature is given by the value of the Pearson
chi-squared statistic [2] with respect to the classification
decision. Based on the presence of a feature value across
all instances in the dataset, one can compute expected
and observed counts using Pearson’s chi-squared test.
The features with the highest disparity between expected
and observed counts against the class attribute are given
higher scores.
• Significance Attribute Evaluation - With significance at-
tribute scoring, high scores are given to features where
an inversion of the feature value (e.g., a programming
keyword not being present when it had previously been
present) will very likely cause an inversion of the classifi-
cation decision (buggy to clean, or vice-versa) [1]. It too
is a myopic feature scoring algorithm. The significance
itself is computed as a two-way function of its association
to the class decision. For each feature, the attribute-to-
class association along with the class-to-attribute associ-
ation is computed. A feature is quite significant if both of
these associations are high for a particular feature. More
detail on the inner workings can be found in [1].
• Relief-F Attribute Selection - Relief-F in an extension
to the Relief algorithm [31]. Relief samples data points
(code changes in the context of this paper) at random,
and computes two nearest neighbors: one neighbor which
has the same class as the instance (similar neighbor),
and one neighbor which has a different class (differing
neighbor). For the context of this paper, the 2 classes are
buggy and clean. The quality estimation for a feature f
is updated based on the value of its similar and differing
neighbors. If for feature f, a code change and its similar
neighbor have differing values, the feature quality of f
is decreased. If a code change and its differing neighbor
have different values for f, f’s feature quality is increased.
This process is repeated for all the sampled points. Relief-
F is an algorithmic extension to Relief [46]. One of the
extensions is to search for the nearest k neighbors in
each class rather than just one neighbor. As Relief-F is a
slower algorithm than the other presented filter methods,
it was used on the top 15% features returned by the best
performing filter method.
• Wrapper Methods - The wrapper methods leveraged the
classifiers used in the study. The SVM and the Na¨ıve
Bayes classifier were used as wrappers. The features are
ranked by their classifier computed score. The top feature
scores are those features which are valued highly after
the creation of a model driven by the SVM or the Na¨ıve
Bayes classifier.
D. Feature Selection Process
Filter and wrapper methods are used in an iterative pro-
cess of selecting incrementally smaller sets of features, as
detailed in Algorithm 1. The process begins by cutting the
initial feature set in half, to reduce memory and processing
requirements for the remainder of the process. The process
performs optimally when under 10% of all features are present.
The initial feature evaluations for filter methods are 10-fold
cross validated in order to avoid the possibility of over-
fitting feature scores. As wrapper methods perform 10-fold
cross validation during the training process, feature scores
from wrapper methods can directly be used. Note that we
cannot stop the process around the 10% mark and assume the
performance is optimal. Performance may not be a unimodal
curve with a single maxima around the 10% mark, it is
possible that performance is even better at, say, the 5% mark.
In the iteration stage, each step finds those 50% of re-
maining features that are least useful for classification using
individual feature rank, and eliminates them (if, instead, we
were to reduce by one feature at a time, this step would be
similar to backward feature selection [35]). Using 50% of
6
Algorithm 1 Feature selection process for one project
1) Start with all features, F
2) For feature selection technique, f, in Gain Ratio, Chi-
Squared, Significance Evaluation, Relief-F, Wrapper
method using SVM, Wrapper method using Na¨ıve
Bayes, perform steps 3-6 below.
3) Compute feature Evaluations for using f over F, and
select the top 50% of features with the best performance,
F/2
4) Selected features, selF = F/2
5) While |selF| ≥ 1 feature, perform steps (a)-(d)
a) Compute and store buggy and clean precision,
recall, accuracy, F-measure, and ROC AUC using
a machine learning classifier (e.g., Na¨ıve Bayes
or SVM), using 10-fold cross validation. Record
result in a tuple list.
b) If f is a wrapper method, recompute feature scores
over selF.
c) Identify removeF, the 50% of features of selF
with the lowest feature evaluation. These are the
least useful features in this iteration.
d) selF = selF − removeF
6) Determine the best F-measure result recorded in step 5.a.
The percentage of features that yields the best result is
optimal for the given metric.
features at a time improves speed without sacrificing much
result quality.
So, for example, selF starts at 50% of all features, then
is reduced to 25% of all features, then 12.5%, and so on. At
each step, change classification bug prediction using selF is
then performed over the entire revision history, using 10-fold
cross validation to reduce the possibility of over-fitting to the
data.
This iteration terminates when only one of the original
features is left. At this point, there is a list of tuples: feature %,
feature selection technique, classifier performance. The final
step involves a pass over this list to find the feature % at
which a specific classifier achieves its greatest performance.
The metric used in this paper for classifier performance is the
buggy F-measure (harmonic mean of precision and recall),
though one could use a different metric.
It should be noted that algorithm 1 is itself a wrapper
method as it builds an SVM or Na¨ıve Bayes classifier at
various points. When the number of features for a project
is large, the learning process can take a long time. This is
still strongly preferable to a straightforward use of backward
feature selection, that removes one feature every iteration. An
analysis on the runtime of the feature selection process and its
components is performed in section V-F. When working with
extremely large datasets, using algorithm 1 exclusively with
filter methods will also significantly lower its runtime.
III. PERFORMANCE METRICS
We now define common performance metrics used to eval-
uate classifiers: Accuracy, Precision, Recall, F-Measure, and
ROC AUC. There are four possible outcomes while using a
classifier on a single change: Classifying a buggy change as
buggy, b→b (true positive)
Classifying a buggy change as clean, b→c (false negative)
Classifying a clean change as clean, c→c (true negative)
Classifying a clean change as buggy, c→b (false positive)
With a good set of training data, it is possible to compute
the total number of buggy changes correctly classified as
buggy (nb→b), buggy changes incorrectly classified as clean
(nb→c), clean changes correctly classified as clean (nc→c), and
clean changes incorrectly classified as buggy (nc→b). Accuracy
is the number of correctly classified changes over the total
number of changes. As there are typically more clean changes
than buggy changes, this measure could yield a high value if
clean changes are being better predicted than buggy changes.
This is often less relevant than buggy precision and recall.
Accuracy =
nb→b + nc→c
nb→b + nb→c + nc→c + nc→b
Buggy change precision represents the number of correct bug
classifications over the total number of classifications that
resulted in a bug outcome. Or, put another way, if the change
classifier predicts a change is buggy, what fraction of these
changes really contain a bug?
Buggy change precision, P(b) =
nb→b
nb→b + nc→b
Buggy change recall is also known as the true positive rate,
this represents the number of correct bug classifications over
the total number of changes that were actually bugs. That is, of
all the changes that are buggy, what fraction does the change
classifier actually label as buggy?
Buggy change recall, R(b) =
nb→b
nb→b + nb→c
Buggy change F-measure is a composite measure of buggy
change precision and recall, more precisely, it is the harmonic
mean of precision and recall. Since precision can often be
improved at the expense of recall (and vice-versa), F-measure
is a good measure of the overall precision/recall performance
of a classifier, since it incorporates both values. For this
reason, we emphasize this metric in our analysis of classifier
performance in this paper. Clean change recall, precision, and
F-measure can be computed similarly.
Buggy change F − measure =
2 ∗ P(b) ∗ R(b)
P(b) + R(b)
An example is when 80% of code changes are clean and
20% of code changes are buggy. In this case, a classifier
reporting all changes as clean except one will have around 80%
accuracy. The classifier predicts exactly one buggy change
correctly. The buggy precision and recall, however will be
100% and close to 0% respectively. Precision is 100% because
the single buggy prediction is correct. The buggy F-measure
is 16.51% in this case, revealing poor performance. On the
other hand, if a classifier reports all changes as buggy, the
accuracy is 20%, the buggy recall is 100%, the buggy precision
is 20%, and the buggy F-measure is 33.3%. While these results
seem better, the buggy F-measure figure of 33.3% gives an
7
impression that the second classifier is better than the first
one. However, a low F-measure does indicate that the second
classifier is not performing well either. For this paper, the
classifiers for all experiments are F-measure optimized. Sev-
eral data/text mining papers have compared performance on
classifiers using F-measure including [3] and [32]. Accuracy
is also reported in order to avoid returning artificially higher
F-measures when accuracy is low. For example, suppose there
are 10 code changes, 5 of which are buggy. If a classifier
predicts all changes as buggy, the resulting precision, recall,
and F-measure are 50%, 100%, and 66.67% respectively. The
accuracy figure of 50% demonstrates less promise than the
high F-measure figure suggests.
An ROC (originally, receiver operating characteristic, now
typically just ROC) curve is a two-dimensional graph where
the true positive rate (i.e. recall, the number of items correctly
labeled as belonging to the class) is plotted on the Y axis
against the false positive rate (i.e. items incorrectly labeled as
belonging to the class, or nc→b/(nc→b+nc→c)) on the X axis.
The area under an ROC curve, commonly abbreviated as ROC
AUC, has an important statistical property. The ROC AUC of
a classifier is equivalent to the probability that the classifier
will value a randomly chosen positive instance higher than a
randomly chosen negative instance [16]. These instances map
to code changes when relating to bug prediction. Tables V and
VI contain the ROC AUC figures for each project.
One potential problem when computing these performance
metrics is the choice of which data is used to train and test
the classifier. We consistently use the standard technique of 10-
fold cross-validation [2] when computing performance metrics
throughout this paper (for all figures and tables) with the
exception of section V-C where more extensive cross-fold
validation was performed. This avoids the potential problem of
over fitting to a particular training and test set within a specific
project. In 10-fold cross validation, a data set (i.e., the project
revisions) is divided into 10 parts (partitions) at random. One
partition is removed and is the target of classification, while
the other 9 are used to train the classifier. Classifier evaluation
metrics, including buggy and clean accuracy, precision, recall,
F-measure, and ROC AUC are computed for each partition.
After these evaluations are completed for all 10 partitions,
averages are are obtained. The classifier performance metrics
reported in this paper are these average figures.
IV. EXPERIMENTAL CONTEXT
We gathered software revision history for Apache, Columba,
Gaim, Gforge, Jedit, Mozilla, Eclipse, Plone, PostgreSQL,
Subversion, and a commercial project written in Java (JCP).
These are all mature open source projects with the exception
of JCP. In this paper these projects are collectively called the
corpus.
Using the project’s CVS (Concurrent Versioning System)
or SVN (Subversion) source code repositories, we collected
revisions 500-1000 for each project, excepting Jedit, Eclipse,
and JCP. For Jedit and Eclipse, revisions 500-750 were col-
lected. For JCP, a year’s worth of changes were collected. We
used the same revisions as Kim et al. [28], since we wanted
to be able to compare our results with theirs. We removed two
of the projects they surveyed from our analysis, Bugzilla and
Scarab, as the bug tracking systems for these projects did not
distinguish between new features and bug fixes. Even though
our technique did well on those two projects, the value of bug
prediction when new features are also treated as bug fixes is
arguably less meaningful.
V. RESULTS
The following sections present results obtained when ex-
ploring the four research questions. For the convenience of
the reader, each result section repeats the research question
that will be answered.
A. Classifier performance comparison
Research Question 1. Which variables lead to best bug
prediction performance when using feature selection?
The three main variables affecting bug prediction per-
formance that are explored in this paper are: (1) type of
classifier (Na¨ıve Bayes, Support Vector Machine), (2) type of
feature selection used (3) and whether multiple instances of a
particular feature are significant (count), or whether only the
existence of a feature is significant (binary). For example, if a
variable by the name of “maxParameters” is referenced four
times during a code change, a binary feature interpretation
records this variable as 1 for that code change, while a count
feature interpretation would record it as 4.
The permutations of variables 1, 2, and 3 are explored
across all 11 projects in the data set with the best performing
feature selection technique for each classifier. For SVMs, a
linear kernel with optimized values for C is used. C is a
parameter that allows one to trade off training error and model
complexity. A low C tolerates a higher number of errors in
training, while a large C allows fewer train errors but increases
the complexity of the model. [2] covers the SVM C parameter
in more detail.
For each project, feature selection is performed, followed by
computation of per-project accuracy, buggy precision, buggy
recall, and buggy F-measure. Once all projects are complete,
average values across all projects are computed. Results are
reported in Table IV. Average numbers may not provide
enough information on the variance. Every result in Table IV
reports the average value along with the standard deviation.
The results are presented in descending order of buggy F-
measure.
Significance Attribute and Gain Ratio based feature selec-
tion performed best on average across all projects in the corpus
when keywords were interpreted as binary features. McCallum
et al. [38] confirm that feature selection performs very well
on sparse data used for text classification. Anagostopoulos et
al. [3] report success with Information Gain for sparse text
classification. Our data set is also quite sparse and performs
well using Gain Ratio, an improved version of Information
Gain. A sparse data set is one in which instances contain a
majority of zeroes. The number of distinct ones for most of
the binary attributes analyzed are a small minority. The least
8
TABLE IV
AVERAGE CLASSIFIER PERFORMANCE ON CORPUS (ORDERED BY DESCENDING BUGGY F-MEASURE)
Feature Classifier Feature Features Feature Accuracy Buggy Buggy Buggy
Interp. Selection Technique Percentage F-measure Precision Recall
Binary Bayes Significance Attr. 1153.4 ± 756.3 7.9 ± 6.3 0.90 ± 0.04 0.81 ± 0.07 0.97 ± 0.06 0.70 ± 0.10
Binary Bayes Gain Ratio 1914.6 ± 2824.0 9.5 ± 8.0 0.91 ± 0.03 0.79 ± 0.07 0.92 ± 0.10 0.70 ± 0.07
Binary Bayes Relief-F 1606.7 ± 1329.2 9.3 ± 3.6 0.85 ± 0.08 0.71 ± 0.10 0.74 ± 0.19 0.71 ± 0.08
Binary SVM Gain Ratio 1522.3 ± 1444.2 10.0 ± 9.8 0.87 ± 0.05 0.69 ± 0.06 0.90 ± 0.10 0.57 ± 0.08
Binary Bayes Wrapper 2154.6 ± 1367.0 12.7 ± 6.5 0.85 ± 0.06 0.68 ± 0.10 0.76 ± 0.17 0.64 ± 0.11
Binary SVM Significance Attr. 1480.9 ± 2368.0 9.5 ± 14.2 0.86 ± 0.06 0.65 ± 0.06 0.94 ± 0.11 0.51 ± 0.08
Count Bayes Relief-F 1811.1 ± 1276.7 11.8 ± 1.9 0.78 ± 0.07 0.63 ± 0.08 0.56 ± 0.09 0.72 ± 0.10
Binary SVM Relief-F 663.1 ± 503.2 4.2 ± 3.5 0.83 ± 0.04 0.62 ± 0.06 0.75 ± 0.22 0.57 ± 0.14
Binary Bayes Chi-Squared 3337.2 ± 2356.1 25.0 ± 20.5 0.77 ± 0.08 0.61 ± 0.07 0.54 ± 0.07 0.71 ± 0.09
Count SVM Relief-F 988.6 ± 1393.8 5.8 ± 4.0 0.78 ± 0.08 0.61 ± 0.06 0.58 ± 0.10 0.66 ± 0.07
Count SVM Gain Ratio 1726.0 ± 1171.3 12.2 ± 8.3 0.80 ± 0.07 0.57 ± 0.08 0.62 ± 0.13 0.53 ± 0.08
Binary SVM Chi-Squared 2561.5 ± 3044.1 12.6 ± 16.2 0.81 ± 0.06 0.56 ± 0.09 0.55 ± 0.11 0.57 ± 0.07
Count SVM Significance Attr. 1953.0 ± 2416.8 12.0 ± 13.6 0.79 ± 0.06 0.56 ± 0.07 0.62 ± 0.13 0.52 ± 0.06
Count SVM Wrapper 1232.1 ± 1649.7 8.9 ± 14.9 0.76 ± 0.10 0.55 ± 0.05 0.56 ± 0.09 0.56 ± 0.07
Count Bayes Wrapper 1946.6 ± 6023.6 4.8 ± 14.3 0.79 ± 0.08 0.55 ± 0.08 0.63 ± 0.08 0.50 ± 0.09
Count Bayes Chi-Squared 58.3 ± 59.7 0.4 ± 0.4 0.76 ± 0.08 0.55 ± 0.05 0.54 ± 0.09 0.58 ± 0.11
Count Bayes Significance Attr. 2542.1 ± 1787.1 14.8 ± 6.6 0.76 ± 0.08 0.54 ± 0.07 0.53 ± 0.11 0.59 ± 0.13
Count SVM Chi-Squared 4282.3 ± 3554.2 26.2 ± 20.2 0.75 ± 0.10 0.54 ± 0.07 0.55 ± 0.10 0.54 ± 0.06
Binary SVM Wrapper 4771.0 ± 3439.7 33.8 ± 22.8 0.79 ± 0.11 0.53 ± 0.12 0.58 ± 0.14 0.49 ± 0.11
Count Bayes Gain Ratio 2523.6 ± 1587.2 15.3 ± 6.1 0.77 ± 0.08 0.53 ± 0.06 0.56 ± 0.10 0.52 ± 0.09
sparse binary features in the datasets have non zero values in
the 10% - 15% range.
Significance attribute based feature selection narrowly won
over Gain Ratio in combination with the Na¨ıve Bayes clas-
sifier. With the SVM classifier, Gain Ratio feature selection
worked best. Filter methods did better than wrapper methods.
This is due to the fact that classifier models with a lot of fea-
tures are worse performing and can drop the wrong features at
an early cutting stage of the feature selection process presented
under algorithm 1. When a lot of features are present, the data
contains more outliers. The worse performance of Chi-Squared
Attribute Evaluation can be attributed to computing variances
at the early stages of feature selection. Relief-F is a nearest
neighbor based method. Using nearest neighbor information
in the presence of noise can be detrimental [17]. However, as
Relief-F was used on the top 15% of features returned by the
best filter method, it performed reasonably well.
It appears that simple filter based feature selection tech-
niques such as Gain Ratio and Significance Attribute Evalu-
ation can work on the surveyed large feature datasets. These
techniques do not make strong assumptions on the data. When
the amount of features is reduced to a reasonable size, wrapper
methods can produce a model usable for bug prediction.
Overall, Na¨ıve Bayes using binary interpretation of features
performed better than the best SVM technique. One explana-
tion for these results is that change classification is a type of
optimized binary text classification, and several sources ( [3],
[38]) note that Na¨ıve Bayes performs text classification well.
As a linear SVM was used, it is possible that using a
non linear kernel can improve SVM classifier performance.
However, in preliminary work, optimizing the SVM for one
project by using a polynomial kernel often led to degradation
of performance in another project. In addition, using a non
linear kernel, made the experimental runs significantly slower.
To permit a consistent comparison of results between SVM
and Na¨ıve Bayes, it was important to use the same classi-
fier settings for all projects, and hence no per-project SVM
classifier optimization was performed. Future work could
include optimizing SVMs on a per project basis for better
performance. A major benefit of the Na¨ıve Bayes classifier is
that such optimization is not needed.
The methodology section discussed the differences between
binary and count interpretation of keyword features. Count did
not perform well within the corpus, yielding poor results for
SVM and Na¨ıve Bayes classifiers. Count results are mostly
bundled at the bottom of Table IV. This is consistent with
McCallum et al. [38], which mentions that when the number of
features is low, binary values can yield better results than using
count. Even when the corpus was tested without any feature
selection in the prior work by Kim et al., count performed
worse than binary. This seems to imply that regardless of the
number of features, code changes are better captured by the
presence or absence of keywords. The bad performance of
count can possibly be explained by the difficulty in establish-
ing the semantic meaning behind recurrence of keywords and
the added complexity it brings to the data. When using the
count interpretation, the top results for both Na¨ıve Bayes and
the SVM were from Relief-F. This suggests that using Relief-
F after trimming out most features via a myopic filter method
can yield good performance.
The two best performing classifier combinations by buggy
F-measure, Bayes (binary interpretation with Significance
Evaluation) and SVM (binary interpretation with Gain Ratio),
both yield an average precision of 90% and above. For the
remainder of the paper, analysis focuses just on these two, to
better understand their characteristics. The remaining figures
and tables in the paper will use a binary interpretation of
keyword features.
B. Effect of feature selection
Question 2. Range of bug prediction performance using
feature selection. How do the best-performing SVM and
9
Na¨ıve Bayes classifiers perform across all projects when using
feature selection?
In the previous section, aggregate average performance
of different classifiers and optimization combinations was
compared across all projects. In actual practice, change classi-
fication would be trained and employed on a specific project.
As a result, it is useful to understand the range of performance
achieved using change classification with a reduced feature
set. Tables V and VI below report, for each project, overall
prediction accuracy, buggy and clean precision, recall, F-
measure, and ROC area under curve (AUC). Table V presents
results for Na¨ıve Bayes using F-measure feature selection with
binary features, while Table VI presents results for SVM using
feature selection with binary features.
Observing these two tables, eleven projects overall (8 with
Na¨ıve Bayes, 3 with SVM) achieve a buggy precision of 1,
indicating that all buggy predictions are correct (no buggy
false positives). While the buggy recall figures (ranging from
0.40 to 0.84) indicate that not all bugs are predicted, still, on
average, more than half of all project bugs are successfully
predicted.
Comparing buggy and clean F-measures for the two classi-
fiers, Na¨ıve Bayes (binary) clearly outperforms SVM (binary)
across all 11 projects when using the same feature selection
technique. The ROC AUC figures are also better for the Na¨ıve
Bayes classifier than those of the SVM classifier across all
projects. The ROC AUC of a classifier is equivalent to the
probability that the classifier will value a randomly chosen
positive instance higher than a randomly chosen negative
instance. A higher ROC AUC for the Na¨ıve Bayes classifier
better indicates that the classifier will still perform strongly if
the initial labelling of code changes as buggy/clean turned out
to be incorrect. Thus, the Na¨ıve Bayes results are less sensitive
to inaccuracies in the data sets. [2].
Figure 1 summarizes the relative performance of the two
classifiers and compares against the prior work of Kim et al
[28]. Examining these figures, it is clear that feature selection
significantly improves F-measure of bug prediction using
change classification. As precision can often be increased at
the cost of recall and vice-versa, we compared classifiers using
buggy F-measure. Good F-measures indicate overall result
quality.
The practical benefit of our result can be demonstrated by
the following. For a given fixed level of recall, say 70%,
our method provides 97% precision, while Kim et al. provide
52.5% precision. These numbers were calculated using buggy
F-measure numbers from V and [28]. The buggy precision
value was extrapolated from the buggy F-measure and the
fixed recall figures. High precision with decent recall allows
developers to use our solution with added confidence. This
means that if a change is flagged as buggy, the change is very
likely to be actually be buggy.
Kim et al.’s results, shown in Figure 1, are taken from [28].
In all but the following cases, this paper used the same corpus.
• This paper does not use Scarab and Bugzilla because
those projects did not distinguish buggy and new features.
• This paper uses JCP, which was not in Kim et al.’s corpus.
Kim et al. did not perform feature selection and used
substantially more features for each project. Table IV reveals
the drastic reduction in the average number of features per
project when compared to Kim et al’s work.
Additional benefits of the reduced feature set include better
speeds of classification and scalability. As linear SVMs and
Na¨ıve Bayes classifiers work in linear time for classification
[25], removing 90% of features allows classifications to be
done about ten times faster. We have noted an order of
magnitude improvement in code change classification time.
This helps promote interactive use of the system within an
Integrated Development Environment.
C. Statistical Analysis of Results
While the above results appear compelling, it is necessary
to further analyze them to be sure that the results did not occur
due to statistical anomalies. There are a few possible sources
of anomaly.
• The 10-fold validation is an averaging of the results for
F-measure, accuracy, and all of the stats presented. Its
possible that the variance of the cross validation folds is
high.
• Statistical analysis is needed to confirm that a better
performing classifier is indeed better after taking into
account cross validation variance.
To ease comparison of results against Kim et al. [28], the
cross validation random seed was set to the same value used by
them. With the same datasets (except for JCP) and the same
cross validation random seed, but with far better buggy F-
measure and accuracy results, the improvement over no feature
selection is straightforward to demonstrate.
Showing that the Na¨ıve Bayes results are better than the
SVM results when comparing tables V, VI is a more involved
process that is outlined below. For the steps below, the better
and worse classifiers are denoted respectively as cb and cw.
1) Increase the cross validation cycle to 100 runs of 10-fold
validation, with the seed being varied each run.
2) For each run, note the metric of interest, buggy F-
measure or accuracy. This process empirically generates
a distribution for the metric of interest, dmcb using the
better classifier cb.
3) Repeat steps 1, 2 for the worse classifier cw, attaining
dmcw.
4) Use a one-sided Kolmogorov Smirnov test to show that
the population CDF of distribution dmcb is larger than
the CDF of dmcw at the 95% confidence level.
The seed is varied every run in order to change elements of
the train and test sets. Note that the seed for each run is the
same for both cw and cb.
In step four above, one can consider using a two sample
bootstrap test to show that the points of dmcb come from a
distribution with a higher mean than the points of dmcw [13].
However, the variance within 10-fold validation is the reason
for the statistical analysis. The Birnbaum-Tingey one-sided
contour analysis was used for the one-sided Kolmogorov-
Smirnov test. It takes variance into account and can indicate
if the CDF of one set of samples is larger than the CDF of the
10
TABLE V
NA¨IVE BAYES (WITH SIGNIFICANCE ATTRIBUTE EVALUATION) ON THE OPTIMIZED FEATURE SET (BINARY)
Project Features Percentage Accuracy Buggy Buggy Buggy Clean Clean Clean Buggy
Name of Features Precision Recall F-measure Precision Recall F-measure ROC
APACHE 1098 6.25 0.93 0.99 0.63 0.77 0.92 1.00 0.96 0.86
COLUMBA 1088 6.25 0.89 1.00 0.62 0.77 0.86 1.00 0.93 0.82
ECLIPSE 505 3.12 0.98 1.00 0.78 0.87 0.98 1.00 0.99 0.94
GAIM 1160 12.50 0.87 1.00 0.66 0.79 0.83 1.00 0.91 0.83
GFORGE 2248 25 0.85 0.84 0.84 0.84 0.86 0.86 0.86 0.93
JCP 1310 3.12 0.96 1.00 0.80 0.89 0.95 1.00 0.97 0.90
JEDIT 867 6.25 0.90 1.00 0.73 0.85 0.86 1.00 0.93 0.87
MOZILLA 852 6.24 0.94 1.00 0.80 0.89 0.92 1.00 0.96 0.90
PLONE 191 3.12 0.93 1.00 0.62 0.77 0.92 1.00 0.96 0.83
PSQL 2905 12.50 0.90 1.00 0.59 0.74 0.88 1.00 0.94 0.81
SVN 464 3.12 0.95 0.87 0.68 0.76 0.95 0.98 0.97 0.83
Average 1153.45 7.95 0.92 0.97 0.70 0.81 0.90 0.99 0.94 0.87
TABLE VI
SVM (WITH GAIN RATIO) ON THE OPTIMIZED FEATURE SET (BINARY)
Project Features Percentage Accuracy Buggy Buggy Buggy Clean Clean Clean Buggy
Name of Features Precision Recall F-measure Precision Recall F-measure ROC
APACHE 549 3.12 0.89 0.87 0.51 0.65 0.90 0.98 0.94 0.75
COLUMBA 4352 25 0.79 0.62 0.71 0.66 0.87 0.82 0.84 0.76
ECLIPSE 252 1.56 0.96 1.00 0.61 0.76 0.96 1.00 0.98 0.81
GAIM 580 6.25 0.82 0.98 0.53 0.69 0.78 0.99 0.87 0.76
GFORGE 2248 25 0.84 0.96 0.67 0.79 0.78 0.98 0.87 0.82
JCP 1310 3.12 0.92 1.00 0.61 0.76 0.91 1.00 0.95 0.8
JEDIT 3469 25 0.79 0.74 0.67 0.70 0.81 0.86 0.83 0.76
MOZILLA 426 3.12 0.87 1.00 0.58 0.73 0.85 1.00 0.92 0.79
PLONE 191 3.12 0.89 0.98 0.44 0.60 0.88 1.00 0.93 0.72
PSQL 2905 12.50 0.85 0.92 0.40 0.56 0.84 0.99 0.91 0.7
SVN 464 3.12 0.92 0.80 0.56 0.66 0.94 0.98 0.96 0.77
Average 1522.36 10.08 0.87 0.90 0.57 0.69 0.87 0.96 0.91 0.77
other set [9], [37]. It also returns a p-value for the assertion.
The 95% confidence level was used.
Incidentally, step 1 is also similar to bootstrapping but
uses further runs of cross validation to generate more data
points. The goal is to ensure that the results of average
error estimation via k-fold validation are not curtailed due to
variance. Many more cross validation runs help generate an
empirical distribution.
This test was performed on every dataset’s buggy F-measure
and accuracy to compare the performance of the Na¨ıve Bayes
classifier to the SVM. In most of the tests, the Na¨ıve Bayes
dominated the SVM results in both accuracy and F-measure
at p < 0.001. A notable exception is the accuracy metric for
the Gforge project.
While tables V and VI show that the Na¨ıve Bayes classifier
has a 1% higher accuracy for Gforge, the empirical distribution
for Gforge’s accuracy indicates that this is true only at a p of
0.67, meaning that this is far from a statistically significant
result. The other projects and the F-measure results for Gforge
demonstrate the dominance of the Na¨ıve Bayes results over the
SVM.
In spite of the results of tables IV and V, it is not possible to
confidently state that a binary representation performs better
than count for both the Na¨ıve Bayes and SVM classifiers
without performing a statistical analysis. Na¨ıve Bayes using
binary features dominates over the performance of Na¨ıve
Fig. 1. Classifier F-measure by Project
Bayes with count features at a p < 0.001. The binary SVM’s
dominance over the best performing count SVM with the same
feature selection technique (Gain Ratio) is also apparent, with
a p < 0.001 on the accuracy and buggy F-measure for most
projects, but lacking statistical significance on the buggy F-
measure of Gaim.
11
D. Feature Sensitivity
Research Question 3. Feature Sensitivity. What is the per-
formance of change classification at varying percentages of
features? What is the F-measure of the best performing clas-
sifier when using just 1% of all project features?
Section V-B reported results from using the best feature
set chosen using a given optimization criteria, and showed
that Na¨ıve Bayes with Significance Attribute feature selection
performed best with 7.95% of the original feature set, and
SVM with Gain Ratio feature selection performed best at
10.08%. This is a useful result, since the reduced feature
set decreases prediction time as compared to Kim et al.
A buggy/clean decision is based on about a tenth of the
initial number of features. This raises the question of how
performance of change classification behaves with varying
percentages of features.
To explore this question, for each project we ran a modified
version of the feature selection process described in Algorithm
1, in which only 10% (instead of 50%) of features are removed
each iteration using Significance Attribute Evaluation. After
each feature selection step, buggy F-measure is computed
using a Na¨ıve Bayes classifier.
The graph of buggy F-measure vs features, figure 2, follows
a similar pattern for all projects. As the number of features
decrease, the curve shows a steady increase in buggy F-
measure. Some projects temporarily dip before an increase
in F-measure occurs. Performance can increase with fewer
features due to noise reduction but can also decrease with
fewer features due to important features missing. The dip
in F-measure that typically reverts to higher F-measure can
be explained by the presence of correlated features which
are removed in later iterations. While correlated features are
present at the early stages of feature selection, their influence
is limited by the presence of a large number of features. They
have a more destructive effect towards the middle before being
removed.
Following the curve in the direction of fewer features, most
projects then stay at or near their maximum F-measure down to
single digit percentages of features. This is significant, since it
suggests a small number of features might produce results that
are close to those obtained using more features. The reason
can be attributed to Menzies et al. [39] who state that a small
number of features with different types of information can
sometimes capture more information than many features of
the same type. In the experiments of the current paper there
are many feature types. Fewer features bring two benefits: a
reduction in memory footprint and an increase in classification
speed.
A notable exception is the Gforge project. When trimming
by 50% percent of features at a time, Gforge’s optimal F-
measure point (at around 35%) was missed. Feature selection
is still useful for Gforge but the optimal point seems to be a
bit higher than for the other projects.
In practice, one will not know a-priori the best percentage
of features to use for bug prediction. Empirically from figure
2, a good starting point for bug prediction is at 15% of the
total number of features for a project. One can certainly locate
Fig. 2. Buggy F-measure vs Features using Na¨ıve Bayes
counterexamples including Gforge from this paper. Neverthe-
less, 15% of features is a reasonable practical recommendation
if no additional information is available for a project.
To better characterize performance at low numbers of
features, accuracy, buggy precision, and buggy recall are
computed for all projects using just 1% of features (selected
using the Significance Attribute Evaluation process). Results
are presented in Table VII. When using 1% percent of overall
features, it is still possible to achieve high buggy precision,
but at the cost of somewhat lower recall. Taking Apache as
an example, using the F-measure optimal number of project
features (6.25%, 1098 total) achieves buggy precision of 0.99
and buggy recall of 0.63 (Table V), while using 1% of all
features yields buggy precision of 1 and buggy recall of 0.46
(Table VII).
These results indicate that a small number of features have
strong predictive power for distinguishing between buggy and
clean changes. An avenue of future work is to explore the
degree to which these features have a causal relationship with
bugs. If a strong causal relationship exists, it might be possible
to use these features as input to a static analysis of a software
project. Static analysis violations spanning those keywords can
then be prioritized for human review.
A natural question that follows is the breakdown of the top
features. A software developer might ask which code attributes
are the most effective predictors of bugs. The next section
deals with an analysis of the top 100 features of each project.
12
TABLE VII
NA¨IVE BAYES WITH 1% OF ALL FEATURES
Project Features Accuracy Buggy Buggy
Name Precision Recall
APACHE 175 0.90 1.00 0.46
COLUMBA 174 0.82 1.00 0.38
ECLIPSE 161 0.97 1.00 0.69
GFORGE 89 0.46 0.46 1.00
JCP 419 0.92 1.00 0.61
JEDIT 138 0.81 1.00 0.48
MOZILLA 136 0.88 1.00 0.60
PLONE 61 0.89 1.00 0.45
PSQL 232 0.84 1.00 0.34
SVN 148 0.93 1.00 0.43
GAIM 92 0.79 1.00 0.45
Average 165.91 0.84 0.95 0.54
E. Breakdown of Top 100 Features
Research Question 4. Best Performing Features. Which
classes of features are the most useful for performing bug
predictions?
Table VIII provides a breakdown by category of the 100
most prominent features in each project. The top three types
are purely keyword related. Adding further occurrences of
a keyword to a file has the highest chance of creating a
bug, followed by deleting a keyword, followed finally by
introducing an entirely new keyword to a file. Changelog
features are the next most prominent set of features. These
are features based on the changelog comments for a particular
code change. Finally, filename features are present within the
top 100 for a few of the projects. These features are generated
from the actual names of files in a project.
It is somewhat surprising that complexity features do not
make the top 100 feature list. The Apache project has 3
complexity features in the top 100 list. This was not presented
in Table VIII as Apache is the only project where complexity
metrics proved useful. Incidentally, the top individual feature
for the Apache project is the average length of the comment
lines in a code change.
The absence of metadata features in the top 100 list is also
surprising. Author name, commit hour, and commit day do not
seem to be significant in predicting bugs. The commit hour and
commit day were converted to binary form before classifier
analysis. It is possible that encoding new features using a
combination of commit hour and day such as “Mornings”,
“Evenings”, or “Weekday Nights” might provide more bug
prediction value. An example would be if the hour 22 is
not a good predictor of buggy code changes but hours 20-24
are jointly a strong predictor of buggy changes. In this case,
encoding “Nights” as hours 20-24 would yield a new feature
that is more effective at predicting buggy code changes.
Five of the eleven projects had BOW+ features among the
top features including Gforge, JCP, Plone, PSQL, and SVN.
The Na¨ıve Bayes optimized feature (Table V) contains BOW+
features for those projects with the addition of Gaim and
Mozilla, a total of seven out of eleven projects. BOW+ features
help add predictive value.
TABLE VIII
TOP 100 BUG PREDICTING FEATURES
Project Added Deleted New Changelog Filename
Name Delta Delta Revision Source Features Features
APACHE 49 43 0 0 5
COLUMBA 26 18 50 4 2
ECLIPSE 34 40 26 0 0
GAIM 55 40 3 2 0
GFORGE 23 30 40 6 1
JCP 39 25 36 0 0
JEDIT 50 29 17 3 1
MOZILLA 67 9 14 10 0
PLONE 50 23 19 6 2
PSQL 44 34 22 0 0
SVN 48 42 1 9 0
Average 44.09 30.27 20.72 3.63 1
F. Algorithm Runtime Analysis
The classifiers used in the study consist of linear SVM and
Na¨ıve Bayes. The theoretical training time for these classifiers
has been analyzed in the research literature. Traditionally,
Na¨ıve Bayes is faster to train than a linear SVM. The Na¨ıve
Bayes classifier has a training runtime of O(s ∗ n) where
s is the number of non-zero features and n is the number
of training instances. The SVM traditionally has a runtime
of approximately O(s ∗ n2.1
) [25]. There is recent work in
reducing the runtime of linear SVMs to O(s ∗ n) [25].
Optimizing the classifiers for F-measure slows both down.
In our experiments, training an F-measure optimized Na¨ıve
Bayes classifier was faster than doing so for the SVM though
the speeds were comparable when the number of features
are low. The current paper uses Liblinear [15], one of the
fastest implementations of a linear SVM. If a slower SVM
package were used, the performance gap would be far wider.
The change classification time for both classifiers is linear with
the number of features returned after the training.
An important issue is the speed of classification on the
projects without any feature selection. While the SVM or
Na¨ıve Bayes algorithms do not crash in the presence of
a lot of data, training times and memory requirements are
considerable. Feature selection allows one to reduce classi-
fication time. The time gain on classification is an order of
magnitude improvement, as less than 10% of features are
typically needed for improved classification performance. This
allows individual code classifications to be scalable.
A rough wall-clock analysis of the time required to classify
a single code change improved from a few seconds to a few
hundred milliseconds when about 10% of features are used
(with a 2.33 Ghz Intel Core 2 Quad CPU and 6GB of RAM).
Lowered RAM requirements allow multiple trained classifiers
to operate simultaneously without exceeding the amount of
physical RAM present in a change classification server. The
combined impact of reduced classifier memory footprint and
reduced classification time will permit a server-based classifi-
cation service to support substantially more classifications per
user.
The top performing filter methods themselves are quite fast.
Significance Attribute Evaluation and Gain Ratio also operate
13
linearly with respect to the number of training instances
multiplied by the number of features. Rough wall-clock times
show 4-5 seconds for the feature ranking process (with a
2.33 Ghz Intel Core 2 Quad CPU and 6GB of RAM). It is
only necessary to do this computation once when using these
techniques within the feature selection process, algorithm 1.
VI. RELATED WORK
Given a software project containing a set of program units
(files, classes, methods or functions, or changes depending on
prediction technique and language), a bug prediction algorithm
outputs one of the following.
Totally Ordered Program Units. A total ordering of program
units from most to least bug prone [26] using an ordering
metric such as predicted bug density for each file [44]. If
desired, this can be used to create a partial ordering (see
below).
Partially Ordered Program Units. A partial ordering of
program units into bug prone categories (e.g. the top N%
most bug-prone files in [21], [29], [44])
Prediction on a Given Software Unit. A prediction on
whether a given software unit contains a bug. Prediction
granularities range from an entire file or class [19], [22] to
a single change (e.g., Change Classification [28]).
Change classification and faulty program unit detection
techniques both aim to locate software defects, but differ
in scope and resolution. While change classification focuses
on changed entities, faulty module detection techniques do
not need a changed entity. The pros of change classification
include the following:
• Bug pinpointing at a low level of granularity, typically
about 20 lines of code.
• Possibility of IDE integration, enabling a prediction im-
mediately after code is committed.
• Understandability of buggy code prediction from the
perspective of a developer.
The cons include the following:
• Excess of features to account for when including key-
words.
• Failure to address defects not related to recent file
changes.
• Inability to organize a set of modules by likelihood of
being defective
Related work in each of these areas is detailed below. As
this paper employs feature selection, related work on feature
selection techniques is then addressed. This is followed by a
comparison of the current paper’s results against similar work.
A. Totally Ordered Program Units
Khoshgoftaar and Allen have proposed a model to list
modules according to software quality factors such as future
fault density [26], [27]. The inputs to the model are software
complexity metrics such as LOC, number of unique operators,
and cyclomatic complexity. A stepwise multiregression is then
performed to find weights for each factor. Briand et al. use
object oriented metrics to predict classes which are likely to
contain bugs. They used PCA in combination with logistic
regression to predict fault prone classes [10]. Morasca et al.
use rough set theory and logistic regression to predict risky
modules in commercial software [42]. Key inputs to their
model include traditional metrics such as LOC, code block
properties in addition to subjective metrics such as module
knowledge. Mockus and Weiss predict risky modules in soft-
ware by using a regression algorithm and change measures
such as the number of systems touched, the number of modules
touched, the number of lines of added code, and the number
of modification requests [41]. Ostrand et al. identify the top 20
percent of problematic files in a project [44]. Using future fault
predictors and a negative binomial linear regression model,
they predict the fault density of each file.
B. Partially Ordered Program Units
The previous section covered work which is based on total
ordering of all program modules. This could be converted into
a partially ordered program list, e.g. by presenting the top N%
of modules, as performed by Ostrand et al. above. This section
deals with work that can only return a partial ordering of bug
prone modules. Hassan and Holt use a caching algorithm to
compute the set of fault-prone modules, called the top-10 list
[21]. They use four factors to determine this list: software units
that were most frequently modified, most recently modified,
most frequently fixed, and most recently fixed. A recent
study by Shihab et al. [48] investigates the impact of code
and process metrics on future defects when applied on the
Eclipse project. The focus is on reducing metrics to a reach a
much smaller though statistically significant set of metrics for
computing potentially defective modules. Kim et al. proposed
the bug cache algorithm to predict future faults based on
previous fault localities [29]. In an empirical study of partially
ordered faulty modules, Lessmann et al. [33] conclude that
the choice of classifier may have a less profound impact on
the prediction than previously thought, and recommend the
ROC AUC as an accuracy indicator for comparative studies in
software defect prediction.
C. Prediction on a Given Software Unit
Using decision trees and neural networks that employ
object-oriented metrics as features, Gyimothy et al. [19] pre-
dict fault classes of the Mozilla project across several releases.
Their buggy precision and recall are both about 70%, resulting
in a buggy F-measure of 70%. Our buggy precision for the
Mozilla project is around 100% (+30%) and recall is at 80%
(+10%), resulting in a buggy F-measure of 89% (+19%). In
addition they predict faults at the class level of granularity
(typically by file), while our level of granularity is by code
change.
Aversano et al. [4] achieve 59% buggy precision and recall
using KNN (K nearest neighbors) to locate faulty modules.
Hata et al. [22] show that a technique used for spam filtering
of emails can be successfully used on software modules to
classify software as buggy or clean. However, they classify
static code (such as the current contents of a file), while
our approach classifies file changes. They achieve 63.9%
14
precision, 79.8% recall, and 71% buggy F-measure on the best
data points of source code history for 2 Eclipse plugins. We
obtain buggy precision, recall, and F-measure figures of 100%
(+36.1%), 78% (-1.8%) and 87% (+16%), respectively, with
our best performing technique on the Eclipse project (Table
V). Menzies et al. [39] achieve good results on their best
projects. However, their average precision is low, ranging from
a minimum of 2% and a median of 20% to a max of 70%. As
mentioned in section III, to avoid optimizing on precision or
recall, we present F-measure figures. A commonality we share
with the work of Menzies et al. is their use of Information Gain
(quite similar to the Gain Ratio that we use, as explained in
section II-C) to rank features. Both Menzies and Hata focus
on the file level of granularity.
Kim et al. show that using support vector machines on
software revision history information can provide an average
bug prediction accuracy of 78%, a buggy F-measure of 60%,
and a precision and recall of 60% when tested on twelve
open source projects [28]. Our corresponding results are an
accuracy of 92% (+14%), a buggy F-measure of 81% (+21%),
a precision of 97% (+37%), and a recall of 70% (+10%). Elish
and Elish [14] also used SVMs to predict buggy modules in
software. Table IX compares our results with that of earlier
work. Hata, Aversano, and Menzies did not report overall
accuracy in their results and focused on precision and recall
results.
Recently, D’Ambros et al. [12] provided an extensive com-
parison of various bug prediction algorithms that operate at
the file level using ROC AUC to compare algorithms. Wrapper
Subset Evaluation is used sequentially to trim attribute size.
They find that the top ROC AUC of 0.921 was achieved for
the Eclipse project using the prediction approach of Moser
et al. [43]. As a comparison, the results achieved using
feature selection and change classification in the current paper
achieved an ROC AUC for Eclipse of 0.94. While it’s not a
significant difference, it does show that change classification
after feature selection can provide results comparable to those
of Moser et al. which are based on code metrics such as code
churn, past bugs, refactorings, number of authors, file age, etc.
An advantage of our change classification approach over that
of Moser et al. is that it operates at the granularity of code
changes, which permits developers to receive faster feedback
on small regions of code. As well, since the top features are
keywords (Section V-E), it is easier to explain to developers
the reason for a code change being diagnosed as buggy with a
keyword diagnosis instead of providing metrics. It is easier to
understand a code change being buggy due to having certain
combinations of keywords instead of combination of source
code metrics.
Challagulla et al. investigate machine learning algorithms
to identify faulty real-time software modules [11]. They find
that predicting the number of defects in a module is much
harder than predicting whether a module is defective. They
achieve best results using the Na¨ıve Bayes classifier. They
conclude that “size” and “complexity” metrics are not suffi-
cient attributes for accurate prediction.
The next section moves on to work focusing on feature
selection.
D. Feature Selection
Hall and Holmes [20] compare six different feature selection
techniques when using the Na¨ıve Bayes and the C4.5 classifier
[45]. Each dataset analyzed has about one hundred features.
The method and analysis based on iterative feature selection
used in this paper is different from that of Hall and Holmes
in that the present work involves substantially more features,
coming from a different type of corpus (features coming from
software projects). Many of the feature selection techniques
used by Hall and Holmes are used in this paper.
Song et al. [51] propose a general defect prediction frame-
work involving a data preprocessor, feature selection, and
learning algorithms. They also note that small changes to
data representation can have a major impact on the results
of feature selection and defect prediction. This paper uses
a different experimental setup due to the large number of
features. However, some elements of this paper were adopted
from Song et al. including cross validation during the feature
ranking process for filter methods. In addition, the Na¨ıve
Bayes classifier was used, and the J48 classifier was attempted.
The results for the latter are mentioned in Section VII. Their
suggestion of using a log pre-processor is hard to adopt for
keyword changes and cannot be done on binary interpretations
of keyword features. Forward and backward feature selection
one attribute at a time is not a practical solution for large
datasets.
Gao et al. [18] apply several feature selection algorithms to
predict defective software modules for a large legacy telecom-
munications software system. Seven filter-based techniques
and three subset selection search algorithms are employed.
Removing 85% of software metric features does not adversely
affect results, and in some cases improved results. The current
paper uses keyword information as features instead of product,
process, and execution metrics. The current paper also uses
historical data. Finally, 11 systems are examined instead of 4
when compared to that work. However similar to that work,
a variety of feature selection techniques is used as well on a
far larger set of attributes.
VII. THREATS TO VALIDITY
There are six major threats to the validity of this study.
Systems examined might not be representative of typical
projects.
Eleven systems are examined, a quite high number compared
to other work reported in literature. In spite of this, it is still
possible that we accidentally chose systems that have better
(or worse) than average bug classification accuracy. Since we
intentionally chose systems that had some degree of linkage
between change tracking systems and change log text (to
determine fix inducing changes), there is a project selection
bias.
Systems are mostly open source.
The systems examined in this paper mostly all use an open
source development methodology with the exception of JCP,
and hence might not be representative of typical develop-
ment contexts, potentially affecting external validity [52]. It
is possible that more deadline pressures, differing personnel
15
TABLE IX
COMPARISONS OF OUR APPROACH TO RELATED WORK
Authors Project(s) Accuracy Buggy Buggy Buggy Granularity
F-measure Precision Recall
Hata et ala[22] Eclipse Plugins - 71 64 80 File
Shivaji et al. Eclipse 98 87 100 78 Code Change
Gyimothy et al. [19] Mozilla 73 70 70 70 File
Shivaji et al. Mozilla 94 89 100 80 Code Change
Aversano et al. [4] Multiple - 59b 59 59b File
Menzies et al. [39] Multiple - 74b 70b 79 File
Kim et al. [28] Multiple 78 60 60 60 Code Change
Shivaji et al. Multiple 92 81 97 70 Code Change
a Result is from two best historical data points
b Result presented are the best project results.
turnover patterns, and varied development processes used in
commercial development could lead to different buggy change
patterns.
Bug fix data is incomplete.
Even though we selected projects that have decent change
logs, we still are only able to extract a subset of the total
number of bugs (typically only 40%- 60% of those reported
in the bug tracking system). Since the quality of change log
comments varies across projects, it is possible that the output
of the classification algorithm will include false positive and
false negatives. Recent research by Bachmann et al. focusing
on the Apache system is starting to shed light on the size of
this missing data [5]. The impact of this data has been explored
by Bird et al. who find that in the presence of missing data, the
Bug Cache prediction technique [30] is biased towards finding
less severe bug types [8].
Bug introducing data is incomplete.
The SZZ algorithm used to identify bug-introducing changes
has limitations: it cannot find bug introducing changes for bug
fixes that only involve deletion of source code. It also cannot
identify bug-introducing changes caused by a change made to
a file different from the one being analyzed. It is also possible
to miss bug-introducing changes when a file changes its name,
since these are not tracked.
Selected classifiers might not be optimal.
We explored many other classifiers, and found that Na¨ıve
Bayes and SVM consistently returned the best results. Other
popular classifiers include decision trees (e.g. J48), and JRIP.
The average buggy F-measure for the projects surveyed in this
paper using J48 and JRip using feature selection was 51% and
48% respectively. Though reasonable results, they are not as
strong as those for Na¨ıve Bayes and SVM, the focus of this
paper.
Feature Selection might remove features which become
important in the future.
Feature selection was employed in this paper to remove
features in order to optimize performance and scalability. The
feature selection techniques used ensured that less important
features were removed. Better performance did result from
the reduction. Nevertheless, it might turn out that in future
revisions, previously removed features become important and
their absence might lower prediction quality.
VIII. CONCLUSION
This paper has explored the use of feature selection tech-
niques to predict software bugs. An important pragmatic result
is that feature selection can be performed in increments of
half of all remaining features, allowing it to proceed quickly.
Between 3.12% and 25% of the total feature set yielded
optimal classification results. The reduced feature set permits
better and faster bug predictions.
The feature selection process presented in section II-D was
empirically applied to eleven software projects. The process is
fast performing and can be applied to predicting bugs on other
projects. The most important results stemming from using
the feature selection process are found in Table V, which
present F-measure optimized results for the Na¨ıve Bayes
classifier. A useful pragmatic result is that feature selection
can be performed in increments of half of all remaining
features, allowing it to proceed quickly. The average buggy is
precision is quite high at 0.97, with a reasonable recall of 0.70.
This result outperforms similar classifier based bug prediction
techniques and the results pave the way for practical adoption
of classifier based bug prediction.
From the perspective of a developer receiving bug predic-
tions on their work, these figures mean that if the classifier
says a code change has a bug, it is almost always right. The
recall figures mean that on average 30% of all bugs will not
be detected by the bug predictor. This is likely a fundamental
limitation of history-based bug prediction, as there might be
new types of bugs that have not yet been incorporated into
the training data. We believe this represents a reasonable
tradeoff, since increasing recall would come at the expense of
more false bug predictions, not to mention a decrease in the
aggregate buggy F-measure figure. Such predictions can waste
developer time and reduce their confidence in the system.
In the future, when software developers have advanced bug
prediction technology integrated into their software develop-
ment environment, the use of classifiers with feature selection
will permit fast, precise, and accurate bug predictions. The age
of bug-predicting imps will have arrived.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers,
Dr Tigran Ishkhanov, and Shyamal Chandra of IBM Almaden
for their helpful comments.
16
REFERENCES
[1] A. Ahmad and L. Dey. A Feature Selection Technique for Classificatory
Analysis. Pattern Recognition Letters, 26(1):43–56, 2005.
[2] E. Alpaydin. Introduction To Machine Learning. MIT Press, 2004.
[3] A. Anagnostopoulos, A. Broder, and K. Punera. Effective and Efficient
Classification on a Search-Engine Model. Proc. 15th ACM Int’l Conf.
on Information and Knowledge Management, Jan 2006.
[4] L. Aversano, L. Cerulo, and C. Del Grosso. Learning from Bug-
introducing Changes to Prevent Fault Prone Code. In Proceedings of
the Foundations of Software Engineering, pages 19–26. ACM New York,
NY, USA, 2007.
[5] A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The
Missing Links: Bugs and Bug-Fix Commits. In Proceedings of the
eighteenth ACM SIGSOFT international symposium on Foundations of
software engineering, FSE ’10, pages 97–106, New York, NY, USA,
2010. ACM.
[6] A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem, C. Henri-
Gros, A. Kamsky, S. McPeak, and D. R. Engler. A few billion lines of
code later: using static analysis to find bugs in the real world. Commun.
ACM, 53(2):66–75, 2010.
[7] J. Bevan, E. Whitehead Jr, S. Kim, and M. Godfrey. Facilitating Software
Evolution Research with Kenyon. Proc. ESEC/FSE 2005, pages 177–
186, 2005.
[8] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov,
and P. Devanbu. Fair and Balanced?: Bias in Bug-Fix Datasets. In
Proceedings of the the 7th joint meeting of the European software
engineering conference and the ACM SIGSOFT symposium on The
foundations of software engineering, ESEC/FSE ’09, pages 121–130,
New York, NY, USA, 2009. ACM.
[9] Z. Birnbaum and F. Tingey. One-sided Confidence Contours for Prob-
ability Distribution Functions. The Annals of Mathematical Statistics,
pages 592–596, 1951.
[10] L. C. Briand, J. Wiist, S. V. Ikonomovski, and H. Lounis. Investigating
Quality Factors in Object-oriented Designs: An Industrial Case Study.
Software Engineering, International Conference on, 0:345, 1999.
[11] V. Challagulla, F. Bastani, I. Yen, and R. Paul. Empirical Assessment
of Machine Learning Based Software Defect Prediction Techniques. In
Object-Oriented Real-Time Dependable Systems, 2005. WORDS 2005.
10th IEEE International Workshop on, pages 263–270. IEEE, 2005.
[12] M. DAmbros, M. Lanza, and R. Robbes. Evaluating Defect Prediction
Approaches: a Benchmark and an Extensive Comparison. Empirical
Software Engineering, pages 1–47, 2011.
[13] B. Efron and R. Tibshirani. An Introduction to the Bootstrap, volume 57.
Chapman & Hall/CRC, 1993.
[14] K. Elish and M. Elish. Predicting Defect-Prone Software Modules Using
Support Vector Machines. Journal of Systems and Software, 81(5):649–
660, 2008.
[15] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A Library
for Large Linear Classification. The Journal of Machine Learning
Research, 9:1871–1874, 2008.
[16] T. Fawcett. An Introduction to ROC Analysis. Pattern Recognition
Letters, 27(8):861–874, 2006.
[17] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical
Learning, volume 1. Springer Series in Statistics, 2001.
[18] K. Gao, T. Khoshgoftaar, H. Wang, and N. Seliya. Choosing Software
Metrics for Defect Prediction: an Investigation on Feature Selection
Techniques. Software: Practice and Experience, 41(5):579–606, 2011.
[19] T. Gyim´othy, R. Ferenc, and I. Siket. Empirical Validation of Object-
Oriented Metrics on Open Source Software for Fault Prediction. IEEE
Trans. Software Eng., 31(10):897–910, 2005.
[20] M. Hall and G. Holmes. Benchmarking Attribute Selection Techniques
for Discrete Class Data Mining. Knowledge and Data Engineering,
IEEE Transactions on, 15(6):1437–1447, 2003.
[21] A. Hassan and R. Holt. The Top Ten List: Dynamic Fault Prediction.
Proc. ICSM’05, Jan 2005.
[22] H. Hata, O. Mizuno, and T. Kikuno. An Extension of Fault-prone
Filtering using Precise Training and a Dynamic Threshold. Proc. MSR
2008, 2008.
[23] S. T. http://www.scitools.com/. Maintenance, Understanding, Metrics
and Documentation Tools for Ada, C, C++, Java, and Fortran. 2005.
[24] T. Joachims. Text Categorization with Support Vector Machines:
Learning with Many Relevant Features. Machine Learning: ECML-98,
pages 137–142, 1998.
[25] T. Joachims. Training Linear SVMs in Linear Time. In Proceedings
of the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining, page 226. ACM, 2006.
[26] T. Khoshgoftaar and E. Allen. Predicting the Order of Fault-Prone
Modules in Legacy Software. Proc. 1998 Int’l Symp. on Software
Reliability Eng., pages 344–353, 1998.
[27] T. Khoshgoftaar and E. Allen. Ordering Fault-Prone Software Modules.
Software Quality J., 11(1):19–37, 2003.
[28] S. Kim, E. W. Jr., and Y. Zhang. Classifying Software Changes: Clean
or Buggy? IEEE Trans. Software Eng., 34(2):181–196, 2008.
[29] S. Kim, T. Zimmermann, E. W. Jr., and A. Zeller. Predicting Faults
from Cached History. Proc. ICSE 2007, pages 489–498, 2007.
[30] S. Kim, T. Zimmermann, E. Whitehead Jr, and A. Zeller. Predicting
Faults from Cached History. In Proceedings of the 29th international
conference on Software Engineering, pages 489–498. IEEE Computer
Society, 2007.
[31] I. Kononenko. Estimating Attributes: Analysis and Extensions of relief.
In Machine Learning: ECML-94, pages 171–182. Springer, 1994.
[32] B. Larsen and C. Aone. Fast and Effective Text Mining using Linear-
time Document Clustering. In Proceedings of the fifth ACM SIGKDD
international conference on Knowledge discovery and data mining,
pages 16–22. ACM New York, NY, USA, 1999.
[33] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking
Classification Models for Software Defect Prediction: A Proposed
Framework and Novel Findings. IEEE Trans. on Software Eng.,
34(4):485–496, 2008.
[34] D. Lewis. Naive (Bayes) at Forty: The Independence Assumption in
Information Retrieval. Machine Learning: ECML-98, pages 4–15, 1998.
[35] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and
Data Mining. Springer, 1998.
[36] J. Madhavan and E. Whitehead Jr. Predicting Buggy Changes Inside an
Integrated Development Environment. Proc. 2007 Eclipse Technology
eXchange, 2007.
[37] F. Massey Jr. The Kolmogorov-Smirnov Test for Goodness of Fit.
Journal of the American Statistical Association, pages 68–78, 1951.
[38] A. McCallum and K. Nigam. A Comparison of Event Models for Naive
Bayes Text Classification. AAAI-98 Workshop on Learning for Text
Categorization, Jan 1998.
[39] T. Menzies, J. Greenwald, and A. Frank. Data Mining Static Code
Attributes to Learn Defect Predictors. IEEE Trans. Software Eng.,
33(1):2–13, 2007.
[40] A. Mockus and L. Votta. Identifying Reasons for Software Changes
using Historic Databases. Proc. ICSM 2000, page 120, 2000.
[41] A. Mockus and D. Weiss. Predicting Risk of Software Changes. Bell
Labs Tech. Journal, 5(2):169–180, 2000.
[42] S. Morasca and G. Ruhe. A Hybrid Approach to Analyze Empirical
Software Engineering Data and its Application to Predict Module Fault-
proneness in Maintenance. J. Syst. Softw., 53(3):225–237, 2000.
[43] R. Moser, W. Pedrycz, and G. Succi. A Comparative Analysis of the
Efficiency of Change Metrics and Static Code Attributes for Defect
Prediction. In Software Engineering, 2008. ICSE’08. ACM/IEEE 30th
International Conference on, pages 181–190. IEEE, 2008.
[44] T. Ostrand, E. Weyuker, and R. Bell. Predicting the Location and
Number of Faults in Large Software Systems. IEEE Trans. Software
Eng., 31(4):340–355, 2005.
[45] J. Quinlan. C4. 5: Programs for Machine Learning. Morgan Kaufmann,
1993.
[46] M. Robnik-ˇSikonja and I. Kononenko. Theoretical and Empirical
Analysis of Relieff and RRelieff. Machine learning, 53(1):23–69, 2003.
[47] S. Scott and S. Matwin. Feature Engineering for Text Classification.
Machine Learning-International Workshop, pages 379–388, 1999.
[48] E. Shihab, Z. Jiang, W. Ibrahim, B. Adams, and A. Hassan. Understand-
ing the Impact of Code and Process Metrics on Post-Release Defects:
a Case Study on the Eclipse Project. Proc. ESEM 2010, pages 1–10,
2010.
[49] S. Shivaji, E. Whitehead Jr, R. Akella, and S. Kim. Reducing Features to
Improve Bug Prediction. In 2009 IEEE/ACM International Conference
on Automated Software Engineering, pages 600–604. IEEE, 2009.
[50] J. Sliwerski, T. Zimmermann, and A. Zeller. When Do Changes Induce
Fixes? Proc. MSR 2005, pages 24–28, 2005.
[51] Q. Song, Z. Jia, M. Shepperd, S. Ying, and J. Liu. A General Software
Defect-proneness Prediction Framework. Software Engineering, IEEE
Transactions on, (99):356–370, 2011.
[52] H. K. Wright, M. Kim, and D. E. Perry. Validity Concerns in Software
Engineering Research. In Proceedings of the FSE/SDP workshop on
Future of software engineering research, FoSER ’10, pages 411–414,
New York, NY, USA, 2010. ACM.
[53] T. Zimmermann and P. Weißgerber. Preprocessing CVS Data for Fine-
Grained Analysis. Proc. MSR 2004, pages 2–6, 2004.
17
Shivkumar Shivaji is a PhD student in the Depart-
ment of Computer Science, University of California,
Santa Cruz. His research interests include software
evolution, software bug prediction, and machine
learning. He has worked for large software firms
such as eBay, Yahoo!, and IBM along with smaller
firms as a senior engineer/architect. He is also a
chess master.
E. James Whitehead, Jr. is a professor in the De-
partment of Computer Science, University of Cali-
fornia, Santa Cruz. His research interests in software
engineering include software evolution and software
bug prediction, and his research interests in com-
puter games are intelligent design tools, procedural
content generation, and level design. He received his
BS in Electrical Engineering from the Rensselaer
Polytechnic Institute, and his PhD in Information
and Computer Science from the University of Cali-
fornia, Irvine.
Sunghun Kim is an assistant professor of computer
science at the Hong Kong University of Science and
Technology. His core research area is software en-
gineering, focusing on software evolution, program
analysis, and empirical studies. He received the PhD
degree from the Computer Science Department at the
University of California, Santa Cruz in 2006. He was
a postdoctoral associate at the Massachusetts Insti-
tute of Technology and a member of the Program
Analysis Group. He was a chief technical officer
(CTO), and led a 25-person team for six years at
Nara Vision Co. Ltd., a leading Internet software company in Korea.
Ram Akella is currently a professor and director
of the Center for Large-scale Live Analytics and
Smart Services (CLASS), and Center for Knowl-
edge, Information Systems and Technology Man-
agement (KISMT) at the University of California,
Silicon Valley Center. He held postdoctoral appoint-
ments at Harvard and MIT (EECS/LIDS). He joined
Carnegie Mellon University in 1985 as an Associate
Professor in the Business School and the School of
Computer Science, prior to appointments at schools
including MIT, Berkeley, and Stanford. He has been
a professor of Engineering, Computer Science, Management, Information, and
Medicine. He has established an ORU, and the TIM program at UCSC/SVC
as Founding Director. His research interests include Data/Text Analytics,
Mining, Search, and Recommender Systems. He has worked with over 200
firms, including AOL (Faculty Award), Cisco, Google (Research Award), IBM
(Faculty Award), and SAP. He has been on editorial boards and program
committees of the ACM, IEEE, INFORMS, and IIE. His students are endowed
or Department Chairs at major universities, or executives (including CEOs) of
major firms. His research has achieved transformational impact on industry,
and he has received stock from executives.

Mais conteúdo relacionado

Destaque

Grammar translation method
Grammar translation methodGrammar translation method
Grammar translation method
Ahmed Adem
 
Aku tdk lbh dulu ke surga
Aku tdk lbh dulu ke surgaAku tdk lbh dulu ke surga
Aku tdk lbh dulu ke surga
hernowo14
 
体系和才华,一半海水一半火焰
体系和才华,一半海水一半火焰体系和才华,一半海水一半火焰
体系和才华,一半海水一半火焰
Leo Zhao
 

Destaque (12)

Neuron
NeuronNeuron
Neuron
 
Back pain
Back painBack pain
Back pain
 
Grammar translation method
Grammar translation methodGrammar translation method
Grammar translation method
 
Mesopotamia expo
Mesopotamia  expoMesopotamia  expo
Mesopotamia expo
 
Control de calidad
Control de calidadControl de calidad
Control de calidad
 
Blood cells
Blood cells Blood cells
Blood cells
 
Aku tdk lbh dulu ke surga
Aku tdk lbh dulu ke surgaAku tdk lbh dulu ke surga
Aku tdk lbh dulu ke surga
 
E&PP Stats
E&PP StatsE&PP Stats
E&PP Stats
 
Ecg line
Ecg lineEcg line
Ecg line
 
Fenicios
FeniciosFenicios
Fenicios
 
Madness Cricket
Madness CricketMadness Cricket
Madness Cricket
 
体系和才华,一半海水一半火焰
体系和才华,一半海水一半火焰体系和才华,一半海水一半火焰
体系和才华,一半海水一半火焰
 

Último

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 

Reducing features to improve code change based bug prediction

  • 1. 1 Reducing Features to Improve Code Change Based Bug Prediction Shivkumar Shivaji ∗, E. James Whitehead, Jr. ∗, Ram Akella ∗, Sunghun Kim † ∗School of Engineering University of California Santa Cruz, Santa Cruz, CA, USA Email: {shiv, ejw, ram}@soe.ucsc.edu †Department of Computer Science Hong Kong University of Science and Technology Hong Kong Email: hunkim@cse.ust.hk Abstract—Machine learning classifiers have recently emerged as a way to predict the introduction of bugs in changes made to source code files. The classifier is first trained on software history, and then used to predict if an impending change causes a bug. Drawbacks of existing classifier-based bug prediction techniques are insufficient performance for practical use and slow prediction times due to a large number of machine learned features. This paper investigates multiple feature selection techniques that are generally applicable to classification-based bug predic- tion methods. The techniques discard less important features until optimal classification performance is reached. The total number of features used for training is substantially reduced, often to less than 10% of the original. The performance of Na¨ıve Bayes and Support Vector Machine (SVM) classifiers when using this technique is characterized on eleven software projects. Na¨ıve Bayes using feature selection provides significant improvement in buggy F-measure (21% improvement) over prior change classification bug prediction results (by the second and fourth authors [28]). The SVM’s improvement in buggy F-measure is 9%. Interestingly, an analysis of performance for varying numbers of features shows that strong performance is achieved at even 1% of the original number of features. Index Terms—Reliability; Bug prediction; Machine Learning; Feature Selection I. INTRODUCTION Imagine if you had a little imp, sitting on your shoulder, telling you whether your latest code change has a bug. Imps, being mischievous by nature, would occasionally get it wrong, telling you a clean change was buggy. This would be annoying. However, say the imp was right 90% of the time. Would you still listen to him? While real imps are in short supply, thankfully advanced machine learning classifiers are readily available (and have plenty of impish qualities). Prior work by the second and fourth authors (hereafter called Kim et al. [28]) and similar work by Hata et al. [22] demonstrate that classifiers, when trained on historical software project data, can be used to predict the existence of a bug in an individual file-level software change. The classifier is first trained on information found in historical changes, and once trained, can be used to classify a new impending change as being either buggy (predicted to have a bug) or clean (predicted to not have a bug). We envision a future where software engineers have bug prediction capability built into their development environment [36]. Instead of an imp on the shoulder, software engineers will receive feedback from a classifier on whether a change they committed is buggy or clean. During this process, a software engineer completes a change to a source code file, submits the change to a software configuration management (SCM) system, then receives a bug prediction back on that change. If the change is predicted to be buggy, a software engineer could perform a range of actions to find the latent bug, including writing more unit tests, performing a code inspection, or examining similar changes made elsewhere in the project. Due to the need for the classifier to have up-to-date training data, the prediction is performed by a bug prediction service located on a server machine [36]. Since the service is used by many engineers, speed is of the essence when performing bug predictions. Faster bug prediction means better scalability, since quick response times permit a single machine to service many engineers. A bug prediction service must also provide precise predic- tions. If engineers are to trust a bug prediction service, it must provide very few “false alarms”, changes that are predicted to be buggy but which are really clean [6]. If too many clean changes are falsely predicted to be buggy, developers will lose faith in the bug prediction system. The prior change classification bug prediction approach used by Kim et al. and analogous work by Hata et al. involve the extraction of “features” (in the machine learning sense, which differ from software features) from the history of changes made to a software project. They include everything separated by whitespace in the code that was added or deleted in a change. Hence, all variables, comment words, operators, method names, and programming language keywords are used as features to train the classifier. Some object-oriented metrics are also used as part of the training set, together with other features such as configuration management log messages and change metadata (size of change, day of change, hour of
  • 2. 2 change, etc.). This leads to a large number of features: in the thousands and low tens of thousands. For project histories which span a thousand revisions or more, this can stretch into hundreds of thousands of features. Changelog features were included as a way to capture the intent behind a code change, for example to scan for bug fixes. Metadata features were included on the assumption that individual developers and code submission times are positively correlated with bugs [50]. Code complexity metrics were captured due to their effectiveness in previous studies ( [12], [43] amongst many others). Source keywords are captured en masse as they contain detailed information for every code change. The large feature set comes at a cost. Classifiers typically cannot handle such a large feature set in the presence of complex interactions and noise. For example, the addition of certain features can reduce the accuracy, precision, and recall of a support vector machine. Due to the value of a specific feature not being known a-priori, it is necessary to start with a large feature set and gradually reduce features. Additionally, the time required to perform classification increases with the number of features, rising to several seconds per classification for tens of thousands of features, and minutes for large project histories. This negatively affects the scalability of a bug prediction service. A possible approach (from the machine learning literature) for handling large feature sets is to perform a feature selection process to identify that subset of features providing the best classification results. A reduced feature set improves the scalability of the classifier, and can often produce substantial improvements in accuracy, precision, and recall. This paper investigates multiple feature selection techniques to improve classifier performance. Classifier performance can be evaluated using a suitable metric. For this paper, classifier performance refers to buggy F-measure rates. The choice of this measure is discussed in section III. The feature selection techniques investigated include both filter and wrapper meth- ods. The best technique is Significance Attribute Evaluation (a filter method) in conjunction with the Na¨ıve Bayes classifier. This technique discards features with lowest significance until optimal classification performance is reached (described in section II-D). Although many classification techniques could be em- ployed, this paper focuses on the use of Na¨ıve Bayes and SVM. The reason is due to the strong performance of the SVM and the Na¨ıve Bayes classifier for text categorization and numerical data [24], [34]. The J48 and JRIP classifiers were briefly tried, but due to inadequate results, their mention is limited to section VII. The primary contribution of this paper is the empirical anal- ysis of multiple feature selection techniques to classify bugs in software code changes using file level deltas. An important secondary contribution is the high average F-measure values for predicting bugs in individual software changes. This paper explores the following research questions. Question 1. Which variables lead to best bug prediction performance when using feature selection? The three variables affecting bug prediction performance that are explored in this paper are: (1) type of classifier (Na¨ıve Bayes, Support Vector Machine), (2) type of feature selection used (3) and whether multiple instances of a feature are significant (count), or whether only the existence of a feature is significant (binary). Results are reported in Section V-A. Results for question 1 are reported as averages across all projects in the corpus. However, in practice it is useful to know the range of results across a set of projects. This leads to our second question. Question 2. Range of bug prediction performance using feature selection. How do the best-performing SVM and Na¨ıve Bayes classifiers perform across all projects when using feature selection? (see Section V-B) The sensitivity of bug prediction results with number of features is explored in the next question. Question 3. Feature Sensitivity. What is the performance of change classification at varying percentages of features? What is the F-measure of the best performing classifier when using just 1% of all project features? (see Section V-D) Some types of features work better than others for discrimi- nating between buggy and clean changes, explored in the final research question. Question 4. Best Performing Features. Which classes of features are the most useful for performing bug predictions? (see Section V-E) A comparison of this paper’s results with those found in related work (see Section VI) shows that change classification with feature selection outperforms other existing classification- based bug prediction approaches. Furthermore, when using the Na¨ıve Bayes classifier, buggy precision averages 97% with a recall of 70%, indicating the predictions are generally highly precise, thereby minimizing the impact of false positives. In the remainder of the paper, we begin by presenting an overview of the change classification approach for bug prediction, and then detail a new process for feature selection (Section II). Next, standard measures for evaluating classifiers (accuracy, precision, recall, F-measure, ROC AUC) are de- scribed in Section III. Following, we describe the experimental context, including our data set, and specific classifiers (Section IV). The stage is now set, and in subsequent sections we explore the research questions described above (Sections V-A - V-E). A brief investigation of algorithm runtimes is next (Section V-F). The paper ends with a summary of related work (Section VI), threats to validity (Section VII), and the conclusion. This paper builds on [49], a prior effort by the same set of authors. While providing substantially more detail in addition to updates to the previous effort, there are two additional research questions, followed by three new experiments. In ad- dition, the current paper extends feature selection comparison to more techniques. The new artifacts are research questions 3, 4, tables 2, and VII. Further details over previous experiments include Table VI, Table IV, and the accompanying analysis. II. CHANGE CLASSIFICATION The primary steps involved in performing change classifi- cation on a single project are as follows: Creating a Corpus:
  • 3. 3 1. File level change deltas are extracted from the revision history of a project, as stored in its SCM repository (described further in Section II-A). 2. The bug fix changes for each file are identified by examining keywords in SCM change log messages (Section II-A). 3. The bug-introducing and clean changes at the file level are identified by tracing backward in the revision history from bug fix changes (Section II-A). 4. Features are extracted from all changes, both buggy and clean. Features include all terms in the complete source code, the lines modified in each change (delta), and change meta- data such as author and change time. Complexity metrics, if available, are computed at this step. Details on these feature extraction techniques are presented in Section II-B. At the end of Step 4, a project-specific corpus has been created, a set of labeled changes with a set of features associated with each change. All of the steps until this point are the same as in Kim et al [28]. The following step is the new contribution in this paper. Feature Selection: 5. Perform a feature selection process that employs a combination of wrapper and filter methods to compute a reduced set of features. The filter methods used are Gain Ratio, Chi-Squared, Significance, and Relief-F feature rankers. The wrapper methods are based on the Na¨ıve Bayes and the SVM classifiers. For each iteration of feature selection, classifier F- measure is optimized. As Relief-F is a slower method, it is only invoked on the top 15% of features rather than 50%. Feature selection is iteratively performed until one feature is left. At the end of Step 5, there is a reduced feature set that performs optimally for the chosen classifier metric. Classification: 6. Using the reduced feature set, a classification model is trained. 7. Once a classifier has been trained, it is ready to use. New code changes can now be fed to the classifier, which determines whether a new change is more similar to a buggy change or a clean change. Classification is performed at a code change level using file level change deltas as input to the classifier. A. Finding Buggy and Clean Changes The process of determining buggy and clean changes begins by using the Kenyon infrastructure to extract change transac- tions from either a CVS or Subversion software configuration management repository [7]. In Subversion, such transactions are directly available. CVS, however, provides only versioning at the file level, and does not record which files were commit- ted together. To recover transactions from CVS archives, we group the individual per-file changes using a sliding window approach [53]: two subsequent changes by the same author and with the same log message are part of one transaction if they are at most 200 seconds apart. In order to find bug-introducing changes, bug fixes must first be identified by mining change log messages. We use two approaches: searching for keywords in log messages such TABLE I EXAMPLE BUG FIX SOURCE CODE CHANGE 1.23: Bug-introducing 1.42: Fix ... ... 15: if (foo==null) { 36: if (foo!=null) { 16: foo.bar(); 37: foo.bar(); ... ... as “Fixed”, “Bug” [40], or other keywords likely to appear in a bug fix, and searching for references to bug reports like “#42233”. This allows us to identify whether an entire code change transaction contains a bug fix. If it does, we then need to identify the specific file delta change that introduced the bug. For the systems studied in this paper, we manually verified that the identified fix commits were, indeed, bug fixes. For JCP, all bug fixes were identified using a source code to bug tracking system hook. As a result, we did not have to rely on change log messages for JCP. The bug-introducing change identification algorithm pro- posed by ´Sliwerski, Zimmermann, and Zeller (SZZ algorithm) [50] is used in the current paper. After identifying bug fixes, SZZ uses a diff tool to determine what changed in the bug- fixes. The diff tool returns a list of regions that differ between the two files; each region is called a “hunk”. It observes each hunk in the bug-fix and assumes that the deleted or modified source code in each hunk is the location of a bug. Finally, SZZ tracks down the origins of the deleted or modified source code in the hunks using the built-in annotation function of SCM (Source Code Management) systems. The annotation computes, for each line in the source code, the most recent revision where the line was changed, and the developer who made the change. These revisions are identified as bug- introducing changes. In the example in Table I, revision 1.42 fixes a fault in line 36. This line was introduced in revision 1.23 (when it was line 15). Thus revision 1.23 contains a bug- introducing change. Specifically, revision 1.23 calls a method within foo. However, the if-block is entered only if foo is null. B. Feature Extraction To classify software changes using machine learning algo- rithms, a classification model must be trained using features of buggy and clean changes. In this section, we discuss techniques for extracting features from a software project’s change history. A file change involves two source code revisions (an old revision and a new revision) and a change delta that records the added code (added delta) and the deleted code (deleted delta) between the two revisions. A file change has associated meta-data, including the change log, author, and commit date. Every term in the source code, change delta, and change log texts is used as a feature. This means that every variable, method name, function name, keyword, comment word, and operator—that is, everything in the source code separated by whitespace or a semicolon—is used as a feature. We gather eight features from change meta-data: author, commit hour (0, 1, 2, . . . , 23), commit day (Sunday, Monday, . . . , Saturday), cumulative change count, cumulative bug
  • 4. 4 count, length of change log, changed LOC (added delta LOC + deleted delta LOC), and new revision source code LOC. We compute a range of traditional complexity metrics of the source code by using the Understand C/C++ and Java tools [23]. As a result, we extract 61 complexity metrics for each file, including LOC, number of comment lines, cyclomatic complexity, and max nesting. Since we have two source code files involved in each change (old and new revision files), we can use complexity metric deltas as features. That is, we can compute a complexity metric for each file and take the difference; this difference can be used as a feature. Change log messages are similar to e-mail or news articles in that they are human readable texts. To extract features from change log messages, we use the bag-of-words (BOW) approach, which converts a stream of characters (the text) into a BOW (index terms) [47]. We use all terms in the source code as features, including operators, numbers, keywords, and comments. To generate features from source code, we use a modified version of BOW, called BOW+, that extracts operators in addition to all terms extracted by BOW, since we believe operators such as !=, ++, and && are important terms in source code. We perform BOW+ extraction on added delta, deleted delta, and new revision source code. This means that every variable, method name, function name, programming language keyword, com- ment word, and operator in the source code separated by whitespace or a semicolon is used as a feature. We also convert the directory and filename into features since they encode both module information and some behav- ioral semantics of the source code. For example, the file (from the Columba project) “ReceiveOptionPanel.java” in the direc- tory “src/mail/core/org/columba/mail/gui/config/ account/” re- veals that the file receives some options using a panel interface and the directory name shows that the source code is related to “account,” “configure,” and “graphical user interface.” Direc- tory and filenames often use camel case, concatenating words breaks with capitals. For example, “ReceiveOptionPanel.java” combines “receive,” “option,” and “panel.” To extract such words correctly, we use a case change in a directory or a filename as a word separator. We call this method BOW++. Table II summarizes features generated and used in this paper. Feature groups which can be interpreted as binary or count are also indicated. Section V-A explores binary and count interpretations for those feature groups. Table III provides an overview of the projects examined in this research and the duration of each project examined. For each project we analyzed (see Table III) the number of meta-data (M) and code complexity (C) features is 8 and 150 respectively. Source code (A, D, N), change log (L), and directory/filename (F) features contribute thousands of features per project, ranging from 6-42 thousand features. Source code features (A, D, N) can take two forms: binary or count. For binary features, the feature only notes the presence or absence of a particular keyword. For count features, the count of the number of times a keyword appears is recorded. For example, if a variable maxParameters is present anywhere in the project, a binary feature just records this variable as being present, while a count feature would additionally record the number TABLE III SUMMARY OF PROJECTS SURVEYED Project Period Clean Buggy Features Changes Changes APACHE 1.3 10/1996- 566 134 17,575 01/1997 COLUMBA 05/2003- 1,270 530 17,411 09/2003 GAIM 08/2000- 742 451 9,281 03/2001 GFORGE 01/2003- 399 334 8,996 03/2004 JEDIT 08/2002- 626 377 13,879 03/2003 MOZILLA 08/2003- 395 169 13,648 08/2004 ECLIPSE 10/2001- 592 67 16,192 11/2001 PLONE 07/2002- 457 112 6,127 02/2003 POSTGRESQL 11/1996- 853 273 23,247 02/1997 SUBVERSION 01/2002- 1,925 288 14,856 03/2002 JCP 1 year 1,516 403 41,942 Total N/A 9,294 3,125 183,054 of times it appears anywhere in the project’s history. C. Feature Selection Techniques The number of features gathered during the feature extrac- tion phase is quite large, ranging from 6,127 for Plone to 41,942 for JCP (Table III). Such large feature sets lead to longer training and prediction times, require large amounts of memory to perform classification. A common solution to this problem is the process of feature selection, in which only the subset of features that are most useful for making classification decisions are actually used. This paper empirically compares a selection of wrapper and filter methods for bug prediction classification effectiveness. Filter methods use general characteristics of the dataset to evaluate and rank features [20]. They are independent of learn- ing algorithms. Wrapper methods, on the other hand, evaluate features by using scores provided by learning algorithms. All methods can be used with both count and binary interpretations of keyword features. The methods are further described below. • Gain Ratio Attribute Evaluation - Gain Ratio is a myopic feature scoring algorithm. A myopic feature scoring algo- rithm evaluates each feature individually independent of the other features. Gain Ratio improves upon Information Gain [2], a well known measure of the amount by which a given feature contributes information to a classification decision. Information Gain for a feature in our context is the amount of information the feature can provide about whether the code change is buggy or clean. Ideally, features that provide a good split between buggy and clean changes should be preserved. For example, if the presence of a certain feature strongly indicates a bug, and its absence greatly decreases the likelihood of a bug, that feature possesses strong Information Gain. On the other
  • 5. 5 TABLE II FEATURE GROUPS. FEATURE GROUP DESCRIPTION, EXTRACTION METHOD, AND EXAMPLE FEATURES. Feature Description Extraction Example Feature Group Method Features Interpretation Added Delta (A) Terms in added delta source code BOW+ if, while, for, == Binary/Count Deleted Delta (D) Terms in deleted delta source code BOW+ true, 0, <=, ++, int Binary/Count Directory/File Name (F) Terms in directory/file names BOW++ src, module, java N/A Change Log (L) Terms in the change log BOW fix, added, new N/A New Revision Terms in new BOW+ if, , !=, do, while, Binary/Count Source Code (N) Revision source code file string, false Meta-data (M) Change meta-data Direct author: hunkim, N/A such as time and author commit hour: 12 Complexity Metrics (C) Software complexity metrics Understand tools [23] LOC: 34, N/A of each source code unit Cyclomatic: 10 hand, if a feature’s presence indicates a 50% likelihood of a bug, and its absence also indicates a 50% likelihood of a bug, this feature has low Information Gain and is not useful in predicting the code change. However, Information Gain places more importance on features that have a large range of values. This can lead to problems with features such as the LOC of a code change. Gain Ratio [2] plays the same role as Information Gain, but instead provides a normalized measure of a feature’s contribution to a classification decision [2]. Thus, Gain Ratio is less affected by features having a large range of values. More details on how the entropy based measure is calculated for Gain Ratio (including how a normalized measure of a feature’s contribution is computed), and other inner workings can be found in an introductory data mining book, e.g. [2]. • Chi-Squared Attribute Evaluation - Chi-squared feature score is also a myopic feature scoring algorithm. The worth of a feature is given by the value of the Pearson chi-squared statistic [2] with respect to the classification decision. Based on the presence of a feature value across all instances in the dataset, one can compute expected and observed counts using Pearson’s chi-squared test. The features with the highest disparity between expected and observed counts against the class attribute are given higher scores. • Significance Attribute Evaluation - With significance at- tribute scoring, high scores are given to features where an inversion of the feature value (e.g., a programming keyword not being present when it had previously been present) will very likely cause an inversion of the classifi- cation decision (buggy to clean, or vice-versa) [1]. It too is a myopic feature scoring algorithm. The significance itself is computed as a two-way function of its association to the class decision. For each feature, the attribute-to- class association along with the class-to-attribute associ- ation is computed. A feature is quite significant if both of these associations are high for a particular feature. More detail on the inner workings can be found in [1]. • Relief-F Attribute Selection - Relief-F in an extension to the Relief algorithm [31]. Relief samples data points (code changes in the context of this paper) at random, and computes two nearest neighbors: one neighbor which has the same class as the instance (similar neighbor), and one neighbor which has a different class (differing neighbor). For the context of this paper, the 2 classes are buggy and clean. The quality estimation for a feature f is updated based on the value of its similar and differing neighbors. If for feature f, a code change and its similar neighbor have differing values, the feature quality of f is decreased. If a code change and its differing neighbor have different values for f, f’s feature quality is increased. This process is repeated for all the sampled points. Relief- F is an algorithmic extension to Relief [46]. One of the extensions is to search for the nearest k neighbors in each class rather than just one neighbor. As Relief-F is a slower algorithm than the other presented filter methods, it was used on the top 15% features returned by the best performing filter method. • Wrapper Methods - The wrapper methods leveraged the classifiers used in the study. The SVM and the Na¨ıve Bayes classifier were used as wrappers. The features are ranked by their classifier computed score. The top feature scores are those features which are valued highly after the creation of a model driven by the SVM or the Na¨ıve Bayes classifier. D. Feature Selection Process Filter and wrapper methods are used in an iterative pro- cess of selecting incrementally smaller sets of features, as detailed in Algorithm 1. The process begins by cutting the initial feature set in half, to reduce memory and processing requirements for the remainder of the process. The process performs optimally when under 10% of all features are present. The initial feature evaluations for filter methods are 10-fold cross validated in order to avoid the possibility of over- fitting feature scores. As wrapper methods perform 10-fold cross validation during the training process, feature scores from wrapper methods can directly be used. Note that we cannot stop the process around the 10% mark and assume the performance is optimal. Performance may not be a unimodal curve with a single maxima around the 10% mark, it is possible that performance is even better at, say, the 5% mark. In the iteration stage, each step finds those 50% of re- maining features that are least useful for classification using individual feature rank, and eliminates them (if, instead, we were to reduce by one feature at a time, this step would be similar to backward feature selection [35]). Using 50% of
  • 6. 6 Algorithm 1 Feature selection process for one project 1) Start with all features, F 2) For feature selection technique, f, in Gain Ratio, Chi- Squared, Significance Evaluation, Relief-F, Wrapper method using SVM, Wrapper method using Na¨ıve Bayes, perform steps 3-6 below. 3) Compute feature Evaluations for using f over F, and select the top 50% of features with the best performance, F/2 4) Selected features, selF = F/2 5) While |selF| ≥ 1 feature, perform steps (a)-(d) a) Compute and store buggy and clean precision, recall, accuracy, F-measure, and ROC AUC using a machine learning classifier (e.g., Na¨ıve Bayes or SVM), using 10-fold cross validation. Record result in a tuple list. b) If f is a wrapper method, recompute feature scores over selF. c) Identify removeF, the 50% of features of selF with the lowest feature evaluation. These are the least useful features in this iteration. d) selF = selF − removeF 6) Determine the best F-measure result recorded in step 5.a. The percentage of features that yields the best result is optimal for the given metric. features at a time improves speed without sacrificing much result quality. So, for example, selF starts at 50% of all features, then is reduced to 25% of all features, then 12.5%, and so on. At each step, change classification bug prediction using selF is then performed over the entire revision history, using 10-fold cross validation to reduce the possibility of over-fitting to the data. This iteration terminates when only one of the original features is left. At this point, there is a list of tuples: feature %, feature selection technique, classifier performance. The final step involves a pass over this list to find the feature % at which a specific classifier achieves its greatest performance. The metric used in this paper for classifier performance is the buggy F-measure (harmonic mean of precision and recall), though one could use a different metric. It should be noted that algorithm 1 is itself a wrapper method as it builds an SVM or Na¨ıve Bayes classifier at various points. When the number of features for a project is large, the learning process can take a long time. This is still strongly preferable to a straightforward use of backward feature selection, that removes one feature every iteration. An analysis on the runtime of the feature selection process and its components is performed in section V-F. When working with extremely large datasets, using algorithm 1 exclusively with filter methods will also significantly lower its runtime. III. PERFORMANCE METRICS We now define common performance metrics used to eval- uate classifiers: Accuracy, Precision, Recall, F-Measure, and ROC AUC. There are four possible outcomes while using a classifier on a single change: Classifying a buggy change as buggy, b→b (true positive) Classifying a buggy change as clean, b→c (false negative) Classifying a clean change as clean, c→c (true negative) Classifying a clean change as buggy, c→b (false positive) With a good set of training data, it is possible to compute the total number of buggy changes correctly classified as buggy (nb→b), buggy changes incorrectly classified as clean (nb→c), clean changes correctly classified as clean (nc→c), and clean changes incorrectly classified as buggy (nc→b). Accuracy is the number of correctly classified changes over the total number of changes. As there are typically more clean changes than buggy changes, this measure could yield a high value if clean changes are being better predicted than buggy changes. This is often less relevant than buggy precision and recall. Accuracy = nb→b + nc→c nb→b + nb→c + nc→c + nc→b Buggy change precision represents the number of correct bug classifications over the total number of classifications that resulted in a bug outcome. Or, put another way, if the change classifier predicts a change is buggy, what fraction of these changes really contain a bug? Buggy change precision, P(b) = nb→b nb→b + nc→b Buggy change recall is also known as the true positive rate, this represents the number of correct bug classifications over the total number of changes that were actually bugs. That is, of all the changes that are buggy, what fraction does the change classifier actually label as buggy? Buggy change recall, R(b) = nb→b nb→b + nb→c Buggy change F-measure is a composite measure of buggy change precision and recall, more precisely, it is the harmonic mean of precision and recall. Since precision can often be improved at the expense of recall (and vice-versa), F-measure is a good measure of the overall precision/recall performance of a classifier, since it incorporates both values. For this reason, we emphasize this metric in our analysis of classifier performance in this paper. Clean change recall, precision, and F-measure can be computed similarly. Buggy change F − measure = 2 ∗ P(b) ∗ R(b) P(b) + R(b) An example is when 80% of code changes are clean and 20% of code changes are buggy. In this case, a classifier reporting all changes as clean except one will have around 80% accuracy. The classifier predicts exactly one buggy change correctly. The buggy precision and recall, however will be 100% and close to 0% respectively. Precision is 100% because the single buggy prediction is correct. The buggy F-measure is 16.51% in this case, revealing poor performance. On the other hand, if a classifier reports all changes as buggy, the accuracy is 20%, the buggy recall is 100%, the buggy precision is 20%, and the buggy F-measure is 33.3%. While these results seem better, the buggy F-measure figure of 33.3% gives an
  • 7. 7 impression that the second classifier is better than the first one. However, a low F-measure does indicate that the second classifier is not performing well either. For this paper, the classifiers for all experiments are F-measure optimized. Sev- eral data/text mining papers have compared performance on classifiers using F-measure including [3] and [32]. Accuracy is also reported in order to avoid returning artificially higher F-measures when accuracy is low. For example, suppose there are 10 code changes, 5 of which are buggy. If a classifier predicts all changes as buggy, the resulting precision, recall, and F-measure are 50%, 100%, and 66.67% respectively. The accuracy figure of 50% demonstrates less promise than the high F-measure figure suggests. An ROC (originally, receiver operating characteristic, now typically just ROC) curve is a two-dimensional graph where the true positive rate (i.e. recall, the number of items correctly labeled as belonging to the class) is plotted on the Y axis against the false positive rate (i.e. items incorrectly labeled as belonging to the class, or nc→b/(nc→b+nc→c)) on the X axis. The area under an ROC curve, commonly abbreviated as ROC AUC, has an important statistical property. The ROC AUC of a classifier is equivalent to the probability that the classifier will value a randomly chosen positive instance higher than a randomly chosen negative instance [16]. These instances map to code changes when relating to bug prediction. Tables V and VI contain the ROC AUC figures for each project. One potential problem when computing these performance metrics is the choice of which data is used to train and test the classifier. We consistently use the standard technique of 10- fold cross-validation [2] when computing performance metrics throughout this paper (for all figures and tables) with the exception of section V-C where more extensive cross-fold validation was performed. This avoids the potential problem of over fitting to a particular training and test set within a specific project. In 10-fold cross validation, a data set (i.e., the project revisions) is divided into 10 parts (partitions) at random. One partition is removed and is the target of classification, while the other 9 are used to train the classifier. Classifier evaluation metrics, including buggy and clean accuracy, precision, recall, F-measure, and ROC AUC are computed for each partition. After these evaluations are completed for all 10 partitions, averages are are obtained. The classifier performance metrics reported in this paper are these average figures. IV. EXPERIMENTAL CONTEXT We gathered software revision history for Apache, Columba, Gaim, Gforge, Jedit, Mozilla, Eclipse, Plone, PostgreSQL, Subversion, and a commercial project written in Java (JCP). These are all mature open source projects with the exception of JCP. In this paper these projects are collectively called the corpus. Using the project’s CVS (Concurrent Versioning System) or SVN (Subversion) source code repositories, we collected revisions 500-1000 for each project, excepting Jedit, Eclipse, and JCP. For Jedit and Eclipse, revisions 500-750 were col- lected. For JCP, a year’s worth of changes were collected. We used the same revisions as Kim et al. [28], since we wanted to be able to compare our results with theirs. We removed two of the projects they surveyed from our analysis, Bugzilla and Scarab, as the bug tracking systems for these projects did not distinguish between new features and bug fixes. Even though our technique did well on those two projects, the value of bug prediction when new features are also treated as bug fixes is arguably less meaningful. V. RESULTS The following sections present results obtained when ex- ploring the four research questions. For the convenience of the reader, each result section repeats the research question that will be answered. A. Classifier performance comparison Research Question 1. Which variables lead to best bug prediction performance when using feature selection? The three main variables affecting bug prediction per- formance that are explored in this paper are: (1) type of classifier (Na¨ıve Bayes, Support Vector Machine), (2) type of feature selection used (3) and whether multiple instances of a particular feature are significant (count), or whether only the existence of a feature is significant (binary). For example, if a variable by the name of “maxParameters” is referenced four times during a code change, a binary feature interpretation records this variable as 1 for that code change, while a count feature interpretation would record it as 4. The permutations of variables 1, 2, and 3 are explored across all 11 projects in the data set with the best performing feature selection technique for each classifier. For SVMs, a linear kernel with optimized values for C is used. C is a parameter that allows one to trade off training error and model complexity. A low C tolerates a higher number of errors in training, while a large C allows fewer train errors but increases the complexity of the model. [2] covers the SVM C parameter in more detail. For each project, feature selection is performed, followed by computation of per-project accuracy, buggy precision, buggy recall, and buggy F-measure. Once all projects are complete, average values across all projects are computed. Results are reported in Table IV. Average numbers may not provide enough information on the variance. Every result in Table IV reports the average value along with the standard deviation. The results are presented in descending order of buggy F- measure. Significance Attribute and Gain Ratio based feature selec- tion performed best on average across all projects in the corpus when keywords were interpreted as binary features. McCallum et al. [38] confirm that feature selection performs very well on sparse data used for text classification. Anagostopoulos et al. [3] report success with Information Gain for sparse text classification. Our data set is also quite sparse and performs well using Gain Ratio, an improved version of Information Gain. A sparse data set is one in which instances contain a majority of zeroes. The number of distinct ones for most of the binary attributes analyzed are a small minority. The least
  • 8. 8 TABLE IV AVERAGE CLASSIFIER PERFORMANCE ON CORPUS (ORDERED BY DESCENDING BUGGY F-MEASURE) Feature Classifier Feature Features Feature Accuracy Buggy Buggy Buggy Interp. Selection Technique Percentage F-measure Precision Recall Binary Bayes Significance Attr. 1153.4 ± 756.3 7.9 ± 6.3 0.90 ± 0.04 0.81 ± 0.07 0.97 ± 0.06 0.70 ± 0.10 Binary Bayes Gain Ratio 1914.6 ± 2824.0 9.5 ± 8.0 0.91 ± 0.03 0.79 ± 0.07 0.92 ± 0.10 0.70 ± 0.07 Binary Bayes Relief-F 1606.7 ± 1329.2 9.3 ± 3.6 0.85 ± 0.08 0.71 ± 0.10 0.74 ± 0.19 0.71 ± 0.08 Binary SVM Gain Ratio 1522.3 ± 1444.2 10.0 ± 9.8 0.87 ± 0.05 0.69 ± 0.06 0.90 ± 0.10 0.57 ± 0.08 Binary Bayes Wrapper 2154.6 ± 1367.0 12.7 ± 6.5 0.85 ± 0.06 0.68 ± 0.10 0.76 ± 0.17 0.64 ± 0.11 Binary SVM Significance Attr. 1480.9 ± 2368.0 9.5 ± 14.2 0.86 ± 0.06 0.65 ± 0.06 0.94 ± 0.11 0.51 ± 0.08 Count Bayes Relief-F 1811.1 ± 1276.7 11.8 ± 1.9 0.78 ± 0.07 0.63 ± 0.08 0.56 ± 0.09 0.72 ± 0.10 Binary SVM Relief-F 663.1 ± 503.2 4.2 ± 3.5 0.83 ± 0.04 0.62 ± 0.06 0.75 ± 0.22 0.57 ± 0.14 Binary Bayes Chi-Squared 3337.2 ± 2356.1 25.0 ± 20.5 0.77 ± 0.08 0.61 ± 0.07 0.54 ± 0.07 0.71 ± 0.09 Count SVM Relief-F 988.6 ± 1393.8 5.8 ± 4.0 0.78 ± 0.08 0.61 ± 0.06 0.58 ± 0.10 0.66 ± 0.07 Count SVM Gain Ratio 1726.0 ± 1171.3 12.2 ± 8.3 0.80 ± 0.07 0.57 ± 0.08 0.62 ± 0.13 0.53 ± 0.08 Binary SVM Chi-Squared 2561.5 ± 3044.1 12.6 ± 16.2 0.81 ± 0.06 0.56 ± 0.09 0.55 ± 0.11 0.57 ± 0.07 Count SVM Significance Attr. 1953.0 ± 2416.8 12.0 ± 13.6 0.79 ± 0.06 0.56 ± 0.07 0.62 ± 0.13 0.52 ± 0.06 Count SVM Wrapper 1232.1 ± 1649.7 8.9 ± 14.9 0.76 ± 0.10 0.55 ± 0.05 0.56 ± 0.09 0.56 ± 0.07 Count Bayes Wrapper 1946.6 ± 6023.6 4.8 ± 14.3 0.79 ± 0.08 0.55 ± 0.08 0.63 ± 0.08 0.50 ± 0.09 Count Bayes Chi-Squared 58.3 ± 59.7 0.4 ± 0.4 0.76 ± 0.08 0.55 ± 0.05 0.54 ± 0.09 0.58 ± 0.11 Count Bayes Significance Attr. 2542.1 ± 1787.1 14.8 ± 6.6 0.76 ± 0.08 0.54 ± 0.07 0.53 ± 0.11 0.59 ± 0.13 Count SVM Chi-Squared 4282.3 ± 3554.2 26.2 ± 20.2 0.75 ± 0.10 0.54 ± 0.07 0.55 ± 0.10 0.54 ± 0.06 Binary SVM Wrapper 4771.0 ± 3439.7 33.8 ± 22.8 0.79 ± 0.11 0.53 ± 0.12 0.58 ± 0.14 0.49 ± 0.11 Count Bayes Gain Ratio 2523.6 ± 1587.2 15.3 ± 6.1 0.77 ± 0.08 0.53 ± 0.06 0.56 ± 0.10 0.52 ± 0.09 sparse binary features in the datasets have non zero values in the 10% - 15% range. Significance attribute based feature selection narrowly won over Gain Ratio in combination with the Na¨ıve Bayes clas- sifier. With the SVM classifier, Gain Ratio feature selection worked best. Filter methods did better than wrapper methods. This is due to the fact that classifier models with a lot of fea- tures are worse performing and can drop the wrong features at an early cutting stage of the feature selection process presented under algorithm 1. When a lot of features are present, the data contains more outliers. The worse performance of Chi-Squared Attribute Evaluation can be attributed to computing variances at the early stages of feature selection. Relief-F is a nearest neighbor based method. Using nearest neighbor information in the presence of noise can be detrimental [17]. However, as Relief-F was used on the top 15% of features returned by the best filter method, it performed reasonably well. It appears that simple filter based feature selection tech- niques such as Gain Ratio and Significance Attribute Evalu- ation can work on the surveyed large feature datasets. These techniques do not make strong assumptions on the data. When the amount of features is reduced to a reasonable size, wrapper methods can produce a model usable for bug prediction. Overall, Na¨ıve Bayes using binary interpretation of features performed better than the best SVM technique. One explana- tion for these results is that change classification is a type of optimized binary text classification, and several sources ( [3], [38]) note that Na¨ıve Bayes performs text classification well. As a linear SVM was used, it is possible that using a non linear kernel can improve SVM classifier performance. However, in preliminary work, optimizing the SVM for one project by using a polynomial kernel often led to degradation of performance in another project. In addition, using a non linear kernel, made the experimental runs significantly slower. To permit a consistent comparison of results between SVM and Na¨ıve Bayes, it was important to use the same classi- fier settings for all projects, and hence no per-project SVM classifier optimization was performed. Future work could include optimizing SVMs on a per project basis for better performance. A major benefit of the Na¨ıve Bayes classifier is that such optimization is not needed. The methodology section discussed the differences between binary and count interpretation of keyword features. Count did not perform well within the corpus, yielding poor results for SVM and Na¨ıve Bayes classifiers. Count results are mostly bundled at the bottom of Table IV. This is consistent with McCallum et al. [38], which mentions that when the number of features is low, binary values can yield better results than using count. Even when the corpus was tested without any feature selection in the prior work by Kim et al., count performed worse than binary. This seems to imply that regardless of the number of features, code changes are better captured by the presence or absence of keywords. The bad performance of count can possibly be explained by the difficulty in establish- ing the semantic meaning behind recurrence of keywords and the added complexity it brings to the data. When using the count interpretation, the top results for both Na¨ıve Bayes and the SVM were from Relief-F. This suggests that using Relief- F after trimming out most features via a myopic filter method can yield good performance. The two best performing classifier combinations by buggy F-measure, Bayes (binary interpretation with Significance Evaluation) and SVM (binary interpretation with Gain Ratio), both yield an average precision of 90% and above. For the remainder of the paper, analysis focuses just on these two, to better understand their characteristics. The remaining figures and tables in the paper will use a binary interpretation of keyword features. B. Effect of feature selection Question 2. Range of bug prediction performance using feature selection. How do the best-performing SVM and
  • 9. 9 Na¨ıve Bayes classifiers perform across all projects when using feature selection? In the previous section, aggregate average performance of different classifiers and optimization combinations was compared across all projects. In actual practice, change classi- fication would be trained and employed on a specific project. As a result, it is useful to understand the range of performance achieved using change classification with a reduced feature set. Tables V and VI below report, for each project, overall prediction accuracy, buggy and clean precision, recall, F- measure, and ROC area under curve (AUC). Table V presents results for Na¨ıve Bayes using F-measure feature selection with binary features, while Table VI presents results for SVM using feature selection with binary features. Observing these two tables, eleven projects overall (8 with Na¨ıve Bayes, 3 with SVM) achieve a buggy precision of 1, indicating that all buggy predictions are correct (no buggy false positives). While the buggy recall figures (ranging from 0.40 to 0.84) indicate that not all bugs are predicted, still, on average, more than half of all project bugs are successfully predicted. Comparing buggy and clean F-measures for the two classi- fiers, Na¨ıve Bayes (binary) clearly outperforms SVM (binary) across all 11 projects when using the same feature selection technique. The ROC AUC figures are also better for the Na¨ıve Bayes classifier than those of the SVM classifier across all projects. The ROC AUC of a classifier is equivalent to the probability that the classifier will value a randomly chosen positive instance higher than a randomly chosen negative instance. A higher ROC AUC for the Na¨ıve Bayes classifier better indicates that the classifier will still perform strongly if the initial labelling of code changes as buggy/clean turned out to be incorrect. Thus, the Na¨ıve Bayes results are less sensitive to inaccuracies in the data sets. [2]. Figure 1 summarizes the relative performance of the two classifiers and compares against the prior work of Kim et al [28]. Examining these figures, it is clear that feature selection significantly improves F-measure of bug prediction using change classification. As precision can often be increased at the cost of recall and vice-versa, we compared classifiers using buggy F-measure. Good F-measures indicate overall result quality. The practical benefit of our result can be demonstrated by the following. For a given fixed level of recall, say 70%, our method provides 97% precision, while Kim et al. provide 52.5% precision. These numbers were calculated using buggy F-measure numbers from V and [28]. The buggy precision value was extrapolated from the buggy F-measure and the fixed recall figures. High precision with decent recall allows developers to use our solution with added confidence. This means that if a change is flagged as buggy, the change is very likely to be actually be buggy. Kim et al.’s results, shown in Figure 1, are taken from [28]. In all but the following cases, this paper used the same corpus. • This paper does not use Scarab and Bugzilla because those projects did not distinguish buggy and new features. • This paper uses JCP, which was not in Kim et al.’s corpus. Kim et al. did not perform feature selection and used substantially more features for each project. Table IV reveals the drastic reduction in the average number of features per project when compared to Kim et al’s work. Additional benefits of the reduced feature set include better speeds of classification and scalability. As linear SVMs and Na¨ıve Bayes classifiers work in linear time for classification [25], removing 90% of features allows classifications to be done about ten times faster. We have noted an order of magnitude improvement in code change classification time. This helps promote interactive use of the system within an Integrated Development Environment. C. Statistical Analysis of Results While the above results appear compelling, it is necessary to further analyze them to be sure that the results did not occur due to statistical anomalies. There are a few possible sources of anomaly. • The 10-fold validation is an averaging of the results for F-measure, accuracy, and all of the stats presented. Its possible that the variance of the cross validation folds is high. • Statistical analysis is needed to confirm that a better performing classifier is indeed better after taking into account cross validation variance. To ease comparison of results against Kim et al. [28], the cross validation random seed was set to the same value used by them. With the same datasets (except for JCP) and the same cross validation random seed, but with far better buggy F- measure and accuracy results, the improvement over no feature selection is straightforward to demonstrate. Showing that the Na¨ıve Bayes results are better than the SVM results when comparing tables V, VI is a more involved process that is outlined below. For the steps below, the better and worse classifiers are denoted respectively as cb and cw. 1) Increase the cross validation cycle to 100 runs of 10-fold validation, with the seed being varied each run. 2) For each run, note the metric of interest, buggy F- measure or accuracy. This process empirically generates a distribution for the metric of interest, dmcb using the better classifier cb. 3) Repeat steps 1, 2 for the worse classifier cw, attaining dmcw. 4) Use a one-sided Kolmogorov Smirnov test to show that the population CDF of distribution dmcb is larger than the CDF of dmcw at the 95% confidence level. The seed is varied every run in order to change elements of the train and test sets. Note that the seed for each run is the same for both cw and cb. In step four above, one can consider using a two sample bootstrap test to show that the points of dmcb come from a distribution with a higher mean than the points of dmcw [13]. However, the variance within 10-fold validation is the reason for the statistical analysis. The Birnbaum-Tingey one-sided contour analysis was used for the one-sided Kolmogorov- Smirnov test. It takes variance into account and can indicate if the CDF of one set of samples is larger than the CDF of the
  • 10. 10 TABLE V NA¨IVE BAYES (WITH SIGNIFICANCE ATTRIBUTE EVALUATION) ON THE OPTIMIZED FEATURE SET (BINARY) Project Features Percentage Accuracy Buggy Buggy Buggy Clean Clean Clean Buggy Name of Features Precision Recall F-measure Precision Recall F-measure ROC APACHE 1098 6.25 0.93 0.99 0.63 0.77 0.92 1.00 0.96 0.86 COLUMBA 1088 6.25 0.89 1.00 0.62 0.77 0.86 1.00 0.93 0.82 ECLIPSE 505 3.12 0.98 1.00 0.78 0.87 0.98 1.00 0.99 0.94 GAIM 1160 12.50 0.87 1.00 0.66 0.79 0.83 1.00 0.91 0.83 GFORGE 2248 25 0.85 0.84 0.84 0.84 0.86 0.86 0.86 0.93 JCP 1310 3.12 0.96 1.00 0.80 0.89 0.95 1.00 0.97 0.90 JEDIT 867 6.25 0.90 1.00 0.73 0.85 0.86 1.00 0.93 0.87 MOZILLA 852 6.24 0.94 1.00 0.80 0.89 0.92 1.00 0.96 0.90 PLONE 191 3.12 0.93 1.00 0.62 0.77 0.92 1.00 0.96 0.83 PSQL 2905 12.50 0.90 1.00 0.59 0.74 0.88 1.00 0.94 0.81 SVN 464 3.12 0.95 0.87 0.68 0.76 0.95 0.98 0.97 0.83 Average 1153.45 7.95 0.92 0.97 0.70 0.81 0.90 0.99 0.94 0.87 TABLE VI SVM (WITH GAIN RATIO) ON THE OPTIMIZED FEATURE SET (BINARY) Project Features Percentage Accuracy Buggy Buggy Buggy Clean Clean Clean Buggy Name of Features Precision Recall F-measure Precision Recall F-measure ROC APACHE 549 3.12 0.89 0.87 0.51 0.65 0.90 0.98 0.94 0.75 COLUMBA 4352 25 0.79 0.62 0.71 0.66 0.87 0.82 0.84 0.76 ECLIPSE 252 1.56 0.96 1.00 0.61 0.76 0.96 1.00 0.98 0.81 GAIM 580 6.25 0.82 0.98 0.53 0.69 0.78 0.99 0.87 0.76 GFORGE 2248 25 0.84 0.96 0.67 0.79 0.78 0.98 0.87 0.82 JCP 1310 3.12 0.92 1.00 0.61 0.76 0.91 1.00 0.95 0.8 JEDIT 3469 25 0.79 0.74 0.67 0.70 0.81 0.86 0.83 0.76 MOZILLA 426 3.12 0.87 1.00 0.58 0.73 0.85 1.00 0.92 0.79 PLONE 191 3.12 0.89 0.98 0.44 0.60 0.88 1.00 0.93 0.72 PSQL 2905 12.50 0.85 0.92 0.40 0.56 0.84 0.99 0.91 0.7 SVN 464 3.12 0.92 0.80 0.56 0.66 0.94 0.98 0.96 0.77 Average 1522.36 10.08 0.87 0.90 0.57 0.69 0.87 0.96 0.91 0.77 other set [9], [37]. It also returns a p-value for the assertion. The 95% confidence level was used. Incidentally, step 1 is also similar to bootstrapping but uses further runs of cross validation to generate more data points. The goal is to ensure that the results of average error estimation via k-fold validation are not curtailed due to variance. Many more cross validation runs help generate an empirical distribution. This test was performed on every dataset’s buggy F-measure and accuracy to compare the performance of the Na¨ıve Bayes classifier to the SVM. In most of the tests, the Na¨ıve Bayes dominated the SVM results in both accuracy and F-measure at p < 0.001. A notable exception is the accuracy metric for the Gforge project. While tables V and VI show that the Na¨ıve Bayes classifier has a 1% higher accuracy for Gforge, the empirical distribution for Gforge’s accuracy indicates that this is true only at a p of 0.67, meaning that this is far from a statistically significant result. The other projects and the F-measure results for Gforge demonstrate the dominance of the Na¨ıve Bayes results over the SVM. In spite of the results of tables IV and V, it is not possible to confidently state that a binary representation performs better than count for both the Na¨ıve Bayes and SVM classifiers without performing a statistical analysis. Na¨ıve Bayes using binary features dominates over the performance of Na¨ıve Fig. 1. Classifier F-measure by Project Bayes with count features at a p < 0.001. The binary SVM’s dominance over the best performing count SVM with the same feature selection technique (Gain Ratio) is also apparent, with a p < 0.001 on the accuracy and buggy F-measure for most projects, but lacking statistical significance on the buggy F- measure of Gaim.
  • 11. 11 D. Feature Sensitivity Research Question 3. Feature Sensitivity. What is the per- formance of change classification at varying percentages of features? What is the F-measure of the best performing clas- sifier when using just 1% of all project features? Section V-B reported results from using the best feature set chosen using a given optimization criteria, and showed that Na¨ıve Bayes with Significance Attribute feature selection performed best with 7.95% of the original feature set, and SVM with Gain Ratio feature selection performed best at 10.08%. This is a useful result, since the reduced feature set decreases prediction time as compared to Kim et al. A buggy/clean decision is based on about a tenth of the initial number of features. This raises the question of how performance of change classification behaves with varying percentages of features. To explore this question, for each project we ran a modified version of the feature selection process described in Algorithm 1, in which only 10% (instead of 50%) of features are removed each iteration using Significance Attribute Evaluation. After each feature selection step, buggy F-measure is computed using a Na¨ıve Bayes classifier. The graph of buggy F-measure vs features, figure 2, follows a similar pattern for all projects. As the number of features decrease, the curve shows a steady increase in buggy F- measure. Some projects temporarily dip before an increase in F-measure occurs. Performance can increase with fewer features due to noise reduction but can also decrease with fewer features due to important features missing. The dip in F-measure that typically reverts to higher F-measure can be explained by the presence of correlated features which are removed in later iterations. While correlated features are present at the early stages of feature selection, their influence is limited by the presence of a large number of features. They have a more destructive effect towards the middle before being removed. Following the curve in the direction of fewer features, most projects then stay at or near their maximum F-measure down to single digit percentages of features. This is significant, since it suggests a small number of features might produce results that are close to those obtained using more features. The reason can be attributed to Menzies et al. [39] who state that a small number of features with different types of information can sometimes capture more information than many features of the same type. In the experiments of the current paper there are many feature types. Fewer features bring two benefits: a reduction in memory footprint and an increase in classification speed. A notable exception is the Gforge project. When trimming by 50% percent of features at a time, Gforge’s optimal F- measure point (at around 35%) was missed. Feature selection is still useful for Gforge but the optimal point seems to be a bit higher than for the other projects. In practice, one will not know a-priori the best percentage of features to use for bug prediction. Empirically from figure 2, a good starting point for bug prediction is at 15% of the total number of features for a project. One can certainly locate Fig. 2. Buggy F-measure vs Features using Na¨ıve Bayes counterexamples including Gforge from this paper. Neverthe- less, 15% of features is a reasonable practical recommendation if no additional information is available for a project. To better characterize performance at low numbers of features, accuracy, buggy precision, and buggy recall are computed for all projects using just 1% of features (selected using the Significance Attribute Evaluation process). Results are presented in Table VII. When using 1% percent of overall features, it is still possible to achieve high buggy precision, but at the cost of somewhat lower recall. Taking Apache as an example, using the F-measure optimal number of project features (6.25%, 1098 total) achieves buggy precision of 0.99 and buggy recall of 0.63 (Table V), while using 1% of all features yields buggy precision of 1 and buggy recall of 0.46 (Table VII). These results indicate that a small number of features have strong predictive power for distinguishing between buggy and clean changes. An avenue of future work is to explore the degree to which these features have a causal relationship with bugs. If a strong causal relationship exists, it might be possible to use these features as input to a static analysis of a software project. Static analysis violations spanning those keywords can then be prioritized for human review. A natural question that follows is the breakdown of the top features. A software developer might ask which code attributes are the most effective predictors of bugs. The next section deals with an analysis of the top 100 features of each project.
  • 12. 12 TABLE VII NA¨IVE BAYES WITH 1% OF ALL FEATURES Project Features Accuracy Buggy Buggy Name Precision Recall APACHE 175 0.90 1.00 0.46 COLUMBA 174 0.82 1.00 0.38 ECLIPSE 161 0.97 1.00 0.69 GFORGE 89 0.46 0.46 1.00 JCP 419 0.92 1.00 0.61 JEDIT 138 0.81 1.00 0.48 MOZILLA 136 0.88 1.00 0.60 PLONE 61 0.89 1.00 0.45 PSQL 232 0.84 1.00 0.34 SVN 148 0.93 1.00 0.43 GAIM 92 0.79 1.00 0.45 Average 165.91 0.84 0.95 0.54 E. Breakdown of Top 100 Features Research Question 4. Best Performing Features. Which classes of features are the most useful for performing bug predictions? Table VIII provides a breakdown by category of the 100 most prominent features in each project. The top three types are purely keyword related. Adding further occurrences of a keyword to a file has the highest chance of creating a bug, followed by deleting a keyword, followed finally by introducing an entirely new keyword to a file. Changelog features are the next most prominent set of features. These are features based on the changelog comments for a particular code change. Finally, filename features are present within the top 100 for a few of the projects. These features are generated from the actual names of files in a project. It is somewhat surprising that complexity features do not make the top 100 feature list. The Apache project has 3 complexity features in the top 100 list. This was not presented in Table VIII as Apache is the only project where complexity metrics proved useful. Incidentally, the top individual feature for the Apache project is the average length of the comment lines in a code change. The absence of metadata features in the top 100 list is also surprising. Author name, commit hour, and commit day do not seem to be significant in predicting bugs. The commit hour and commit day were converted to binary form before classifier analysis. It is possible that encoding new features using a combination of commit hour and day such as “Mornings”, “Evenings”, or “Weekday Nights” might provide more bug prediction value. An example would be if the hour 22 is not a good predictor of buggy code changes but hours 20-24 are jointly a strong predictor of buggy changes. In this case, encoding “Nights” as hours 20-24 would yield a new feature that is more effective at predicting buggy code changes. Five of the eleven projects had BOW+ features among the top features including Gforge, JCP, Plone, PSQL, and SVN. The Na¨ıve Bayes optimized feature (Table V) contains BOW+ features for those projects with the addition of Gaim and Mozilla, a total of seven out of eleven projects. BOW+ features help add predictive value. TABLE VIII TOP 100 BUG PREDICTING FEATURES Project Added Deleted New Changelog Filename Name Delta Delta Revision Source Features Features APACHE 49 43 0 0 5 COLUMBA 26 18 50 4 2 ECLIPSE 34 40 26 0 0 GAIM 55 40 3 2 0 GFORGE 23 30 40 6 1 JCP 39 25 36 0 0 JEDIT 50 29 17 3 1 MOZILLA 67 9 14 10 0 PLONE 50 23 19 6 2 PSQL 44 34 22 0 0 SVN 48 42 1 9 0 Average 44.09 30.27 20.72 3.63 1 F. Algorithm Runtime Analysis The classifiers used in the study consist of linear SVM and Na¨ıve Bayes. The theoretical training time for these classifiers has been analyzed in the research literature. Traditionally, Na¨ıve Bayes is faster to train than a linear SVM. The Na¨ıve Bayes classifier has a training runtime of O(s ∗ n) where s is the number of non-zero features and n is the number of training instances. The SVM traditionally has a runtime of approximately O(s ∗ n2.1 ) [25]. There is recent work in reducing the runtime of linear SVMs to O(s ∗ n) [25]. Optimizing the classifiers for F-measure slows both down. In our experiments, training an F-measure optimized Na¨ıve Bayes classifier was faster than doing so for the SVM though the speeds were comparable when the number of features are low. The current paper uses Liblinear [15], one of the fastest implementations of a linear SVM. If a slower SVM package were used, the performance gap would be far wider. The change classification time for both classifiers is linear with the number of features returned after the training. An important issue is the speed of classification on the projects without any feature selection. While the SVM or Na¨ıve Bayes algorithms do not crash in the presence of a lot of data, training times and memory requirements are considerable. Feature selection allows one to reduce classi- fication time. The time gain on classification is an order of magnitude improvement, as less than 10% of features are typically needed for improved classification performance. This allows individual code classifications to be scalable. A rough wall-clock analysis of the time required to classify a single code change improved from a few seconds to a few hundred milliseconds when about 10% of features are used (with a 2.33 Ghz Intel Core 2 Quad CPU and 6GB of RAM). Lowered RAM requirements allow multiple trained classifiers to operate simultaneously without exceeding the amount of physical RAM present in a change classification server. The combined impact of reduced classifier memory footprint and reduced classification time will permit a server-based classifi- cation service to support substantially more classifications per user. The top performing filter methods themselves are quite fast. Significance Attribute Evaluation and Gain Ratio also operate
  • 13. 13 linearly with respect to the number of training instances multiplied by the number of features. Rough wall-clock times show 4-5 seconds for the feature ranking process (with a 2.33 Ghz Intel Core 2 Quad CPU and 6GB of RAM). It is only necessary to do this computation once when using these techniques within the feature selection process, algorithm 1. VI. RELATED WORK Given a software project containing a set of program units (files, classes, methods or functions, or changes depending on prediction technique and language), a bug prediction algorithm outputs one of the following. Totally Ordered Program Units. A total ordering of program units from most to least bug prone [26] using an ordering metric such as predicted bug density for each file [44]. If desired, this can be used to create a partial ordering (see below). Partially Ordered Program Units. A partial ordering of program units into bug prone categories (e.g. the top N% most bug-prone files in [21], [29], [44]) Prediction on a Given Software Unit. A prediction on whether a given software unit contains a bug. Prediction granularities range from an entire file or class [19], [22] to a single change (e.g., Change Classification [28]). Change classification and faulty program unit detection techniques both aim to locate software defects, but differ in scope and resolution. While change classification focuses on changed entities, faulty module detection techniques do not need a changed entity. The pros of change classification include the following: • Bug pinpointing at a low level of granularity, typically about 20 lines of code. • Possibility of IDE integration, enabling a prediction im- mediately after code is committed. • Understandability of buggy code prediction from the perspective of a developer. The cons include the following: • Excess of features to account for when including key- words. • Failure to address defects not related to recent file changes. • Inability to organize a set of modules by likelihood of being defective Related work in each of these areas is detailed below. As this paper employs feature selection, related work on feature selection techniques is then addressed. This is followed by a comparison of the current paper’s results against similar work. A. Totally Ordered Program Units Khoshgoftaar and Allen have proposed a model to list modules according to software quality factors such as future fault density [26], [27]. The inputs to the model are software complexity metrics such as LOC, number of unique operators, and cyclomatic complexity. A stepwise multiregression is then performed to find weights for each factor. Briand et al. use object oriented metrics to predict classes which are likely to contain bugs. They used PCA in combination with logistic regression to predict fault prone classes [10]. Morasca et al. use rough set theory and logistic regression to predict risky modules in commercial software [42]. Key inputs to their model include traditional metrics such as LOC, code block properties in addition to subjective metrics such as module knowledge. Mockus and Weiss predict risky modules in soft- ware by using a regression algorithm and change measures such as the number of systems touched, the number of modules touched, the number of lines of added code, and the number of modification requests [41]. Ostrand et al. identify the top 20 percent of problematic files in a project [44]. Using future fault predictors and a negative binomial linear regression model, they predict the fault density of each file. B. Partially Ordered Program Units The previous section covered work which is based on total ordering of all program modules. This could be converted into a partially ordered program list, e.g. by presenting the top N% of modules, as performed by Ostrand et al. above. This section deals with work that can only return a partial ordering of bug prone modules. Hassan and Holt use a caching algorithm to compute the set of fault-prone modules, called the top-10 list [21]. They use four factors to determine this list: software units that were most frequently modified, most recently modified, most frequently fixed, and most recently fixed. A recent study by Shihab et al. [48] investigates the impact of code and process metrics on future defects when applied on the Eclipse project. The focus is on reducing metrics to a reach a much smaller though statistically significant set of metrics for computing potentially defective modules. Kim et al. proposed the bug cache algorithm to predict future faults based on previous fault localities [29]. In an empirical study of partially ordered faulty modules, Lessmann et al. [33] conclude that the choice of classifier may have a less profound impact on the prediction than previously thought, and recommend the ROC AUC as an accuracy indicator for comparative studies in software defect prediction. C. Prediction on a Given Software Unit Using decision trees and neural networks that employ object-oriented metrics as features, Gyimothy et al. [19] pre- dict fault classes of the Mozilla project across several releases. Their buggy precision and recall are both about 70%, resulting in a buggy F-measure of 70%. Our buggy precision for the Mozilla project is around 100% (+30%) and recall is at 80% (+10%), resulting in a buggy F-measure of 89% (+19%). In addition they predict faults at the class level of granularity (typically by file), while our level of granularity is by code change. Aversano et al. [4] achieve 59% buggy precision and recall using KNN (K nearest neighbors) to locate faulty modules. Hata et al. [22] show that a technique used for spam filtering of emails can be successfully used on software modules to classify software as buggy or clean. However, they classify static code (such as the current contents of a file), while our approach classifies file changes. They achieve 63.9%
  • 14. 14 precision, 79.8% recall, and 71% buggy F-measure on the best data points of source code history for 2 Eclipse plugins. We obtain buggy precision, recall, and F-measure figures of 100% (+36.1%), 78% (-1.8%) and 87% (+16%), respectively, with our best performing technique on the Eclipse project (Table V). Menzies et al. [39] achieve good results on their best projects. However, their average precision is low, ranging from a minimum of 2% and a median of 20% to a max of 70%. As mentioned in section III, to avoid optimizing on precision or recall, we present F-measure figures. A commonality we share with the work of Menzies et al. is their use of Information Gain (quite similar to the Gain Ratio that we use, as explained in section II-C) to rank features. Both Menzies and Hata focus on the file level of granularity. Kim et al. show that using support vector machines on software revision history information can provide an average bug prediction accuracy of 78%, a buggy F-measure of 60%, and a precision and recall of 60% when tested on twelve open source projects [28]. Our corresponding results are an accuracy of 92% (+14%), a buggy F-measure of 81% (+21%), a precision of 97% (+37%), and a recall of 70% (+10%). Elish and Elish [14] also used SVMs to predict buggy modules in software. Table IX compares our results with that of earlier work. Hata, Aversano, and Menzies did not report overall accuracy in their results and focused on precision and recall results. Recently, D’Ambros et al. [12] provided an extensive com- parison of various bug prediction algorithms that operate at the file level using ROC AUC to compare algorithms. Wrapper Subset Evaluation is used sequentially to trim attribute size. They find that the top ROC AUC of 0.921 was achieved for the Eclipse project using the prediction approach of Moser et al. [43]. As a comparison, the results achieved using feature selection and change classification in the current paper achieved an ROC AUC for Eclipse of 0.94. While it’s not a significant difference, it does show that change classification after feature selection can provide results comparable to those of Moser et al. which are based on code metrics such as code churn, past bugs, refactorings, number of authors, file age, etc. An advantage of our change classification approach over that of Moser et al. is that it operates at the granularity of code changes, which permits developers to receive faster feedback on small regions of code. As well, since the top features are keywords (Section V-E), it is easier to explain to developers the reason for a code change being diagnosed as buggy with a keyword diagnosis instead of providing metrics. It is easier to understand a code change being buggy due to having certain combinations of keywords instead of combination of source code metrics. Challagulla et al. investigate machine learning algorithms to identify faulty real-time software modules [11]. They find that predicting the number of defects in a module is much harder than predicting whether a module is defective. They achieve best results using the Na¨ıve Bayes classifier. They conclude that “size” and “complexity” metrics are not suffi- cient attributes for accurate prediction. The next section moves on to work focusing on feature selection. D. Feature Selection Hall and Holmes [20] compare six different feature selection techniques when using the Na¨ıve Bayes and the C4.5 classifier [45]. Each dataset analyzed has about one hundred features. The method and analysis based on iterative feature selection used in this paper is different from that of Hall and Holmes in that the present work involves substantially more features, coming from a different type of corpus (features coming from software projects). Many of the feature selection techniques used by Hall and Holmes are used in this paper. Song et al. [51] propose a general defect prediction frame- work involving a data preprocessor, feature selection, and learning algorithms. They also note that small changes to data representation can have a major impact on the results of feature selection and defect prediction. This paper uses a different experimental setup due to the large number of features. However, some elements of this paper were adopted from Song et al. including cross validation during the feature ranking process for filter methods. In addition, the Na¨ıve Bayes classifier was used, and the J48 classifier was attempted. The results for the latter are mentioned in Section VII. Their suggestion of using a log pre-processor is hard to adopt for keyword changes and cannot be done on binary interpretations of keyword features. Forward and backward feature selection one attribute at a time is not a practical solution for large datasets. Gao et al. [18] apply several feature selection algorithms to predict defective software modules for a large legacy telecom- munications software system. Seven filter-based techniques and three subset selection search algorithms are employed. Removing 85% of software metric features does not adversely affect results, and in some cases improved results. The current paper uses keyword information as features instead of product, process, and execution metrics. The current paper also uses historical data. Finally, 11 systems are examined instead of 4 when compared to that work. However similar to that work, a variety of feature selection techniques is used as well on a far larger set of attributes. VII. THREATS TO VALIDITY There are six major threats to the validity of this study. Systems examined might not be representative of typical projects. Eleven systems are examined, a quite high number compared to other work reported in literature. In spite of this, it is still possible that we accidentally chose systems that have better (or worse) than average bug classification accuracy. Since we intentionally chose systems that had some degree of linkage between change tracking systems and change log text (to determine fix inducing changes), there is a project selection bias. Systems are mostly open source. The systems examined in this paper mostly all use an open source development methodology with the exception of JCP, and hence might not be representative of typical develop- ment contexts, potentially affecting external validity [52]. It is possible that more deadline pressures, differing personnel
  • 15. 15 TABLE IX COMPARISONS OF OUR APPROACH TO RELATED WORK Authors Project(s) Accuracy Buggy Buggy Buggy Granularity F-measure Precision Recall Hata et ala[22] Eclipse Plugins - 71 64 80 File Shivaji et al. Eclipse 98 87 100 78 Code Change Gyimothy et al. [19] Mozilla 73 70 70 70 File Shivaji et al. Mozilla 94 89 100 80 Code Change Aversano et al. [4] Multiple - 59b 59 59b File Menzies et al. [39] Multiple - 74b 70b 79 File Kim et al. [28] Multiple 78 60 60 60 Code Change Shivaji et al. Multiple 92 81 97 70 Code Change a Result is from two best historical data points b Result presented are the best project results. turnover patterns, and varied development processes used in commercial development could lead to different buggy change patterns. Bug fix data is incomplete. Even though we selected projects that have decent change logs, we still are only able to extract a subset of the total number of bugs (typically only 40%- 60% of those reported in the bug tracking system). Since the quality of change log comments varies across projects, it is possible that the output of the classification algorithm will include false positive and false negatives. Recent research by Bachmann et al. focusing on the Apache system is starting to shed light on the size of this missing data [5]. The impact of this data has been explored by Bird et al. who find that in the presence of missing data, the Bug Cache prediction technique [30] is biased towards finding less severe bug types [8]. Bug introducing data is incomplete. The SZZ algorithm used to identify bug-introducing changes has limitations: it cannot find bug introducing changes for bug fixes that only involve deletion of source code. It also cannot identify bug-introducing changes caused by a change made to a file different from the one being analyzed. It is also possible to miss bug-introducing changes when a file changes its name, since these are not tracked. Selected classifiers might not be optimal. We explored many other classifiers, and found that Na¨ıve Bayes and SVM consistently returned the best results. Other popular classifiers include decision trees (e.g. J48), and JRIP. The average buggy F-measure for the projects surveyed in this paper using J48 and JRip using feature selection was 51% and 48% respectively. Though reasonable results, they are not as strong as those for Na¨ıve Bayes and SVM, the focus of this paper. Feature Selection might remove features which become important in the future. Feature selection was employed in this paper to remove features in order to optimize performance and scalability. The feature selection techniques used ensured that less important features were removed. Better performance did result from the reduction. Nevertheless, it might turn out that in future revisions, previously removed features become important and their absence might lower prediction quality. VIII. CONCLUSION This paper has explored the use of feature selection tech- niques to predict software bugs. An important pragmatic result is that feature selection can be performed in increments of half of all remaining features, allowing it to proceed quickly. Between 3.12% and 25% of the total feature set yielded optimal classification results. The reduced feature set permits better and faster bug predictions. The feature selection process presented in section II-D was empirically applied to eleven software projects. The process is fast performing and can be applied to predicting bugs on other projects. The most important results stemming from using the feature selection process are found in Table V, which present F-measure optimized results for the Na¨ıve Bayes classifier. A useful pragmatic result is that feature selection can be performed in increments of half of all remaining features, allowing it to proceed quickly. The average buggy is precision is quite high at 0.97, with a reasonable recall of 0.70. This result outperforms similar classifier based bug prediction techniques and the results pave the way for practical adoption of classifier based bug prediction. From the perspective of a developer receiving bug predic- tions on their work, these figures mean that if the classifier says a code change has a bug, it is almost always right. The recall figures mean that on average 30% of all bugs will not be detected by the bug predictor. This is likely a fundamental limitation of history-based bug prediction, as there might be new types of bugs that have not yet been incorporated into the training data. We believe this represents a reasonable tradeoff, since increasing recall would come at the expense of more false bug predictions, not to mention a decrease in the aggregate buggy F-measure figure. Such predictions can waste developer time and reduce their confidence in the system. In the future, when software developers have advanced bug prediction technology integrated into their software develop- ment environment, the use of classifiers with feature selection will permit fast, precise, and accurate bug predictions. The age of bug-predicting imps will have arrived. ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers, Dr Tigran Ishkhanov, and Shyamal Chandra of IBM Almaden for their helpful comments.
  • 16. 16 REFERENCES [1] A. Ahmad and L. Dey. A Feature Selection Technique for Classificatory Analysis. Pattern Recognition Letters, 26(1):43–56, 2005. [2] E. Alpaydin. Introduction To Machine Learning. MIT Press, 2004. [3] A. Anagnostopoulos, A. Broder, and K. Punera. Effective and Efficient Classification on a Search-Engine Model. Proc. 15th ACM Int’l Conf. on Information and Knowledge Management, Jan 2006. [4] L. Aversano, L. Cerulo, and C. Del Grosso. Learning from Bug- introducing Changes to Prevent Fault Prone Code. In Proceedings of the Foundations of Software Engineering, pages 19–26. ACM New York, NY, USA, 2007. [5] A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The Missing Links: Bugs and Bug-Fix Commits. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, FSE ’10, pages 97–106, New York, NY, USA, 2010. ACM. [6] A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem, C. Henri- Gros, A. Kamsky, S. McPeak, and D. R. Engler. A few billion lines of code later: using static analysis to find bugs in the real world. Commun. ACM, 53(2):66–75, 2010. [7] J. Bevan, E. Whitehead Jr, S. Kim, and M. Godfrey. Facilitating Software Evolution Research with Kenyon. Proc. ESEC/FSE 2005, pages 177– 186, 2005. [8] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and Balanced?: Bias in Bug-Fix Datasets. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE ’09, pages 121–130, New York, NY, USA, 2009. ACM. [9] Z. Birnbaum and F. Tingey. One-sided Confidence Contours for Prob- ability Distribution Functions. The Annals of Mathematical Statistics, pages 592–596, 1951. [10] L. C. Briand, J. Wiist, S. V. Ikonomovski, and H. Lounis. Investigating Quality Factors in Object-oriented Designs: An Industrial Case Study. Software Engineering, International Conference on, 0:345, 1999. [11] V. Challagulla, F. Bastani, I. Yen, and R. Paul. Empirical Assessment of Machine Learning Based Software Defect Prediction Techniques. In Object-Oriented Real-Time Dependable Systems, 2005. WORDS 2005. 10th IEEE International Workshop on, pages 263–270. IEEE, 2005. [12] M. DAmbros, M. Lanza, and R. Robbes. Evaluating Defect Prediction Approaches: a Benchmark and an Extensive Comparison. Empirical Software Engineering, pages 1–47, 2011. [13] B. Efron and R. Tibshirani. An Introduction to the Bootstrap, volume 57. Chapman & Hall/CRC, 1993. [14] K. Elish and M. Elish. Predicting Defect-Prone Software Modules Using Support Vector Machines. Journal of Systems and Software, 81(5):649– 660, 2008. [15] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A Library for Large Linear Classification. The Journal of Machine Learning Research, 9:1871–1874, 2008. [16] T. Fawcett. An Introduction to ROC Analysis. Pattern Recognition Letters, 27(8):861–874, 2006. [17] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning, volume 1. Springer Series in Statistics, 2001. [18] K. Gao, T. Khoshgoftaar, H. Wang, and N. Seliya. Choosing Software Metrics for Defect Prediction: an Investigation on Feature Selection Techniques. Software: Practice and Experience, 41(5):579–606, 2011. [19] T. Gyim´othy, R. Ferenc, and I. Siket. Empirical Validation of Object- Oriented Metrics on Open Source Software for Fault Prediction. IEEE Trans. Software Eng., 31(10):897–910, 2005. [20] M. Hall and G. Holmes. Benchmarking Attribute Selection Techniques for Discrete Class Data Mining. Knowledge and Data Engineering, IEEE Transactions on, 15(6):1437–1447, 2003. [21] A. Hassan and R. Holt. The Top Ten List: Dynamic Fault Prediction. Proc. ICSM’05, Jan 2005. [22] H. Hata, O. Mizuno, and T. Kikuno. An Extension of Fault-prone Filtering using Precise Training and a Dynamic Threshold. Proc. MSR 2008, 2008. [23] S. T. http://www.scitools.com/. Maintenance, Understanding, Metrics and Documentation Tools for Ada, C, C++, Java, and Fortran. 2005. [24] T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Machine Learning: ECML-98, pages 137–142, 1998. [25] T. Joachims. Training Linear SVMs in Linear Time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, page 226. ACM, 2006. [26] T. Khoshgoftaar and E. Allen. Predicting the Order of Fault-Prone Modules in Legacy Software. Proc. 1998 Int’l Symp. on Software Reliability Eng., pages 344–353, 1998. [27] T. Khoshgoftaar and E. Allen. Ordering Fault-Prone Software Modules. Software Quality J., 11(1):19–37, 2003. [28] S. Kim, E. W. Jr., and Y. Zhang. Classifying Software Changes: Clean or Buggy? IEEE Trans. Software Eng., 34(2):181–196, 2008. [29] S. Kim, T. Zimmermann, E. W. Jr., and A. Zeller. Predicting Faults from Cached History. Proc. ICSE 2007, pages 489–498, 2007. [30] S. Kim, T. Zimmermann, E. Whitehead Jr, and A. Zeller. Predicting Faults from Cached History. In Proceedings of the 29th international conference on Software Engineering, pages 489–498. IEEE Computer Society, 2007. [31] I. Kononenko. Estimating Attributes: Analysis and Extensions of relief. In Machine Learning: ECML-94, pages 171–182. Springer, 1994. [32] B. Larsen and C. Aone. Fast and Effective Text Mining using Linear- time Document Clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16–22. ACM New York, NY, USA, 1999. [33] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings. IEEE Trans. on Software Eng., 34(4):485–496, 2008. [34] D. Lewis. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. Machine Learning: ECML-98, pages 4–15, 1998. [35] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Springer, 1998. [36] J. Madhavan and E. Whitehead Jr. Predicting Buggy Changes Inside an Integrated Development Environment. Proc. 2007 Eclipse Technology eXchange, 2007. [37] F. Massey Jr. The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association, pages 68–78, 1951. [38] A. McCallum and K. Nigam. A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on Learning for Text Categorization, Jan 1998. [39] T. Menzies, J. Greenwald, and A. Frank. Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Software Eng., 33(1):2–13, 2007. [40] A. Mockus and L. Votta. Identifying Reasons for Software Changes using Historic Databases. Proc. ICSM 2000, page 120, 2000. [41] A. Mockus and D. Weiss. Predicting Risk of Software Changes. Bell Labs Tech. Journal, 5(2):169–180, 2000. [42] S. Morasca and G. Ruhe. A Hybrid Approach to Analyze Empirical Software Engineering Data and its Application to Predict Module Fault- proneness in Maintenance. J. Syst. Softw., 53(3):225–237, 2000. [43] R. Moser, W. Pedrycz, and G. Succi. A Comparative Analysis of the Efficiency of Change Metrics and Static Code Attributes for Defect Prediction. In Software Engineering, 2008. ICSE’08. ACM/IEEE 30th International Conference on, pages 181–190. IEEE, 2008. [44] T. Ostrand, E. Weyuker, and R. Bell. Predicting the Location and Number of Faults in Large Software Systems. IEEE Trans. Software Eng., 31(4):340–355, 2005. [45] J. Quinlan. C4. 5: Programs for Machine Learning. Morgan Kaufmann, 1993. [46] M. Robnik-ˇSikonja and I. Kononenko. Theoretical and Empirical Analysis of Relieff and RRelieff. Machine learning, 53(1):23–69, 2003. [47] S. Scott and S. Matwin. Feature Engineering for Text Classification. Machine Learning-International Workshop, pages 379–388, 1999. [48] E. Shihab, Z. Jiang, W. Ibrahim, B. Adams, and A. Hassan. Understand- ing the Impact of Code and Process Metrics on Post-Release Defects: a Case Study on the Eclipse Project. Proc. ESEM 2010, pages 1–10, 2010. [49] S. Shivaji, E. Whitehead Jr, R. Akella, and S. Kim. Reducing Features to Improve Bug Prediction. In 2009 IEEE/ACM International Conference on Automated Software Engineering, pages 600–604. IEEE, 2009. [50] J. Sliwerski, T. Zimmermann, and A. Zeller. When Do Changes Induce Fixes? Proc. MSR 2005, pages 24–28, 2005. [51] Q. Song, Z. Jia, M. Shepperd, S. Ying, and J. Liu. A General Software Defect-proneness Prediction Framework. Software Engineering, IEEE Transactions on, (99):356–370, 2011. [52] H. K. Wright, M. Kim, and D. E. Perry. Validity Concerns in Software Engineering Research. In Proceedings of the FSE/SDP workshop on Future of software engineering research, FoSER ’10, pages 411–414, New York, NY, USA, 2010. ACM. [53] T. Zimmermann and P. Weißgerber. Preprocessing CVS Data for Fine- Grained Analysis. Proc. MSR 2004, pages 2–6, 2004.
  • 17. 17 Shivkumar Shivaji is a PhD student in the Depart- ment of Computer Science, University of California, Santa Cruz. His research interests include software evolution, software bug prediction, and machine learning. He has worked for large software firms such as eBay, Yahoo!, and IBM along with smaller firms as a senior engineer/architect. He is also a chess master. E. James Whitehead, Jr. is a professor in the De- partment of Computer Science, University of Cali- fornia, Santa Cruz. His research interests in software engineering include software evolution and software bug prediction, and his research interests in com- puter games are intelligent design tools, procedural content generation, and level design. He received his BS in Electrical Engineering from the Rensselaer Polytechnic Institute, and his PhD in Information and Computer Science from the University of Cali- fornia, Irvine. Sunghun Kim is an assistant professor of computer science at the Hong Kong University of Science and Technology. His core research area is software en- gineering, focusing on software evolution, program analysis, and empirical studies. He received the PhD degree from the Computer Science Department at the University of California, Santa Cruz in 2006. He was a postdoctoral associate at the Massachusetts Insti- tute of Technology and a member of the Program Analysis Group. He was a chief technical officer (CTO), and led a 25-person team for six years at Nara Vision Co. Ltd., a leading Internet software company in Korea. Ram Akella is currently a professor and director of the Center for Large-scale Live Analytics and Smart Services (CLASS), and Center for Knowl- edge, Information Systems and Technology Man- agement (KISMT) at the University of California, Silicon Valley Center. He held postdoctoral appoint- ments at Harvard and MIT (EECS/LIDS). He joined Carnegie Mellon University in 1985 as an Associate Professor in the Business School and the School of Computer Science, prior to appointments at schools including MIT, Berkeley, and Stanford. He has been a professor of Engineering, Computer Science, Management, Information, and Medicine. He has established an ORU, and the TIM program at UCSC/SVC as Founding Director. His research interests include Data/Text Analytics, Mining, Search, and Recommender Systems. He has worked with over 200 firms, including AOL (Faculty Award), Cisco, Google (Research Award), IBM (Faculty Award), and SAP. He has been on editorial boards and program committees of the ACM, IEEE, INFORMS, and IIE. His students are endowed or Department Chairs at major universities, or executives (including CEOs) of major firms. His research has achieved transformational impact on industry, and he has received stock from executives.