SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
A Tale of Experiments on
Bug Prediction
Martin Pinzger
Professor of Software Engineering
University of Klagenfurt, Austria
Follow me: @pinzger
Software repositories
Hmm, wait a minute
Can’t we learn “something” from that data?
Goal of software repository mining
Making the information stored in software repositories available to
software developers
Quality analysis and defect prediction
Recommender systems
Examples from my mining research
Predicting failure-prone source files using changes (MSR 2011)
Predicting failure-prone methods (ESEM 2012)
The relationship between developer contributions and failures (FSE
There are many more studies
MSR 2013
A survey and taxonomy of approaches for mining software repositories in the
context of software evolution, Kagdi et al. 2007
Using Fine-Grained Source Code
Changes for Bug Prediction
with Emanuel Giger, Harald Gall
University of Zurich
Bug prediction
Train models to predict the bug-prone source files of the next release
Using product measures, process measures, organizational measures with
machine learning techniques
Many existing studies on building prediction models
Moser et al., Nagappan et al., Zimmermann et al., Hassan et al., etc.
Process measures performed particularly well
Classical change measures
Number of file revisions
Code Churn aka lines added/deleted/changed
Research question of this study: Can we further improve these
Revisions are coarse grained
What did change in a revision?
Code Churn can be imprecise
Extra changes not relevant for locating bugs
Fine Grained-Source Code Changes (SCC)
IF "balance > 0"
"withDraw(amount);" 1.5
"balance > 0 && amount <= balance"
notify(); 1.6
3 SCC: 1x condition change, 1x else-part insert, 1x invocation
statement insert 11
Research hypotheses
H1 SCC is correlated with the number of bugs in
source files
H2 SCC is a predictor for bug-prone source files
(and outperforms LM)
H3 SCC is a predictor for the number of bugs in
source files (and outperforms LM)
15 Eclipse plug-ins
>850’000 fine-grained source code changes (SCC)
>10’000 files
>9’700’000 lines modified (LM)
>9 years of development history
..... and a lot of bugs referenced in commit messages (e.g., bug #345)
Typical steps of such an experiment
Analyze correctness and distribution of the data
Use descriptive statistics, histograms, Q-Q plots
-> Determines the statistical methods that you can use
Perform correlation analysis
Spearman (non-parametric)
Machine learners/classifiers
Simple ones first (linear regression, binary logistic regression)
10-fold cross validation, precision, recall, AUC ROC
Interpretation, discussion of results (incl. threats to validity)
Approach overview
how many of them (Bugs), and (3) fine-grained source code
changes (SCC).
4. Experiment
2. Bug Data
3. Source Code Changes (SCC)1.Versioning Data
Log Entries
Message Bug
1.1 1.2
H1: SCC is correlated with #bugs
on parametric Spearman rank correlation of
nd SCC . * marks significant correlations at
rger values are printed bold.
Eclipse Project LM SCC
Compare 0.68 0.76
jFace 0.74 0.71
JDT Debug 0.62 0.8
Resource 0.75 0.86
Runtime 0.66 0.79
Team Core 0.15 0.66
CVS Core 0.60 0.79
Debug Core 0.63 0.78
jFace Text 0.75 0.74
Update Core 0.43 0.62
Debug UI 0.56 0.81
JDT Debug UI 0.80 0.81
Help 0.54 0.48
JDT Core 0.70 0.74
OSGI 0.70 0.77
Median 0.66 0.77
Table 5: N
and cate
= 0.01
Eclipse Pr
Team Cor
CVS Core
Debug Co
JDT Debu
jFace Text
JDT Debu
Update C
Debug UI
JDT Core
correlation at 0.01
+/-0.5 substantial
+/-0.7 strong
Predicting bug-prone files
Bug-prone vs. not bug-prone
calculate and assign a probability to a file if it is bug-prone or
not bug-prone.
For each Eclipse project we binned files into bug-prone and
not bug-prone using the median of the number of bugs per file
bugClass =
not bug prone : #bugs <= median
bug prone : #bugs > median
When using the median as cut point the labeling of a file is
relative to how much bugs other files have in a project. There
exist several ways of binning files afore. They mainly vary in
that they result in different prior probabilities: For instance
Zimmerman et al. [40] and Bernstein et al. [4] labeled files as
bug-prone if they had at least one bug. When having heavily
skewed distributions this approach may lead to high a prior
probability towards a one class. Nagappan et al. [28] used a17
H2: SCC can predict bug-prone files
UC values of E 1 using logistic regression with
CC as predictors for bug-prone and a not bug-
Larger values are printed in bold.
Eclipse Project AUC LM AUC SCC
Compare 0.84 0.85
jFace 0.90 0.90
JDT Debug 0.83 0.95
Resource 0.87 0.93
Runtime 0.83 0.91
Team Core 0.62 0.87
CVS Core 0.80 0.90
Debug Core 0.86 0.94
jFace Text 0.87 0.87
Update Core 0.78 0.85
Debug UI 0.85 0.93
JDT Debug UI 0.90 0.91
Help 0.75 0.70
JDT Core 0.86 0.87
OSGI 0.88 0.88
Median 0.85 0.90
Overall 0.85 0.89 18
SCC outperforms LM
Predicting the number of bugs
Non linear regression with asymptotic model:
6 0
4 0
2 0
Eclipse Team Core
f(#Bugs) = a1 + b2*eb3*SCC
H3: SCC can predict the number of bugsTable 8: Results of the nonlinear regression in terms of R2
and Spearman correlation using LM and SCC as predictors.
Project R2
SCC SpearmanLM SpearmanSCC
Compare 0.84 0.88 0.68 0.76
jFace 0.74 0.79 0.74 0.71
JDT Debug 0.69 0.68 0.62 0.8
Resource 0.81 0.85 0.75 0.86
Runtime 0.69 0.72 0.66 0.79
Team Core 0.26 0.53 0.15 0.66
CVS Core 0.76 0.83 0.62 0.79
Debug Core 0.88 0.92 0.63 0.78
Jface Text 0.83 0.89 0.75 0.74
Update Core 0.41 0.48 0.43 0.62
Debug UI 0.7 0.79 0.56 0.81
JDT Debug UI 0.82 0.82 0.8 0.81
Help 0.66 0.67 0.54 0.84
JDT Core 0.69 0.77 0.7 0.74
OSGI 0.51 0.8 0.74 0.77
Median 0.7 0.79 0.66 0.77
Overall 0.65 0.72 0.62 0.74
SCC outperforms LM
Summary of results
SCC performs significantly better than LM
Advanced learners are not always better
Change types do not yield extra discriminatory power
Predicting the number of bugs is “possible”
More information in our paper
“Comparing Fine-Grained Source Code Changes And Code Churn For Bug
Prediction”, MSR 2011
Method-Level Bug Prediction
with Emanuel Giger, Marco D’Ambros*, Harald Gall
University of Zurich
*University of Lugano
Prediction granularity
11 methods on average
class 1 class 2 class 3 class n...class 2
4 are bug prone (ca. 36%)
Retrieving bug-prone methods saves manual inspection steps and
improves testing effort allocation
Large files are typically the most bug-prone files
Research questions
RQ1 What is the performance of bug prediction on
method level using change & code metrics?
RQ2 Which set of predictors provides the best
RQ3 How does the performance vary if the number
of buggy methods decreases?
21 Java open source projects
Project #Classes #Methods #M-Histories #Bugs
JDT Core 1140 17703 43134 4888
Jena2 897 8340 7764 704
Lucene 477 3870 1754 377
Xerces 693 8189 6866 1017
Derby Engine 1394 18693 9507 1663
Ant Core 827 8698 17993 1900
Investigated metrics
Source code metrics (from the last release)
fanIn, fanOut, localVar, parameters, commentToCodeRatio, countPath, McCabe
Complexity, statements, maxNesting
Change metrics
methodHistories, authors, stmtAdded, maxStmtAdded, avgStmtAdded,
stmtDeleted, maxStmtDeleted, avgStmtDeleted, churn, maxChurn, avgChurn,
decl, cond, elseAdded, elseDeleted
Count bug references in commit logs for changed methods
Predicting bug-prone methods
Bug-prone vs. not bug-prone
.1 Experimental Setup
Prior to model building and classification we labeled
ethod in our dataset either as bug-prone or not bug-p
s follows:
bugClass =
not bug − prone : #bugs = 0
bug − prone : #bugs >= 1
hese two classes represent the binary target classes
aining and validating the prediction models. Using 0
pectively 1) as cut-point is a common approach applie
any studies covering bug prediction models, e.g., [30
7, 4, 27, 37]. Other cut-points are applied in litera
r instance, a statistical lower confidence bound [33] or
edian [16]. Those varying cut-points as well as the div
Models computed with change metrics (CM) perform best
authors and methodHistories are the most important measures
RQ1 & RQ2: Performance of prediction models
Table 4: Median classification results over all pro-
jects per classifier and per model
RndFor .95 .84 .88 .72 .5 .64 .95 .85 .95
SVM .96 .83 .86 .7 .48 .63 .95 .8 .96
BN .96 .82 .86 .73 .46 .73 .96 .81 .96
J48 .95 .84 .82 .69 .56 .58 .91 .83 .89
values of the code metrics model are approximately 0.7 for
each classifier—what is defined by Lessman et al. as ”promis-
ing” [26]. However, the source code metrics suffer from con-
siderably low precision values. The highest median precision
Predicting bug-prone methods with diff. cut-points
Bug-prone vs. not bug-prone
p = 75%, 90%, 95% percentiles
ow the classification performance varies (RQ3) as the
er of samples in the target class shrinks, and wheth
bserve similar findings as in Section 3.2 regarding t
ults of the change and code metrics (RQ2). For tha
pplied three additional cut-point values as follows:
bugClass =
not bug − prone : #bugs <= p
bug − prone : #bugs > p
here p represents either the value of the 75%, 90%, or
ercentile of the distribution of the number of bugs in
ds per project. For example, using the 95% percent
ut-point for prior binning would mean to predict the
ve percent” methods in terms of the number of bugs.
To conduct this study we applied the same experim
etup as in Section 3.1, except for the differently chose
Models computed with change metrics (CM) perform best
Precision decreases (as expected)
RQ3: Decreasing the number of bug-prone methods
Table 5: Median classification results for RndFor
ver all projects per cut-point and per model
GT0 .95 .84 .88 .72 .50 .64 .95 .85 .95
75% .97 .72 .95 .75 .39 .63 .97 .74 .95
90% .97 .58 .94 .77 .20 .69 .98 .64 .94
95% .97 .62 .92 .79 .13 .72 .98 .68 .92
ion in the case of the 95% percentile (median precision of
.13). Looking at the change metrics and the combined
model the median precision is significantly higher for the
Application: file level vs. method level prediction
JDT Core 3.0 - LocalDeclaration
Contains 6 methods / 1 affected by post release bugs
LocalDeclaration.resolve(...) was predicted bug-prone with p=0.97
File-level: p=0.17 to guess the bug-prone method
Need to manually rule out 5 methods to reach 0.82 precision 1 / (6-5)
JDT Core 3.0 - Main
Contains 26 methods / 11 affected by post release bugs
Main.configure(...) was predicted bug-prone with p=1.0
File-level: p=0.42 to guess a bug-prone method
Need to rule out 12 methods to reach 0.82 precision 11 / (26-12)
What can we learn from that?
Large files are more likely to change and have bugs
Test large files more thoroughly - YES
Bugs are fixed through changes that again lead to bugs
Stop changing our systems - NO, of course not!
Test changing entities more thoroughly - YES
Are we not already doing that?
Do we really need (complex) prediction models for that?
Not sure - might be the reason why these models are not really used yet
But, use at least a metrics tool and keep track of your code quality
Continuous integration environments
What is next?
Ease understanding of changes
Analysis of the effect(s) of changes
What is the effect on the design?
What is the effect on the quality?
Recommender techniques
Provide advice on the effects of changes
Facilitating understanding of changes
Alex fixed Bug 14:
Changed if condition in method
send() in module BandPass.
Research: understanding changes
Subscription to
detailed changes
6 0
4 0
2 0
Eclipse Team Core
Martin Pinzger
the history of a software system to assemble the dataset for
our experiments: (1) versioning data including lines modi-
fied (LM), (2) bug data, i.e., which files contained bugs and
how many of them (Bugs), and (3) fine-grained source code
changes (SCC).
4. Experiment
2. Bug Data
3. Source Code Changes (SCC)1.Versioning Data
Log Entries
Message Bug
1.1 1.2
Figure 1: Stepwise overview of the data extraction process.
1. Versioning Data. We use EVOLIZER [14] to access the ver-
sioning repositories , e.g., CVS, SVN, or GIT. They provide
log entries that contain information about revisions of files
that belong to a system. From the log entries we extract the
revision number (to identify the revisions of a file in correct
temporal order), the revision timestamp, the name of the de-
veloper who checked-in the new revision, and the commit
message. We then compute LM for a source file as the sum of
lines added, lines deleted, and lines changed per file revision.
2. Bug Data. Bug reports are stored in bug repositories such
as Bugzilla. Traditional bug tracking and versioning repos-
Update Core 595 8’496 251’434 36’151 532 Oct0
Debug UI 1’954 18’862 444’061 81’836 3’120 May
JDT Debug UI 775 8’663 168’598 45’645 2’002 Nov
Help 598 3’658 66’743 12’170 243 May
JDT Core 1’705 63’038 2’814K 451’483 6’033 Jun0
OSGI 748 9’866 335’253 56’238 1’411 Nov
single source code statements, e.g., method invocatio
ments, between two versions of a program by com
their respective abstract syntax trees (AST). Each chan
represents a tree edit operation that is required to tr
one version of the AST into the other. The algorithm i
mented in CHANGEDISTILLER [14] that pairwise co
the ASTs between all direct subsequent revisions of e
Based on this information, we then count the numbe
ferent source code changes (SCC) per file revision.
The preprocessed data from step 1-3 is stored into
lease History Database (RHDB) [10]. From that data,
compute LM, SCC, and Bugs for each source file by a
ing the values over the given observation period.
In this section, we present the empirical study that
formed to investigate the hypotheses stated in Sectio
discuss the dataset, the statistical methods and machi
ing algorithms we used, and report on the results a
ings of the experiments.
3.1 Dataset and Data Preparation
We performed our experiments on 15 plugins of the
platform. Eclipse is a popular open source system
been studied extensively before [4,27,38,39].
Table 1 gives an overview of the Eclipse dataset
this study with the number of unique *.java files (Fi
Change summaries
Change visualization
Subscription to
detailed changes

Mais conteúdo relacionado

Mais procurados

TBar: Revisiting Template-based Automated Program Repair
TBar: Revisiting Template-based Automated Program RepairTBar: Revisiting Template-based Automated Program Repair
TBar: Revisiting Template-based Automated Program RepairDongsun Kim
A Closer Look at Real-World Patches
A Closer Look at Real-World PatchesA Closer Look at Real-World Patches
A Closer Look at Real-World PatchesDongsun Kim
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug LocalizationBench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug LocalizationDongsun Kim
LSRepair: Live Search of Fix Ingredients for Automated Program Repair
LSRepair: Live Search of Fix Ingredients for Automated Program RepairLSRepair: Live Search of Fix Ingredients for Automated Program Repair
LSRepair: Live Search of Fix Ingredients for Automated Program RepairDongsun Kim
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...Dongsun Kim
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
Fault Modeling for Verilog Register Transfer Level
Fault Modeling for Verilog Register Transfer LevelFault Modeling for Verilog Register Transfer Level
Fault Modeling for Verilog Register Transfer Levelidescitation
Ανάλυση σφαλµάτων κώδικα µε βάση τον γράφο κλήσεων συναρτήσεων
Ανάλυση σφαλµάτων κώδικα µε βάση τον γράφο κλήσεων συναρτήσεωνΑνάλυση σφαλµάτων κώδικα µε βάση τον γράφο κλήσεων συναρτήσεων
Ανάλυση σφαλµάτων κώδικα µε βάση τον γράφο κλήσεων συναρτήσεωνISSEL
Panoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFPanoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFEric Jansen
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...Andrey Karpov
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...Silvio Cesare
Applying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsApplying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsFrank Bergmann
STAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash ReproductionSTAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash ReproductionSung Kim
A preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationA preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationkrws
Impact of Tool Support in Patch Construction
Impact of Tool Support in Patch ConstructionImpact of Tool Support in Patch Construction
Impact of Tool Support in Patch ConstructionDongsun Kim
An Empirical Study on Bounded Model Checking
An Empirical Study on Bounded Model CheckingAn Empirical Study on Bounded Model Checking
An Empirical Study on Bounded Model CheckingStefano Dalla Palma
Black-Box attacks against Neural Networks - technical project report
Black-Box attacks against Neural Networks - technical project reportBlack-Box attacks against Neural Networks - technical project report
Black-Box attacks against Neural Networks - technical project reportRoberto Falconi
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talkAbhik Roychoudhury
System Testing of Timing Requirements based on Use Cases and Timed Automata
System Testing of Timing Requirements based on Use Cases and Timed AutomataSystem Testing of Timing Requirements based on Use Cases and Timed Automata
System Testing of Timing Requirements based on Use Cases and Timed AutomataLionel Briand

Mais procurados (20)

TBar: Revisiting Template-based Automated Program Repair
TBar: Revisiting Template-based Automated Program RepairTBar: Revisiting Template-based Automated Program Repair
TBar: Revisiting Template-based Automated Program Repair
A Closer Look at Real-World Patches
A Closer Look at Real-World PatchesA Closer Look at Real-World Patches
A Closer Look at Real-World Patches
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug LocalizationBench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
LSRepair: Live Search of Fix Ingredients for Automated Program Repair
LSRepair: Live Search of Fix Ingredients for Automated Program RepairLSRepair: Live Search of Fix Ingredients for Automated Program Repair
LSRepair: Live Search of Fix Ingredients for Automated Program Repair
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
Fault Modeling for Verilog Register Transfer Level
Fault Modeling for Verilog Register Transfer LevelFault Modeling for Verilog Register Transfer Level
Fault Modeling for Verilog Register Transfer Level
Ανάλυση σφαλµάτων κώδικα µε βάση τον γράφο κλήσεων συναρτήσεων
Ανάλυση σφαλµάτων κώδικα µε βάση τον γράφο κλήσεων συναρτήσεωνΑνάλυση σφαλµάτων κώδικα µε βάση τον γράφο κλήσεων συναρτήσεων
Ανάλυση σφαλµάτων κώδικα µε βάση τον γράφο κλήσεων συναρτήσεων
Panoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFPanoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURF
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
Program errors occurring while porting C++ code from 32-bit platforms on 64-b...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Applying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsApplying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation Experiments
STAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash ReproductionSTAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash Reproduction
A preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationA preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localization
Impact of Tool Support in Patch Construction
Impact of Tool Support in Patch ConstructionImpact of Tool Support in Patch Construction
Impact of Tool Support in Patch Construction
An Empirical Study on Bounded Model Checking
An Empirical Study on Bounded Model CheckingAn Empirical Study on Bounded Model Checking
An Empirical Study on Bounded Model Checking
Black-Box attacks against Neural Networks - technical project report
Black-Box attacks against Neural Networks - technical project reportBlack-Box attacks against Neural Networks - technical project report
Black-Box attacks against Neural Networks - technical project report
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talk
System Testing of Timing Requirements based on Use Cases and Timed Automata
System Testing of Timing Requirements based on Use Cases and Timed AutomataSystem Testing of Timing Requirements based on Use Cases and Timed Automata
System Testing of Timing Requirements based on Use Cases and Timed Automata

Semelhante a A tale of experiments on bug prediction

A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software developmentMartin Pinzger
Esem2010 shihab
Esem2010 shihabEsem2010 shihab
Esem2010 shihabSAIL_QU
Populating a Release History Database (ICSM 2013 MIP)
Populating a Release History Database (ICSM 2013 MIP)Populating a Release History Database (ICSM 2013 MIP)
Populating a Release History Database (ICSM 2013 MIP)Martin Pinzger
Implementation of reducing features to improve code change based bug predicti...
Implementation of reducing features to improve code change based bug predicti...Implementation of reducing features to improve code change based bug predicti...
Implementation of reducing features to improve code change based bug predicti...eSAT Journals
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals
Dependability Benchmarking by Injecting Software Bugs
Dependability Benchmarking by Injecting Software BugsDependability Benchmarking by Injecting Software Bugs
Dependability Benchmarking by Injecting Software BugsRoberto Natella
A survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithmsA survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithmsAhmed Magdy Ezzeldin, MSc.
680report final
680report final680report final
680report finalRajesh M
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
Icse 2011 ds_1
Icse 2011 ds_1Icse 2011 ds_1
Icse 2011 ds_1SAIL_QU
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter

Semelhante a A tale of experiments on bug prediction (20)

A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
Esem2010 shihab
Esem2010 shihabEsem2010 shihab
Esem2010 shihab
Populating a Release History Database (ICSM 2013 MIP)
Populating a Release History Database (ICSM 2013 MIP)Populating a Release History Database (ICSM 2013 MIP)
Populating a Release History Database (ICSM 2013 MIP)
Debug me
Debug meDebug me
Debug me
Implementation of reducing features to improve code change based bug predicti...
Implementation of reducing features to improve code change based bug predicti...Implementation of reducing features to improve code change based bug predicti...
Implementation of reducing features to improve code change based bug predicti...
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
Dependability Benchmarking by Injecting Software Bugs
Dependability Benchmarking by Injecting Software BugsDependability Benchmarking by Injecting Software Bugs
Dependability Benchmarking by Injecting Software Bugs
Keynote HotSWUp 2012
Keynote HotSWUp 2012Keynote HotSWUp 2012
Keynote HotSWUp 2012
A survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithmsA survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithms
680report final
680report final680report final
680report final
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
Dill may-2008
Dill may-2008Dill may-2008
Dill may-2008
Icse 2011 ds_1
Icse 2011 ds_1Icse 2011 ds_1
Icse 2011 ds_1
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...


Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Último (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?

A tale of experiments on bug prediction

  • 1. A Tale of Experiments on Bug Prediction Martin Pinzger Professor of Software Engineering University of Klagenfurt, Austria Follow me: @pinzger
  • 3. Hmm, wait a minute 3 Can’t we learn “something” from that data?
  • 4. Goal of software repository mining Making the information stored in software repositories available to software developers Quality analysis and defect prediction Recommender systems ... 4
  • 5. Examples from my mining research Predicting failure-prone source files using changes (MSR 2011) Predicting failure-prone methods (ESEM 2012) The relationship between developer contributions and failures (FSE 2008) There are many more studies MSR 2013 A survey and taxonomy of approaches for mining software repositories in the context of software evolution, Kagdi et al. 2007 5
  • 6. Using Fine-Grained Source Code Changes for Bug Prediction with Emanuel Giger, Harald Gall University of Zurich
  • 7. Bug prediction Goal Train models to predict the bug-prone source files of the next release How Using product measures, process measures, organizational measures with machine learning techniques Many existing studies on building prediction models Moser et al., Nagappan et al., Zimmermann et al., Hassan et al., etc. Process measures performed particularly well 7
  • 8. Classical change measures Number of file revisions Code Churn aka lines added/deleted/changed Research question of this study: Can we further improve these models? 8
  • 9. Revisions are coarse grained What did change in a revision? 9
  • 10. Code Churn can be imprecise 10 Extra changes not relevant for locating bugs
  • 11. Fine Grained-Source Code Changes (SCC) THEN MI IF "balance > 0" "withDraw(amount);" 1.5 THEN MI IF "balance > 0 && amount <= balance" "withDraw(amount);" ELSE MI notify(); 1.6 3 SCC: 1x condition change, 1x else-part insert, 1x invocation statement insert 11
  • 12. Research hypotheses 12 H1 SCC is correlated with the number of bugs in source files H2 SCC is a predictor for bug-prone source files (and outperforms LM) H3 SCC is a predictor for the number of bugs in source files (and outperforms LM)
  • 13. 15 Eclipse plug-ins Data >850’000 fine-grained source code changes (SCC) >10’000 files >9’700’000 lines modified (LM) >9 years of development history ..... and a lot of bugs referenced in commit messages (e.g., bug #345) 13
  • 14. Typical steps of such an experiment 14 Analyze correctness and distribution of the data Use descriptive statistics, histograms, Q-Q plots -> Determines the statistical methods that you can use Perform correlation analysis Spearman (non-parametric) Machine learners/classifiers Simple ones first (linear regression, binary logistic regression) 10-fold cross validation, precision, recall, AUC ROC Interpretation, discussion of results (incl. threats to validity)
  • 15. Approach overview how many of them (Bugs), and (3) fine-grained source code changes (SCC). 4. Experiment 2. Bug Data 3. Source Code Changes (SCC)1.Versioning Data CVS, SVN, GIT Evolizer RHDB Log Entries ChangeDistiller Subsequent Versions Changes #bug123 Message Bug Support Vector Machine 1.1 1.2 AST Comparison 15
  • 16. H1: SCC is correlated with #bugs on parametric Spearman rank correlation of nd SCC . * marks significant correlations at rger values are printed bold. Eclipse Project LM SCC Compare 0.68 0.76 jFace 0.74 0.71 JDT Debug 0.62 0.8 Resource 0.75 0.86 Runtime 0.66 0.79 Team Core 0.15 0.66 CVS Core 0.60 0.79 Debug Core 0.63 0.78 jFace Text 0.75 0.74 Update Core 0.43 0.62 Debug UI 0.56 0.81 JDT Debug UI 0.80 0.81 Help 0.54 0.48 JDT Core 0.70 0.74 OSGI 0.70 0.77 Median 0.66 0.77 Table 5: N and cate = 0.01 Eclipse Pr Compare jFace Resource Team Cor CVS Core Debug Co Runtime JDT Debu jFace Text JDT Debu Update C Debug UI Help OSGI JDT Core Mean *significant correlation at 0.01 16 +/-0.5 substantial +/-0.7 strong
  • 17. Predicting bug-prone files Bug-prone vs. not bug-prone calculate and assign a probability to a file if it is bug-prone or not bug-prone. For each Eclipse project we binned files into bug-prone and not bug-prone using the median of the number of bugs per file (#bugs): bugClass = ⇢ not bug prone : #bugs <= median bug prone : #bugs > median When using the median as cut point the labeling of a file is relative to how much bugs other files have in a project. There exist several ways of binning files afore. They mainly vary in that they result in different prior probabilities: For instance Zimmerman et al. [40] and Bernstein et al. [4] labeled files as bug-prone if they had at least one bug. When having heavily skewed distributions this approach may lead to high a prior probability towards a one class. Nagappan et al. [28] used a17
  • 18. H2: SCC can predict bug-prone files UC values of E 1 using logistic regression with CC as predictors for bug-prone and a not bug- Larger values are printed in bold. Eclipse Project AUC LM AUC SCC Compare 0.84 0.85 jFace 0.90 0.90 JDT Debug 0.83 0.95 Resource 0.87 0.93 Runtime 0.83 0.91 Team Core 0.62 0.87 CVS Core 0.80 0.90 Debug Core 0.86 0.94 jFace Text 0.87 0.87 Update Core 0.78 0.85 Debug UI 0.85 0.93 JDT Debug UI 0.90 0.91 Help 0.75 0.70 JDT Core 0.86 0.87 OSGI 0.88 0.88 Median 0.85 0.90 Overall 0.85 0.89 18 SCC outperforms LM
  • 19. Predicting the number of bugs Non linear regression with asymptotic model: 19 #SCC 40003000200010000 #Bugs 6 0 4 0 2 0 0 Eclipse Team Core f(#Bugs) = a1 + b2*eb3*SCC
  • 20. H3: SCC can predict the number of bugsTable 8: Results of the nonlinear regression in terms of R2 and Spearman correlation using LM and SCC as predictors. Project R2 LM R2 SCC SpearmanLM SpearmanSCC Compare 0.84 0.88 0.68 0.76 jFace 0.74 0.79 0.74 0.71 JDT Debug 0.69 0.68 0.62 0.8 Resource 0.81 0.85 0.75 0.86 Runtime 0.69 0.72 0.66 0.79 Team Core 0.26 0.53 0.15 0.66 CVS Core 0.76 0.83 0.62 0.79 Debug Core 0.88 0.92 0.63 0.78 Jface Text 0.83 0.89 0.75 0.74 Update Core 0.41 0.48 0.43 0.62 Debug UI 0.7 0.79 0.56 0.81 JDT Debug UI 0.82 0.82 0.8 0.81 Help 0.66 0.67 0.54 0.84 JDT Core 0.69 0.77 0.7 0.74 OSGI 0.51 0.8 0.74 0.77 Median 0.7 0.79 0.66 0.77 Overall 0.65 0.72 0.62 0.74 nrm.Residuals 1.50 1.00 .50 .00 -.50 -1.00 6,000.0 5,000.0 4,000.0 3,000.0 2,000.0 1,000.020 SCC outperforms LM
  • 21. Summary of results SCC performs significantly better than LM Advanced learners are not always better Change types do not yield extra discriminatory power Predicting the number of bugs is “possible” More information in our paper “Comparing Fine-Grained Source Code Changes And Code Churn For Bug Prediction”, MSR 2011 21
  • 22. Method-Level Bug Prediction with Emanuel Giger, Marco D’Ambros*, Harald Gall University of Zurich *University of Lugano
  • 23. Prediction granularity 11 methods on average class 1 class 2 class 3 class n...class 2 4 are bug prone (ca. 36%) Retrieving bug-prone methods saves manual inspection steps and improves testing effort allocation 23 Large files are typically the most bug-prone files
  • 24. Research questions 24 RQ1 What is the performance of bug prediction on method level using change & code metrics? RQ2 Which set of predictors provides the best performance? RQ3 How does the performance vary if the number of buggy methods decreases?
  • 25. 21 Java open source projects 25 Project #Classes #Methods #M-Histories #Bugs JDT Core 1140 17703 43134 4888 Jena2 897 8340 7764 704 Lucene 477 3870 1754 377 Xerces 693 8189 6866 1017 Derby Engine 1394 18693 9507 1663 Ant Core 827 8698 17993 1900
  • 26. Investigated metrics 26 Source code metrics (from the last release) fanIn, fanOut, localVar, parameters, commentToCodeRatio, countPath, McCabe Complexity, statements, maxNesting Change metrics methodHistories, authors, stmtAdded, maxStmtAdded, avgStmtAdded, stmtDeleted, maxStmtDeleted, avgStmtDeleted, churn, maxChurn, avgChurn, decl, cond, elseAdded, elseDeleted Bugs Count bug references in commit logs for changed methods
  • 27. Predicting bug-prone methods Bug-prone vs. not bug-prone 27 .1 Experimental Setup Prior to model building and classification we labeled ethod in our dataset either as bug-prone or not bug-p s follows: bugClass = not bug − prone : #bugs = 0 bug − prone : #bugs >= 1 hese two classes represent the binary target classes aining and validating the prediction models. Using 0 pectively 1) as cut-point is a common approach applie any studies covering bug prediction models, e.g., [30 7, 4, 27, 37]. Other cut-points are applied in litera r instance, a statistical lower confidence bound [33] or edian [16]. Those varying cut-points as well as the div
  • 28. Models computed with change metrics (CM) perform best authors and methodHistories are the most important measures RQ1 & RQ2: Performance of prediction models 28 Table 4: Median classification results over all pro- jects per classifier and per model CM SCM CM&SCM AUC P R AUC P R AUC P R RndFor .95 .84 .88 .72 .5 .64 .95 .85 .95 SVM .96 .83 .86 .7 .48 .63 .95 .8 .96 BN .96 .82 .86 .73 .46 .73 .96 .81 .96 J48 .95 .84 .82 .69 .56 .58 .91 .83 .89 values of the code metrics model are approximately 0.7 for each classifier—what is defined by Lessman et al. as ”promis- ing” [26]. However, the source code metrics suffer from con- siderably low precision values. The highest median precision
  • 29. Predicting bug-prone methods with diff. cut-points Bug-prone vs. not bug-prone p = 75%, 90%, 95% percentiles 29 ow the classification performance varies (RQ3) as the er of samples in the target class shrinks, and wheth bserve similar findings as in Section 3.2 regarding t ults of the change and code metrics (RQ2). For tha pplied three additional cut-point values as follows: bugClass = not bug − prone : #bugs <= p bug − prone : #bugs > p here p represents either the value of the 75%, 90%, or ercentile of the distribution of the number of bugs in ds per project. For example, using the 95% percent ut-point for prior binning would mean to predict the ve percent” methods in terms of the number of bugs. To conduct this study we applied the same experim etup as in Section 3.1, except for the differently chose
  • 30. Models computed with change metrics (CM) perform best Precision decreases (as expected) RQ3: Decreasing the number of bug-prone methods 30 Table 5: Median classification results for RndFor ver all projects per cut-point and per model CM SCM CM&SCM AUC P R AUC P R AUC P R GT0 .95 .84 .88 .72 .50 .64 .95 .85 .95 75% .97 .72 .95 .75 .39 .63 .97 .74 .95 90% .97 .58 .94 .77 .20 .69 .98 .64 .94 95% .97 .62 .92 .79 .13 .72 .98 .68 .92 ion in the case of the 95% percentile (median precision of .13). Looking at the change metrics and the combined model the median precision is significantly higher for the
  • 31. Application: file level vs. method level prediction JDT Core 3.0 - LocalDeclaration Contains 6 methods / 1 affected by post release bugs LocalDeclaration.resolve(...) was predicted bug-prone with p=0.97 File-level: p=0.17 to guess the bug-prone method Need to manually rule out 5 methods to reach 0.82 precision 1 / (6-5) JDT Core 3.0 - Main Contains 26 methods / 11 affected by post release bugs Main.configure(...) was predicted bug-prone with p=1.0 File-level: p=0.42 to guess a bug-prone method Need to rule out 12 methods to reach 0.82 precision 11 / (26-12) 31
  • 32. What can we learn from that? Large files are more likely to change and have bugs Test large files more thoroughly - YES Bugs are fixed through changes that again lead to bugs Stop changing our systems - NO, of course not! Test changing entities more thoroughly - YES Are we not already doing that? Do we really need (complex) prediction models for that? Not sure - might be the reason why these models are not really used yet But, use at least a metrics tool and keep track of your code quality Continuous integration environments 32
  • 33. What is next? Ease understanding of changes Analysis of the effect(s) of changes What is the effect on the design? What is the effect on the quality? Recommender techniques Provide advice on the effects of changes 33
  • 34. Facilitating understanding of changes 34 FineMem Changes Alex fixed Bug 14: Changed if condition in method send() in module BandPass. Alex Peter
  • 36. Conclusions 36 #SCC 40003000200010000 #Bugs 6 0 4 0 2 0 0 Eclipse Team Core Questions? Martin Pinzger the history of a software system to assemble the dataset for our experiments: (1) versioning data including lines modi- fied (LM), (2) bug data, i.e., which files contained bugs and how many of them (Bugs), and (3) fine-grained source code changes (SCC). 4. Experiment 2. Bug Data 3. Source Code Changes (SCC)1.Versioning Data CVS, SVN, GIT Evolizer RHDB Log Entries ChangeDistiller Subsequent Versions Changes #bug123 Message Bug Support Vector Machine 1.1 1.2 AST Comparison Figure 1: Stepwise overview of the data extraction process. 1. Versioning Data. We use EVOLIZER [14] to access the ver- sioning repositories , e.g., CVS, SVN, or GIT. They provide log entries that contain information about revisions of files that belong to a system. From the log entries we extract the revision number (to identify the revisions of a file in correct temporal order), the revision timestamp, the name of the de- veloper who checked-in the new revision, and the commit message. We then compute LM for a source file as the sum of lines added, lines deleted, and lines changed per file revision. 2. Bug Data. Bug reports are stored in bug repositories such as Bugzilla. Traditional bug tracking and versioning repos- Update Core 595 8’496 251’434 36’151 532 Oct0 Debug UI 1’954 18’862 444’061 81’836 3’120 May JDT Debug UI 775 8’663 168’598 45’645 2’002 Nov Help 598 3’658 66’743 12’170 243 May JDT Core 1’705 63’038 2’814K 451’483 6’033 Jun0 OSGI 748 9’866 335’253 56’238 1’411 Nov single source code statements, e.g., method invocatio ments, between two versions of a program by com their respective abstract syntax trees (AST). Each chan represents a tree edit operation that is required to tr one version of the AST into the other. The algorithm i mented in CHANGEDISTILLER [14] that pairwise co the ASTs between all direct subsequent revisions of e Based on this information, we then count the numbe ferent source code changes (SCC) per file revision. The preprocessed data from step 1-3 is stored into lease History Database (RHDB) [10]. From that data, compute LM, SCC, and Bugs for each source file by a ing the values over the given observation period. 3. EMPIRICAL STUDY In this section, we present the empirical study that formed to investigate the hypotheses stated in Sectio discuss the dataset, the statistical methods and machi ing algorithms we used, and report on the results a ings of the experiments. 3.1 Dataset and Data Preparation We performed our experiments on 15 plugins of the platform. Eclipse is a popular open source system been studied extensively before [4,27,38,39]. Table 1 gives an overview of the Eclipse dataset this study with the number of unique *.java files (Fi FineMem Change extractor Changes Detailed changes Change summaries Change visualization Subscription to detailed changes Alex