This document presents a comparative study of different text models for information retrieval (IR)-based bug localization. It evaluates generic models like vector space model (VSM), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and cluster-based document model (CBDM) on a dataset of 291 bugs from the iBUGS AspectJ repository. It finds that simpler models like unigram and VSM perform comparably or better than more complex topic-based models like LDA and LSA. The study also compares the performance of IR-based localization to static and dynamic bug localization tools.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
MSR presentation
1. Comparative Study
Retrieval from Software Libraries for Bug
Localization: A Comparative Study of Generic
and Composite Text Models
Shivani Rao and Avinash Kak
School of ECE,Purdue University
May 21, 2011
MSR, Hawaii
Mining Software Repositories, Hawaii, 2011
2. Comparative Study
Outline
1 Bug localization
2 IR(Information Retrieval)-based bug localization
3 Text Models
4 Preprocessing of the source files
5 Evaluation Metrics
6 Results
7 Conclusion
Mining Software Repositories, Hawaii, 2011
3. Comparative Study
Bug localization
Bug localization
Bug localization means to locate the files, methods, classes,
etc., that are directly related to the problem causing abnormal
execution behavior of the software.
IR Bug localization means to locate a bug from its textual
description.
Mining Software Repositories, Hawaii, 2011
4. Comparative Study
Background
A typical bug localization process
Mining Software Repositories, Hawaii, 2011
5. Comparative Study
Background
A typical bug report:JEdit
Mining Software Repositories, Hawaii, 2011
6. Comparative Study
Background
Past work on IR-based bug localization
Authors/Paper Model Software dataset
Marcus et al. VSM Jedit
[1]
Cleary et al. [2] LM, LSA and Eclipse JDT
CA
Lukins et al. [3] LDA Mozilla, Eclipse, Rhino and
JEdit
Drawbacks
1 None of the work reported has been evaluated on a standard
dataset.
2 Inability to compare with the static and dynamic techniques.
3 Number of bugs is of the order 5-30
Mining Software Repositories, Hawaii, 2011
7. Comparative Study
Background
iBUGS
Created by Dallmeier and Zimmerman [4], iBUGS contains a
large number of real bugs with corresponding test suites in
order to generate failing and passing test runs
ASPECTJ software
Software Library Size (Number of files) 6546
Lines of Code 75 KLOC
Vocabulary Size 7553
Number of bugs 291
Table: The iBUGS dataset after preprocessing
Mining Software Repositories, Hawaii, 2011
8. Comparative Study
Background
A typical bug report in the iBUGS repository
Mining Software Repositories, Hawaii, 2011
9. Comparative Study
Text Models
Text models
VSM : Vector Space Model
LSA : Latent Semantic Analysis Model
UM : Unigram Model
LDA : Latent Dirichlet Allocation Model
CBDM : Cluster-Based Document Model
Mining Software Repositories, Hawaii, 2011
10. Comparative Study
Text Models
Vector Space Model
If V is the vocabulary
then queries and
documents are
|V|-dimensional vectors.
wq .wm
sim(q, dm ) =
|w q ||w m |
Sparse yet high
dimensional space.
Mining Software Repositories, Hawaii, 2011
11. Comparative Study
Text Models
Latent semantic analysis: Eigen decomposition
A = UΣV T
Mining Software Repositories, Hawaii, 2011
12. Comparative Study
Text Models
LSA based models
Topic based representation: wk (m) which is a K -dimensional
eigen vector that mth document wm .
wK (m) = Σ−1 UK wm
K
T
qK = Σ−1 UK q
K
T
qK .wK (m)
sim(q, dm ) =
|qK ||wK (m)|
LSA2: Fold back the K-dimensional representation to a
smoothed |V| dimensional represenation and compare directly
with the query q. w = UK ΣK wK
˜ T
Combined Representation: combines the LSA2 with the VSM
representation using the mixture parameter λ .
˜
Acombined = λA + (1 − λ)A
Mining Software Repositories, Hawaii, 2011
13. Comparative Study
Text Models
Unigram model to represent documents using
probability distribution [5]
The term frequencies in a document are considered to be its
probability distribution
The term frequencies in a query become the query’s
probablity distribution
The similarities are established by comparing the probability
distributions using KL divergence.
To add smoothing we add the probability distribution over the
entire source library.
|D|
c(w , dm ) m=1 c(w , dm )
puni (w |Dm ) = µ + (1 − µ) |D|
|dm |
m=1 |dm |
|D|
c(w , q) m=1 c(w , dm )
puni (w |q) = µ + (1 − µ) |D|
|q|
m=1 |dm |
Mining Software Repositories, Hawaii, 2011
14. Comparative Study
Text Models
LDA: A mixture model to represent
documents using topics/concepts [6]
Mining Software Repositories, Hawaii, 2011
15. Comparative Study
Text Models
LDA based models [7]
Topic based representation θm which is a K -dimensional
probability vector that indicates the topic proportions
present in mth document.
Maximum Likelihood Representation folds back to the |V|
dimensional term space.
t=K
plda (w |Dm ) = p(w |z = t)p(z = t|Dm )
t=1
t=K
= φ(t, w )θm (t)
t=1
Combined Representation combines the Unigram representation of
the document and the MLE-LDA representation of a
document.
pcombined (w |Dm ) = λplda (w |Dm ) + (1 − λ)puni (w |Dm )
Mining Software Repositories, Hawaii, 2011
16. Comparative Study
Text Models
Cluster Based Document Model (CBDM) [8]
Cluster the documents into K clusters using deterministic
algorithms like K-means, hierarchical, agglomerative clustering
and so on.
Represent each of the clusters using a multinomial distribution
over the terms in the vocabulary. This distribution is
commonly denoted by pML (w |Clusterj ) and we can express
probabilistic distribution for a words in a dm ∈ Clusterj by:
wm (n)
pcbdm (w |wm ) = λ1 × n=|V|
+ λ2 × pc (w ) +
n=1 wm (n)
λ3 × pML (w |Clusterj ) (1)
Mining Software Repositories, Hawaii, 2011
17. Comparative Study
Text Models
Summary of Text Models used in the
comparative study
Mining Software Repositories, Hawaii, 2011
18. Comparative Study
Text Models
Summary of Text Models used in the
comparative study (cont.)
Model Representation Similarity Metric
VSM frequency vector Cosine similarity
LSA K dimensional vector in the Cosine similarity
eigen space
Unigram |V| dimensional probability vec- KL divergence
tor (smoothed)
LDA K dimensional probability vec- KL divergence
tor
CBDM |V| dimensional combined prob- KL divergence or likeli-
ability vector hood
Table: Generic models used in the comparative evaluation
Mining Software Repositories, Hawaii, 2011
19. Comparative Study
Text Models
Summary of Text Models used in the
comparative study (cont.)
Model Representation Similarity Metric
LSA2 |V| dimensional representation Cosine similarity
in term-space
MLE- |V| dimensional MLE-LDA KL divergence or likeli-
LDA probability vector hood
Table: The variations on two of the generic models used in the
comparative evaluation
Mining Software Repositories, Hawaii, 2011
20. Comparative Study
Text Models
Summary of Text Models used in the
comparative study (cont.)
Model Representation Similarity Metric
Unigram |V| dimensional combined prob- KL divergence or likeli-
+ LDA ability vector hood
VSM + |V| dimensional combined VSM Cosine similarity
LSA and LSA representation
Table: The two composite models used
Mining Software Repositories, Hawaii, 2011
21. Comparative Study
Preprocessing of the source files
Preprocessing of the source files
If a patch file does not exist in the /trunk then it is searched
and added to the source library from the other branches/tags
of the ASPECTJ
The source library consists of ”.java” files only. After this
step, our library ended up with 6546 Java files.
The repository.xml file documents all the information related
to a bug. This includes the BugID, the bug description, the
relevant source files, and so on. We shall call this
ground-truth information as relevance judgements.
The bugs that are documented in iBUGS and do not have any
relevant software files in the source library that results from
the previous step are eliminated. After this step, we are left
with 291 bugs.
Mining Software Repositories, Hawaii, 2011
22. Comparative Study
Preprocessing of the source files
Preprocessing of the source files (contd)
Hard-words, camel-case words and soft-words are handled by
using popular identifier-splitting methods [9, 10].
Stop-list consists of most commonly occuring words.
Example: “for,” “else,” “while,” “int,”, “double,” “long,”
“public,” “void,” etc. There are 375 such words in iBUGS
ASPECTJ software. We also drop from the vocabulary all
unicode strings.
The vocabulary is pruned further by calculating the relative
importance of terms and eliminating ubiquitous and
rarely-occuring terms.
Mining Software Repositories, Hawaii, 2011
23. Comparative Study
Evaluation Metrics
Mean Average Precision (MAP)
Mean Average Precision (MAP)
Calculated using the following two sets:
retreived(Nr ) set consists of the top Nr documents from a ranked
list of documents retrieved vis-a-vis the query.
relevant set is extracted from relevance judgements available
from repository.xml
Precision and Recall:
|{relevant} {retrieved}|
Precision(P@Nr ) =
|{retrieved}|
|{relevant} {retrieved}|
Recall(R@Nr ) =
|{relevant}|
Mining Software Repositories, Hawaii, 2011
24. Comparative Study
Evaluation Metrics
Mean Average Precision (MAP)
Mean Average Precision (MAP) (cont.)
1 If we were to plot a typical P-R curve from the values for
P@Nr and R@Nr , we would get a monotonically decrceasing
curve that has high values of Precision for low values of Recall
and vice versa.
2 Area under the P-R curve is called the Average Precision.
3 Taking mean of the Average Precision over all the queries
gives Mean Average Precision (MAP).
4 Physical significance of MAP: Same as that of Precision.
Mining Software Repositories, Hawaii, 2011
25. Comparative Study
Evaluation Metrics
Rank of Retrieved Files
Rank of Retrieved Files [3]
The number of queries/bugs for which relevant source files
were retrieved with ranks rlow ≤ R ≤ rhigh is reported.
For the retrieval performance reported in [3], ranks used are
R = 1, 2 ≤ R ≤ 5, 6 ≤ R ≤ 10 and R > 10.
Mining Software Repositories, Hawaii, 2011
26. Comparative Study
Evaluation Metrics
SCORE
SCORE [11]
1 Indicates the proportion of the program that need to be
examined in order to locate or localize a fault
2 For each range of this proportion (example, 10 − 20%) the
number of test-runs (bugs) is reported.
Mining Software Repositories, Hawaii, 2011
27. Comparative Study
Results
Models using LDA
Figure: MAP using the three LDA models for different values of K, the
experimental parameters for LDA+Unigram model are λ = 0.9 µ = 0.5,
β = 0.01 and α = 50/K
Mining Software Repositories, Hawaii, 2011
28. Comparative Study
Results
The combined LDA+Unigram model
Figure: MAP plotted for different values of mixture proportions (λ and
µ) of the LDA+Unigram combined model.
Mining Software Repositories, Hawaii, 2011
29. Comparative Study
Results
Models using LSA
Figure: MAP using LSA model and its variations and combinations for
different values of K. The experimental parameter for the LSA+VSM
combined model is λ = 0.5.
Mining Software Repositories, Hawaii, 2011
30. Comparative Study
Results
CBDM
Model parameters K
λ1 λ2 λ3 100 250 500 1000
0.25 0.25 0.5 0.093144 0.0914 0.08666 0.07664
0.15 0.35 0.5 0.0883 0.0897 0.0963 0.0932
0.81 0.09 0.1 0.143 0.102 0.108 0.09952
0.27 0.63 0.1 0.1306 0.117 0.111 0.0998
0.495 0.495 0.01 0.141 0.141 0.141 0.141
0.05 0.05 0.99 0.069 0.075 0.072 0.065
Table: Retrieval performance using MAP with the CBDM.
λ1 + λ2 + λ3 = 1. λ1 Unigram model λ2 Collection Model λ3 Cluster
model
Mining Software Repositories, Hawaii, 2011
31. Comparative Study
Results
Rank based metric
Figure: The height of the bars shows the number of queries (bugs) for
which at least one relevant source file was retrieved at rank 1.
Mining Software Repositories, Hawaii, 2011
32. Comparative Study
Results
SCORE: IR based bug localization tools
Mining Software Repositories, Hawaii, 2011
33. Comparative Study
Results
SCORE: Compare with AMPLE and
FINDBUGS
SCORE with FINDBUGS
None of the bugs were
localized correctly.
Figure: SCORE values calculated over 44
bugs in iBUGS ASPECTJ using AMPLE
[12]
Mining Software Repositories, Hawaii, 2011
34. Comparative Study
Conclusion
Conclusion
IR based bug localization techniques are equally or more
effective compared to static or dynamic bug localization tools.
Sophisticated models like LDA, LSA or CBDM do not
out-perform simpler models like Unigram or VSM for IR based
bug localization on large software systems.
An analysis of the spread of the word distributions over the
source files with the help of measures such as tf and idf can
give useful insights into the usability of topic and cluster
based models for localization.
Mining Software Repositories, Hawaii, 2011
35. Comparative Study
Conclusion
End of Presentation
Thanks to
Questions?
Mining Software Repositories, Hawaii, 2011
36. Comparative Study
Conclusion
Threads to validity
We have tested on a single database like iBUGS. How does
this generalize?
We have eliminated xml files among those that are indexed
and queried. Maybe not a valid assumption?
Mining Software Repositories, Hawaii, 2011
37. Comparative Study
Conclusion
References
A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An
Information Retrieval Approach to Concept Location in Source
code,” in In Proceedings of the 11th Working Conference on
Reverse Engineering (WCRE 2004, pp. 214–223, IEEE
Computer Society, 2004.
B. Cleary, C. Exton, J. Buckley, and M. English, “An Empirical
Analysis of Information Retrieval based Concept Location
Techniques in Software Comprehension,” Empirical Softw.
Engg., vol. 14, no. 1, pp. 93–130, 2009.
S. K. Lukins, N. A. Karft, and E. H. Letha, “Source Code
Retrieval for Bug Localization using Latent Dirichlet
Allocation,” in 15th Working Conference on Reverse
Engineering, 2008.
Mining Software Repositories, Hawaii, 2011
38. Comparative Study
Conclusion
References (cont.)
V. Dallmeier and T. Zimmermann, “Extraction of Bug
Localization Benchmarks from History,” in ASE ’07:
Proceedings of the twenty-second IEEE/ACM international
conference on Automated software engineering, (New York,
NY, USA), pp. 433–436, ACM, 2007.
J. Lafferty and C. Zhai, “A Study of Smoothing Methods for
Language Models Applied to information retrieval,” ACM
Transactions Information Systems, pp. 179–214, 2004.
D. M. Blei, A. V. Ng, and M. I. Jordan, “Latent Dirichlet
Allocation,” Journal of Machine Learning, pp. 993–1022, 2003.
Mining Software Repositories, Hawaii, 2011
39. Comparative Study
Conclusion
References (cont.)
X. Wei and W. B. Croft, “Lda-Based Document Models for
Ad-hoc Retrieval,” in Proceedings of the 29th annual
international ACM SIGIR conference on Research and
development in information retrieval, ACM, 2006.
L. X and W. B. Croft, “Cluster-Based Retrieval Using
Language Models,” in ACM SIGIR Conference on Research
and Development in Information Retrieval, ACM, 2004.
D. B. H. Field and D. Lawrie., “An Empirical Comparison of
Techniques for Extracting Concept Abbreviations from
Identifiers.,” in Proceedings of IASTED International
Conference on Software Engineering and Applications, 2006.
Mining Software Repositories, Hawaii, 2011
40. Comparative Study
Conclusion
References (cont.)
E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Mining
Source Code to Automatically Split Identifiers for Software
Analysis,” in Proceedings of the 2009 6th IEEE International
Working Conference on Mining Software Repositories, MSR
’09, (Washington, DC, USA), pp. 71–80, IEEE Computer
Society, 2009.
J. A. Jones and M. J. Harrold, “Empirical Evaluation of the
Tarantula Automatic Fault-Localization Technique,” in
Automated Software Engineering, 2005.
V. Dallmeier and T. Zimmermann, “Automatic Extraction of
Bug Localization Benchmarks from History,” tech. rep.,
Universi¨t des Saarlandes, Saarbr¨cken, Germany, June 2007.
a u
Mining Software Repositories, Hawaii, 2011