MSR presentation

Comparative Study

Retrieval from Software Libraries for Bug
Localization: A Comparative Study of Generic
and Composite Text Models

Shivani Rao and Avinash Kak

School of ECE,Purdue University

May 21, 2011
MSR, Hawaii

Mining Software Repositories, Hawaii, 2011

Comparative Study

Outline
1 Bug localization

2 IR(Information Retrieval)-based bug localization

3 Text Models

4 Preprocessing of the source ﬁles

5 Evaluation Metrics

6 Results

7 Conclusion

Comparative Study
Bug localization

Bug localization

Bug localization means to locate the ﬁles, methods, classes,
etc., that are directly related to the problem causing abnormal
execution behavior of the software.
IR Bug localization means to locate a bug from its textual
description.


Comparative Study
Background

A typical bug localization process


Comparative Study
Background

A typical bug report:JEdit


Comparative Study
Background

Past work on IR-based bug localization
Authors/Paper Model Software dataset
Marcus et al. VSM Jedit
[1]
Cleary et al. [2] LM, LSA and Eclipse JDT
CA
Lukins et al. [3] LDA Mozilla, Eclipse, Rhino and
JEdit

Drawbacks
1 None of the work reported has been evaluated on a standard

dataset.
2 Inability to compare with the static and dynamic techniques.
3 Number of bugs is of the order 5-30


Comparative Study
Background

iBUGS

Created by Dallmeier and Zimmerman [4], iBUGS contains a
large number of real bugs with corresponding test suites in
order to generate failing and passing test runs
ASPECTJ software
Software Library Size (Number of ﬁles) 6546
Lines of Code 75 KLOC
Vocabulary Size 7553
Number of bugs 291
Table: The iBUGS dataset after preprocessing


Comparative Study
Background

A typical bug report in the iBUGS repository


Comparative Study
Text Models

Text models

VSM : Vector Space Model
LSA : Latent Semantic Analysis Model
UM : Unigram Model
LDA : Latent Dirichlet Allocation Model
CBDM : Cluster-Based Document Model


Comparative Study
Text Models

Vector Space Model

If V is the vocabulary
then queries and
documents are
|V|-dimensional vectors.
wq .wm
sim(q, dm ) =
|w q ||w m |

Sparse yet high
dimensional space.


Comparative Study
Text Models

Latent semantic analysis: Eigen decomposition

A = UΣV T


Comparative Study
Text Models

LSA based models
Topic based representation: wk (m) which is a K -dimensional
eigen vector that mth document wm .

wK (m) = Σ−1 UK wm
K
T

qK = Σ−1 UK q
K
T

qK .wK (m)
sim(q, dm ) =
|qK ||wK (m)|

LSA2: Fold back the K-dimensional representation to a
smoothed |V| dimensional represenation and compare directly
with the query q. w = UK ΣK wK
˜ T

Combined Representation: combines the LSA2 with the VSM
representation using the mixture parameter λ .
˜
Acombined = λA + (1 − λ)A


Comparative Study
Text Models

Unigram model to represent documents using
probability distribution [5]
The term frequencies in a document are considered to be its
probability distribution
The term frequencies in a query become the query’s
probablity distribution
The similarities are established by comparing the probability
distributions using KL divergence.
To add smoothing we add the probability distribution over the
entire source library.
|D|
c(w , dm ) m=1 c(w , dm )
puni (w |Dm ) = µ + (1 − µ) |D|
|dm |
m=1 |dm |
|D|
c(w , q) m=1 c(w , dm )
puni (w |q) = µ + (1 − µ) |D|
|q|
m=1 |dm |

Comparative Study
Text Models

LDA: A mixture model to represent
documents using topics/concepts [6]


Comparative Study
Text Models

LDA based models [7]
Topic based representation θm which is a K -dimensional
probability vector that indicates the topic proportions
present in mth document.
Maximum Likelihood Representation folds back to the |V|
dimensional term space.
t=K
plda (w |Dm ) = p(w |z = t)p(z = t|Dm )
t=1
t=K
= φ(t, w )θm (t)
t=1
Combined Representation combines the Unigram representation of
the document and the MLE-LDA representation of a
document.
pcombined (w |Dm ) = λplda (w |Dm ) + (1 − λ)puni (w |Dm )

Comparative Study
Text Models

Cluster Based Document Model (CBDM) [8]
Cluster the documents into K clusters using deterministic
algorithms like K-means, hierarchical, agglomerative clustering
and so on.
Represent each of the clusters using a multinomial distribution
over the terms in the vocabulary. This distribution is
commonly denoted by pML (w |Clusterj ) and we can express
probabilistic distribution for a words in a dm ∈ Clusterj by:

wm (n)
pcbdm (w |wm ) = λ1 × n=|V|
+ λ2 × pc (w ) +
n=1 wm (n)
λ3 × pML (w |Clusterj ) (1)


Comparative Study
Text Models

Summary of Text Models used in the
comparative study


Comparative Study
Text Models

comparative study (cont.)

Model Representation Similarity Metric
VSM frequency vector Cosine similarity
LSA K dimensional vector in the Cosine similarity
eigen space
Unigram |V| dimensional probability vec- KL divergence
tor (smoothed)
LDA K dimensional probability vec- KL divergence
tor
CBDM |V| dimensional combined prob- KL divergence or likeli-
ability vector hood
Table: Generic models used in the comparative evaluation


Comparative Study
Text Models


LSA2 |V| dimensional representation Cosine similarity
in term-space
MLE- |V| dimensional MLE-LDA KL divergence or likeli-
LDA probability vector hood
Table: The variations on two of the generic models used in the
comparative evaluation


Comparative Study
Text Models


Unigram |V| dimensional combined prob- KL divergence or likeli-
+ LDA ability vector hood
VSM + |V| dimensional combined VSM Cosine similarity
LSA and LSA representation
Table: The two composite models used


Comparative Study
Preprocessing of the source files


If a patch file does not exist in the /trunk then it is searched
and added to the source library from the other branches/tags
of the ASPECTJ
The source library consists of ”.java” files only. After this
step, our library ended up with 6546 Java files.
The repository.xml file documents all the information related
to a bug. This includes the BugID, the bug description, the
relevant source files, and so on. We shall call this
ground-truth information as relevance judgements.
The bugs that are documented in iBUGS and do not have any
relevant software files in the source library that results from
the previous step are eliminated. After this step, we are left
with 291 bugs.


Comparative Study

Preprocessing of the source ﬁles (contd)

Hard-words, camel-case words and soft-words are handled by
using popular identiﬁer-splitting methods [9, 10].
Stop-list consists of most commonly occuring words.
Example: “for,” “else,” “while,” “int,”, “double,” “long,”
“public,” “void,” etc. There are 375 such words in iBUGS
ASPECTJ software. We also drop from the vocabulary all
unicode strings.
The vocabulary is pruned further by calculating the relative
importance of terms and eliminating ubiquitous and
rarely-occuring terms.


Comparative Study
Evaluation Metrics
Mean Average Precision (MAP)


Calculated using the following two sets:
retreived(Nr ) set consists of the top Nr documents from a ranked
list of documents retrieved vis-a-vis the query.
relevant set is extracted from relevance judgements available
from repository.xml
Precision and Recall:
|{relevant} {retrieved}|
Precision(P@Nr ) =
|{retrieved}|
|{relevant} {retrieved}|
Recall(R@Nr ) =
|{relevant}|


Comparative Study
Evaluation Metrics

Mean Average Precision (MAP) (cont.)

1 If we were to plot a typical P-R curve from the values for
P@Nr and R@Nr , we would get a monotonically decrceasing
curve that has high values of Precision for low values of Recall
and vice versa.
2 Area under the P-R curve is called the Average Precision.
3 Taking mean of the Average Precision over all the queries
gives Mean Average Precision (MAP).
4 Physical signiﬁcance of MAP: Same as that of Precision.


Comparative Study
Evaluation Metrics
Rank of Retrieved Files

Rank of Retrieved Files [3]

The number of queries/bugs for which relevant source ﬁles
were retrieved with ranks rlow ≤ R ≤ rhigh is reported.
For the retrieval performance reported in [3], ranks used are
R = 1, 2 ≤ R ≤ 5, 6 ≤ R ≤ 10 and R > 10.


Comparative Study
Evaluation Metrics
SCORE

SCORE [11]

1 Indicates the proportion of the program that need to be
examined in order to locate or localize a fault
2 For each range of this proportion (example, 10 − 20%) the
number of test-runs (bugs) is reported.


Comparative Study
Results

Models using LDA

Figure: MAP using the three LDA models for diﬀerent values of K, the
experimental parameters for LDA+Unigram model are λ = 0.9 µ = 0.5,
β = 0.01 and α = 50/K

Comparative Study
Results

The combined LDA+Unigram model

Figure: MAP plotted for diﬀerent values of mixture proportions (λ and
µ) of the LDA+Unigram combined model.


Comparative Study
Results

Models using LSA

Figure: MAP using LSA model and its variations and combinations for
diﬀerent values of K. The experimental parameter for the LSA+VSM
combined model is λ = 0.5.

Comparative Study
Results

CBDM

Model parameters K
λ1 λ2 λ3 100 250 500 1000
0.25 0.25 0.5 0.093144 0.0914 0.08666 0.07664
0.15 0.35 0.5 0.0883 0.0897 0.0963 0.0932
0.81 0.09 0.1 0.143 0.102 0.108 0.09952
0.27 0.63 0.1 0.1306 0.117 0.111 0.0998
0.495 0.495 0.01 0.141 0.141 0.141 0.141
0.05 0.05 0.99 0.069 0.075 0.072 0.065
Table: Retrieval performance using MAP with the CBDM.
λ1 + λ2 + λ3 = 1. λ1 Unigram model λ2 Collection Model λ3 Cluster
model


Comparative Study
Results

Rank based metric

Figure: The height of the bars shows the number of queries (bugs) for
which at least one relevant source ﬁle was retrieved at rank 1.


Comparative Study
Results

SCORE: IR based bug localization tools


Comparative Study
Results

SCORE: Compare with AMPLE and
FINDBUGS

SCORE with FINDBUGS
None of the bugs were
localized correctly.

Figure: SCORE values calculated over 44
bugs in iBUGS ASPECTJ using AMPLE
[12]

Comparative Study
Conclusion

Conclusion

IR based bug localization techniques are equally or more
eﬀective compared to static or dynamic bug localization tools.
Sophisticated models like LDA, LSA or CBDM do not
out-perform simpler models like Unigram or VSM for IR based
bug localization on large software systems.
An analysis of the spread of the word distributions over the
source ﬁles with the help of measures such as tf and idf can
give useful insights into the usability of topic and cluster
based models for localization.


Comparative Study
Conclusion

End of Presentation

Thanks to

Questions?


Comparative Study
Conclusion

Threads to validity

We have tested on a single database like iBUGS. How does
this generalize?
We have eliminated xml ﬁles among those that are indexed
and queried. Maybe not a valid assumption?


Comparative Study
Conclusion

References
A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An
Information Retrieval Approach to Concept Location in Source
code,” in In Proceedings of the 11th Working Conference on
Reverse Engineering (WCRE 2004, pp. 214–223, IEEE
Computer Society, 2004.
B. Cleary, C. Exton, J. Buckley, and M. English, “An Empirical
Analysis of Information Retrieval based Concept Location
Techniques in Software Comprehension,” Empirical Softw.
Engg., vol. 14, no. 1, pp. 93–130, 2009.
S. K. Lukins, N. A. Karft, and E. H. Letha, “Source Code
Retrieval for Bug Localization using Latent Dirichlet
Allocation,” in 15th Working Conference on Reverse
Engineering, 2008.


Comparative Study
Conclusion

References (cont.)

V. Dallmeier and T. Zimmermann, “Extraction of Bug
Localization Benchmarks from History,” in ASE ’07:
Proceedings of the twenty-second IEEE/ACM international
conference on Automated software engineering, (New York,
NY, USA), pp. 433–436, ACM, 2007.
J. Laﬀerty and C. Zhai, “A Study of Smoothing Methods for
Language Models Applied to information retrieval,” ACM
Transactions Information Systems, pp. 179–214, 2004.
D. M. Blei, A. V. Ng, and M. I. Jordan, “Latent Dirichlet
Allocation,” Journal of Machine Learning, pp. 993–1022, 2003.


Comparative Study
Conclusion

References (cont.)

X. Wei and W. B. Croft, “Lda-Based Document Models for
Ad-hoc Retrieval,” in Proceedings of the 29th annual
international ACM SIGIR conference on Research and
development in information retrieval, ACM, 2006.
L. X and W. B. Croft, “Cluster-Based Retrieval Using
Language Models,” in ACM SIGIR Conference on Research
and Development in Information Retrieval, ACM, 2004.
D. B. H. Field and D. Lawrie., “An Empirical Comparison of
Techniques for Extracting Concept Abbreviations from
Identiﬁers.,” in Proceedings of IASTED International
Conference on Software Engineering and Applications, 2006.


Comparative Study
Conclusion

References (cont.)

E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Mining
Source Code to Automatically Split Identiﬁers for Software
Analysis,” in Proceedings of the 2009 6th IEEE International
Working Conference on Mining Software Repositories, MSR
’09, (Washington, DC, USA), pp. 71–80, IEEE Computer
Society, 2009.
J. A. Jones and M. J. Harrold, “Empirical Evaluation of the
Tarantula Automatic Fault-Localization Technique,” in
Automated Software Engineering, 2005.
V. Dallmeier and T. Zimmermann, “Automatic Extraction of
Bug Localization Benchmarks from History,” tech. rep.,
Universi¨t des Saarlandes, Saarbr¨cken, Germany, June 2007.
a u


MSR presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Semelhante a MSR presentation

Semelhante a MSR presentation (20)

MSR presentation