Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault Proneness
1. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Physical and Conceptual Identifier Dispersion:
Measures and Relation to Fault Proneness
Venera Arnaoudova Laleh Eshkevari Rocco Oliveto
Yann-Ga¨el Gu´eh´eneuc Giuliano Antoniol
SOCCER Lab. – DGIGL, ´Ecole Polytechnique de Montr´eal, Qc, Canada
SE@SA Lab – DMI, University of Salerno - Salerno - Italy
Ptidej Team – DGIGL, ´Ecole Polytechnique de Montr´eal, Qc, Canada
September 15, 2010
SOftware Cost-effective Change and Evolution Research Lab
Software Engineering @ SAlerno
Pattern Trace Identification, Detection, and Enhancement in Java
2. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Outline
Introduction
Our study
Dispersion measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and future work
2 / 16
3. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Introduction
Fault identification
size (e.g., [Gyim´othy et al., 2005])
cohesion (e.g., [Liu et al., 2009])
coupling (e.g., [Marcus et al., 2008])
number of changes (e.g., [Zimmermann et al., 2007])
Importance of linguistic information
program comprehension (e.g.,
[Takang et al., 1996, Deissenboeck and Pizka, 2006,
Haiduc and Marcus, 2008, Binkley et al., 2009])
code quality (e.g., [Marcus et al., 2008,
Poshyvanyk and Marcus, 2006, Butler et al., 2009])
3 / 16
4. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Our study
Term dispersion
We are interested in studying the relation between term
dispersion and the quality of the source code.
term basic component of identifiers
dispersion the way terms are scattered among different
entities (attributes and methods)
quality absence of faults
Example: What is the impact of using getRelativePath,
returnAbsolutePath, and setPath as method names on
the fault proneness of those methods?
4 / 16
5. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
(1/3)
Physical dispersion - Entropy
fee
foo
bar
Terms
Entities
E1 E2 E3 E4 E5
Entropy
The circle indicates the occurrences of a term in an entity.
The higher the size of the circle the higher the number of occurrences.
5 / 16
6. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
(2/3)
Conceptual dispersion - Context Coverage
E1
E3
E2
E5
E4
C1
C3
C2
C4
Entity Contexts
Entity contexts are identified taking into account
the terms contained in the entities.
fee
foo
bar
Terms
ContextsC1 C2 C3 C4
Context
coverage
The star indicates that the term appears in the particular context.
6 / 16
7. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
7 / 16
8. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
?
7 / 16
9. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
H: used in few identifiers
CC: used in similar contexts
7 / 16
10. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
?
7 / 16
11. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
H: used in many identifiers
CC: used in similar contexts
7 / 16
12. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
?
7 / 16
13. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
H: used in few identifiers
CC: used in different contexts
7 / 16
14. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
?
7 / 16
15. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
H: used in many identifiers
CC: used in different contexts
7 / 16
16. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
H: used in many identifiers
CC: used in different contexts
!
7 / 16
17. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Dispersion measures
Aggregated metric - numHEHCC
(3/3)
Context Coverage
Entropy
th
H
th
CC
H: used in many identifiers
CC: used in different contexts
!
For each entity, numHEHCC counts the number of
such terms
7 / 16
18. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Our study - refined
(1/2)
Research question 1
RQ1 – Metric Relevance: Does numHEHCC capture
characteristics different from size?
Our believe: Yes it does, although we expect some
overlap.
To this end, we verify the following:
1. To what extend numHEHCC and size vary together.
2. Can size explain numHEHCC?
3. Does numHEHCC bring additional information to size
for fault explanation?
8 / 16
19. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Our study - refined
(2/2)
Research question 2
RQ2 – Relation to Faults: Do term entropy and
context coverage help to explain the presence of faults
in an entity?
Our believe: Yes it does!
How?
1. Estimate the risk of being faulty when entities contain
terms with high entropy and high context coverage.
9 / 16
20. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Objects
Objects
ArgoUML v0.16 – a UML modeling CASE tool.
Rhino v1.4R3 – a JavaScript/ECMAScript interpreter
and compiler.
Program LOC # Entities # Terms
ArgoUML 97,946 12,423 2517
Rhino 18,163 1,624 949
We consider as entities both methods and attributes.
10 / 16
21. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Case study
RQ1 – Metric Relevance (1/3)
Results for RQ1 – Metric Relevance
To what extend numHEHCC and size vary together?
ArgoUML: 40%
Rhino: 43%
Correlation between numHEHCC and LOC
numHEHCC
LOC
11 / 16
22. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Case study
RQ1 – Metric Relevance (2/3)
Results for RQ1 – Metric Relevance
Can size explain numHEHCC?
ArgoUML: 17%
Rhino: 19%
Composition of numHEHCC.
12 / 16
23. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Case study
RQ1 – Metric Relevance (3/3)
Results for RQ1 – Metric Relevance (cont’d)
Does numHEHCC bring additional information to size
for fault explanation?
Variables Coefficients p-values
MArgoUML
Intercept -1.688e+00 2e − 16
LOC 7.703e-03 8.34e − 10
numHEHCC 7.490e-02 1.42e − 05
LOC:numHEHCC -2.819e-04 0.000211
MRhino
Intercept -4.9625130 2e − 16
LOC 0.0041486 0.17100
numHEHCC 0.2446853 0.00310
LOC:numHEHCC -0.0004976 0.29788
13 / 16
24. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Case study
Results for RQ2 – Relation to Faults (1/1)
The risk of being faulty when entities contain terms
with high entropy and high context coverage.
All entities
14 / 16
25. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Case study
Results for RQ2 – Relation to Faults (1/1)
The risk of being faulty when entities contain terms
with high entropy and high context coverage.
All entities
14 / 16
26. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Case study
Results for RQ2 – Relation to Faults (1/1)
The risk of being faulty when entities contain terms
with high entropy and high context coverage.
All entities
numHEHCC
10% of the
entities
14 / 16
27. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Case study
Results for RQ2 – Relation to Faults (1/1)
The risk of being faulty when entities contain terms
with high entropy and high context coverage.
All entities
numHEHCC
10% of the
entities
Risk of being faulty?
14 / 16
28. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Case study
Results for RQ2 – Relation to Faults (1/1)
The risk of being faulty when entities contain terms
with high entropy and high context coverage.
All entities
numHEHCC
10% of the
entities
Risk of being faulty?
ArgoUML: 2 x higher
Rhino: 6 x higher
14 / 16
29. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Conclusions and future work
Conclusions
Entropy and context coverage, together, capture
characteristics different from size!
Entropy and context coverage, together, help to explain
the presence of faults in entities!
Future directions
Replicate the study to other systems.
Use entropy and context coverage to suggest
refactoring.
Study the impact of lexicon evolution on entropy and
context coverage.
15 / 16
31. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Binkley, D., Davis, M., Lawrie, D., and Morrell, C.
(2009).
To CamelCase or Under score.
In Proceedings of 17th IEEE International Conference on
Program Comprehension. IEEE CS Press.
Butler, S., Wermelinger, M., Yu, Y., and Sharp, H.
(2009).
Relating identifier naming flaws and code quality: An
empirical study.
In Proceedings of the 16th Working Conference on
Reverse Engineering, pages 31–35. IEEE CS Press.
Deissenboeck, F. and Pizka, M. (2006).
Concise and consistent naming.
Software Quality Journal, 14(3):261–282.
Gyim´othy, T., Ferenc, R., and Siket, I. (2005).
Empirical validation of object-oriented metrics on open
source software for fault prediction.
16 / 16
32. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
IEEE Transactions on Software Engineering,
31(10):897–910.
Haiduc, S. and Marcus, A. (2008).
On the use of domain terms in source code.
In Proceedings of 16th IEEE International Conference on
Program Comprehension, pages 113–122. IEEE CS
Press.
Liu, Y., Poshyvanyk, D., Ferenc, R., Gyim´othy, T., and
Chrisochoides, N. (2009).
Modelling class cohesion as mixtures of latent topics.
In Proceedings of 25th IEEE International Conference on
Software Maintenance, pages 233–242, Edmonton,
Canada. IEEE CS Press.
Marcus, A., Poshyvanyk, D., and Ferenc, R. (2008).
Using the conceptual cohesion of classes for fault
prediction in object-oriented systems.
IEEE Transactions on Software Engineering,
34(2):287–300.
16 / 16
33. Physical and
Conceptual
Identifier
Dispersion
Venera
Arnaoudova, Laleh
Eshkevari, Rocco
Oliveto, Yann-Ga¨el
Gu´eh´eneuc,
Giuliano Antoniol
Introduction
Our study
Dispersion
measures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions and
future work
Poshyvanyk, D. and Marcus, A. (2006).
The conceptual coupling metrics for object-oriented
systems.
In Proceedings of 22nd IEEE International Conference on
Software Maintenance, pages 469 – 478. IEEE CS Press.
Takang, A., Grubb, P., and Macredie, R. (1996).
The effects of comments and identifier names on
program comprehensibility: an experiential study.
Journal of Program Languages, 4(3):143–167.
Zimmermann, T., Premraj, R., and Zeller, A. (2007).
Predicting defects for eclipse.
In Proceedings of the Third International Workshop on
Predictor Models in Software Engineering.
16 / 16