Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Butler
1. 2010 CRC PhD Student Conference
Analysing semantic networks of
identifier names to improve source code
maintainability and quality
Simon Butler
sjb792@student.open.ac.uk
Supervisors Michel Wermelinger, Yijun Yu & Helen Sharp
Department/Institute Centre for Research in Computing
Status Part-time
Probation viva After
Starting date October 2008
Source code is the written expression of a software design consisting of identifier
names – natural language phrases that represent concepts being manipulated
by the program – embedded in a framework of keywords and operators provided
by the programming language. Identifiers are crucial for program comprehen-
sion [9], a necessary activity in the development and maintenance of software.
Despite their importance, there is little understanding of the relationship be-
tween identifier names and source code quality and maintainability. Neither is
there automated support for identifier management or the selection of relevant
natural language content for identifiers during software development.
We will extend current understanding of the relationship between identifier
name quality and source code quality and maintainability by developing tech-
niques to analyse identifiers for meaning, modelling the semantic relationships
between identifiers and empirically validating the models against measures of
maintainability and software quality. We will also apply the analysis and mod-
elling techniques in a tool to support the selection and management of identifier
names during software development, and concept identification and location for
program comprehension.
The consistent use of clear identifier names is known to aid program com-
prehension [4, 7, 8]. However, despite the advice given in programming conven-
tions and the popular programming literature on the use of meaningful identifier
names in source code, the reality is that identifier names are not always meaning-
ful, may be selected in an ad hoc manner, and do not always follow conventions
[5, 1, 2].
Researchers in the reverse engineering community have constructed mod-
els to support program comprehension. The models range in complexity from
textual search systems [11], to RDF-OWL ontologies created either solely from
source code and identifier names [8], or with the inclusion of supporting doc-
umentation and source code comments [13]. The ontologies typically focus on
Page 5 of 125
2. 2010 CRC PhD Student Conference
class and method names, and are used for concept identification and location
based on the lexical similarity of identifier names. The approach, however, does
not directly address the quality of identifier names used.
The development of detailed identifier name analysis has focused on method
names because their visibility and reuse in APIs implies a greater need for them
to contain clear information about their purpose [10]. Caprile and Tonella [3]
derived both a grammar and vocabulary for C function identifiers, sufficient
for the implementation of automated name refactoring. Høst and Østvold [5]
have since analysed Java method names looking for a common vocabulary that
could form the basis of a naming scheme for Java methods. Their analysis of
the method names used in multiple Java projects found common grammatical
forms; however, there were sufficient degenerate forms for them to be unable to
derive a grammar for Java method names.
The consequences of identifier naming problems have been considered to be
largely confined to the domain of program comprehension. However, Deißenb¨ck o
and Pizka observed an improvement in maintainability when their rules of con-
cise and consistent naming were applied to a project [4], and our recent work
found statistical associations between identifier name quality and source code
quality [1, 2]. Our studies, however, only looked at the construction of the
identifier names in isolation, and not at the relationships between the meaning
of the natural language content of the identifiers. We hypothesise that a rela-
tionship exists between the quality of identifier names, in terms of their natural
language content and semantic relationships, and the quality of source code,
which can be understood in terms of the functionality, reliability, and usability
of the resulting software, and its maintainability [6]. Accordingly, we seek to
answer the following research question:
How are the semantic relationships between identifier names, in-
ferred from their natural language content and programming lan-
guage structure, related to source code maintainability and quality?
We will construct models of source code as semantic networks predicated
on both the semantic content of identifier names and the relationships between
identifier names inferred from the programming language structure. For exam-
ple, the simple class Car in Figure 1 may be represented by the semantic network
in Figure 2. Such models can be applied to support empirical investigations of
the relationship between identifier name quality and source code quality and
maintainability. The models may also be used in tools to support the manage-
ment and selection of identifier names during software development, and to aid
concept identification and location during source code maintenance.
public c l a s s Car extends V e h i c l e {
Engine e n g i n e ;
}
Figure 1: The class Car
We will analyse identifier names mined from open source Java projects to
create a catalogue of identifier structures to understand the mechanisms em-
ployed by developers to encode domain information in identifiers. We will build
Page 6 of 125
3. 2010 CRC PhD Student Conference
on the existing analyses of C function and Java method identifier names [3, 5, 8],
and anticipate the need to develop additional techniques to analyse identifiers,
particularly variable identifier names.
extends
Car Vehicle
has a
has instance named
Engine engine
Figure 2: A semantic network of the class Car
Modelling of both the structural and semantic relationships between iden-
tifiers can be accomplished using Gellish [12], an extensible controlled natural
language with dictionaries for natural languages – Gellish English being the
variant for the English language. Unlike a conventional dictionary, a Gellish
dictionary includes human- and machine-readable links between entries to de-
fine relationships between concepts – thus making Gellish a semantic network –
and to show hierarchical linguistic relationships such as meronymy, an entity–
component relationship. Gellish dictionaries also permit the creation of multiple
conceptual links for individual entries to define polysemic senses.
The natural language relationships catalogued in Gellish can be applied to
establish whether the structural relationship between two identifiers implied by
the programming language is consistent with the conventional meaning of the
natural language found in the identifier names. For example, a field is implic-
itly a component of the containing class allowing the inference of a conceptual
and linguistic relationship between class and field identifier names. Any incon-
sistency between the two relationships could indicate potential problems with
either the design or with the natural language content of the identifier names.
We have assumed a model of source code development and comprehension
predicated on the idea that it is advantageous for coherent and relevant semantic
relationships to exist between identifier names based on their natural language
content. To assess the relevance of our model to real-world source code we
will validate the underlying assumption empirically. We intend to mine both
software repositories and defect reporting systems to identify source code impli-
cated in defect reports and evaluate the source code in terms of the coherence
and consistency of models of its identifiers. To assess maintainability we will
investigate how source code implicated in defect reports develops in successive
versions – e.g. is the code a continuing source of defects? – and monitor areas of
source code modified between versions to determine how well our model predicts
defect-prone and defect-free regions of source code.
We will apply the results of our research to develop a tool to support the
selection and management of identifier names during software development, as
well as modelling source code to support software maintenance. We will evaluate
and validate the tool with software developers – both industry partners and
FLOSS developers – to establish the value of identifier naming support. While
intended for software developers, the visualisations of source code presented by
Page 7 of 125
4. 2010 CRC PhD Student Conference
the tool will enable stakeholders (e.g. domain experts) who are not literate
in programming or modelling languages (like Java and UML) to examine, and
feedback on, the representation of domain concepts in source code.
References
[1] S. Butler, M. Wermelinger, Y. Yu, and H. Sharp. Relating identifier naming
flaws and code quality: an empirical study. In Proc. of the Working Conf.
on Reverse Engineering, pages 31–35. IEEE Computer Society, 2009.
[2] S. Butler, M. Wermelinger, Y. Yu, and H. Sharp. Exploring the influence
of identifier names on code quality: an empirical study. In Proc. of the
14th European Conf. on Software Maintenance and Reengineering, pages
159–168. IEEE Computer Society, 2010.
[3] B. Caprile and P. Tonella. Restructuring program identifier names. In
Proc. Int’l Conf. on Software Maintenance, pages 97–107. IEEE, 2000.
[4] F. Deißenb¨ck and M. Pizka. Concise and consistent naming. Software
o
Quality Journal, 14(3):261–282, Sep 2006.
[5] E. W. Høst and B. M. Østvold. The Java programmer’s phrase book.
In Software Language Engineering, volume 5452 of LNCS, pages 322–341.
Springer, 2008.
[6] International Standards Organisation. ISO/IEC 9126-1: Software engineer-
ing – product quality, 2001.
[7] D. Lawrie, H. Feild, and D. Binkley. An empirical study of rules for well-
formed identifiers. Journal of Software Maintenance and Evolution: Re-
search and Practice, 19(4):205–229, 2007.
[8] D. Ratiu. Intentional Meaning of Programs. PhD thesis, Technische Uni-
¸
versit¨t M¨nchen, 2009.
a u
[9] V. Rajlich and N. Wilde. The role of concepts in program comprehension.
In Proc. 10th Int’l Workshop on Program Comprehension, pages 271–278.
IEEE, 2002.
[10] M. Robillard. What makes APIs hard to learn? Answers from developers.
IEEE Software, 26(6):27–34, Nov.-Dec. 2009.
[11] G. Sridhara, E. Hill, L. Pollock, and K. Vijay-Shanker. Identifying word
relations in software: a comparative study of semantic similarity tools. In
Proc Int’l Conf. on Program Comprehension, pages 123–132. IEEE, June
2008.
[12] A. S. H. P. van Renssen. Gellish: a generic extensible ontological language.
Delft University Press, 2005.
[13] R. Witte, Y. Zhang, and J. Rilling. Empowering software maintainers with
semantic web technologies. In European Semantic Web Conf., pages 37–52,
2007.
Page 8 of 125