Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming

Supporting the Maintenance of Identifier
Names: A Holistic Approach to High-
Quality Automated Identifier Naming
Anthony Peruma
June 28, 2022
B. Thomas Golisano College of Computing and Information Sciences
Ph.D. Dissertation Presentation
Dissertation Committee: Dr. Mohamed Mkaouer, Dr. Mehdi Mirakhorli & Dr. Marcos Zampieri
Dissertation Advisor: Dr. Christian Newman
Dissertation Defense Chair: Dr. Robert Glick

Agenda
~ Martin Fowler, 1999
“Any fool can write code that
a computer can understand.
Good programmers write
code that humans can
understand.”
01
Introduction
➢ Context
➢ Current Research & Challenges
02
Research
➢ Goal & Research Questions
➢ Completed Studies
➢ Overall Findings
03
Conclusion
➢ Future Work
➢ Summary

The importance of identifer names
4
Identifier Names
Lexical tokens that uniquely identify elements
in the source code (classes, methods, etc. )
Names acount for 70% of characters in the
code base
Software Maintenance
Every software system undergoes maintenance
(corrective, adaptive, preventive, perfective)
Consumes 60% - 80% of organization
resources
Program Comprehension
A precursor to any maintenance task
Developers spend 58% of their time on
program comprehension activities
Poor code readability impacts time and quality

Lexical tokens that uniquely identify entities
5
class name
attribute name
parameter name
method name
variable name
Responsible for saving/writing
results of an operation
Responsible for writing the output
which comes in as a parameter

Some poor-quality names are easy to spot…
6
Unreadable method name
Generic name

… others are not so straightforward!
7
Readable attribute name
Anti-Pattern: Collection data
type, singular identifier name
Readable method name
Anti-Pattern: method
name suggests
transformation, but
no return type

Current Research &
Challenges
Introduction
8

9
IDENTIFIER
NAMING IS
HARD
CODING STANDARDS & STYLE GUIDES
Provide heuristics about the overall readability of a class. They do not produce
strong names, nor can they provide lexical structure recommendations.
RENAMING
Renames account to over 40% of the rework developers perform.
Renames do not guarantee a strong name.

Challenges with
renaming
10
A “rename chain” - multiple
instances of developers renaming
an identifier
Is the final name
high-quality?

11
IDENTIFIER
NAMING IS
HARD
NAME RECOMMENDATION MODELS
Models are prescriptive not descriptive. Model is built based on the
existing code styles and does not consider pre-existing poor identifiers.
Works only on method names. Context sensitivity is a challenge.
RENAMING

12
IDENTIFIER
NAMING IS
HARD
NAME RECOMMENDATION MODELS
Models are prescriptive not descriptive. Model is built based on the
existing code styles and does not consider pre-existing poor identifiers.
Works only on method names. Context sensitivity is a challenge.
RENAMING
NLP TECHNOLOGY
Current technology is built for English prose– not source code (e.g., Stanford
POS tagger); domain/technology terms pose a challenge.
Names are diverse, and so are the developers who craft
these names -- a one-stop solution is very challenging!

Over 30 years of research in identifier naming
13
Multiple Research Streams
Identifier Renaming
Identifier Name Quality
Naming styles, metrics, models, linguistic anti-patterns,
grammar patterns
Challenges with current approaches
Name quality is a threat to downstream approaches
Even with over 30 years of research, we do not have a way to measure strong identifier names.

Improving the
developer code
comprehension
experience through
novel automated
mechanisms in identifier
name appraisals and
recommendations
GOAL

Grammar Patterns
16
A grammar pattern is the sequence of part-of-speech tags assigned to
individual words within an identifier
Part-of-speech is a category to which a word is assigned in accordance with
its syntactic functions
• In English, the main parts of speech are noun, pronoun, adjective, determiner, verb, adverb,
preposition, conjunction, and interjection
int dynamic_Table_Index; void save_As_Quadratic_Png();
Noun Modifier
(NM)
Noun
(N)
Noun Modifier
(NM)
Verb
(V)
Noun
(N)
Noun Modifier
(NM)
Preposition
(P)

Research Questions
• RQ 1: How effectively, in terms of correctness, can
grammar patterns be automatically generated for identifier
names?
• RQ 2: To what extent did the automated identifier naming
mechanism positively or negatively influence naming
practices?
• RQ 3: What are the primary challenges in appraising and
recommending the semantic structure of identifier names,
and how can these be improved?
17

Incorporating automated support for identifier name
maintenance into the developer workflow
18

19
Squiggly Line Indicates A Naming Problem
Summary of All Naming Problems

20
Selected Identifier
Problem Summary
Detected &
Recommended
Grammar Pattern
Problem Explanation

Research Focus Areas
21
Identifier Name
Evolution
Identifier Name Tool
Development
Developer Workflow
Integration
• Rename Prevalence
• Semantic Evolution
• Contextualization
• Grammar Patterns
• Abbreviation Expansion
• Rename Semantic Detection
• Linguistic Anti-Pattern Detection
• Identifer Part-of-Speech Tagger
• Developer Experience

An empirical investigation of how and why
developers rename identifiers
INTRODUCTION & GOAL
Current work in the field does not examine the evolution of the name
Most of these studies do not provide empirical data – mostly conceptual
The study extends a portion of the work done by Arnaoudova et al. to a much larger number of systems
Lays the groundwork for understanding how the semantics of a name evolves
Goal: Explore the volume of rename refactoring operations developers apply and changes to the
structure of the renamed identifier names
23
Peruma, A., Mkaouer, M. W., Decker, M. J., & Newman, C. D. (2018, September). An empirical investigation of how and why developers rename
identifiers. In Proceedings of the 2nd International Workshop on Refactoring (pp. 26-33).

METHODOLOGY
Empirical study on 3,795 open-source Java systems
• RefactoringMiner to mine rename refactoring operations
• Rename Taxonomy – determines the type of form and semantic
change an identifier’s name
• NLP Tools – including NLTK to determine the semantics of a name
• Topic Modeling – on the rename refactoring commit messages to
determine
Data Overview:
1M+ refactoring operations  43.36% rename refactorings
24

KEY FINDINGS & TAKEAWAYS
Renames form the bulk of the rework developers perform when refactoring their code
Developers mostly perform simple renames – either add or remove a single term in a name
Narrowing the meaning of the name is frequently done by the developer during a rename
A strong correlation between grammar changes and meaning preservation of the identifier's name
Topic modeling of rename commit messages results in high-level topics - difficult to pinpoint the
developer’s intention
25
• Exploratory study showing the viability of using the semantic structure of names to determine the quality of the name
• Scope for constructing specialized NLP tools for software engineering artifacts

INTRODUCTION & GOAL
Contextualizing rename decisions using
refactorings, commit messages, and data types
Existing research on identifer naming does not investigate how names evolve and how these
changes correlate with changes made to source code
Help determine when/how to rename identifiers and to understand more about developer naming
mental models
Goal: Understand how surrounding code and development activities influence the structure and
meaning of an identifier’s name
• Data Types – have strong influence over the data and behavior represented by an identifier
• Refactorings – changes made before or after a rename have a relationship with the rename itself
26
Peruma, A., Mkaouer, M. W., Decker, M. J., & Newman, C. D. (2020). Contextualizing rename decisions using refactorings, commit messages, and data
types. Journal of Systems and Software, 169, 110704.

METHODOLOGY
• RefactoringMiner to mine 28 refactoring operation types in the
source code
• Rename Taxonomy – determines the type of form and semantic
change an identifier’s name
• Static Analysis – extract the data type for an identifier
• NLP Tools – including NLTK to determine the semantics of a name
• Developer Experience – measured using the amounts of commits
performed on source code
Data Overview:
748,001 commits  711,495 refactoring operations  53.51%
refactorings are renames
27

Novice developers frequently perform rename refactorings than other types of refactoring
operations
A rename attribute usually follow a move attribute
When a rename follows another rename, the developer reverts to the original name
Developers frequently change the semantic meaning of an identifier name after a refactoring
Renames which involve a change to the type name tended to also involve identifiers with names
exactly matching their type -- A data type change to collection causes a name to change to plural
28

29
• Name quality tools should consider the experience of the developer when presenting results
• Incorporation of code & name relationship heuristics (e.g., data types and plurality) into automated name appraisals
and recommendations
Narrowing the name of the type narrows the
meaning of the identifier’s name
Collection data types are associated with plural
identifier names

On the generation, structure, and semantics of
grammar patterns in source code identifiers
30
Newman, C. D., AlSuhaibani, R. S., Decker, M. J., Peruma, A., Kaushik, D., Mkaouer, M. W., & Hill, E. (2020). On the generation, structure, and semantics
of grammar patterns in source code identifiers. Journal of Systems and Software, 170, 110740.
INTRODUCTION & GOAL
Existing work focus on a specific type of identifier (class or method) or do not focus on real-world names
Current NLP techniques are not trained to be applied to software
Understanding this connection between name and behavior is challenging for humans and tools
Goal: Study the structure, semantics, diversity, and generation of grammar patterns, including
establishing and exploring the common and diverse grammar pattern structures found in identifiers

31
METHODOLOGY
Manually curated gold set of grammar patterns
• 20 open-source systems (java, c++, c)
• Statistically significant sample of 1,335 identifier names (95% confidence level; 6% confidence interval)
• class names, function names, parameter names & attribute names
• Annotated and reviewed by the authors – every identifer reviewed by two annotators
• assigned part-of-speech tags for each word in the name
• Comparison against 3 part-of-speech taggers (Stanfrord, SWUM, POSSE)

Identified five patterns by looking at how frequently they occurred in the annotated dataset
• Most common: noun phrase (NM+ N) pattern
• Most common for methods: verb phrase (V NM+ N)
SWUM had the most agreement with the annotated dataset, followed by POSSE and Stanford
• SWUM: 67.8%, POSSE: 24.7%, Stanford: 26.6%
Part-of-speech taggers still require significant improvements to be effective on identifiers
32

33
• Construction of a specialized identifier name part-of-speech tagger
• Incorporation of common grammar patterns for each identifier type in name appraisals and recommendations
int dynamic_Table_Index;
Noun Modifier
(NM)
Noun
(N)
Noun Modifier
(NM)
void save_As_Quadratic_Png();
Verb
(V)
Noun
(N)
Noun Modifier
(NM)
Preposition
(P)
Noun Phrase – Common for identifiers that are not
non-functions or not collections, not boolean types
Verb Phrase – Common for function identifiers or
identifiers with a boolean type

Using grammar patterns to interpret test
method name evolution
34
Peruma, A., Hu, E., Chen, J., AlOmar, E. A., Mkaouer, M. W., & Newman, C. D. (2021, May). Using grammar patterns to interpret test method name
evolution. In 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC) (pp. 335-346). IEEE.
INTRODUCTION & GOAL
The purpose of unit test code differs from production code – therefore do their identifier names
Test methods names are constructed to describe both the entity that is being tested as well as
actions taken by the test
Most existing studies focus on production code (and identifier names) → findings do not generalize
to test code
Understand how test method names are structured, how they evolve in structure and meaning,
and how the structure/meaning of these names relate to statically-verifiable code behavior

35
METHODOLOGY
• RefactoringMiner – to mine 28 refactoring operation types in the source code
• Test Suites – 12,010 JUnit test files had undergone a Rename Method refactoring
• Manual Annotation – part-of-speech tags for a statistically significant sample of 632 random test
method names were annotated by the authors
• Rename Taxonomy – determines the type of form and semantic change an identifier’s name

• Test names have a structure that differs from production names; some of this structure can be leveraged
to provide test-specific recommendations
• Existence of grammar patterns that include determiners, prepositions, and adverbs (e.g., +VM+, +DT+, N
V+, V V N P+)
• Methods with a noun phrase grammar pattern (N) are extremely rare; hence, poor test method names
• Grammar pattern prefixes are stable; they do not change very often during rename activities.
• Test method name refactorings tend to change the meaning of terms in the name; contrasts with
production name that tend to narrow in meaning
36

37
• Code quality tools/techniques should treat test identifiers differently from production identifiers
• Incorporation of common test method grammar patterns when appraising and recommending names
Verb Verb
Noun
Dual Verb Phrase With Prepositon – Preposition
identifies the relationship between the nouns
Divided Verb Phrase – Verb enclosed within nouns, where
the verb is the action applied to the secondary noun
Verb Noun Preposition
Verb Noun Noun

Tool Development
Naming Violation Detection
• Detects 19 types of linguistic anti-patterns
• Provides an explanation of the violation
• Analyzes C# & Java source code
• Supports project-specific customizations
• Average precision: 75.27%
• Open-source
Ensemble Part-of-Speech Tagger
• Tagger uses machine-learning and the output
from multiple part-of-speech taggers to
annotate natural language text
• The ensemble uses three state-of-the-art part-
of-speech taggers: SWUM, POSSE, and
Stanford
• Accuracy of 86%; Outperforms Stanford by
51%
38
Peruma, A., Arnaoudova, V., & Newman, C. D. (2021, September). Ideal:
An open-source identifier name appraisal tool. In 2021 IEEE
International Conference on Software Maintenance and Evolution
(ICSME) (pp. 599-603). IEEE.
Newman, C. D., Decker, M. J., Alsuhaibani, R., Peruma, A., Mkaouer, M.,
Mohapatra, S., ... & Hill, E. (2021). An ensemble approach for
annotating source code identifiers with part-of-speech tags. IEEE
Transactions on Software Engineering.

RQ 1: How effectively, in terms of
correctness, can grammar patterns
be automatically generated for
identifier names?

Common identifier naming patterns
41
NM* N
V P NM*
(N|NPL)
NM* N
P NM*
(N|NPL)
P NM*
(N|NPL)
V* DT
NM*
(N|NPL)
V NM*
(N|NPL)
V+
NM*
NPL
Prepositional w/ Noun
Prepositional phrase with leading noun
phrase
long query_Timeout_In_Milliseconds;
NM N P NPL
Noun w/ Determiner
Noun phrase with leading determiner
String[] all_Open_Indices;
DT NM NPL
Prepositional w/ Verb
Prepositional phrase with leading verb
string convert_to_php_namespace();
V P NM N
Prepositonal Phrase
A noun or verb-phrase with a leading
preposition
String to_string();
P N
Plural Noun Phrase
Identical to Noun Phrase, except the head-
noun is plural
String[] method_Name_Prefixes;
NM NM NPL
Verb Phrase
The addition of a verb to a noun phrase
creates a verb phrase
bool create_metadata_array();
V NM N
Noun Phrase
Zero or more noun-modifiers appear to
the left of a head-noun
int dynamic_Table_Index;
NM NM N
Verb Pattern
One or more verbs with no noun phrase
void sort();
V

Common identifier naming rules
42
Rule(s):
(Plural) Noun Phrase:
NM* N(PL)
(e.g., class StringUtility)
Rule(s):
Verb Phrase/Pattern:
V NM* N(PL)| V+
Event Handler or Casting:
(.*) P NM * N(PL)
Looping or Find/Contains:
V* DT NM* N(PL)
Rule(s):
Bool Type:
V NM* N(PL)
Non-Collection Type:
NM* N
Collection Type:
NM* NPL
CLASS METHOD
VARIABLE &
PARAMETER

Analyzing the quality of names using grammar patterns
43
Identifer Phase Structure != Human Language Phrase Structure
Off-the-shelf NLP tools underperform analyzing source code
Challenges with automatically determine the meaning of words in an identifier
and how these words interact with one another
Grammar patterns allow a more efficient analysis by broadly categorizing words
into their corresponding part-of-speech
The Ensemble Tagger is a specialized part-of-speech tagger with a high accuracy
and outperforms state-of-the-art taggers
Developers mostly agree with the proposed grammar pattern heuristics to
appraise identifier names

RQ 2: To what extent did the
automated identifier naming
mechanism positively or negatively
influence naming practices?

Plugin for IntelliJ IDEA
45
• Construction of an IntelliJ plugin that provides real-time appraisals and
recommendations for identifier grammar patterns
• Utilizes the part-of-speech tagger to generate the identifier’s part-of-speech tags
• The tagger is exposed as a webservice that is called from the plugin

46
Selected Identifier
Problem Summary
Detected &
Recommended
Grammar Pattern
Problem Explanation
Squiggly Line Indicates
A Naming Problem
Summary of All Naming
Problems In The File

IDE plugin user study
47
• User study with undergraduate and
graduate students
• 20 participants in total
• Two groups of equal size:
• Group A – utilized the plugin
• Group B – control group
• Review pre-defined code snippet in
IntelliJ IDEA and correct identifier
names
• Code snippets included string
manipulation methods and simple
object-oriented program
• Pre- and Post-questionnaire

Quantitative participant feedback
48
80% 80% 70% 90%
of participants
rated the priority
they place on
part-of-speech
tags as either
High Priority or
Essential
of participants
rated the
convenience of
having a
grammar pattern
recommendation
tool as either
Convenient or
Very Convenient
of participants
rated their ability
to interpret the
recommendations
as either Easy or
Very Easy
of participants
rated the accuracy
of the plugins
recommendations
as either Satisfied
or Very Satisfied

Qualitative participant feedback
49
NEGATIVE
• IDE at times is slow or hangs
• The plugin occasionally takes time to
update
• Part-of-speech tags are not known to
everyone
• Not all recommendations are accurate
POSITIVE
• The plugin forces the user to think about
the quality of the identifier's name
• Ensures consistency in identifier naming
• Good resource for novice developers
• Explanation and examples are helpful
• Most of the recommendations were
satisfactory
ENHANCEMENTS
• More examples on recommended patterns
• Definitions for part-of-speech tags
• The UI can improve to make it easier to
navigate to identifiers in the code

RQ 3: What are the primary
challenges in appraising and
recommending the semantic
structure of identifier names, and
how can these be improved?

Types of challenges encountered conducting this research
51
Tools and
Technologies
Prior Research
Studies

Lack of specialized tools for s/w artifacts
52
Due to the diversity of systems, not all name appraisal and recommendation tools incorporate all
naming rules – leading to inaccurate results (i.e., not a one-stop solution)
The Ensemble Tagger misannotates specific grammar patterns – performs poorly on names
having preambles and elongated verb phrase patterns
Name are diverse and subjective – context plays an essential part to evaluating the quality of the
name; context lies in the code surrounding the identifier
Existing well-established NLP tools (e.g., WordNet, NLTK) perform poorly on software engineering
artifacts, such as source code
Current code quality tools/approaches (e.g., checkstyle) focus on the styling of a name
Rename recommendation models are prescriptive – they do not provide a rationale for the
recommendation

Dearth of empirical data
53
Most studies focus on the readability of a name --- e.g., readability models look at name styling or
readability of entire files
Readability does not always correlate to understandability
Readable names might not accurately reflect intended behavior
Developers are diverse – experience impacts naming and comprehension activities
Lack of empirical studies on how developers' structure and comprehend identifier names
Names are composed of diverse words – including abbreviations, acronyms and digits
These tokens also convey meanings, but studies on how and why they are used by developers are
lacking and therefore inhibit our overall understanding of a name

54
Summary of overall research findings
Grammar Pattern
Name Appraisal &
Recommendation
At a conceptual
level, grammar
patterns reflect
both the linguistic
and program
behavior and make
it possible to
provide accurate
name appraisals
and
recommendations
01 Primary
Challenges With
Grammar Patterns
Current
tools/technologies
have shortcomings
and do not provide
a one-stop
solution; a dearth
of empirical
studies hinders the
comparison of
findings
03
Developer
Workflow
Integration
Developers find an
IDE plugin
incorporating
grammar pattern
name appraisals
and
recommendations
both valuable and
useable
02

Expanding the knowledge on identifier naming
56
Detect patterns in specific
types of systems/code
Further the understanding of
name-code relationships
Pattern Detection
Insight from professional
and novice developers on
the characteristics of high-
quality names
Developer Experience
Incremental improvements
to existing tools
Improving NLP techniques to
better understand code
Tool Development

Summary
• High-quality identifiers are essential for program comprehension
• I study the evolution of names and investigate their relationship with statically detectable code behavior
• My work provides developers with tools to craft and maintain high-quality identifier names in their projects
• This is a long-term initiative, that will continue post-graduation and into my academic career
57

PH.D100%
Advisor: Dr. Christian Newman
Committee:
Dr. Mohamed Mkaouer,
Dr. Mehdi Mirakhorli
Dr. Marcos Zampieri
Chair: Dr. Robert Glick
Faculty & Staff:
Department of Computing and Information Sciences
Department of Software Engineering
Collaborators, Colleagues & SE Sr. Design Teams
Friends & Family
Acknowledgements
Supporting the Maintenance of Identifier Names:
A Holistic Approach to High-Quality Automated
Identifier Naming
A n t h o ny P e r u m a
https://www.peruma.me
June 28, 2022

Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming

Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming

Recommended

Recommended

More Related Content

Similar to Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming

Similar to Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming (20)

More from University of Hawai‘i at Mānoa

More from University of Hawai‘i at Mānoa (16)

Recently uploaded

Recently uploaded (20)

Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming