Put Your Hands in the Mud: What Technique, Why, and How

PutYour Hands in the Mud:
What Technique,
Why, and How?
Massimiliano Di Penta
University of Sannio, Italy
dipenta@unisannio.it
http://www.ing.unisannio.it/mdipenta

Outline
MUD of software repositories
Available techniques: 
Pattern matching 
Island parsers 
IR algebraic methods 
Natural Language Parsing
Choosing the solution that ﬁts your needs

Textual Analysis very
successful in SE...

Why?
Code identiﬁers and comments reﬂect
semantics
Software repositories contain a mix of
structured and unstructured data

(Some) Applications
Traceability Link Recovery
Feature Location
Clone Detection
Conceptual Cohesion/Coupling Metrics
Bug prediction
Bug triaging
Software remodularization
Textual smells/antipatterns
Duplicate bug report detection
Software artifact summarization and labeling
Generating assertions from textual descriptions
....

Different kinds of
Repositories
Different kinds of
Repositories

Versioning Systems
Besides code itself, the main source of
unstructured data is represented by
commit messages

Example
• Add Checkclipse preferences to all projects
so Checkstyle is preconﬁgured
• Fix for issue 4503.
• Fix for issue 4517: "changeability" on
association ends not displayed properly.
This was never adapted since UML 1.3 &
NSUML were replaced.
What kind of problem do you see here?

Be aware!
• Commit notes are often insufﬁcient to
know everything about a change
• Need to merge with issue tracker data

Issue Trackers
• Contain a mix of structured and non-
structured data
• Structured: classiﬁcation, priority, severity,
status, but also stack traces and code
snippets
• Unstructured: issue description, comments
• Often natural language mixed with method,
class, package names, etc.

Challenges
• Including comments in the text corpus may
or may not add useful information
• The mix of natural language and code
elements may challenge the application of
techniques such as natural language parsing

Techniques
Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan: Using Fuzzy Code Search to Link Code
Fragments in Discussions to Source Code. CSMR 2012: 319-328
Using Fuzzy Code Search to Link Code Fragments
in Discussions to Source Code
Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
School of Computing, Queen’s University
Kingston, Ontario, Canada
Email: {nicbet, sthomas, ahmed}@cs.queensu.ca
Abstract—When discussing software, practitioners often ref-
erence parts of the project’s source code. Such references
have different motivations, such as mentoring and guiding less
experienced developers, pointing out code that needs changes,
or proposing possible strategies for the implementation of future
changes. The fact that particular parts of a source code are being
discussed makes these parts of the software special. Knowing
which code is being talked about the most can not only help
practitioners to guide important software engineering and main-
tenance activities, but also act as a high-level documentation of
development activities for managers. In this paper, we use clone-
detection as specific instance of a code search based approach
for establishing links between code fragments that are discussed
by developers and the actual source code of a project. Through
a case study on the Eclipse project we explore the traceability
links established through this approach, both quantitatively and
qualitatively, and compare fuzzy code search based traceability
linking to classical approaches, in particular change log analysis
and information retrieval. We demonstrate a sample application
of code search based traceability links by visualizing those parts
of the project that are most discussed in issue reports with
a Treemap visualization. The results of our case study show
that the traceability links established through fuzzy code search-
Past approaches to uncovering traceability links between doc-
umentation and source code are commonly based on in-
formation retrieval [2], [15], [18], [20], natural language
processing [1], [19] and lightweight textual analyses [5], [11],
[12]. Each approach, however, is tailored towards a specific
set of goals and use cases. For example, when linking code
changes to issue reports by analyzing transaction logs [28],
we observe only the associations between a bug report and
the final locations of the bug fix, but miss the bug fixing
history: all the locations that a developer had to investigate
and understand before he could find an appropriate way to fix
the error.
In this paper, we aim to find traceability links between
issue reports and source code. For this purpose, we propose
a new approach that uses token-based clone detection as an
implementation of fuzzy code search for discovering links
between code fragments mentioned in project discussions and
the location of these fragments in the source code body of a
software system. In a case study on the ECLIPSE project, we
Using Fuzzy Code Search to Link Code Fragments
in Discussions to Source Code
Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
School of Computing, Queen’s University
Kingston, Ontario, Canada
Email: {nicbet, sthomas, ahmed}@cs.queensu.ca
Abstract—When discussing software, practitioners often ref-Abstract—When discussing software, practitioners often ref-Abstract
erence parts of the project’s source code. Such references
have different motivations, such as mentoring and guiding less
experienced developers, pointing out code that needs changes,
or proposing possible strategies for the implementation of future
changes. The fact that particular parts of a source code are being
discussed makes these parts of the software special. Knowing
which code is being talked about the most can not only help
practitioners to guide important software engineering and main-
tenance activities, but also act as a high-level documentation of
Past approaches to uncovering traceability links between doc-
umentation and source code are commonly based on in-
formation retrieval [2], [15], [18], [20], natural language
processing [1], [19] and lightweight textual analyses [5], [11],
[12]. Each approach, however, is tailored towards a specific
set of goals and use cases. For example, when linking code
changes to issue reports by analyzing transaction logs [28],
we observe only the associations between a bug report and

Emails
Similar characteristics of issue trackers,
with less structure
From lisa at usna.navy.MIL Thu Jul 1 15:42:27 1999
From: lisa at usna.navy.MIL (Lisa Becktold {CADIG STAFF})
Date: Tue Dec 2 03:03:10 2003
Subject: nmbd/nmbd_processlogon.c - CODE 12???
Message-ID: <99Jul1.114236-0400edt.4995-357+39@jupiter.usna.navy.mil
Hi:
I have Samba 2.1.0-prealpha running on a Sun Ultra 10. My NT PC
joined
the domain without a problem, but I can't logon. Every time I attempt
to log into the Samba domain, the NT screen blanks as it usually does
during login, but then the "Begin Login" screen reappears.
I see this message in my samba/var/log.nmb file whenever I try to
login:

Mining Forums:
Challenges
• The usual problems you face with issue
trackers and emails are still there
• However in general the separation between
different elements is much more consistent

However, this
convention is not
always followed!

Not all reviews are equal
AR-Miner: Mining Informative Reviews for Developers from
Mobile App Marketplace
Ning Chen, Jialiu Lin†
, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang
Nanyang Technological University, Singapore, †
Carnegie Mellon University, USA
{nchen1,chhoi,xkxiao,bszhang}@ntu.edu.sg, †
jialiul@cs.cmu.edu
ABSTRACT
With the popularity of smartphones and mobile devices, mo-
bile application (a.k.a. “app”) markets have been growing
exponentially in terms of number of users and download-
s. App developers spend considerable e↵ort on collecting
and exploiting user feedback to improve user satisfaction,
but su↵er from the absence of e↵ective user review ana-
lytics tools. To facilitate mobile app developers discover
the most “informative” user reviews from a large and rapid-
ly increasing pool of user reviews, we present “AR-Miner”
— a novel computational framework for App Review Min-
ing, which performs comprehensive analytics from raw user
reviews by (i) first extracting informative user reviews by
filtering noisy and irrelevant ones, (ii) then grouping the in-
formative reviews automatically using topic modeling, (iii)
further prioritizing the informative reviews by an e↵ective
review ranking scheme, (iv) and finally presenting the group-
s of most “informative” reviews via an intuitive visualization
approach. We conduct extensive experiments and case s-
tudies on four popular Android apps to evaluate AR-Miner,
intense, in order to seize the initiative, developers tend to
employ an iterative process to develop, test, and improve
apps [23]. Therefore, timely and constructive feedback from
users becomes extremely crucial for developers to fix bugs,
implement new features, and improve user experience ag-
ilely. One key challenge to many app developers is how to
obtain and digest user feedback in an e↵ective and e cient
manner, i.e., the “user feedback extraction” task. One way
to extract user feedback is to adopt typical channels used
in traditional software development, such as (i) bug/change
repositories (e.g., Bugzilla [3]), (ii) crash reporting systems
[19], (iii) online forums (e.g., SwiftKey feedback forum [6]),
and (iv) emails [10].
Unlike the traditional channels, modern app marketplaces,
such as Apple App Store and Google Play, o↵er a much eas-
ier way (i.e., the web-based market portal and the market
app) for users to rate and post app reviews. These reviews
present user feedback on various aspects of apps (such as
functionality, quality, performance, etc), and provide app
developers a new and critical channel to extract user feed-
AR-Miner: Mining Informative Reviews for Developers from
Mobile App Marketplace
Ning Chen, Jialiu Lin†
, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang
Nanyang Technological University, Singapore, †
Carnegie Mellon University, USA
{nchen1,chhoi,xkxiao,bszhang}@ntu.edu.sg, †
jialiul@cs.cmu.edu
ABSTRACT
With the popularity of smartphones and mobile devices, mo-
bile application (a.k.a. “app”) markets have been growing
exponentially in terms of number of users and download-
s. App developers spend considerable e↵ort on collecting↵ort on collecting↵
and exploiting user feedback to improve user satisfaction,
but su↵er from the absence of e↵er from the absence of e↵ ↵ective user review ana-↵ective user review ana-↵
lytics tools. To facilitate mobile app developers discover
the most “informative” user reviews from a large and rapid-
ly increasing pool of user reviews, we present “AR-Miner”
intense, in order to seize the initiative, developers tend to
employ an iterative process to develop, test, and improve
apps [23]. Therefore, timely and constructive feedback from
users becomes extremely crucial for developers to fix bugs,
implement new features, and improve user experience ag-
ilely. One key challenge to many app developers is how to
obtain and digest user feedback in an e↵ective and e↵ective and e↵ cient
manner, i.e., the “user feedback extraction” task. One way
to extract user feedback is to adopt typical channels used
in traditional software development, such as (i) bug/change
repositories (e.g., Bugzilla [3]), (ii) crash reporting systems
Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang:AR-miner: mining
informative reviews for developers from mobile app marketplace. ICSE 2014: 767-778

Informative reviews…
Functional None of the pictures will load in my news feed.
Performance
It lags and doesn't respond to my touch which almost
always causes me to run into stuff.
Feature (change)
request
Amazing app, although I wish there were more
themes to choose from
Please make it a little easy to get bananas please and
make more power ups that would be awesome.
Remove ads So many ads its unplayable!
Fix permissions
This game is adding for too much unexplained
permissions.

Non-Informative reviews
Purely emotional
Great fun can't put it down!
This is a crap app.
Description
of app/actions
I have changed my review from 2 star to 1 star.
Unclear issue
description
Bad game this is not working on my phone.
Questions How can I get more points?

@ICSME 2015  
Thurdsay, 13.50 - Mobile applications
User Reviews Matter! Tracking Crowdsourced
Reviews to Support Evolution of Successful Apps
Fabio Palomba⇤, Mario Linares-Vásquez§, Gabriele Bavota†, Rocco Oliveto‡,
Massimiliano Di Penta¶, Denys Poshyvanyk§, Andrea De Lucia⇤
⇤University of Salerno, Fisciano (SA), Italy – §The College of William and Mary, VA, USA
†Free University of Bozen-Bolzano, Bolzano (BZ), Italy – ‡University of Molise, Pesche (IS), Italy
¶University of Sannio, Benevento, Italy
Abstract—Nowadays software applications, and especially mo-
bile apps, undergo frequent release updates through app stores.
After installing/updating apps, users can post reviews and provide
ratings, expressing their level of satisfaction with apps, and
possibly pointing out bugs or desired features. In this paper
we show—by performing a study on 100 Android apps—how
applications addressing user reviews increase their success in
terms of rating. Specifically, we devise an approach, named
CRISTAL, for tracing informative crowd reviews onto source
code changes, and for monitoring the extent to which developers
accommodate crowd requests and follow-up user reactions as
reflected in their ratings. The results indicate that developers
implementing user reviews are rewarded in terms of ratings. This
poses the need for specialized recommendation systems aimed at
analyzing informative crowd reviews and prioritizing feedback
to be satisfied in order to increase the apps success.
systems, and network conditions that may not necessarily be
reproducible during development/testing activities.
Consequently, by reading reviews and analyzing the ratings,
development teams are encouraged to improve their apps, for
example, by fixing bugs or by adding commonly requested
features. According to a recent Gartner report’s recommenda-
tion [21], given the complexity of mobile testing “development
teams should monitor app store reviews to identify issues that
are difficult to catch during testing, and to clarify issues that
cause problems on the users’s side”. Moreover, useful app
reviews reflect crowd-based needs and are a valuable source
of comments, bug reports, feature requests, and informal user
experience feedback [7], [8], [14], [19], [22], [27], [29].
In this paper we investigate to what extent app development
User Reviews Matter! Tracking Crowdsourced
Reviews to Support Evolution of Successful Apps
Fabio Palomba⇤, Mario Linares-Vásquez§, Gabriele Bavota†, Rocco Oliveto‡,
Massimiliano Di Penta¶, Denys Poshyvanyk§, Andrea De Lucia⇤
⇤University of Salerno, Fisciano (SA), Italy – §The College of William and Mary, VA, USA
†Free University of Bozen-Bolzano, Bolzano (BZ), Italy – ‡University of Molise, Pesche (IS), Italy
¶University of Sannio, Benevento, Italy
Abstract—Nowadays software applications, and especially mo-Abstract—Nowadays software applications, and especially mo-Abstract
bile apps, undergo frequent release updates through app stores.
After installing/updating apps, users can post reviews and provide
ratings, expressing their level of satisfaction with apps, and
possibly pointing out bugs or desired features. In this paper
we show—by performing a study on 100 Android apps—how
applications addressing user reviews increase their success in
terms of rating. Specifically, we devise an approach, named
CRISTAL, for tracing informative crowd reviews onto source
code changes, and for monitoring the extent to which developers
accommodate crowd requests and follow-up user reactions as
systems, and network conditions that may not necessarily be
reproducible during development/testing activities.
Consequently, by reading reviews and analyzing the ratings,
development teams are encouraged to improve their apps, for
example, by fixing bugs or by adding commonly requested
features. According to a recent Gartner report’s recommenda-
tion [21], given the complexity of mobile testing “development
teams should monitor app store reviews to identify issues that
are difficult to catch during testing, and to clarify issues that
cause problems on the users’s side”. Moreover, useful app

So what we get?
EmailsEmailsEmailsEmails
Versioning
Issue
Reports

Continuous interleaving of structured and
unstructured data
Format not always consistent
Technical/domain terms, abbreviations,
acronyms
Short documents, documents having
different size

Mining unstructured
data: techniques

Different techniques
Techniques based on string / regexp
matching
Information Retrieval Models
Natural Language Parsing
Island and lake parsers

Pattern/Regular
Expression Matching
Pattern/Regular
Expression Matching

Pattern matching:
when and where
• You’re not interested to the whole
document content
• Rather, you want to match some keywords
• Very few variants
• Context insentitiveness

Basic tools you might
want to use
• Unix command line tools: wc, cut, sort, uniq,
head, tail
• Regular expression engines: grep, sed, awk
• Scripting Languages: Perl, Python
• General purposes programming languages
with regexp support: 
java.util.regexp (similar to Perl regexp)

Pattern Matching in Perl
• The =~ or !~ searches a string for a pattern
• /PATTERN/ deﬁne a pattern
• e.g. if($a=~/PATTERN/i){...} or  
if($a!~/PATTERN){...}
• s/PATTERN/REPLACEMENT/egimosx
• e.g. $a=~s/PATTERN/REPLACEMENT/g;

Special Symbols
w Match a "word" character
(alphanumeric plus "_")
W Match a non-word character
s Match a whitespace character
S Match a non-whitespace character
d Match a digit character
D Match a non-digit character

Assertions
b Match a word boundary
B Match a non-(word boundary)
A Match only at beginning of string
Z Match only at end of string, or before
newline at the end
z Match only at end of string

Matching issue ids
“fix 367920 setting pop3 messages as junk/not junk
ignored when message quarantining turned on sr=mscott
$note=~/Issue (d+)/i || $note=~/Issue number:s+(d
+)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/i

Mapping emails onto
classes/files
• Emails often refer classes / source files
• One can trace email to file using IR-based
traceability
• In the end simple regexp matching is more
precise and efficient
Alberto Bacchelli, Michele Lanza, Romain Robbes:  
Linking e-mails and source code artifacts. ICSE (1) 2010: 375-384

Example (Apache httpd)
Hi,
I'm running SunOS4.1.3 (with vif kernel hacks) and the Apache server.
www.xyz.com is a CNAME alias for xyz.com.
<VirtualHost www.xyz.com etc., is in my httpd.conf file.
Accessing http://www.xyz.com *fails* (actually, it brings up the default
DocumentRoot).
Accessing http://xyz.com *succeeds*, however, bringing up theVirtual
DocumentRoot.
The get_local_addr() function in the Apache util.c file seems to return
the wrong socket name, but I can't figure out why.
Any help is greatly appreciated!

(Simpliﬁed) Perl implementation
#! /usr/bin/perl
@fileNames=`cat filelist.txt`; #content of filelist.txt goes into array @fileNames
%hashFiles=();
foreach $f(@fileNames){
chomp($f);
$hashFiles{$f}++; #populates hashtable %hashFiles
}
while(<>){
$l=$_;
chomp($l);
#matches filename composed of anum chars, ending with .c and word boundary (b)
while($l=~/(w+.cb)/){
$name=$1;
if(defined($hashFiles{$name})){ #checks if the filename is in the hash table
print "Mail mapped to file $namen";
}
$l=$'; #assigns post matching to $l
}
}
while(<>){
$l=$_;
chomp($l);
#matches filename composed of anum chars, ending with .c and word boundary (b)
while($l=~/(w+.cb)/){

RegExp: Pros and Cons
• Often the easiest solution
• Fast and precise enough
• Unsupervised
• Pattern variants can lead to 
misclassiﬁcations
• Context-insensitive
✘
✔
✔
✔
✘

Key Ingredients
• Often ﬁnding the right pattern is not
straight-forward
• A lot of manual analysis might be needed
• Iterative process

General Ideas
Island parser: ignore everything (lake) until
you find a specific grammar rule (island)
Lake parser: parse everything (lake) unless
you find something you’re not able to
match a rule (island)

Traditional applications
Leon Moonen: Generating Robust Parsers Using Island Grammars. WCRE 2001: 13-

free text
error log
free text
source code
free text
stack trace
Applications to MUD

Approach
Alberto Bacchelli, Tommaso Dal Sasso, Marco D'Ambros, Michele Lanza: Content classification of
development emails. ICSE 2012: 375-385
Content Classification of Development Emails
Alberto Bacchelli, Tommaso Dal Sasso, Marco D’Ambros, Michele Lanza
REVEAL @ Faculty of Informatics — University of Lugano, Switzerland
Abstract—Emails related to the development of a software
system contain information about design choices and issues
encountered during the development process. Exploiting the
knowledge embedded in emails with automatic tools is chal-
lenging, due to the unstructured, noisy and mixed language
nature of this communication medium. Natural language text
is often not well-formed and is interleaved with languages with
other syntaxes, such as code or stack traces.
We present an approach to classify email content at line
level. Our technique classifies email lines in five categories (i.e.,
text, junk, code, patch, and stack trace) to allow one to sub-
sequently apply ad hoc analysis techniques for each category.
We evaluated our approach on a statistically significant set
of emails gathered from mailing lists of four unrelated open
source systems.
Keywords-Empirical software engineering; Unstructured
Data Mining; Emails
I. Introduction
Software repositories supply information useful for sup-
porting software analysis and program comprehension [17].
Di↵erent repositories o↵er di↵erent perspectives on systems:
(e.g., IRC chat logs, mailing lists) require the most care, as
the documents leave complete freedom to the authors. For
example, Bettenburg et al. presented the risks of using email
data without a proper cleaning pre-processing phase [9].
NL documents are usually treated as bags of words–a
count of terms’ occurrences. This simplification is proven
to be e↵ective in the information retrieval (IR) field, where
techniques are tested on well-formed NL documents [25]. In
software engineering, although e↵ective for some tasks (e.g.,
traceability between documents and code [1]), this approach
reduces the quality, reliability, and comprehensibility of the
available information, as NL text is often not well-formed
and is interleaved with languages with di↵erent syntaxes:
code fragments, stack traces, patches, etc.
We present a work for advancing the analysis of the
contents of development emails. We argue that we should not
create a single bag with terms indiscriminately coming from
NL parts, code fragments, email signatures, patches, etc. and
treat them equally. We need to recognize every language in
an email to enable techniques exploiting the peculiarities of
Content Classification of Development Emails
Alberto Bacchelli, Tommaso Dal Sasso, Marco D’Ambros, Michele Lanza
REVEAL @ Faculty of Informatics — University of Lugano, Switzerland
Abstract—Emails related to the development of a softwareAbstract—Emails related to the development of a softwareAbstract
system contain information about design choices and issues
encountered during the development process. Exploiting the
knowledge embedded in emails with automatic tools is chal-
lenging, due to the unstructured, noisy and mixed language
nature of this communication medium. Natural language text
is often not well-formed and is interleaved with languages with
other syntaxes, such as code or stack traces.
We present an approach to classify email content at line
level. Our technique classifies email lines in five categories (i.e.,
text, junk, code, patch, and stack trace) to allow one to sub-
(e.g., IRC chat logs, mailing lists) require the most care, as
the documents leave complete freedom to the authors. For
example, Bettenburg et al. presented the risks of using email
data without a proper cleaning pre-processing phase [9].
NL documents are usually treated as bags of words–a
count of terms’ occurrences. This simplification is proven
to be e↵ective in the information retrieval (IR) field, where↵ective in the information retrieval (IR) field, where↵
techniques are tested on well-formed NL documents [25]. In
software engineering, although e↵ective for some tasks (↵ective for some tasks (↵ e.g.,

Alternative approach
Lines/paragraphs belonging to different
elements contain different proportions of
keywords, special characters, etc.
Alberto Bacchelli, Marco D'Ambros, Michele Lanza: Extracting Source Code from E-Mails. ICPC 2010: 24-33
Extracting Source Code from E-Mails
Alberto Bacchelli, Marco D’Ambros, Michele Lanza
REVEAL@ Faculty of Informatics - University of Lugano, Switzerland
Abstract—E-mails, used by developers and system users to
communicate over a broad range of topics, offer a valuable
source of information. If archived, e-mails can be mined to
support program comprehension activities and to provide views
of a software system that are alternative and complementary
to those offered by the source code.
However, e-mails are written in natural language, and
therefore contain noise that makes it difficult to retrieve the
important data. Thus, before conducting an effective system
analysis and extracting data for program comprehension, it is
necessary to select the relevant messages, and to expose only
the meaningful information.
In this work we focus both on classifying e-mails that hold
fragments of the source code of a system, and on extracting the
source code pieces inside the e-mail. We devised and analyzed a
number of lightweight techniques to accomplish these tasks. To
assess the validity of our techniques, we manually inspected and
annotated a statistically significant number of e-mails from five
unrelated open source software systems written in Java. With
such a benchmark in place, we measured the effectiveness of
each technique in terms of precision and recall.
about a system. E-mails are used to discuss issues ranging
from low-level decisions (e.g., implementation strategies, bug
fixing) up to high-level considerations (e.g., design rationale,
future planning). Our goal is to use this kind of artifacts to
enhance program comprehension and analysis.
We already devised techniques to recover the traceability
links between source code entities and e-mails [4]. However,
such ties can only be retrieved by knowing the entities of
the system in advance, i.e., by having the model of the
system generated from the source code. In this paper we
tackle a different problem: We want to extract source code
fragments from e-mail messages. To do this, we first need
to select e-mails that contain source code fragments, and
then we extract such fragments from the content in which
they are enclosed. Separating source code from natural
language in e-mail messages brings several benefits: The
access to structured data (1) facilitates the reconstruction
Extracting Source Code from E-Mails
Alberto Bacchelli, Marco D’Ambros, Michele Lanza
REVEAL@ Faculty of Informatics - University of Lugano, Switzerland
Abstract—E-mails, used by developers and system users toAbstract—E-mails, used by developers and system users toAbstract
communicate over a broad range of topics, offer a valuable
source of information. If archived, e-mails can be mined to
support program comprehension activities and to provide views
of a software system that are alternative and complementary
to those offered by the source code.
However, e-mails are written in natural language, and
therefore contain noise that makes it difficult to retrieve the
important data. Thus, before conducting an effective system
analysis and extracting data for program comprehension, it is
necessary to select the relevant messages, and to expose only
the meaningful information.
In this work we focus both on classifying e-mails that hold
about a system. E-mails are used to discuss issues ranging
from low-level decisions (e.g., implementation strategies, bug
fixing) up to high-level considerations (e.g., design rationale,
future planning). Our goal is to use this kind of artifacts to
enhance program comprehension and analysis.
We already devised techniques to recover the traceability
links between source code entities and e-mails [4]. However,
such ties can only be retrieved by knowing the entities of
the system in advance, i.e., by having the model of the
system generated from the source code. In this paper we
tackle a different problem: We want to extract source code

Learning hidden
languages
Luigi Cerulo, Massimiliano Di Penta, Alberto Bacchelli, Michele Ceccarelli, Gerardo Canfora:
Irish: A Hidden Markov Model to detect coded information islands in free text. Sci. Comput.
Program. 105: 26-43 (2015)

Hidden Markov Models
A language category is modeled as an HMM where hidden
states are language tokens
0.28
0.23
0.6
0.49
0.36
0.36
0.38
0.52
0.430.25
0.31
0.31
0.77
0.34
0.51
0.47
0.52
0.26
0.38
0.71
0.27
0.47
0.65
0.34
0.4
0.4
0.65
0.54
0.41
0.22
0.39
0.21
0.32
0.37
0.9
0.43
0.27
0.3
0.9
0.62
0.38
0.6
0.22
0.25
0.31
0.4
0.97
0.41
0.36
0.36
0.21
0.93
0.91
]
=

~
!
'
`
NUM
(
[
#
<
-
:
,
{
)
NEWLN
KEY
$
}
+
.
*
;
&
/
^
|
%
"
?
@
WORD
UNDSC
>
Fig. 4. A source code HMM trained on PosgreSQL source code (transition
probabilities less than 0.2 are not shown).
syntaxes. For example, in development mailing list messages
usually there is no more than one source code fragment in
a text message. If this is the case, we can assume, a priori,
the transition probability from natural language text to source
code approximated to 1/N, where N is the number of tokens
in the message. It could happens that the transition between
two language syntaxes could never occur. This may be the case
of stack traces and patches. By observing developers’ emails,
it is usual to notice that, after a stack trace is provided, the
resolution patch is introduced with natural language phrases,
such as “Here is the change that we made to fix the problem”.
This leads to a null transition probability from stack trace
to patch code. Other heuristics could be adopted to refine
the estimation of transition probabilities that reflects specific
properties or styles adopted [23].
IV. EMPIRICAL EVALUATION PROCEDURE
To evaluate the effectiveness of our approach we adopt
the Precision and Recall metrics, known respectively also
as Positive Predictive Value (PPV) and Sensitivity [24]. In
particular, for each language syntax, i, we compute:
such as opening parenthesis. Formally, the HMM state space
is defined as:
Q = { T XT , SRC}
where T XT = {WORDT XT , KEYT XT , . . . }, and SRC
= {WORDSRC, KEYSRC, . . . }. Each state emits the corre-
sponding alphabet symbol without subscript label T XT or
SRC. For example, the KEY symbol can be emitted by
KEYT XT or KEYSRC with a probability equal to 1. If the
probability for staying in a natural language text is p and
the probability of staying in source code text is q, then the
transition from a state in T XT to a state in SRC is 1 p,
instead the inverse transition is 1 q. The above defined HMM
emits the sequence of symbols observed in a text by evolving
through a sequence of states { 1, 2, . . . , i, i+1, . . . } with
the transition probabilities tkl defined as:
tkl = P( i = l| i 1 = k) · p, if k, l ⇤ T XT
tkl = P( i = l| i 1 = k) · q, if k, l ⇤ SRC
tkl =
1 p
| |
, if k ⇤ T XT , l ⇤ SRC
tkl =
1 q
| |
, if k ⇤ SRC, l ⇤ T XT
and the emission probabilities defined as:
ekb = 1, if k = bT XT or k = bSRC, otherwise 0.
Fig. 2 shows the global HMM composed by two sub-HMM,
one modeling natural language text and another modeling
source code. Fig. 3 and Fig. 4 show their transition proba-
bilities, estimated on the Frankenstein novel and PostgreSQL
source code respectively. We detail in Section III-D how these
probabilities could be estimated.
It is interesting to observe how typical token sequences
are modeled by each HMM. For example, in the source code
of an arithmetic/logic expressions, array indexing, or function
argument enumeration.
Fig. 2. The source code – natural text island HMM.
0.33
0.67
0.78
0.86
0.82
0.78
0.65
0.12
0.68
0.64
0.87
0.7
0.71
0.9
0.81
0.71
0.5
0.54
1.0
1.0
1.0
1.0
0.11
0.11
0.21
0.14
0.15
0.14
0.73
0.2 0.45
0.21
1.0
0.17
0.120.12
0.14
0.29
0.33
]
WORD
KEY
!
"
@
'
?
(
*
.
NUM
:
NEWLN
$
)
[
UNDSC
-
/
;
%
,
Fig. 3. A natural text HMM trained on the Frankenstein novel (transition
probabilities less than 0.1 are not shown).
C. An extension of the basic model

Transition between
languages…
Language HMMs are connected by a language transition
matrix into a single HMM
if they
after a
to ﬁnd
s more
aracter,
e space
SRC
corre-
XT or
ted by
If the
p and
hen the
1 p,
HMM
WORD modeling typical variable naming convention. Instead,
in the natural language HMM, numbers (NUM) are preceded
just by the dollar symbol ($) indicating currency, and likely
followed by a dot, indicating text item enumerations. Instead,
in the source code HMM it is noticeable that numbers are part
of an arithmetic/logic expressions, array indexing, or function
argument enumeration.
1-q
1-p
p q
Natural text
HMM
Soure code
HMM
Fig. 2. The source code – natural text island HMM.
Natural
language
HMM
Source
code
HMM

Indexing-based IR
Document Query
indexing indexing
(Query analysis)
Representation Representation
Query
evaluation

Document indexing
Term
extraction
The above process may vary a lot...
Stop word
removal
Stemming
Term
indexing
Algebraic
method

Tokenization in SE
• Often identiﬁers are compound words 
getPointer, getpointer
• Sometimes they contain abbreviations and
acronyms 
getPntr, computeIPAddr

Camel Case splitter
getWords getWords
openIP open IP
Pros: simple and fast
Cons: convention not adopted everywhere

Samurai
• Assumption1: A substring composing an
identifier is also likely to be used elsewhere
• Assumption 2: Given two possible split,
prefer the one that contains identifiers
having a high frequency in the program
Eric Enslen, Emily Hill, Lori L. Pollock, K.Vijay-Shanker: Mining source code to
automatically split identifiers for software analysis. MSR 2009: 71-80

More advanced
approaches
• Tidier/Tris, and Normalize
• Allow expanding abbreviations and
acronyms 
getPntr get pointer
• Based on more advanced techniques, e.g.
speech recognition
Dawn Lawrie, David Binkley: Expanding identiﬁers to normalize source code vocabulary. ICSM 2011: 113-122
Latifa Guerrouj, Massimiliano Di Penta, Giuliano Antoniol andYann-Gaël Guéhéneuc.TIDIER: an identiﬁer splitting
approach using speech recognition techniques. Journal of Software: Evolution and Process.Vol. 25, Issue 6, June
2013, p. 575–599

Stop words for SE
• English not enough
• Programming language keywords may or
may not be included
• What about standard API?
• Speciﬁc, recurring words of some
documents (e.g., of bug reports or of
emails)

Use of smoothing ﬁlters
Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, Sebastiano Panichella: Applying a
smoothing ﬁlter to improve IR-based traceability recovery processes: An empirical investigation. Information &
Software Technology 55(4): 741-754 (2013)
Applying a smoothing Ä lter to improve IR-based traceability recovery processes: An
empirical investigation q
Andrea De Lucia a
, Massimiliano Di Penta b,⇑
, Rocco Oliveto c
, Annibale Panichella a
, Sebastiano Panichella b
a
University of Salerno, Via Ponte don Melillo, 84084 Fisciano (SA), Italy
b
University of Sannio, Viale Traiano, 82100 Benevento, Italy
c
University of Molise, C.da Fonte Lappone, 86090 Pesche (IS), Italy
a r t i c l e i n f o
Article history:
Available online 24 August 2012
Keywords:
Software traceability
Information retrieval
Smoothing Ä lters
Empirical software engineering
a b s t r a c t
Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For
this reason, various traceability recovery approachesˆ based on Information Retrieval (IR) techniquesˆ
have been proposed. The performances of such approaches are often inÅ uenced by ` ` noise' ' contained
in software artifacts (e.g., recurring words in document templates or other words that do not contribute
to the retrieval itself).
Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a
smoothing Ä lter to remove ` ` noise' ' from the textual corpus of artifacts to be traced.
Method: We evaluate the effect of a smoothing Ä lter in traceability recovery tasks involving different
kinds of artifacts from Ä ve software projects, and applying three different IR methods, namely Vector
Space Models, Latent Semantic Indexing, and Jensenç Shannon similarity model.
Results: Our study indicates that, with the exception of some speciÄ c kinds of artifacts (i.e., tracing test
cases to source code) the proposed approach is able to signiÄ cantly improve the performances of trace-
ability recovery, and to remove ` ` noise' ' that simple stop word Ä lters cannot remove.
Conclusions: The obtained results not only help to develop traceability recovery approaches able to work
in presence of noisy artifacts, but also suggest that smoothing Ä lters can be used to improve perfor-
mances of other software engineering approaches based on textual analysis.
Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction
In recent and past years, textual analysis has been successfully
applied to several kinds of software engineering tasks, for example
impact analysis [1], clone detection [2], feature location [3,4],
refactoring [5], deÄ nition of new cohesion and coupling metrics
[6,7], software quality assessment [8ç 11], and, last but not least,
traceability recovery [12ç 14].
Such a kind of analysis demonstrated to be effective and useful
for various reasons:
it is lightweight and to some extent independent on the pro-
gramming language, as it does not require a full source code
parsing, but only its tokenization and (for some applications)
lexical analysis;
it provides information (e.g., brought in comments and identiÄ -
ers) complementary to what structural or dynamic analysis can
provide [6,7];
it models software artifacts as textual documents, thus can be
applied to different kinds of artifacts (i.e., it is not limited to
the source code) and, above all, can be used to perform com-
bined analysis of different kinds of artifacts (e.g., requirements
and source code), as in the case of traceability recovery.
Textual analysis has also some weaknesses, and poses chal-
lenges for researchers. It strongly depends on the quality of the lex-
icon: a bad lexicon often means inaccurateˆ if not completely
wrongˆ results. There are two common problems in the textual
analysis of software artifacts. The Ä rst is represented by the pres-
ence of inconsistent terms in related documents (e.g., require-
q
This paper is an extension of the work ` ` Improving IR-based Traceability
Recovery Using Smoothing Filters' ' appeared in the Proceedings of the 19th IEEE
International Conference on Program Comprehension, Kingston, ON, Canada, pp. 21ç
Information and Software Technology 55 (2013) 741ç 754
Contents lists available at SciVerse ScienceDirect
Information and Software Technology
journal homepage: www.elsevier.com/locate/infsof
Applying a smoothing Ä lter to improve IR-based traceability recovery processes: An
empirical investigation q
Andrea De Lucia a
, Massimiliano Di Penta b,⇑
, Rocco Oliveto c
, Annibale Panichella a
, Sebastiano Panichella b
a
University of Salerno, Via Ponte don Melillo, 84084 Fisciano (SA), Italy
b
University of Sannio, Viale Traiano, 82100 Benevento, Italy
c
University of Molise, C.da Fonte Lappone, 86090 Pesche (IS), Italy
a r t i c l e i n f o
Article history:
Available online 24 August 2012
Keywords:
Software traceability
Information retrieval
Smoothing Ä lters
Empirical software engineering
a b s t r a c t
Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For
this reason, various traceability recovery approachesˆ based on Information Retrieval (IR) techniquesˆ
have been proposed. The performances of such approaches are often inÅ uenced by ` ` noise' ' contained
in software artifacts (e.g., recurring words in document templates or other words that do not contribute
to the retrieval itself).
Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a
smoothing Ä lter to remove ` ` noise' ' from the textual corpus of artifacts to be traced.
Method: We evaluate the effect of a smoothing Ä lter in traceability recovery tasks involving different
kinds of artifacts from Ä ve software projects, and applying three different IR methods, namely Vector
Space Models, Latent Semantic Indexing, and Jensenç Shannon similarity model.
Results: Our study indicates that, with the exception of some speciÄ c kinds of artifacts (i.e., tracing test
cases to source code) the proposed approach is able to signiÄ cantly improve the performances of trace-
ability recovery, and to remove ` ` noise' ' that simple stop word Ä lters cannot remove.
Conclusions: The obtained results not only help to develop traceability recovery approaches able to work
in presence of noisy artifacts, but also suggest that smoothing Ä lters can be used to improve perfor-
mances of other software engineering approaches based on textual analysis.
Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction
In recent and past years, textual analysis has been successfully
applied to several kinds of software engineering tasks, for example
impact analysis [1], clone detection [2], feature location [3,4],
refactoring [5], deÄ nition of new cohesion and coupling metrics
parsing, but only its tokenization and (for some applications)
lexical analysis;
it provides information (e.g., brought in comments and identiÄ -
ers) complementary to what structural or dynamic analysis can
provide [6,7];
it models software artifacts as textual documents, thus can be

Smoothing Filters
S
(mean source vector)
T
(mean target vector)
Source Documents Target Documents
Filtered 
Source Set
-
Filtered 
Target Set
-
s1 s2 s3 … sk t1 t2 t3 … tz

From stemming to
lexical databases…

• Lexical database where concepts are
organised in a semantic network
• Organize lexical information in terms of
word meaning, rather than word form
• Interfaces for Java, Prolog, Lisp, Python, Perl,
C#
WordNet
(http://wordnet.princeton.edu)

Word Categories
Nouns
• Topical hierarchies with lexical inheritance
(hyponymy/hyperymy and meronymy/holonymy).
Verbs
• Entailment relations
Adjectives and adverbs
• Bipolar opposition relations (antonymy)
Function words (articles, prepositions, etc.)
• Simply omitted

Application: mining and
classifying identifier renamings
Generalization/specialization:  
thrownExceptionSize → thrownExceptionLength
Opposite meaning:  
hasClosingBracket → hasOpeningBracket
Whole/part relation:  
filename → extension
Venera Arnaoudova, Laleh Mousavi Eshkevari, Massimiliano Di Penta, Rocco Oliveto, Giuliano
Antoniol,Yann-Gaël Guéhéneuc: REPENT:Analyzing the Nature of Identifier Renamings. IEEE
Trans. Software Eng. 40(5): 502-532 (2014)

Challenge
• Word relations in WordNet do not fully
comprise relations relevant in the IT
domain
• Alternative approaches try to “learn” these
relations
Jinqiu Yang, Lin Tan: SWordNet: Inferring semantically related words from software context. Empirical
Software Engineering 19(6): 1856-1886 (2014)

Term Indexing:Assigning
Weights to Terms
• Binary Weights: terms weighted as 0 (not
appearing) or 1 (appearing)
• (Raw) term frequency: terms weighted by the
number of occurrences/frequency in the
document
• tf-idf: want to weight terms highly if they are
frequent in relevant documents … BUT ...
infrequent in the collection as a whole

What to use?
tf-idf might be useful for text retrieval
but…
not (necessarily) the best choice to identify
keywords, e.g. for classiﬁcation purposes

Algebraic Models
• Aim at providing representations to
documents and at comparing/clustering
documents
• Examples
• Vector Space Model
• Latent Semantic Indexing (LSI)
• latent Dirichlet allocation (LDA)

TheVector Space Model (VSM)
• Assume t distinct terms remain after preprocessing
• call them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
• Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-
valued weight, wij.
• Both documents and queries are expressed as t-
dimensional vectors:
dj = (w1j, w2j, …, wtj)

VSM Representation
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5

Latent Semantic Indexing
(LSI)
Overcomes limitations ofVSM (synonymy,
polysemy, considering words occurring in
documents as independent events)
Shifting from a document-term space towards a
document-concept space
Dumais, S.T., Furnas, G.W., Landauer,T. K. and Deerwester, S. (1988), Using latent semantic
analysis to improve information retrieval. In Proceedings of CHI'88: Conference on Human
Factors in Computing, NewYork:ACM, 281-285.

latent Dirichlet Allocation
(LDA)
• Probabilistic model
• Documents treated as a distributions of
topics
• Topics are distributions of words
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” The Journal of
Machine Learning Research, vol. 3, pp. 993–1022, 2003.

Applications
Alessandra Gorla, Ilaria Tavecchia, Florian Gross, Andreas Zeller: Checking app behavior
against app descriptions. ICSE 2014: 1025-1035
Checking App Behavior Against App Descriptions
Alessandra Gorla · Ilaria Tavecchia⇤
· Florian Gross · Andreas Zeller
Saarland University
Saarbrücken, Germany
{gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de
ABSTRACT
How do we know a program does what it claims to do? After clus-
tering Android apps by their description topics, we identify outliers
in each cluster with respect to their API usage. A “weather” app that
sends messages thus becomes an anomaly; likewise, a “messaging”
app would typically not be expected to access the current location.
Applied on a set of 22,500+ Android applications, our CHABADA
prototype identified several anomalies; additionally, it flagged 56%
of novel malware as such, without requiring any known malware
patterns.
Categories and Subject Descriptors
D.4.6 [Security and Protection]: Invasive software
General Terms
Security
Keywords
Android, malware detection, description analysis, clustering
1. INTRODUCTION
Checking whether a program does what it claims to do is a long-
standing problem for developers. Unfortunately, it now has become
a problem for computer users, too. Whenever we install a new app,
1. App collection 2. Topics
Weather,
Map…
Travel,
Map…
Theme
3. Clusters
Weather
+ Travel
Themes
Access-LocationInternet Access-LocationInternet Send-SMS
4. Used APIs 5. Outliers
Weather
+ Travel
Figure 1: Detecting applications with unadvertised behavior.
Starting from a collection of “good” apps (1), we identify their
description topics (2) to form clusters of related apps (3). For
each cluster, we identify the sentitive APIs used (4), and can
then identify outliers that use APIs that are uncommon for that
cluster (5).
• An app that sends a text message to a premium number to
raise money is suspicious? Maybe, but on Android, this is a
Checking App Behavior Against App Descriptions
Alessandra Gorla · Ilaria Tavecchia⇤
· Florian Gross · Andreas Zeller
Saarland University
Saarbrücken, Germany
{gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de
ABSTRACT
How do we know a program does what it claims to do? After clus-
tering Android apps by their description topics, we identify outliers
in each cluster with respect to their API usage. A “weather” app that
sends messages thus becomes an anomaly; likewise, a “messaging”
app would typically not be expected to access the current location.
Applied on a set of 22,500+ Android applications, our CHABADA
prototype identified several anomalies; additionally, it flagged 56%
of novel malware as such, without requiring any known malware
patterns.
Categories and Subject Descriptors
D.4.6 [Security and Protection]: Invasive software
General Terms
Security
1. App collection 2. Topics
Weather,
Map…Map…
Travel,
Map…
Theme
3. Clusters
Weather
+ Travel+ Travel
ThemesThemes
Access-LocationInternet Access-LocationInternet Send-SMS
4. Used APIs 5. Outliers
Weather
+ Travel+ Travel

Need to properly
calibrate the techniques
• Many IR techniques require a careful
calibration of many parameters
• LSI number of concepts (k)
• LDA k, α, β, number of iterations (n)
• Without that, the performances can be
sub-optimal

… and beyond that…
• Should I use stop word removal? Which
one?
• Stemming?
• Which weighting scheme? tf? tf-idf? others?
• Which technique?VSM? LSI? LDA?

Approaches (I)
Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, Jane Cleland-Huang: Improving trace
accuracy through data-driven configuration and composition of tracing features. ESEC/SIGSOFT FSE
2013: 378-388
Improving Trace Accuracy through Data-Driven
ConÆ guration and Composition of Tracing Features
Sugandha Lohar, Sorawit
Amornborvornwong
DePaul University
Chicago, IL, USA
sonul.123@gmail.com,
sorambww@hotmail.com
Andrea Zisman
Department of Computing
The Open University
Milton Keynes, MK7 6AA, UK
andrea.zisman@open.ac.uk
Jane Cleland-Huang
DePaul University
Systems and Requirements
Engineering Center
Chicago, IL, USA
jhuang@cs.depaul.edu
ABSTRACT
Software traceability is a sought-after, yet often elusive qual-
ity in large software-intensive systems primarily because the
cost and effort of tracing can be overwhelming. State-of-the
art solutions address this problem through utilizing trace re-
trieval techniques to automate the process of creating and
maintaining trace links. However, there is no simple one-
size- ts all solution to trace retrieval. As this paper will
show, nding the right combination of tracing techniques
can lead to signi cant improvements in the quality of gener-
ated links. We present a novel approach to trace retrieval in
which the underlying infrastructure is con gured at runtime
to optimize trace quality. We utilize a machine-learning ap-
proach to search for the best con guration given an initial
training set of validated trace links, a set of available tracing
techniques speci ed in a feature model, and an architecture
capable of instantiating all valid con gurations of features.
critical software-intensive systems [4]. It is used to capture
relationships between requirements, design, code, test-cases,
and other software engineering artifacts, and support crit-
ical activities such as impact analysis, compliance veri ca-
tion, test-regression selection, and safety-analysis. As such,
traceability is mandated in safety-critical domains includ-
ing the automotive, aeronautics, and medical device indus-
tries. Unfortunately, tracing costs can grow excessively high
if trace links have to be created and maintained manually
by human users, and as a result, practitioners often fail to
establish adequate traceability in a project [27].
To address these needs, numerous researchers have devel-
oped or adopted algorithms that semi-automate the process
of creating trace links. These algorithms include the Vec-
tor Space Model (VSM) [23], Probabilistic approaches [14],
Latent Semantic Indexing [2, 12], Latent Dirichlet Alloca-
tion (LDA) [13], rule-based approaches that identify rela-
tionships across project artifacts [35], and approaches that
Improving Trace Accuracy through Data-Driven
ConÆ guration and Composition of Tracing Features
Sugandha Lohar, Sorawit
Amornborvornwong
DePaul University
Chicago, IL, USA
sonul.123@gmail.com,
sorambww@hotmail.com
Andrea Zisman
Department of Computing
The Open University
Milton Keynes, MK7 6AA, UK
andrea.zisman@open.ac.uk
Jane Cleland-Huang
DePaul University
Systems and Requirements
Engineering Center
Chicago, IL, USA
jhuang@cs.depaul.edu
ABSTRACT
Software traceability is a sought-after, yet often elusive qual-
ity in large software-intensive systems primarily because the
cost and effort of tracing can be overwhelming. State-of-theffort of tracing can be overwhelming. State-of-theff
art solutions address this problem through utilizing trace re-
critical software-intensive systems [4]. It is used to capture
relationships between requirements, design, code, test-cases,
and other software engineering artifacts, and support crit-
rint
ical activities such as impact analysis, compliance veri ca-rint
rint

Discussion
• Supervised, task speciﬁc
• Model parameters varied through Genetic
Algorithms
• You train the model on a training set
• Maximize precision/recall for a given task

Approaches (II)
Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimiliano Di Penta, Denys Poshyvanyk, Andrea
De Lucia: How to effectively use topic models for software engineering tasks? an approach based on
genetic algorithms. ICSE 2013: 522-531
How to Effectively Use Topic Models for
Software Engineering Tasks?
An Approach Based on Genetic Algorithms
Annibale Panichella1, Bogdan Dit2, Rocco Oliveto3,
Massimilano Di Penta4, Denys Poshynanyk2, Andrea De Lucia1
1
University of Salerno, Fisciano (SA), Italy
2
The College of William and Mary, Williamsburg, VA, USA
3
University of Molise, Pesche (IS), Italy
4
University of Sannio, Benevento, Italy
Abstractó Information Retrieval (IR) methods, and in partic-
ular topic models, have recently been used to support essential
software engineering (SE) tasks, by enabling software textual
retrieval and analysis. In all these approaches, topic models have
been used on software artifacts in a similar manner as they
were used on natural language documents (e.g., using the same
settings and parameters) because the underlying assumption was
that source code and natural language documents are similar.
However, applying topic models on software data using the same
settings as for natural language text did not always produce the
expected results.
Recent research investigated this assumption and showed that
source code is much more repetitive and predictable as compared
to the natural language text. Our paper builds on this new
fundamental nding and proposes a novel solution to adapt,
con gure and effectively use a topic modeling technique, namely
Latent Dirichlet Allocation (LDA), to achieve better (acceptable)
performance across various SE tasks. Our paper introduces a
novel solution called LDA-GA, which uses Genetic Algorithms
(GA) to determine a near-optimal con guration for LDA in
the context of three different SE tasks: (1) traceability link
recovery, (2) feature location, and (3) software artifact labeling.
The results of our empirical studies demonstrate that LDA-GA
is able to identify robust LDA con gurations, which lead to a
higher accuracy on all the datasets for these SE tasks as compared
proposed to support software engineering tasks: feature lo-
cation [4], change impact analysis [5], bug localization [6],
clone detection [7], traceability link recovery [8], [9], expert
developer recommendation [10], code measurement [11], [12],
artifact summarization [13], and many others [14], [15], [16].
In all these approaches, LDA and LSI have been used on
software artifacts in a similar manner as they were used on
natural language documents (i.e., using the same settings,
con gurations and parameters) because the underlying as-
sumption was that source code (or other software artifacts)
and natural language documents exhibit similar properties.
More speci cally, applying LDA requires setting the number
of topics and other parameters speci c to the particular LDA
implementation. For example, the fast collapsed Gibbs sam-
pling generative model for LDA requires setting the number
of iterations n and the Dirichlet distribution parameters α
and β [17]. Even though LDA was successfully used in the
IR and natural language analysis community, applying it on
software data, using the same parameter values used for natural
language text, did not always produce the expected results
[18]. As in the case of machine learning and optimization
How to Effectively Use Topic Models for
Software Engineering Tasks?
An Approach Based on Genetic Algorithms
Annibale Panichella1, Bogdan Dit2, Rocco Oliveto3,
Massimilano Di Penta4, Denys Poshynanyk2, Andrea De Lucia1
1
University of Salerno, Fisciano (SA), Italy
2
The College of William and Mary, Williamsburg, VA, USA
3
University of Molise, Pesche (IS), Italy
4
University of Sannio, Benevento, Italy
Abstractó Information Retrieval (IR) methods, and in partic-
ular topic models, have recently been used to support essential
software engineering (SE) tasks, by enabling software textual
retrieval and analysis. In all these approaches, topic models have
been used on software artifacts in a similar manner as they
were used on natural language documents (e.g., using the same
settings and parameters) because the underlying assumption was
that source code and natural language documents are similar.that source code and natural language documents are similar.that source code and natural language documents are similar
However, applying topic models on software data using the same
settings as for natural language text did not always produce the
proposed to support software engineering tasks: feature lo-
cation [4], change impact analysis [5], bug localization [6],
clone detection [7], traceability link recovery [8], [9], expert
developer recommendation [10], code measurement [11], [12],
artifact summarization [13], and many others [14], [15], [16].
In all these approaches, LDA and LSI have been used on
software artifacts in a similar manner as they were used on
natural language documents (i.e., using the same settings,
con gurations and parameters) because the underlying as-

Choose a random
population of LDA
parameters
LDA
Determine fitness of each
chromosome (individual)
of each
(individual)
Fitness = silhouette coefficientFitness = silhouette coefficientFitness = silhouette coefficient

Choose a random
population of LDA
parameters
LDA
Select next generation
Crossover, mutation
Crossover
Mutation

Choose a random
population of LDA
parameters
LDA
Select next generation
Crossover, mutation
Next generation
Best model for first generation
Best model for last generation

• Integrated suite of software facilities for data
manipulation, calculation and graphical display
• VSM and LSI implemented in the lsa package
• topicmodels and lda also available
• many other text processing packages…
R (www.r-project.org)

Creating a t-d matrix
Creating a term-document matrix from a directory
tm-textmatrix(/Users/Max/mydocs)
Syntax:
textmatrix(mydir, stemming=FALSE,
language=“english, 
minWordLength=2, minDocFreq=1,
stopwords=NULL, vocabulary=NULL)

Applying a weighting schema
• lw_tf(m): returns a completely unmodiﬁed n times m matrix
• lw_bintf(m): returns binary values of the n times m matrix.
• gw_normalisation(m): returns a normalised n times m matrix
• gw_gﬁdf(m): returns the global frequency multiplied with idf

Applying LSI
• Decomposes tm and reduces the space of
concepts to 2
lspace-lsa(tm,dims=2)
• Converts lspace into a textmatrix
m2-as.textmatrix(lspace)

Computing the cosine
cosine(v1,v2)
or
cosine(matrix)

Application: Identifying
duplicate bug reports
Problem: people often report bugs someone
has already reported
Solution (in a nutshell):
• Compute textual similarity among bug
reports
• Match (where available) stack traces
Xiaoyin Wang, Lu Zhang,Tao Xie, John Anvik, Jiasu Sun:An approach to
detecting duplicate bug reports using natural language and execution
information. ICSE 2008: 461-470

Example (Eclipse)
Bug 123 - Synchronize View: files nodes in
tree should provide replace with action
This scenario happens very often to me
- I change a file in my workspace to test
something
- I do a release
- In the release I encounter that there is no
need to release the file
- I want to replace it from the stream but a
corresponding action is missing
Now I have to go to the navigator to do the
job.
t-textmatrix(/Users/mdipenta/docs,stemming=TRUE)
cosine(t)
d1 d2
d1 1.0000000 0.6721491
d2 0.6721491 1.0000000
/mdipenta/docs,stemming=TRUE)
Bug 4934 - DCR: Replace from Stream in
Release mode
Sometimes I touch files, without
intention.These files are then unwanted
outgoing changes.To fix it, I have go back to
the package view/navigator and replace the
file with the stream.After this a new
synchronize is needed, for that I have a clean
view.
It would be very nice to have the 'replace
with stream' in the outgoing tree.
Or 'Revert'?

Other tools
Lucene http://lucene.apache.org
RapidMiner www.rapidminer.com
Mallet http://mallet.cs.umass.edu

Careful choice
• In case you don’t want a GA chooses it for
you…
• Most exotic model not necessarily the best
one
• Simple models have advantages (e.g.
scalability)
• Try to experiment multiple models

IR methods:  
Pros and Cons
• Can be used to match similar (related) discussions
• Some techniques deal with issues such as
homonymy and polysemy
• Software-related discussion can be very noisy
• Noise different between different artifacts 
(bug reports, commit notes, code)
• You have less control than with regular expressions
• Do not capture dependencies in sentences  
e.g.,“this bug will not be ﬁxed”
✘
✔
✔
✘
✘
✘

Technology
Stanford Natural Language Parser 
http://nlp.stanford.edu
DependenceVisualizer: DepenSee
Grammar Browser: GrammarScope

(ROOT
(S
(NP (DT this) (NN bug))
(VP (MD will) (RB not)
(VP (VB be)
(VP (VBN fixed))))))
det(bug-2, this-1)
nsubjpass(fixed-6, bug-2)
aux(fixed-6, will-3)
neg(fixed-6, not-4)
auxpass(fixed-6, be-5)
root(ROOT-0, fixed-6)
Example
this bug will not be ﬁxed
DT NN MD RB VB VBN
det aux
nsubjpass
neg
auxpass

Is it applicable to analyze
source code identifiers?
• It could be, once identifiers are split into
compound words
• Useful to tag part-of-speech
• Results may not be very precise because
identifiers are not natural language
sentences

Rules (Abebe and Tonella, 2010)
Surafel Lemma Abebe, Paolo Tonella: Natural Language Parsing of Program
Element Names for Concept Extraction. ICPC 2010: 156-159

Example:
OpenStructure
open structure Let’s insert an artiﬁcial subject and an
article:
subjects open the structure
(ROOT
(NP (JJ open) (NN structure)))
amod(structure-2, open-1)
root(ROOT-0, structure-2)
(ROOT
(S
(NP (NNS subjects))
(VP (VBP open)
(NP (DT the) (NN structure)))))
nsubj(open-2, subjects-1)
root(ROOT-0, open-2)
det(structure-4, the-3)
dobj(open-2, structure-4)

Applications
Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, Amit M. Paradkar:
Inferring method specifications from natural language API descriptions. ICSE 2012: 815-825
Inferring Method Specifications from Natural Language API Descriptions
Rahul Pandita⇤, Xusheng Xiao⇤, Hao Zhong†, Tao Xie⇤, Stephen Oney‡, and Amit Paradkar§
⇤Department of Computer Science, North Carolina State University, Raleigh, USA
†Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China
‡Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA
§I.B.M. T. J. Watson Research Center, Hawthorne, NY, USA
{rpandit, xxiao2, txie}@ncsu.edu, zhonghao@itechs.iscas.ac.cn, soney@cs.cmu.edu, paradkar@us.ibm.com
Abstract—Application Programming Interface (API) docu-
ments are a typical way of describing legal usage of reusable
software libraries, thus facilitating software reuse. However,
even with such documents, developers often overlook some
documents and build software systems that are inconsistent
with the legal usage of those libraries. Existing software
verification tools require formal specifications (such as code
contracts), and therefore cannot directly verify the legal usage
described in natural language text in API documents against
code using that library. However, in practice, most libraries
do not come with formal specifications, thus hindering tool-
based verification. To address this issue, we propose a novel
approach to infer formal specifications from natural language
text of API documents. Our evaluation results show that our
approach achieves an average of 92% precision and 93%
recall in identifying sentences that describe code contracts from
more than 2500 sentences of API documents. Furthermore, our
results show that our approach has an average 83% accuracy
in inferring specifications from over 1600 sentences describing
code contracts.
I. INTRODUCTION
Programming Interface (API) documents. Typically, such
documents are provided to client-code developers through
online access, or are shipped with the API code. For exam-
ple, J2EE’s API documentation3
is one of the most popular
API documents.
Even with such documents, client-code developers often
overlook some API documents and use methods in API
libraries incorrectly [22]. Since these documents are written
in natural language, existing tools cannot verify legal usage
described in a library’s API documents against the client
code of that library. One possible solution is to manually
write code contracts based on the specifications described
in API documents. However, due to a large number of
sentences in API documents, manually hunting for contract
sentences and writing code contracts for the API library
is prohibitively time consuming and labor intensive. For
instance, the File class of the C# .NET Framework has
around 800 sentences. Moreover, not all of these sentences
Inferring Method Specifications from Natural Language API Descriptions
Rahul Pandita⇤, Xusheng Xiao⇤, Hao Zhong†, Tao Xie⇤, Stephen Oney‡, and Amit Paradkar§
⇤Department of Computer Science, North Carolina State University, Raleigh, USA
†Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China
‡
Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China
‡
Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China
Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA
§
§
I.B.M. T. J. Watson Research Center, Hawthorne, NY, USA
{rpandit, xxiao2, txie}@ncsu.edu, zhonghao@itechs.iscas.ac.cn, soney@cs.cmu.edu, paradkar@us.ibm.com
Abstract—Application Programming Interface (API) docu-Abstract—Application Programming Interface (API) docu-Abstract
ments are a typical way of describing legal usage of reusable
software libraries, thus facilitating software reuse. However,
even with such documents, developers often overlook some
documents and build software systems that are inconsistent
with the legal usage of those libraries. Existing software
verification tools require formal specifications (such as code
contracts), and therefore cannot directly verify the legal usage
described in natural language text in API documents against
code using that library. However, in practice, most libraries
do not come with formal specifications, thus hindering tool-
based verification. To address this issue, we propose a novel
Programming Interface (API) documents. Typically, such
documents are provided to client-code developers through
online access, or are shipped with the API code. For exam-
ple, J2EE’s API documentation3
is one of the most popular
API documents.
Even with such documents, client-code developers often
overlook some API documents and use methods in API
libraries incorrectly [22]. Since these documents are written
in natural language, existing tools cannot verify legal usage
described in a library’s API documents against the client

Applications
How Can I Improve My App? Classifying User
Reviews for Software Maintenance and Evolution
S. Panichella⇤, A. Di Sorbo†, E. Guzman‡, C. A.Visaggio†, G. Canfora† and H. C. Gall⇤
⇤University of Zurich, Switzerland
†University of Sannio, Benevento, Italy
‡Technische Universität München, Garching, Germany
panichella@ifi.uzh.ch, disorbo@unisannio.it, emitza.guzman@mytum.de, {visaggio,canfora}@unisannio.it, gall@ifi.uzh.ch
Abstract—App Stores, such as Google Play or the Apple Store,
allow users to provide feedback on apps by posting review
comments and giving star ratings. These platforms constitute a
useful electronic mean in which application developers and users
can productively exchange information about apps. Previous
research showed that users feedback contains usage scenarios,
bug reports and feature requests, that can help app developers to
accomplish software maintenance and evolution tasks. However,
in the case of the most popular apps, the large amount of received
feedback, its unstructured nature and varying quality can make
the identification of useful user feedback a very challenging task.
In this paper we present a taxonomy to classify app reviews
into categories relevant to software maintenance and evolution,
as well as an approach that merges three techniques: (1)
Natural Language Processing, (2) Text Analysis and (3) Sentiment
Analysis to automatically classify app reviews into the proposed
categories. We show that the combined use of these techniques
allows to achieve better results (a precision of 75% and a recall
of 74%) than results obtained using each technique individually
(precision of 70% and a recall of 67%).
Index Terms—User Reviews, Mobile Applications, Natural
Language Processing, Sentiment Analysis, Text classification
form of unstructured text that is difficult to parse and analyze.
Thus, developers and analysts have to read a large amount
of textual data to become aware of the comments and needs
of their users [10]. In addition, the quality of reviews varies
greatly, from useful reviews providing ideas for improvement
or describing specific issues to generic praises and complaints
(e.g. “You have to be stupid to program this app”, “I love
it!”, “this app is useless”).
To handle this problem Chen et al. [10] proposed AR-
Miner, an approach to help app developers discover the most
informative user reviews. Specifically, the authors use: (i) text
analysis and machine learning to filter out non-informative
reviews and (ii) topic analysis to recognize topics treated
in the reviews classified as informative. In this paper, we
argue that text content represents just one of the possible
dimensions that can be explored to detect informative reviews
from a software maintenance and evolution perspective. In
particular, topic analysis techniques are useful to discover
How Can I Improve My App? Classifying User
Reviews for Software Maintenance and Evolution
S. Panichella⇤, A. Di Sorbo†, E. Guzman‡, C. A.Visaggio†, G. Canfora† and H. C. Gall⇤
⇤University of Zurich, Switzerland
†University of Sannio, Benevento, Italy
‡Technische Universitat Mät M¨ unchen, Garching, Germanyünchen, Garching, Germany¨
panichella@ifi.uzh.ch, disorbo@unisannio.it, emitza.guzman@mytum.de, {visaggio,canfora}@unisannio.it, gall@ifi.uzh.ch
Abstract—App Stores, such as Google Play or the Apple Store,Abstract—App Stores, such as Google Play or the Apple Store,Abstract
allow users to provide feedback on apps by posting review
comments and giving star ratings. These platforms constitute a
useful electronic mean in which application developers and users
can productively exchange information about apps. Previous
research showed that users feedback contains usage scenarios,
bug reports and feature requests, that can help app developers to
accomplish software maintenance and evolution tasks. However,
in the case of the most popular apps, the large amount of received
feedback, its unstructured nature and varying quality can make
the identification of useful user feedback a very challenging task.
In this paper we present a taxonomy to classify app reviews
into categories relevant to software maintenance and evolution,
as well as an approach that merges three techniques: (1)
Natural Language Processing, (2) Text Analysis and (3) Sentiment
Analysis to automatically classify app reviews into the proposed
categories. We show that the combined use of these techniques
allows to achieve better results (a precision of 75% and a recall
of 74%) than results obtained using each technique individually
(precision of 70% and a recall of 67%).
form of unstructured text that is difficult to parse and analyze.
Thus, developers and analysts have to read a large amount
of textual data to become aware of the comments and needs
of their users [10]. In addition, the quality of reviews varies
greatly, from useful reviews providing ideas for improvement
or describing specific issues to generic praises and complaints
(e.g. “You have to be stupid to program this app”, “I love
it!”, “this app is useless”).
To handle this problem Chen et al. [10] proposed AR-
Miner, an approach to help app developers discover the most
informative user reviews. Specifically, the authors use: (i) text
analysis and machine learning to filter out non-informative
reviews and (ii) topic analysis to recognize topics treated
in the reviews classified as informative. In this paper, we
argue that text content represents just one of the possible
dimensions that can be explored to detect informative reviews
from a software maintenance and evolution perspective. In
@ICSME 2015 - Thurdsay, 13.50 - Mobile applications

Natural Language
Parsing: Pros and Cons
• Can identify dependencies between words
• Accurate part-of-speech analysis, better than
lexical databases
• Application to software entities (e.g.
identifiers) may be problematic
• Difficult to process data
• NL parsing is difficult, especially when the
corpus is noisy ✘
✔
✔
✘
✘

What technique?What technique?

What technique?
Need to match precise elements

What technique?
Separate structured and
unstructured data

What technique?
Account for the whole textual
corpus of a document

What technique?
Compare documents

What technique?
Identify “topics” discussed in
documents

What technique?
Not just bag of words, but also
dependencies, parts of speech,
negations

Take aways
Use structure when available
Carefully (empirically) choose the most
suitable technique
Out of the box conﬁguration might not
work for you
Manual analysis cannot be replaced

(ROOT
(S
(VP (VB be)
det(bug-2, this-1)
neg(fixed-6, not-4)
For example, the
Stanford Natural
Language Parser 
Techniques: Natural
Language Parsing
DT NN MD RB VB VBN
det aux
nsubjpass
neg
auxpass
(ROOT
(S
(VP (VB be)
det(bug-2, this-1)
neg(fixed-6, not-4)
For example, the
Stanford Natural
Language Parser
Techniques: Natural
Language Parsing
DT NN MD RB VB VBN
det aux
nsubjpass
neg
auxpass
Maching issue ids
+)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/
i
Simple yet powerful technique to link commit
notes to issue reports
Maching issue ids
+)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/
i
Simple yet powerful technique to link commit
notes to issue reports
VSM Representation
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
VSM Representation
T3T3T
T1T1T
T2T2T
D1 = 2T1= 2T1= 2T + 3T2+ 3T2+ 3T + 5T3+ 5T3+ 5T
D2 = 3T1= 3T1= 3T + 7T2+ 7T2+ 7T + T3T3T
Q = 0T1Q = 0T1Q = 0T + 0T2+ 0T2+ 0T + 2T3+ 2T3+ 2T
7
32
5

Put Your Hands in the Mud: What Technique, Why, and How

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Put Your Hands in the Mud: What Technique, Why, and How

Semelhante a Put Your Hands in the Mud: What Technique, Why, and How (20)

Último

Último (20)

Put Your Hands in the Mud: What Technique, Why, and How