SlideShare uma empresa Scribd logo
1 de 114
Baixar para ler offline
PutYour Hands in the Mud:
What Technique,
Why, and How?
Massimiliano Di Penta
University of Sannio, Italy
dipenta@unisannio.it
http://www.ing.unisannio.it/mdipenta
Outline
MUD of software repositories
Available techniques:

Pattern matching

Island parsers

IR algebraic methods

Natural Language Parsing
Choosing the solution that fits your needs
Textual Analysis very
successful in SE...
Why?
Code identifiers and comments reflect
semantics
Software repositories contain a mix of
structured and unstructured data
(Some) Applications
Traceability Link Recovery
Feature Location
Clone Detection
Conceptual Cohesion/Coupling Metrics
Bug prediction
Bug triaging
Software remodularization
Textual smells/antipatterns
Duplicate bug report detection
Software artifact summarization and labeling
Generating assertions from textual descriptions
....
Different kinds of
Repositories
Different kinds of
Repositories
Versioning Systems
Besides code itself, the main source of
unstructured data is represented by
commit messages
Example
• Add Checkclipse preferences to all projects
so Checkstyle is preconfigured
• Fix for issue 4503.
• Fix for issue 4517: "changeability" on
association ends not displayed properly.
This was never adapted since UML 1.3 &
NSUML were replaced.
What kind of problem do you see here?
Be aware!
• Commit notes are often insufficient to
know everything about a change
• Need to merge with issue tracker data
Issue Trackers
Issue Trackers
• Contain a mix of structured and non-
structured data
• Structured: classification, priority, severity,
status, but also stack traces and code
snippets
• Unstructured: issue description, comments
• Often natural language mixed with method,
class, package names, etc.
Challenges
• Including comments in the text corpus may
or may not add useful information
• The mix of natural language and code
elements may challenge the application of
techniques such as natural language parsing
Techniques
Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan: Using Fuzzy Code Search to Link Code
Fragments in Discussions to Source Code. CSMR 2012: 319-328
Using Fuzzy Code Search to Link Code Fragments
in Discussions to Source Code
Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
School of Computing, Queen’s University
Kingston, Ontario, Canada
Email: {nicbet, sthomas, ahmed}@cs.queensu.ca
Abstract—When discussing software, practitioners often ref-
erence parts of the project’s source code. Such references
have different motivations, such as mentoring and guiding less
experienced developers, pointing out code that needs changes,
or proposing possible strategies for the implementation of future
changes. The fact that particular parts of a source code are being
discussed makes these parts of the software special. Knowing
which code is being talked about the most can not only help
practitioners to guide important software engineering and main-
tenance activities, but also act as a high-level documentation of
development activities for managers. In this paper, we use clone-
detection as specific instance of a code search based approach
for establishing links between code fragments that are discussed
by developers and the actual source code of a project. Through
a case study on the Eclipse project we explore the traceability
links established through this approach, both quantitatively and
qualitatively, and compare fuzzy code search based traceability
linking to classical approaches, in particular change log analysis
and information retrieval. We demonstrate a sample application
of code search based traceability links by visualizing those parts
of the project that are most discussed in issue reports with
a Treemap visualization. The results of our case study show
that the traceability links established through fuzzy code search-
Past approaches to uncovering traceability links between doc-
umentation and source code are commonly based on in-
formation retrieval [2], [15], [18], [20], natural language
processing [1], [19] and lightweight textual analyses [5], [11],
[12]. Each approach, however, is tailored towards a specific
set of goals and use cases. For example, when linking code
changes to issue reports by analyzing transaction logs [28],
we observe only the associations between a bug report and
the final locations of the bug fix, but miss the bug fixing
history: all the locations that a developer had to investigate
and understand before he could find an appropriate way to fix
the error.
In this paper, we aim to find traceability links between
issue reports and source code. For this purpose, we propose
a new approach that uses token-based clone detection as an
implementation of fuzzy code search for discovering links
between code fragments mentioned in project discussions and
the location of these fragments in the source code body of a
software system. In a case study on the ECLIPSE project, we
Using Fuzzy Code Search to Link Code Fragments
in Discussions to Source Code
Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
School of Computing, Queen’s University
Kingston, Ontario, Canada
Email: {nicbet, sthomas, ahmed}@cs.queensu.ca
Abstract—When discussing software, practitioners often ref-Abstract—When discussing software, practitioners often ref-Abstract
erence parts of the project’s source code. Such references
have different motivations, such as mentoring and guiding less
experienced developers, pointing out code that needs changes,
or proposing possible strategies for the implementation of future
changes. The fact that particular parts of a source code are being
discussed makes these parts of the software special. Knowing
which code is being talked about the most can not only help
practitioners to guide important software engineering and main-
tenance activities, but also act as a high-level documentation of
Past approaches to uncovering traceability links between doc-
umentation and source code are commonly based on in-
formation retrieval [2], [15], [18], [20], natural language
processing [1], [19] and lightweight textual analyses [5], [11],
[12]. Each approach, however, is tailored towards a specific
set of goals and use cases. For example, when linking code
changes to issue reports by analyzing transaction logs [28],
we observe only the associations between a bug report and
Emails
Similar characteristics of issue trackers,
with less structure
From lisa at usna.navy.MIL Thu Jul 1 15:42:27 1999
From: lisa at usna.navy.MIL (Lisa Becktold {CADIG STAFF})
Date: Tue Dec 2 03:03:10 2003
Subject: nmbd/nmbd_processlogon.c - CODE 12???
Message-ID: <99Jul1.114236-0400edt.4995-357+39@jupiter.usna.navy.mil
Hi:
I have Samba 2.1.0-prealpha running on a Sun Ultra 10. My NT PC
joined
the domain without a problem, but I can't logon. Every time I attempt
to log into the Samba domain, the NT screen blanks as it usually does
during login, but then the "Begin Login" screen reappears.
I see this message in my samba/var/log.nmb file whenever I try to
login:
Web Forums
Stack Overflow
Mining Forums:
Challenges
• The usual problems you face with issue
trackers and emails are still there
• However in general the separation between
different elements is much more consistent
StackOverflow
Embedded Code
StackOverflow
Embedded Code
However, this
convention is not
always followed!
User reviews
Not all reviews are equal
AR-Miner: Mining Informative Reviews for Developers from
Mobile App Marketplace
Ning Chen, Jialiu Lin†
, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang
Nanyang Technological University, Singapore, †
Carnegie Mellon University, USA
{nchen1,chhoi,xkxiao,bszhang}@ntu.edu.sg, †
jialiul@cs.cmu.edu
ABSTRACT
With the popularity of smartphones and mobile devices, mo-
bile application (a.k.a. “app”) markets have been growing
exponentially in terms of number of users and download-
s. App developers spend considerable e↵ort on collecting
and exploiting user feedback to improve user satisfaction,
but su↵er from the absence of e↵ective user review ana-
lytics tools. To facilitate mobile app developers discover
the most “informative” user reviews from a large and rapid-
ly increasing pool of user reviews, we present “AR-Miner”
— a novel computational framework for App Review Min-
ing, which performs comprehensive analytics from raw user
reviews by (i) first extracting informative user reviews by
filtering noisy and irrelevant ones, (ii) then grouping the in-
formative reviews automatically using topic modeling, (iii)
further prioritizing the informative reviews by an e↵ective
review ranking scheme, (iv) and finally presenting the group-
s of most “informative” reviews via an intuitive visualization
approach. We conduct extensive experiments and case s-
tudies on four popular Android apps to evaluate AR-Miner,
intense, in order to seize the initiative, developers tend to
employ an iterative process to develop, test, and improve
apps [23]. Therefore, timely and constructive feedback from
users becomes extremely crucial for developers to fix bugs,
implement new features, and improve user experience ag-
ilely. One key challenge to many app developers is how to
obtain and digest user feedback in an e↵ective and e cient
manner, i.e., the “user feedback extraction” task. One way
to extract user feedback is to adopt typical channels used
in traditional software development, such as (i) bug/change
repositories (e.g., Bugzilla [3]), (ii) crash reporting systems
[19], (iii) online forums (e.g., SwiftKey feedback forum [6]),
and (iv) emails [10].
Unlike the traditional channels, modern app marketplaces,
such as Apple App Store and Google Play, o↵er a much eas-
ier way (i.e., the web-based market portal and the market
app) for users to rate and post app reviews. These reviews
present user feedback on various aspects of apps (such as
functionality, quality, performance, etc), and provide app
developers a new and critical channel to extract user feed-
AR-Miner: Mining Informative Reviews for Developers from
Mobile App Marketplace
Ning Chen, Jialiu Lin†
, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang
Nanyang Technological University, Singapore, †
Carnegie Mellon University, USA
{nchen1,chhoi,xkxiao,bszhang}@ntu.edu.sg, †
jialiul@cs.cmu.edu
ABSTRACT
With the popularity of smartphones and mobile devices, mo-
bile application (a.k.a. “app”) markets have been growing
exponentially in terms of number of users and download-
s. App developers spend considerable e↵ort on collecting↵ort on collecting↵
and exploiting user feedback to improve user satisfaction,
but su↵er from the absence of e↵er from the absence of e↵ ↵ective user review ana-↵ective user review ana-↵
lytics tools. To facilitate mobile app developers discover
the most “informative” user reviews from a large and rapid-
ly increasing pool of user reviews, we present “AR-Miner”
intense, in order to seize the initiative, developers tend to
employ an iterative process to develop, test, and improve
apps [23]. Therefore, timely and constructive feedback from
users becomes extremely crucial for developers to fix bugs,
implement new features, and improve user experience ag-
ilely. One key challenge to many app developers is how to
obtain and digest user feedback in an e↵ective and e↵ective and e↵ cient
manner, i.e., the “user feedback extraction” task. One way
to extract user feedback is to adopt typical channels used
in traditional software development, such as (i) bug/change
repositories (e.g., Bugzilla [3]), (ii) crash reporting systems
Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang:AR-miner: mining
informative reviews for developers from mobile app marketplace. ICSE 2014: 767-778
Informative reviews…
Functional None of the pictures will load in my news feed.
Performance
It lags and doesn't respond to my touch which almost
always causes me to run into stuff.
Feature (change)
request
Amazing app, although I wish there were more
themes to choose from
Please make it a little easy to get bananas please and
make more power ups that would be awesome.
Remove ads So many ads its unplayable!
Fix permissions
This game is adding for too much unexplained
permissions.
Non-Informative reviews
Purely emotional
Great fun can't put it down!
This is a crap app.
Description
of app/actions
I have changed my review from 2 star to 1 star.
Unclear issue
description
Bad game this is not working on my phone.
Questions How can I get more points?
@ICSME 2015 

Thurdsay, 13.50 - Mobile applications
User Reviews Matter! Tracking Crowdsourced
Reviews to Support Evolution of Successful Apps
Fabio Palomba⇤, Mario Linares-V´asquez§, Gabriele Bavota†, Rocco Oliveto‡,
Massimiliano Di Penta¶, Denys Poshyvanyk§, Andrea De Lucia⇤
⇤University of Salerno, Fisciano (SA), Italy – §The College of William and Mary, VA, USA
†Free University of Bozen-Bolzano, Bolzano (BZ), Italy – ‡University of Molise, Pesche (IS), Italy
¶University of Sannio, Benevento, Italy
Abstract—Nowadays software applications, and especially mo-
bile apps, undergo frequent release updates through app stores.
After installing/updating apps, users can post reviews and provide
ratings, expressing their level of satisfaction with apps, and
possibly pointing out bugs or desired features. In this paper
we show—by performing a study on 100 Android apps—how
applications addressing user reviews increase their success in
terms of rating. Specifically, we devise an approach, named
CRISTAL, for tracing informative crowd reviews onto source
code changes, and for monitoring the extent to which developers
accommodate crowd requests and follow-up user reactions as
reflected in their ratings. The results indicate that developers
implementing user reviews are rewarded in terms of ratings. This
poses the need for specialized recommendation systems aimed at
analyzing informative crowd reviews and prioritizing feedback
to be satisfied in order to increase the apps success.
systems, and network conditions that may not necessarily be
reproducible during development/testing activities.
Consequently, by reading reviews and analyzing the ratings,
development teams are encouraged to improve their apps, for
example, by fixing bugs or by adding commonly requested
features. According to a recent Gartner report’s recommenda-
tion [21], given the complexity of mobile testing “development
teams should monitor app store reviews to identify issues that
are difficult to catch during testing, and to clarify issues that
cause problems on the users’s side”. Moreover, useful app
reviews reflect crowd-based needs and are a valuable source
of comments, bug reports, feature requests, and informal user
experience feedback [7], [8], [14], [19], [22], [27], [29].
In this paper we investigate to what extent app development
User Reviews Matter! Tracking Crowdsourced
Reviews to Support Evolution of Successful Apps
Fabio Palomba⇤, Mario Linares-V´asquez§, Gabriele Bavota†, Rocco Oliveto‡,
Massimiliano Di Penta¶, Denys Poshyvanyk§, Andrea De Lucia⇤
⇤University of Salerno, Fisciano (SA), Italy – §The College of William and Mary, VA, USA
†Free University of Bozen-Bolzano, Bolzano (BZ), Italy – ‡University of Molise, Pesche (IS), Italy
¶University of Sannio, Benevento, Italy
Abstract—Nowadays software applications, and especially mo-Abstract—Nowadays software applications, and especially mo-Abstract
bile apps, undergo frequent release updates through app stores.
After installing/updating apps, users can post reviews and provide
ratings, expressing their level of satisfaction with apps, and
possibly pointing out bugs or desired features. In this paper
we show—by performing a study on 100 Android apps—how
applications addressing user reviews increase their success in
terms of rating. Specifically, we devise an approach, named
CRISTAL, for tracing informative crowd reviews onto source
code changes, and for monitoring the extent to which developers
accommodate crowd requests and follow-up user reactions as
systems, and network conditions that may not necessarily be
reproducible during development/testing activities.
Consequently, by reading reviews and analyzing the ratings,
development teams are encouraged to improve their apps, for
example, by fixing bugs or by adding commonly requested
features. According to a recent Gartner report’s recommenda-
tion [21], given the complexity of mobile testing “development
teams should monitor app store reviews to identify issues that
are difficult to catch during testing, and to clarify issues that
cause problems on the users’s side”. Moreover, useful app
So what we get?
EmailsEmailsEmailsEmails
Versioning
Issue
Reports
Continuous interleaving of structured and
unstructured data
Format not always consistent
Technical/domain terms, abbreviations,
acronyms
Short documents, documents having
different size
Mining unstructured
data: techniques
Different techniques
Techniques based on string / regexp
matching
Information Retrieval Models
Natural Language Parsing
Island and lake parsers
Pattern/Regular
Expression Matching
Pattern/Regular
Expression Matching
Pattern matching:
when and where
• You’re not interested to the whole
document content
• Rather, you want to match some keywords
• Very few variants
• Context insentitiveness
Basic tools you might
want to use
• Unix command line tools: wc, cut, sort, uniq,
head, tail
• Regular expression engines: grep, sed, awk
• Scripting Languages: Perl, Python
• General purposes programming languages
with regexp support:

java.util.regexp (similar to Perl regexp)
Pattern Matching in Perl
• The =~ or !~ searches a string for a pattern
• /PATTERN/ define a pattern
• e.g. if($a=~/PATTERN/i){...} or 

if($a!~/PATTERN){...}
• s/PATTERN/REPLACEMENT/egimosx
• e.g. $a=~s/PATTERN/REPLACEMENT/g;
Special Symbols
w Match a "word" character
(alphanumeric plus "_")
W Match a non-word character
s Match a whitespace character
S Match a non-whitespace character
d Match a digit character
D Match a non-digit character
Assertions
b Match a word boundary
B Match a non-(word boundary)
A Match only at beginning of string
Z Match only at end of string, or before
newline at the end
z Match only at end of string
Matching issue ids
“fix 367920 setting pop3 messages as junk/not junk
ignored when message quarantining turned on sr=mscott
$note=~/Issue (d+)/i || $note=~/Issue number:s+(d
+)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/i
Mapping emails onto
classes/files
• Emails often refer classes / source files
• One can trace email to file using IR-based
traceability
• In the end simple regexp matching is more
precise and efficient
Alberto Bacchelli, Michele Lanza, Romain Robbes: 

Linking e-mails and source code artifacts. ICSE (1) 2010: 375-384
Example (Apache httpd)
Hi,
I'm running SunOS4.1.3 (with vif kernel hacks) and the Apache server.
www.xyz.com is a CNAME alias for xyz.com.
<VirtualHost www.xyz.com etc., is in my httpd.conf file.
Accessing http://www.xyz.com *fails* (actually, it brings up the default
DocumentRoot).
Accessing http://xyz.com *succeeds*, however, bringing up theVirtual
DocumentRoot.
The get_local_addr() function in the Apache util.c file seems to return
the wrong socket name, but I can't figure out why.
Any help is greatly appreciated!
(Simplified) Perl implementation
#! /usr/bin/perl
@fileNames=`cat filelist.txt`; #content of filelist.txt goes into array @fileNames
%hashFiles=();
foreach $f(@fileNames){
chomp($f);
$hashFiles{$f}++; #populates hashtable %hashFiles
}
while(<>){
$l=$_;
chomp($l);
#matches filename composed of anum chars, ending with .c and word boundary (b)
while($l=~/(w+.cb)/){
$name=$1;
if(defined($hashFiles{$name})){ #checks if the filename is in the hash table
print "Mail mapped to file $namen";
}
$l=$'; #assigns post matching to $l
}
}
while(<>){
$l=$_;
chomp($l);
#matches filename composed of anum chars, ending with .c and word boundary (b)
while($l=~/(w+.cb)/){
RegExp: Pros and Cons
• Often the easiest solution
• Fast and precise enough
• Unsupervised
• Pattern variants can lead to

misclassifications
• Context-insensitive
✘
✔
✔
✔
✘
Key Ingredients
• Often finding the right pattern is not
straight-forward
• A lot of manual analysis might be needed
• Iterative process
Island and Lake Parsers
General Ideas
Island parser: ignore everything (lake) until
you find a specific grammar rule (island)
Lake parser: parse everything (lake) unless
you find something you’re not able to
match a rule (island)
Traditional applications
Leon Moonen: Generating Robust Parsers Using Island Grammars. WCRE 2001: 13-
free text
error log
free text
source code
free text
stack trace
Applications to MUD
Approach
Alberto Bacchelli, Tommaso Dal Sasso, Marco D'Ambros, Michele Lanza: Content classification of
development emails. ICSE 2012: 375-385
Content Classification of Development Emails
Alberto Bacchelli, Tommaso Dal Sasso, Marco D’Ambros, Michele Lanza
REVEAL @ Faculty of Informatics — University of Lugano, Switzerland
Abstract—Emails related to the development of a software
system contain information about design choices and issues
encountered during the development process. Exploiting the
knowledge embedded in emails with automatic tools is chal-
lenging, due to the unstructured, noisy and mixed language
nature of this communication medium. Natural language text
is often not well-formed and is interleaved with languages with
other syntaxes, such as code or stack traces.
We present an approach to classify email content at line
level. Our technique classifies email lines in five categories (i.e.,
text, junk, code, patch, and stack trace) to allow one to sub-
sequently apply ad hoc analysis techniques for each category.
We evaluated our approach on a statistically significant set
of emails gathered from mailing lists of four unrelated open
source systems.
Keywords-Empirical software engineering; Unstructured
Data Mining; Emails
I. Introduction
Software repositories supply information useful for sup-
porting software analysis and program comprehension [17].
Di↵erent repositories o↵er di↵erent perspectives on systems:
(e.g., IRC chat logs, mailing lists) require the most care, as
the documents leave complete freedom to the authors. For
example, Bettenburg et al. presented the risks of using email
data without a proper cleaning pre-processing phase [9].
NL documents are usually treated as bags of words–a
count of terms’ occurrences. This simplification is proven
to be e↵ective in the information retrieval (IR) field, where
techniques are tested on well-formed NL documents [25]. In
software engineering, although e↵ective for some tasks (e.g.,
traceability between documents and code [1]), this approach
reduces the quality, reliability, and comprehensibility of the
available information, as NL text is often not well-formed
and is interleaved with languages with di↵erent syntaxes:
code fragments, stack traces, patches, etc.
We present a work for advancing the analysis of the
contents of development emails. We argue that we should not
create a single bag with terms indiscriminately coming from
NL parts, code fragments, email signatures, patches, etc. and
treat them equally. We need to recognize every language in
an email to enable techniques exploiting the peculiarities of
Content Classification of Development Emails
Alberto Bacchelli, Tommaso Dal Sasso, Marco D’Ambros, Michele Lanza
REVEAL @ Faculty of Informatics — University of Lugano, Switzerland
Abstract—Emails related to the development of a softwareAbstract—Emails related to the development of a softwareAbstract
system contain information about design choices and issues
encountered during the development process. Exploiting the
knowledge embedded in emails with automatic tools is chal-
lenging, due to the unstructured, noisy and mixed language
nature of this communication medium. Natural language text
is often not well-formed and is interleaved with languages with
other syntaxes, such as code or stack traces.
We present an approach to classify email content at line
level. Our technique classifies email lines in five categories (i.e.,
text, junk, code, patch, and stack trace) to allow one to sub-
(e.g., IRC chat logs, mailing lists) require the most care, as
the documents leave complete freedom to the authors. For
example, Bettenburg et al. presented the risks of using email
data without a proper cleaning pre-processing phase [9].
NL documents are usually treated as bags of words–a
count of terms’ occurrences. This simplification is proven
to be e↵ective in the information retrieval (IR) field, where↵ective in the information retrieval (IR) field, where↵
techniques are tested on well-formed NL documents [25]. In
software engineering, although e↵ective for some tasks (↵ective for some tasks (↵ e.g.,
Alternative approach
Lines/paragraphs belonging to different
elements contain different proportions of
keywords, special characters, etc.
Alberto Bacchelli, Marco D'Ambros, Michele Lanza: Extracting Source Code from E-Mails. ICPC 2010: 24-33
Extracting Source Code from E-Mails
Alberto Bacchelli, Marco D’Ambros, Michele Lanza
REVEAL@ Faculty of Informatics - University of Lugano, Switzerland
Abstract—E-mails, used by developers and system users to
communicate over a broad range of topics, offer a valuable
source of information. If archived, e-mails can be mined to
support program comprehension activities and to provide views
of a software system that are alternative and complementary
to those offered by the source code.
However, e-mails are written in natural language, and
therefore contain noise that makes it difficult to retrieve the
important data. Thus, before conducting an effective system
analysis and extracting data for program comprehension, it is
necessary to select the relevant messages, and to expose only
the meaningful information.
In this work we focus both on classifying e-mails that hold
fragments of the source code of a system, and on extracting the
source code pieces inside the e-mail. We devised and analyzed a
number of lightweight techniques to accomplish these tasks. To
assess the validity of our techniques, we manually inspected and
annotated a statistically significant number of e-mails from five
unrelated open source software systems written in Java. With
such a benchmark in place, we measured the effectiveness of
each technique in terms of precision and recall.
about a system. E-mails are used to discuss issues ranging
from low-level decisions (e.g., implementation strategies, bug
fixing) up to high-level considerations (e.g., design rationale,
future planning). Our goal is to use this kind of artifacts to
enhance program comprehension and analysis.
We already devised techniques to recover the traceability
links between source code entities and e-mails [4]. However,
such ties can only be retrieved by knowing the entities of
the system in advance, i.e., by having the model of the
system generated from the source code. In this paper we
tackle a different problem: We want to extract source code
fragments from e-mail messages. To do this, we first need
to select e-mails that contain source code fragments, and
then we extract such fragments from the content in which
they are enclosed. Separating source code from natural
language in e-mail messages brings several benefits: The
access to structured data (1) facilitates the reconstruction
Extracting Source Code from E-Mails
Alberto Bacchelli, Marco D’Ambros, Michele Lanza
REVEAL@ Faculty of Informatics - University of Lugano, Switzerland
Abstract—E-mails, used by developers and system users toAbstract—E-mails, used by developers and system users toAbstract
communicate over a broad range of topics, offer a valuable
source of information. If archived, e-mails can be mined to
support program comprehension activities and to provide views
of a software system that are alternative and complementary
to those offered by the source code.
However, e-mails are written in natural language, and
therefore contain noise that makes it difficult to retrieve the
important data. Thus, before conducting an effective system
analysis and extracting data for program comprehension, it is
necessary to select the relevant messages, and to expose only
the meaningful information.
In this work we focus both on classifying e-mails that hold
about a system. E-mails are used to discuss issues ranging
from low-level decisions (e.g., implementation strategies, bug
fixing) up to high-level considerations (e.g., design rationale,
future planning). Our goal is to use this kind of artifacts to
enhance program comprehension and analysis.
We already devised techniques to recover the traceability
links between source code entities and e-mails [4]. However,
such ties can only be retrieved by knowing the entities of
the system in advance, i.e., by having the model of the
system generated from the source code. In this paper we
tackle a different problem: We want to extract source code
Learning hidden
languages
Luigi Cerulo, Massimiliano Di Penta, Alberto Bacchelli, Michele Ceccarelli, Gerardo Canfora:
Irish: A Hidden Markov Model to detect coded information islands in free text. Sci. Comput.
Program. 105: 26-43 (2015)
Hidden Markov Models
A language category is modeled as an HMM where hidden
states are language tokens
0.28
0.23
0.6
0.49
0.36
0.36
0.38
0.52
0.430.25
0.31
0.31
0.77
0.34
0.51
0.47
0.52
0.26
0.38
0.71
0.27
0.47
0.65
0.34
0.4
0.4
0.65
0.54
0.41
0.22
0.39
0.21
0.32
0.37
0.9
0.43
0.27
0.3
0.9
0.62
0.38
0.6
0.22
0.25
0.31
0.4
0.97
0.41
0.36
0.36
0.21
0.93
0.91
]
=

~
!
'
`
NUM
(
[
#
<
-
:
,
{
)
NEWLN
KEY
$
}
+
.
*
;
&
/
^
|
%
"
?
@
WORD
UNDSC
>
Fig. 4. A source code HMM trained on PosgreSQL source code (transition
probabilities less than 0.2 are not shown).
syntaxes. For example, in development mailing list messages
usually there is no more than one source code fragment in
a text message. If this is the case, we can assume, a priori,
the transition probability from natural language text to source
code approximated to 1/N, where N is the number of tokens
in the message. It could happens that the transition between
two language syntaxes could never occur. This may be the case
of stack traces and patches. By observing developers’ emails,
it is usual to notice that, after a stack trace is provided, the
resolution patch is introduced with natural language phrases,
such as “Here is the change that we made to fix the problem”.
This leads to a null transition probability from stack trace
to patch code. Other heuristics could be adopted to refine
the estimation of transition probabilities that reflects specific
properties or styles adopted [23].
IV. EMPIRICAL EVALUATION PROCEDURE
To evaluate the effectiveness of our approach we adopt
the Precision and Recall metrics, known respectively also
as Positive Predictive Value (PPV) and Sensitivity [24]. In
particular, for each language syntax, i, we compute:
such as opening parenthesis. Formally, the HMM state space
is defined as:
Q = { T XT , SRC}
where T XT = {WORDT XT , KEYT XT , . . . }, and SRC
= {WORDSRC, KEYSRC, . . . }. Each state emits the corre-
sponding alphabet symbol without subscript label T XT or
SRC. For example, the KEY symbol can be emitted by
KEYT XT or KEYSRC with a probability equal to 1. If the
probability for staying in a natural language text is p and
the probability of staying in source code text is q, then the
transition from a state in T XT to a state in SRC is 1 p,
instead the inverse transition is 1 q. The above defined HMM
emits the sequence of symbols observed in a text by evolving
through a sequence of states { 1, 2, . . . , i, i+1, . . . } with
the transition probabilities tkl defined as:
tkl = P( i = l| i 1 = k) · p, if k, l ⇤ T XT
tkl = P( i = l| i 1 = k) · q, if k, l ⇤ SRC
tkl =
1 p
| |
, if k ⇤ T XT , l ⇤ SRC
tkl =
1 q
| |
, if k ⇤ SRC, l ⇤ T XT
and the emission probabilities defined as:
ekb = 1, if k = bT XT or k = bSRC, otherwise 0.
Fig. 2 shows the global HMM composed by two sub-HMM,
one modeling natural language text and another modeling
source code. Fig. 3 and Fig. 4 show their transition proba-
bilities, estimated on the Frankenstein novel and PostgreSQL
source code respectively. We detail in Section III-D how these
probabilities could be estimated.
It is interesting to observe how typical token sequences
are modeled by each HMM. For example, in the source code
of an arithmetic/logic expressions, array indexing, or function
argument enumeration.
Fig. 2. The source code – natural text island HMM.
0.33
0.67
0.78
0.86
0.82
0.78
0.65
0.12
0.68
0.64
0.87
0.7
0.71
0.9
0.81
0.71
0.5
0.54
1.0
1.0
1.0
1.0
0.11
0.11
0.21
0.14
0.15
0.14
0.73
0.2 0.45
0.21
1.0
0.17
0.120.12
0.14
0.29
0.33
]
WORD
KEY
!
"
@
'
?
(
*
.
NUM
:
NEWLN
$
)
[
UNDSC
-
/
;
%
,
Fig. 3. A natural text HMM trained on the Frankenstein novel (transition
probabilities less than 0.1 are not shown).
C. An extension of the basic model
Transition between
languages…
Language HMMs are connected by a language transition
matrix into a single HMM
if they
after a
to find
s more
aracter,
e space
SRC
corre-
XT or
ted by
If the
p and
hen the
1 p,
HMM
WORD modeling typical variable naming convention. Instead,
in the natural language HMM, numbers (NUM) are preceded
just by the dollar symbol ($) indicating currency, and likely
followed by a dot, indicating text item enumerations. Instead,
in the source code HMM it is noticeable that numbers are part
of an arithmetic/logic expressions, array indexing, or function
argument enumeration.
1-q
1-p
p q
Natural text
HMM
Soure code
HMM
Fig. 2. The source code – natural text island HMM.
Natural
language
HMM
Source
code
HMM
Information Retrieval
Models
Indexing-based IR
Document Query
indexing indexing
(Query analysis)
Representation Representation
Query
evaluation
Document indexing
Term
extraction
The above process may vary a lot...
Stop word
removal
Stemming
Term
indexing
Algebraic
method
Tokenization in SE
• Often identifiers are compound words

getPointer, getpointer
• Sometimes they contain abbreviations and
acronyms

getPntr, computeIPAddr
Camel Case splitter
getWords getWords
openIP open IP
Pros: simple and fast
Cons: convention not adopted everywhere
Samurai
• Assumption1: A substring composing an
identifier is also likely to be used elsewhere
• Assumption 2: Given two possible split,
prefer the one that contains identifiers
having a high frequency in the program
Eric Enslen, Emily Hill, Lori L. Pollock, K.Vijay-Shanker: Mining source code to
automatically split identifiers for software analysis. MSR 2009: 71-80
More advanced
approaches
• Tidier/Tris, and Normalize
• Allow expanding abbreviations and
acronyms

getPntr get pointer
• Based on more advanced techniques, e.g.
speech recognition
Dawn Lawrie, David Binkley: Expanding identifiers to normalize source code vocabulary. ICSM 2011: 113-122
Latifa Guerrouj, Massimiliano Di Penta, Giuliano Antoniol andYann-Gaël Guéhéneuc.TIDIER: an identifier splitting
approach using speech recognition techniques. Journal of Software: Evolution and Process.Vol. 25, Issue 6, June
2013, p. 575–599
Stop words for SE
• English not enough
• Programming language keywords may or
may not be included
• What about standard API?
• Specific, recurring words of some
documents (e.g., of bug reports or of
emails)
Use of smoothing filters
Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, Sebastiano Panichella: Applying a
smoothing filter to improve IR-based traceability recovery processes: An empirical investigation. Information &
Software Technology 55(4): 741-754 (2013)
Applying a smoothing Ä lter to improve IR-based traceability recovery processes: An
empirical investigation q
Andrea De Lucia a
, Massimiliano Di Penta b,⇑
, Rocco Oliveto c
, Annibale Panichella a
, Sebastiano Panichella b
a
University of Salerno, Via Ponte don Melillo, 84084 Fisciano (SA), Italy
b
University of Sannio, Viale Traiano, 82100 Benevento, Italy
c
University of Molise, C.da Fonte Lappone, 86090 Pesche (IS), Italy
a r t i c l e i n f o
Article history:
Available online 24 August 2012
Keywords:
Software traceability
Information retrieval
Smoothing Ä lters
Empirical software engineering
a b s t r a c t
Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For
this reason, various traceability recovery approachesˆ based on Information Retrieval (IR) techniquesˆ
have been proposed. The performances of such approaches are often inÅ uenced by ` ` noise' ' contained
in software artifacts (e.g., recurring words in document templates or other words that do not contribute
to the retrieval itself).
Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a
smoothing Ä lter to remove ` ` noise' ' from the textual corpus of artifacts to be traced.
Method: We evaluate the effect of a smoothing Ä lter in traceability recovery tasks involving different
kinds of artifacts from Ä ve software projects, and applying three different IR methods, namely Vector
Space Models, Latent Semantic Indexing, and Jensenç Shannon similarity model.
Results: Our study indicates that, with the exception of some speciÄ c kinds of artifacts (i.e., tracing test
cases to source code) the proposed approach is able to signiÄ cantly improve the performances of trace-
ability recovery, and to remove ` ` noise' ' that simple stop word Ä lters cannot remove.
Conclusions: The obtained results not only help to develop traceability recovery approaches able to work
in presence of noisy artifacts, but also suggest that smoothing Ä lters can be used to improve perfor-
mances of other software engineering approaches based on textual analysis.
Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction
In recent and past years, textual analysis has been successfully
applied to several kinds of software engineering tasks, for example
impact analysis [1], clone detection [2], feature location [3,4],
refactoring [5], deÄ nition of new cohesion and coupling metrics
[6,7], software quality assessment [8ç 11], and, last but not least,
traceability recovery [12ç 14].
Such a kind of analysis demonstrated to be effective and useful
for various reasons:
 it is lightweight and to some extent independent on the pro-
gramming language, as it does not require a full source code
parsing, but only its tokenization and (for some applications)
lexical analysis;
 it provides information (e.g., brought in comments and identiÄ -
ers) complementary to what structural or dynamic analysis can
provide [6,7];
 it models software artifacts as textual documents, thus can be
applied to different kinds of artifacts (i.e., it is not limited to
the source code) and, above all, can be used to perform com-
bined analysis of different kinds of artifacts (e.g., requirements
and source code), as in the case of traceability recovery.
Textual analysis has also some weaknesses, and poses chal-
lenges for researchers. It strongly depends on the quality of the lex-
icon: a bad lexicon often means inaccurateˆ if not completely
wrongˆ results. There are two common problems in the textual
analysis of software artifacts. The Ä rst is represented by the pres-
ence of inconsistent terms in related documents (e.g., require-
q
This paper is an extension of the work ` ` Improving IR-based Traceability
Recovery Using Smoothing Filters' ' appeared in the Proceedings of the 19th IEEE
International Conference on Program Comprehension, Kingston, ON, Canada, pp. 21ç
Information and Software Technology 55 (2013) 741ç 754
Contents lists available at SciVerse ScienceDirect
Information and Software Technology
journal homepage: www.elsevier.com/locate/infsof
Applying a smoothing Ä lter to improve IR-based traceability recovery processes: An
empirical investigation q
Andrea De Lucia a
, Massimiliano Di Penta b,⇑
, Rocco Oliveto c
, Annibale Panichella a
, Sebastiano Panichella b
a
University of Salerno, Via Ponte don Melillo, 84084 Fisciano (SA), Italy
b
University of Sannio, Viale Traiano, 82100 Benevento, Italy
c
University of Molise, C.da Fonte Lappone, 86090 Pesche (IS), Italy
a r t i c l e i n f o
Article history:
Available online 24 August 2012
Keywords:
Software traceability
Information retrieval
Smoothing Ä lters
Empirical software engineering
a b s t r a c t
Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For
this reason, various traceability recovery approachesˆ based on Information Retrieval (IR) techniquesˆ
have been proposed. The performances of such approaches are often inÅ uenced by ` ` noise' ' contained
in software artifacts (e.g., recurring words in document templates or other words that do not contribute
to the retrieval itself).
Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a
smoothing Ä lter to remove ` ` noise' ' from the textual corpus of artifacts to be traced.
Method: We evaluate the effect of a smoothing Ä lter in traceability recovery tasks involving different
kinds of artifacts from Ä ve software projects, and applying three different IR methods, namely Vector
Space Models, Latent Semantic Indexing, and Jensenç Shannon similarity model.
Results: Our study indicates that, with the exception of some speciÄ c kinds of artifacts (i.e., tracing test
cases to source code) the proposed approach is able to signiÄ cantly improve the performances of trace-
ability recovery, and to remove ` ` noise' ' that simple stop word Ä lters cannot remove.
Conclusions: The obtained results not only help to develop traceability recovery approaches able to work
in presence of noisy artifacts, but also suggest that smoothing Ä lters can be used to improve perfor-
mances of other software engineering approaches based on textual analysis.
Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction
In recent and past years, textual analysis has been successfully
applied to several kinds of software engineering tasks, for example
impact analysis [1], clone detection [2], feature location [3,4],
refactoring [5], deÄ nition of new cohesion and coupling metrics
parsing, but only its tokenization and (for some applications)
lexical analysis;
 it provides information (e.g., brought in comments and identiÄ -
ers) complementary to what structural or dynamic analysis can
provide [6,7];
 it models software artifacts as textual documents, thus can be
Smoothing Filters
S
(mean source vector)
T
(mean target vector)
Source Documents Target Documents
Filtered

Source Set
-
Filtered

Target Set
-
s1 s2 s3 … sk t1 t2 t3 … tz
From stemming to
lexical databases…
• Lexical database where concepts are
organised in a semantic network
• Organize lexical information in terms of
word meaning, rather than word form
• Interfaces for Java, Prolog, Lisp, Python, Perl,
C#
WordNet
(http://wordnet.princeton.edu)
Word Categories
Nouns
• Topical hierarchies with lexical inheritance
(hyponymy/hyperymy and meronymy/holonymy).
Verbs
• Entailment relations
Adjectives and adverbs
• Bipolar opposition relations (antonymy)
Function words (articles, prepositions, etc.)
• Simply omitted
Application: mining and
classifying identifier renamings
Generalization/specialization: 

thrownExceptionSize → thrownExceptionLength
Opposite meaning: 

hasClosingBracket → hasOpeningBracket
Whole/part relation: 

filename → extension
Venera Arnaoudova, Laleh Mousavi Eshkevari, Massimiliano Di Penta, Rocco Oliveto, Giuliano
Antoniol,Yann-Gaël Guéhéneuc: REPENT:Analyzing the Nature of Identifier Renamings. IEEE
Trans. Software Eng. 40(5): 502-532 (2014)
Challenge
• Word relations in WordNet do not fully
comprise relations relevant in the IT
domain
• Alternative approaches try to “learn” these
relations
Jinqiu Yang, Lin Tan: SWordNet: Inferring semantically related words from software context. Empirical
Software Engineering 19(6): 1856-1886 (2014)
Term Indexing:Assigning
Weights to Terms
• Binary Weights: terms weighted as 0 (not
appearing) or 1 (appearing)
• (Raw) term frequency: terms weighted by the
number of occurrences/frequency in the
document
• tf-idf: want to weight terms highly if they are
frequent in relevant documents … BUT ...
infrequent in the collection as a whole
What to use?
tf-idf might be useful for text retrieval
but…
not (necessarily) the best choice to identify
keywords, e.g. for classification purposes
Algebraic Models
• Aim at providing representations to
documents and at comparing/clustering
documents
• Examples
• Vector Space Model
• Latent Semantic Indexing (LSI)
• latent Dirichlet allocation (LDA)
TheVector Space Model (VSM)
• Assume t distinct terms remain after preprocessing
• call them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
• Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-
valued weight, wij.
• Both documents and queries are expressed as t-
dimensional vectors:
dj = (w1j, w2j, …, wtj)
VSM Representation
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
Latent Semantic Indexing
(LSI)
Overcomes limitations ofVSM (synonymy,
polysemy, considering words occurring in
documents as independent events)
Shifting from a document-term space towards a
document-concept space
Dumais, S.T., Furnas, G.W., Landauer,T. K. and Deerwester, S. (1988), Using latent semantic
analysis to improve information retrieval. In Proceedings of CHI'88: Conference on Human
Factors in Computing, NewYork:ACM, 281-285.
latent Dirichlet Allocation
(LDA)
• Probabilistic model
• Documents treated as a distributions of
topics
• Topics are distributions of words
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” The Journal of
Machine Learning Research, vol. 3, pp. 993–1022, 2003.
Applications
Alessandra Gorla, Ilaria Tavecchia, Florian Gross, Andreas Zeller: Checking app behavior
against app descriptions. ICSE 2014: 1025-1035
Checking App Behavior Against App Descriptions
Alessandra Gorla · Ilaria Tavecchia⇤
· Florian Gross · Andreas Zeller
Saarland University
Saarbrücken, Germany
{gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de
ABSTRACT
How do we know a program does what it claims to do? After clus-
tering Android apps by their description topics, we identify outliers
in each cluster with respect to their API usage. A “weather” app that
sends messages thus becomes an anomaly; likewise, a “messaging”
app would typically not be expected to access the current location.
Applied on a set of 22,500+ Android applications, our CHABADA
prototype identified several anomalies; additionally, it flagged 56%
of novel malware as such, without requiring any known malware
patterns.
Categories and Subject Descriptors
D.4.6 [Security and Protection]: Invasive software
General Terms
Security
Keywords
Android, malware detection, description analysis, clustering
1. INTRODUCTION
Checking whether a program does what it claims to do is a long-
standing problem for developers. Unfortunately, it now has become
a problem for computer users, too. Whenever we install a new app,
1. App collection 2. Topics
Weather,
Map…
Travel,
Map…
Theme
3. Clusters
Weather
+ Travel
Themes
Access-LocationInternet Access-LocationInternet Send-SMS
4. Used APIs 5. Outliers
Weather
+ Travel
Figure 1: Detecting applications with unadvertised behavior.
Starting from a collection of “good” apps (1), we identify their
description topics (2) to form clusters of related apps (3). For
each cluster, we identify the sentitive APIs used (4), and can
then identify outliers that use APIs that are uncommon for that
cluster (5).
• An app that sends a text message to a premium number to
raise money is suspicious? Maybe, but on Android, this is a
Checking App Behavior Against App Descriptions
Alessandra Gorla · Ilaria Tavecchia⇤
· Florian Gross · Andreas Zeller
Saarland University
Saarbrücken, Germany
{gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de
ABSTRACT
How do we know a program does what it claims to do? After clus-
tering Android apps by their description topics, we identify outliers
in each cluster with respect to their API usage. A “weather” app that
sends messages thus becomes an anomaly; likewise, a “messaging”
app would typically not be expected to access the current location.
Applied on a set of 22,500+ Android applications, our CHABADA
prototype identified several anomalies; additionally, it flagged 56%
of novel malware as such, without requiring any known malware
patterns.
Categories and Subject Descriptors
D.4.6 [Security and Protection]: Invasive software
General Terms
Security
1. App collection 2. Topics
Weather,
Map…Map…
Travel,
Map…
Theme
3. Clusters
Weather
+ Travel+ Travel
ThemesThemes
Access-LocationInternet Access-LocationInternet Send-SMS
4. Used APIs 5. Outliers
Weather
+ Travel+ Travel
Need to properly
calibrate the techniques
• Many IR techniques require a careful
calibration of many parameters
• LSI number of concepts (k)
• LDA k, α, β, number of iterations (n)
• Without that, the performances can be
sub-optimal
… and beyond that…
• Should I use stop word removal? Which
one?
• Stemming?
• Which weighting scheme? tf? tf-idf? others?
• Which technique?VSM? LSI? LDA?
Approaches (I)
Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, Jane Cleland-Huang: Improving trace
accuracy through data-driven configuration and composition of tracing features. ESEC/SIGSOFT FSE
2013: 378-388
Improving Trace Accuracy through Data-Driven
ConÆ guration and Composition of Tracing Features
Sugandha Lohar, Sorawit
Amornborvornwong
DePaul University
Chicago, IL, USA
sonul.123@gmail.com,
sorambww@hotmail.com
Andrea Zisman
Department of Computing
The Open University
Milton Keynes, MK7 6AA, UK
andrea.zisman@open.ac.uk
Jane Cleland-Huang
DePaul University
Systems and Requirements
Engineering Center
Chicago, IL, USA
jhuang@cs.depaul.edu
ABSTRACT
Software traceability is a sought-after, yet often elusive qual-
ity in large software-intensive systems primarily because the
cost and effort of tracing can be overwhelming. State-of-the
art solutions address this problem through utilizing trace re-
trieval techniques to automate the process of creating and
maintaining trace links. However, there is no simple one-
size- ts all solution to trace retrieval. As this paper will
show, nding the right combination of tracing techniques
can lead to signi cant improvements in the quality of gener-
ated links. We present a novel approach to trace retrieval in
which the underlying infrastructure is con gured at runtime
to optimize trace quality. We utilize a machine-learning ap-
proach to search for the best con guration given an initial
training set of validated trace links, a set of available tracing
techniques speci ed in a feature model, and an architecture
capable of instantiating all valid con gurations of features.
critical software-intensive systems [4]. It is used to capture
relationships between requirements, design, code, test-cases,
and other software engineering artifacts, and support crit-
ical activities such as impact analysis, compliance veri ca-
tion, test-regression selection, and safety-analysis. As such,
traceability is mandated in safety-critical domains includ-
ing the automotive, aeronautics, and medical device indus-
tries. Unfortunately, tracing costs can grow excessively high
if trace links have to be created and maintained manually
by human users, and as a result, practitioners often fail to
establish adequate traceability in a project [27].
To address these needs, numerous researchers have devel-
oped or adopted algorithms that semi-automate the process
of creating trace links. These algorithms include the Vec-
tor Space Model (VSM) [23], Probabilistic approaches [14],
Latent Semantic Indexing [2, 12], Latent Dirichlet Alloca-
tion (LDA) [13], rule-based approaches that identify rela-
tionships across project artifacts [35], and approaches that
Improving Trace Accuracy through Data-Driven
ConÆ guration and Composition of Tracing Features
Sugandha Lohar, Sorawit
Amornborvornwong
DePaul University
Chicago, IL, USA
sonul.123@gmail.com,
sorambww@hotmail.com
Andrea Zisman
Department of Computing
The Open University
Milton Keynes, MK7 6AA, UK
andrea.zisman@open.ac.uk
Jane Cleland-Huang
DePaul University
Systems and Requirements
Engineering Center
Chicago, IL, USA
jhuang@cs.depaul.edu
ABSTRACT
Software traceability is a sought-after, yet often elusive qual-
ity in large software-intensive systems primarily because the
cost and effort of tracing can be overwhelming. State-of-theffort of tracing can be overwhelming. State-of-theff
art solutions address this problem through utilizing trace re-
critical software-intensive systems [4]. It is used to capture
relationships between requirements, design, code, test-cases,
and other software engineering artifacts, and support crit-
ical activities such as impact analysis, compliance veri ca-
tion, test-regression selection, and safety-analysis. As such,
rint
ical activities such as impact analysis, compliance veri ca-rint
ical activities such as impact analysis, compliance veri ca-
tion, test-regression selection, and safety-analysis. As such,
rint
tion, test-regression selection, and safety-analysis. As such,
Discussion
• Supervised, task specific
• Model parameters varied through Genetic
Algorithms
• You train the model on a training set
• Maximize precision/recall for a given task
Approaches (II)
Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimiliano Di Penta, Denys Poshyvanyk, Andrea
De Lucia: How to effectively use topic models for software engineering tasks? an approach based on
genetic algorithms. ICSE 2013: 522-531
How to Effectively Use Topic Models for
Software Engineering Tasks?
An Approach Based on Genetic Algorithms
Annibale Panichella1, Bogdan Dit2, Rocco Oliveto3,
Massimilano Di Penta4, Denys Poshynanyk2, Andrea De Lucia1
1
University of Salerno, Fisciano (SA), Italy
2
The College of William and Mary, Williamsburg, VA, USA
3
University of Molise, Pesche (IS), Italy
4
University of Sannio, Benevento, Italy
Abstractó Information Retrieval (IR) methods, and in partic-
ular topic models, have recently been used to support essential
software engineering (SE) tasks, by enabling software textual
retrieval and analysis. In all these approaches, topic models have
been used on software artifacts in a similar manner as they
were used on natural language documents (e.g., using the same
settings and parameters) because the underlying assumption was
that source code and natural language documents are similar.
However, applying topic models on software data using the same
settings as for natural language text did not always produce the
expected results.
Recent research investigated this assumption and showed that
source code is much more repetitive and predictable as compared
to the natural language text. Our paper builds on this new
fundamental nding and proposes a novel solution to adapt,
con gure and effectively use a topic modeling technique, namely
Latent Dirichlet Allocation (LDA), to achieve better (acceptable)
performance across various SE tasks. Our paper introduces a
novel solution called LDA-GA, which uses Genetic Algorithms
(GA) to determine a near-optimal con guration for LDA in
the context of three different SE tasks: (1) traceability link
recovery, (2) feature location, and (3) software artifact labeling.
The results of our empirical studies demonstrate that LDA-GA
is able to identify robust LDA con gurations, which lead to a
higher accuracy on all the datasets for these SE tasks as compared
proposed to support software engineering tasks: feature lo-
cation [4], change impact analysis [5], bug localization [6],
clone detection [7], traceability link recovery [8], [9], expert
developer recommendation [10], code measurement [11], [12],
artifact summarization [13], and many others [14], [15], [16].
In all these approaches, LDA and LSI have been used on
software artifacts in a similar manner as they were used on
natural language documents (i.e., using the same settings,
con gurations and parameters) because the underlying as-
sumption was that source code (or other software artifacts)
and natural language documents exhibit similar properties.
More speci cally, applying LDA requires setting the number
of topics and other parameters speci c to the particular LDA
implementation. For example, the fast collapsed Gibbs sam-
pling generative model for LDA requires setting the number
of iterations n and the Dirichlet distribution parameters α
and β [17]. Even though LDA was successfully used in the
IR and natural language analysis community, applying it on
software data, using the same parameter values used for natural
language text, did not always produce the expected results
[18]. As in the case of machine learning and optimization
How to Effectively Use Topic Models for
Software Engineering Tasks?
An Approach Based on Genetic Algorithms
Annibale Panichella1, Bogdan Dit2, Rocco Oliveto3,
Massimilano Di Penta4, Denys Poshynanyk2, Andrea De Lucia1
1
University of Salerno, Fisciano (SA), Italy
2
The College of William and Mary, Williamsburg, VA, USA
3
University of Molise, Pesche (IS), Italy
4
University of Sannio, Benevento, Italy
Abstractó Information Retrieval (IR) methods, and in partic-
ular topic models, have recently been used to support essential
software engineering (SE) tasks, by enabling software textual
retrieval and analysis. In all these approaches, topic models have
been used on software artifacts in a similar manner as they
were used on natural language documents (e.g., using the same
settings and parameters) because the underlying assumption was
that source code and natural language documents are similar.that source code and natural language documents are similar.that source code and natural language documents are similar
However, applying topic models on software data using the same
settings as for natural language text did not always produce the
proposed to support software engineering tasks: feature lo-
cation [4], change impact analysis [5], bug localization [6],
clone detection [7], traceability link recovery [8], [9], expert
developer recommendation [10], code measurement [11], [12],
artifact summarization [13], and many others [14], [15], [16].
In all these approaches, LDA and LSI have been used on
software artifacts in a similar manner as they were used on
natural language documents (i.e., using the same settings,
con gurations and parameters) because the underlying as-
Choose a random
population of LDA
parameters
LDA
Determine fitness of each
chromosome (individual)
of each
(individual)
Fitness = silhouette coefficientFitness = silhouette coefficientFitness = silhouette coefficient
Choose a random
population of LDA
parameters
LDA
Determine fitness of each
chromosome (individual)
Select next generation
Crossover, mutation
Crossover
Mutation
Choose a random
population of LDA
parameters
LDA
Determine fitness of each
chromosome (individual)
Select next generation
Crossover, mutation
Next generation
Best model for first generation
Best model for last generation
Supporting technology
• Integrated suite of software facilities for data
manipulation, calculation and graphical display
• VSM and LSI implemented in the lsa package
• topicmodels and lda also available
• many other text processing packages…
R (www.r-project.org)
Creating a t-d matrix
Creating a term-document matrix from a directory
tm-textmatrix(/Users/Max/mydocs)
Syntax:
textmatrix(mydir, stemming=FALSE,
language=“english,

minWordLength=2, minDocFreq=1,
stopwords=NULL, vocabulary=NULL)
Applying a weighting schema
• lw_tf(m): returns a completely unmodified n times m matrix
• lw_bintf(m): returns binary values of the n times m matrix.
• gw_normalisation(m): returns a normalised n times m matrix
• gw_gfidf(m): returns the global frequency multiplied with idf
Applying LSI
• Decomposes tm and reduces the space of
concepts to 2
lspace-lsa(tm,dims=2)
• Converts lspace into a textmatrix
m2-as.textmatrix(lspace)
Computing the cosine
cosine(v1,v2)
or
cosine(matrix)
Application: Identifying
duplicate bug reports
Problem: people often report bugs someone
has already reported
Solution (in a nutshell):
• Compute textual similarity among bug
reports
• Match (where available) stack traces
Xiaoyin Wang, Lu Zhang,Tao Xie, John Anvik, Jiasu Sun:An approach to
detecting duplicate bug reports using natural language and execution
information. ICSE 2008: 461-470
Example (Eclipse)
Bug 123 - Synchronize View: files nodes in
tree should provide replace with action
This scenario happens very often to me
- I change a file in my workspace to test
something
- I do a release
- In the release I encounter that there is no
need to release the file
- I want to replace it from the stream but a
corresponding action is missing
Now I have to go to the navigator to do the
job.
t-textmatrix(/Users/mdipenta/docs,stemming=TRUE)
cosine(t)
d1 d2
d1 1.0000000 0.6721491
d2 0.6721491 1.0000000
/mdipenta/docs,stemming=TRUE)
Bug 4934 - DCR: Replace from Stream in
Release mode
Sometimes I touch files, without
intention.These files are then unwanted
outgoing changes.To fix it, I have go back to
the package view/navigator and replace the
file with the stream.After this a new
synchronize is needed, for that I have a clean
view.
It would be very nice to have the 'replace
with stream' in the outgoing tree.
Or 'Revert'?
Other tools
Lucene http://lucene.apache.org
RapidMiner www.rapidminer.com
Mallet http://mallet.cs.umass.edu
Choosing the IR model
Careful choice
• In case you don’t want a GA chooses it for
you…
• Most exotic model not necessarily the best
one
• Simple models have advantages (e.g.
scalability)
• Try to experiment multiple models
IR methods: 

Pros and Cons
• Can be used to match similar (related) discussions
• Some techniques deal with issues such as
homonymy and polysemy
• Software-related discussion can be very noisy
• Noise different between different artifacts

(bug reports, commit notes, code)
• You have less control than with regular expressions
• Do not capture dependencies in sentences 

e.g.,“this bug will not be fixed”
✘
✔
✔
✘
✘
✘
Natural Language
Parsing
Technology
Stanford Natural Language Parser

http://nlp.stanford.edu
DependenceVisualizer: DepenSee
Grammar Browser: GrammarScope
(ROOT
(S
(NP (DT this) (NN bug))
(VP (MD will) (RB not)
(VP (VB be)
(VP (VBN fixed))))))
det(bug-2, this-1)
nsubjpass(fixed-6, bug-2)
aux(fixed-6, will-3)
neg(fixed-6, not-4)
auxpass(fixed-6, be-5)
root(ROOT-0, fixed-6)
Example
this bug will not be fixed
DT NN MD RB VB VBN
det aux
nsubjpass
neg
auxpass
Is it applicable to analyze
source code identifiers?
• It could be, once identifiers are split into
compound words
• Useful to tag part-of-speech
• Results may not be very precise because
identifiers are not natural language
sentences
Rules (Abebe and Tonella, 2010)
Surafel Lemma Abebe, Paolo Tonella: Natural Language Parsing of Program
Element Names for Concept Extraction. ICPC 2010: 156-159
Example:
OpenStructure
open structure Let’s insert an artificial subject and an
article:
subjects open the structure
(ROOT
(NP (JJ open) (NN structure)))
amod(structure-2, open-1)
root(ROOT-0, structure-2)
(ROOT
(S
(NP (NNS subjects))
(VP (VBP open)
(NP (DT the) (NN structure)))))
nsubj(open-2, subjects-1)
root(ROOT-0, open-2)
det(structure-4, the-3)
dobj(open-2, structure-4)
Applications
Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, Amit M. Paradkar:
Inferring method specifications from natural language API descriptions. ICSE 2012: 815-825
Inferring Method Specifications from Natural Language API Descriptions
Rahul Pandita⇤, Xusheng Xiao⇤, Hao Zhong†, Tao Xie⇤, Stephen Oney‡, and Amit Paradkar§
⇤Department of Computer Science, North Carolina State University, Raleigh, USA
†Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China
‡Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA
§I.B.M. T. J. Watson Research Center, Hawthorne, NY, USA
{rpandit, xxiao2, txie}@ncsu.edu, zhonghao@itechs.iscas.ac.cn, soney@cs.cmu.edu, paradkar@us.ibm.com
Abstract—Application Programming Interface (API) docu-
ments are a typical way of describing legal usage of reusable
software libraries, thus facilitating software reuse. However,
even with such documents, developers often overlook some
documents and build software systems that are inconsistent
with the legal usage of those libraries. Existing software
verification tools require formal specifications (such as code
contracts), and therefore cannot directly verify the legal usage
described in natural language text in API documents against
code using that library. However, in practice, most libraries
do not come with formal specifications, thus hindering tool-
based verification. To address this issue, we propose a novel
approach to infer formal specifications from natural language
text of API documents. Our evaluation results show that our
approach achieves an average of 92% precision and 93%
recall in identifying sentences that describe code contracts from
more than 2500 sentences of API documents. Furthermore, our
results show that our approach has an average 83% accuracy
in inferring specifications from over 1600 sentences describing
code contracts.
I. INTRODUCTION
Programming Interface (API) documents. Typically, such
documents are provided to client-code developers through
online access, or are shipped with the API code. For exam-
ple, J2EE’s API documentation3
is one of the most popular
API documents.
Even with such documents, client-code developers often
overlook some API documents and use methods in API
libraries incorrectly [22]. Since these documents are written
in natural language, existing tools cannot verify legal usage
described in a library’s API documents against the client
code of that library. One possible solution is to manually
write code contracts based on the specifications described
in API documents. However, due to a large number of
sentences in API documents, manually hunting for contract
sentences and writing code contracts for the API library
is prohibitively time consuming and labor intensive. For
instance, the File class of the C# .NET Framework has
around 800 sentences. Moreover, not all of these sentences
Inferring Method Specifications from Natural Language API Descriptions
Rahul Pandita⇤, Xusheng Xiao⇤, Hao Zhong†, Tao Xie⇤, Stephen Oney‡, and Amit Paradkar§
⇤Department of Computer Science, North Carolina State University, Raleigh, USA
†Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China
‡
Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China
‡
Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China
Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA
§
Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA
§
Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA
I.B.M. T. J. Watson Research Center, Hawthorne, NY, USA
{rpandit, xxiao2, txie}@ncsu.edu, zhonghao@itechs.iscas.ac.cn, soney@cs.cmu.edu, paradkar@us.ibm.com
Abstract—Application Programming Interface (API) docu-Abstract—Application Programming Interface (API) docu-Abstract
ments are a typical way of describing legal usage of reusable
software libraries, thus facilitating software reuse. However,
even with such documents, developers often overlook some
documents and build software systems that are inconsistent
with the legal usage of those libraries. Existing software
verification tools require formal specifications (such as code
contracts), and therefore cannot directly verify the legal usage
described in natural language text in API documents against
code using that library. However, in practice, most libraries
do not come with formal specifications, thus hindering tool-
based verification. To address this issue, we propose a novel
Programming Interface (API) documents. Typically, such
documents are provided to client-code developers through
online access, or are shipped with the API code. For exam-
ple, J2EE’s API documentation3
is one of the most popular
API documents.
Even with such documents, client-code developers often
overlook some API documents and use methods in API
libraries incorrectly [22]. Since these documents are written
in natural language, existing tools cannot verify legal usage
described in a library’s API documents against the client
Applications
How Can I Improve My App? Classifying User
Reviews for Software Maintenance and Evolution
S. Panichella⇤, A. Di Sorbo†, E. Guzman‡, C. A.Visaggio†, G. Canfora† and H. C. Gall⇤
⇤University of Zurich, Switzerland
†University of Sannio, Benevento, Italy
‡Technische Universit¨at M¨unchen, Garching, Germany
panichella@ifi.uzh.ch, disorbo@unisannio.it, emitza.guzman@mytum.de, {visaggio,canfora}@unisannio.it, gall@ifi.uzh.ch
Abstract—App Stores, such as Google Play or the Apple Store,
allow users to provide feedback on apps by posting review
comments and giving star ratings. These platforms constitute a
useful electronic mean in which application developers and users
can productively exchange information about apps. Previous
research showed that users feedback contains usage scenarios,
bug reports and feature requests, that can help app developers to
accomplish software maintenance and evolution tasks. However,
in the case of the most popular apps, the large amount of received
feedback, its unstructured nature and varying quality can make
the identification of useful user feedback a very challenging task.
In this paper we present a taxonomy to classify app reviews
into categories relevant to software maintenance and evolution,
as well as an approach that merges three techniques: (1)
Natural Language Processing, (2) Text Analysis and (3) Sentiment
Analysis to automatically classify app reviews into the proposed
categories. We show that the combined use of these techniques
allows to achieve better results (a precision of 75% and a recall
of 74%) than results obtained using each technique individually
(precision of 70% and a recall of 67%).
Index Terms—User Reviews, Mobile Applications, Natural
Language Processing, Sentiment Analysis, Text classification
form of unstructured text that is difficult to parse and analyze.
Thus, developers and analysts have to read a large amount
of textual data to become aware of the comments and needs
of their users [10]. In addition, the quality of reviews varies
greatly, from useful reviews providing ideas for improvement
or describing specific issues to generic praises and complaints
(e.g. “You have to be stupid to program this app”, “I love
it!”, “this app is useless”).
To handle this problem Chen et al. [10] proposed AR-
Miner, an approach to help app developers discover the most
informative user reviews. Specifically, the authors use: (i) text
analysis and machine learning to filter out non-informative
reviews and (ii) topic analysis to recognize topics treated
in the reviews classified as informative. In this paper, we
argue that text content represents just one of the possible
dimensions that can be explored to detect informative reviews
from a software maintenance and evolution perspective. In
particular, topic analysis techniques are useful to discover
How Can I Improve My App? Classifying User
Reviews for Software Maintenance and Evolution
S. Panichella⇤, A. Di Sorbo†, E. Guzman‡, C. A.Visaggio†, G. Canfora† and H. C. Gall⇤
⇤University of Zurich, Switzerland
†University of Sannio, Benevento, Italy
‡Technische Universitat M¨at M¨ unchen, Garching, Germany¨unchen, Garching, Germany¨
panichella@ifi.uzh.ch, disorbo@unisannio.it, emitza.guzman@mytum.de, {visaggio,canfora}@unisannio.it, gall@ifi.uzh.ch
Abstract—App Stores, such as Google Play or the Apple Store,Abstract—App Stores, such as Google Play or the Apple Store,Abstract
allow users to provide feedback on apps by posting review
comments and giving star ratings. These platforms constitute a
useful electronic mean in which application developers and users
can productively exchange information about apps. Previous
research showed that users feedback contains usage scenarios,
bug reports and feature requests, that can help app developers to
accomplish software maintenance and evolution tasks. However,
in the case of the most popular apps, the large amount of received
feedback, its unstructured nature and varying quality can make
the identification of useful user feedback a very challenging task.
In this paper we present a taxonomy to classify app reviews
into categories relevant to software maintenance and evolution,
as well as an approach that merges three techniques: (1)
Natural Language Processing, (2) Text Analysis and (3) Sentiment
Analysis to automatically classify app reviews into the proposed
categories. We show that the combined use of these techniques
allows to achieve better results (a precision of 75% and a recall
of 74%) than results obtained using each technique individually
(precision of 70% and a recall of 67%).
form of unstructured text that is difficult to parse and analyze.
Thus, developers and analysts have to read a large amount
of textual data to become aware of the comments and needs
of their users [10]. In addition, the quality of reviews varies
greatly, from useful reviews providing ideas for improvement
or describing specific issues to generic praises and complaints
(e.g. “You have to be stupid to program this app”, “I love
it!”, “this app is useless”).
To handle this problem Chen et al. [10] proposed AR-
Miner, an approach to help app developers discover the most
informative user reviews. Specifically, the authors use: (i) text
analysis and machine learning to filter out non-informative
reviews and (ii) topic analysis to recognize topics treated
in the reviews classified as informative. In this paper, we
argue that text content represents just one of the possible
dimensions that can be explored to detect informative reviews
from a software maintenance and evolution perspective. In
@ICSME 2015 - Thurdsay, 13.50 - Mobile applications
Natural Language
Parsing: Pros and Cons
• Can identify dependencies between words
• Accurate part-of-speech analysis, better than
lexical databases
• Application to software entities (e.g.
identifiers) may be problematic
• Difficult to process data
• NL parsing is difficult, especially when the
corpus is noisy ✘
✔
✔
✘
✘
Summary
What technique?What technique?
What technique?
Need to match precise elements
What technique?
Separate structured and
unstructured data
What technique?
Account for the whole textual
corpus of a document
What technique?
Compare documents
What technique?
Identify “topics” discussed in
documents
What technique?
Not just bag of words, but also
dependencies, parts of speech,
negations
Take aways
Use structure when available
Carefully (empirically) choose the most
suitable technique
Out of the box configuration might not
work for you
Manual analysis cannot be replaced
(ROOT
(S
(NP (DT this) (NN bug))
(VP (MD will) (RB not)
(VP (VB be)
(VP (VBN fixed))))))
det(bug-2, this-1)
nsubjpass(fixed-6, bug-2)
aux(fixed-6, will-3)
neg(fixed-6, not-4)
auxpass(fixed-6, be-5)
root(ROOT-0, fixed-6)
For example, the
Stanford Natural
Language Parser

http://nlp.stanford.edu
Techniques: Natural
Language Parsing
this bug will not be fixed
DT NN MD RB VB VBN
det aux
nsubjpass
neg
auxpass
(ROOT
(S
(NP (DT this) (NN bug))
(VP (MD will) (RB not)
(VP (VB be)
(VP (VBN fixed))))))
det(bug-2, this-1)
nsubjpass(fixed-6, bug-2)
aux(fixed-6, will-3)
neg(fixed-6, not-4)
auxpass(fixed-6, be-5)
root(ROOT-0, fixed-6)
For example, the
Stanford Natural
Language Parser
http://nlp.stanford.edu
Techniques: Natural
Language Parsing
this bug will not be fixed
DT NN MD RB VB VBN
det aux
nsubjpass
neg
auxpass
Maching issue ids
“fix 367920 setting pop3 messages as junk/not junk
ignored when message quarantining turned on sr=mscott
$note=~/Issue (d+)/i || $note=~/Issue number:s+(d
+)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/
i
Simple yet powerful technique to link commit
notes to issue reports
Maching issue ids
“fix 367920 setting pop3 messages as junk/not junk
ignored when message quarantining turned on sr=mscott
$note=~/Issue (d+)/i || $note=~/Issue number:s+(d
+)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/
i
Simple yet powerful technique to link commit
notes to issue reports
VSM Representation
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
VSM Representation
T3T3T
T1T1T
T2T2T
D1 = 2T1= 2T1= 2T + 3T2+ 3T2+ 3T + 5T3+ 5T3+ 5T
D2 = 3T1= 3T1= 3T + 7T2+ 7T2+ 7T + T3T3T
Q = 0T1Q = 0T1Q = 0T + 0T2+ 0T2+ 0T + 2T3+ 2T3+ 2T
7
32
5

Mais conteúdo relacionado

Mais procurados

Mining the Technical Skills of Open Source Developers
Mining the Technical Skills of Open Source DevelopersMining the Technical Skills of Open Source Developers
Mining the Technical Skills of Open Source DevelopersMobilet
 
Development Emails Content Analyzer: Intention Mining in Developer Discussions
Development Emails Content Analyzer: Intention Mining in Developer DiscussionsDevelopment Emails Content Analyzer: Intention Mining in Developer Discussions
Development Emails Content Analyzer: Intention Mining in Developer DiscussionsSebastiano Panichella
 
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Margaret-Anne Storey
 
Components license
Components licenseComponents license
Components licensedmgerman
 
Malicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyMalicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyijsptm
 
Using HPC Resources to Exploit Big Data for Code Review Analytics
Using HPC Resources to Exploit Big Data for Code Review AnalyticsUsing HPC Resources to Exploit Big Data for Code Review Analytics
Using HPC Resources to Exploit Big Data for Code Review AnalyticsThe University of Adelaide
 
The (Un) Expected Impact of Tools in Software Evolution
The (Un) Expected Impact of Tools in Software EvolutionThe (Un) Expected Impact of Tools in Software Evolution
The (Un) Expected Impact of Tools in Software EvolutionGail Murphy
 
Open Source Software Survivability Analysis Using Communication Pattern Valid...
Open Source Software Survivability Analysis Using Communication Pattern Valid...Open Source Software Survivability Analysis Using Communication Pattern Valid...
Open Source Software Survivability Analysis Using Communication Pattern Valid...IOSR Journals
 
Investigating Code Review Practices in Defective Files
Investigating Code Review Practices in Defective FilesInvestigating Code Review Practices in Defective Files
Investigating Code Review Practices in Defective FilesThe University of Adelaide
 
Tech Report: On the Effectiveness of Malware Protection on Android
Tech Report: On the Effectiveness of Malware Protection on AndroidTech Report: On the Effectiveness of Malware Protection on Android
Tech Report: On the Effectiveness of Malware Protection on AndroidFraunhofer AISEC
 
Review Participation in Modern Code Review: An Empirical Study of the Android...
Review Participation in Modern Code Review: An Empirical Study of the Android...Review Participation in Modern Code Review: An Empirical Study of the Android...
Review Participation in Modern Code Review: An Empirical Study of the Android...The University of Adelaide
 
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...Tao Xie
 
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...Margaret-Anne Storey
 
Revisiting Code Ownership and Its Relationship with Software Quality in the S...
Revisiting Code Ownership and Its Relationship with Software Quality in the S...Revisiting Code Ownership and Its Relationship with Software Quality in the S...
Revisiting Code Ownership and Its Relationship with Software Quality in the S...The University of Adelaide
 
Ph.D. Thesis Defense: Studying Reviewer Selection and Involvement in Modern ...
Ph.D. Thesis Defense:  Studying Reviewer Selection and Involvement in Modern ...Ph.D. Thesis Defense:  Studying Reviewer Selection and Involvement in Modern ...
Ph.D. Thesis Defense: Studying Reviewer Selection and Involvement in Modern ...The University of Adelaide
 
Blackhat Europe 2009 - Detecting Certified Pre Owned Software
Blackhat Europe 2009 - Detecting Certified Pre Owned SoftwareBlackhat Europe 2009 - Detecting Certified Pre Owned Software
Blackhat Europe 2009 - Detecting Certified Pre Owned SoftwareTyler Shields
 
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Swati Patel
 

Mais procurados (20)

Mining the Technical Skills of Open Source Developers
Mining the Technical Skills of Open Source DevelopersMining the Technical Skills of Open Source Developers
Mining the Technical Skills of Open Source Developers
 
Development Emails Content Analyzer: Intention Mining in Developer Discussions
Development Emails Content Analyzer: Intention Mining in Developer DiscussionsDevelopment Emails Content Analyzer: Intention Mining in Developer Discussions
Development Emails Content Analyzer: Intention Mining in Developer Discussions
 
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
 
My life as a cyborg
My life as a cyborg My life as a cyborg
My life as a cyborg
 
Components license
Components licenseComponents license
Components license
 
Malicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyMalicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropy
 
Using HPC Resources to Exploit Big Data for Code Review Analytics
Using HPC Resources to Exploit Big Data for Code Review AnalyticsUsing HPC Resources to Exploit Big Data for Code Review Analytics
Using HPC Resources to Exploit Big Data for Code Review Analytics
 
The (Un) Expected Impact of Tools in Software Evolution
The (Un) Expected Impact of Tools in Software EvolutionThe (Un) Expected Impact of Tools in Software Evolution
The (Un) Expected Impact of Tools in Software Evolution
 
Open Source Software Survivability Analysis Using Communication Pattern Valid...
Open Source Software Survivability Analysis Using Communication Pattern Valid...Open Source Software Survivability Analysis Using Communication Pattern Valid...
Open Source Software Survivability Analysis Using Communication Pattern Valid...
 
Investigating Code Review Practices in Defective Files
Investigating Code Review Practices in Defective FilesInvestigating Code Review Practices in Defective Files
Investigating Code Review Practices in Defective Files
 
Tech Report: On the Effectiveness of Malware Protection on Android
Tech Report: On the Effectiveness of Malware Protection on AndroidTech Report: On the Effectiveness of Malware Protection on Android
Tech Report: On the Effectiveness of Malware Protection on Android
 
Review Participation in Modern Code Review: An Empirical Study of the Android...
Review Participation in Modern Code Review: An Empirical Study of the Android...Review Participation in Modern Code Review: An Empirical Study of the Android...
Review Participation in Modern Code Review: An Empirical Study of the Android...
 
Who Should Review My Code?
Who Should Review My Code?  Who Should Review My Code?
Who Should Review My Code?
 
Fc25949950
Fc25949950Fc25949950
Fc25949950
 
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
 
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
SLE 2012 Keynote: Cognitive and Social Challenges of Ontology Use in the Biom...
 
Revisiting Code Ownership and Its Relationship with Software Quality in the S...
Revisiting Code Ownership and Its Relationship with Software Quality in the S...Revisiting Code Ownership and Its Relationship with Software Quality in the S...
Revisiting Code Ownership and Its Relationship with Software Quality in the S...
 
Ph.D. Thesis Defense: Studying Reviewer Selection and Involvement in Modern ...
Ph.D. Thesis Defense:  Studying Reviewer Selection and Involvement in Modern ...Ph.D. Thesis Defense:  Studying Reviewer Selection and Involvement in Modern ...
Ph.D. Thesis Defense: Studying Reviewer Selection and Involvement in Modern ...
 
Blackhat Europe 2009 - Detecting Certified Pre Owned Software
Blackhat Europe 2009 - Detecting Certified Pre Owned SoftwareBlackhat Europe 2009 - Detecting Certified Pre Owned Software
Blackhat Europe 2009 - Detecting Certified Pre Owned Software
 
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
 

Destaque

FSE 2012 talk: finding mentors in software projects
FSE 2012 talk: finding mentors in software projectsFSE 2012 talk: finding mentors in software projects
FSE 2012 talk: finding mentors in software projectsMassimiliano Di Penta
 
Filosofia Propia Niglobismo
Filosofia Propia   NiglobismoFilosofia Propia   Niglobismo
Filosofia Propia Niglobismonikolle.marrero
 
El pasado simple
El pasado simpleEl pasado simple
El pasado simple95barby
 
Ezra's Powerpoint Respect
Ezra's Powerpoint RespectEzra's Powerpoint Respect
Ezra's Powerpoint Respectezralee32
 
Comentarios e 070 albañileria confinada
Comentarios e 070 albañileria confinadaComentarios e 070 albañileria confinada
Comentarios e 070 albañileria confinadalycanrv
 
Apostila prática cosmetologia 2013 02
Apostila prática cosmetologia 2013 02Apostila prática cosmetologia 2013 02
Apostila prática cosmetologia 2013 02Fabricio Oliveira
 
Obtencion del acido lactico a partir del almidon de papa como materia prima p...
Obtencion del acido lactico a partir del almidon de papa como materia prima p...Obtencion del acido lactico a partir del almidon de papa como materia prima p...
Obtencion del acido lactico a partir del almidon de papa como materia prima p...KELY CRISPIN MARTEL
 
"Resistencia de metales ordinarios a la corrosión en diferentes medios"
"Resistencia de metales ordinarios a la corrosión en diferentes medios""Resistencia de metales ordinarios a la corrosión en diferentes medios"
"Resistencia de metales ordinarios a la corrosión en diferentes medios"Paola Rey
 

Destaque (20)

FSE 2012 talk: finding mentors in software projects
FSE 2012 talk: finding mentors in software projectsFSE 2012 talk: finding mentors in software projects
FSE 2012 talk: finding mentors in software projects
 
Dipenta msr2011-csbf
Dipenta msr2011-csbfDipenta msr2011-csbf
Dipenta msr2011-csbf
 
Dipenta msr2011-renaming
Dipenta msr2011-renamingDipenta msr2011-renaming
Dipenta msr2011-renaming
 
Dipenta msr2011-challenge
Dipenta msr2011-challenge Dipenta msr2011-challenge
Dipenta msr2011-challenge
 
SSBSE 2012 Keynote
SSBSE 2012 KeynoteSSBSE 2012 Keynote
SSBSE 2012 Keynote
 
Esay
EsayEsay
Esay
 
Filosofia Propia Niglobismo
Filosofia Propia   NiglobismoFilosofia Propia   Niglobismo
Filosofia Propia Niglobismo
 
El pasado simple
El pasado simpleEl pasado simple
El pasado simple
 
Tarea 2, seminario 2
Tarea 2, seminario 2Tarea 2, seminario 2
Tarea 2, seminario 2
 
Ezra's Powerpoint Respect
Ezra's Powerpoint RespectEzra's Powerpoint Respect
Ezra's Powerpoint Respect
 
Metodo watch ciro
Metodo watch ciroMetodo watch ciro
Metodo watch ciro
 
MSR 2015 Announcement
MSR 2015 AnnouncementMSR 2015 Announcement
MSR 2015 Announcement
 
Riscos boechat
Riscos boechatRiscos boechat
Riscos boechat
 
Comentarios e 070 albañileria confinada
Comentarios e 070 albañileria confinadaComentarios e 070 albañileria confinada
Comentarios e 070 albañileria confinada
 
Apostila prática cosmetologia 2013 02
Apostila prática cosmetologia 2013 02Apostila prática cosmetologia 2013 02
Apostila prática cosmetologia 2013 02
 
Obtencion del acido lactico a partir del almidon de papa como materia prima p...
Obtencion del acido lactico a partir del almidon de papa como materia prima p...Obtencion del acido lactico a partir del almidon de papa como materia prima p...
Obtencion del acido lactico a partir del almidon de papa como materia prima p...
 
Most Influential Paper - SANER 2017
Most Influential Paper - SANER 2017Most Influential Paper - SANER 2017
Most Influential Paper - SANER 2017
 
"Resistencia de metales ordinarios a la corrosión en diferentes medios"
"Resistencia de metales ordinarios a la corrosión en diferentes medios""Resistencia de metales ordinarios a la corrosión en diferentes medios"
"Resistencia de metales ordinarios a la corrosión en diferentes medios"
 
Ode intimations to immortality
Ode intimations to immortalityOde intimations to immortality
Ode intimations to immortality
 
The New Age of Politics and Media
The New Age of Politics and MediaThe New Age of Politics and Media
The New Age of Politics and Media
 

Semelhante a Put Your Hands in the Mud: What Technique, Why, and How

‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud ‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud acijjournal
 
Software design.edited (1)
Software design.edited (1)Software design.edited (1)
Software design.edited (1)FarjanaAhmed3
 
Multi step automated refactoring for code smell
Multi step automated refactoring for code smellMulti step automated refactoring for code smell
Multi step automated refactoring for code smelleSAT Publishing House
 
Multi step automated refactoring for code smell
Multi step automated refactoring for code smellMulti step automated refactoring for code smell
Multi step automated refactoring for code smelleSAT Journals
 
Software Systems Requirements Engineering
Software Systems Requirements EngineeringSoftware Systems Requirements Engineering
Software Systems Requirements EngineeringKristen Wilson
 
Coupling based structural metrics for measuring the quality of a software (sy...
Coupling based structural metrics for measuring the quality of a software (sy...Coupling based structural metrics for measuring the quality of a software (sy...
Coupling based structural metrics for measuring the quality of a software (sy...Mumbai Academisc
 
Reengineering including reverse & forward Engineering
Reengineering including reverse & forward EngineeringReengineering including reverse & forward Engineering
Reengineering including reverse & forward EngineeringMuhammad Chaudhry
 
Software Refactoring Under Uncertainty: A Robust Multi-Objective Approach
Software Refactoring Under Uncertainty:  A Robust Multi-Objective ApproachSoftware Refactoring Under Uncertainty:  A Robust Multi-Objective Approach
Software Refactoring Under Uncertainty: A Robust Multi-Objective ApproachWiem Mkaouer
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models IJECEIAES
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015Mozaic Works
 
Different Methodologies For Testing Web Application Testing
Different Methodologies For Testing Web Application TestingDifferent Methodologies For Testing Web Application Testing
Different Methodologies For Testing Web Application TestingRachel Davis
 
Adaptation of the technology of the static code analyzer for developing paral...
Adaptation of the technology of the static code analyzer for developing paral...Adaptation of the technology of the static code analyzer for developing paral...
Adaptation of the technology of the static code analyzer for developing paral...PVS-Studio
 
7a Good Programming Practice.pptx
7a Good Programming Practice.pptx7a Good Programming Practice.pptx
7a Good Programming Practice.pptxDylanTilbury1
 
Code Craftsmanship Checklist
Code Craftsmanship ChecklistCode Craftsmanship Checklist
Code Craftsmanship ChecklistRyan Polk
 
Programming in c++
Programming in c++Programming in c++
Programming in c++MalarMohana
 
Programming in c++
Programming in c++Programming in c++
Programming in c++sujathavvv
 

Semelhante a Put Your Hands in the Mud: What Technique, Why, and How (20)

‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud ‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
Software design.edited (1)
Software design.edited (1)Software design.edited (1)
Software design.edited (1)
 
Multi step automated refactoring for code smell
Multi step automated refactoring for code smellMulti step automated refactoring for code smell
Multi step automated refactoring for code smell
 
Multi step automated refactoring for code smell
Multi step automated refactoring for code smellMulti step automated refactoring for code smell
Multi step automated refactoring for code smell
 
H1803044651
H1803044651H1803044651
H1803044651
 
Software Systems Requirements Engineering
Software Systems Requirements EngineeringSoftware Systems Requirements Engineering
Software Systems Requirements Engineering
 
Coupling based structural metrics for measuring the quality of a software (sy...
Coupling based structural metrics for measuring the quality of a software (sy...Coupling based structural metrics for measuring the quality of a software (sy...
Coupling based structural metrics for measuring the quality of a software (sy...
 
Reengineering including reverse & forward Engineering
Reengineering including reverse & forward EngineeringReengineering including reverse & forward Engineering
Reengineering including reverse & forward Engineering
 
Software Refactoring Under Uncertainty: A Robust Multi-Objective Approach
Software Refactoring Under Uncertainty:  A Robust Multi-Objective ApproachSoftware Refactoring Under Uncertainty:  A Robust Multi-Objective Approach
Software Refactoring Under Uncertainty: A Robust Multi-Objective Approach
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
 
SA_UNIT_1.pptx
SA_UNIT_1.pptxSA_UNIT_1.pptx
SA_UNIT_1.pptx
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
 
Different Methodologies For Testing Web Application Testing
Different Methodologies For Testing Web Application TestingDifferent Methodologies For Testing Web Application Testing
Different Methodologies For Testing Web Application Testing
 
Adaptation of the technology of the static code analyzer for developing paral...
Adaptation of the technology of the static code analyzer for developing paral...Adaptation of the technology of the static code analyzer for developing paral...
Adaptation of the technology of the static code analyzer for developing paral...
 
7a Good Programming Practice.pptx
7a Good Programming Practice.pptx7a Good Programming Practice.pptx
7a Good Programming Practice.pptx
 
Code Craftsmanship Checklist
Code Craftsmanship ChecklistCode Craftsmanship Checklist
Code Craftsmanship Checklist
 
Programming in c++
Programming in c++Programming in c++
Programming in c++
 
Programming in c++
Programming in c++Programming in c++
Programming in c++
 

Último

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 

Último (20)

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

Put Your Hands in the Mud: What Technique, Why, and How

  • 1. PutYour Hands in the Mud: What Technique, Why, and How? Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it http://www.ing.unisannio.it/mdipenta
  • 2. Outline MUD of software repositories Available techniques:
 Pattern matching
 Island parsers
 IR algebraic methods
 Natural Language Parsing Choosing the solution that fits your needs
  • 4. Why? Code identifiers and comments reflect semantics Software repositories contain a mix of structured and unstructured data
  • 5. (Some) Applications Traceability Link Recovery Feature Location Clone Detection Conceptual Cohesion/Coupling Metrics Bug prediction Bug triaging Software remodularization Textual smells/antipatterns Duplicate bug report detection Software artifact summarization and labeling Generating assertions from textual descriptions ....
  • 7. Versioning Systems Besides code itself, the main source of unstructured data is represented by commit messages
  • 8. Example • Add Checkclipse preferences to all projects so Checkstyle is preconfigured • Fix for issue 4503. • Fix for issue 4517: "changeability" on association ends not displayed properly. This was never adapted since UML 1.3 & NSUML were replaced. What kind of problem do you see here?
  • 9. Be aware! • Commit notes are often insufficient to know everything about a change • Need to merge with issue tracker data
  • 11.
  • 12. Issue Trackers • Contain a mix of structured and non- structured data • Structured: classification, priority, severity, status, but also stack traces and code snippets • Unstructured: issue description, comments • Often natural language mixed with method, class, package names, etc.
  • 13. Challenges • Including comments in the text corpus may or may not add useful information • The mix of natural language and code elements may challenge the application of techniques such as natural language parsing
  • 14. Techniques Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan: Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code. CSMR 2012: 319-328 Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL) School of Computing, Queen’s University Kingston, Ontario, Canada Email: {nicbet, sthomas, ahmed}@cs.queensu.ca Abstract—When discussing software, practitioners often ref- erence parts of the project’s source code. Such references have different motivations, such as mentoring and guiding less experienced developers, pointing out code that needs changes, or proposing possible strategies for the implementation of future changes. The fact that particular parts of a source code are being discussed makes these parts of the software special. Knowing which code is being talked about the most can not only help practitioners to guide important software engineering and main- tenance activities, but also act as a high-level documentation of development activities for managers. In this paper, we use clone- detection as specific instance of a code search based approach for establishing links between code fragments that are discussed by developers and the actual source code of a project. Through a case study on the Eclipse project we explore the traceability links established through this approach, both quantitatively and qualitatively, and compare fuzzy code search based traceability linking to classical approaches, in particular change log analysis and information retrieval. We demonstrate a sample application of code search based traceability links by visualizing those parts of the project that are most discussed in issue reports with a Treemap visualization. The results of our case study show that the traceability links established through fuzzy code search- Past approaches to uncovering traceability links between doc- umentation and source code are commonly based on in- formation retrieval [2], [15], [18], [20], natural language processing [1], [19] and lightweight textual analyses [5], [11], [12]. Each approach, however, is tailored towards a specific set of goals and use cases. For example, when linking code changes to issue reports by analyzing transaction logs [28], we observe only the associations between a bug report and the final locations of the bug fix, but miss the bug fixing history: all the locations that a developer had to investigate and understand before he could find an appropriate way to fix the error. In this paper, we aim to find traceability links between issue reports and source code. For this purpose, we propose a new approach that uses token-based clone detection as an implementation of fuzzy code search for discovering links between code fragments mentioned in project discussions and the location of these fragments in the source code body of a software system. In a case study on the ECLIPSE project, we Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL) School of Computing, Queen’s University Kingston, Ontario, Canada Email: {nicbet, sthomas, ahmed}@cs.queensu.ca Abstract—When discussing software, practitioners often ref-Abstract—When discussing software, practitioners often ref-Abstract erence parts of the project’s source code. Such references have different motivations, such as mentoring and guiding less experienced developers, pointing out code that needs changes, or proposing possible strategies for the implementation of future changes. The fact that particular parts of a source code are being discussed makes these parts of the software special. Knowing which code is being talked about the most can not only help practitioners to guide important software engineering and main- tenance activities, but also act as a high-level documentation of Past approaches to uncovering traceability links between doc- umentation and source code are commonly based on in- formation retrieval [2], [15], [18], [20], natural language processing [1], [19] and lightweight textual analyses [5], [11], [12]. Each approach, however, is tailored towards a specific set of goals and use cases. For example, when linking code changes to issue reports by analyzing transaction logs [28], we observe only the associations between a bug report and
  • 15. Emails Similar characteristics of issue trackers, with less structure From lisa at usna.navy.MIL Thu Jul 1 15:42:27 1999 From: lisa at usna.navy.MIL (Lisa Becktold {CADIG STAFF}) Date: Tue Dec 2 03:03:10 2003 Subject: nmbd/nmbd_processlogon.c - CODE 12??? Message-ID: <99Jul1.114236-0400edt.4995-357+39@jupiter.usna.navy.mil Hi: I have Samba 2.1.0-prealpha running on a Sun Ultra 10. My NT PC joined the domain without a problem, but I can't logon. Every time I attempt to log into the Samba domain, the NT screen blanks as it usually does during login, but then the "Begin Login" screen reappears. I see this message in my samba/var/log.nmb file whenever I try to login:
  • 18. Mining Forums: Challenges • The usual problems you face with issue trackers and emails are still there • However in general the separation between different elements is much more consistent
  • 21. However, this convention is not always followed!
  • 23. Not all reviews are equal AR-Miner: Mining Informative Reviews for Developers from Mobile App Marketplace Ning Chen, Jialiu Lin† , Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang Nanyang Technological University, Singapore, † Carnegie Mellon University, USA {nchen1,chhoi,xkxiao,bszhang}@ntu.edu.sg, † jialiul@cs.cmu.edu ABSTRACT With the popularity of smartphones and mobile devices, mo- bile application (a.k.a. “app”) markets have been growing exponentially in terms of number of users and download- s. App developers spend considerable e↵ort on collecting and exploiting user feedback to improve user satisfaction, but su↵er from the absence of e↵ective user review ana- lytics tools. To facilitate mobile app developers discover the most “informative” user reviews from a large and rapid- ly increasing pool of user reviews, we present “AR-Miner” — a novel computational framework for App Review Min- ing, which performs comprehensive analytics from raw user reviews by (i) first extracting informative user reviews by filtering noisy and irrelevant ones, (ii) then grouping the in- formative reviews automatically using topic modeling, (iii) further prioritizing the informative reviews by an e↵ective review ranking scheme, (iv) and finally presenting the group- s of most “informative” reviews via an intuitive visualization approach. We conduct extensive experiments and case s- tudies on four popular Android apps to evaluate AR-Miner, intense, in order to seize the initiative, developers tend to employ an iterative process to develop, test, and improve apps [23]. Therefore, timely and constructive feedback from users becomes extremely crucial for developers to fix bugs, implement new features, and improve user experience ag- ilely. One key challenge to many app developers is how to obtain and digest user feedback in an e↵ective and e cient manner, i.e., the “user feedback extraction” task. One way to extract user feedback is to adopt typical channels used in traditional software development, such as (i) bug/change repositories (e.g., Bugzilla [3]), (ii) crash reporting systems [19], (iii) online forums (e.g., SwiftKey feedback forum [6]), and (iv) emails [10]. Unlike the traditional channels, modern app marketplaces, such as Apple App Store and Google Play, o↵er a much eas- ier way (i.e., the web-based market portal and the market app) for users to rate and post app reviews. These reviews present user feedback on various aspects of apps (such as functionality, quality, performance, etc), and provide app developers a new and critical channel to extract user feed- AR-Miner: Mining Informative Reviews for Developers from Mobile App Marketplace Ning Chen, Jialiu Lin† , Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang Nanyang Technological University, Singapore, † Carnegie Mellon University, USA {nchen1,chhoi,xkxiao,bszhang}@ntu.edu.sg, † jialiul@cs.cmu.edu ABSTRACT With the popularity of smartphones and mobile devices, mo- bile application (a.k.a. “app”) markets have been growing exponentially in terms of number of users and download- s. App developers spend considerable e↵ort on collecting↵ort on collecting↵ and exploiting user feedback to improve user satisfaction, but su↵er from the absence of e↵er from the absence of e↵ ↵ective user review ana-↵ective user review ana-↵ lytics tools. To facilitate mobile app developers discover the most “informative” user reviews from a large and rapid- ly increasing pool of user reviews, we present “AR-Miner” intense, in order to seize the initiative, developers tend to employ an iterative process to develop, test, and improve apps [23]. Therefore, timely and constructive feedback from users becomes extremely crucial for developers to fix bugs, implement new features, and improve user experience ag- ilely. One key challenge to many app developers is how to obtain and digest user feedback in an e↵ective and e↵ective and e↵ cient manner, i.e., the “user feedback extraction” task. One way to extract user feedback is to adopt typical channels used in traditional software development, such as (i) bug/change repositories (e.g., Bugzilla [3]), (ii) crash reporting systems Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang:AR-miner: mining informative reviews for developers from mobile app marketplace. ICSE 2014: 767-778
  • 24. Informative reviews… Functional None of the pictures will load in my news feed. Performance It lags and doesn't respond to my touch which almost always causes me to run into stuff. Feature (change) request Amazing app, although I wish there were more themes to choose from Please make it a little easy to get bananas please and make more power ups that would be awesome. Remove ads So many ads its unplayable! Fix permissions This game is adding for too much unexplained permissions.
  • 25. Non-Informative reviews Purely emotional Great fun can't put it down! This is a crap app. Description of app/actions I have changed my review from 2 star to 1 star. Unclear issue description Bad game this is not working on my phone. Questions How can I get more points?
  • 26. @ICSME 2015 
 Thurdsay, 13.50 - Mobile applications User Reviews Matter! Tracking Crowdsourced Reviews to Support Evolution of Successful Apps Fabio Palomba⇤, Mario Linares-V´asquez§, Gabriele Bavota†, Rocco Oliveto‡, Massimiliano Di Penta¶, Denys Poshyvanyk§, Andrea De Lucia⇤ ⇤University of Salerno, Fisciano (SA), Italy – §The College of William and Mary, VA, USA †Free University of Bozen-Bolzano, Bolzano (BZ), Italy – ‡University of Molise, Pesche (IS), Italy ¶University of Sannio, Benevento, Italy Abstract—Nowadays software applications, and especially mo- bile apps, undergo frequent release updates through app stores. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. In this paper we show—by performing a study on 100 Android apps—how applications addressing user reviews increase their success in terms of rating. Specifically, we devise an approach, named CRISTAL, for tracing informative crowd reviews onto source code changes, and for monitoring the extent to which developers accommodate crowd requests and follow-up user reactions as reflected in their ratings. The results indicate that developers implementing user reviews are rewarded in terms of ratings. This poses the need for specialized recommendation systems aimed at analyzing informative crowd reviews and prioritizing feedback to be satisfied in order to increase the apps success. systems, and network conditions that may not necessarily be reproducible during development/testing activities. Consequently, by reading reviews and analyzing the ratings, development teams are encouraged to improve their apps, for example, by fixing bugs or by adding commonly requested features. According to a recent Gartner report’s recommenda- tion [21], given the complexity of mobile testing “development teams should monitor app store reviews to identify issues that are difficult to catch during testing, and to clarify issues that cause problems on the users’s side”. Moreover, useful app reviews reflect crowd-based needs and are a valuable source of comments, bug reports, feature requests, and informal user experience feedback [7], [8], [14], [19], [22], [27], [29]. In this paper we investigate to what extent app development User Reviews Matter! Tracking Crowdsourced Reviews to Support Evolution of Successful Apps Fabio Palomba⇤, Mario Linares-V´asquez§, Gabriele Bavota†, Rocco Oliveto‡, Massimiliano Di Penta¶, Denys Poshyvanyk§, Andrea De Lucia⇤ ⇤University of Salerno, Fisciano (SA), Italy – §The College of William and Mary, VA, USA †Free University of Bozen-Bolzano, Bolzano (BZ), Italy – ‡University of Molise, Pesche (IS), Italy ¶University of Sannio, Benevento, Italy Abstract—Nowadays software applications, and especially mo-Abstract—Nowadays software applications, and especially mo-Abstract bile apps, undergo frequent release updates through app stores. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. In this paper we show—by performing a study on 100 Android apps—how applications addressing user reviews increase their success in terms of rating. Specifically, we devise an approach, named CRISTAL, for tracing informative crowd reviews onto source code changes, and for monitoring the extent to which developers accommodate crowd requests and follow-up user reactions as systems, and network conditions that may not necessarily be reproducible during development/testing activities. Consequently, by reading reviews and analyzing the ratings, development teams are encouraged to improve their apps, for example, by fixing bugs or by adding commonly requested features. According to a recent Gartner report’s recommenda- tion [21], given the complexity of mobile testing “development teams should monitor app store reviews to identify issues that are difficult to catch during testing, and to clarify issues that cause problems on the users’s side”. Moreover, useful app
  • 27. So what we get? EmailsEmailsEmailsEmails Versioning Issue Reports
  • 28. Continuous interleaving of structured and unstructured data Format not always consistent Technical/domain terms, abbreviations, acronyms Short documents, documents having different size
  • 30. Different techniques Techniques based on string / regexp matching Information Retrieval Models Natural Language Parsing Island and lake parsers
  • 32. Pattern matching: when and where • You’re not interested to the whole document content • Rather, you want to match some keywords • Very few variants • Context insentitiveness
  • 33. Basic tools you might want to use • Unix command line tools: wc, cut, sort, uniq, head, tail • Regular expression engines: grep, sed, awk • Scripting Languages: Perl, Python • General purposes programming languages with regexp support:
 java.util.regexp (similar to Perl regexp)
  • 34. Pattern Matching in Perl • The =~ or !~ searches a string for a pattern • /PATTERN/ define a pattern • e.g. if($a=~/PATTERN/i){...} or 
 if($a!~/PATTERN){...} • s/PATTERN/REPLACEMENT/egimosx • e.g. $a=~s/PATTERN/REPLACEMENT/g;
  • 35. Special Symbols w Match a "word" character (alphanumeric plus "_") W Match a non-word character s Match a whitespace character S Match a non-whitespace character d Match a digit character D Match a non-digit character
  • 36. Assertions b Match a word boundary B Match a non-(word boundary) A Match only at beginning of string Z Match only at end of string, or before newline at the end z Match only at end of string
  • 37. Matching issue ids “fix 367920 setting pop3 messages as junk/not junk ignored when message quarantining turned on sr=mscott $note=~/Issue (d+)/i || $note=~/Issue number:s+(d +)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/i
  • 38. Mapping emails onto classes/files • Emails often refer classes / source files • One can trace email to file using IR-based traceability • In the end simple regexp matching is more precise and efficient Alberto Bacchelli, Michele Lanza, Romain Robbes: 
 Linking e-mails and source code artifacts. ICSE (1) 2010: 375-384
  • 39. Example (Apache httpd) Hi, I'm running SunOS4.1.3 (with vif kernel hacks) and the Apache server. www.xyz.com is a CNAME alias for xyz.com. <VirtualHost www.xyz.com etc., is in my httpd.conf file. Accessing http://www.xyz.com *fails* (actually, it brings up the default DocumentRoot). Accessing http://xyz.com *succeeds*, however, bringing up theVirtual DocumentRoot. The get_local_addr() function in the Apache util.c file seems to return the wrong socket name, but I can't figure out why. Any help is greatly appreciated!
  • 40. (Simplified) Perl implementation #! /usr/bin/perl @fileNames=`cat filelist.txt`; #content of filelist.txt goes into array @fileNames %hashFiles=(); foreach $f(@fileNames){ chomp($f); $hashFiles{$f}++; #populates hashtable %hashFiles } while(<>){ $l=$_; chomp($l); #matches filename composed of anum chars, ending with .c and word boundary (b) while($l=~/(w+.cb)/){ $name=$1; if(defined($hashFiles{$name})){ #checks if the filename is in the hash table print "Mail mapped to file $namen"; } $l=$'; #assigns post matching to $l } } while(<>){ $l=$_; chomp($l); #matches filename composed of anum chars, ending with .c and word boundary (b) while($l=~/(w+.cb)/){
  • 41. RegExp: Pros and Cons • Often the easiest solution • Fast and precise enough • Unsupervised • Pattern variants can lead to
 misclassifications • Context-insensitive ✘ ✔ ✔ ✔ ✘
  • 42. Key Ingredients • Often finding the right pattern is not straight-forward • A lot of manual analysis might be needed • Iterative process
  • 43. Island and Lake Parsers
  • 44. General Ideas Island parser: ignore everything (lake) until you find a specific grammar rule (island) Lake parser: parse everything (lake) unless you find something you’re not able to match a rule (island)
  • 45. Traditional applications Leon Moonen: Generating Robust Parsers Using Island Grammars. WCRE 2001: 13-
  • 46. free text error log free text source code free text stack trace Applications to MUD
  • 47. Approach Alberto Bacchelli, Tommaso Dal Sasso, Marco D'Ambros, Michele Lanza: Content classification of development emails. ICSE 2012: 375-385 Content Classification of Development Emails Alberto Bacchelli, Tommaso Dal Sasso, Marco D’Ambros, Michele Lanza REVEAL @ Faculty of Informatics — University of Lugano, Switzerland Abstract—Emails related to the development of a software system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is chal- lenging, due to the unstructured, noisy and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces. We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to sub- sequently apply ad hoc analysis techniques for each category. We evaluated our approach on a statistically significant set of emails gathered from mailing lists of four unrelated open source systems. Keywords-Empirical software engineering; Unstructured Data Mining; Emails I. Introduction Software repositories supply information useful for sup- porting software analysis and program comprehension [17]. Di↵erent repositories o↵er di↵erent perspectives on systems: (e.g., IRC chat logs, mailing lists) require the most care, as the documents leave complete freedom to the authors. For example, Bettenburg et al. presented the risks of using email data without a proper cleaning pre-processing phase [9]. NL documents are usually treated as bags of words–a count of terms’ occurrences. This simplification is proven to be e↵ective in the information retrieval (IR) field, where techniques are tested on well-formed NL documents [25]. In software engineering, although e↵ective for some tasks (e.g., traceability between documents and code [1]), this approach reduces the quality, reliability, and comprehensibility of the available information, as NL text is often not well-formed and is interleaved with languages with di↵erent syntaxes: code fragments, stack traces, patches, etc. We present a work for advancing the analysis of the contents of development emails. We argue that we should not create a single bag with terms indiscriminately coming from NL parts, code fragments, email signatures, patches, etc. and treat them equally. We need to recognize every language in an email to enable techniques exploiting the peculiarities of Content Classification of Development Emails Alberto Bacchelli, Tommaso Dal Sasso, Marco D’Ambros, Michele Lanza REVEAL @ Faculty of Informatics — University of Lugano, Switzerland Abstract—Emails related to the development of a softwareAbstract—Emails related to the development of a softwareAbstract system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is chal- lenging, due to the unstructured, noisy and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces. We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to sub- (e.g., IRC chat logs, mailing lists) require the most care, as the documents leave complete freedom to the authors. For example, Bettenburg et al. presented the risks of using email data without a proper cleaning pre-processing phase [9]. NL documents are usually treated as bags of words–a count of terms’ occurrences. This simplification is proven to be e↵ective in the information retrieval (IR) field, where↵ective in the information retrieval (IR) field, where↵ techniques are tested on well-formed NL documents [25]. In software engineering, although e↵ective for some tasks (↵ective for some tasks (↵ e.g.,
  • 48. Alternative approach Lines/paragraphs belonging to different elements contain different proportions of keywords, special characters, etc. Alberto Bacchelli, Marco D'Ambros, Michele Lanza: Extracting Source Code from E-Mails. ICPC 2010: 24-33 Extracting Source Code from E-Mails Alberto Bacchelli, Marco D’Ambros, Michele Lanza REVEAL@ Faculty of Informatics - University of Lugano, Switzerland Abstract—E-mails, used by developers and system users to communicate over a broad range of topics, offer a valuable source of information. If archived, e-mails can be mined to support program comprehension activities and to provide views of a software system that are alternative and complementary to those offered by the source code. However, e-mails are written in natural language, and therefore contain noise that makes it difficult to retrieve the important data. Thus, before conducting an effective system analysis and extracting data for program comprehension, it is necessary to select the relevant messages, and to expose only the meaningful information. In this work we focus both on classifying e-mails that hold fragments of the source code of a system, and on extracting the source code pieces inside the e-mail. We devised and analyzed a number of lightweight techniques to accomplish these tasks. To assess the validity of our techniques, we manually inspected and annotated a statistically significant number of e-mails from five unrelated open source software systems written in Java. With such a benchmark in place, we measured the effectiveness of each technique in terms of precision and recall. about a system. E-mails are used to discuss issues ranging from low-level decisions (e.g., implementation strategies, bug fixing) up to high-level considerations (e.g., design rationale, future planning). Our goal is to use this kind of artifacts to enhance program comprehension and analysis. We already devised techniques to recover the traceability links between source code entities and e-mails [4]. However, such ties can only be retrieved by knowing the entities of the system in advance, i.e., by having the model of the system generated from the source code. In this paper we tackle a different problem: We want to extract source code fragments from e-mail messages. To do this, we first need to select e-mails that contain source code fragments, and then we extract such fragments from the content in which they are enclosed. Separating source code from natural language in e-mail messages brings several benefits: The access to structured data (1) facilitates the reconstruction Extracting Source Code from E-Mails Alberto Bacchelli, Marco D’Ambros, Michele Lanza REVEAL@ Faculty of Informatics - University of Lugano, Switzerland Abstract—E-mails, used by developers and system users toAbstract—E-mails, used by developers and system users toAbstract communicate over a broad range of topics, offer a valuable source of information. If archived, e-mails can be mined to support program comprehension activities and to provide views of a software system that are alternative and complementary to those offered by the source code. However, e-mails are written in natural language, and therefore contain noise that makes it difficult to retrieve the important data. Thus, before conducting an effective system analysis and extracting data for program comprehension, it is necessary to select the relevant messages, and to expose only the meaningful information. In this work we focus both on classifying e-mails that hold about a system. E-mails are used to discuss issues ranging from low-level decisions (e.g., implementation strategies, bug fixing) up to high-level considerations (e.g., design rationale, future planning). Our goal is to use this kind of artifacts to enhance program comprehension and analysis. We already devised techniques to recover the traceability links between source code entities and e-mails [4]. However, such ties can only be retrieved by knowing the entities of the system in advance, i.e., by having the model of the system generated from the source code. In this paper we tackle a different problem: We want to extract source code
  • 49. Learning hidden languages Luigi Cerulo, Massimiliano Di Penta, Alberto Bacchelli, Michele Ceccarelli, Gerardo Canfora: Irish: A Hidden Markov Model to detect coded information islands in free text. Sci. Comput. Program. 105: 26-43 (2015)
  • 50. Hidden Markov Models A language category is modeled as an HMM where hidden states are language tokens 0.28 0.23 0.6 0.49 0.36 0.36 0.38 0.52 0.430.25 0.31 0.31 0.77 0.34 0.51 0.47 0.52 0.26 0.38 0.71 0.27 0.47 0.65 0.34 0.4 0.4 0.65 0.54 0.41 0.22 0.39 0.21 0.32 0.37 0.9 0.43 0.27 0.3 0.9 0.62 0.38 0.6 0.22 0.25 0.31 0.4 0.97 0.41 0.36 0.36 0.21 0.93 0.91 ] = ~ ! ' ` NUM ( [ # < - : , { ) NEWLN KEY $ } + . * ; & / ^ | % " ? @ WORD UNDSC > Fig. 4. A source code HMM trained on PosgreSQL source code (transition probabilities less than 0.2 are not shown). syntaxes. For example, in development mailing list messages usually there is no more than one source code fragment in a text message. If this is the case, we can assume, a priori, the transition probability from natural language text to source code approximated to 1/N, where N is the number of tokens in the message. It could happens that the transition between two language syntaxes could never occur. This may be the case of stack traces and patches. By observing developers’ emails, it is usual to notice that, after a stack trace is provided, the resolution patch is introduced with natural language phrases, such as “Here is the change that we made to fix the problem”. This leads to a null transition probability from stack trace to patch code. Other heuristics could be adopted to refine the estimation of transition probabilities that reflects specific properties or styles adopted [23]. IV. EMPIRICAL EVALUATION PROCEDURE To evaluate the effectiveness of our approach we adopt the Precision and Recall metrics, known respectively also as Positive Predictive Value (PPV) and Sensitivity [24]. In particular, for each language syntax, i, we compute: such as opening parenthesis. Formally, the HMM state space is defined as: Q = { T XT , SRC} where T XT = {WORDT XT , KEYT XT , . . . }, and SRC = {WORDSRC, KEYSRC, . . . }. Each state emits the corre- sponding alphabet symbol without subscript label T XT or SRC. For example, the KEY symbol can be emitted by KEYT XT or KEYSRC with a probability equal to 1. If the probability for staying in a natural language text is p and the probability of staying in source code text is q, then the transition from a state in T XT to a state in SRC is 1 p, instead the inverse transition is 1 q. The above defined HMM emits the sequence of symbols observed in a text by evolving through a sequence of states { 1, 2, . . . , i, i+1, . . . } with the transition probabilities tkl defined as: tkl = P( i = l| i 1 = k) · p, if k, l ⇤ T XT tkl = P( i = l| i 1 = k) · q, if k, l ⇤ SRC tkl = 1 p | | , if k ⇤ T XT , l ⇤ SRC tkl = 1 q | | , if k ⇤ SRC, l ⇤ T XT and the emission probabilities defined as: ekb = 1, if k = bT XT or k = bSRC, otherwise 0. Fig. 2 shows the global HMM composed by two sub-HMM, one modeling natural language text and another modeling source code. Fig. 3 and Fig. 4 show their transition proba- bilities, estimated on the Frankenstein novel and PostgreSQL source code respectively. We detail in Section III-D how these probabilities could be estimated. It is interesting to observe how typical token sequences are modeled by each HMM. For example, in the source code of an arithmetic/logic expressions, array indexing, or function argument enumeration. Fig. 2. The source code – natural text island HMM. 0.33 0.67 0.78 0.86 0.82 0.78 0.65 0.12 0.68 0.64 0.87 0.7 0.71 0.9 0.81 0.71 0.5 0.54 1.0 1.0 1.0 1.0 0.11 0.11 0.21 0.14 0.15 0.14 0.73 0.2 0.45 0.21 1.0 0.17 0.120.12 0.14 0.29 0.33 ] WORD KEY ! " @ ' ? ( * . NUM : NEWLN $ ) [ UNDSC - / ; % , Fig. 3. A natural text HMM trained on the Frankenstein novel (transition probabilities less than 0.1 are not shown). C. An extension of the basic model
  • 51. Transition between languages… Language HMMs are connected by a language transition matrix into a single HMM if they after a to find s more aracter, e space SRC corre- XT or ted by If the p and hen the 1 p, HMM WORD modeling typical variable naming convention. Instead, in the natural language HMM, numbers (NUM) are preceded just by the dollar symbol ($) indicating currency, and likely followed by a dot, indicating text item enumerations. Instead, in the source code HMM it is noticeable that numbers are part of an arithmetic/logic expressions, array indexing, or function argument enumeration. 1-q 1-p p q Natural text HMM Soure code HMM Fig. 2. The source code – natural text island HMM. Natural language HMM Source code HMM
  • 53. Indexing-based IR Document Query indexing indexing (Query analysis) Representation Representation Query evaluation
  • 54. Document indexing Term extraction The above process may vary a lot... Stop word removal Stemming Term indexing Algebraic method
  • 55. Tokenization in SE • Often identifiers are compound words
 getPointer, getpointer • Sometimes they contain abbreviations and acronyms
 getPntr, computeIPAddr
  • 56. Camel Case splitter getWords getWords openIP open IP Pros: simple and fast Cons: convention not adopted everywhere
  • 57. Samurai • Assumption1: A substring composing an identifier is also likely to be used elsewhere • Assumption 2: Given two possible split, prefer the one that contains identifiers having a high frequency in the program Eric Enslen, Emily Hill, Lori L. Pollock, K.Vijay-Shanker: Mining source code to automatically split identifiers for software analysis. MSR 2009: 71-80
  • 58. More advanced approaches • Tidier/Tris, and Normalize • Allow expanding abbreviations and acronyms
 getPntr get pointer • Based on more advanced techniques, e.g. speech recognition Dawn Lawrie, David Binkley: Expanding identifiers to normalize source code vocabulary. ICSM 2011: 113-122 Latifa Guerrouj, Massimiliano Di Penta, Giuliano Antoniol andYann-Gaël Guéhéneuc.TIDIER: an identifier splitting approach using speech recognition techniques. Journal of Software: Evolution and Process.Vol. 25, Issue 6, June 2013, p. 575–599
  • 59. Stop words for SE • English not enough • Programming language keywords may or may not be included • What about standard API? • Specific, recurring words of some documents (e.g., of bug reports or of emails)
  • 60. Use of smoothing filters Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, Sebastiano Panichella: Applying a smoothing filter to improve IR-based traceability recovery processes: An empirical investigation. Information & Software Technology 55(4): 741-754 (2013) Applying a smoothing Ä lter to improve IR-based traceability recovery processes: An empirical investigation q Andrea De Lucia a , Massimiliano Di Penta b,⇑ , Rocco Oliveto c , Annibale Panichella a , Sebastiano Panichella b a University of Salerno, Via Ponte don Melillo, 84084 Fisciano (SA), Italy b University of Sannio, Viale Traiano, 82100 Benevento, Italy c University of Molise, C.da Fonte Lappone, 86090 Pesche (IS), Italy a r t i c l e i n f o Article history: Available online 24 August 2012 Keywords: Software traceability Information retrieval Smoothing Ä lters Empirical software engineering a b s t r a c t Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For this reason, various traceability recovery approachesˆ based on Information Retrieval (IR) techniquesˆ have been proposed. The performances of such approaches are often inÅ uenced by ` ` noise' ' contained in software artifacts (e.g., recurring words in document templates or other words that do not contribute to the retrieval itself). Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a smoothing Ä lter to remove ` ` noise' ' from the textual corpus of artifacts to be traced. Method: We evaluate the effect of a smoothing Ä lter in traceability recovery tasks involving different kinds of artifacts from Ä ve software projects, and applying three different IR methods, namely Vector Space Models, Latent Semantic Indexing, and Jensenç Shannon similarity model. Results: Our study indicates that, with the exception of some speciÄ c kinds of artifacts (i.e., tracing test cases to source code) the proposed approach is able to signiÄ cantly improve the performances of trace- ability recovery, and to remove ` ` noise' ' that simple stop word Ä lters cannot remove. Conclusions: The obtained results not only help to develop traceability recovery approaches able to work in presence of noisy artifacts, but also suggest that smoothing Ä lters can be used to improve perfor- mances of other software engineering approaches based on textual analysis. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction In recent and past years, textual analysis has been successfully applied to several kinds of software engineering tasks, for example impact analysis [1], clone detection [2], feature location [3,4], refactoring [5], deÄ nition of new cohesion and coupling metrics [6,7], software quality assessment [8ç 11], and, last but not least, traceability recovery [12ç 14]. Such a kind of analysis demonstrated to be effective and useful for various reasons: it is lightweight and to some extent independent on the pro- gramming language, as it does not require a full source code parsing, but only its tokenization and (for some applications) lexical analysis; it provides information (e.g., brought in comments and identiÄ - ers) complementary to what structural or dynamic analysis can provide [6,7]; it models software artifacts as textual documents, thus can be applied to different kinds of artifacts (i.e., it is not limited to the source code) and, above all, can be used to perform com- bined analysis of different kinds of artifacts (e.g., requirements and source code), as in the case of traceability recovery. Textual analysis has also some weaknesses, and poses chal- lenges for researchers. It strongly depends on the quality of the lex- icon: a bad lexicon often means inaccurateˆ if not completely wrongˆ results. There are two common problems in the textual analysis of software artifacts. The Ä rst is represented by the pres- ence of inconsistent terms in related documents (e.g., require- q This paper is an extension of the work ` ` Improving IR-based Traceability Recovery Using Smoothing Filters' ' appeared in the Proceedings of the 19th IEEE International Conference on Program Comprehension, Kingston, ON, Canada, pp. 21ç Information and Software Technology 55 (2013) 741ç 754 Contents lists available at SciVerse ScienceDirect Information and Software Technology journal homepage: www.elsevier.com/locate/infsof Applying a smoothing Ä lter to improve IR-based traceability recovery processes: An empirical investigation q Andrea De Lucia a , Massimiliano Di Penta b,⇑ , Rocco Oliveto c , Annibale Panichella a , Sebastiano Panichella b a University of Salerno, Via Ponte don Melillo, 84084 Fisciano (SA), Italy b University of Sannio, Viale Traiano, 82100 Benevento, Italy c University of Molise, C.da Fonte Lappone, 86090 Pesche (IS), Italy a r t i c l e i n f o Article history: Available online 24 August 2012 Keywords: Software traceability Information retrieval Smoothing Ä lters Empirical software engineering a b s t r a c t Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For this reason, various traceability recovery approachesˆ based on Information Retrieval (IR) techniquesˆ have been proposed. The performances of such approaches are often inÅ uenced by ` ` noise' ' contained in software artifacts (e.g., recurring words in document templates or other words that do not contribute to the retrieval itself). Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a smoothing Ä lter to remove ` ` noise' ' from the textual corpus of artifacts to be traced. Method: We evaluate the effect of a smoothing Ä lter in traceability recovery tasks involving different kinds of artifacts from Ä ve software projects, and applying three different IR methods, namely Vector Space Models, Latent Semantic Indexing, and Jensenç Shannon similarity model. Results: Our study indicates that, with the exception of some speciÄ c kinds of artifacts (i.e., tracing test cases to source code) the proposed approach is able to signiÄ cantly improve the performances of trace- ability recovery, and to remove ` ` noise' ' that simple stop word Ä lters cannot remove. Conclusions: The obtained results not only help to develop traceability recovery approaches able to work in presence of noisy artifacts, but also suggest that smoothing Ä lters can be used to improve perfor- mances of other software engineering approaches based on textual analysis. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction In recent and past years, textual analysis has been successfully applied to several kinds of software engineering tasks, for example impact analysis [1], clone detection [2], feature location [3,4], refactoring [5], deÄ nition of new cohesion and coupling metrics parsing, but only its tokenization and (for some applications) lexical analysis; it provides information (e.g., brought in comments and identiÄ - ers) complementary to what structural or dynamic analysis can provide [6,7]; it models software artifacts as textual documents, thus can be
  • 61.
  • 62. Smoothing Filters S (mean source vector) T (mean target vector) Source Documents Target Documents Filtered
 Source Set - Filtered
 Target Set - s1 s2 s3 … sk t1 t2 t3 … tz
  • 63. From stemming to lexical databases…
  • 64. • Lexical database where concepts are organised in a semantic network • Organize lexical information in terms of word meaning, rather than word form • Interfaces for Java, Prolog, Lisp, Python, Perl, C# WordNet (http://wordnet.princeton.edu)
  • 65. Word Categories Nouns • Topical hierarchies with lexical inheritance (hyponymy/hyperymy and meronymy/holonymy). Verbs • Entailment relations Adjectives and adverbs • Bipolar opposition relations (antonymy) Function words (articles, prepositions, etc.) • Simply omitted
  • 66. Application: mining and classifying identifier renamings Generalization/specialization: 
 thrownExceptionSize → thrownExceptionLength Opposite meaning: 
 hasClosingBracket → hasOpeningBracket Whole/part relation: 
 filename → extension Venera Arnaoudova, Laleh Mousavi Eshkevari, Massimiliano Di Penta, Rocco Oliveto, Giuliano Antoniol,Yann-Gaël Guéhéneuc: REPENT:Analyzing the Nature of Identifier Renamings. IEEE Trans. Software Eng. 40(5): 502-532 (2014)
  • 67. Challenge • Word relations in WordNet do not fully comprise relations relevant in the IT domain • Alternative approaches try to “learn” these relations Jinqiu Yang, Lin Tan: SWordNet: Inferring semantically related words from software context. Empirical Software Engineering 19(6): 1856-1886 (2014)
  • 68. Term Indexing:Assigning Weights to Terms • Binary Weights: terms weighted as 0 (not appearing) or 1 (appearing) • (Raw) term frequency: terms weighted by the number of occurrences/frequency in the document • tf-idf: want to weight terms highly if they are frequent in relevant documents … BUT ... infrequent in the collection as a whole
  • 69. What to use? tf-idf might be useful for text retrieval but… not (necessarily) the best choice to identify keywords, e.g. for classification purposes
  • 70. Algebraic Models • Aim at providing representations to documents and at comparing/clustering documents • Examples • Vector Space Model • Latent Semantic Indexing (LSI) • latent Dirichlet allocation (LDA)
  • 71. TheVector Space Model (VSM) • Assume t distinct terms remain after preprocessing • call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. • Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real- valued weight, wij. • Both documents and queries are expressed as t- dimensional vectors: dj = (w1j, w2j, …, wtj)
  • 72. VSM Representation T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 32 5
  • 73. Latent Semantic Indexing (LSI) Overcomes limitations ofVSM (synonymy, polysemy, considering words occurring in documents as independent events) Shifting from a document-term space towards a document-concept space Dumais, S.T., Furnas, G.W., Landauer,T. K. and Deerwester, S. (1988), Using latent semantic analysis to improve information retrieval. In Proceedings of CHI'88: Conference on Human Factors in Computing, NewYork:ACM, 281-285.
  • 74. latent Dirichlet Allocation (LDA) • Probabilistic model • Documents treated as a distributions of topics • Topics are distributions of words D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” The Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
  • 75. Applications Alessandra Gorla, Ilaria Tavecchia, Florian Gross, Andreas Zeller: Checking app behavior against app descriptions. ICSE 2014: 1025-1035 Checking App Behavior Against App Descriptions Alessandra Gorla · Ilaria Tavecchia⇤ · Florian Gross · Andreas Zeller Saarland University Saarbrücken, Germany {gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de ABSTRACT How do we know a program does what it claims to do? After clus- tering Android apps by their description topics, we identify outliers in each cluster with respect to their API usage. A “weather” app that sends messages thus becomes an anomaly; likewise, a “messaging” app would typically not be expected to access the current location. Applied on a set of 22,500+ Android applications, our CHABADA prototype identified several anomalies; additionally, it flagged 56% of novel malware as such, without requiring any known malware patterns. Categories and Subject Descriptors D.4.6 [Security and Protection]: Invasive software General Terms Security Keywords Android, malware detection, description analysis, clustering 1. INTRODUCTION Checking whether a program does what it claims to do is a long- standing problem for developers. Unfortunately, it now has become a problem for computer users, too. Whenever we install a new app, 1. App collection 2. Topics Weather, Map… Travel, Map… Theme 3. Clusters Weather + Travel Themes Access-LocationInternet Access-LocationInternet Send-SMS 4. Used APIs 5. Outliers Weather + Travel Figure 1: Detecting applications with unadvertised behavior. Starting from a collection of “good” apps (1), we identify their description topics (2) to form clusters of related apps (3). For each cluster, we identify the sentitive APIs used (4), and can then identify outliers that use APIs that are uncommon for that cluster (5). • An app that sends a text message to a premium number to raise money is suspicious? Maybe, but on Android, this is a Checking App Behavior Against App Descriptions Alessandra Gorla · Ilaria Tavecchia⇤ · Florian Gross · Andreas Zeller Saarland University Saarbrücken, Germany {gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de ABSTRACT How do we know a program does what it claims to do? After clus- tering Android apps by their description topics, we identify outliers in each cluster with respect to their API usage. A “weather” app that sends messages thus becomes an anomaly; likewise, a “messaging” app would typically not be expected to access the current location. Applied on a set of 22,500+ Android applications, our CHABADA prototype identified several anomalies; additionally, it flagged 56% of novel malware as such, without requiring any known malware patterns. Categories and Subject Descriptors D.4.6 [Security and Protection]: Invasive software General Terms Security 1. App collection 2. Topics Weather, Map…Map… Travel, Map… Theme 3. Clusters Weather + Travel+ Travel ThemesThemes Access-LocationInternet Access-LocationInternet Send-SMS 4. Used APIs 5. Outliers Weather + Travel+ Travel
  • 76. Need to properly calibrate the techniques • Many IR techniques require a careful calibration of many parameters • LSI number of concepts (k) • LDA k, α, β, number of iterations (n) • Without that, the performances can be sub-optimal
  • 77. … and beyond that… • Should I use stop word removal? Which one? • Stemming? • Which weighting scheme? tf? tf-idf? others? • Which technique?VSM? LSI? LDA?
  • 78. Approaches (I) Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, Jane Cleland-Huang: Improving trace accuracy through data-driven configuration and composition of tracing features. ESEC/SIGSOFT FSE 2013: 378-388 Improving Trace Accuracy through Data-Driven ConÆ guration and Composition of Tracing Features Sugandha Lohar, Sorawit Amornborvornwong DePaul University Chicago, IL, USA sonul.123@gmail.com, sorambww@hotmail.com Andrea Zisman Department of Computing The Open University Milton Keynes, MK7 6AA, UK andrea.zisman@open.ac.uk Jane Cleland-Huang DePaul University Systems and Requirements Engineering Center Chicago, IL, USA jhuang@cs.depaul.edu ABSTRACT Software traceability is a sought-after, yet often elusive qual- ity in large software-intensive systems primarily because the cost and effort of tracing can be overwhelming. State-of-the art solutions address this problem through utilizing trace re- trieval techniques to automate the process of creating and maintaining trace links. However, there is no simple one- size- ts all solution to trace retrieval. As this paper will show, nding the right combination of tracing techniques can lead to signi cant improvements in the quality of gener- ated links. We present a novel approach to trace retrieval in which the underlying infrastructure is con gured at runtime to optimize trace quality. We utilize a machine-learning ap- proach to search for the best con guration given an initial training set of validated trace links, a set of available tracing techniques speci ed in a feature model, and an architecture capable of instantiating all valid con gurations of features. critical software-intensive systems [4]. It is used to capture relationships between requirements, design, code, test-cases, and other software engineering artifacts, and support crit- ical activities such as impact analysis, compliance veri ca- tion, test-regression selection, and safety-analysis. As such, traceability is mandated in safety-critical domains includ- ing the automotive, aeronautics, and medical device indus- tries. Unfortunately, tracing costs can grow excessively high if trace links have to be created and maintained manually by human users, and as a result, practitioners often fail to establish adequate traceability in a project [27]. To address these needs, numerous researchers have devel- oped or adopted algorithms that semi-automate the process of creating trace links. These algorithms include the Vec- tor Space Model (VSM) [23], Probabilistic approaches [14], Latent Semantic Indexing [2, 12], Latent Dirichlet Alloca- tion (LDA) [13], rule-based approaches that identify rela- tionships across project artifacts [35], and approaches that Improving Trace Accuracy through Data-Driven ConÆ guration and Composition of Tracing Features Sugandha Lohar, Sorawit Amornborvornwong DePaul University Chicago, IL, USA sonul.123@gmail.com, sorambww@hotmail.com Andrea Zisman Department of Computing The Open University Milton Keynes, MK7 6AA, UK andrea.zisman@open.ac.uk Jane Cleland-Huang DePaul University Systems and Requirements Engineering Center Chicago, IL, USA jhuang@cs.depaul.edu ABSTRACT Software traceability is a sought-after, yet often elusive qual- ity in large software-intensive systems primarily because the cost and effort of tracing can be overwhelming. State-of-theffort of tracing can be overwhelming. State-of-theff art solutions address this problem through utilizing trace re- critical software-intensive systems [4]. It is used to capture relationships between requirements, design, code, test-cases, and other software engineering artifacts, and support crit- ical activities such as impact analysis, compliance veri ca- tion, test-regression selection, and safety-analysis. As such, rint ical activities such as impact analysis, compliance veri ca-rint ical activities such as impact analysis, compliance veri ca- tion, test-regression selection, and safety-analysis. As such, rint tion, test-regression selection, and safety-analysis. As such,
  • 79. Discussion • Supervised, task specific • Model parameters varied through Genetic Algorithms • You train the model on a training set • Maximize precision/recall for a given task
  • 80. Approaches (II) Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimiliano Di Penta, Denys Poshyvanyk, Andrea De Lucia: How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. ICSE 2013: 522-531 How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms Annibale Panichella1, Bogdan Dit2, Rocco Oliveto3, Massimilano Di Penta4, Denys Poshynanyk2, Andrea De Lucia1 1 University of Salerno, Fisciano (SA), Italy 2 The College of William and Mary, Williamsburg, VA, USA 3 University of Molise, Pesche (IS), Italy 4 University of Sannio, Benevento, Italy Abstractó Information Retrieval (IR) methods, and in partic- ular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However, applying topic models on software data using the same settings as for natural language text did not always produce the expected results. Recent research investigated this assumption and showed that source code is much more repetitive and predictable as compared to the natural language text. Our paper builds on this new fundamental nding and proposes a novel solution to adapt, con gure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms (GA) to determine a near-optimal con guration for LDA in the context of three different SE tasks: (1) traceability link recovery, (2) feature location, and (3) software artifact labeling. The results of our empirical studies demonstrate that LDA-GA is able to identify robust LDA con gurations, which lead to a higher accuracy on all the datasets for these SE tasks as compared proposed to support software engineering tasks: feature lo- cation [4], change impact analysis [5], bug localization [6], clone detection [7], traceability link recovery [8], [9], expert developer recommendation [10], code measurement [11], [12], artifact summarization [13], and many others [14], [15], [16]. In all these approaches, LDA and LSI have been used on software artifacts in a similar manner as they were used on natural language documents (i.e., using the same settings, con gurations and parameters) because the underlying as- sumption was that source code (or other software artifacts) and natural language documents exhibit similar properties. More speci cally, applying LDA requires setting the number of topics and other parameters speci c to the particular LDA implementation. For example, the fast collapsed Gibbs sam- pling generative model for LDA requires setting the number of iterations n and the Dirichlet distribution parameters α and β [17]. Even though LDA was successfully used in the IR and natural language analysis community, applying it on software data, using the same parameter values used for natural language text, did not always produce the expected results [18]. As in the case of machine learning and optimization How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms Annibale Panichella1, Bogdan Dit2, Rocco Oliveto3, Massimilano Di Penta4, Denys Poshynanyk2, Andrea De Lucia1 1 University of Salerno, Fisciano (SA), Italy 2 The College of William and Mary, Williamsburg, VA, USA 3 University of Molise, Pesche (IS), Italy 4 University of Sannio, Benevento, Italy Abstractó Information Retrieval (IR) methods, and in partic- ular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar.that source code and natural language documents are similar.that source code and natural language documents are similar However, applying topic models on software data using the same settings as for natural language text did not always produce the proposed to support software engineering tasks: feature lo- cation [4], change impact analysis [5], bug localization [6], clone detection [7], traceability link recovery [8], [9], expert developer recommendation [10], code measurement [11], [12], artifact summarization [13], and many others [14], [15], [16]. In all these approaches, LDA and LSI have been used on software artifacts in a similar manner as they were used on natural language documents (i.e., using the same settings, con gurations and parameters) because the underlying as-
  • 81. Choose a random population of LDA parameters LDA Determine fitness of each chromosome (individual) of each (individual) Fitness = silhouette coefficientFitness = silhouette coefficientFitness = silhouette coefficient
  • 82. Choose a random population of LDA parameters LDA Determine fitness of each chromosome (individual) Select next generation Crossover, mutation Crossover Mutation
  • 83. Choose a random population of LDA parameters LDA Determine fitness of each chromosome (individual) Select next generation Crossover, mutation Next generation Best model for first generation Best model for last generation
  • 85. • Integrated suite of software facilities for data manipulation, calculation and graphical display • VSM and LSI implemented in the lsa package • topicmodels and lda also available • many other text processing packages… R (www.r-project.org)
  • 86. Creating a t-d matrix Creating a term-document matrix from a directory tm-textmatrix(/Users/Max/mydocs) Syntax: textmatrix(mydir, stemming=FALSE, language=“english,
 minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL)
  • 87. Applying a weighting schema • lw_tf(m): returns a completely unmodified n times m matrix • lw_bintf(m): returns binary values of the n times m matrix. • gw_normalisation(m): returns a normalised n times m matrix • gw_gfidf(m): returns the global frequency multiplied with idf
  • 88. Applying LSI • Decomposes tm and reduces the space of concepts to 2 lspace-lsa(tm,dims=2) • Converts lspace into a textmatrix m2-as.textmatrix(lspace)
  • 90. Application: Identifying duplicate bug reports Problem: people often report bugs someone has already reported Solution (in a nutshell): • Compute textual similarity among bug reports • Match (where available) stack traces Xiaoyin Wang, Lu Zhang,Tao Xie, John Anvik, Jiasu Sun:An approach to detecting duplicate bug reports using natural language and execution information. ICSE 2008: 461-470
  • 91. Example (Eclipse) Bug 123 - Synchronize View: files nodes in tree should provide replace with action This scenario happens very often to me - I change a file in my workspace to test something - I do a release - In the release I encounter that there is no need to release the file - I want to replace it from the stream but a corresponding action is missing Now I have to go to the navigator to do the job. t-textmatrix(/Users/mdipenta/docs,stemming=TRUE) cosine(t) d1 d2 d1 1.0000000 0.6721491 d2 0.6721491 1.0000000 /mdipenta/docs,stemming=TRUE) Bug 4934 - DCR: Replace from Stream in Release mode Sometimes I touch files, without intention.These files are then unwanted outgoing changes.To fix it, I have go back to the package view/navigator and replace the file with the stream.After this a new synchronize is needed, for that I have a clean view. It would be very nice to have the 'replace with stream' in the outgoing tree. Or 'Revert'?
  • 92. Other tools Lucene http://lucene.apache.org RapidMiner www.rapidminer.com Mallet http://mallet.cs.umass.edu
  • 94. Careful choice • In case you don’t want a GA chooses it for you… • Most exotic model not necessarily the best one • Simple models have advantages (e.g. scalability) • Try to experiment multiple models
  • 95. IR methods: 
 Pros and Cons • Can be used to match similar (related) discussions • Some techniques deal with issues such as homonymy and polysemy • Software-related discussion can be very noisy • Noise different between different artifacts
 (bug reports, commit notes, code) • You have less control than with regular expressions • Do not capture dependencies in sentences 
 e.g.,“this bug will not be fixed” ✘ ✔ ✔ ✘ ✘ ✘
  • 97. Technology Stanford Natural Language Parser
 http://nlp.stanford.edu DependenceVisualizer: DepenSee Grammar Browser: GrammarScope
  • 98. (ROOT (S (NP (DT this) (NN bug)) (VP (MD will) (RB not) (VP (VB be) (VP (VBN fixed)))))) det(bug-2, this-1) nsubjpass(fixed-6, bug-2) aux(fixed-6, will-3) neg(fixed-6, not-4) auxpass(fixed-6, be-5) root(ROOT-0, fixed-6) Example this bug will not be fixed DT NN MD RB VB VBN det aux nsubjpass neg auxpass
  • 99. Is it applicable to analyze source code identifiers? • It could be, once identifiers are split into compound words • Useful to tag part-of-speech • Results may not be very precise because identifiers are not natural language sentences
  • 100. Rules (Abebe and Tonella, 2010) Surafel Lemma Abebe, Paolo Tonella: Natural Language Parsing of Program Element Names for Concept Extraction. ICPC 2010: 156-159
  • 101. Example: OpenStructure open structure Let’s insert an artificial subject and an article: subjects open the structure (ROOT (NP (JJ open) (NN structure))) amod(structure-2, open-1) root(ROOT-0, structure-2) (ROOT (S (NP (NNS subjects)) (VP (VBP open) (NP (DT the) (NN structure))))) nsubj(open-2, subjects-1) root(ROOT-0, open-2) det(structure-4, the-3) dobj(open-2, structure-4)
  • 102. Applications Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, Amit M. Paradkar: Inferring method specifications from natural language API descriptions. ICSE 2012: 815-825 Inferring Method Specifications from Natural Language API Descriptions Rahul Pandita⇤, Xusheng Xiao⇤, Hao Zhong†, Tao Xie⇤, Stephen Oney‡, and Amit Paradkar§ ⇤Department of Computer Science, North Carolina State University, Raleigh, USA †Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China ‡Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA §I.B.M. T. J. Watson Research Center, Hawthorne, NY, USA {rpandit, xxiao2, txie}@ncsu.edu, zhonghao@itechs.iscas.ac.cn, soney@cs.cmu.edu, paradkar@us.ibm.com Abstract—Application Programming Interface (API) docu- ments are a typical way of describing legal usage of reusable software libraries, thus facilitating software reuse. However, even with such documents, developers often overlook some documents and build software systems that are inconsistent with the legal usage of those libraries. Existing software verification tools require formal specifications (such as code contracts), and therefore cannot directly verify the legal usage described in natural language text in API documents against code using that library. However, in practice, most libraries do not come with formal specifications, thus hindering tool- based verification. To address this issue, we propose a novel approach to infer formal specifications from natural language text of API documents. Our evaluation results show that our approach achieves an average of 92% precision and 93% recall in identifying sentences that describe code contracts from more than 2500 sentences of API documents. Furthermore, our results show that our approach has an average 83% accuracy in inferring specifications from over 1600 sentences describing code contracts. I. INTRODUCTION Programming Interface (API) documents. Typically, such documents are provided to client-code developers through online access, or are shipped with the API code. For exam- ple, J2EE’s API documentation3 is one of the most popular API documents. Even with such documents, client-code developers often overlook some API documents and use methods in API libraries incorrectly [22]. Since these documents are written in natural language, existing tools cannot verify legal usage described in a library’s API documents against the client code of that library. One possible solution is to manually write code contracts based on the specifications described in API documents. However, due to a large number of sentences in API documents, manually hunting for contract sentences and writing code contracts for the API library is prohibitively time consuming and labor intensive. For instance, the File class of the C# .NET Framework has around 800 sentences. Moreover, not all of these sentences Inferring Method Specifications from Natural Language API Descriptions Rahul Pandita⇤, Xusheng Xiao⇤, Hao Zhong†, Tao Xie⇤, Stephen Oney‡, and Amit Paradkar§ ⇤Department of Computer Science, North Carolina State University, Raleigh, USA †Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China ‡ Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China ‡ Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA § Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA § Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA I.B.M. T. J. Watson Research Center, Hawthorne, NY, USA {rpandit, xxiao2, txie}@ncsu.edu, zhonghao@itechs.iscas.ac.cn, soney@cs.cmu.edu, paradkar@us.ibm.com Abstract—Application Programming Interface (API) docu-Abstract—Application Programming Interface (API) docu-Abstract ments are a typical way of describing legal usage of reusable software libraries, thus facilitating software reuse. However, even with such documents, developers often overlook some documents and build software systems that are inconsistent with the legal usage of those libraries. Existing software verification tools require formal specifications (such as code contracts), and therefore cannot directly verify the legal usage described in natural language text in API documents against code using that library. However, in practice, most libraries do not come with formal specifications, thus hindering tool- based verification. To address this issue, we propose a novel Programming Interface (API) documents. Typically, such documents are provided to client-code developers through online access, or are shipped with the API code. For exam- ple, J2EE’s API documentation3 is one of the most popular API documents. Even with such documents, client-code developers often overlook some API documents and use methods in API libraries incorrectly [22]. Since these documents are written in natural language, existing tools cannot verify legal usage described in a library’s API documents against the client
  • 103. Applications How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution S. Panichella⇤, A. Di Sorbo†, E. Guzman‡, C. A.Visaggio†, G. Canfora† and H. C. Gall⇤ ⇤University of Zurich, Switzerland †University of Sannio, Benevento, Italy ‡Technische Universit¨at M¨unchen, Garching, Germany panichella@ifi.uzh.ch, disorbo@unisannio.it, emitza.guzman@mytum.de, {visaggio,canfora}@unisannio.it, gall@ifi.uzh.ch Abstract—App Stores, such as Google Play or the Apple Store, allow users to provide feedback on apps by posting review comments and giving star ratings. These platforms constitute a useful electronic mean in which application developers and users can productively exchange information about apps. Previous research showed that users feedback contains usage scenarios, bug reports and feature requests, that can help app developers to accomplish software maintenance and evolution tasks. However, in the case of the most popular apps, the large amount of received feedback, its unstructured nature and varying quality can make the identification of useful user feedback a very challenging task. In this paper we present a taxonomy to classify app reviews into categories relevant to software maintenance and evolution, as well as an approach that merges three techniques: (1) Natural Language Processing, (2) Text Analysis and (3) Sentiment Analysis to automatically classify app reviews into the proposed categories. We show that the combined use of these techniques allows to achieve better results (a precision of 75% and a recall of 74%) than results obtained using each technique individually (precision of 70% and a recall of 67%). Index Terms—User Reviews, Mobile Applications, Natural Language Processing, Sentiment Analysis, Text classification form of unstructured text that is difficult to parse and analyze. Thus, developers and analysts have to read a large amount of textual data to become aware of the comments and needs of their users [10]. In addition, the quality of reviews varies greatly, from useful reviews providing ideas for improvement or describing specific issues to generic praises and complaints (e.g. “You have to be stupid to program this app”, “I love it!”, “this app is useless”). To handle this problem Chen et al. [10] proposed AR- Miner, an approach to help app developers discover the most informative user reviews. Specifically, the authors use: (i) text analysis and machine learning to filter out non-informative reviews and (ii) topic analysis to recognize topics treated in the reviews classified as informative. In this paper, we argue that text content represents just one of the possible dimensions that can be explored to detect informative reviews from a software maintenance and evolution perspective. In particular, topic analysis techniques are useful to discover How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution S. Panichella⇤, A. Di Sorbo†, E. Guzman‡, C. A.Visaggio†, G. Canfora† and H. C. Gall⇤ ⇤University of Zurich, Switzerland †University of Sannio, Benevento, Italy ‡Technische Universitat M¨at M¨ unchen, Garching, Germany¨unchen, Garching, Germany¨ panichella@ifi.uzh.ch, disorbo@unisannio.it, emitza.guzman@mytum.de, {visaggio,canfora}@unisannio.it, gall@ifi.uzh.ch Abstract—App Stores, such as Google Play or the Apple Store,Abstract—App Stores, such as Google Play or the Apple Store,Abstract allow users to provide feedback on apps by posting review comments and giving star ratings. These platforms constitute a useful electronic mean in which application developers and users can productively exchange information about apps. Previous research showed that users feedback contains usage scenarios, bug reports and feature requests, that can help app developers to accomplish software maintenance and evolution tasks. However, in the case of the most popular apps, the large amount of received feedback, its unstructured nature and varying quality can make the identification of useful user feedback a very challenging task. In this paper we present a taxonomy to classify app reviews into categories relevant to software maintenance and evolution, as well as an approach that merges three techniques: (1) Natural Language Processing, (2) Text Analysis and (3) Sentiment Analysis to automatically classify app reviews into the proposed categories. We show that the combined use of these techniques allows to achieve better results (a precision of 75% and a recall of 74%) than results obtained using each technique individually (precision of 70% and a recall of 67%). form of unstructured text that is difficult to parse and analyze. Thus, developers and analysts have to read a large amount of textual data to become aware of the comments and needs of their users [10]. In addition, the quality of reviews varies greatly, from useful reviews providing ideas for improvement or describing specific issues to generic praises and complaints (e.g. “You have to be stupid to program this app”, “I love it!”, “this app is useless”). To handle this problem Chen et al. [10] proposed AR- Miner, an approach to help app developers discover the most informative user reviews. Specifically, the authors use: (i) text analysis and machine learning to filter out non-informative reviews and (ii) topic analysis to recognize topics treated in the reviews classified as informative. In this paper, we argue that text content represents just one of the possible dimensions that can be explored to detect informative reviews from a software maintenance and evolution perspective. In @ICSME 2015 - Thurdsay, 13.50 - Mobile applications
  • 104. Natural Language Parsing: Pros and Cons • Can identify dependencies between words • Accurate part-of-speech analysis, better than lexical databases • Application to software entities (e.g. identifiers) may be problematic • Difficult to process data • NL parsing is difficult, especially when the corpus is noisy ✘ ✔ ✔ ✘ ✘
  • 107. What technique? Need to match precise elements
  • 108. What technique? Separate structured and unstructured data
  • 109. What technique? Account for the whole textual corpus of a document
  • 111. What technique? Identify “topics” discussed in documents
  • 112. What technique? Not just bag of words, but also dependencies, parts of speech, negations
  • 113. Take aways Use structure when available Carefully (empirically) choose the most suitable technique Out of the box configuration might not work for you Manual analysis cannot be replaced
  • 114. (ROOT (S (NP (DT this) (NN bug)) (VP (MD will) (RB not) (VP (VB be) (VP (VBN fixed)))))) det(bug-2, this-1) nsubjpass(fixed-6, bug-2) aux(fixed-6, will-3) neg(fixed-6, not-4) auxpass(fixed-6, be-5) root(ROOT-0, fixed-6) For example, the Stanford Natural Language Parser
 http://nlp.stanford.edu Techniques: Natural Language Parsing this bug will not be fixed DT NN MD RB VB VBN det aux nsubjpass neg auxpass (ROOT (S (NP (DT this) (NN bug)) (VP (MD will) (RB not) (VP (VB be) (VP (VBN fixed)))))) det(bug-2, this-1) nsubjpass(fixed-6, bug-2) aux(fixed-6, will-3) neg(fixed-6, not-4) auxpass(fixed-6, be-5) root(ROOT-0, fixed-6) For example, the Stanford Natural Language Parser http://nlp.stanford.edu Techniques: Natural Language Parsing this bug will not be fixed DT NN MD RB VB VBN det aux nsubjpass neg auxpass Maching issue ids “fix 367920 setting pop3 messages as junk/not junk ignored when message quarantining turned on sr=mscott $note=~/Issue (d+)/i || $note=~/Issue number:s+(d +)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/ i Simple yet powerful technique to link commit notes to issue reports Maching issue ids “fix 367920 setting pop3 messages as junk/not junk ignored when message quarantining turned on sr=mscott $note=~/Issue (d+)/i || $note=~/Issue number:s+(d +)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/ i Simple yet powerful technique to link commit notes to issue reports VSM Representation T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 32 5 VSM Representation T3T3T T1T1T T2T2T D1 = 2T1= 2T1= 2T + 3T2+ 3T2+ 3T + 5T3+ 5T3+ 5T D2 = 3T1= 3T1= 3T + 7T2+ 7T2+ 7T + T3T3T Q = 0T1Q = 0T1Q = 0T + 0T2+ 0T2+ 0T + 2T3+ 2T3+ 2T 7 32 5