Software evolution research is a thriving area of software engineering research. Recent years have seen a growing interest in variety of evolution topics, as witnessed by the growing number of publications dedicated to the subject. Without attempting to be complete, in this talk we provide an overview of emerging trends in software evolution research, such as extension of the traditional boundaries of software, growing attention for social and socio-technical aspects of software development processes, and interdisciplinary research applying research techniques from other research areas to study software evolution, and software evolution research techniques to other research areas. As a large body of software evolution research is empirical in nature, we are confronted by important challenges pertaining to reproducibility of the research, and its generalizability.
8. Business-oriented view
“a set of actors functioning as a unit and
interacting with a shared market for
software and services, together with the
relationships among them.”
with thanks to International Data Corporation (IDC)
9. Development-centric view
a collection of software projects
that are developed and evolve
together in the same environment
with thanks to Bram Adams
10. Socio-technical view
a community of persons (end-users,
developers, debuggers,
…) contributing to a collection
of projects
18. Scientific
challenges
Raw data
Processed
data set
Tools &
scripts
#MSR papers
2004-2009
Y Y Y 2
Y Y N 2
Y P Y 1
Y P P 2
Y P N 2
Y N Y 16
Y N P 19
Y N N 64
P N Y 1
P N N 2
N Y N 2
N P N 1
N N Y 7
N N P 2
N N N 31
N/A N/A N/A 17
We share raw data
but rarely share tools
– reinventing the
wheel anybody?
19. • How can we share our big data with other
researchers?
• Different formats, different tools, storage
Practical
challenges
problems, …
• How can we make our research results useful
to practitioners and development
communities?
• How can we build tools and dashboards that
integrate our findings?
27. • describe evolutionary steps
• relate to changes of other
artifacts
• describe prevalence in
practice
• support automation
http://help.eclipse.org/juno/index.jsp?topic=%2Forg.eclipse.m2m.atl.doc%2Fguide%2Fconcepts%2FModel-Transformation.html
28. New kind of
verification
artifacts
2008
2009
2012
2013
29. 2008 vs. 2014
From technical to socio-technical
perspective:
Who are these
people?
What do they do?
30. > 90% in WordPress & Drupal
> 95% in FLOSS surveys
> 87% in GNOME
> 70% in software-related jobs (NSF)
MEN
39. Heuristics:
title + first h1
<title>Ben Kamens</title>
…
<h1>We’re willing to
be embarrassed about what
we
<em>haven’t</em>
done…</h1>
Ben Kamens We’re willing to be
embarrassed about what we
haven’t done…
Stanford Named
Entity Tagger
<PERSON>Ben Kamens</PERSON>
We’re willing to be embarrassed
about what we haven’t done…
40. Quality of gender resolution: Survey
Self-identification
As inferred Total
M F ?
M 60 3 43 106
F 2 5 4 11
+ avatars, other
social media
sites (manually)
Self-identification
As inferred Total
M F ?
M 90 3 13 106
F 2 9 0 11
44. How can we reliably and efficiently
identify human activities?
Technical
challenges
45. How can we reliably and efficiently
identify human activities?
Technical
challenges
Notas do Editor
Software maintenance is an area of software engineering with deep financial implications. Indeed, maintenance and evolution costs were forecasted to account for more than half of North American and European software budgets in 2010. Similar or even higher figures were reported for countries such as Norway and Chile. In this talk we discuss recent advancement on two popular approaches to assessing evolution of software projects: measuring and mining software. Software metrics, commonly used to measure software, are usually defined at micro level (method, class, package), while the analysis of maintainability and evolution requires insights at macro (system) level. Metrics should, therefore, be aggregated. We discuss recent work on software metrics aggregation techniques, and advocate econometric inequality induces to perform aggregation.
A complementary approach to studying software evolution consists in mining software repositories, e.g., version control systems, bug trackers and mail archives. While abundant information is usually present in such repositories, successful information extraction is often challenged by the necessity to simultaneously analyze different repositories and to combine the information obtained. We propose to apply process mining techniques, originally developed for business process analysis, to address this challenge. However, in order for process mining to become applicable, different software repositories should be combined, and “related” software development events should be matched: e.g., mails sent about a file, modifications of the file and bug reports that can be traced back to it. In this talk we discuss the approach proposed, as well as a series of case studies addressing such aspects of the development process as roles of different developers, the way bug reports are handled and conformance to software engineering standards.
Software ecosystem – collection of software products that are developed and evolve in the same environment [Lungu, 2008]
Examples: Eclipse; Android and iOS app store
Technical challenges:
Extracting and combining data from different sources
Identifying correspondences across different data sources (identity merging)
Dealing with inconsistent and incomplete data
Big data analytics
special skills and tools needed to store, process and analyse huge amounts of data
u
Example of a technical challenge that has to be addressed: Identifying correspondences across different data sources (identity merging). Non-names are root, info….
Example of a technical challenge that has to be addressed: Identifying correspondences across different data sources (identity merging). Non-names are root, info….
From structured data to unstructured data
From structured data to unstructured data. Still there are different stemming algorithms, different information retrieval approaches etc.
Accessibility of data
E.g. many apps in Google Play are proprietaryand historical information is not accessible
Focus on open source software
Reproducibility of results
Generalisability of results
Which research methodology, which metrics, which statistical tools, …
Privacy issues
Can we use and combine information about actual developers?
Can we make these results freely available?
How to reconcile privacy with reproducibility ?
There are several approaches that have been proposed to ensure secure anonymization and that we would like to study in the next future. The concept of k-anonymity [9] tries to ensure that with k-anonymity greater than 1, even with all fields a single person cannot be identified, but k people. Still, k-anonymity has shown not to be sufficient as attackers can discover sensitive attributes in data with low diversity, and together with other information identify a single person. Data with sufficient diversity, l-diversity, should be published [5]. Finally, t-closeness requires that the distribution of sensitive attributes to be close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should less than a threshold t) [4]. In the meantime, we will combine the data
internally, as we have done in the case study shown next.
There are several approaches that have been proposed to ensure secure anonymization and that we would like to study in the next future. The concept of k-anonymity [9] tries to ensure that with k-anonymity greater than 1, even with all fields a single person cannot be identified, but k people. Still, k-anonymity has shown not to be sufficient as attackers can discover sensitive attributes in data with low diversity, and together with other information identify a single person. Data with sufficient diversity, l-diversity, should be published [5]. Finally, t-closeness requires that the distribution of sensitive attributes to be close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should less than a threshold t) [4]. In the meantime, we will combine the data
internally, as we have done in the case study shown next.
Software ecosystem – collection of software products that are developed and evolve in the same environment [Lungu, 2008]
A toy train consisting of an engine and a car
Evolutionary problems specific to model-driven engineering are related to presence of multiple co-evolving artefacts: meta-models, models and model transformations
Evolutionary problems specific to model-driven engineering are related to presence of multiple co-evolving artefacts: meta-models, models and model transformations
Male dominated
SO: the number of programmers is roughly normally distributed around age 29, though skewed right.
Modern developers have to jungle multiple activities, including source code updates, mails, bug trackers, questions and answers on StackOverflow. We start by looking into data coming from version control repositories of GNOME, and then proceed with analysing mail archives and Stack Overflow question-answering.
Contributing to modern software system (or ecosystem of software systems) is not only coding but also localising, testing, creating images/multimedia, developing libraries, writing documentation, creating build or configuration scripts and/or designing databases. All these activities are somehow reflected in the version control system archives. We use file extensions and file paths to map each one of the activities to groups of files
Arrows indicate that a statistical analysis reveals significant differences between activities linked by the arrow: localization (l10n) has more commits related to it than code, code more than doc or img etc. Occasional = less than 14 commits (median), frequent 14 commits or more