Kenyon: A Software Stratigraphy Platform (ESEC/FSE 2005)
1. Kenyon: A Software Stratigraphy Platform
Jennifer Bevan, Sunghun Lijie Zou, Mike Godfrey
Kim, E. James Whitehead Jr. University of Waterloo
University of California, Santa Cruz {lzou, migod}
{jbevan, hunkim, ejw} @uwaterloo.edu
@cs.ucsc.edu
2. Motivation
Static analysis-based software evolution
research has several common technical
issues to manage.
Extracting meaningful configurations from an
SCM repository.
Calculating static relations, metrics.
Augments data from commit log messages.
Saving the extracted facts.
For later time-based analysis, data mining,
incremental data addition.
3. Ongoing Static Evolution Research
Instability Analysis (J. Bevan)
Refines Zimmerman/Ying/Murphy using static
dependence to remove temporal dependencies
Entity Mapping/Origin Analysis (L. Zou, M.
Godfrey)
Uses static metrics to identify moved/split/merged
procedures, files.
Code clone evolution (M. Kim)
Identifies clones and follows their evolution.
4. More Static Evolution Research
Association rule mining
For predicting changes [Ying et al., IEEE TSE, v30 n9, Sept. 2004]
For architectural justification [Zimmermann, Diehl, and Zeller,
Proc. IWPSE 2003]
Identifying code “chunks” for future
modularization [Mockus and Weiss, IEEE Software, v18 n2, 2001]
“Feature” identification [Fischer, Pinzger, and Gall, Proc. WCRE
2003]
…and the ongoing research related to these.
5. Problem
Despite similarity of approach, systems make
several choices that limit sharing of technology and
results:
Usually choosing a single SCM system (CVS) for data.
Usually creating a proprietary database schema.
Usually not easily integratable with other research
projects for result sharing.
The cost of computationally expensive analysis
techniques are not amortized across multiple
research directions.
6. Solution: Kenyon
Kenyon is designed to facilitate static software
evolution research by providing common solutions
to these common problems:
Phase 1: Automatic configuration extraction from SCM
Phase 2: Invoking static analysis tool(s)
Phase 3: Storing the results from these preprocessing
steps.
Asynchronously provides access to previously
processed and stored data.
8. Phase 1: Extract Configurations
Kenyon provides transaction recovery and logical
configuration extraction for multiple SCM systems.
Configurations specified by time + branch identifier.
Sliding window algorithm for transaction recovery.
Only changes from completed transactions are extracted
for a “logical configuration”.
Only changes from transactions that completed between
two specifications are considered for a “configuration
delta”.
9. Configuration Specification
Kenyon’s logical configuration extraction and delta
calculations allow researchers to consider software
“as it existed at time T on branch B”.
Most SCM systems archive data along a timeline with
varying support for parallel development.
Kenyon uses this commonality as the basis for its SCM
interface and configuration specification.
There is no indication that change-set based SCM
systems will not be supportable by Kenyon.
10. Logical Configuration
• At any given point in time,
one or more transactions may
have just completed, and one
or more may be ongoing. T1
• Ongoing transactions are F4
shown in red.
• Completed transactions are F2
shown in green. F1
F3
11. Configuration Deltas
• Configuration deltas are
calculated as C(T2) –
C(T1).
• Only changes from T2
transactions completing
between T1 (exclusive) and
T1 F4
T2 (inclusive) are
considered.
F3
F2
F1
12. Data from Phase 1
Valid configuration specifications for extraction are
created by Kenyon, one per timestamp where a
transaction completed.
For each configuration extracted:
Author and log message of each transaction completing
at that specification.
The configuration is placed on the filesystem.
A configuration delta for each consecutive pair of
configurations processed can also be stored.
13. Phase 2: Invoke Fact Extractors
Kenyon provides an abstract class that is used to
invoke third-party fact extractors on the
configuration extracted to the filesystem.
Kenyon users would subclass this class to invoke their
own fact extractor.
Support for Codesurfer (line-level analysis) and
SWAGKIT (procedure-level analysis) are provided with
Kenyon. [www.grammatech.com, swag.uwaterloo.ca]
FactExtractor subclasses have a tri-modal return status:
“failure”, “new data to store”, or “no new data to store”.
14. Data from Phase 2
FactExtractor subclasses provide:
A ConfigGraph that maps software elements to nodes
and static relationships to edges.
The graph, any node, and any edge may be attributed
with static metrics.
Multiple fact extractors may be invoked on a single
configuration: each created ConfigGraph is saved
with a reference to the fact extractor that created it.
If a configuration has already been processed by a
given fact extractor, it will not be processed again
unless new metrics are to be calculated.
15. Phase 3: Data Storage
Kenyon uses Hibernate to persist data
classes.
Hibernate is an “object/relational persistence and
query service for Java” [www.hibernate.org].
Allows reuse of Kenyon classes by research
tools implemented in Java.
Each configuration processed by Kenyon is
assigned to a Project, the top-level data class
persisted by Kenyon.
16. Persisted Kenyon Data
• Projects contain one set of
data for each configuration Project
specification processed. 1
N
• Each such data set N 1
ConfigGraph ConfigData
contains one or more 1 1
ConfigGraphs, each 1 N
produced by a different
FactExtractor ConfigSpec
FactExtractor.
1 2
• FactExtractors specify 1 1
what GraphSchema GraphSchema ConfigDelta
subclass they use (not
restrictive).
17. Data Access
Hibernate allows access to preprocessed data using
SQL or the Hibernate query methods (HQL, QBE/
QBC), which support class/field-based queries.
A Hibernate query returns a List of Objects, each of
which is of the type originally persisted.
Data fields in the returned class are populated unless
specified as lazily loaded.
Kenyon provides several convenience queries for
common anticipated queries, such as “what
configurations are available for this project”.
18. Kenyon Usage
Kenyon processes data based on specifications in a
configuration file
Start time, stop time, how often to process
Fact extractors and their assigned metric calculators.
SCM parameters, filesystem parameters, some control
over what Hibernate persists.
A “processing run” will reuse any previously
processed data if available
For example, if a ConfigGraph has already been created,
if new metrics are necessary they are calculated and
added to the existing ConfigGraph.
19. Iterative Refinement Example
When looking for “interesting” timeframes of
evolution, a multiple-pass process is recommended.
A user can configure Kenyon to process the changes in a
system once per day.
Days with high activity or other metrics exceeding a
threshold can be flagged as “interesting”.
The user can then configure Kenyon to process those
days (via multiple processing runs) at the frequency of
“every 20 minutes”.
This process can repeat down to the “every second”
level.
20. Parallel Preprocessing
Kenyon is a single-threaded process, but Hibernate
supports multiple connections to a single Kenyon
database.
A 10-year history can be processed in chunks by
any number of computers, even if the processing
configurations have overlapping times or different
intervals.
Kenyon does not integrate the deltas between
different processing runs, so a small overlap in
processing chunks is suggested.
22. Current Status
Kenyon 1.2 available at
http://kenyon.dforge.cse.ucsc.edu
Supports CVS, Subversion, and ClearCase
Students in 290G are performing projects
using Kenyon this quarter
Actively working with Samsung to analyze
some of their source code.
23. Future Work (1/3)
Continue working with M. Kim
Evaluate usefulness of SCM-only module.
If she decides to use Kenyon, assist with full integration.
Finish integration of Beagle/Kenyon and
IVA/Kenyon.
Work with G. Murphy on using Kenyon at UBC.
Evaluate Kenyon’s ability to reduce the time-to-
results for static software evolution research by
analyzing the seminar class projects.
24. Future Work (2/3)
Support branch path traversal
Allow users to see the branch points in a system and
specify a path for processing instead of a single branch.
Will reuse existing visualizations, must add specification
mechanism.
Incorporate full language-specific containment
models for better inter-language graph traversal and
mapping.
Use M. Godfrey’s Java fact extractor and containment
model.
25. Future Work (3/3)
Support more of the Standard Exchange
Formats for ConfigGraph export.
TA is already supported, but only the Fact
sections. Schema sections should be improved
to use the language-specific containment models.
Encourage other reseachers to use Kenyon,
and improve results-sharing, capabilities, etc.
based on their feedback.
26. Open Issues (1/3)
The exact mechanism for allowing data
sharing between researchers is not entirely
controllable by Kenyon
Database setup and administration can
effectively override much of Kenyon’s
preferences.
By default, Kenyon-created tables are not
mutable by processes other than Kenyon.
27. Open Issues (2/3)
Kenyon provides a public class, EvolutionPath, that
links a subgraph in one ConfigGraph to one in
another ConfigGraph.
Directed and attributable.
Basic building block for evolution data.
Is currently persisted by Kenyon, will likely not be
after 1.1, due to database mutability issues.
Other research projects can subclass and, if they want to
share their results easily, persist them to a Hibernate
database using the provided Hibernate mapping
examples.
28. Open Issues (3/3)
Kenyon is able to be automatically invoked
via a post-commit script or a cron job.
Should Kenyon be able to be automatically
invoked from an IDE?
What sort of support should Kenyon provide
for better integration with, for example,
Eclipse?
29. Conclusions (1/2)
Kenyon is an engineering solution, designed to
amortize the cost of the computationally expensive
preprocessing steps that can benefit static software
evolution research.
Research projects using Kenyon will not have to
independently create solutions for these common
problems.
18% code reduction in Beagle without really trying.
Is expected to reduce the lag between beginning system
implementation and producing research results.
30. Conclusions (2/2)
Kenyon is not intended to be a lightweight data
mining system for software evolution research.
Tradeoff of speed vs. precision is still controllable via
the choice of fact extractors.
The configuration extraction time and associated
network lag already put the per-configuration time at
O(seconds)
Instead, it allows the cost of time-consuming,
computationally expensive preprocessing, to be
amortized among researchers.
31. Questions?
Kenyon was created primarily from code that existed in
IVA, which is being funded by NSF grant CCR-01234603.
Kenyon also contains code from Beagle, the origin analysis
project overseen by Mike Godfrey.
Email jbevan@cs.ucsc.edu with future questions.
http://www.cse.ucsc.edu/research/labs/grase/kenyon/