Information Retrieval methods have been largely adopted to identify traceability links based on the textual similarity of software artifacts. However, noise due to word usage in software artifacts might negatively affect the recovery accuracy. We propose the use of smoothing filters to reduce the effect of noise in software artifacts and improve the performances of traceability recovery methods. An empirical evaluation performed on two repositories indicates that the usage of a smoothing filter is able to significantly improve the performances of Vector Space Model and Latent Semantic
Indexing. Such a result suggests that other than being used for traceability recovery the proposed filter can be used to improve performances of various other software engineering approaches based on textual analysis.
Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
1. Improving IR-based Traceability
Recovery Using Smoothing Filters
Andrea Massimiliano Rocco Annibale Sebastiano
De Lucia Di Penta Oliveto Panichella Panichella
2. Software traceability
“The degree to which a relationship can be established
between two products of a software development process”
[IEEE Glossary for Software Terminology]
Source
Use case Test case
code
Source
Use case Test case
code
Important for:
Up-to-date traceability
program comprehension Up-to-date traceability
requirement tracing links rarely exists →
links rarely exists →
impact analysis
need to recover them
need to recover them
software reuse
…
3. IR-based traceability recovery
Antoniol et al., 2002 (VSM+Probabilistic model)
Antoniol et al., 2002 (VSM+Probabilistic model)
Marcus and Maletic, 2003 (LSI)
Marcus and Maletic, 2003 (LSI)
4. Traditional IR vs.
IR applied to Software Engineering
Traditional IR IR applied to SE
Deals with We have sets of
heterogeneous homogeneous
documents for what documents for what
concerns: concerns
Linguistic choices Syntax, linguistic
Syntax choices
Semantics Examples:
We just live with that Use cases, test
differences documents, design
documents follow a
common template and
contain recurrent words
5. Problem
Different kinds of software artifacts require specific
preprocessing
Test case Change the date for a visit:
Test case Change the date for a visit:
C51
C51 Version: 0 02 000
Version: 0 02 000
Use case
Use case Satisfies the request to modify a visit
for a patient request to modify a visit
Satisfies the
for a patient
UcModVis
UcModVis
Priority
Priority High
High
....
....
Test description
Test description
Input
Input Select a visit:
Select a visit:
26/09/2003 11:00 First visit
26/09/2003 11:00 First visit
Change: 03/10/2003 11:00
Change: 03/10/2003 11:00
Oracle
Oracle Invalid sequence: The system does not allow
Invalid sequence: The system does not allow
to change a booking
to change a booking
Coverage
Coverage Valid classes: CE1 CE8 CE14 CE19 CE21
Valid classes: CE1 CE8 CE14 CE19 CE21
Invalid classes: None
Invalid classes: None
6. Problem
Different kinds of software artifacts require specific
preprocessing
Test case Change the date for a visit:
Test case Change the date for a visit:
C51
C51 Version: 0 02 000
Version: 0 02 000
Use case
Use case Satisfies the request to modify a visit
for a patient request to modify a visit
Satisfies the
for a patient
UcModVis
UcModVis
Priority
Priority High
High Artifact-specific words do
....
.... not bring useful
Test description
Test description
Input Select a visit:
information
Input Select a visit:
26/09/2003 11:00 First visit
26/09/2003 11:00 First visit
Change: 03/10/2003 11:00
Change: 03/10/2003 11:00
Oracle
Oracle Invalid sequence: The system does not allow
Invalid sequence: The system does not allow
to change a booking
to change a booking
Coverage
Coverage Valid classes: CE1 CE8 CE14 CE19 CE21
Valid classes: CE1 CE8 CE14 CE19 CE21
Invalid classes: None
Invalid classes: None
8. Noisy images
Pixels with peaks of low Pixels with peaks of
color intensity high color intensity
Noise
9. Reducing noise using smoothing filters
Mean filter
1
g ( x, y ) =
M
∑ f ( n, m )
f ( n , m )∈S
10. Image vs. traceability noise
Image noise: Traceability noise:
Pixels with high or Terms and linguistic
low color intensity patterns occurring in
Pixels are position many artifacts of a given
dependent category
Use cases, test
cases..
Artifacts (columns) are
position independent
d1 d2 d2 d1
11. Representing the noise
Source Documents Target Documents
s1 s2 s3 L sk t1 t2 t3 L tz
word1 v1,1 v1,2 v1,3 L v1, k v1,1 v1,2 v1,3 L v1, z
v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z
word 2 2,1 L L
M M O M O M O M M O M O M O M
word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn ,1 L vn ,2 L vn ,3 L vn , z
Linguistic information strictly Linguistic information strictly belonging
belonging to source documents to target documents
Common Information Common Information
for Source Documents For target documents
12. Representing the noise
Source Documents Target Documents
s1 s2 s3 L sk t1 t2 t3 L tz
word1 v1,1 v1,2 v1,3 L v1, k v1,1 v1,2 v1,3 L v1, z
v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z
word 2 2,1 L L
M M O M O M O M M O M O M O M
word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn,1 L vn,2 L vn,3 L vn, z
1 k
k ∑ v1, j 1 m
z ∑ v1, j
j =1 j = k +1
1 k 1 m
∑ v2, j
Mean source vector S= k ∑
S = j =1
v2, j Mean target vector T= T = z j = k +1
M
M m
k 1
1 z ∑ vn , j
∑ vn, j
Common Information k j =1 Common Information j = k +1
for Source Documents For target documents
The Mean Vectors are like the continuous component of a signal…
The Mean Vectors are like the continuous component of a signal…
13. Representing the noise
Source Documents Target Documents
s1 s2 s3 L sk t1 t2 t3 L tz
word1 v1,1 v1,2 v1,3 L v1, k v1,1 v1,2 v1,3 L v1, z
v v2,2 v2,3 v2, k v2,1 v2,2 v2,3 v2, z
word 2 2,1 L L
M M O M O M O M M O M O M O M
word n vn ,1 L vn ,2 L vn ,3 L vn ,k vn ,1 L vn ,2 L vn ,3 L vn , z
- -
S T
(mean target (mean target vector)
vector)
Filtered
Filtered Filtered
Filtered
Source Set
Source Set Target Set
Target Set
14. Empirical Study
Goal: analyze the effect of smoothing filter
Purpose: investigating how the filter affects
traceability recovery
Quality focus: traceability recovery performance
(precision and recall)
Perspective:
Researchers: evaluating the novel technique
Project managers: adopt a better traceability recovery
technique
Context: artifacts from two systems
EasyClinic and Pine
15. Context
EasyClinic Pine
Description Medical doctor office Text-based
management email client
Language Java C
Files/Classe 37 31
s
KLOC 20 130
Documents 113 100
Language Italian English
Artifacts Use cases Requirements
Interaction diagrams Use cases
Source code
Test cases
16. Research Questions and Factors
RQ1: Does the smoothing filter improve the
recovery performances of VSM-based traceability
recovery?
RQ2: Does the smoothing filter improve the
recovery performances of LSI-based traceability
recovery?
RQ3: How do the performances vary for different
types of artifacts?
Factors:
Use of filter: YES, NO
Technique: VSM, LSI
Artifact: Req., UC, Int. Diagrams, Code, TC
System: Easyclinic, Pine
17. Analysis Method
Performances evaluated by precision and recall:
correct ∩ retrieved correct ∩ retrieved
precision = recall =
retrieved correct
M1 M2
We statistically compare the #
of false positives of different 0
methods for each correct link
2
identified
2
Wilcoxon Rank Sum test
3
Cliff’s delta effect size
21. EasyClinic: Test cases into source (LSI)
Test cases are:
Short documents
Limited vocabulary
Mostly consistent with
source code
Precision
Filtered
Not Filtered
Recall
22. Pine: Use cases into requirements (LSI)
Precision
Filtere
d
Not Filtered
Recall
24. Link precision improvement
Login Patient
Login Patient
vs. Person
vs. Person
Poor vocabulary
Poor vocabulary
overlap (10%)
overlap (10%)
25. Threats to validity
Construct validity
Mainly related to our oracle
Provided by developers and for EasyClinic also peer-
reviewed
Internal validity
Improvements could be due to other reasons…
However we compared different techniques (VSM, LSI)
The approach works well regardless of stop word
removal/stemming and use of tf-idf
Conclusion validity
Conclusions based on proper (non-parametric) statistics
External validity
We considered systems with different characteristics and
artifacts
… but further studies are desirable
26. Conclusions
We proposed the use of smoothing filter to
improve performances of IR-based traceability
recovery
Idea inspired from digital signal processing
The filter significantly improves IR-based
traceability recovery based on VSM (RQ1) and LSI
(RQ2)
Filter particularly suitable for artifacts having a
higher verbosity (RQ3)
e.g., requirements and use cases
Less useful for artifacts composed of short
sentences and using a limited vocabulary
e.g., test cases
27. Work-in-progress
Study replication
Different systems and artifacts
Use of relevance feedback
More sophisticated smoothing technique
Non linear filters
Use in other applications of IR to software
engineering
E.g. impact analysis or feature location