Csmr2013 presentation

Change-Based Test Selection
in the Presence of Developer Tests

Quinten David Soetens
Serge Demeyer
Andy Zaidman

Bon Giorno, my name is Quinten, I’m from the University of Antwerp.

I will be showing a technique that we investigated to reduce the size of a test suite.

Test Suites Grow

2

As a Software System grows, so does its Test Suite. And they can grow very large indeed!

We talked to a couple of companies in industry and they conﬁrmed that this is indeed a relevant problem for them. ... Why?

un
o R
rs t
ou
e H
Ta k

3

Because large test suites lead to inefficient testing -- it takes too long to run all the tests

One company we talked to mentioned that their tests take up to 13 hours to run. They start the tests in the evening and when
they come back in the morning for their daily standup scrum meeting the testing is still going on.

4

This leads in turn leads to delays -- delays in executing the test as well as delays in the updating of the test cases.

This leads to to a reduced test coverage and larger feedback cycles.
It takes a lot longer for a developer to know when his code was good or not.

R un T
ests i
n Par
allel

5

One solution to this problem could be to run the tests in parallel to save time.

For instance another company that we talked to had tests that run 8 hours. Their solution was to run the tests in parallel in 16
different machines effectively reducing the runtime of their testsuite from 8 hours to half an hour. Which in my opinion is still a
long time to wait. Especially as a developer who just wants to check if his code is OK.

Which
tests should I
run when
changing this
method?

6

In light of this, developers are faced with a problem:

Which test(s) should they run when changing a particular part of a system?

Currently developers use their own gut feeling, common knowledge in the company or expert knowledge of a collegue to select
a subset of tests that could be relevant for the code he is working on. However tool support to aid in this task is desirable.

We therefore need to ﬁnd which tests are relevant for that particular change. We can do this when we have recorded the ﬁne
grained changes made during the development.

ChEOPSJ

Applications TestSelection

Model
ChangeRecorders Change
Distiller

Logger Distiller
SVNKit

ChEOPSJ: Change-Based Test Optimization
Quinten David Soetens and Serge Demeyer
In "Proceedings of 16th European Conference on Software Maintenance and Reengineering, CSMR 2012
7

This approach was implemented in a tool called ChEOPSJ, which I presented at last years CSMR.

ChEOPSJ

Applications TestSelection

Model
ChangeRecorders Change
Distiller

Logger Distiller
SVNKit

ChEOPSJ: Change-Based Test Optimization
Quinten David Soetens and Serge Demeyer
In "Proceedings of 16th European Conference on Software Maintenance and Reengineering, CSMR 2012
8

We consider changes made to the source code as ﬁrst class objects, -- tangible entities that we can analyze and manipulate.

Basically it’s a tool that can record changes in the background while you are programming.
And in order to work with real world cases we also have the capability of recovering changes from source code repositories.

Once a change model is instantiated for a system we can analyze the change model and run different applications (for now only
the test selection application).

First Class Change Objects

Changes act on Source Code
(FAMIX) Entities
(e.g. AddClassChange, AddMethodChange, etc.)

9

For instance adding a new class will result in an Add-Class change


Changes have Structural Dependencies
(e.g. AddMethod ---> AddClass--->AddPackage etc.)

10

We can also deﬁne dependencies between these changes. For instance adding a method to a class requires the class to be added
ﬁrst.
Therefor there is a dependency between the AddMethodChange and the AddClassChange.

Traceability via Dependencies
between Test and Program Code

Changes to
Changes to
Program Code
Test Code

11

It’s these dependencies that we can use to ﬁnd relevant tests. Since Tests are also source code, so we can ﬁnd a series of
dependencies between the test code and the source code.

These dependencies form a live traceability link between the test code and the source code. Using these links we can select
relevant tests for a particular change.

Research Questions
Compare test subset against “retest all”

Size Reduction?

Quality?

Accuracy?

12

We evaluated our approach on two open source cases: Cruisecontrol and PMD.
For each class we searched for the relevant test classes. (Using the changes in the class).
We could then compare the found subset(s) of tests against the entire (larger) test suite.

And we compared this on three criteria.

How much did we actually reduce the test suite?

What was the quality of the reduced test suite? Is this the same or worse? And we used a metric called Mutation Coverage to
gauge the quality of a set of tests.

And ﬁnally we also looked at the accuracy of our approach, which means we looked at precision and recall.

The First Question is: When we reduce the test suite to a subset of tests, how much did we actually reduce it?

Size
Red
ucti
on?

Cruisecontrol,

295 Tests
Reduced to 1 Test

PMD$

215 Tests
Reduced to 1 Test

13

For 54% and 44% of the classes we found that there was only 1 relevant test.
For 11% and 20% of the classes we found there were 2 relevant tests.
For 13% and 10% of the classes had 3 relevant tests.
For 21% and 26% there were 4 or more relevant tests.

Cruisecontrol:
1 (54.5%)
2 (11.4%)
3 (13.1%)
>=4 (21.0%)
PMD:

1 (44.0%)
2 (19.9%)
3 (09.9%)
>=4 (26.2%)

Size
Red
ucti
on?

Cruisecontrol,

295 Tests
Reduced to 2 Tests

PMD$

215 Tests
Reduced to 2 Test

14


Cruisecontrol:
1 (54.5%)
2 (11.4%)
3 (13.1%)
>=4 (21.0%)
PMD:

1 (44.0%)
2 (19.9%)
3 (09.9%)
>=4 (26.2%)

Size
Red
ucti
on?

Cruisecontrol,
295 Tests
Reduced to 3 Tests

PMD$
215 Tests
Reduced to 3 Test

15


Cruisecontrol:
1 (54.5%)
2 (11.4%)
3 (13.1%)
>=4 (21.0%)
PMD:

1 (44.0%)
2 (19.9%)
3 (09.9%)
>=4 (26.2%)

Size
Red
ucti
on?

295 Tests Reduced to
Cruisecontrol,
4 or more Tests
(max = 22)

215 Tests Reduced
to 4 or more Test PMD$

(max = 37)

16


Cruisecontrol:
1 (54.5%)
2 (11.4%)
3 (13.1%)
>=4 (21.0%)
PMD:

1 (44.0%)
2 (19.9%)
3 (09.9%)
>=4 (26.2%)

Test Suites Grow

17

As such we can say that we can reduce ALL the tests

Size
Red
ucti
on?

18

to a handful of tests. -- 80 to 90 % of the classes had up to 5 relevant tests!

Research Questions

Size Reduction?

Quality?

Accuracy?

19

Next Question was: Does the quality of the reduced test sets remain the same or is it worse than retest all?

Qua
Mutation Testing lity?

package engine;
import java.util.*;

public class SuffixTree {
int[] hdlabel = new int[10000];
int[] ithSuf;
int ithSufLength;
int ithSufBegin;
int[] firstSuf;
public Vertex root= null;
private Vector pStringV;
private int[] a;
public Vector pMatches = new Vector();
Vector inputFiles;
int in;//de in-de suffix
public SuffixTree(Vector symbolen, Vector files) {
pStringV = symbolen;
inputFiles = files;
ithSuf = new int[pStringV.size()];
firstSuf = new int[pStringV.size()];
}

SOURCE
private void ithSuffix1(){//nieuwe versie, nu voor i=1
for(int j=0;j<=pStringV.size()-1;j++){
Symbool s = (Symbool)pStringV.elementAt(j);
if(s.parameter==false){
firstSuf[j]=s.symbool;
ithSuf[j]=s.symbool;
}
else{
firstSuf[j]=s.dTotVorige;
ithSuf[j]=s.dTotVorige;
}
}
ithSufLength = pStringV.size();
ithSufBegin=0;

CODE
}
private void ithSuffix(int i){//nieuwste versie, niet voor i=1
ithSufBegin = i-1;
Symbool sym = (Symbool)pStringV.elementAt(i-2);
if(sym.parameter==true && sym.dTotVolgende!=0){
ithSuf[i-2+sym.dTotVolgende] = 0;
}
ithSufLength = pStringV.size()-i+1;
// return ithSufClone;//nodig?

All Tests Pass
}
public void berekenDTotVorige(){
if(s.parameter==true){
int vorigePos=-1;
for(int k=j-1;k>=0;k--){//zoek of de parameter al eerder voorkwam
Symbool sym = (Symbool)pStringV.elementAt(k);
if(sym.symbool==s.symbool) {vorigePos=k;break;}
}//is er een probleem als een n-par en een par dezelfde int hebben?
if(vorigePos==-1) s.dTotVorige=0;
else s.dTotVorige = j-vorigePos;
}
}
}
public void berekenDTotVorige2(){
Hashtable ht = new Hashtable();
Integer i = new Integer(s.symbool);
if(!ht.containsKey(i)){
s.dTotVorige=0;
ht.put(i,new Integer(j));
}
else{
int vorigeIndex = ((Integer)ht.get(i)).intValue();
s.dTotVorige = j-vorigeIndex;
}
}
}
}
public void berekenDTotVolgende(){
int volgendePos=-1;
for(int k=j+1;k<pStringV.size();k++){//zoek of de parameter al eerder voorkwam
if(sym.symbool==s.symbool) {volgendePos=k;break;}
}
if(volgendePos==-1) s.dTotVolgende=0;
else s.dTotVolgende = volgendePos-j;
}

SOURCE
CODE

© ≈ http://pitest.org ≈
20

To asses the quality of a set of tests, we used mutation testing.

In short. this is inserting a fault into the code and checking if your test set fails (mutation killed) or not (mutation survived).
We used PIT as a tool to do this automatically for us.

We start with a green test suite (i.e. all tests pass)

Qua

package engine;
import java.util.*;

int[] ithSuf;
int ithSufLength;
int ithSufBegin;
int[] firstSuf;
private int[] a;
Vector inputFiles;
inputFiles = files;
}

SOURCE
≈ pitest.org

Introduce Mutant
}
else{
}

+ Rerun Tests
}
ithSufBegin=0;

CODE
}
ithSufBegin = i-1;
}
}
int vorigePos=-1;
}
}
}
s.dTotVorige=0;
}
else{
}
}
}
}
int volgendePos=-1;
}
}

SOURCE
CODE

21

After inserting a mutation we run the tests. If the tests still pass, we say that the mutation survived (Which is BAD, because you
introduced a bug in your system and the tests did not catch it.)

Qua

package engine;
import java.util.*;

int[] ithSuf;
int ithSufLength;
int ithSufBegin;
int[] firstSuf;
private int[] a;
Vector inputFiles;
inputFiles = files;
}

SOURCE
}
else{
}
}
ithSufBegin=0;

CODE
}
ithSufBegin = i-1;
}

All Tests Pass
}
int vorigePos=-1;
}
}
}

Mutation
s.dTotVorige=0;
}
else{
}
}

Survived
}
}
int volgendePos=-1;
}
}

SOURCE
CODE

22

After inserting a mutation we run the tests. If the tests still pass, we say that the mutation survived (Which is BAD, because you
introduced a bug in your system and the tests did not catch it.)

Qua

package engine;
import java.util.*;

int[] ithSuf;
int ithSufLength;
int ithSufBegin;
int[] firstSuf;
private int[] a;
Vector inputFiles;
inputFiles = files;
}

SOURCE
}
else{
}
}
ithSufBegin=0;

CODE
}
ithSufBegin = i-1;
}
}

≈ pitest.org
int vorigePos=-1;

Introduce Mutant
}

+ Rerun Tests
}
}
s.dTotVorige=0;
}
else{
}
}
}
}
int volgendePos=-1;
}
}

SOURCE
CODE

23

After inserting another mutation we run the tests again. Now some of tests fail, so we can say that this mutation was killed (This
is GOOD)

Qua

package engine;
import java.util.*;

int[] ithSuf;
int ithSufLength;
int ithSufBegin;
int[] firstSuf;
private int[] a;
Vector inputFiles;
inputFiles = files;
}

SOURCE
}
else{
}
}
ithSufBegin=0;

CODE
}
ithSufBegin = i-1;
}
}

Some Tests Fail
int vorigePos=-1;
}
}
}

Mutation
s.dTotVorige=0;
}
else{
}
}
}
}

Killed
int volgendePos=-1;
}
}

SOURCE
CODE

24

After inserting another mutation we run the tests again. Now some of tests fail, so we can say that this mutation was killed (This
is GOOD)

Qua

package engine;
import java.util.*;

int[] ithSuf;
int ithSufLength;
int ithSufBegin;
int[] firstSuf;
private int[] a;
Vector inputFiles;
inputFiles = files;
}

SOURCE
}
else{
}
}
ithSufBegin=0;

CODE
}

≈ pitest.org
ithSufBegin = i-1;

Repeat For All
}
}

Possible Mutations
int vorigePos=-1;
}
}
}
s.dTotVorige=0;
}
else{
}
}
}
}
int volgendePos=-1;
}
}

SOURCE
CODE

25

We do this for all mutations and we get a metric: Mutation Coverage which is the percentage of the number of mutants killed out
of the total number of mutants introduced.

We can use this metric to gauge the quality of a set of tests. And we now want to see if for a particular class the quality remains
the same? when only using a reduced set of tests.

Qua

package engine;
import java.util.*;

int[] ithSuf;
int ithSufLength;
int ithSufBegin;
int[] firstSuf;
private int[] a;
Vector inputFiles;
inputFiles = files;
}

SOURCE
}
else{
}
}
ithSufBegin=0;

CODE
}

≈ ipitest.org
ithSufBegin = i-1;

Repeat For All
}
}

Possible Mutations
int vorigePos=-1;

ts K lled
}

n
}

# Muta
}
s.dTotVorige=0;

roduced
}
else{

overage = # Mutants Int
}
}
}
}

utation C
int volgendePos=-1;

M
}
}

SOURCE
CODE

25

We do this for all mutations and we get a metric: Mutation Coverage which is the percentage of the number of mutants killed out
of the total number of mutants introduced.

We can use this metric to gauge the quality of a set of tests. And we now want to see if for a particular class the quality remains
the same? when only using a reduced set of tests.

Qua
lity?

Cruisecontrol,

88% equal Mutation Coverage

PMD$

26

In 88% and 50% of the inspected classes we have a mutation coverage that remained the same. (i.e. the quality of the reduced
test set is equal to that of the full test suite.)

In 12% (Cruisecontrol) and 50% (PMD) however we have a worse Mutation Coverage, but the question then arises

Qua
lity?

Cruisecontrol,


rse is
ch wo
w mu era ge?
Ho nC ov
tio
uta equal Mutation Coverage
PMD$

th e M 50%

27

How much worse is the mutation coverage in these cases?

Qua
lity?

100" 100"

Percentage)of)more)surviving)

90" 90"
80" 80"
70" 70"

mutants)
60" 60"
mutants)

50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"
Total)number)of)mutants) Total)number)of)mutants)

28

So we looked at those test subsets were more mutants survived than with the retest all.
We see that it varies from a couple of percent to a hundred percent more mutants surviving.
However we need to take in account the total number of mutants introduced.

So that is what is shown here.
On the vertical axis we show the percentage of more surviving mutants. Meaning the lower the better.

On the horizontal axis we show the total number of mutants introduced. Which puts some of the data points in perspective.

Qua
lity?

100" 100"


90" 90"
80" 80"
70" 70"

mutants)
60" 60"
mutants)

50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"

29

For Cruisecontrol for instance there is one point where a 100% of the introduced mutants survived the subset, but were caught in
the retest all. However when put in perspective this is out of a total of only 3 mutants!!!

Qua
lity?

100" 100"


90" 90"
80" 80"
70" 70"

mutants)
60" 60"
mutants)

50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"

30

The data points that are more worrisome in Cruisecontrol are the two in the middle. Because, here a relatively high number of
mutants is introduced an quite a few of them survived the subset of tests where they did not survive the full test set.

Qua
lity?

100" 100"


90" 90"
80" 80"
70" 70"

mutants)
60" 60"
mutants)

50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"

31

PMD performs a lot worse. As we can see all of these data points with high numbers of mutants surviving the subset and not the
full set.

Qua
lity?

100" 100"


90" 90"
80" 80"
70" 70"

mutants)
60" 60"
mutants)

50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"

On average 12% more On average 24% more
mutants survive mutants survive
(weighted average) (weighted average)

32

Still on average we can say that 12% and 24% more mutants survive, and this is a weighted average where we took the total
number of mutants as weights.
In short the closer the data points are to the axes,the better.

So our approach up to now is good, but it’s not perfect. We do miss some relevant tests.

Research Questions

Size Reduction?

Quality?

Accuracy?

33

Which leads us automatically to the next question, what’s our precision and recall?
i.e.

How many of the selected tests are really relevant tests (precision)?

How many of the really relevant tests are selected (recall)?

To measure precision and recall we need some kind of oracle to tell us which actually are the relevant tests for each class.

Acc
urac
Dynamic Analysis y?

∀ t ∈Tests: execute t

∀ m : Method invoked
during run of t

t is a relevant test for m

34

We used a dynamic analysis to tell us.
In short we wrote a simple aspect in aspectj that during the execution of a test, notes which methods were invoked.
We can then say that that test is relevant for those methods.

Using these results we could compare to our static analysis of the changes...

Acc
urac
y?

Precision) Precision)
[0.25,0.5[$[0,0.25[$ [0,0.25[$
[0.25,0.5[$
[0.5,0.75[$
[0.5,0.75[$

[0.75,1[$

[1]$
[0.75,1[$
[1]$

Avg: 0.88 Avg: 0.83
Recall) Recall)
[0,0.25[$
[0.25,0.5[$ [0,0.25[$
[1]$

[0.5,0.75[$ [1]$
[0.25,0.5[$
[0.75,1[$

[0.75,1[$ Avg: 0.77 Avg: 0.58 [0.5,0.75[$

35

We ﬁnd for both Cruisecontrol and PMD high precision values (on average 0.88 and 0.83%).
Which means that most of the test that we selected in the subsets were in fact relevant tests!

The recall values are a bit lower especially in the case of PMD. With an average recall of 77% and 58%.
This means that some of the actually relevant tests where not selected in the subsets by our tool.
This was also apparent in the mutation testing approach.

But is this really bad?

36

When we look back at our individual developer. He is performing changes on a software system. And wants to test his code.

When he gets tool support saying, these are the relevant tests for your changes, he gets more conﬁdent about his code.
He will test more often. He gets shorter feedback cycles.

The selected subset is not safe as it occasionally misses a few relevant tests, however it is adequate especially since the complete
test suite will be executed as part of the integration build anyway.

37

What’s next after this?
We need to do some more work on this, basically polishing the approach (try to improve recall, probably at the cost of precision)
See how this approach performs on industrial cases.
On the other hand we also want to have a look at other applications of Change Centric Software Development.
One thing that we are currently looking at is looking if we can detect patterns in the set of changes.

-- Either predeﬁned patterns like refactorings, and checking if we can identify those.

-- Or just frequent pattern mining on a set of changes and not knowing in advance what kind of patterns we might
uncover.
Another application is that successful changes on one branch of a piece of software might be reapplied on other branches of
that system
(bug ﬁxes?)

Future Directions
• Reducing Test Runtime
• Polishing of the Approach (& Implementation)
• More (Industrial) Cases

• Detect Change Patterns
• Identify Refactorings
• Recurring sequences of changes

• Reapplying changes
• bug fixes
• design improvements
• API evolution

37

What’s next after this?
We need to do some more work on this, basically polishing the approach (try to improve recall, probably at the cost of precision)
See how this approach performs on industrial cases.
On the other hand we also want to have a look at other applications of Change Centric Software Development.
One thing that we are currently looking at is looking if we can detect patterns in the set of changes.

-- Either predefined patterns like refactorings, and checking if we can identify those.

-- Or just frequent pattern mining on a set of changes and not knowing in advance what kind of patterns we might
uncover.
Another application is that successful changes on one branch of a piece of software might be reapplied on other branches of
that system
(bug fixes?)

38

To wrap up....
We were looking for a way to ﬁnd relevant tests for small changes to the software.
We found that our technique could reduce the test suite to a handful of test (5 tests in 80-90% of the cases)
We found that in 50-80% those reduced test suites had the same mutation coverage (quality) as the full test set)
The test sets that had a worse mutation coverage, was actually not that bad.
And we found that we had really good precision, but lower recall, meaning that we did in fact miss some relevant tests.
However as we mentioned this is not a very big problem since the full test suite will in the end also be built anyway.

Csmr2013 presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (10)

Semelhante a Csmr2013 presentation

Semelhante a Csmr2013 presentation (20)

Último

Último (20)

Csmr2013 presentation