DSPy a system for AI to Write Prompts and Do Fine Tuning
Csmr2013 presentation
1. Change-Based Test Selection
in the Presence of Developer Tests
Quinten David Soetens
Serge Demeyer
Andy Zaidman
Bon Giorno, my name is Quinten, I’m from the University of Antwerp.
I will be showing a technique that we investigated to reduce the size of a test suite.
2. Test Suites Grow
2
As a Software System grows, so does its Test Suite. And they can grow very large indeed!
We talked to a couple of companies in industry and they confirmed that this is indeed a relevant problem for them. ... Why?
3. un
o R
rs t
ou
e H
Ta k
3
Because large test suites lead to inefficient testing -- it takes too long to run all the tests
One company we talked to mentioned that their tests take up to 13 hours to run. They start the tests in the evening and when
they come back in the morning for their daily standup scrum meeting the testing is still going on.
4. 4
This leads in turn leads to delays -- delays in executing the test as well as delays in the updating of the test cases.
This leads to to a reduced test coverage and larger feedback cycles.
It takes a lot longer for a developer to know when his code was good or not.
5. R un T
ests i
n Par
allel
5
One solution to this problem could be to run the tests in parallel to save time.
For instance another company that we talked to had tests that run 8 hours. Their solution was to run the tests in parallel in 16
different machines effectively reducing the runtime of their testsuite from 8 hours to half an hour. Which in my opinion is still a
long time to wait. Especially as a developer who just wants to check if his code is OK.
6. Which
tests should I
run when
changing this
method?
6
In light of this, developers are faced with a problem:
Which test(s) should they run when changing a particular part of a system?
Currently developers use their own gut feeling, common knowledge in the company or expert knowledge of a collegue to select
a subset of tests that could be relevant for the code he is working on. However tool support to aid in this task is desirable.
We therefore need to find which tests are relevant for that particular change. We can do this when we have recorded the fine
grained changes made during the development.
7. ChEOPSJ
Applications TestSelection
Model
ChangeRecorders Change
Distiller
Logger Distiller
SVNKit
ChEOPSJ: Change-Based Test Optimization
Quinten David Soetens and Serge Demeyer
In "Proceedings of 16th European Conference on Software Maintenance and Reengineering, CSMR 2012
7
This approach was implemented in a tool called ChEOPSJ, which I presented at last years CSMR.
8. ChEOPSJ
Applications TestSelection
Model
ChangeRecorders Change
Distiller
Logger Distiller
SVNKit
ChEOPSJ: Change-Based Test Optimization
Quinten David Soetens and Serge Demeyer
In "Proceedings of 16th European Conference on Software Maintenance and Reengineering, CSMR 2012
8
We consider changes made to the source code as first class objects, -- tangible entities that we can analyze and manipulate.
Basically it’s a tool that can record changes in the background while you are programming.
And in order to work with real world cases we also have the capability of recovering changes from source code repositories.
Once a change model is instantiated for a system we can analyze the change model and run different applications (for now only
the test selection application).
9. First Class Change Objects
Changes act on Source Code
(FAMIX) Entities
(e.g. AddClassChange, AddMethodChange, etc.)
9
For instance adding a new class will result in an Add-Class change
10. First Class Change Objects
Changes have Structural Dependencies
(e.g. AddMethod ---> AddClass--->AddPackage etc.)
10
We can also define dependencies between these changes. For instance adding a method to a class requires the class to be added
first.
Therefor there is a dependency between the AddMethodChange and the AddClassChange.
11. First Class Change Objects
Traceability via Dependencies
between Test and Program Code
Changes to
Changes to
Program Code
Test Code
11
It’s these dependencies that we can use to find relevant tests. Since Tests are also source code, so we can find a series of
dependencies between the test code and the source code.
These dependencies form a live traceability link between the test code and the source code. Using these links we can select
relevant tests for a particular change.
12. Research Questions
Compare test subset against “retest all”
Size Reduction?
Quality?
Accuracy?
12
We evaluated our approach on two open source cases: Cruisecontrol and PMD.
For each class we searched for the relevant test classes. (Using the changes in the class).
We could then compare the found subset(s) of tests against the entire (larger) test suite.
And we compared this on three criteria.
How much did we actually reduce the test suite?
What was the quality of the reduced test suite? Is this the same or worse? And we used a metric called Mutation Coverage to
gauge the quality of a set of tests.
And finally we also looked at the accuracy of our approach, which means we looked at precision and recall.
The First Question is: When we reduce the test suite to a subset of tests, how much did we actually reduce it?
13. Size
Red
ucti
on?
Cruisecontrol,
295 Tests
Reduced to 1 Test
PMD$
215 Tests
Reduced to 1 Test
13
For 54% and 44% of the classes we found that there was only 1 relevant test.
For 11% and 20% of the classes we found there were 2 relevant tests.
For 13% and 10% of the classes had 3 relevant tests.
For 21% and 26% there were 4 or more relevant tests.
Cruisecontrol:
1 (54.5%)
2 (11.4%)
3 (13.1%)
>=4 (21.0%)
PMD:
1 (44.0%)
2 (19.9%)
3 (09.9%)
>=4 (26.2%)
14. Size
Red
ucti
on?
Cruisecontrol,
295 Tests
Reduced to 2 Tests
PMD$
215 Tests
Reduced to 2 Test
14
For 54% and 44% of the classes we found that there was only 1 relevant test.
For 11% and 20% of the classes we found there were 2 relevant tests.
For 13% and 10% of the classes had 3 relevant tests.
For 21% and 26% there were 4 or more relevant tests.
Cruisecontrol:
1 (54.5%)
2 (11.4%)
3 (13.1%)
>=4 (21.0%)
PMD:
1 (44.0%)
2 (19.9%)
3 (09.9%)
>=4 (26.2%)
15. Size
Red
ucti
on?
Cruisecontrol,
295 Tests
Reduced to 3 Tests
PMD$
215 Tests
Reduced to 3 Test
15
For 54% and 44% of the classes we found that there was only 1 relevant test.
For 11% and 20% of the classes we found there were 2 relevant tests.
For 13% and 10% of the classes had 3 relevant tests.
For 21% and 26% there were 4 or more relevant tests.
Cruisecontrol:
1 (54.5%)
2 (11.4%)
3 (13.1%)
>=4 (21.0%)
PMD:
1 (44.0%)
2 (19.9%)
3 (09.9%)
>=4 (26.2%)
16. Size
Red
ucti
on?
295 Tests Reduced to
Cruisecontrol,
4 or more Tests
(max = 22)
215 Tests Reduced
to 4 or more Test PMD$
(max = 37)
16
For 54% and 44% of the classes we found that there was only 1 relevant test.
For 11% and 20% of the classes we found there were 2 relevant tests.
For 13% and 10% of the classes had 3 relevant tests.
For 21% and 26% there were 4 or more relevant tests.
Cruisecontrol:
1 (54.5%)
2 (11.4%)
3 (13.1%)
>=4 (21.0%)
PMD:
1 (44.0%)
2 (19.9%)
3 (09.9%)
>=4 (26.2%)
17. Test Suites Grow
17
As such we can say that we can reduce ALL the tests
18. Size
Red
ucti
on?
18
to a handful of tests. -- 80 to 90 % of the classes had up to 5 relevant tests!
19. Research Questions
Compare test subset against “retest all”
Size Reduction?
Quality?
Accuracy?
19
Next Question was: Does the quality of the reduced test sets remain the same or is it worse than retest all?
27. Qua
lity?
Cruisecontrol,
88% equal Mutation Coverage
PMD$
50% equal Mutation Coverage
26
In 88% and 50% of the inspected classes we have a mutation coverage that remained the same. (i.e. the quality of the reduced
test set is equal to that of the full test suite.)
In 12% (Cruisecontrol) and 50% (PMD) however we have a worse Mutation Coverage, but the question then arises
28. Qua
lity?
Cruisecontrol,
88% equal Mutation Coverage
rse is
ch wo
w mu era ge?
Ho nC ov
tio
uta equal Mutation Coverage
PMD$
th e M 50%
27
How much worse is the mutation coverage in these cases?
29. Qua
lity?
100" 100"
Percentage)of)more)surviving)
Percentage)of)more)surviving)
90" 90"
80" 80"
70" 70"
mutants)
60" 60"
mutants)
50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"
Total)number)of)mutants) Total)number)of)mutants)
28
So we looked at those test subsets were more mutants survived than with the retest all.
We see that it varies from a couple of percent to a hundred percent more mutants surviving.
However we need to take in account the total number of mutants introduced.
So that is what is shown here.
On the vertical axis we show the percentage of more surviving mutants. Meaning the lower the better.
On the horizontal axis we show the total number of mutants introduced. Which puts some of the data points in perspective.
30. Qua
lity?
100" 100"
Percentage)of)more)surviving)
Percentage)of)more)surviving)
90" 90"
80" 80"
70" 70"
mutants)
60" 60"
mutants)
50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"
Total)number)of)mutants) Total)number)of)mutants)
29
For Cruisecontrol for instance there is one point where a 100% of the introduced mutants survived the subset, but were caught in
the retest all. However when put in perspective this is out of a total of only 3 mutants!!!
31. Qua
lity?
100" 100"
Percentage)of)more)surviving)
Percentage)of)more)surviving)
90" 90"
80" 80"
70" 70"
mutants)
60" 60"
mutants)
50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"
Total)number)of)mutants) Total)number)of)mutants)
30
The data points that are more worrisome in Cruisecontrol are the two in the middle. Because, here a relatively high number of
mutants is introduced an quite a few of them survived the subset of tests where they did not survive the full test set.
32. Qua
lity?
100" 100"
Percentage)of)more)surviving)
Percentage)of)more)surviving)
90" 90"
80" 80"
70" 70"
mutants)
60" 60"
mutants)
50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"
Total)number)of)mutants) Total)number)of)mutants)
31
PMD performs a lot worse. As we can see all of these data points with high numbers of mutants surviving the subset and not the
full set.
33. Qua
lity?
100" 100"
Percentage)of)more)surviving)
Percentage)of)more)surviving)
90" 90"
80" 80"
70" 70"
mutants)
60" 60"
mutants)
50" 50"
40"
40"
30"
30"
20"
20"
10"
10" 0"
0" ,20" 30" 80" 130" 180"
,20" 30" 80" 130" 180"
Total)number)of)mutants) Total)number)of)mutants)
On average 12% more On average 24% more
mutants survive mutants survive
(weighted average) (weighted average)
32
Still on average we can say that 12% and 24% more mutants survive, and this is a weighted average where we took the total
number of mutants as weights.
In short the closer the data points are to the axes,the better.
So our approach up to now is good, but it’s not perfect. We do miss some relevant tests.
34. Research Questions
Compare test subset against “retest all”
Size Reduction?
Quality?
Accuracy?
33
Which leads us automatically to the next question, what’s our precision and recall?
i.e.
How many of the selected tests are really relevant tests (precision)?
How many of the really relevant tests are selected (recall)?
To measure precision and recall we need some kind of oracle to tell us which actually are the relevant tests for each class.
35. Acc
urac
Dynamic Analysis y?
∀ t ∈Tests: execute t
∀ m : Method invoked
during run of t
t is a relevant test for m
34
We used a dynamic analysis to tell us.
In short we wrote a simple aspect in aspectj that during the execution of a test, notes which methods were invoked.
We can then say that that test is relevant for those methods.
Using these results we could compare to our static analysis of the changes...
36. Acc
urac
y?
Precision) Precision)
[0.25,0.5[$[0,0.25[$ [0,0.25[$
[0.25,0.5[$
[0.5,0.75[$
[0.5,0.75[$
[0.75,1[$
[1]$
[0.75,1[$
[1]$
Avg: 0.88 Avg: 0.83
Recall) Recall)
[0,0.25[$
[0.25,0.5[$ [0,0.25[$
[1]$
[0.5,0.75[$ [1]$
[0.25,0.5[$
[0.75,1[$
[0.75,1[$ Avg: 0.77 Avg: 0.58 [0.5,0.75[$
35
We find for both Cruisecontrol and PMD high precision values (on average 0.88 and 0.83%).
Which means that most of the test that we selected in the subsets were in fact relevant tests!
The recall values are a bit lower especially in the case of PMD. With an average recall of 77% and 58%.
This means that some of the actually relevant tests where not selected in the subsets by our tool.
This was also apparent in the mutation testing approach.
But is this really bad?
37. 36
When we look back at our individual developer. He is performing changes on a software system. And wants to test his code.
When he gets tool support saying, these are the relevant tests for your changes, he gets more confident about his code.
He will test more often. He gets shorter feedback cycles.
The selected subset is not safe as it occasionally misses a few relevant tests, however it is adequate especially since the complete
test suite will be executed as part of the integration build anyway.
38. 37
What’s next after this?
We need to do some more work on this, basically polishing the approach (try to improve recall, probably at the cost of precision)
See how this approach performs on industrial cases.
On the other hand we also want to have a look at other applications of Change Centric Software Development.
One thing that we are currently looking at is looking if we can detect patterns in the set of changes.
-- Either predefined patterns like refactorings, and checking if we can identify those.
-- Or just frequent pattern mining on a set of changes and not knowing in advance what kind of patterns we might
uncover.
Another application is that successful changes on one branch of a piece of software might be reapplied on other branches of
that system
(bug fixes?)
39. Future Directions
• Reducing Test Runtime
• Polishing of the Approach (& Implementation)
• More (Industrial) Cases
• Detect Change Patterns
• Identify Refactorings
• Recurring sequences of changes
• Reapplying changes
• bug fixes
• design improvements
• API evolution
37
What’s next after this?
We need to do some more work on this, basically polishing the approach (try to improve recall, probably at the cost of precision)
See how this approach performs on industrial cases.
On the other hand we also want to have a look at other applications of Change Centric Software Development.
One thing that we are currently looking at is looking if we can detect patterns in the set of changes.
-- Either predefined patterns like refactorings, and checking if we can identify those.
-- Or just frequent pattern mining on a set of changes and not knowing in advance what kind of patterns we might
uncover.
Another application is that successful changes on one branch of a piece of software might be reapplied on other branches of
that system
(bug fixes?)
40. 38
To wrap up....
We were looking for a way to find relevant tests for small changes to the software.
We found that our technique could reduce the test suite to a handful of test (5 tests in 80-90% of the cases)
We found that in 50-80% those reduced test suites had the same mutation coverage (quality) as the full test set)
The test sets that had a worse mutation coverage, was actually not that bad.
And we found that we had really good precision, but lower recall, meaning that we did in fact miss some relevant tests.
However as we mentioned this is not a very big problem since the full test suite will in the end also be built anyway.