This document discusses trends in code evolution observed by analyzing version control data from open source projects like Eclipse and Rhino. It finds that certain features like import statements, method calls, and word stems change frequently over time, with patterns of additions, deletions, and substitutions. These evolution patterns can be learned from historical data and used to predict future changes, detect defects, and recommend improvements by combining related patterns. The goal is to help developers better understand how code evolves and prevent defects by learning from the past evolution of other projects.
Handwritten Text Recognition for manuscripts and early printed texts
FSE'08 Doctoral Symposium
1. Trends in Code
100
75 Yana Momchilova Mileva
50 Saarland University, Germany
25
Advisor: Prof. Andreas Zeller
0
2001 2003 2005 2007
2. Google Zeitgeist
Nov 4, 2008 “david plouffe”
1. mccain concession speech
2. david plouffe
3. uncle tom
4. did prop 8 pass
5. mccain concedes
6. obama acceptance speech
7. david pluff
8. california election results 2008
9. obama elected president
10. obama campaign manager
2
3. The Code Evolves
Lines of Code
30,000
22,500
15,000
7,500
0
1999 2001 2003 2005 2007
Rhino
3
4. The Questions
How does this affect the future of the code?
Evolution of code has an impact on the system.
Can this evolution information prevent code defects?
4
5. Evolution of Features
• Variables
• Import statements
• Packages
• Method calls
• Word Stems
5
13. What happened to Stack?
package org.mozilla.javascript; package org.mozilla.javascript;
import java.util.Stack;
import java.util.Vector;
... ...
loops = new Stack (); loops = new ObjArray ();
... ...
for (int i = loops.size()-1; i >= 0; i--) { for (int i = loops.size()-1; i >= 0; i--) {
Node n = (Node) loops.elementAt(i); Node n = (Node) loops.get(i);
if (n.getType() == TokenStream.LABEL) { if (n.getType() == TokenStream.LABEL) {
... ...
} }
} }
Commit message: “I replaced Stack by ObjArray... It avoids unnecessary synchronization and save memory. To
simplify the replacement I added to ObjArray and ObjToIntMap few utility methods.”
13
14. Evolution of Features
• Variables
• Import statements
• Packages
• Method calls
• Word Stems
14
15. Method Calls
Number of occurrences per year in Eclipse
Evolution
Method name
patterns
2001 2002 2003 2004 2005 2006
append ( ) 4522 11473 20215 30658 40169 46913
getMinorComponent ( ) 20 18 34 33 20 24
getVersionStr ( ) 53 25 25 0 0 0
getModifiedElement ( ) 3 3 0 10 16 17
15
17. Method Calls Deletion
classFile = new ClassFileWriter (generatedClassName, superClassName, itsSourceFile);
...
for (int i = 0; i< scriptOrFn.getParamCount ( ); i++) {
push (i);
addByteCode (ByteCode.ALOAD, 4);
...
}
...
private void addByteCode (byte theOpcode){
classFile.add(theOpcode);
}
Commit message: “Renaming Codegen.classFile to Codegen.cfw and removal of methods like push/
load/store/add in favour of directly calling ClassFileMethods.”
cfw = new ClassFileWriter (generatedClassName, superClassName, itsSourceFile);
...
for (int i = 0; i< scriptOrFn.getParamCount ( ); i++) {
cfw.addPush (i);
cfw.add (ByteCode.ALOAD, 4);
...
}
17
18. Evolution of Features
• Variables
• Import statements
• Packages
• Method calls
• Word Stems getName = {get, name}
18
19. Word Stems
Number of occurrences per year in Rhino
Evolution
Word stems
patterns
1999 2000 2001 2002 2003 2004 2005 2006 2007
get 3626 3750 1775 1803 1640 1474 1493 1502 1647
set 488 488 259 316 303 291 292 297 318
feature 0 0 12 16 13 21 21 23 32
system 2 2 0 0 0 0 0 1 3
19
20. Learning from Evolution
tokens
tokens db deletion
evolution program
(CVS data) patterns
analyzer
point to
patterns
defect
violations
locations
20
Tokens
(CVS data)
21. Learning from Evolution
tokens
tokens db deletion
evolution
(CVS data) patterns
analyzer
issue a patterns
warning! violations
21
Tokens
(CVS data)
23. Combining Patterns
token old token new token # substitutions
type
m call getNextSibling getNext 12
Rhino m call getShort getIndex 8
(31 in total) m call generateCodeFromNode generateExpression 8
m call reportError reportSyntaxError 9
...
m call addVariable addReslover 48
m call outputDelimiter outputIn 29
Eclipse
m call gtk_new gtk_new_system 18
(1864 in total)
m call getString translateString 16
...
23
24. Combining Patterns
... ...
Node lhs = n.getFirstChild(); Node lhs = n.getFirstChild();
Node rhs = lhs.getNextSibling(); Node rhs = lhs.getNext();
lookForVariablesAndCalls(rhs, liveSet, theVariables); lookForVariablesAndCalls(rhs, liveSet, theVariables);
... ...
300 150
225
100
150
75 50
0 0
1999 2002 2005 1999 2002 2005
Commit message: “ I removed method duplication in Node where getNext() was duplicated as
getNextSibling() and code was using both of them and similarly for getFirstChild()/getFirst().”
24
25. Learning from Evolution
tokens
tokens db combined
deletion
evolution
(CVS data) patterns
analyzer
recommend issue a patterns
substitution warning! violations
25
26. Future tokens list
Extend the
Work
• Variables
• Import statements
• Packages
• Method calls
• Word Stems
• More features
26
27. FuturedoWork
Context matter
getShort( ) getIndex( )
(replaced in Rhino)
...
case LINE_ICODE: {
int line = getShort (iCode, pc + 1);
...
}
...
In 100% of the places where ‘LINE_ICODE’ was used, getShort( )
was not replaced by getIndex( ).
27
28. Future Work
Program Analysis Features
... ...
scriptOrFn.getParamCount ( ); addByteCode (ByteCode.ALOAD, 4);
addByteCode (ByteCode.ALOAD, 4); scriptOrFn.getParamCount ( );
... ...
Evolution and trends of sequence of method calls
28
29. Planned Evaluation
• Training and testing data sets:
• learn evolution patterns from the training set;
• predicting deletions in the testing set.
• Recommendation tool, perform user studies:
• open-source community;
• closed-source community.
29
30. Related Work
Thomas Zimmermann
Beat Fluri
Stephan Diehl
30