"Clone detection in Python": Slides presented at EuroPython 2012
Clone Detection in Python highlights the topic of code duplication detection using Machine Learning techniques.
Some examples on Python code duplications and C-Python implementation duplications are reported as well.
Scaling API-first ā The story of a global engineering organization
Ā
Clone detection in Python
1. DATE: May 13, 2013Florence, Italy
Clone Detection
in Python
Valerio Maggio (valerio.maggio@unina.it)
2. Introduction
Duplicated Code
Number one in the stink
parade is duplicated code.
If you see the same code
structure in more than one
place, you can be sure that
your program will be better if
you ļ¬nd a way to unify them.
2
12. Introduction
Duplicated Code
ā£ Exists: 5% to 30% of code is similar
ā¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā£ Is often created during development
5
13. Introduction
Duplicated Code
ā£ Exists: 5% to 30% of code is similar
ā¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā£ Is often created during development
ā¢ due to time pressure for an upcoming deadline
5
14. Introduction
Duplicated Code
ā£ Exists: 5% to 30% of code is similar
ā¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā£ Is often created during development
ā¢ due to time pressure for an upcoming deadline
ā¢ to overcome limitations of the programming language
5
15. Introduction
Duplicated Code
ā£ Exists: 5% to 30% of code is similar
ā¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā£ Is often created during development
ā¢ due to time pressure for an upcoming deadline
ā¢ to overcome limitations of the programming language
ā£ Three Public Enemies:
5
16. Introduction
Duplicated Code
ā£ Exists: 5% to 30% of code is similar
ā¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā£ Is often created during development
ā¢ due to time pressure for an upcoming deadline
ā¢ to overcome limitations of the programming language
ā£ Three Public Enemies:
ā¢ Copy, Paste and Modify
5
17. DATE: May 13, 2013Part I: Clone Detection
Clone Detection
in Python
18. DATE: May 13, 2013Part I: Clone Detection
Clone Detection
in Python
19. Part I: Clone Detection
Code Clones
ā£ There can be different deļ¬nitions of similarity,
based on:
ā¢ Program Text (text, syntax)
ā¢ Semantics
7
(Def.) āSoftware Clones are segments of code that are similar
according to some deļ¬nition of similarityā (I.D. Baxter, 1998)
20. Part I: Clone Detection
Code Clones
ā£ There can be different deļ¬nitions of similarity,
based on:
ā¢ Program Text (text, syntax)
ā¢ Semantics
ā£ Four Diļ¬erent Types of Clones
7
(Def.) āSoftware Clones are segments of code that are similar
according to some deļ¬nition of similarityā (I.D. Baxter, 1998)
21. Part I: Clone Detection
The original one
8
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
lines = list()
with open(filepath) as report:
for l in report:
if l.endswith(marker):
lines.append(l) # Stores only lines that ends with "marker"
return lines #Return the list of different lines
22. Part I: Clone Detection
Type 1: Exact Copy
ā£ Identical code segments except for differences in layout, whitespace,
and comments
9
23. Part I: Clone Detection
Type 1: Exact Copy
ā£ Identical code segments except for differences in layout, whitespace,
and comments
9
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
lines = list()
with open(filepath) as report:
for l in report:
if l.endswith(marker):
lines.append(l) # Stores only lines that ends with "marker"
return lines #Return the list of different lines
def do_something_cool_in_Python (filepath, marker='---end---'):
lines = list() # This list is initially empty
with open(filepath) as report:
for l in report: # It goes through the lines of the file
if l.endswith(marker):
lines.append(l)
return lines
24. Part I: Clone Detection
Type 2: Parameter Substituted Clones
ā£ Structurally identical segments except for differences in identiļ¬ers,
literals, layout, whitespace, and comments
10
25. Part I: Clone Detection
Type 2: Parameter Substituted Clones
ā£ Structurally identical segments except for differences in identiļ¬ers,
literals, layout, whitespace, and comments
10
# Type 2 Clone
def do_something_cool_in_Python(path, end='---end---'):
targets = list()
with open(path) as data_file:
for t in data_file:
if l.endswith(end):
targets.append(t) # Stores only lines that ends with "marker"
#Return the list of different lines
return targets
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
lines = list()
with open(filepath) as report:
for l in report:
if l.endswith(marker):
lines.append(l) # Stores only lines that ends with "marker"
return lines #Return the list of different lines
26. Part I: Clone Detection
Type 3: Structure Substituted Clones
ā£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
27. Part I: Clone Detection
Type 3: Structure Substituted Clones
ā£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
import os
def do_something_with(path, marker='---end---'):
# Check if the input path corresponds to a file
if not os.path.isfile(path):
return None
bad_ones = list()
good_ones = list()
with open(path) as report:
for line in report:
line = line.strip()
if line.endswith(marker):
good_ones.append(line)
else:
bad_ones.append(line)
#Return the lists of different lines
return good_ones, bad_ones
28. Part I: Clone Detection
Type 3: Structure Substituted Clones
ā£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
import os
def do_something_with(path, marker='---end---'):
# Check if the input path corresponds to a file
if not os.path.isfile(path):
return None
bad_ones = list()
good_ones = list()
with open(path) as report:
for line in report:
line = line.strip()
if line.endswith(marker):
good_ones.append(line)
else:
bad_ones.append(line)
#Return the lists of different lines
return good_ones, bad_ones
29. Part I: Clone Detection
Type 3: Structure Substituted Clones
ā£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
import os
def do_something_with(path, marker='---end---'):
# Check if the input path corresponds to a file
if not os.path.isfile(path):
return None
bad_ones = list()
good_ones = list()
with open(path) as report:
for line in report:
line = line.strip()
if line.endswith(marker):
good_ones.append(line)
else:
bad_ones.append(line)
#Return the lists of different lines
return good_ones, bad_ones
30. Part I: Clone Detection
Type 3: Structure Substituted Clones
ā£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
import os
def do_something_with(path, marker='---end---'):
# Check if the input path corresponds to a file
if not os.path.isfile(path):
return None
bad_ones = list()
good_ones = list()
with open(path) as report:
for line in report:
line = line.strip()
if line.endswith(marker):
good_ones.append(line)
else:
bad_ones.append(line)
#Return the lists of different lines
return good_ones, bad_ones
31. Part I: Clone Detection
Type 4: āSemanticā Clones
ā£ Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
12
32. Part I: Clone Detection
Type 4: āSemanticā Clones
ā£ Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
12
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
lines = list()
with open(filepath) as report:
for l in report:
if l.endswith(marker):
lines.append(l) # Stores only lines that ends with "marker"
return lines #Return the list of different lines
def do_always_the_same_stuff(filepath, marker='---end---'):
report = open(filepath)
file_lines = report.readlines()
report.close()
#Filters only the lines ending with marker
return filter(lambda l: len(l) and l.endswith(marker), file_lines)
33. Part I: Clone Detection
What are the consequences?
ā£ Do clones increase the maintenance effort?
ā£ Hypothesis:
ā¢ Cloned code increases code size
ā¢ A ļ¬x to a clone must be applied to all similar fragments
ā¢ Bugs are duplicated together with their clones
ā£ However: it is not always possible to remove clones
ā¢ Removal of Clones is harder if variations exist.
13
37. Part I: Clone Detection 14
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
ā£ Syntax Based Tools:
ā¢ Syntax subtrees are
compared to each other
Clone Detection Tools
38. Part I: Clone Detection 14
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
ā£ Graph Based Tools:
ā¢ (sub) graphs are compared to each
other
Clone Detection Tools
39. Part I: Clone Detection
Clone Detection Techniques
15
ā£ String/Token based Techiniques:
ā¢ Pros: Run very fast
ā¢ Cons: Too many false clones
ā£ Syntax based (AST) Techniques:
ā¢ Pros: Well suited to detect structural similarities
ā¢ Cons: Not Properly suited to detect Type 3 Clones
ā£ Graph based Techniques:
ā¢ Pros: The only one able to deal with Type 4 Clones
ā¢ Cons: Performance Issues
40. Part I: Clone Detection
The idea: Use Machine Learning, Luke
ā£ Use Machine Learning Techniques to compute similarity of fragments by
exploiting speciļ¬c features of the code.
ā£ Combine different sources of Information
ā¢ Structural Information: ASTs, PDGs
ā¢ Lexical Information: Program Text
16
41. Part I: Clone Detection
Kernel Methods for Structured Data
ā£ Well-grounded on solid and awful
Math
ā£ Based on the idea that objects
can be described in terms of
their constituent Parts
ā£ Can be easily tailored to specific
domains
ā¢ Tree Kernels
ā¢ Graph Kernels
ā¢ ....
17
42. Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
18
43. Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
18
44. Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
ā£ Set of features to annotate each part of the object
18
45. Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
ā£ Set of features to annotate each part of the object
ā£ A Kernel function to measure the similarity on the smallest part
of the object
18
46. Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
ā£ Set of features to annotate each part of the object
ā£ A Kernel function to measure the similarity on the smallest part
of the object
ā¢ e.g., Nodes for AST and Graphs
18
47. Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
ā£ Set of features to annotate each part of the object
ā£ A Kernel function to measure the similarity on the smallest part
of the object
ā¢ e.g., Nodes for AST and Graphs
ā£ A Kernel function to apply the computation on the different
(sub)parts of the structured object
18
48. Part I: Clone Detection
Kernel Methods for Clones:
Tree Kernels Example on AST
ā£ Features: We annotate each node by a set of 4
features
ā¢ Instruction Class
- i.e., LOOP, CONDITIONAL_STATEMENT, CALL
ā¢ Instruction
- i.e., FOR, IF, WHILE, RETURN
ā¢ Context
- i.e. Instruction Class of the closer statement node
ā¢ Lexemes
- Lexical information gathered (recursively) from leaves
- i.e., Lexical Information
19
FOR
49. Part I: Clone Detection
Kernel Methods for Clones:
Tree Kernels Example on AST
ā£ Kernel Function:
ā¢ Aims at identify the maximum
isomorphic Tree/Subtree
20
K(T1, T2) =
X
n2T1
X
n02T2
(n, n0
) Ā· Ksubt(n, n0
)
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
Ksubt(n, n0
) = sim(n, n0
) + (1 )
X
(n1,n2)2Ch(n,n0)
k(n1, n2)
50. DATE: May 13, 2013Part II: Clones and Python
Clone Detection
in Python
51. DATE: May 13, 2013Part II: Clones and Python
Clone Detection
in Python
52. Part II: In Python
The Overall Process Sketch
22
1. Pre Processing
53. Part II: In Python
The Overall Process Sketch
22
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
1. Pre Processing 2. Extraction
54. Part II: In Python
The Overall Process Sketch
22
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
1. Pre Processing 2. Extraction
3. Detection
55. Part II: In Python
The Overall Process Sketch
22
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
1. Pre Processing 2. Extraction
3. Detection 4. Aggregation
57. Part II: In Python
Empirical Evaluation
ā£ Comparison with another (pure) AST-based: Clone Digger
ā¢ It has been the ļ¬rst Clone detector for and in Python :-)
ā¢ Presented at EuroPython 2006
ā£ Comparison on a system with randomly seeded clones
24
ā£ Results refer only to
Type 3 Clones
ā£ On Type 1 and Type 2
we got the same
results
58. Part II: In Python
Precision/Recall Plot
25
0
0.25
0.50
0.75
1.00
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
Precision, Recall and F-Measure
Precision Recall F1
Precision: How accurate are the obtained results?
(Altern.) How many errors do they contain?
Recall: How complete are the obtained results?
(Altern.) How many clones have been retrieved w.r.t. Total Clones?
59. Part II: In Python
Is Python less clone prone?
26
Roy et. al., IWSC, 2010