SlideShare uma empresa Scribd logo
1 de 61
Baixar para ler offline
DATE: May 13, 2013Florence, Italy
Clone Detection
in Python
Valerio Maggio (valerio.maggio@unina.it)
Introduction
Duplicated Code
Number one in the stink
parade is duplicated code.
If you see the same code
structure in more than one
place, you can be sure that
your program will be better if
you ļ¬nd a way to unify them.
2
Introduction
The Python Way
3
Introduction
The Python Way
3
Introduction
NLTK (tree.py)
4
Introduction
NLTK (tree.py)
4
Introduction
NLTK (tree.py)
4
Introduction
NLTK (tree.py)
4
Introduction
NLTK (tree.py)
4
Introduction
NLTK (tree.py)
4
Introduction
Duplicated Code
ā€£ Exists: 5% to 30% of code is similar
ā€¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
5
Introduction
Duplicated Code
ā€£ Exists: 5% to 30% of code is similar
ā€¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā€£ Is often created during development
5
Introduction
Duplicated Code
ā€£ Exists: 5% to 30% of code is similar
ā€¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā€£ Is often created during development
ā€¢ due to time pressure for an upcoming deadline
5
Introduction
Duplicated Code
ā€£ Exists: 5% to 30% of code is similar
ā€¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā€£ Is often created during development
ā€¢ due to time pressure for an upcoming deadline
ā€¢ to overcome limitations of the programming language
5
Introduction
Duplicated Code
ā€£ Exists: 5% to 30% of code is similar
ā€¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā€£ Is often created during development
ā€¢ due to time pressure for an upcoming deadline
ā€¢ to overcome limitations of the programming language
ā€£ Three Public Enemies:
5
Introduction
Duplicated Code
ā€£ Exists: 5% to 30% of code is similar
ā€¢ In extreme cases, even up to 50%
- This is the case of Payroll, a COBOL system
ā€£ Is often created during development
ā€¢ due to time pressure for an upcoming deadline
ā€¢ to overcome limitations of the programming language
ā€£ Three Public Enemies:
ā€¢ Copy, Paste and Modify
5
DATE: May 13, 2013Part I: Clone Detection
Clone Detection
in Python
DATE: May 13, 2013Part I: Clone Detection
Clone Detection
in Python
Part I: Clone Detection
Code Clones
ā€£ There can be different deļ¬nitions of similarity,
based on:
ā€¢ Program Text (text, syntax)
ā€¢ Semantics
7
(Def.) ā€œSoftware Clones are segments of code that are similar
according to some deļ¬nition of similarityā€ (I.D. Baxter, 1998)
Part I: Clone Detection
Code Clones
ā€£ There can be different deļ¬nitions of similarity,
based on:
ā€¢ Program Text (text, syntax)
ā€¢ Semantics
ā€£ Four Diļ¬€erent Types of Clones
7
(Def.) ā€œSoftware Clones are segments of code that are similar
according to some deļ¬nition of similarityā€ (I.D. Baxter, 1998)
Part I: Clone Detection
The original one
8
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
	 lines = list()
	 with open(filepath) as report:
	 	 for l in report:
	 	 	 if l.endswith(marker):
	 	 	 	 lines.append(l) # Stores only lines that ends with "marker"
	 return lines #Return the list of different lines
Part I: Clone Detection
Type 1: Exact Copy
ā€£ Identical code segments except for differences in layout, whitespace,
and comments
9
Part I: Clone Detection
Type 1: Exact Copy
ā€£ Identical code segments except for differences in layout, whitespace,
and comments
9
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
	 lines = list()
	 with open(filepath) as report:
	 	 for l in report:
	 	 	 if l.endswith(marker):
	 	 	 	 lines.append(l) # Stores only lines that ends with "marker"
	 return lines #Return the list of different lines
def do_something_cool_in_Python (filepath, marker='---end---'):
	 lines = list() # This list is initially empty
	 with open(filepath) as report:
	 	 for l in report: # It goes through the lines of the file
	 	 	 if l.endswith(marker):
	 	 	 	 lines.append(l)
	 return lines
Part I: Clone Detection
Type 2: Parameter Substituted Clones
ā€£ Structurally identical segments except for differences in identiļ¬ers,
literals, layout, whitespace, and comments
10
Part I: Clone Detection
Type 2: Parameter Substituted Clones
ā€£ Structurally identical segments except for differences in identiļ¬ers,
literals, layout, whitespace, and comments
10
# Type 2 Clone
def do_something_cool_in_Python(path, end='---end---'):
	 targets = list()
	 with open(path) as data_file:
	 	 for t in data_file:
	 	 	 if l.endswith(end):
	 	 	 	 targets.append(t) # Stores only lines that ends with "marker"
	 #Return the list of different lines
	 return targets
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
	 lines = list()
	 with open(filepath) as report:
	 	 for l in report:
	 	 	 if l.endswith(marker):
	 	 	 	 lines.append(l) # Stores only lines that ends with "marker"
	 return lines #Return the list of different lines
Part I: Clone Detection
Type 3: Structure Substituted Clones
ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
Part I: Clone Detection
Type 3: Structure Substituted Clones
ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
import os
def do_something_with(path, marker='---end---'):
	 # Check if the input path corresponds to a file
	 if not os.path.isfile(path):
	 	 return None
	
	 bad_ones = list()
	 good_ones = list()
	 with open(path) as report:
	 	 for line in report:
	 	 	 line = line.strip()
	 	 	 if line.endswith(marker):
	 	 	 	 good_ones.append(line)
	 	 	 else:
	 	 	 	 bad_ones.append(line)
	 #Return the lists of different lines
	 return good_ones, bad_ones
Part I: Clone Detection
Type 3: Structure Substituted Clones
ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
import os
def do_something_with(path, marker='---end---'):
	 # Check if the input path corresponds to a file
	 if not os.path.isfile(path):
	 	 return None
	
	 bad_ones = list()
	 good_ones = list()
	 with open(path) as report:
	 	 for line in report:
	 	 	 line = line.strip()
	 	 	 if line.endswith(marker):
	 	 	 	 good_ones.append(line)
	 	 	 else:
	 	 	 	 bad_ones.append(line)
	 #Return the lists of different lines
	 return good_ones, bad_ones
Part I: Clone Detection
Type 3: Structure Substituted Clones
ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
import os
def do_something_with(path, marker='---end---'):
	 # Check if the input path corresponds to a file
	 if not os.path.isfile(path):
	 	 return None
	
	 bad_ones = list()
	 good_ones = list()
	 with open(path) as report:
	 	 for line in report:
	 	 	 line = line.strip()
	 	 	 if line.endswith(marker):
	 	 	 	 good_ones.append(line)
	 	 	 else:
	 	 	 	 bad_ones.append(line)
	 #Return the lists of different lines
	 return good_ones, bad_ones
Part I: Clone Detection
Type 3: Structure Substituted Clones
ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted)
statements, in additions to variations in identiļ¬ers, literals, layout and comments
11
import os
def do_something_with(path, marker='---end---'):
	 # Check if the input path corresponds to a file
	 if not os.path.isfile(path):
	 	 return None
	
	 bad_ones = list()
	 good_ones = list()
	 with open(path) as report:
	 	 for line in report:
	 	 	 line = line.strip()
	 	 	 if line.endswith(marker):
	 	 	 	 good_ones.append(line)
	 	 	 else:
	 	 	 	 bad_ones.append(line)
	 #Return the lists of different lines
	 return good_ones, bad_ones
Part I: Clone Detection
Type 4: ā€œSemanticā€ Clones
ā€£ Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
12
Part I: Clone Detection
Type 4: ā€œSemanticā€ Clones
ā€£ Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
12
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
	 lines = list()
	 with open(filepath) as report:
	 	 for l in report:
	 	 	 if l.endswith(marker):
	 	 	 	 lines.append(l) # Stores only lines that ends with "marker"
	 return lines #Return the list of different lines
def do_always_the_same_stuff(filepath, marker='---end---'):
	 report = open(filepath)
	 file_lines = report.readlines()
	 report.close()
	 #Filters only the lines ending with marker
	 return filter(lambda l: len(l) and l.endswith(marker), file_lines)
Part I: Clone Detection
What are the consequences?
ā€£ Do clones increase the maintenance effort?
ā€£ Hypothesis:
ā€¢ Cloned code increases code size
ā€¢ A ļ¬x to a clone must be applied to all similar fragments
ā€¢ Bugs are duplicated together with their clones
ā€£ However: it is not always possible to remove clones
ā€¢ Removal of Clones is harder if variations exist.
13
Part I: Clone Detection 14
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Clone Detection Tools
Part I: Clone Detection 14
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
ā€£ Text Based Tools:
ā€¢ Lines are compared to other
lines
Clone Detection Tools
Part I: Clone Detection 14
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
ā€£ Token Based Tools:
ā€¢ Token sequences are compared to
sequences
Clone Detection Tools
Part I: Clone Detection 14
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
ā€£ Syntax Based Tools:
ā€¢ Syntax subtrees are
compared to each other
Clone Detection Tools
Part I: Clone Detection 14
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
ā€£ Graph Based Tools:
ā€¢ (sub) graphs are compared to each
other
Clone Detection Tools
Part I: Clone Detection
Clone Detection Techniques
15
ā€£ String/Token based Techiniques:
ā€¢ Pros: Run very fast
ā€¢ Cons: Too many false clones
ā€£ Syntax based (AST) Techniques:
ā€¢ Pros: Well suited to detect structural similarities
ā€¢ Cons: Not Properly suited to detect Type 3 Clones
ā€£ Graph based Techniques:
ā€¢ Pros: The only one able to deal with Type 4 Clones
ā€¢ Cons: Performance Issues
Part I: Clone Detection
The idea: Use Machine Learning, Luke
ā€£ Use Machine Learning Techniques to compute similarity of fragments by
exploiting speciļ¬c features of the code.
ā€£ Combine different sources of Information
ā€¢ Structural Information: ASTs, PDGs
ā€¢ Lexical Information: Program Text
16
Part I: Clone Detection
Kernel Methods for Structured Data
ā€£ Well-grounded on solid and awful
Math
ā€£ Based on the idea that objects
can be described in terms of
their constituent Parts
ā€£ Can be easily tailored to specific
domains
ā€¢ Tree Kernels
ā€¢ Graph Kernels
ā€¢ ....
17
Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
18
Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
18
Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
ā€£ Set of features to annotate each part of the object
18
Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
ā€£ Set of features to annotate each part of the object
ā€£ A Kernel function to measure the similarity on the smallest part
of the object
18
Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
ā€£ Set of features to annotate each part of the object
ā€£ A Kernel function to measure the similarity on the smallest part
of the object
ā€¢ e.g., Nodes for AST and Graphs
18
Part I: Clone Detection
Deļ¬ning a Kernel for Structured Data
The deļ¬nition of a new Kernel for a Structured Object requires
the deļ¬nition of:
ā€£ Set of features to annotate each part of the object
ā€£ A Kernel function to measure the similarity on the smallest part
of the object
ā€¢ e.g., Nodes for AST and Graphs
ā€£ A Kernel function to apply the computation on the different
(sub)parts of the structured object
18
Part I: Clone Detection
Kernel Methods for Clones:
Tree Kernels Example on AST
ā€£ Features: We annotate each node by a set of 4
features
ā€¢ Instruction Class
- i.e., LOOP, CONDITIONAL_STATEMENT, CALL
ā€¢ Instruction
- i.e., FOR, IF, WHILE, RETURN
ā€¢ Context
- i.e. Instruction Class of the closer statement node
ā€¢ Lexemes
- Lexical information gathered (recursively) from leaves
- i.e., Lexical Information
19
FOR
Part I: Clone Detection
Kernel Methods for Clones:
Tree Kernels Example on AST
ā€£ Kernel Function:
ā€¢ Aims at identify the maximum
isomorphic Tree/Subtree
20
K(T1, T2) =
X
n2T1
X
n02T2
(n, n0
) Ā· Ksubt(n, n0
)
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
Ksubt(n, n0
) = sim(n, n0
) + (1 )
X
(n1,n2)2Ch(n,n0)
k(n1, n2)
DATE: May 13, 2013Part II: Clones and Python
Clone Detection
in Python
DATE: May 13, 2013Part II: Clones and Python
Clone Detection
in Python
Part II: In Python
The Overall Process Sketch
22
1. Pre Processing
Part II: In Python
The Overall Process Sketch
22
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
1. Pre Processing 2. Extraction
Part II: In Python
The Overall Process Sketch
22
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
1. Pre Processing 2. Extraction
3. Detection
Part II: In Python
The Overall Process Sketch
22
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
block
print
p0.0s
=
1.0
=
p
s
f
block
print
y1.0x
=
x
=
y
x
f
1. Pre Processing 2. Extraction
3. Detection 4. Aggregation
Part II: In Python
Detection Process
23
Part II: In Python
Empirical Evaluation
ā€£ Comparison with another (pure) AST-based: Clone Digger
ā€¢ It has been the ļ¬rst Clone detector for and in Python :-)
ā€¢ Presented at EuroPython 2006
ā€£ Comparison on a system with randomly seeded clones
24
ā€£ Results refer only to
Type 3 Clones
ā€£ On Type 1 and Type 2
we got the same
results
Part II: In Python
Precision/Recall Plot
25
0
0.25
0.50
0.75
1.00
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
Precision, Recall and F-Measure
Precision Recall F1
Precision: How accurate are the obtained results?
(Altern.) How many errors do they contain?
Recall: How complete are the obtained results?
(Altern.) How many clones have been retrieved w.r.t. Total Clones?
Part II: In Python
Is Python less clone prone?
26
Roy et. al., IWSC, 2010
Part II: In Python
Clones in CPython 2.5.1
27
DATE: May 13, 2013Florence, Italy
Thank you!

Mais conteĆŗdo relacionado

Mais procurados

Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
Ā 
Deadlock in Distributed Systems
Deadlock in Distributed SystemsDeadlock in Distributed Systems
Deadlock in Distributed SystemsPritom Saha Akash
Ā 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysisSoujanya V
Ā 
Church Turing Thesis
Church Turing ThesisChurch Turing Thesis
Church Turing ThesisHemant Sharma
Ā 
recursion tree method.pdf
recursion tree method.pdfrecursion tree method.pdf
recursion tree method.pdfMalikShazen
Ā 
String matching, naive,
String matching, naive,String matching, naive,
String matching, naive,Amit Kumar Rathi
Ā 
Time advance mehcanism
Time advance mehcanismTime advance mehcanism
Time advance mehcanismNikhil Sharma
Ā 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithmsAshikapokiya12345
Ā 
UNIX/Linux training
UNIX/Linux trainingUNIX/Linux training
UNIX/Linux trainingMichael Olafusi
Ā 
Zararlı Yazılım Analizi ve Tespitinde YARA Kullanımı
Zararlı Yazılım Analizi ve Tespitinde YARA KullanımıZararlı Yazılım Analizi ve Tespitinde YARA Kullanımı
Zararlı Yazılım Analizi ve Tespitinde YARA KullanımıBGA Cyber Security
Ā 
Algorithm Using Divide And Conquer
Algorithm Using Divide And ConquerAlgorithm Using Divide And Conquer
Algorithm Using Divide And ConquerUrviBhalani2
Ā 
Algorithm chapter 10
Algorithm chapter 10Algorithm chapter 10
Algorithm chapter 10chidabdu
Ā 
Unix file apiā€™s
Unix file apiā€™sUnix file apiā€™s
Unix file apiā€™sSunil Rm
Ā 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systemssumitjain2013
Ā 
Unix Programming Lab
Unix Programming LabUnix Programming Lab
Unix Programming LabNeil Mathew
Ā 

Mais procurados (20)

Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
Ā 
Deadlock in Distributed Systems
Deadlock in Distributed SystemsDeadlock in Distributed Systems
Deadlock in Distributed Systems
Ā 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
Ā 
Church Turing Thesis
Church Turing ThesisChurch Turing Thesis
Church Turing Thesis
Ā 
recursion tree method.pdf
recursion tree method.pdfrecursion tree method.pdf
recursion tree method.pdf
Ā 
String matching, naive,
String matching, naive,String matching, naive,
String matching, naive,
Ā 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
Ā 
Time advance mehcanism
Time advance mehcanismTime advance mehcanism
Time advance mehcanism
Ā 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
Ā 
UNIX/Linux training
UNIX/Linux trainingUNIX/Linux training
UNIX/Linux training
Ā 
Zararlı Yazılım Analizi ve Tespitinde YARA Kullanımı
Zararlı Yazılım Analizi ve Tespitinde YARA KullanımıZararlı Yazılım Analizi ve Tespitinde YARA Kullanımı
Zararlı Yazılım Analizi ve Tespitinde YARA Kullanımı
Ā 
Chapter 6 synchronization
Chapter 6 synchronizationChapter 6 synchronization
Chapter 6 synchronization
Ā 
DBSCAN
DBSCANDBSCAN
DBSCAN
Ā 
Back patching
Back patchingBack patching
Back patching
Ā 
Algorithm Using Divide And Conquer
Algorithm Using Divide And ConquerAlgorithm Using Divide And Conquer
Algorithm Using Divide And Conquer
Ā 
Algorithm chapter 10
Algorithm chapter 10Algorithm chapter 10
Algorithm chapter 10
Ā 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
Ā 
Unix file apiā€™s
Unix file apiā€™sUnix file apiā€™s
Unix file apiā€™s
Ā 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
Ā 
Unix Programming Lab
Unix Programming LabUnix Programming Lab
Unix Programming Lab
Ā 

Destaque

Machine Learning for Software Maintainability
Machine Learning for Software MaintainabilityMachine Learning for Software Maintainability
Machine Learning for Software MaintainabilityValerio Maggio
Ā 
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...Kamiya Toshihiro
Ā 
Droidcon2013 pro guard, optimizer and obfuscator in the android sdk_eric lafo...
Droidcon2013 pro guard, optimizer and obfuscator in the android sdk_eric lafo...Droidcon2013 pro guard, optimizer and obfuscator in the android sdk_eric lafo...
Droidcon2013 pro guard, optimizer and obfuscator in the android sdk_eric lafo...Droidcon Berlin
Ā 
Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionUnsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionValerio Maggio
Ā 
Android Malware Detection Mechanisms
Android Malware Detection MechanismsAndroid Malware Detection Mechanisms
Android Malware Detection MechanismsTalha Kabakus
Ā 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenomec.titus.brown
Ā 
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filtersPyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filtersc.titus.brown
Ā 

Destaque (9)

Machine Learning for Software Maintainability
Machine Learning for Software MaintainabilityMachine Learning for Software Maintainability
Machine Learning for Software Maintainability
Ā 
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
Ā 
Clonedigger-Python
Clonedigger-PythonClonedigger-Python
Clonedigger-Python
Ā 
Empirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone DetectionEmpirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone Detection
Ā 
Droidcon2013 pro guard, optimizer and obfuscator in the android sdk_eric lafo...
Droidcon2013 pro guard, optimizer and obfuscator in the android sdk_eric lafo...Droidcon2013 pro guard, optimizer and obfuscator in the android sdk_eric lafo...
Droidcon2013 pro guard, optimizer and obfuscator in the android sdk_eric lafo...
Ā 
Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionUnsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detection
Ā 
Android Malware Detection Mechanisms
Android Malware Detection MechanismsAndroid Malware Detection Mechanisms
Android Malware Detection Mechanisms
Ā 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
Ā 
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filtersPyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
Ā 

Semelhante a Clone detection in Python

Semelhante a Clone detection in Python (20)

presentation_intro_to_python
presentation_intro_to_pythonpresentation_intro_to_python
presentation_intro_to_python
Ā 
presentation_intro_to_python_1462930390_181219.ppt
presentation_intro_to_python_1462930390_181219.pptpresentation_intro_to_python_1462930390_181219.ppt
presentation_intro_to_python_1462930390_181219.ppt
Ā 
python1.ppt
python1.pptpython1.ppt
python1.ppt
Ā 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts
Ā 
Ch3 gnu make
Ch3 gnu makeCh3 gnu make
Ch3 gnu make
Ā 
Parsing
ParsingParsing
Parsing
Ā 
ENGLISH PYTHON.ppt
ENGLISH PYTHON.pptENGLISH PYTHON.ppt
ENGLISH PYTHON.ppt
Ā 
Pcd question bank
Pcd question bank Pcd question bank
Pcd question bank
Ā 
manish python.pptx
manish python.pptxmanish python.pptx
manish python.pptx
Ā 
Kavitha_python.ppt
Kavitha_python.pptKavitha_python.ppt
Kavitha_python.ppt
Ā 
python1.ppt
python1.pptpython1.ppt
python1.ppt
Ā 
python1.ppt
python1.pptpython1.ppt
python1.ppt
Ā 
Python Basics
Python BasicsPython Basics
Python Basics
Ā 
python1.ppt
python1.pptpython1.ppt
python1.ppt
Ā 
Lenguaje Python
Lenguaje PythonLenguaje Python
Lenguaje Python
Ā 
pysdasdasdsadsadsadsadsadsadasdasdthon1.ppt
pysdasdasdsadsadsadsadsadsadasdasdthon1.pptpysdasdasdsadsadsadsadsadsadasdasdthon1.ppt
pysdasdasdsadsadsadsadsadsadasdasdthon1.ppt
Ā 
coolstuff.ppt
coolstuff.pptcoolstuff.ppt
coolstuff.ppt
Ā 
python1.ppt
python1.pptpython1.ppt
python1.ppt
Ā 
Introductio_to_python_progamming_ppt.ppt
Introductio_to_python_progamming_ppt.pptIntroductio_to_python_progamming_ppt.ppt
Introductio_to_python_progamming_ppt.ppt
Ā 
python1.ppt
python1.pptpython1.ppt
python1.ppt
Ā 

Mais de Valerio Maggio

Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesImproving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesValerio Maggio
Ā 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in PythonValerio Maggio
Ā 
LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsLINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsValerio Maggio
Ā 
A Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionA Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionValerio Maggio
Ā 
Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeValerio Maggio
Ā 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMockValerio Maggio
Ā 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with JunitValerio Maggio
Ā 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and RefactoringValerio Maggio
Ā 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven DevelopmentValerio Maggio
Ā 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffoldingValerio Maggio
Ā 

Mais de Valerio Maggio (12)

Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesImproving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Ā 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
Ā 
LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsLINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviations
Ā 
A Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionA Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detection
Ā 
Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing code
Ā 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMock
Ā 
Junit in action
Junit in actionJunit in action
Junit in action
Ā 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with Junit
Ā 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and Refactoring
Ā 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
Ā 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffolding
Ā 
Web frameworks
Web frameworksWeb frameworks
Web frameworks
Ā 

ƚltimo

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
Ā 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Ā 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜RTylerCroy
Ā 
Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024The Digital Insurer
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
Ā 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
Ā 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationRadu Cotescu
Ā 

ƚltimo (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Ā 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
Ā 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Ā 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organization
Ā 

Clone detection in Python

  • 1. DATE: May 13, 2013Florence, Italy Clone Detection in Python Valerio Maggio (valerio.maggio@unina.it)
  • 2. Introduction Duplicated Code Number one in the stink parade is duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you ļ¬nd a way to unify them. 2
  • 11. Introduction Duplicated Code ā€£ Exists: 5% to 30% of code is similar ā€¢ In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system 5
  • 12. Introduction Duplicated Code ā€£ Exists: 5% to 30% of code is similar ā€¢ In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ā€£ Is often created during development 5
  • 13. Introduction Duplicated Code ā€£ Exists: 5% to 30% of code is similar ā€¢ In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ā€£ Is often created during development ā€¢ due to time pressure for an upcoming deadline 5
  • 14. Introduction Duplicated Code ā€£ Exists: 5% to 30% of code is similar ā€¢ In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ā€£ Is often created during development ā€¢ due to time pressure for an upcoming deadline ā€¢ to overcome limitations of the programming language 5
  • 15. Introduction Duplicated Code ā€£ Exists: 5% to 30% of code is similar ā€¢ In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ā€£ Is often created during development ā€¢ due to time pressure for an upcoming deadline ā€¢ to overcome limitations of the programming language ā€£ Three Public Enemies: 5
  • 16. Introduction Duplicated Code ā€£ Exists: 5% to 30% of code is similar ā€¢ In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ā€£ Is often created during development ā€¢ due to time pressure for an upcoming deadline ā€¢ to overcome limitations of the programming language ā€£ Three Public Enemies: ā€¢ Copy, Paste and Modify 5
  • 17. DATE: May 13, 2013Part I: Clone Detection Clone Detection in Python
  • 18. DATE: May 13, 2013Part I: Clone Detection Clone Detection in Python
  • 19. Part I: Clone Detection Code Clones ā€£ There can be different deļ¬nitions of similarity, based on: ā€¢ Program Text (text, syntax) ā€¢ Semantics 7 (Def.) ā€œSoftware Clones are segments of code that are similar according to some deļ¬nition of similarityā€ (I.D. Baxter, 1998)
  • 20. Part I: Clone Detection Code Clones ā€£ There can be different deļ¬nitions of similarity, based on: ā€¢ Program Text (text, syntax) ā€¢ Semantics ā€£ Four Diļ¬€erent Types of Clones 7 (Def.) ā€œSoftware Clones are segments of code that are similar according to some deļ¬nition of similarityā€ (I.D. Baxter, 1998)
  • 21. Part I: Clone Detection The original one 8 # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines
  • 22. Part I: Clone Detection Type 1: Exact Copy ā€£ Identical code segments except for differences in layout, whitespace, and comments 9
  • 23. Part I: Clone Detection Type 1: Exact Copy ā€£ Identical code segments except for differences in layout, whitespace, and comments 9 # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines def do_something_cool_in_Python (filepath, marker='---end---'): lines = list() # This list is initially empty with open(filepath) as report: for l in report: # It goes through the lines of the file if l.endswith(marker): lines.append(l) return lines
  • 24. Part I: Clone Detection Type 2: Parameter Substituted Clones ā€£ Structurally identical segments except for differences in identiļ¬ers, literals, layout, whitespace, and comments 10
  • 25. Part I: Clone Detection Type 2: Parameter Substituted Clones ā€£ Structurally identical segments except for differences in identiļ¬ers, literals, layout, whitespace, and comments 10 # Type 2 Clone def do_something_cool_in_Python(path, end='---end---'): targets = list() with open(path) as data_file: for t in data_file: if l.endswith(end): targets.append(t) # Stores only lines that ends with "marker" #Return the list of different lines return targets # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines
  • 26. Part I: Clone Detection Type 3: Structure Substituted Clones ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted) statements, in additions to variations in identiļ¬ers, literals, layout and comments 11
  • 27. Part I: Clone Detection Type 3: Structure Substituted Clones ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted) statements, in additions to variations in identiļ¬ers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
  • 28. Part I: Clone Detection Type 3: Structure Substituted Clones ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted) statements, in additions to variations in identiļ¬ers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
  • 29. Part I: Clone Detection Type 3: Structure Substituted Clones ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted) statements, in additions to variations in identiļ¬ers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
  • 30. Part I: Clone Detection Type 3: Structure Substituted Clones ā€£ Similar segments with further modiļ¬cations such as changed, added (or deleted) statements, in additions to variations in identiļ¬ers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
  • 31. Part I: Clone Detection Type 4: ā€œSemanticā€ Clones ā€£ Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants 12
  • 32. Part I: Clone Detection Type 4: ā€œSemanticā€ Clones ā€£ Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants 12 # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines def do_always_the_same_stuff(filepath, marker='---end---'): report = open(filepath) file_lines = report.readlines() report.close() #Filters only the lines ending with marker return filter(lambda l: len(l) and l.endswith(marker), file_lines)
  • 33. Part I: Clone Detection What are the consequences? ā€£ Do clones increase the maintenance effort? ā€£ Hypothesis: ā€¢ Cloned code increases code size ā€¢ A ļ¬x to a clone must be applied to all similar fragments ā€¢ Bugs are duplicated together with their clones ā€£ However: it is not always possible to remove clones ā€¢ Removal of Clones is harder if variations exist. 13
  • 34. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Clone Detection Tools
  • 35. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ā€£ Text Based Tools: ā€¢ Lines are compared to other lines Clone Detection Tools
  • 36. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ā€£ Token Based Tools: ā€¢ Token sequences are compared to sequences Clone Detection Tools
  • 37. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ā€£ Syntax Based Tools: ā€¢ Syntax subtrees are compared to each other Clone Detection Tools
  • 38. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ā€£ Graph Based Tools: ā€¢ (sub) graphs are compared to each other Clone Detection Tools
  • 39. Part I: Clone Detection Clone Detection Techniques 15 ā€£ String/Token based Techiniques: ā€¢ Pros: Run very fast ā€¢ Cons: Too many false clones ā€£ Syntax based (AST) Techniques: ā€¢ Pros: Well suited to detect structural similarities ā€¢ Cons: Not Properly suited to detect Type 3 Clones ā€£ Graph based Techniques: ā€¢ Pros: The only one able to deal with Type 4 Clones ā€¢ Cons: Performance Issues
  • 40. Part I: Clone Detection The idea: Use Machine Learning, Luke ā€£ Use Machine Learning Techniques to compute similarity of fragments by exploiting speciļ¬c features of the code. ā€£ Combine different sources of Information ā€¢ Structural Information: ASTs, PDGs ā€¢ Lexical Information: Program Text 16
  • 41. Part I: Clone Detection Kernel Methods for Structured Data ā€£ Well-grounded on solid and awful Math ā€£ Based on the idea that objects can be described in terms of their constituent Parts ā€£ Can be easily tailored to specific domains ā€¢ Tree Kernels ā€¢ Graph Kernels ā€¢ .... 17
  • 42. Part I: Clone Detection Deļ¬ning a Kernel for Structured Data 18
  • 43. Part I: Clone Detection Deļ¬ning a Kernel for Structured Data The deļ¬nition of a new Kernel for a Structured Object requires the deļ¬nition of: 18
  • 44. Part I: Clone Detection Deļ¬ning a Kernel for Structured Data The deļ¬nition of a new Kernel for a Structured Object requires the deļ¬nition of: ā€£ Set of features to annotate each part of the object 18
  • 45. Part I: Clone Detection Deļ¬ning a Kernel for Structured Data The deļ¬nition of a new Kernel for a Structured Object requires the deļ¬nition of: ā€£ Set of features to annotate each part of the object ā€£ A Kernel function to measure the similarity on the smallest part of the object 18
  • 46. Part I: Clone Detection Deļ¬ning a Kernel for Structured Data The deļ¬nition of a new Kernel for a Structured Object requires the deļ¬nition of: ā€£ Set of features to annotate each part of the object ā€£ A Kernel function to measure the similarity on the smallest part of the object ā€¢ e.g., Nodes for AST and Graphs 18
  • 47. Part I: Clone Detection Deļ¬ning a Kernel for Structured Data The deļ¬nition of a new Kernel for a Structured Object requires the deļ¬nition of: ā€£ Set of features to annotate each part of the object ā€£ A Kernel function to measure the similarity on the smallest part of the object ā€¢ e.g., Nodes for AST and Graphs ā€£ A Kernel function to apply the computation on the different (sub)parts of the structured object 18
  • 48. Part I: Clone Detection Kernel Methods for Clones: Tree Kernels Example on AST ā€£ Features: We annotate each node by a set of 4 features ā€¢ Instruction Class - i.e., LOOP, CONDITIONAL_STATEMENT, CALL ā€¢ Instruction - i.e., FOR, IF, WHILE, RETURN ā€¢ Context - i.e. Instruction Class of the closer statement node ā€¢ Lexemes - Lexical information gathered (recursively) from leaves - i.e., Lexical Information 19 FOR
  • 49. Part I: Clone Detection Kernel Methods for Clones: Tree Kernels Example on AST ā€£ Kernel Function: ā€¢ Aims at identify the maximum isomorphic Tree/Subtree 20 K(T1, T2) = X n2T1 X n02T2 (n, n0 ) Ā· Ksubt(n, n0 ) block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f Ksubt(n, n0 ) = sim(n, n0 ) + (1 ) X (n1,n2)2Ch(n,n0) k(n1, n2)
  • 50. DATE: May 13, 2013Part II: Clones and Python Clone Detection in Python
  • 51. DATE: May 13, 2013Part II: Clones and Python Clone Detection in Python
  • 52. Part II: In Python The Overall Process Sketch 22 1. Pre Processing
  • 53. Part II: In Python The Overall Process Sketch 22 block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f 1. Pre Processing 2. Extraction
  • 54. Part II: In Python The Overall Process Sketch 22 block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f 1. Pre Processing 2. Extraction 3. Detection
  • 55. Part II: In Python The Overall Process Sketch 22 block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f 1. Pre Processing 2. Extraction 3. Detection 4. Aggregation
  • 56. Part II: In Python Detection Process 23
  • 57. Part II: In Python Empirical Evaluation ā€£ Comparison with another (pure) AST-based: Clone Digger ā€¢ It has been the ļ¬rst Clone detector for and in Python :-) ā€¢ Presented at EuroPython 2006 ā€£ Comparison on a system with randomly seeded clones 24 ā€£ Results refer only to Type 3 Clones ā€£ On Type 1 and Type 2 we got the same results
  • 58. Part II: In Python Precision/Recall Plot 25 0 0.25 0.50 0.75 1.00 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Precision, Recall and F-Measure Precision Recall F1 Precision: How accurate are the obtained results? (Altern.) How many errors do they contain? Recall: How complete are the obtained results? (Altern.) How many clones have been retrieved w.r.t. Total Clones?
  • 59. Part II: In Python Is Python less clone prone? 26 Roy et. al., IWSC, 2010
  • 60. Part II: In Python Clones in CPython 2.5.1 27
  • 61. DATE: May 13, 2013Florence, Italy Thank you!