SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Ctcompare:
 Comparing Multiple Code
   Trees for Similarity

       Warren Toomey
   School of IT, Bond University


    Using lexical analysis with
  techniques borrowed from DNA
sequencing, multiple code trees can
  be quickly compared to find any
          code similarities
Why Write a Code
          Comparison Tool?
    To detect student plagiarism
●




    To determine if your codebase is
●


    `infected' by proprietary code from
    elsewhere, or by open-source code
    covered by a license like the GPL

    To trace code genealogy between trees
●


    separated by time (e.g. versions), useful
    for the new field of computing history
Code Comparison Issues
  Can rearrangement of code be
●


  detected?
  - per line? per sub-line?
● Can “munging of code” be detected?


  - variable/function/struct renaming?
● What if one or both codebases are


  proprietary? How can third parties verify
  any comparison?
● Can a timely comparison be done?

● What is the rate of false positives?


  - of missed matches?
Lexical Comparison
  First version: break the source code
●


  from each tree into lexical tokens, then
  compare runs of tokens between trees.
● This removes all the code's semantics,


  but does deal with code rearrangement
  (but not code “munging”).
Performance of Lexical
            Approach

    Performance was O(M*N), when M and N
●


    the # of tokens in each code tree

    Basically: slow, and exponential
●




    I also wanted to compare multiple code
●


    trees, and also find duplicated code
    within a tree
When is Copying
            Significant?
    Common code fragments not significant:
●


                                    13 tokens
     for (i=0; i< val; i++)
                                    10 tokens
     if (x > max) max=x;
     err= stat(file, &sb); if (err) 14 tokens

    Axiom: runs of < 16 tokens insignificant
●




    Idea: use a run of 16 tokens as a lookup
●


    key to find other trees with the same run
    of tokens
Approach in Second
          Version
   Take all 16-tokens runs in a code tree
●

● Use each as a key, and insert the runs


  into a database along with position of
  the run in the source tree
● Keys (i.e. runs of 16 tokens) with


  multiple values indicate possible code
  copying of at least 16 tokens
● For these keys, we can evaluate all code


  trees from this point on to find any real
  code similarity
Approach in Second
     Version
Algorithm
for each key in the database with multiple record nodes {
  obtain the set of record nodes from the database;

 for each combination of node pairs Na and Nb {
  if (performing a cross-tree comparison)
    skip this combination if Na and Nb in the same tree;
  if (performing a code clone comparison)
    skip this combination if Na and Nb in the same file;

  perform a full token comparison beginning at Na and Nb to
  determine the full extent of the code similarity, i.e. determine
  the actual number of tokens in common;

  add the matching token sequence to a list of quot;potential runsquot;;
 }
}
walk the list of potential runs to merge overlapping runs, and
remove runs which are subsets of larger runs;

output the details of the actual matching token runs found.
Hashing Identifiers and
    Literals in Each Code Tree
    Simply comparing tokens yields false
●


    positives, e.g
      if (xx > yy) { aa = 2 * bb + cc; }
      if (ab > bc) {ab = 5 * ab + bc; }

    I hash each identifier and literal down to
●


    a 16-bit value

    This obfuscates the original values while
●


    minimising the amount of false positives
Isomorphic Comparison
   Hashing identifiers helps to ensure code
●


  similarities are found
● Does not help if variables have been


  renamed
● This can be solved with isomorphic


  comparison: keep a 2-way variable
  isomorphism table
● Code is isomorphic when variable


  names are isomorphic across a long run
  of code
Isomorphic Example
int maxofthree(int x, int y, int z)
 {
  if ((x>y) && (x>z)) return(x);
  if (y>z) return(y);
  return(z);
 }

int bigtriple(int b, int a, int c)
 {
  if ((b>a) && (b>c)) return(b);
  if (a>c) return(a);
  return(c);
 }
Bottlenecks
   In the second version, main bottleneck
●


  is the merging of overlapping code
  fragments which have been found to
  match between two trees
● Similar to the problem with DNA


  sequencing when genes are split up into
  multiple fragments before sequencing
● Use of AVL trees here has helped to


  reduce the cost of merging immensely
● Other more mundane optimisations


  such as mmap() have also helped
Current Performance




    54 seconds to compare 14 trees of code
●


    totalling 1 million lines of C code
Results
   Many lines of code written at UNSW in
●


  late 1970s still present in System V, and
  in Open Solaris
● Thousands of lines of replicated code


  inside Linux kernel
● AT&T vs BSDi lawsuit in early 1990s:


   - expert witness found 56 common lines
     in 230K lines between BSD and Unix
   - my tool find roughly same lines, plus
     100+ more isomorphic similarities
Question?

Mais conteúdo relacionado

Mais procurados

C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013srikanthreddy004
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentationfaizang909
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
 
Avoiding Hardware Aliasing
Avoiding Hardware AliasingAvoiding Hardware Aliasing
Avoiding Hardware AliasingPeter Breuer
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts Pavan Babu .G
 
CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3HamesKellor
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBryan O'Sullivan
 
Study of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceStudy of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceOriRoth
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA Rebaz Najeeb
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)MANISH T I
 

Mais procurados (19)

Lzw compression
Lzw compressionLzw compression
Lzw compression
 
C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013
 
More Pointers and Arrays
More Pointers and ArraysMore Pointers and Arrays
More Pointers and Arrays
 
Dictor
DictorDictor
Dictor
 
50120130405006
5012013040500650120130405006
50120130405006
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
 
Haxe vs Unicode (en)
Haxe vs Unicode (en)Haxe vs Unicode (en)
Haxe vs Unicode (en)
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Avoiding Hardware Aliasing
Avoiding Hardware AliasingAvoiding Hardware Aliasing
Avoiding Hardware Aliasing
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts
 
CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
 
Study of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceStudy of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with Variance
 
C language
C languageC language
C language
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA
 
LZ78
LZ78LZ78
LZ78
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)
 
Computer Science Assignment Help
 Computer Science Assignment Help  Computer Science Assignment Help
Computer Science Assignment Help
 
7
77
7
 

Destaque

The internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMAThe internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMARohit Varma
 
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Rohit Varma
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى guestfd118
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىالمحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىguestfd118
 
Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Rohit Varma
 
The Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideThe Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideDoctorWkt
 
Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Rohit Varma
 
Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Rohit Varma
 
Reconstructive Software Archaeology
Reconstructive Software ArchaeologyReconstructive Software Archaeology
Reconstructive Software ArchaeologyDoctorWkt
 
The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...Rohit Varma
 
Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Rohit Varma
 
Steven Marks Engineer
Steven Marks EngineerSteven Marks Engineer
Steven Marks Engineerguestb541e
 
How does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to itHow does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to italiirfan04
 
Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Rohit Varma
 
LTE and EPC Specifications
LTE and EPC SpecificationsLTE and EPC Specifications
LTE and EPC Specificationsaliirfan04
 
Lte security overview
Lte security overviewLte security overview
Lte security overviewaliirfan04
 

Destaque (17)

The internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMAThe internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMA
 
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىالمحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
 
Culture/Social Media
Culture/Social MediaCulture/Social Media
Culture/Social Media
 
Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...
 
The Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideThe Australian Community-Driven TV Guide
The Australian Community-Driven TV Guide
 
Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003
 
Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...
 
Reconstructive Software Archaeology
Reconstructive Software ArchaeologyReconstructive Software Archaeology
Reconstructive Software Archaeology
 
The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...
 
Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...
 
Steven Marks Engineer
Steven Marks EngineerSteven Marks Engineer
Steven Marks Engineer
 
How does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to itHow does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to it
 
Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...
 
LTE and EPC Specifications
LTE and EPC SpecificationsLTE and EPC Specifications
LTE and EPC Specifications
 
Lte security overview
Lte security overviewLte security overview
Lte security overview
 

Semelhante a Ctcompare: Comparing Multiple Code Trees for Similarity

C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for FresherJaved Ahmad
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationsonu sharma
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKgr Sushmitha
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed CoordinationLuis Galárraga
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in PythonValerio Maggio
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in PythonSujith Kumar
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsgadisaAdamu
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3Debanjan Bhattacharya
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffmanchidabdu
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of codesource{d}
 

Semelhante a Ctcompare: Comparing Multiple Code Trees for Similarity (20)

I1
I1I1
I1
 
Parsing
ParsingParsing
Parsing
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Perl_Part4
Perl_Part4Perl_Part4
Perl_Part4
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materials
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3
 
Cryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie BrownCryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie Brown
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffman
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

Último

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Último (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Ctcompare: Comparing Multiple Code Trees for Similarity

  • 1. Ctcompare: Comparing Multiple Code Trees for Similarity Warren Toomey School of IT, Bond University Using lexical analysis with techniques borrowed from DNA sequencing, multiple code trees can be quickly compared to find any code similarities
  • 2. Why Write a Code Comparison Tool? To detect student plagiarism ● To determine if your codebase is ● `infected' by proprietary code from elsewhere, or by open-source code covered by a license like the GPL To trace code genealogy between trees ● separated by time (e.g. versions), useful for the new field of computing history
  • 3. Code Comparison Issues Can rearrangement of code be ● detected? - per line? per sub-line? ● Can “munging of code” be detected? - variable/function/struct renaming? ● What if one or both codebases are proprietary? How can third parties verify any comparison? ● Can a timely comparison be done? ● What is the rate of false positives? - of missed matches?
  • 4. Lexical Comparison First version: break the source code ● from each tree into lexical tokens, then compare runs of tokens between trees. ● This removes all the code's semantics, but does deal with code rearrangement (but not code “munging”).
  • 5. Performance of Lexical Approach Performance was O(M*N), when M and N ● the # of tokens in each code tree Basically: slow, and exponential ● I also wanted to compare multiple code ● trees, and also find duplicated code within a tree
  • 6. When is Copying Significant? Common code fragments not significant: ● 13 tokens for (i=0; i< val; i++) 10 tokens if (x > max) max=x; err= stat(file, &sb); if (err) 14 tokens Axiom: runs of < 16 tokens insignificant ● Idea: use a run of 16 tokens as a lookup ● key to find other trees with the same run of tokens
  • 7. Approach in Second Version Take all 16-tokens runs in a code tree ● ● Use each as a key, and insert the runs into a database along with position of the run in the source tree ● Keys (i.e. runs of 16 tokens) with multiple values indicate possible code copying of at least 16 tokens ● For these keys, we can evaluate all code trees from this point on to find any real code similarity
  • 9. Algorithm for each key in the database with multiple record nodes { obtain the set of record nodes from the database; for each combination of node pairs Na and Nb { if (performing a cross-tree comparison) skip this combination if Na and Nb in the same tree; if (performing a code clone comparison) skip this combination if Na and Nb in the same file; perform a full token comparison beginning at Na and Nb to determine the full extent of the code similarity, i.e. determine the actual number of tokens in common; add the matching token sequence to a list of quot;potential runsquot;; } } walk the list of potential runs to merge overlapping runs, and remove runs which are subsets of larger runs; output the details of the actual matching token runs found.
  • 10. Hashing Identifiers and Literals in Each Code Tree Simply comparing tokens yields false ● positives, e.g if (xx > yy) { aa = 2 * bb + cc; } if (ab > bc) {ab = 5 * ab + bc; } I hash each identifier and literal down to ● a 16-bit value This obfuscates the original values while ● minimising the amount of false positives
  • 11. Isomorphic Comparison Hashing identifiers helps to ensure code ● similarities are found ● Does not help if variables have been renamed ● This can be solved with isomorphic comparison: keep a 2-way variable isomorphism table ● Code is isomorphic when variable names are isomorphic across a long run of code
  • 12. Isomorphic Example int maxofthree(int x, int y, int z) { if ((x>y) && (x>z)) return(x); if (y>z) return(y); return(z); } int bigtriple(int b, int a, int c) { if ((b>a) && (b>c)) return(b); if (a>c) return(a); return(c); }
  • 13. Bottlenecks In the second version, main bottleneck ● is the merging of overlapping code fragments which have been found to match between two trees ● Similar to the problem with DNA sequencing when genes are split up into multiple fragments before sequencing ● Use of AVL trees here has helped to reduce the cost of merging immensely ● Other more mundane optimisations such as mmap() have also helped
  • 14. Current Performance 54 seconds to compare 14 trees of code ● totalling 1 million lines of C code
  • 15. Results Many lines of code written at UNSW in ● late 1970s still present in System V, and in Open Solaris ● Thousands of lines of replicated code inside Linux kernel ● AT&T vs BSDi lawsuit in early 1990s: - expert witness found 56 common lines in 230K lines between BSD and Unix - my tool find roughly same lines, plus 100+ more isomorphic similarities