SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Ctcompare:
 Comparing Multiple Code
   Trees for Similarity

       Warren Toomey
   School of IT, Bond University


    Using lexical analysis with
  techniques borrowed from DNA
sequencing, multiple code trees can
  be quickly compared to find any
          code similarities
Why Write a Code
          Comparison Tool?
    To detect student plagiarism
●




    To determine if your codebase is
●


    `infected' by proprietary code from
    elsewhere, or by open-source code
    covered by a license like the GPL

    To trace code genealogy between trees
●


    separated by time (e.g. versions), useful
    for the new field of computing history
Code Comparison Issues
  Can rearrangement of code be
●


  detected?
  - per line? per sub-line?
● Can “munging of code” be detected?


  - variable/function/struct renaming?
● What if one or both codebases are


  proprietary? How can third parties verify
  any comparison?
● Can a timely comparison be done?

● What is the rate of false positives?


  - of missed matches?
Lexical Comparison
  First version: break the source code
●


  from each tree into lexical tokens, then
  compare runs of tokens between trees.
● This removes all the code's semantics,


  but does deal with code rearrangement
  (but not code “munging”).
Performance of Lexical
            Approach

    Performance was O(M*N), when M and N
●


    the # of tokens in each code tree

    Basically: slow, and exponential
●




    I also wanted to compare multiple code
●


    trees, and also find duplicated code
    within a tree
When is Copying
            Significant?
    Common code fragments not significant:
●


                                    13 tokens
     for (i=0; i< val; i++)
                                    10 tokens
     if (x > max) max=x;
     err= stat(file, &sb); if (err) 14 tokens

    Axiom: runs of < 16 tokens insignificant
●




    Idea: use a run of 16 tokens as a lookup
●


    key to find other trees with the same run
    of tokens
Approach in Second
          Version
   Take all 16-tokens runs in a code tree
●

● Use each as a key, and insert the runs


  into a database along with position of
  the run in the source tree
● Keys (i.e. runs of 16 tokens) with


  multiple values indicate possible code
  copying of at least 16 tokens
● For these keys, we can evaluate all code


  trees from this point on to find any real
  code similarity
Approach in Second
     Version
Algorithm
for each key in the database with multiple record nodes {
  obtain the set of record nodes from the database;

 for each combination of node pairs Na and Nb {
  if (performing a cross-tree comparison)
    skip this combination if Na and Nb in the same tree;
  if (performing a code clone comparison)
    skip this combination if Na and Nb in the same file;

  perform a full token comparison beginning at Na and Nb to
  determine the full extent of the code similarity, i.e. determine
  the actual number of tokens in common;

  add the matching token sequence to a list of quot;potential runsquot;;
 }
}
walk the list of potential runs to merge overlapping runs, and
remove runs which are subsets of larger runs;

output the details of the actual matching token runs found.
Hashing Identifiers and
    Literals in Each Code Tree
    Simply comparing tokens yields false
●


    positives, e.g
      if (xx > yy) { aa = 2 * bb + cc; }
      if (ab > bc) {ab = 5 * ab + bc; }

    I hash each identifier and literal down to
●


    a 16-bit value

    This obfuscates the original values while
●


    minimising the amount of false positives
Isomorphic Comparison
   Hashing identifiers helps to ensure code
●


  similarities are found
● Does not help if variables have been


  renamed
● This can be solved with isomorphic


  comparison: keep a 2-way variable
  isomorphism table
● Code is isomorphic when variable


  names are isomorphic across a long run
  of code
Isomorphic Example
int maxofthree(int x, int y, int z)
 {
  if ((x>y) && (x>z)) return(x);
  if (y>z) return(y);
  return(z);
 }

int bigtriple(int b, int a, int c)
 {
  if ((b>a) && (b>c)) return(b);
  if (a>c) return(a);
  return(c);
 }
Bottlenecks
   In the second version, main bottleneck
●


  is the merging of overlapping code
  fragments which have been found to
  match between two trees
● Similar to the problem with DNA


  sequencing when genes are split up into
  multiple fragments before sequencing
● Use of AVL trees here has helped to


  reduce the cost of merging immensely
● Other more mundane optimisations


  such as mmap() have also helped
Current Performance




    54 seconds to compare 14 trees of code
●


    totalling 1 million lines of C code
Results
   Many lines of code written at UNSW in
●


  late 1970s still present in System V, and
  in Open Solaris
● Thousands of lines of replicated code


  inside Linux kernel
● AT&T vs BSDi lawsuit in early 1990s:


   - expert witness found 56 common lines
     in 230K lines between BSD and Unix
   - my tool find roughly same lines, plus
     100+ more isomorphic similarities
Question?

Mais conteúdo relacionado

Mais procurados

C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013srikanthreddy004
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentationfaizang909
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
 
Avoiding Hardware Aliasing
Avoiding Hardware AliasingAvoiding Hardware Aliasing
Avoiding Hardware AliasingPeter Breuer
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts Pavan Babu .G
 
CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3HamesKellor
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBryan O'Sullivan
 
Study of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceStudy of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceOriRoth
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA Rebaz Najeeb
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)MANISH T I
 

Mais procurados (19)

Lzw compression
Lzw compressionLzw compression
Lzw compression
 
C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013
 
More Pointers and Arrays
More Pointers and ArraysMore Pointers and Arrays
More Pointers and Arrays
 
Dictor
DictorDictor
Dictor
 
50120130405006
5012013040500650120130405006
50120130405006
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
 
Haxe vs Unicode (en)
Haxe vs Unicode (en)Haxe vs Unicode (en)
Haxe vs Unicode (en)
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Avoiding Hardware Aliasing
Avoiding Hardware AliasingAvoiding Hardware Aliasing
Avoiding Hardware Aliasing
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts
 
CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
 
Study of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceStudy of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with Variance
 
C language
C languageC language
C language
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA
 
LZ78
LZ78LZ78
LZ78
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)
 
Computer Science Assignment Help
 Computer Science Assignment Help  Computer Science Assignment Help
Computer Science Assignment Help
 
7
77
7
 

Destaque

The internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMAThe internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMARohit Varma
 
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Rohit Varma
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى guestfd118
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىالمحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىguestfd118
 
Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Rohit Varma
 
The Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideThe Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideDoctorWkt
 
Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Rohit Varma
 
Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Rohit Varma
 
Reconstructive Software Archaeology
Reconstructive Software ArchaeologyReconstructive Software Archaeology
Reconstructive Software ArchaeologyDoctorWkt
 
The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...Rohit Varma
 
Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Rohit Varma
 
Steven Marks Engineer
Steven Marks EngineerSteven Marks Engineer
Steven Marks Engineerguestb541e
 
How does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to itHow does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to italiirfan04
 
Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Rohit Varma
 
LTE and EPC Specifications
LTE and EPC SpecificationsLTE and EPC Specifications
LTE and EPC Specificationsaliirfan04
 
Lte security overview
Lte security overviewLte security overview
Lte security overviewaliirfan04
 

Destaque (17)

The internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMAThe internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMA
 
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىالمحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
 
Culture/Social Media
Culture/Social MediaCulture/Social Media
Culture/Social Media
 
Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...
 
The Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideThe Australian Community-Driven TV Guide
The Australian Community-Driven TV Guide
 
Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003
 
Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...
 
Reconstructive Software Archaeology
Reconstructive Software ArchaeologyReconstructive Software Archaeology
Reconstructive Software Archaeology
 
The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...
 
Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...
 
Steven Marks Engineer
Steven Marks EngineerSteven Marks Engineer
Steven Marks Engineer
 
How does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to itHow does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to it
 
Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...
 
LTE and EPC Specifications
LTE and EPC SpecificationsLTE and EPC Specifications
LTE and EPC Specifications
 
Lte security overview
Lte security overviewLte security overview
Lte security overview
 

Semelhante a Ctcompare: Comparing Multiple Code Trees for Similarity

C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for FresherJaved Ahmad
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationsonu sharma
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKgr Sushmitha
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed CoordinationLuis Galárraga
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in PythonValerio Maggio
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in PythonSujith Kumar
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsgadisaAdamu
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3Debanjan Bhattacharya
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffmanchidabdu
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of codesource{d}
 

Semelhante a Ctcompare: Comparing Multiple Code Trees for Similarity (20)

I1
I1I1
I1
 
Parsing
ParsingParsing
Parsing
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Perl_Part4
Perl_Part4Perl_Part4
Perl_Part4
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materials
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3
 
Cryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie BrownCryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie Brown
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffman
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

Último

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 

Último (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 

Ctcompare: Comparing Multiple Code Trees for Similarity

  • 1. Ctcompare: Comparing Multiple Code Trees for Similarity Warren Toomey School of IT, Bond University Using lexical analysis with techniques borrowed from DNA sequencing, multiple code trees can be quickly compared to find any code similarities
  • 2. Why Write a Code Comparison Tool? To detect student plagiarism ● To determine if your codebase is ● `infected' by proprietary code from elsewhere, or by open-source code covered by a license like the GPL To trace code genealogy between trees ● separated by time (e.g. versions), useful for the new field of computing history
  • 3. Code Comparison Issues Can rearrangement of code be ● detected? - per line? per sub-line? ● Can “munging of code” be detected? - variable/function/struct renaming? ● What if one or both codebases are proprietary? How can third parties verify any comparison? ● Can a timely comparison be done? ● What is the rate of false positives? - of missed matches?
  • 4. Lexical Comparison First version: break the source code ● from each tree into lexical tokens, then compare runs of tokens between trees. ● This removes all the code's semantics, but does deal with code rearrangement (but not code “munging”).
  • 5. Performance of Lexical Approach Performance was O(M*N), when M and N ● the # of tokens in each code tree Basically: slow, and exponential ● I also wanted to compare multiple code ● trees, and also find duplicated code within a tree
  • 6. When is Copying Significant? Common code fragments not significant: ● 13 tokens for (i=0; i< val; i++) 10 tokens if (x > max) max=x; err= stat(file, &sb); if (err) 14 tokens Axiom: runs of < 16 tokens insignificant ● Idea: use a run of 16 tokens as a lookup ● key to find other trees with the same run of tokens
  • 7. Approach in Second Version Take all 16-tokens runs in a code tree ● ● Use each as a key, and insert the runs into a database along with position of the run in the source tree ● Keys (i.e. runs of 16 tokens) with multiple values indicate possible code copying of at least 16 tokens ● For these keys, we can evaluate all code trees from this point on to find any real code similarity
  • 9. Algorithm for each key in the database with multiple record nodes { obtain the set of record nodes from the database; for each combination of node pairs Na and Nb { if (performing a cross-tree comparison) skip this combination if Na and Nb in the same tree; if (performing a code clone comparison) skip this combination if Na and Nb in the same file; perform a full token comparison beginning at Na and Nb to determine the full extent of the code similarity, i.e. determine the actual number of tokens in common; add the matching token sequence to a list of quot;potential runsquot;; } } walk the list of potential runs to merge overlapping runs, and remove runs which are subsets of larger runs; output the details of the actual matching token runs found.
  • 10. Hashing Identifiers and Literals in Each Code Tree Simply comparing tokens yields false ● positives, e.g if (xx > yy) { aa = 2 * bb + cc; } if (ab > bc) {ab = 5 * ab + bc; } I hash each identifier and literal down to ● a 16-bit value This obfuscates the original values while ● minimising the amount of false positives
  • 11. Isomorphic Comparison Hashing identifiers helps to ensure code ● similarities are found ● Does not help if variables have been renamed ● This can be solved with isomorphic comparison: keep a 2-way variable isomorphism table ● Code is isomorphic when variable names are isomorphic across a long run of code
  • 12. Isomorphic Example int maxofthree(int x, int y, int z) { if ((x>y) && (x>z)) return(x); if (y>z) return(y); return(z); } int bigtriple(int b, int a, int c) { if ((b>a) && (b>c)) return(b); if (a>c) return(a); return(c); }
  • 13. Bottlenecks In the second version, main bottleneck ● is the merging of overlapping code fragments which have been found to match between two trees ● Similar to the problem with DNA sequencing when genes are split up into multiple fragments before sequencing ● Use of AVL trees here has helped to reduce the cost of merging immensely ● Other more mundane optimisations such as mmap() have also helped
  • 14. Current Performance 54 seconds to compare 14 trees of code ● totalling 1 million lines of C code
  • 15. Results Many lines of code written at UNSW in ● late 1970s still present in System V, and in Open Solaris ● Thousands of lines of replicated code inside Linux kernel ● AT&T vs BSDi lawsuit in early 1990s: - expert witness found 56 common lines in 230K lines between BSD and Unix - my tool find roughly same lines, plus 100+ more isomorphic similarities