The document summarizes research on enabling internet-scale real-time source code search. It analyzes characteristics of source code on the internet and designs an approach called SeClone that can perform clone search and detection on millions of lines of code within 100 milliseconds. SeClone uses a two-phase approach of pattern matching followed by semantic matching. The research studies how to distribute the 100 milliseconds between the two phases and concludes that pattern matching can be done within 1 millisecond, leaving 99 milliseconds for semantic matching. The research enables internet-scale real-time code search through techniques like multi-level indexing, sampling, and using 32-bit hashes.
1. Internet-scale Source Code
Search and Analysis Framework
Iman Keivanloo
Advisor:
Dr. Juergen Rilling
PhD Seminar
Computer Science and Software Engineering Department
November-17-2011
3. Research Context
Internet-Scale Code Search
“is searching the Internet for source code to help solve a software
development problem”
[Gallardo, SUITE’09]
3
4. How to search for Source Code?
• Free-form Query:
– “how to write into file in Java”
• Structural Query:
– “select col1 from table1 where col1=“%write”
[Keivanloo, SUITE’11] [Keivanloo, ICSM’10]
4
5. Research Focus
Similar Fragment Search
XMLReadFile inFile=new XMLReadFile(“kb.xml”);
Suggested simplified query: Window myWindow=new Window();
Select line which has The ideal expected asnwer myWindow.trigger(inFile);
(1) a method call statement on the trigger method. OutputStream result=new OutputStream();
myWindow.flush(result);
Step 1: Input [the simplified structural query] Step 2: Input [the selected fragment
in the first step and its target line (red)]
Internet-Scale Structural Code Real-time Clone Search Engine
Search Engine ...
... 10: Window myWindow=new Window(); The pattern is
11: CSVReadFile csvData=new CSVReadFile(“... ... similar but it uses
59: Event e=new Event(50); 55: Window r=new Window();
12: myWindow.trigger(csvData); XMLStream instead
60: e.trigger(); 13: OutputStream o=new OutputStream(); 56: long timestamp=System.Now();
61: e.update(); Gapped clone of XMLFile as the
14: myWindow.flush(o); 57: System.out.println(“Start reasoning...”);
... 15: myWindow.close(); 58: XMLStream xmldata=new XMLStream(io); input
... ... 59: r.trigger(xmldata);
11: CSVReadFile csvData=new CSVReadFile(“input.csv”); 60: OutputStream o=new OutputStream();
12: myWindow.trigger(csvData); 61: r.flush(o);
13: OutputStream o=new OutputStream(); …
… This match is
This line looks like a match, however it uses … acceptable, even if
... 89: Window var=new Window();
.CSV instead of .XML. We can use our clone 90: XMLReadFile r=new XMLReadFile (“k.xml”);
the order is different
133: Listener res=new Listener();
search engine to find now other similar 91: OutputStream o=new OutputStream(); Unordered core from the 1:1 match
134: res.trigger(“warm-up”);
135: res.close(); code fragments to this one. 92: var.trigger(r);
... 93: var.flush(o);
…
5
14. Clone (Source Code Clone)
• Similar code fragments
for (AttributeEntity for (AttributeEntity
theAttributeEntity:aTableEntity.ge…theAttributeEntity:aTableEntity.ge…
System.out.println(“Hello!"); System.out.println(“Hello!");
• Type 1: Identical except whitespaces …
• Type 2: Identical except variable names ...
• Type 3: Identical except a few missing…
• Type 4: Similar functionality
[Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection techniques
and tools: A qualitative approach. Science of Computer Programming, 2009.] 14
15. Clone Search
Query Code Database
for (Attribute
attribute:exampleSet.getAttributes()) for (Attribute attribute:es1.getAttributes())
System.out.println(“Test");
System.out.println(“Hello!");
for (IAttribute
att:source.getAttributes()) {
System.out.println("Please do not
read me");
for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");
16
22. Internet-scale Real-time Clone Search
for (IAttribute att:source.getAttributes()) {
System.out.println("Please do not read me");
for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");
Requirements:
•Precision
100 • Recall
Millions LOC Milliseconds •Type-1, 2, 3… 23
27. Inside SeClone
Phase 2
• Information Retrieval &
Clustering algorithm
1 for (Attribute attribute:exampleSet.getAttribute
System.out.println(“The end");
2 for (Attribute attribute:es1.getAttributes())
System.out.println(“Test");
Phase 1 Phase 2
Pattern Matching Semantic Matching
3 for (AttributeEntity theAttributeEntity:aTable
System.out.println(“Hello!");
4 for (JAttribute attribute:formType.getAttribute
System.out.println(“Test");
5 for (IAttribute att:source.getAttributes()) {
28
System.out.println("Please do not read m
28. Research Question #2
The Dilemma
How to distribute the 100 milliseconds between
phases?
0 25 50 75 100
Pattern Matching Semantic Matching
[Keivanloo, WCRE’11]
29. Our Further Analysis [WCRE’11]
• 100 Milliseconds
Requirements
• Millions LOC
• Precision
The Dilemma
• Recall
Constraints
• Type-1, 2, 3…
0 25 50 75 100
SeClone [ICPC 11]
O ( p * log n )
Pattern Matching Semantic Matching
Data Characteristics
30
31. Analysis of the Data Characteristics:
Dataset preparation
• Name: IJaDataset
– Comprehensive (Inter-project)
• To avoid project-specific result
– ~18,000 Projects
– 1,500,000 unique Java classes
• No duplicate, empty, buggy file
– ~300 MLOC
• online at http://aseg.cs.concordia.ca/seclone
32
32. Analysis of the Data Characteristics:
Granularity Effect
• Three Level Similarity (TLS): Set of similar three-line fragments
• First Level Similarity (FLS): single-line patterns
33
33. Analysis of the Data Characteristics:
Clone frequency
• How many code fragment are analyzed by
each query?
• Answer: 3 (Average)
34
34. Analysis of the Data Characteristics:
Clone frequency
• Observation result:
– TLS distributes the candidates into 3.9 times more groups
– Its group size is 6 times smaller than FLS
35
35. Analysis of the Data Characteristics:
Clone frequency
• Conclusion:
– TLS heuristic is practical for real-time clone search,
as long as the outliers are handled properly
– Why?
• (1) each TLS group has 2.37 members on average
• (2) it distributes candidates in small-size groups
• (3) for each query, only one group must be evaluated
36
36. What Does an Outlier Look Like?
• Outlier Definition: patterns with more than 2,000 occurrences
• Observation result:
• Only ~1000 patterns out of 30M
• ~ 0.01% patterns
• Mostly insignificant code patterns
37
37. Analysis of the Data Characteristics:
Sampling efficiency
• Can sampling be used to reduce the amount
of data being analyzed?
• Answer: Yes (e.g., 33% contains 91% of popular patterns)
38
38. Analysis of the Data Characteristics:
Indexing
• Can 32bit Hash keys (versus MD5) be used
without affecting index quality?
abc 123 abc 123
aXc 456 aXc 123
• Answer: Yes 0.002% error rate
Only 10 cases for same key for three distinct strings
39
39. Method Names Are Reliable?
• Input Data: Koders 1-year query log
– ~10M records
• Observation purpose:
– Importance of method names
• Observation result:
– 98% success rate vs. 69%
• Result interpretation:
– Method names in this context are reliable source of information
– They must be preserved to increase precision
40
44. Answer:
Research Question #1
Internet-scale Real-time Code Search Is
Possible?
YES
45
45. Answer:
Research Question #2
The Dilemma
How to distribute the 100 milliseconds between phases?
Answer:
0 25 50 75 100
Pattern Matching Semantic Matching
1 millisecond 99 milliseconds
47. Summary
Step 1
• Studied characteristics of source code on the Internet
– unique patterns distribution (sampling application)
– Pattern frequencies (multi-level search)
– 32-bit hashing strength (code pattern)
– Outlier patterns
– Method name importance
Step 2
• Designed an Internet-scale clone search
– Customized for code search (precision)
– Fine granularity
– Multi-level Indexing approach (Type-3 clone)
– Microsecond range response time (up to 10 times faster)
48
48. Publication
Code Clone Search and Detection (http://aseg.cs.concordia.ca/seclone/)
• Iman Keivanloo, Juergen Rilling, Philippe Charland. Internet-scale Real-time Code Clone Search via Multi-level
Indexing. 18th Working Conference on Reverse Engineering (WCRE 2011), Lero, Limerick , Ireland.
• Iman Keivanloo, Juergen Rilling, Philippe Charland. SeClone – A Hybrid Approach to Internet-Scale Real-Time Code
Clone Search. 19th IEEE International Conference on Program Comprehension (ICPC 2011), Kingston, Ontario,
Canada.
Source Code Sharing using Linked Data (secold.org)
• Iman Keivanloo, Chris Forbes, Juergen Rilling, and Philippe Charland, "Towards Sharing Source Code Facts Using
Linked Data," ICSE Workshop on Search-Driven Development: Users, Infrastructure, Tools and Evaluation (SUITE).
2011.
Source Code Search (http://aseg.cs.concordia.ca/codesearch)
• Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. Semantic Web-based Source Code Search. 6th
International Workshop on Semantic Web Enabled Software Engineering (SWESE 2010), June 35, San Francisco,
USA.
• Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. SE-CodeSearch: A Scalable Semantic Web-
based Source Code Search Infrastructure. 26th IEEE International Conference on Software Maintenance (ICSM),
Early Research Achievements (ERA) Track, Sept. 12-18, Timișoara, Romania.
49
49. Thank you for your kind attention
QUESTION?
PhD Seminar
Computer Science and Software Engineering Department 50
November-17-2011
Notas do Editor
use of method names in queries resulted in a 98% "click rate" vs. 68% for queries without method names